## Introduction:

This study focuses on analysing the Titanic dataset, which contains details about passengers aboard the RMS Titanic, including their age, gender,       class, ticket fare, and whether they survived the tragic event.

 The dataset provides an opportunity to explore patterns and relationships among various factors and survival outcomes



The goal of this report is to apply basic statistical methods and data analysis techniques to uncover meaningful insights from the dataseival.


### Specifically, the analysis will:

1.	Summarize the data using basic statistical metrics, such as mean, median, and standard deviation.

2.	Visualize key aspects of the dataset using charts like histograms, bar charts, and scatter plots to interpret trends and distributions.

3.	Test hypotheses related to survival differences across demographics, such as gender and class, using statistical tests like the Chi-Square test and T-tests.

4.	Explore regression analysis to understand how different features like age, fare, and class influence the likelihood of survival.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [None]:
df = pd.read_csv("Titanic-Dataset.csv")
df.head()

### Explanation:
* The dataset contains 891 entries and 12 columns, with mixed data types (int64, float64, and object).

* Columns such as Age, Cabin, and Embarked have missing values, which may require preprocessing.

In [None]:
df.info()

### Explanation:

* Fare has a high range (0 to 512.33), suggesting potential outliers.

* Age has a mean of ~29.7 years, but values extend from 0.42 to 80, indicating diversity in passenger age.

* SibSp and Parch are primarily low, reflecting small family sizes.

In [None]:
df.describe()

### Explanation:

* Passenger ID is 446, aligning with the dataset's midpoint.

* Survival rate (Survived) is ~38.4%, showing less than half of the passengers survived.

* Average fare (Fare) is 32.20, indicating most tickets were moderately priced.

In [None]:
df.mean(numeric_only=True)

### Explanation:
* Median survival (Survived) is 0, as most passengers did not survive.

* Median fare is 14.45, lower than the mean, suggesting a skewed fare distribution.

In [None]:
df.median(numeric_only=True)

### Explanation:
* Mode of Survived is 0, confirming most passengers didn't survive.

* The most common Pclass is 3, indicating most passengers traveled in third class.

In [None]:
df.mode( numeric_only=True)

### Explanation:
* High variance in Fare (2469.44) reflects substantial price differences among ticket classes.

* Age variance (211.02) suggests a wide age range, requiring further investigation for outliers.

In [None]:
df.var(numeric_only=True)

### Explanation:
* The standard deviation of Fare (49.69) supports the observation of wide variability in ticket prices.

* Age standard deviation (14.53) indicates moderate spread around the mean age of ~29.7.

In [None]:
df.std(numeric_only=True)

---
---

# Data visualization

---
## Histograms
---

### Age Distribution (Histogram)
*The histogram for the Age column shows:*

- A peak in the 20–30 age range, indicating that most passengers were young adults.
- A decline in the number of passengers as the age increases.
- A few outliers in the older age range (e.g., 70–80 years).
- The presence of a kde curve (density estimate) helps identify the underlying distribution, which is slightly right-skewed.
- This distribution indicates that the Titanic had a younger population, possibly reflective of third-class passengers seeking new opportunities.

In [None]:
plt.style.use('dark_background')
plt.figure(figsize=(8, 5))
sns.histplot(df['Age'].dropna(), bins=20, kde=True, color='cyan')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('images/age_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### Fare Distribution (Histogram)
*The histogram for the Fare column shows:*

- A high concentration of fares under $50, indicating that the majority of passengers paid lower fares (likely third-class tickets).

- A long tail on the right, with a few passengers paying very high fares (e.g., over $200), indicating first-class luxury accommodations.
- The distribution is heavily right-skewed due to these high fare outliers.
- This suggests a significant disparity in ticket prices, likely corresponding to class differences.

In [None]:
plt.style.use('dark_background')
plt.figure(figsize=(8, 5))
sns.histplot(df['Fare'], bins=20, kde=True, color='green')
plt.title('Fare Distribution')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.savefig('images/fare_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

---
## Bar Chart
---

## Gender Distribution (Bar Chart)
*The bar chart for Sex shows:*

- More males (around 65%) than females (around 35%) among passengers.
- This imbalance is important when analyzing survival rates, as gender-based priorities (e.g., "women and children first") could influence survival outcomes.

In [None]:
plt.style.use('dark_background')
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='Sex', data=df, palette='pastel')
for patch in ax.patches:
    patch.set_alpha(0.8)
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.savefig('images/gender_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## Survival Distribution (Bar Chart)
*The bar chart for Survived shows:*

- A higher number of passengers did not survive (0) compared to those who survived (1).
- This reflects the overall low survival rate of the Titanic disaster, where only about 38% of passengers survived.

In [None]:
plt.figure(figsize=(8, 5))
ax = sns.countplot(x='Survived', data=df, palette='Set2')
for patch in ax.patches:
    patch.set_alpha(0.7)
plt.title('Survival Distribution')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.savefig('images/survival_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

---
## Scatter Plot
---

## Age vs Fare (Scatter Plot)
*The scatter plot of Age vs. Fare with survival status (Survived):*

- Survivors (red points) are more common among passengers who paid higher fares (likely first-class passengers) and younger age groups (possibly children or families in priority groups).
- Non-survivors (blue points) dominate in the lower fare range and are more evenly distributed across age groups.
- Passengers with very high fares (outliers) were more likely to survive, which might indicate better access to lifeboats for first-class passengers.

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x='Age', y='Fare', hue='Survived', data=df, palette={0: '#FF6347', 1: '#4682B4'}, alpha=0.7)
plt.title('Age vs Fare (Colored by Survival)')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.legend(title='Survived')
plt.savefig('images/age_vs_fare_survival.png', dpi=300, bbox_inches='tight')
plt.show()

---
---

# Hypothesis Test applications

---
## Test of Difference Between Proportions
---

## Proportions Z-Test (Survival by Gender)
- Z-Statistic: 16.22
- P-Value: 3.7117477701134797e-59
#### Explanation:
There is a significant difference in survival rates between males and females. 

Women had a significantly higher survival rate, likely due to the "women and children first" policy during the evacuation.

In [None]:
from statsmodels.stats.proportion import proportions_ztest

In [None]:
survived_female = df[df['Sex'] == 'female']['Survived'].sum()
survived_female

In [None]:
total_female = df[df['Sex'] == 'female'].shape[0]
total_female

In [None]:
survived_male = df[df['Sex'] == 'male']['Survived'].sum()
survived_male

In [None]:
total_male = df[df['Sex'] == 'male'].shape[0]
total_male

In [None]:
count = [survived_female, survived_male]
count

In [None]:
nobs = [total_female, total_male]
nobs

In [None]:
stat, p_value = proportions_ztest(count, nobs)

In [None]:
print(f"Z-Statistic: {stat:.2f}, P-Value: {p_value}")

In [None]:
if p_value < 0.05:
    print("Significant difference in survival rates between males and females.")
else:
    print("No significant difference in survival rates between males and females.")

---
## Independent Two-Samples T-Test
---

## T-Test (Fare by Survival)
- T-Statistic: 6.84
- P-Value: 2.6993323503141236e-11
#### Explanation:
There is a significant difference in the average fares paid by survivors and non-survivors. 

Survivors tended to pay higher fares, indicating that passengers in higher classes had a better chance of survival.

In [None]:
from scipy.stats import ttest_ind

In [None]:
# Fare data for survivors
fare_survived = df[df['Survived'] == 1]['Fare']

In [None]:
# Fare data for non-survivors
fare_not_survived = df[df['Survived'] == 0]['Fare']

In [None]:
# T-Test
t_stat, p_value = ttest_ind(fare_survived, fare_not_survived, equal_var=False)

In [None]:
print(f"T-Statistic: {t_stat:.2f}, P-Value: {p_value}")

In [None]:
# Interpretation
if p_value < 0.05:
    print("Significant difference in average fares between survivors and non-survivors.")
else:
    print("No significant difference in average fares between survivors and non-survivors.")

---
## Chi-Square Independence Test
---

## Chi-Square Test (Survival by Class)
- Chi-Square Statistic: 102.89
- P-Value: 4.549251711298793e-23
#### Explanation:
Survival is significantly dependent on passenger class. 

First-class passengers had higher survival rates compared to second- and third-class passengers.


In [None]:
from scipy.stats import chi2_contingency

In [None]:
# Contingency table for Pclass and Survived
contingency_table = pd.crosstab(df['Pclass'], df['Survived'])

In [None]:
# Chi-Square Test
chi2, p, dof, expected = chi2_contingency(contingency_table)

In [None]:
print(f"Chi-Square Statistic: {chi2:.2f}, P-Value: {p}")

In [None]:
# Interpretation
if p < 0.05:
    print("Survival is significantly dependent on passenger class.")
else:
    print("No significant dependency between survival and passenger class.")

---
## Mann-Whitney U Test
---

## Mann-Whitney U Test (Age by Survival)
- U-Statistic: 57682.0
- P-Value: 0.16
#### Explanation:
There is no significant difference in the age distributions of survivors and non-survivors. 

Age alone does not appear to play a major role in determining survival likelihood.

In [None]:
from scipy.stats import mannwhitneyu

In [None]:
age_survived = df[df['Survived'] == 1]['Age'].dropna()

In [None]:
age_not_survived = df[df['Survived'] == 0]['Age'].dropna()

In [None]:
u_stat, p_value = mannwhitneyu(age_survived, age_not_survived)

In [None]:
print(f"U-Statistic: {u_stat:.2f}, P-Value: {p_value:.4f}")

In [None]:
if p_value < 0.05:
    print("Significant difference in age distribution between survivors and non-survivors.")
else:
    print("No significant difference in age distribution between survivors and non-survivors.")

---
## ANOVA (Analysis of Variance)
---

## ANOVA (Fare by Class)
- F-Statistic: 242.34
- P-Value: 1.0313763209141171e-84
#### Explanation:
There is a significant difference in mean fares across passenger classes. 

First-class passengers paid significantly higher fares compared to second- and third-class passengers, reflecting socioeconomic differences.

In [None]:
from scipy.stats import f_oneway

In [None]:
fare_class_1 = df[df['Pclass'] == 1]['Fare']
fare_class_2 = df[df['Pclass'] == 2]['Fare']
fare_class_3 = df[df['Pclass'] == 3]['Fare']

In [None]:
f_stat, p_value = f_oneway(fare_class_1, fare_class_2, fare_class_3)

In [None]:
print(f"F-Statistic: {f_stat:.2f}, P-Value: {p_value:}")

In [None]:
if p_value < 0.05:
    print("Significant difference in mean fares across passenger classes.")
else:
    print("No significant difference in mean fares across passenger classes.")

## Summary:
- **Gender and Class:** 

    *These factors had a strong influence on survival, with females and first-class passengers having higher survival rates.*
- **Fares:** 

    *Higher fares were associated with increased survival, linking economic privilege to better outcomes.*
- **Age:** 

    *Age did not significantly affect survival in this dataset.*

---
---

# Regression analysis

---
## Linear Regression
---

### Linear Regression:
The linear regression model explains approximately 32.1% of the variance in passenger fares. 

- **Pclass:**

    *For each unit increase in passenger class, the fare is predicted to decrease by approximately $37.92, holding age constant. This is statistically significant (p < 0.001).*

- **Age:**

    *For each year increase in age, the fare is predicted to decrease by approximately $0.46, holding passenger class constant. This is also statistically significant (p < 0.001).*

- The model's overall fit is statistically significant (F-statistic = 167.9, p < 0.001). However, the residuals exhibit non-normality, as indicated by the Omnibus and Jarque-Bera tests.

- This analysis suggests that both passenger class and age are significant factors in determining passenger fares.

In [None]:
import statsmodels.api as sm

In [None]:
df = df.dropna(subset=['Age', 'Fare', 'Pclass'])

In [None]:
X = df[['Pclass', 'Age']]

In [None]:
X = sm.add_constant(X)

In [None]:
y = df['Fare']

In [None]:
model = sm.OLS(y, X).fit()

In [None]:
print(model.summary())

---
## Logistic Regression
---

The logistic regression model achieved an accuracy of 76.3% on the test set. This indicates that the model can correctly classify passenger survival in a significant portion of cases.

The classification report provides a more granular view of the model's performance for each class (survived, died). Precision tells us the proportion of predicted positives that were actually positive, while recall tells us the proportion of actual positives that were predicted positive. F1-score is a harmonic mean of precision and recall, balancing both metrics.

Based on the precision, recall, and F1-score values, we can observe that the model performs slightly better in predicting survivors (class 0) compared to deaths (class 1).

**Note:** It's important to consider additional factors like the cost of misclassification depending on the specific application. For instance, if failing to predict a passenger's survival has more severe consequences, the model's performance on class 1 (died) might be of greater concern.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [None]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

In [None]:
X = df[['Pclass', 'Age', 'Sex']].dropna()

In [None]:
y = df.loc[X.index, 'Survived']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train, y_train)

In [None]:
y_pred = log_reg.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))

---
## Multiple Regression (with Interaction)
---

The logistic regression model with interaction terms for Pclass and Sex shows a significant association with passenger survival (LLR p-value < 0.001). The model explains approximately 35.31% of the variance in survival outcomes.

* **Sex:** Being male was significantly associated with lower odds of survival compared to females (odds ratio: exp(-6.1155) ≈ 0.002).
* **Pclass:** Higher passenger class (Pclass) was generally associated with higher odds of survival.
* **Interaction (Pclass:Sex[T.male]):** The effect of Pclass on survival differed between males and females. The decrease in survival odds associated with higher Pclass was less pronounced for males compared to females.

**Note:**

* This interpretation focuses on the log-odds. To interpret the effects in terms of odds or probabilities, you would need to exponentiate the coefficients.
* The interaction term highlights the importance of considering the joint effects of Pclass and Sex on survival.
* Further analysis, such as plotting the predicted probabilities for different combinations of Pclass and Sex, can provide a more visual and intuitive understanding of the model's predictions.

I hope this comprehensive interpretation of your multiple logistic regression model with interaction is helpful! is helpful!

In [None]:
from patsy import dmatrices

In [None]:
formula = "Survived ~ Pclass + Age + Sex + Pclass:Sex"

In [None]:
y, X = dmatrices(formula, df, return_type='dataframe')

In [None]:
logit_model = sm.Logit(y, X).fit()

In [None]:
print(logit_model.summary())

---
## Polynomial Regression
---

## Polynomial Regression

### Key Metrics:
- **Coefficients**: These represent the weights of the polynomial terms (e.g., `Age` and `Age^2`). A positive coefficient for `Age^2` suggests a quadratic relationship between age and fare.
- **Intercept**: This is the predicted fare when age is 0.
- The polynomial regression allows for a non-linear relationship between age and fare, which might fit the data better than a simple linear model.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
X = df[['Age']].dropna()

In [None]:
y = df.loc[X.index, 'Fare']

In [None]:
poly = PolynomialFeatures(degree=2)

In [None]:
X_poly = poly.fit_transform(X)

In [None]:
poly_reg = LinearRegression()

In [None]:
poly_reg.fit(X_poly, y)

In [None]:
print("Coefficients:", poly_reg.coef_)
print("Intercept:", poly_reg.intercept_)

---
## Ridge Regression
---

- **Ridge Regression**:
  - Ridge regression shrinks the coefficients of less important predictors but does not set them to zero. It is useful for reducing overfitting.
  - The coefficients indicate the relative importance of each predictor.

In [None]:
from sklearn.linear_model import Ridge, Lasso

In [None]:
ridge = Ridge(alpha=1.0)

In [None]:
ridge.fit(X_train, y_train)

In [None]:
print("Ridge Coefficients:", ridge.coef_)

---
## Lasso Regression
---

- **Lasso Regression**:
  - Lasso regression can shrink some coefficients to zero, effectively performing feature selection.
  - The coefficients indicate which predictors are most important for the model.

In [None]:
lasso = Lasso(alpha=0.1)

In [None]:
lasso.fit(X_train, y_train)

In [None]:
print("Lasso Coefficients:", lasso.coef_)

### Summary of Results:
1. **Linear Regression**: Explains the relationship between `Pclass`, `Age`, and `Fare`. Higher class and older age are associated with higher fares.
2. **Logistic Regression**: Predicts survival based on `Pclass`, `Age`, and `Sex`. The model's accuracy and classification metrics indicate its performance.
3. **Multiple Regression (with Interaction)**: Explains survival with interaction effects between `Pclass` and `Sex`.
4. **Polynomial Regression**: Captures non-linear relationships between `Age` and `Fare`.
5. **Ridge/Lasso Regression**: Regularized models that reduce overfitting and highlight important predictors.

### Conclusions and Discussions:

- The analysis of the Titanic dataset shows that survival rates were higher for women, children, and first-class passengers. Men and passengers in lower classes had a much lower chance of survival. These results match historical records of the Titanic tragedy, where women and children were given priority during evacuation.

- This study highlights how gender and class played a big role in determining survival. While the analysis gives useful insights, it is limited because it doesn’t include other factors like access to lifeboats or group behaviours. Further research could explore these factors to better understand survival patterns.