# Titanic Dataset — Exploratory Data Analysis
*Generated: 2025-08-19 14:42:01*

**Data Source**: Using uploaded titanic.csv (Kaggle Titanic train.csv)


In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
df = pd.read_csv('titanic.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all', datetime_is_numeric=True)

In [None]:
df.isna().sum().sort_values(ascending=False)

In [None]:
df['Age'].dropna().plot(kind='hist', bins=30, title='Age Distribution'); plt.show()

In [None]:
df['Fare'].dropna().plot(kind='hist', bins=30, title='Fare Distribution'); plt.show()

In [None]:
valid=df[['Fare','Pclass']].dropna(); groups=[valid[valid['Pclass']==c]['Fare'] for c in sorted(valid['Pclass'].unique())];
plt.boxplot(groups, labels=[str(c) for c in sorted(valid['Pclass'].unique())]); plt.title('Fare by Pclass'); plt.show()

In [None]:
valid=df[['Age','Survived']].dropna(); groups=[valid[valid['Survived']==s]['Age'] for s in sorted(valid['Survived'].unique())];
plt.boxplot(groups, labels=[str(s) for s in sorted(valid['Survived'].unique())]); plt.title('Age by Survival'); plt.show()

In [None]:
df.groupby('Sex')['Survived'].mean().plot(kind='bar', title='Survival Rate by Sex'); plt.show()

In [None]:
df.groupby('Pclass')['Survived'].mean().plot(kind='bar', title='Survival Rate by Pclass'); plt.show()

In [None]:
counts=df.groupby(['Pclass','Survived']).size().unstack(fill_value=0);
x=np.arange(len(counts.index)); width=0.35;
plt.bar(x-width/2, counts[0], width, label='Did not survive');
plt.bar(x+width/2, counts[1], width, label='Survived');
plt.xticks(x, counts.index.astype(str)); plt.legend(); plt.title('Counts by Pclass and Survival'); plt.show()

In [None]:
sub=df[['Age','Fare','Survived']].dropna();
plt.scatter(sub[sub['Survived']==0]['Age'], sub[sub['Survived']==0]['Fare'], label='Did not survive', alpha=0.6);
plt.scatter(sub[sub['Survived']==1]['Age'], sub[sub['Survived']==1]['Fare'], label='Survived', alpha=0.6);
plt.legend(); plt.title('Age vs Fare by Survival'); plt.show()

In [None]:
num_df=df.select_dtypes(include=[np.number]).dropna(); corr=num_df.corr();
im=plt.imshow(corr, interpolation='nearest'); plt.xticks(range(len(corr.columns)), corr.columns, rotation=45);
plt.yticks(range(len(corr.columns)), corr.columns); plt.colorbar(im); plt.title('Correlation Matrix'); plt.show()

## Observations
- Age: skewed toward young adults; missing values exist.
- Fare: right-skewed; wide spread especially in 1st class.
- Fare vs Pclass: strong separation, higher class => higher fare.
- Age vs Survival: children had higher survival than some adults.
- Survival Rate by Sex: females survived at higher rates.
- Survival Rate by Pclass: higher classes show higher survival.
- Counts: 3rd class had many non-survivors.
- Age vs Fare scatter: survivors more common at higher fares.
- Correlations: Survived positively linked to Fare, negatively to Pclass.

## Summary of Findings
Summary of Findings:
- Survival higher among females and higher passenger classes.
- Fare (wealth) is associated with better survival.
- Children had better chances of survival.
- Missing Age/Embarked values need imputation for modeling.
- Next steps: feature engineering (family size, titles, cabin decks), preprocessing for ML models.