# ðŸš¢ Titanic Dataset â€” Exploratory Data Analysis (EDA)

**Task 5 | Data Analyst Internship**

**Objective:** Extract insights using visual and statistical exploration  
**Tools:** Python Â· Pandas Â· Matplotlib Â· Seaborn  
**Dataset:** `train.csv` â€” 891 rows Ã— 12 columns

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid', palette='muted')
%matplotlib inline

BLUE, ORANGE, PAL = '#4C72B0', '#DD8452', ['#DD8452', '#4C72B0']

## 1. Load Dataset

In [6]:
df = pd.read_csv("C:\Users\Admin\Downloads\titanic\train.csv")
print('Shape:', df.shape)
df.head()

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape (3047669908.py, line 1)

## 2. Basic Info & Statistical Summary

In [None]:
df.info()

In [None]:
df.describe().round(2)

In [None]:
print('Missing Values:\n', df.isnull().sum())
print('\nSurvived:\n', df['Survived'].value_counts())
print('\nPclass:\n', df['Pclass'].value_counts().sort_index())
print('\nSex:\n', df['Sex'].value_counts())
print('\nEmbarked:\n', df['Embarked'].value_counts())

## 3. Missing Value Analysis

In [None]:
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(6, 3))
missing.plot(kind='bar', color=BLUE, ax=ax, edgecolor='white', rot=0)
for i, v in enumerate(missing):
    ax.text(i, v+2, f'{v} ({v/len(df)*100:.1f}%)', ha='center', fontsize=10)
ax.set_title('Missing Values per Column', fontweight='bold')
ax.set_ylabel('Count'); ax.set_ylim(0, missing.max()*1.3)
plt.tight_layout(); plt.show()
print('Observation: Cabin=77.1% missing (drop/flag). Age=19.9% (impute). Embarked=2 (fill mode).')

## 4. Univariate Analysis

In [None]:
counts = df['Survived'].value_counts().sort_index()
fig, ax = plt.subplots(figsize=(6,4))
bars = ax.bar(['Did Not Survive','Survived'], counts.values, color=PAL, edgecolor='white', width=0.5)
for bar, val in zip(bars, counts.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+5,
            f'{val}\n({val/len(df)*100:.1f}%)', ha='center', fontsize=11)
ax.set_title('Overall Survival Count', fontweight='bold'); ax.set_ylabel('Count')
ax.set_ylim(0, max(counts.values)*1.25); plt.tight_layout(); plt.show()
print('Observation: 549 (61.6%) did not survive; 342 (38.4%) survived. Moderate class imbalance.')

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].hist(df['Age'].dropna(), bins=30, color=BLUE, edgecolor='white')
axes[0].axvline(df['Age'].mean(), color='red', linestyle='--', label=f"Mean: {df['Age'].mean():.1f}")
axes[0].set_title('Age Distribution', fontweight='bold'); axes[0].set_xlabel('Age'); axes[0].legend()
axes[1].hist(df['Fare'].dropna(), bins=40, color=ORANGE, edgecolor='white')
axes[1].set_title('Fare Distribution (Right-Skewed)', fontweight='bold'); axes[1].set_xlabel('Fare (Â£)')
plt.tight_layout(); plt.show()
print('Age: ~Normal, mean=29.7. Fare: Severely right-skewed â€” use log transform before ML.')

## 5. Bivariate Analysis

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.countplot(x='Sex', hue='Survived', data=df, palette=PAL, ax=axes[0])
axes[0].set_title('Survival by Gender', fontweight='bold'); axes[0].legend(title='Survived', labels=['No','Yes'])
sns.countplot(x='Pclass', hue='Survived', data=df, palette=PAL, ax=axes[1])
axes[1].set_title('Survival by Passenger Class', fontweight='bold'); axes[1].legend(title='Survived', labels=['No','Yes'])
plt.tight_layout(); plt.show()
print('Survival by Sex:', df.groupby('Sex')['Survived'].mean().round(3).to_dict())
print('Survival by Pclass:', df.groupby('Pclass')['Survived'].mean().round(3).to_dict())

In [None]:
df2 = df.copy(); df2['Outcome'] = df2['Survived'].map({0:'Did Not Survive',1:'Survived'})
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.boxplot(x='Outcome', y='Age', data=df2, hue='Outcome', palette=PAL, ax=axes[0], legend=False)
axes[0].set_title('Age vs Survival', fontweight='bold'); axes[0].set_xlabel('')
sns.boxplot(x='Outcome', y='Fare', data=df2, hue='Outcome', palette=PAL, ax=axes[1], legend=False)
axes[1].set_title('Fare vs Survival', fontweight='bold'); axes[1].set_xlabel('')
plt.tight_layout(); plt.show()
print('Survivors paid higher median fare (~Â£52 vs Â£22). Age shows weak separation.')

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
sns.countplot(x='Embarked', hue='Survived', data=df, palette=PAL, ax=ax, order=['S','C','Q'])
ax.set_title('Survival by Embarkation Port', fontweight='bold')
ax.set_xlabel('Port (S=Southampton, C=Cherbourg, Q=Queenstown)')
ax.legend(title='Survived', labels=['No','Yes'])
plt.tight_layout(); plt.show()
print('Cherbourg passengers had higher survival â€” more 1st class travellers boarded there.')

## 6. Multivariate Analysis

In [None]:
surv_rate = df.groupby(['Pclass','Sex'])['Survived'].mean().reset_index()
fig, ax = plt.subplots(figsize=(8,4))
sns.barplot(x='Pclass', y='Survived', hue='Sex', data=surv_rate, palette=[BLUE,ORANGE], ax=ax)
ax.set_title('Survival Rate by Pclass & Gender', fontweight='bold')
ax.set_ylabel('Survival Rate'); ax.set_ylim(0, 1.1)
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0%}', (p.get_x()+p.get_width()/2, p.get_height()+0.02), ha='center', fontsize=9)
plt.tight_layout(); plt.show()
print('Female 1st class ~97%. Male 3rd class ~14%. Gender dominates within every class.')

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
sns.violinplot(x='Pclass', y='Age', data=df, hue='Pclass', palette='muted', ax=ax, legend=False)
ax.set_title('Age Distribution by Passenger Class', fontweight='bold'); ax.set_xlabel('Passenger Class')
plt.tight_layout(); plt.show()
print('1st class passengers older (~37 median). 3rd class youngest and most varied.')

## 7. Correlation Heatmap

In [None]:
num_cols = ['Survived','Pclass','Age','SibSp','Parch','Fare']
corr = df[num_cols].corr()
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, linewidths=0.8, annot_kws={'size':11}, ax=ax)
ax.set_title('Correlation Heatmap', fontweight='bold')
plt.tight_layout(); plt.show()
print('Key: Fare vs Survived=+0.26 | Pclass vs Survived=-0.34 | Pclass vs Fare=-0.55')

## 8. Pairplot

In [None]:
pair_df = df[['Survived','Pclass','Age','Fare','SibSp']].dropna().copy()
pair_df['Survived'] = pair_df['Survived'].astype(str)
g = sns.pairplot(pair_df, hue='Survived', palette={'0':ORANGE,'1':BLUE}, plot_kws={'alpha':0.5}, height=2.2)
g.fig.suptitle('Pairplot (blue=Survived, orange=Did Not Survive)', y=1.01)
plt.show()
print('Survivors cluster at high Fare & low Pclass. Age shows weak separation.')

## 9. Summary of Findings

| # | Finding | Detail |
|---|---------|--------|
| 1 | **Gender strongest predictor** | Female: 74.2% vs Male: 18.9% |
| 2 | **Passenger class matters** | 1st: 63% â†’ 2nd: 47.3% â†’ 3rd: 24.2% |
| 3 | **Fare correlates with survival** | Survivors paid ~Â£52 vs Â£22 median |
| 4 | **Sex Ã— Pclass most powerful** | Female 1st ~97%; Male 3rd ~14% |
| 5 | **Age mildly predictive** | Children survived more; overall effect weak |
| 6 | **Fare right-skewed** | Use log transform for ML |
| 7 | **Missing: Age 19.9%, Cabin 77.1%** | Impute Age; drop/flag Cabin |
| 8 | **No severe multicollinearity** | Max: Pclass vs Fare = âˆ’0.55 |
| 9 | **Class imbalance 62/38** | Use stratified splits |

## 10. Interview Q&A

**Q1. What is EDA?** Analyzing datasets visually/statistically to understand structure, find anomalies, and guide modelling.

**Q2. Plots for correlation?** `sns.heatmap()`, `sns.pairplot()`, `sns.regplot()`.

**Q3. Handle skewed data?** `np.log1p()`, sqrt, or Box-Cox. Tree models are robust to skew.

**Q4. Detect multicollinearity?** Correlation matrix (|r|>0.8) + VIF (>5â€“10 is problematic).

**Q5. Univariate/Bivariate/Multivariate?** 1 var / 2 vars / 3+ vars analysis.

**Q6. Heatmap vs Pairplot?** Heatmap = aggregated numbers. Pairplot = full distributions + scatter plots.

**Q7. Summarize insights?** Key findings + numbers + visuals + data quality notes + modelling implications.