# Titanic Survival Prediction

**Author:** Piyush Ramteke  
**Program:** CodSoft Data Science Internship  

---

## 1. Problem Statement

The sinking of the Titanic in 1912 is one of the deadliest maritime disasters in history. Out of 2,224 passengers and crew, more than 1,500 lost their lives.

**Objective:** Build a Machine Learning model to predict whether a passenger **survived or not** based on features like age, gender, ticket class, fare, and family size.

This is a **binary classification** problem:
- **0** → Did not survive
- **1** → Survived

## 2. Import Libraries & Load Dataset

In [None]:
# ── Import Libraries ─────────────────────────────────────────

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    precision_score, recall_score, f1_score
)

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

print('All libraries loaded.')

In [None]:
# ── Load Dataset ─────────────────────────────────────────────

df = pd.read_csv('Titanic-Dataset.csv')

print(f'Dataset Shape: {df.shape[0]} rows × {df.shape[1]} columns')
df.head()

**Column Descriptions:**

| Feature | Description |
|---------|-------------|
| `PassengerId` | Unique ID for each passenger |
| `Survived` | **Target** — 0 = No, 1 = Yes |
| `Pclass` | Ticket class — 1 = 1st, 2 = 2nd, 3 = 3rd |
| `Name` | Passenger name |
| `Sex` | Gender |
| `Age` | Age in years |
| `SibSp` | Number of siblings/spouses aboard |
| `Parch` | Number of parents/children aboard |
| `Ticket` | Ticket number |
| `Fare` | Ticket fare |
| `Cabin` | Cabin number |
| `Embarked` | Port of embarkation — C = Cherbourg, Q = Queenstown, S = Southampton |

---

## 3. Exploratory Data Analysis

In [None]:
# ── 3.1 Dataset Info ─────────────────────────────────────────

df.info()

In [None]:
# ── 3.2 Missing Values Summary ──────────────────────────────

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage (%)': missing_pct
}).sort_values('Missing Count', ascending=False)

print('Missing Values Summary:')
print('=' * 40)
missing_df[missing_df['Missing Count'] > 0]

**Observations:**
- **Cabin** — 77% missing → too many gaps, we'll drop it
- **Age** — 19.9% missing → we'll fill with the median age
- **Embarked** — only 2 missing → we'll fill with the most common port

In [None]:
# ── 3.3 Survival Distribution ───────────────────────────────

surv_counts = df['Survived'].value_counts()
surv_pct = df['Survived'].value_counts(normalize=True) * 100

print('Survival Distribution:')
print(f'  Did Not Survive (0): {surv_counts[0]}  ({surv_pct[0]:.1f}%)')
print(f'  Survived        (1): {surv_counts[1]}  ({surv_pct[1]:.1f}%)')

In [None]:
# ── 3.4 Visualize Survival Distribution ─────────────────────

colors = ['#e74c3c', '#2ecc71']

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Bar chart
bars = axes[0].bar(['Did Not Survive (0)', 'Survived (1)'],
                   surv_counts.values, color=colors, edgecolor='black')
axes[0].set_title('Survival Count', fontsize=15, fontweight='bold')
axes[0].set_ylabel('Count')
for b, c in zip(bars, surv_counts.values):
    axes[0].text(b.get_x()+b.get_width()/2, b.get_height()+5,
                str(c), ha='center', fontweight='bold', fontsize=13)

# Pie chart
axes[1].pie(surv_counts, labels=['Did Not Survive', 'Survived'],
            autopct='%1.1f%%', colors=colors, explode=(0.02, 0.05),
            shadow=True, textprops={'fontsize': 13})
axes[1].set_title('Survival Proportion', fontsize=15, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# ── 3.5 Survival by Gender ──────────────────────────────────

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.countplot(x='Sex', hue='Survived', data=df, palette=colors, 
              edgecolor='black', ax=axes[0])
axes[0].set_title('Survival by Gender', fontsize=15, fontweight='bold')
axes[0].legend(['No', 'Yes'], title='Survived')

# Survival by Pclass
sns.countplot(x='Pclass', hue='Survived', data=df, palette=colors,
              edgecolor='black', ax=axes[1])
axes[1].set_title('Survival by Ticket Class', fontsize=15, fontweight='bold')
axes[1].legend(['No', 'Yes'], title='Survived')

plt.tight_layout()
plt.show()

print(f'Female survival rate: {df[df["Sex"]=="female"]["Survived"].mean()*100:.1f}%')
print(f'Male survival rate:   {df[df["Sex"]=="male"]["Survived"].mean()*100:.1f}%')

In [None]:
# ── 3.6 Age Distribution by Survival ────────────────────────

fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(df[df['Survived']==0]['Age'].dropna(), bins=30, alpha=0.7,
        color=colors[0], label='Did Not Survive', edgecolor='black')
ax.hist(df[df['Survived']==1]['Age'].dropna(), bins=30, alpha=0.7,
        color=colors[1], label='Survived', edgecolor='black')
ax.set_title('Age Distribution by Survival', fontsize=15, fontweight='bold')
ax.set_xlabel('Age')
ax.set_ylabel('Count')
ax.legend()

plt.tight_layout()
plt.show()

**Key EDA Findings:**
- ~61.6% of passengers **did not survive** — moderately imbalanced
- **Females** had a much higher survival rate (~74%) than males (~19%) — "women and children first"
- **1st class** passengers survived more often than 3rd class
- **Young children** (age < 5) had better survival chances

---

## 4. Data Preprocessing

In [None]:
# ── 4.1 Handle Missing Values ───────────────────────────────

# Age → fill with median (robust to outliers)
df['Age'].fillna(df['Age'].median(), inplace=True)

# Embarked → fill with mode (most common port)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Cabin → drop (77% missing, not useful)
df.drop('Cabin', axis=1, inplace=True)

print('Missing values after handling:')
print(df.isnull().sum()[df.isnull().sum() > 0])
if df.isnull().sum().sum() == 0:
    print('→ No missing values remaining.')

**What we did:**
- **Age:** Filled with median (28.0) — better than mean because it's not affected by extreme ages
- **Embarked:** Filled with "S" (Southampton) — the most common embarkation port
- **Cabin:** Dropped entirely — 77% of values are missing, making it unreliable

In [None]:
# ── 4.2 Drop Unnecessary Columns ────────────────────────────
# Name, Ticket, PassengerId don't help prediction.

df.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

print(f'Columns after dropping: {list(df.columns)}')
print(f'Shape: {df.shape}')

In [None]:
# ── 4.3 Create New Feature: FamilySize ─────────────────────
# Combine SibSp + Parch to see total family members aboard.

df['FamilySize'] = df['SibSp'] + df['Parch']

print('FamilySize feature created (SibSp + Parch)')
print(f'\nFamilySize distribution:')
print(df['FamilySize'].value_counts().sort_index())

In [None]:
# ── Visualize FamilySize vs Survival ────────────────────────

fig, ax = plt.subplots(figsize=(10, 5))
sns.countplot(x='FamilySize', hue='Survived', data=df,
              palette=colors, edgecolor='black', ax=ax)
ax.set_title('Survival by Family Size', fontsize=15, fontweight='bold')
ax.legend(['No', 'Yes'], title='Survived')
plt.tight_layout()
plt.show()

print('Passengers traveling alone (FamilySize=0) had lower survival rates.')

In [None]:
# ── 4.4 Encode Categorical Features ────────────────────────
# Sex → 0 (male), 1 (female)
# Embarked → 0 (C), 1 (Q), 2 (S)

le = LabelEncoder()

df['Sex'] = le.fit_transform(df['Sex'])           # female=0, male=1
df['Embarked'] = le.fit_transform(df['Embarked']) # C=0, Q=1, S=2

print('Categorical encoding applied:')
print(f'  Sex      → {dict(zip(["female","male"], [0,1]))}')
print(f'  Embarked → {dict(zip(["C","Q","S"], [0,1,2]))}')
print(f'\nFinal columns: {list(df.columns)}')
df.head()

**Preprocessing Summary:**
- ✅ Missing values handled (Age → median, Embarked → mode, Cabin → dropped)
- ✅ Unnecessary columns dropped (Name, Ticket, PassengerId)
- ✅ New feature created: `FamilySize = SibSp + Parch`
- ✅ Categorical features encoded to numbers (Sex, Embarked)

---

## 5. Train/Test Split & Feature Scaling

In [None]:
# ── 5.1 Separate Features and Target ────────────────────────

X = df.drop('Survived', axis=1)
y = df['Survived']

print(f'Features: {list(X.columns)}')
print(f'Target: Survived')

In [None]:
# ── 5.2 Train/Test Split (80-20) ────────────────────────────

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Testing set:  {X_test.shape[0]} samples')

In [None]:
# ── 5.3 Feature Scaling ──────────────────────────────────────
# Logistic Regression is sensitive to feature scale.
# StandardScaler makes all features have mean=0, std=1.

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # use same scaling as training!

print('Feature scaling applied (StandardScaler).')
print('Note: fit_transform on train, transform only on test — prevents data leakage.')

---

## 6. Model Training

In [None]:
# ── 6.1 Logistic Regression ─────────────────────────────────
# Simple, fast, and interpretable baseline classifier.

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)

lr_pred = lr.predict(X_test_scaled)

print('Logistic Regression — trained.')

In [None]:
# ── 6.2 Random Forest Classifier ────────────────────────────
# Ensemble method — combines 100 decision trees for better accuracy.
# Random Forest doesn't require feature scaling, but we use
# the scaled data for consistency.

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train_scaled, y_train)

rf_pred = rf.predict(X_test_scaled)

print('Random Forest — trained.')

Both models trained on **scaled features** and predictions made on the **test set**.

---

## 7. Model Evaluation

We evaluate using five metrics:

| Metric | What it tells us |
|--------|------------------|
| **Accuracy** | Overall % of correct predictions |
| **Precision** | Of those predicted as survived, how many actually survived? |
| **Recall** | Of those who actually survived, how many did we predict correctly? |
| **F1-Score** | Balance between Precision and Recall |
| **Confusion Matrix** | Visual breakdown of correct vs incorrect predictions |

In [None]:
# ── 7.1 Evaluation Helper Function ──────────────────────────

def evaluate_model(name, y_true, y_pred):
    """Print all evaluation metrics for a model."""
    acc  = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec  = recall_score(y_true, y_pred)
    f1   = f1_score(y_true, y_pred)
    
    print(f'\n{"=" * 50}')
    print(f'{name}')
    print(f'{"=" * 50}')
    print(f'  Accuracy:  {acc:.4f}  ({acc*100:.2f}%)')
    print(f'  Precision: {prec:.4f}')
    print(f'  Recall:    {rec:.4f}')
    print(f'  F1-Score:  {f1:.4f}')
    
    return {'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1-Score': f1}

In [None]:
# ── 7.2 Evaluate Both Models ────────────────────────────────

lr_metrics = evaluate_model('Logistic Regression', y_test, lr_pred)
rf_metrics = evaluate_model('Random Forest', y_test, rf_pred)

In [None]:
# ── 7.3 Confusion Matrices ──────────────────────────────────

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for ax, name, pred in [
    (axes[0], 'Logistic Regression', lr_pred),
    (axes[1], 'Random Forest', rf_pred)
]:
    cm = confusion_matrix(y_test, pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Not Survived','Survived'],
                yticklabels=['Not Survived','Survived'],
                annot_kws={'fontsize': 14})
    ax.set_title(name, fontsize=14, fontweight='bold')
    ax.set_ylabel('Actual')
    ax.set_xlabel('Predicted')

plt.suptitle('Confusion Matrices', fontsize=17, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print('Reading guide:')
print('  Top-left  = Correctly predicted NOT survived')
print('  Top-right = Wrongly predicted survived (False Positive)')
print('  Bot-left  = Wrongly predicted not survived (False Negative)')
print('  Bot-right = Correctly predicted survived')

In [None]:
# ── 7.4 Classification Reports ──────────────────────────────

print('=' * 55)
print('Logistic Regression — Classification Report')
print('=' * 55)
print(classification_report(y_test, lr_pred,
                            target_names=['Not Survived', 'Survived']))

print('=' * 55)
print('Random Forest — Classification Report')
print('=' * 55)
print(classification_report(y_test, rf_pred,
                            target_names=['Not Survived', 'Survived']))

---

## 8. Model Comparison

In [None]:
# ── 8.1 Comparison Table ────────────────────────────────────

comparison = pd.DataFrame({
    'Logistic Regression': lr_metrics,
    'Random Forest': rf_metrics
}).round(4)

print('Final Model Comparison:')
print('=' * 55)
comparison

In [None]:
# ── 8.2 Visual Comparison ───────────────────────────────────

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

model_colors = ['#2980b9', '#c0392b']

for i, m in enumerate(metrics):
    ax = axes[i]
    vals = [lr_metrics[m], rf_metrics[m]]
    bars = ax.bar(['Logistic\nRegression', 'Random\nForest'],
                 vals, color=model_colors, edgecolor='black')
    ax.set_title(m, fontsize=14, fontweight='bold')
    ax.set_ylim(0, 1.15)
    for b, v in zip(bars, vals):
        ax.text(b.get_x()+b.get_width()/2, b.get_height()+0.02,
                f'{v:.4f}', ha='center', fontweight='bold', fontsize=12)

plt.suptitle('Model Performance Comparison',
             fontsize=18, fontweight='bold', y=1.04)
plt.tight_layout()
plt.show()

In [None]:
# ── 8.3 Feature Importance (Random Forest) ──────────────────

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(10, 6))
importances.plot(kind='barh', color='#2980b9', edgecolor='black', ax=ax)
ax.set_title('Feature Importance (Random Forest)', fontsize=15, fontweight='bold')
ax.set_xlabel('Importance')
plt.tight_layout()
plt.show()

print(f'\nTop 3 most important features:')
for feat, imp in importances.tail(3)[::-1].items():
    print(f'  {feat}: {imp:.4f}')

In [None]:
# ── 8.4 Best Model Selection ────────────────────────────────

best = 'Random Forest' if rf_metrics['Accuracy'] >= lr_metrics['Accuracy'] else 'Logistic Regression'
best_metrics = rf_metrics if best == 'Random Forest' else lr_metrics

print('Best Model:', best)
print(f'  Accuracy:  {best_metrics["Accuracy"]:.4f}')
print(f'  Precision: {best_metrics["Precision"]:.4f}')
print(f'  Recall:    {best_metrics["Recall"]:.4f}')
print(f'  F1-Score:  {best_metrics["F1-Score"]:.4f}')

---

## 9. Conclusion

### Key Findings

1. **Data Exploration:** The Titanic dataset has 891 passengers with 12 features. About 61.6% did not survive. Key survival factors: gender (females had 74% survival rate vs males at 19%), ticket class, and age.

2. **Preprocessing:** We handled missing values (Age → median, Embarked → mode, Cabin → dropped), encoded categorical variables, and created a new `FamilySize` feature.

3. **Models:** Both Logistic Regression and Random Forest were trained and evaluated.

4. **Results:** Random Forest generally has an edge due to its ability to capture non-linear relationships and feature interactions. Logistic Regression performs well as a baseline.

5. **Feature Importance:** `Sex`, `Fare`, and `Age` tend to be the most important features for predicting survival — consistent with the historical "women and children first" policy.

### What We Learned

- Always explore and visualize data before building models
- Handle missing values thoughtfully — median for numerical, mode for categorical
- Feature engineering (like `FamilySize`) can provide additional predictive power
- Compare multiple models to find the best performer
- Feature scaling matters for distance-based algorithms like Logistic Regression

---

*Project by **Piyush Ramteke** — CodSoft Data Science Internship*