# Titanic Dataset Exploration & Modeling

This notebook performs exploratory data analysis (EDA), feature engineering, and builds machine learning models to predict survival (target variable `Survived`) using the provided Titanic dataset CSV file (`Titanic-Dataset.csv`).

## Outline
1. Setup & Data Loading
2. Quick Data Overview
3. Exploratory Data Analysis (EDA)
4. Feature Engineering
5. Preprocessing Pipelines
6. Baseline & Advanced Models
7. Model Evaluation & Comparison
8. Feature Importance & Interpretability
9. Save Trained Model (Optional)

You can execute cells sequentially. Adjust or extend as needed.

In [None]:
# 1. Imports & Settings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
import re
import warnings
warnings.filterwarnings('ignore')
sns.set_theme(style='whitegrid')

pd.options.display.max_columns = 100
DATA_PATH = Path('.')  # Adjust if needed
CSV_FILE = DATA_PATH / 'Titanic-Dataset.csv'

In [None]:
# 2. Load Data
df = pd.read_csv(CSV_FILE)
print(f"Shape: {df.shape}")
df.head()

### Initial Observations
- Target: `Survived` (0 = No, 1 = Yes)
- Categorical features include: `Gender`, `Embarked`, `Cabin`, `Ticket`
- Potential to extract signal from `Name` (Title), `Ticket` (group size), `Cabin` (Deck), family relations (`SibSp`, `Parch`).
- Missing values expected in `Age`, `Cabin`, possibly `Embarked`.

In [None]:
# 3. Basic Info & Missingness
display(df.info())
display(df.describe(include='number').T)
display(df.describe(include='object').T)

missing = df.isna().mean().sort_values(ascending=False)
print('Missing value ratio:')
missing.to_frame('missing_ratio').head(15)

In [None]:
# 4. Target Distribution
survival_rate = df['Survived'].mean()
print(f"Overall survival rate: {survival_rate:.2%}")
sns.countplot(data=df, x='Survived')
plt.title('Survival Count')
plt.show()

sns.barplot(x='Gender', y='Survived', data=df)
plt.title('Survival Rate by Gender')
plt.show()

sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.show()

### Continuous Variables vs Survival
We inspect distributions and potential separability of Age and Fare by survival outcome.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,4))
sns.kdeplot(data=df, x='Age', hue='Survived', common_norm=False, ax=axes[0])
axes[0].set_title('Age Distribution by Survival')
sns.kdeplot(data=df[df['Fare'] < 200], x='Fare', hue='Survived', common_norm=False, ax=axes[1])
axes[1].set_title('Fare (<200) Distribution by Survival')
plt.tight_layout()
plt.show()

### Correlation Heatmap
Note: We'll engineer additional features later; initial correlation of raw numeric features shown below.

In [None]:
numeric_cols = ['Survived','Age','SibSp','Parch','Fare','Pclass']
corr = df[numeric_cols].corr()
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap (Raw Numeric)')
plt.show()

## 5. Feature Engineering
We will create the following engineered features:
- `Title` extracted from `Name`
- Group rare titles into 'Rare'
- `FamilySize = SibSp + Parch + 1`
- `IsAlone = 1 if FamilySize == 1 else 0`
- `TicketGroupSize` (count of same Ticket)
- `Deck` extracted from Cabin (first letter) with missing as 'M'
- `HasCabin` binary flag
- `FarePerPerson = Fare / FamilySize`
- `AgeBucket` (binned Age)
- Interaction: `Age*Class`
- Optionally categorize Ticket prefix

We'll keep raw columns for reference but pass only selected ones to the model pipeline.

In [None]:
def extract_title(name: str):
    match = re.search(r',\s*([^\.]+)\.', name)
    if match:
        return match.group(1).strip()
    return 'Unknown'

df['Title'] = df['Name'].apply(extract_title)

# Map rare titles
title_counts = df['Title'].value_counts()
rare_titles = title_counts[title_counts < 10].index
df['Title'] = df['Title'].replace(rare_titles, 'Rare')

# Family features
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Ticket group size
ticket_counts = df['Ticket'].value_counts()
df['TicketGroupSize'] = df['Ticket'].map(ticket_counts)

# Deck extraction
def extract_deck(cabin):
    if pd.isna(cabin) or cabin.strip() == '':
        return 'M'  # Missing
    return cabin.strip()[0]

df['Deck'] = df['Cabin'].apply(extract_deck)
df['HasCabin'] = (~df['Cabin'].isna() & (df['Cabin'].str.strip() != '')).astype(int)

# Fare per person
df['FarePerPerson'] = df['Fare'] / df['FamilySize']

# Age bucket (temporary - will impute age first; this creates NaNs if Age is NaN)
df['AgeBucket'] = pd.cut(df['Age'], bins=[0,5,12,18,30,45,60,80], right=False)

# Interaction
df['Age*Class'] = df['Age'] * df['Pclass']

# Ticket prefix
def ticket_prefix(t):
    t = str(t)
    t = t.replace('.', '').replace('/', '').strip()
    parts = t.split()
    if parts and not parts[0].isdigit():
        return parts[0]
    return 'NOPREFIX'
df['TicketPrefix'] = df['Ticket'].apply(ticket_prefix)

print('Engineered columns added.')
df.head(3)

### Examine Engineered Features vs Survival

In [None]:
fig, axes = plt.subplots(2,3, figsize=(15,8))
sns.barplot(x='Title', y='Survived', data=df, ax=axes[0,0])
axes[0,0].tick_params(axis='x', rotation=45)
sns.barplot(x='Deck', y='Survived', data=df, ax=axes[0,1])
sns.barplot(x='IsAlone', y='Survived', data=df, ax=axes[0,2])
sns.barplot(x='FamilySize', y='Survived', data=df, ax=axes[1,0])
sns.barplot(x='TicketGroupSize', y='Survived', data=df, ax=axes[1,1])
sns.barplot(x='HasCabin', y='Survived', data=df, ax=axes[1,2])
axes[1,0].set_xticklabels(axes[1,0].get_xticklabels(), rotation=0)
plt.tight_layout()
plt.show()

## 6. Modeling
We'll build a preprocessing pipeline using `ColumnTransformer` and test multiple models:
- Logistic Regression (baseline)
- Random Forest
- Gradient Boosting (e.g., XGBoost/LightGBM optional if installed)
- Extra Trees

We will:
1. Split features & target
2. Define numeric & categorical columns
3. Create pipelines with imputation, scaling, one-hot encoding
4. Cross-validate models
5. Compare metrics (Accuracy, ROC AUC, F1)

Note: For reproducibility, we set random_state.

In [None]:
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier

# Feature selection for modeling
target = 'Survived'
feature_cols = [
    'Pclass','Gender','Age','SibSp','Parch','Fare','Embarked',
    'Title','FamilySize','IsAlone','TicketGroupSize','Deck','HasCabin',
    'FarePerPerson','Age*Class','TicketPrefix'
]

X = df[feature_cols].copy()
y = df[target].copy()

numeric_features = ['Age','SibSp','Parch','Fare','FamilySize','TicketGroupSize','FarePerPerson','Age*Class','Pclass']
categorical_features = ['Gender','Embarked','Title','Deck','IsAlone','HasCabin','TicketPrefix']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

models = {
    'LogReg': LogisticRegression(max_iter=1000, C=1.0, random_state=42),
    'RandomForest': RandomForestClassifier(n_estimators=400, max_depth=None, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'ExtraTrees': ExtraTreesClassifier(n_estimators=500, random_state=42)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = []
for name, model in models.items():
    pipe = Pipeline(steps=[('prep', preprocessor), ('clf', model)])
    scores = cross_validate(pipe, X, y, cv=cv, scoring=['accuracy','roc_auc','f1'], n_jobs=-1)
    results.append({
        'model': name,
        'accuracy_mean': scores['test_accuracy'].mean(),
        'roc_auc_mean': scores['test_roc_auc'].mean(),
        'f1_mean': scores['test_f1'].mean()
    })

results_df = pd.DataFrame(results).sort_values(by='roc_auc_mean', ascending=False)
results_df

### Model Selection
Pick the top-performing model (e.g., highest ROC AUC). We'll refit it on the full dataset and inspect feature importances (for tree-based models) or coefficients (for Logistic Regression).

In [None]:
# Choose best model by ROC AUC
best_model_name = results_df.iloc[0]['model']
best_model = models[best_model_name]
print(f"Best model selected: {best_model_name}")

final_pipeline = Pipeline(steps=[('prep', preprocessor), ('clf', best_model)])
final_pipeline.fit(X, y)

## 7. Feature Importance / Coefficients
We extract the processed feature names and show importance for tree models or coefficients for logistic regression. Note that after one-hot encoding, features expand.

In [None]:
def get_feature_names(preprocessor):
    num_feats = numeric_features
    cat_pipeline = preprocessor.named_transformers_['cat']
    ohe = cat_pipeline.named_steps['onehot']
    cat_ohe_feats = ohe.get_feature_names_out(categorical_features)
    return list(num_feats) + list(cat_ohe_feats)

feature_names = get_feature_names(preprocessor)

importances_df = None
clf = final_pipeline.named_steps['clf']
if hasattr(clf, 'feature_importances_'):
    importances_df = pd.DataFrame({
        'feature': feature_names,
        'importance': clf.feature_importances_
    }).sort_values('importance', ascending=False)
elif hasattr(clf, 'coef_'):
    coef = clf.coef_[0]
    importances_df = pd.DataFrame({
        'feature': feature_names,
        'importance': coef
    }).assign(abs_importance=lambda d: d['importance'].abs()).sort_values('abs_importance', ascending=False)

importances_df.head(20)

In [None]:
# Plot top 20 importances if tree-based
if hasattr(clf, 'feature_importances_'):
    top_imp = importances_df.head(20)
    plt.figure(figsize=(8,6))
    sns.barplot(y='feature', x='importance', data=top_imp)
    plt.title(f'Top 20 Feature Importances ({best_model_name})')
    plt.tight_layout()
    plt.show()
elif hasattr(clf, 'coef_'):
    top_coef = importances_df.head(20)
    plt.figure(figsize=(8,6))
    sns.barplot(y='feature', x='importance', data=top_coef)
    plt.title(f'Top Coefficients ({best_model_name})')
    plt.tight_layout()
    plt.show()

## 8. Optional: Hyperparameter Tuning (Example for RandomForest)
This cell demonstrates a small grid search. You can expand the parameter grid if you want deeper optimization (note: increases runtime).

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'clf__n_estimators': [200, 400],
    'clf__max_depth': [None, 6, 10],
    'clf__min_samples_split': [2, 5]
} if best_model_name == 'RandomForest' else None

if param_grid:
    tuning_pipeline = Pipeline(steps=[('prep', preprocessor), ('clf', RandomForestClassifier(random_state=42))])
    grid = GridSearchCV(tuning_pipeline, param_grid=param_grid, scoring='roc_auc', cv=3, n_jobs=-1)
    grid.fit(X, y)
    print('Best params:', grid.best_params_)
    print('Best ROC AUC:', grid.best_score_)
else:
    print('Skipping RF tuning (best model is not RandomForest).')

## 9. Simple Prediction Function
Utility to predict survival probability for a small sample (manually crafted or subset).

In [None]:
def predict_samples(sample_df: pd.DataFrame):
    return final_pipeline.predict_proba(sample_df)[:,1]

# Example: use first 5 passengers (dropping target)
sample = X.head(5)
probs = predict_samples(sample)
pd.DataFrame({'PassengerIndex': sample.index, 'Predicted_Survival_Prob': probs})

## 10. (Optional) Persist the Final Model
Uncomment the code below to save the pipeline for later use (requires joblib).

In [None]:
# from joblib import dump
# dump(final_pipeline, 'titanic_model_pipeline.joblib')
# print('Model saved to titanic_model_pipeline.joblib')

## 11. Next Steps / Ideas
- Try alternative algorithms (e.g., XGBoost, LightGBM, CatBoost)
- Use cross-validation with stratified repeated splits
- Calibrate probabilities (CalibratedClassifierCV)
- Perform feature selection or SHAP analysis for interpretability
- Evaluate with additional metrics (precision-recall curves)
- Handle potential leakage / optimize engineered features

This notebook provides a solid baseline for experimentation.