# üö¢ Titanic Survival Prediction

**Author:** Piyush Ramteke  
**Program:** CodSoft Data Science Internship  

---

## 1Ô∏è‚É£ Problem Statement

The sinking of the **Titanic** in 1912 is one of the deadliest maritime disasters in history. Out of 2,224 passengers and crew, more than 1,500 lost their lives.

**üéØ Objective:** Build a Machine Learning model to predict whether a passenger **survived or not** based on features like age, gender, ticket class, fare, and family size.

This is a **binary classification** problem:
- **0** ‚Üí üíÄ Did not survive
- **1** ‚Üí üèÜ Survived

## 2Ô∏è‚É£ Import Libraries & Load Dataset üì¶

In [None]:
# ‚îÄ‚îÄ Import Libraries ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import interact, widgets
import warnings

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    precision_score, recall_score, f1_score, roc_auc_score, roc_curve
)
import xgboost as xgb
import lightgbm as lgb

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

print('‚úÖ All libraries loaded with Advanced ML capabilities!')

‚úÖ All libraries loaded with Interactive capabilities!


In [28]:
# ‚îÄ‚îÄ Load Dataset ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

df = pd.read_csv('Titanic-Dataset.csv')

print(f'üìÇ Dataset Shape: {df.shape[0]} rows √ó {df.shape[1]} columns')
df.head()

üìÇ Dataset Shape: 891 rows √ó 12 columns


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**üìã Column Descriptions:**

| Feature | Description |
|---------|-------------|
| `PassengerId` | Unique ID for each passenger |
| `Survived` | **Target** ‚Äî 0 = No, 1 = Yes |
| `Pclass` | Ticket class ‚Äî 1 = 1st, 2 = 2nd, 3 = 3rd |
| `Name` | Passenger name |
| `Sex` | Gender |
| `Age` | Age in years |
| `SibSp` | Number of siblings/spouses aboard |
| `Parch` | Number of parents/children aboard |
| `Ticket` | Ticket number |
| `Fare` | Ticket fare |
| `Cabin` | Cabin number |
| `Embarked` | Port of embarkation ‚Äî C = Cherbourg, Q = Queenstown, S = Southampton |

---

## 3Ô∏è‚É£ Exploratory Data Analysis (Interactive) üìä

In [29]:
# ‚îÄ‚îÄ 3.1 Dataset Info ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [30]:
# ‚îÄ‚îÄ 3.2 Missing Values Summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage (%)': missing_pct
}).sort_values('Missing Count', ascending=False)

print('üîç Missing Values Summary:')
print('=' * 40)
missing_df[missing_df['Missing Count'] > 0]

üîç Missing Values Summary:


Unnamed: 0,Missing Count,Percentage (%)
Cabin,687,77.1
Age,177,19.87
Embarked,2,0.22


**üïµÔ∏è Observations:**
- **Cabin** ‚Äî 77% missing ‚Üí we'll extract deck information and fill missing as 'Unknown' üö™
- **Age** ‚Äî 19.9% missing ‚Üí we'll use smart imputation based on passenger title üéÇ
- **Embarked** ‚Äî only 2 missing ‚Üí we'll fill with the most common port ‚öì

In [31]:
# ‚îÄ‚îÄ 3.3 Interactive Survival Distribution ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

surv_counts = df['Survived'].value_counts().reset_index()
surv_counts.columns = ['Survived', 'Count']
surv_counts['Label'] = surv_counts['Survived'].map({0: 'Did Not Survive', 1: 'Survived'})

# Creative Interactive Pie Chart
fig = px.pie(surv_counts, values='Count', names='Label', 
             color='Label', 
             color_discrete_map={'Did Not Survive':'#EF553B', 'Survived':'#00CC96'},
             title='üìä Survival Distribution (Interactive)',
             hole=0.4)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [32]:
# ‚îÄ‚îÄ 3.4 Interactive: Survival by Feature ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

@interact(Feature=['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch'])
def plot_survival_by_feature(Feature):
    fig = px.histogram(df, x=Feature, color='Survived', 
                       barmode='group',
                       color_discrete_map={0: '#EF553B', 1: '#00CC96'},
                       title=f'Survival Count by {Feature}',
                       text_auto=True)
    fig.update_layout(bargap=0.2)
    fig.show()

interactive(children=(Dropdown(description='Feature', options=('Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch'),‚Ä¶

In [33]:
# ‚îÄ‚îÄ 3.5 Interactive Age Distribution ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

fig = px.histogram(df, x='Age', color='Survived', 
                   nbins=30, 
                   color_discrete_map={0: '#EF553B', 1: '#00CC96'},
                   title='üéÇ Age Distribution by Survival Status',
                   marginal='box',
                   opacity=0.7)
fig.update_layout(barmode='overlay')
fig.update_traces(marker_line_width=1, marker_line_color='black')
fig.show()

---

## 4Ô∏è‚É£ Data Preprocessing üõ†Ô∏è

In [None]:
# ‚îÄ‚îÄ 4.1 Advanced Feature Engineering ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Extract Title from Name
def extract_title(name):
    title = name.split(',')[1].split('.')[0].strip()
    # Group rare titles
    title_mapping = {
        'Mr': 'Mr',
        'Miss': 'Miss',
        'Mrs': 'Mrs',
        'Master': 'Master',
        'Dr': 'Rare',
        'Rev': 'Rare',

        'Col': 'Rare',print(df['Title'].value_counts())

        'Major': 'Rare',print('‚úÖ Title extracted from Name')

        'Mlle': 'Miss',

        'Countess': 'Rare',df['Title'] = df['Name'].apply(extract_title)

        'Ms': 'Miss',

        'Lady': 'Rare',    return title_mapping.get(title, 'Rare')

        'Jonkheer': 'Rare',    }

        'Don': 'Rare',        'Sir': 'Rare'

        'Dona': 'Rare',        'Capt': 'Rare',
        'Mme': 'Mrs',

‚úÖ Missing values handled!


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [None]:
# ‚îÄ‚îÄ 4.2 Extract Deck from Cabin ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Extract first letter (deck) from Cabin
df['Deck'] = df['Cabin'].apply(lambda x: str(x)[0] if pd.notna(x) else 'U')

print('‚úÖ Deck extracted from Cabin')
print(df['Deck'].value_counts())

‚úÖ Feature Engineering complete. New column: FamilySize


In [None]:
# ‚îÄ‚îÄ 4.3 Smart Age Imputation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Fill Embarked first
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Fill missing Age based on Title median
age_title_median = df.groupby('Title')['Age'].median()
for title in df['Title'].unique():
    mask = (df['Title'] == title) & (df['Age'].isnull())
    df.loc[mask, 'Age'] = age_title_median[title]


print('‚úÖ Smart Age Imputation completed!')print(f'Age filled based on Title medians: {age_title_median.to_dict()}')

‚úÖ Categorical encoding applied!
  Sex      ‚Üí {'female': 0, 'male': 1}
  Embarked ‚Üí {'C': 0, 'Q': 1, 'S': 2}


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize
0,0,3,1,22.0,1,0,7.25,2,1
1,1,1,0,38.0,1,0,71.2833,0,1
2,1,3,0,26.0,0,0,7.925,2,0
3,1,1,0,35.0,1,0,53.1,2,1
4,0,3,1,35.0,0,0,8.05,2,0


# ‚îÄ‚îÄ 4.4 Create Binned Features ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('‚úÖ Age and Fare binning complete!')

# Age Bins

df['AgeBin'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], df['FareBin'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'], duplicates='drop')

                      labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])# Fare Bins


In [None]:
# ‚îÄ‚îÄ 4.5 Additional Feature Engineering ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Create FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch']

# Create IsAlone feature
df['IsAlone'] = (df['FamilySize'] == 0).astype(int)

print('‚úÖ FamilySize and IsAlone features created!')

In [None]:
# ‚îÄ‚îÄ 4.6 Encode Categorical Features ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Label encode Sex
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])  # female=0, male=1

# One-Hot Encode Embarked (no ordinal relationship)
df = pd.get_dummies(df, columns=['Embarked'], prefix='Embarked', drop_first=False)

# One-Hot Encode Title
df = pd.get_dummies(df, columns=['Title'], prefix='Title', drop_first=False)

# One-Hot Encode Deck
df = pd.get_dummies(df, columns=['Deck'], prefix='Deck', drop_first=False)

# One-Hot Encode AgeBin
df = pd.get_dummies(df, columns=['AgeBin'], prefix='AgeBin', drop_first=False)

# One-Hot Encode FareBin
df = pd.get_dummies(df, columns=['FareBin'], prefix='FareBin', drop_first=False)

# Drop irrelevant columns
df.drop(columns=['Name', 'Ticket', 'PassengerId', 'Cabin'], inplace=True)

print('‚úÖ Categorical encoding applied!')
print(f'Final feature count: {df.shape[1] - 1} features')
df.head()

---

## 5Ô∏è‚É£ Train/Test Split & Feature Scaling üß™

In [37]:
# ‚îÄ‚îÄ 5.1 Separate Features and Target ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

X = df.drop('Survived', axis=1)
y = df['Survived']

print(f'Features: {list(X.columns)}')
print(f'Target: Survived')

Features: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize']
Target: Survived


In [38]:
# ‚îÄ‚îÄ 5.2 Train/Test Split (80-20) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training set: {X_train.shape[0]} samples')
print(f'Testing set:  {X_test.shape[0]} samples')

Training set: 712 samples
Testing set:  179 samples


In [39]:
# ‚îÄ‚îÄ 5.3 Feature Scaling ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print('‚úÖ Features Scaled (StandardScaler).')

‚úÖ Features Scaled (StandardScaler).


---

In [None]:
# ‚îÄ‚îÄ 6.3 XGBoost with GridSearchCV ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('üîç Tuning XGBoost...')

param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1.0]
}

xgb_grid = GridSearchCV(xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
                        param_grid_xgb, cv=5, scoring='accuracy', n_jobs=-1)
xgb_grid.fit(X_train_scaled, y_train)

xgb_model = xgb_grid.best_estimator_
xgb_pred = xgb_model.predict(X_test_scaled)
xgb_proba = xgb_model.predict_proba(X_test_scaled)[:, 1]

print(f'‚úÖ XGBoost Trained with Best Params: {xgb_grid.best_params_}')

---

In [None]:
# ‚îÄ‚îÄ 6.4 LightGBM with GridSearchCV ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('üîç Tuning LightGBM...')

param_grid_lgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'num_leaves': [31, 50]
}

lgb_grid = GridSearchCV(lgb.LGBMClassifier(random_state=42, verbose=-1),
                        param_grid_lgb, cv=5, scoring='accuracy', n_jobs=-1)
lgb_grid.fit(X_train_scaled, y_train)

lgb_model = lgb_grid.best_estimator_
lgb_pred = lgb_model.predict(X_test_scaled)
lgb_proba = lgb_model.predict_proba(X_test_scaled)[:, 1]

print(f'‚úÖ LightGBM Trained with Best Params: {lgb_grid.best_params_}')

## 6Ô∏è‚É£ Model Training with Hyperparameter Tuning ü§ñ

In [None]:
# ‚îÄ‚îÄ 7.3 Interactive Confusion Matrix ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

@interact(Model=['Logistic Regression', 'Random Forest', 'XGBoost', 'LightGBM'])
def plot_conf_matrix(Model):
    pred_mapping = {
        'Logistic Regression': lr_pred,
        'Random Forest': rf_pred,
        'XGBoost': xgb_pred,
        'LightGBM': lgb_pred
    }
    pred = pred_mapping[Model]
    
    cm = confusion_matrix(y_test, pred)
    
    fig = px.imshow(cm, text_auto=True, 
                    labels=dict(x="Predicted", y="Actual", color="Count"),
                    x=['Not Survived', 'Survived'],
                    y=['Not Survived', 'Survived'],
                    title=f'Confusion Matrix - {Model}',
                    color_continuous_scale='Blues')
    fig.show()

In [None]:
# ‚îÄ‚îÄ 7.4 ROC Curve Visualization ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

fig = go.Figure()

# Add ROC curves for all models
model_probas = {
    'Logistic Regression': lr_proba,
    'Random Forest': rf_proba,
    'XGBoost': xgb_proba,
    'LightGBM': lgb_proba
}

colors = ['#636EFA', '#EF553B', '#00CC96', '#AB63FA']

for (name, proba), color in zip(model_probas.items(), colors):
    fpr, tpr, _ = roc_curve(y_test, proba)
    auc_score = roc_auc_score(y_test, proba)
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        name=f'{name} (AUC = {auc_score:.3f})',
        line=dict(color=color, width=2)
    ))

# Add diagonal line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    name='Random Classifier',
    line=dict(color='gray', dash='dash')
))

fig.update_layout(
    title='ROC Curves - Model Comparison',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    width=800,
    height=600
)

fig.show()

In [None]:
# ‚îÄ‚îÄ 6.1 Logistic Regression ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
lr_pred = lr.predict(X_test_scaled)
lr_proba = lr.predict_proba(X_test_scaled)[:, 1]

print('‚úÖ Logistic Regression Trained.')

‚úÖ Logistic Regression Trained.


In [None]:
# ‚îÄ‚îÄ 8.2 Interactive Prediction Widget (Enhanced) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('üîÆ Try predicting survival with the best model!')

def predict_survival(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Name):
    # Extract title from name
    try:
        title_extracted = Name.split(',')[1].split('.')[0].strip()
        title_map = {
            'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
            'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare', 'Major': 'Rare'
        }
        title = title_map.get(title_extracted, 'Rare')
    except:
        title = 'Mr' if Sex == 'male' else 'Miss'
    
    # Create basic features
    sex_enc = 1 if Sex == 'male' else 0
    fam_size = SibSp + Parch
    is_alone = 1 if fam_size == 0 else 0
    
    # Age bin
    if Age <= 12:
        age_bin = 'Child'
    elif Age <= 18:
        age_bin = 'Teen'
    elif Age <= 35:
        age_bin = 'Adult'
    elif Age <= 60:
        age_bin = 'Middle'
    else:
        age_bin = 'Senior'
    
    # Fare bin (simplified)
    if Fare <= 7.91:
        fare_bin = 'Low'
    elif Fare <= 14.45:
        fare_bin = 'Medium'
    elif Fare <= 31:
        fare_bin = 'High'
    else:
        fare_bin = 'VeryHigh'
    
    # Create input dataframe with all features matching training data
    input_dict = {
        'Pclass': Pclass,
        'Sex': sex_enc,
        'Age': Age,
        'SibSp': SibSp,
        'Parch': Parch,
        'Fare': Fare,
        'FamilySize': fam_size,
        'IsAlone': is_alone
    }
    
    # Add one-hot encoded features
    for port in ['C', 'Q', 'S']:
        input_dict[f'Embarked_{port}'] = 1 if Embarked == port else 0
    
    for t in ['Master', 'Miss', 'Mr', 'Mrs', 'Rare']:
        input_dict[f'Title_{t}'] = 1 if title == t else 0
    
    for deck in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T', 'U']:
        input_dict[f'Deck_{deck}'] = 1 if deck == 'U' else 0  # Default to Unknown
    
    for ab in ['Adult', 'Child', 'Middle', 'Senior', 'Teen']:
        input_dict[f'AgeBin_{ab}'] = 1 if age_bin == ab else 0
    
    for fb in ['High', 'Low', 'Medium', 'VeryHigh']:
        input_dict[f'FareBin_{fb}'] = 1 if fare_bin == fb else 0
    
    # Create DataFrame and ensure column order matches training data
    input_data = pd.DataFrame([input_dict])
    input_data = input_data.reindex(columns=X.columns, fill_value=0)
    
    # Scale
    input_scaled = scaler.transform(input_data)
    
    # Predict using best model
    prob = best_model.predict_proba(input_scaled)[0][1]
    pred = 'Survived üèÜ' if prob > 0.5 else 'Did Not Survive üíÄ'
    
    print(f'\nüì¢ Prediction: {pred}')
    print(f'üìä Survival Probability: {prob*100:.2f}%')
    print(f'üé≠ Extracted Title: {title}')

# Create Enhanced Widget
interact(predict_survival, 
         Name=widgets.Text(value='Doe, Mr. John', description='Name:'),
         Pclass=widgets.Dropdown(options=[1, 2, 3], value=3, description='Class:'),
         Sex=widgets.Dropdown(options=['male', 'female'], value='male', description='Gender:'),
         Age=widgets.IntSlider(min=1, max=100, step=1, value=25, description='Age:'),
         SibSp=widgets.IntSlider(min=0, max=8, value=0, description='Siblings:'),
         Parch=widgets.IntSlider(min=0, max=6, value=0, description='Parents:'),
         Fare=widgets.FloatSlider(min=0, max=500, step=10, value=30, description='Fare:'),
         Embarked=widgets.Dropdown(options=['S', 'C', 'Q'], value='S', description='Port:')
);

In [None]:
# ‚îÄ‚îÄ 6.2 Random Forest with GridSearchCV ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('üîç Tuning Random Forest...')

param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, 20],

    'min_samples_split': [2, 5],print(f'‚úÖ Random Forest Trained with Best Params: {rf_grid.best_params_}')

    'min_samples_leaf': [1, 2]

}rf_proba = rf.predict_proba(X_test_scaled)[:, 1]

rf_pred = rf.predict(X_test_scaled)

rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf = rf_grid.best_estimator_

                       param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
rf_grid.fit(X_train_scaled, y_train)

‚úÖ Random Forest Trained.


# ‚îÄ‚îÄ 6.5 Cross-Validation Scores ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('\n‚úÖ Cross-validation complete!')

print('\nüìä 5-Fold Cross-Validation Scores:')

print('=' * 50)    print(f'{name:20s} ‚Üí Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})')

    cv_results[name] = cv_scores

models = {    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')

    'Logistic Regression': lr,for name, model in models.items():

    'Random Forest': rf,cv_results = {}

    'XGBoost': xgb_model,

    'LightGBM': lgb_model}

## 7Ô∏è‚É£ Model Evaluation with ROC-AUC üìâ

In [None]:
# ‚îÄ‚îÄ 7.1 Comprehensive Model Evaluation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def evaluate_model(name, y_true, y_pred, y_proba):
    acc = accuracy_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)
    print(f'\nüîπ {name}')
    print(f'   Accuracy: {acc*100:.2f}%')
    print(f'   ROC-AUC:  {auc:.4f}')
    print(classification_report(y_true, y_pred, zero_division=0))
    return acc, auc


print('üîç Model Evaluation:')lgb_acc, lgb_auc = evaluate_model('LightGBM', y_test, lgb_pred, lgb_proba)

print('=' * 60)xgb_acc, xgb_auc = evaluate_model('XGBoost', y_test, xgb_pred, xgb_proba)

lr_acc, lr_auc = evaluate_model('Logistic Regression', y_test, lr_pred, lr_proba)rf_acc, rf_auc = evaluate_model('Random Forest', y_test, rf_pred, rf_proba)

üîç Model Evaluation:

üîπ Logistic Regression Accuracy: 80.45%
              precision    recall  f1-score   support

           0       0.82      0.86      0.84       105
           1       0.78      0.73      0.76        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.80       179
weighted avg       0.80      0.80      0.80       179


üîπ Random Forest Accuracy: 83.24%
              precision    recall  f1-score   support

           0       0.84      0.89      0.86       105
           1       0.82      0.76      0.79        74

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179



In [None]:
# ‚îÄ‚îÄ 7.2 Model Comparison Summary ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

results_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost', 'LightGBM'],
    'Accuracy': [lr_acc, rf_acc, xgb_acc, lgb_acc],
    'ROC-AUC': [lr_auc, rf_auc, xgb_auc, lgb_auc]
}).sort_values('ROC-AUC', ascending=False)

print('\nüìä Model Comparison:')
print(results_df.to_string(index=False))

# Visualize comparison
fig = px.bar(results_df, x='Model', y=['Accuracy', 'ROC-AUC'], 
             barmode='group',
             title='Model Performance Comparison',
             color_discrete_sequence=['#636EFA', '#EF553B'])
fig.show()

interactive(children=(Dropdown(description='Model', options=('Logistic Regression', 'Random Forest'), value='L‚Ä¶

---

## 8Ô∏è‚É£ Conclusion & Interactive Prediction üîÆ

With advanced feature engineering and hyperparameter tuning, we've achieved significant improvements. The **best model** (based on ROC-AUC) will be used for predictions.

In [None]:
# ‚îÄ‚îÄ 8.1 Select Best Model ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Find best model based on ROC-AUC
best_model_name = results_df.iloc[0]['Model']
model_map = {
    'Logistic Regression': lr,
    'Random Forest': rf,
    'XGBoost': xgb_model,
    'LightGBM': lgb_model
}
best_model = model_map[best_model_name]

print(f'üèÜ Best Model Selected: {best_model_name}')
print(f'   ROC-AUC: {results_df.iloc[0]["ROC-AUC"]:.4f}')

üîÆ Try predicting survival!


interactive(children=(Dropdown(description='Class:', index=2, options=(1, 2, 3), value=3), Dropdown(descriptio‚Ä¶