# Titanic Dataset - Hyperparameter Tuning
### Prepared By:
- **Name:** Nitesh Yadav  
- **Email:** niteshyadav0604@gmail.com 

# 1.Dataset Description

##     Source
-      The Titanic dataset available in Seaborn

##     Column Description

| Column        | Description                                                  |
|---------------|--------------------------------------------------------------|
| survived      | Survival (0 = No, 1 = Yes)                                    |
| pclass        | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)                   |
| sex           | Gender of the passenger                                      |
| age           | Age of the passenger                                         |
| sibsp         | Number of siblings/spouses aboard                            |
| parch         | Number of parents/children aboard                            |
| fare          | Ticket fare                                                  |
| embarked      | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
| class         | Passenger class (Categorical representation of pclass)       |
| who           | Gender category (man, woman, child)                          |
| deck          | Deck location                                                |
| embark_town   | Embarkation town                                             |


In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings for cleaner output in the notebook
warnings.filterwarnings('ignore')

# Load Titanic dataset
df = sns.load_dataset('titanic')
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [11]:
print(df.shape)
print(df.info())

(891, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None


In [12]:
df.isnull().sum().sort_values(ascending=False).head(5)

deck           688
age            177
embarked         2
embark_town      2
survived         0
dtype: int64

In [13]:
# Drop 'deck' column (too many missing values)
df.drop(columns=['deck'], inplace=True)

# Fill missing 'age' values with median
df['age'].fillna(df['age'].median(), inplace=True)

# Drop rows where 'embarked' is missing
df.dropna(subset=['embarked'], inplace=True)

# Drop 'embark_town' (duplicate of 'embarked')
df.drop(columns=['embark_town'], inplace=True)

# Encode Categorical Features & Scale Numerical Features

In [15]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Label encode binary categorical columns
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])  # male=1, female=0
df['alone'] = le.fit_transform(df['alone'])

# One-hot encode multi-class categorical columns
df = pd.get_dummies(df, columns=['embarked', 'class', 'who'], drop_first=True)

# Scale numeric columns
scaler = StandardScaler()
numeric_cols = ['age', 'fare', 'sibsp', 'parch']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


In [20]:
# Convert 'alive' (yes/no) to binary
df['alive'] = LabelEncoder().fit_transform(df['alive'])  # yes = 1, no = 0

# Convert other bool columns (True/False) to integers
bool_cols = ['adult_male', 'alone', 'embarked_Q', 'embarked_S', 
             'class_Second', 'class_Third', 'who_man', 'who_woman']

for col in bool_cols:
    df[col] = df[col].astype(int)


In [21]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,adult_male,alive,alone,embarked_Q,embarked_S,class_Second,class_Third,who_man,who_woman
0,0,3,1,-0.563674,0.431350,-0.474326,-0.500240,1,0,0,0,1,0,1,1,0
1,1,1,0,0.669217,0.431350,-0.474326,0.788947,0,1,0,0,0,0,0,0,1
2,1,3,0,-0.255451,-0.475199,-0.474326,-0.486650,0,1,1,0,1,0,1,0,1
3,1,1,0,0.438050,0.431350,-0.474326,0.422861,0,1,0,0,1,0,0,0,1
4,0,3,1,0.438050,-0.475199,-0.474326,-0.484133,1,0,1,0,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,-0.178396,-0.475199,-0.474326,-0.384475,1,0,1,0,1,1,0,1,0
887,1,1,0,-0.794841,-0.475199,-0.474326,-0.042213,0,1,1,0,1,0,0,0,1
888,0,3,0,-0.101340,0.431350,2.006119,-0.174084,0,0,0,0,1,0,1,0,1
889,1,1,1,-0.255451,-0.475199,-0.474326,-0.042213,1,1,1,0,0,0,0,1,0


# Feature/Target Split and Train-Test Splitting

In [23]:
from sklearn.model_selection import train_test_split

# features (X) and target (y)
X = df.drop(columns=['survived'])
y = df['survived']

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("X_train:", X_train.shape)
print("X_test :", X_test.shape)
print("y_train:", y_train.shape)
print("y_test :", y_test.shape)


X_train: (711, 15)
X_test : (178, 15)
y_train: (711,)
y_test : (178,)


# Train & Evaluate Multiple Models
## We’ll use:

- LogisticRegression

- RandomForestClassifier

- KNeighborsClassifier

- SVC

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC()
}

# Train and evaluate
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred)
    })

# Show results

results_df = pd.DataFrame(results)
print(" Model Evaluation Summary:\n")
print(results_df.sort_values(by='F1-Score', ascending=False))


 Model Evaluation Summary:

                 Model  Accuracy  Precision    Recall  F1-Score
0  Logistic Regression  1.000000   1.000000  1.000000  1.000000
1        Random Forest  1.000000   1.000000  1.000000  1.000000
3                  SVM  0.994382   0.985507  1.000000  0.992701
2                  KNN  0.971910   0.956522  0.970588  0.963504


# Hyperparameter Tuning 
## hyperparameter tuning using GridSearchCV and RandomizedSearchCV for these models:

- RandomForestClassifier
- KNeighborsClassifier
(we'll optimize for F1-Score)

## Hyperparameter Tuning with GridSearchCV

In [25]:
from sklearn.model_selection import GridSearchCV

#  Random Forest hyperparameter grid
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
}

grid_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid_rf,
    cv=5,
    scoring='f1',
    n_jobs=-1
)
grid_rf.fit(X_train, y_train)
best_rf = grid_rf.best_estimator_

# KNN hyperparameter grid
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance']
}

grid_knn = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=param_grid_knn,
    cv=5,
    scoring='f1',
    n_jobs=-1
)
grid_knn.fit(X_train, y_train)
best_knn = grid_knn.best_estimator_


In [26]:
# Evaluate tuned models
for name, model in [('Random Forest (Tuned)', best_rf), ('KNN (Tuned)', best_knn)]:
    y_pred = model.predict(X_test)
    print(f"\n{name}:")
    print(f"Best Parameters: {model.get_params()}")
    print(f"Accuracy : {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_test, y_pred):.4f}")
    print(f"F1-Score : {f1_score(y_test, y_pred):.4f}")


Random Forest (Tuned):
Best Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Accuracy : 1.0000
Precision: 1.0000
Recall   : 1.0000
F1-Score : 1.0000

KNN (Tuned):
Best Parameters: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3, 'p': 2, 'weights': 'distance'}
Accuracy : 0.9607
Precision: 0.9420
Recall   : 0.9559
F1-Score : 0.9489


# Hyperparameter Tuning with RandomizedSearchCV

In [27]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Random Forest parameter distributions
param_dist_rf = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 10)
}

# KNN parameter distributions
param_dist_knn = {
    'n_neighbors': randint(3, 10),
    'weights': ['uniform', 'distance']
}

#  Random Forest RandomizedSearch
random_rf = RandomizedSearchCV(
    estimator=RandomForestClassifier(),
    param_distributions=param_dist_rf,
    n_iter=10,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)
random_rf.fit(X_train, y_train)
best_rf_rand = random_rf.best_estimator_

#  KNN RandomizedSearch
random_knn = RandomizedSearchCV(
    estimator=KNeighborsClassifier(),
    param_distributions=param_dist_knn,
    n_iter=10,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)
random_knn.fit(X_train, y_train)
best_knn_rand = random_knn.best_estimator_

# Final evaluation
for name, model in [('Random Forest (Randomized)', best_rf_rand),
                    ('KNN (Randomized)', best_knn_rand)]:
    y_pred = model.predict(X_test)
    print(f"\n{name}:")
    print(f"Best Params: {model.get_params()}")
    print(f"Accuracy   : {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision  : {precision_score(y_test, y_pred):.4f}")
    print(f"Recall     : {recall_score(y_test, y_pred):.4f}")
    print(f"F1-Score   : {f1_score(y_test, y_pred):.4f}")



Random Forest (Randomized):
Best Params: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 20, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 142, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Accuracy   : 1.0000
Precision  : 1.0000
Recall     : 1.0000
F1-Score   : 1.0000

KNN (Randomized):
Best Params: {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
Accuracy   : 0.9719
Precision  : 0.9565
Recall     : 0.9706
F1-Score   : 0.9635


# Comparison of Tuned Models (GridSearchCV vs RandomizedSearchCV)
| Model                   | Accuracy | Precision | Recall | F1-Score | Best Tuning    |
| ----------------------- | -------- | --------- | ------ | -------- | -------------- |
|  Random Forest (Grid) | 1.0000   | 1.0000    | 1.0000 | 1.0000   | Both equal     |
|  Random Forest (Rand) | 1.0000   | 1.0000    | 1.0000 | 1.0000   | Same as Grid |
|  KNN (Grid)           | 0.9607   | 0.9420    | 0.9559 | 0.9489   | GridSearchCV   |
|  KNN (Rand)           | 0.9719   | 0.9565    | 0.9706 | 0.9635   | Better       |

## Analysis & Recommendation
### Random Forest:
- Both tuning methods produced a perfect model (F1 = 1.0) — so either tuning approach is valid.

- Best Params differ

 - Grid found a simple config (n_estimators=50)

 -  Randomized found a deeper tree (max_depth=20, n_estimators=142)

Final Verdict: Either is acceptable — RandomizedSearch is faster on larger spaces.

### KNN:
- RandomizedSearchCV outperformed GridSearchCV:

- Higher accuracy (0.97 vs 0.96)

- Better F1-score (0.96 vs 0.94)

- Final Verdict: Use KNN (Randomized) if you choose KNN.



###  Final Model Selection Summary

We trained and evaluated Logistic Regression, Random Forest, SVM, and KNN. We then applied hyperparameter tuning using both GridSearchCV and RandomizedSearchCV.

- **Random Forest consistently achieved perfect performance (F1-score = 1.0)** under both tuning methods.
- **KNN improved with RandomizedSearchCV**, reaching an F1-score of 0.96 (better than GridSearch-tuned KNN).
- Based on overall performance, interpretability, and robustness, we select:

 **Final Model: Random Forest Classifier (Tuned)**  
  Best for accuracy, recall, and general performance on Titanic dataset.
