## Model Building on Titanic
At this stage, as a data scientist: 
- I'll choose appropriate machine learning algorithms (e.g., linear regression, decision trees, random forest, XGBoost, neural networks).
- Splits the dataset into training, validation, and testing sets.
- Trains the model using the training data and tunes hyperparameters for better performance.
- Evaluates the model using metrics (accuracy, precision, recall, RMSE, F1-score, AUC, etc.) depending on the problem type.

For the titanic dataset, I will use the Logistic Regression.

✅ Summary

- LogisticRegression → algorithm used for Titanic classification.

- cross_val_score → ensures stable evaluation with k-fold cross-validation.

- GridSearchCV → finds the best hyperparameters for Logistic Regression.

- Together, they make your Titanic model more accurate, robust, and reliable.

In [1]:
# import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

**Load and Split the dataset**

In [2]:
# load the dataset
train_df = pd.read_csv(r"C:\Users\KOLADE\OneDrive\Documents\Practices\Titanic\data\Train.csv")
test_df = pd.read_csv(r"C:\Users\KOLADE\OneDrive\Documents\Practices\Titanic\data\Test.csv")

display(train_df.head())
test_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,...,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled,Survived
0,1,1,54.0,0,0,1,1,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.888773,1.053619,0
1,3,1,26.0,0,0,1,0,1,0,1,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.232122,-0.159147,0
2,2,1,25.0,1,2,3,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.307868,0.828292,0
3,3,1,26.0,1,0,1,0,1,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.232122,-0.22735,0
4,3,0,22.0,0,0,1,0,1,0,0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,-0.535107,-0.533665,0


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,...,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled,Survived
0,3,1,26.0,1,1,2,0,0,0,1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.232122,-0.175318,1
1,2,1,31.0,0,0,1,0,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.14661,-0.535177,0
2,3,1,20.0,0,0,1,0,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.686599,-0.799211,0
3,2,0,6.0,0,1,3,0,0,0,0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,-1.747047,0.593927,1
4,3,0,14.0,1,0,2,0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.141077,-0.470076,1


In [3]:
X_train = train_df.iloc[:,:-1]
y_train = train_df.iloc[:, -1]

print(f"Train: {X_train.shape, y_train.shape}")

X_test = test_df.iloc[:,:-1]
y_test = test_df.iloc[:, -1]

print(f"Test: {X_test.shape, y_test.shape}")

Train: ((596, 20), (596,))
Test: ((295, 20), (295,))


**Model Training**

In [4]:
y_train.value_counts()

Survived
0    374
1    222
Name: count, dtype: int64

In [5]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
print("Train Accuracy:", log_reg.score(X_train, y_train))
print("Test Accuracy:", log_reg.score(X_test, y_test))

Train Accuracy: 0.837248322147651
Test Accuracy: 0.8406779661016949


**Cross_Validation**

In [6]:
scores = cross_val_score(log_reg, X_train, y_train, cv=5, scoring='accuracy')
print("Scores:", scores)
print("Cross-validation accuracy:", scores.mean())
print("Cross-validation Std:", scores.std())

Scores: [0.76666667 0.8907563  0.78991597 0.81512605 0.85714286]
Cross-validation accuracy: 0.8239215686274509
Cross-validation Std: 0.044905237084523264


The scores are the accuracy values on each fold of the 5-fold cross-validation, they vary between 0.77 and 0.90. This shows the model is fairly stable (not wildly inconsistent across folds). Logistic Regression model correctly predicts survival about 82% of the time on unseen data (validation sets).

Average performance across all folds shows that the Logistic Regression model correctly predicts survival about 82% of the time on unseen data (validation sets). This is the best estimate of generalization performance.

Standard deviation tells you how much the accuracy varies between folds. Here it’s about ±4.5%, which is small → the model is stable across different splits of the data. If this number were large (e.g., ±0.12), it would suggest instability and sensitivity to how data is split.


Compare the accuracy on the full training set to CV accuracy (82.2%):

>>
Train Accuracy = 83.7%  
CV Accuracy = 82.2%  
Difference = about 1.5%

👉 Interpretation: 
- The model is not overfitting (train accuracy isn’t much higher than CV accuracy).
- It’s also not underfitting (scores are well above random guessing).
- Logistic Regression is generalizing well.

- Logistic Regression achieves ~82% accuracy, stable with low variance.

- Train vs CV accuracy is very close → good generalization.

- This makes Logistic Regression a solid baseline model.

**🔹 What To Do Next**

Now that you’ve established a baseline:
1. Hyperparameter tuning: Use GridSearchCV to see if tweaking C, penalty, and solver improves accuracy.

2. Try other models: RandomForest, GradientBoosting, or XGBoost, and compare CV scores with Logistic Regression.

**Hyperparameter tuning: Using GridSearchCV**

In [7]:
# Define parameter grid
param_grid = [
    {'penalty': ['l1'], 'solver': ['liblinear', 'saga'], 'C': [1e-4, 1e-3, 0.01, 0.1, 1, 10, 50]},
    {'penalty': ['l2'], 'solver': ['liblinear', 'saga'], 'C': [1e-4, 1e-3, 0.01, 0.1, 1, 10, 50]},
]


grid_search = GridSearchCV(LogisticRegression(max_iter=3000), param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best CV Score: 0.8289495798319327


In [8]:
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
print("Train Accuracy:", best_model.score(X_train, y_train))
print("Test Accuracy:", best_model.score(X_test, y_test))

Train Accuracy: 0.837248322147651
Test Accuracy: 0.8406779661016949


✅ Summary of Findings
- Logistic Regression with L1 regularization (penalty='l1') and C=10 is the best choice for Titanic data.

- Your CV accuracy is ~83%, which is strong for this dataset.

- You can now safely evaluate on the test set and then decide if you want to explore more powerful models like Random Forest or XGBoost for better accuracy.

In [9]:
y_pred = log_reg.predict(X_test)
print(f"Accuracy Score: {accuracy_score(y_test, y_pred):.4f}\n")

print(f"Classification Report: \n{classification_report(y_test, y_pred)}\n")

print(f"Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}")

Accuracy Score: 0.8407

Classification Report: 
              precision    recall  f1-score   support

           0       0.86      0.87      0.87       175
           1       0.81      0.80      0.80       120

    accuracy                           0.84       295
   macro avg       0.84      0.83      0.83       295
weighted avg       0.84      0.84      0.84       295


Confusion Matrix: 
[[152  23]
 [ 24  96]]


**🔎 First Model (Default Logistic Regression)**

Accuracy: 0.8407 (~84%)
Class 0 (Died):
- Correctly predicted 152 out of 175 (87% recall).
- 23 were misclassified as survivors (false positives).

Class 1 (Survived):
- Correctly predicted 96 out of 120 (80% recall).
- 24 actual survivors were missed (false negatives).

Precision / Recall Trade-off:
- Precision for survivors (1) = 0.81 → when model says "survived," it’s correct ~81% of the time.
- Recall for survivors (1) = 0.80 → model detects ~80% of all survivors.
- F1-score = 0.80 → balanced but not perfect.

In [10]:
y_pred2 = best_model.predict(X_test)
print(f"Accuracy Score: {accuracy_score(y_test, y_pred2)}\n")

print(f"Classification Report: \n{classification_report(y_test, y_pred2)}\n")

print(f"Confusion Matrix: \n{confusion_matrix(y_test, y_pred2)}")

Accuracy Score: 0.8406779661016949

Classification Report: 
              precision    recall  f1-score   support

           0       0.86      0.88      0.87       175
           1       0.82      0.78      0.80       120

    accuracy                           0.84       295
   macro avg       0.84      0.83      0.83       295
weighted avg       0.84      0.84      0.84       295


Confusion Matrix: 
[[154  21]
 [ 26  94]]


**🔎 Tuned Model (Best Parameters via GridSearch)**

Accuracy: 0.8407 (same ~84%)

Class 0 (Died):
- 154 correct (88% recall, slightly better).
- 21 misclassified as survivors (improvement over 23).

Class 1 (Survived):
- 94 correct (78% recall, slightly worse).
- 26 missed (more false negatives than first model).

Precision / Recall Trade-off:
- Precision for survivors = 0.82 (slightly better than 0.81).
- Recall for survivors = 0.78 (slightly worse than 0.80).
- F1-score stays at 0.80.

**⚖️ Model Comparison**

- Both models give the same overall accuracy (84%).
- Base model is a little better at catching survivors (higher recall for class 1).
- Tuned model is a little better at avoiding false positives (higher precision for class 1).

So the choice depends on your goal:

- If you care more about not missing survivors → base model is slightly better.
- If you care more about being confident when predicting survivors → tuned model is slightly better.

✅ Is the Model Good Enough? Yes, for Titanic data this is quite good:

- Accuracy > 80% is solid for a relatively small dataset.

- Balanced precision/recall across classes.

- No serious class imbalance issue (175 died vs. 120 survived).

But:

- It still misses ~20% of survivors (false negatives), which may or may not be acceptable depending on the context.

- Further improvements could come from feature engineering (e.g., extracting titles from names, family size, cabin deck) rather than just tuning logistic regression.

In [11]:
from sklearn.metrics import roc_auc_score
y_proba = log_reg.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, y_proba))

ROC-AUC: 0.8968095238095237


In [12]:
y_proba2 = best_model.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, y_proba2))

ROC-AUC: 0.8969047619047619


### Train Multiple models

In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [14]:
models = {
        "Logistic Regression": LogisticRegression(max_iter=3000),
        "Naive Bayes": GaussianNB(),
        "KNN": KNeighborsClassifier(),
        "Decision Tree": DecisionTreeClassifier(random_state=42),
        "Random Forest": RandomForestClassifier(random_state=42)
}

print(models)

{'Logistic Regression': LogisticRegression(max_iter=3000), 'Naive Bayes': GaussianNB(), 'KNN': KNeighborsClassifier(), 'Decision Tree': DecisionTreeClassifier(random_state=42), 'Random Forest': RandomForestClassifier(random_state=42)}


In [15]:
def evaluate_models(model, X_train_df, y_train_df, X_test_df, y_test_df):
    model.fit(X_train_df, y_train_df)
    y_pred = model.predict(X_test_df)
    y_proba = model.predict_proba(X_test_df)[:, 1]

    print(f"Train Accuracy: {model.score(X_train_df, y_train_df)}")
    print(f"Test Accuracy: {accuracy_score(y_test_df, y_pred)}")
    print("ROC-AUC:", roc_auc_score(y_test_df, y_proba))

In [16]:
for name, model in models.items():
    print(f"Model: {name}")
    evaluate_models(model, X_train, y_train, X_test, y_test)
    print("**" * 25)

Model: Logistic Regression
Train Accuracy: 0.837248322147651
Test Accuracy: 0.8406779661016949
ROC-AUC: 0.8968095238095237
**************************************************
Model: Naive Bayes
Train Accuracy: 0.8003355704697986
Test Accuracy: 0.7864406779661017
ROC-AUC: 0.8696190476190476
**************************************************
Model: KNN
Train Accuracy: 0.8338926174496645
Test Accuracy: 0.7932203389830509
ROC-AUC: 0.8708333333333333
**************************************************
Model: Decision Tree
Train Accuracy: 0.9832214765100671
Test Accuracy: 0.7254237288135593
ROC-AUC: 0.7122142857142857
**************************************************
Model: Random Forest
Train Accuracy: 0.9832214765100671
Test Accuracy: 0.8101694915254237
ROC-AUC: 0.8892380952380953
**************************************************


**Hyperparameters Tuning**

In [17]:
# KNN
knn_params = {
    "n_neighbors": [3, 5, 7, 9, 11],
    "weights": ["uniform", "distance"],
    "p": [1, 2]   # 1 = Manhattan, 2 = Euclidean
}

# Decision Tree
dt_params = {
    "max_depth": [3, 5, 7, 10, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "criterion": ["gini", "entropy"]
}

# Random Forest
rf_params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False]
}

# KNN GridSearch
knn_grid = GridSearchCV(KNeighborsClassifier(),
                        param_grid=knn_params,
                        cv=5,
                        scoring="roc_auc",
                        n_jobs=-1)

# Decision Tree GridSearch
dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                       param_grid=dt_params,
                       cv=5,
                       scoring="roc_auc",
                       n_jobs=-1)

# Random Forest GridSearch
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42),
                       param_grid=rf_params,
                       cv=5,
                       scoring="roc_auc",
                       n_jobs=-1)


In [18]:
grids = {
    "KNN": knn_grid,
    "Decision Tree": dt_grid,
    "Random Forest": rf_grid
}

for name, grid in grids.items():
    print(f"🔎 Tuning {name} ...")
    grid.fit(X_train, y_train)
    
    best_model = grid.best_estimator_
    y_predict = best_model.predict(X_test)
    y_prob = best_model.predict_proba(X_test)[:, 1]

    print(f"Best Params: {grid.best_params_}")
    print("Best CV Score:", grid.best_score_)
    print(f"Train Accuracy: {best_model.score(X_train, y_train):.4f}")
    print(f"Test Accuracy: {best_model.score(X_test, y_test):.4f}")
    print(f"Test Accuracy: {accuracy_score(y_test, y_predict):.4f}")
    print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
    print("**" * 25)

🔎 Tuning KNN ...
Best Params: {'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
Best CV Score: 0.8187104377104376
Train Accuracy: 0.8205
Test Accuracy: 0.7966
Test Accuracy: 0.7966
ROC-AUC: 0.8714
**************************************************
🔎 Tuning Decision Tree ...
Best Params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best CV Score: 0.8281949221949223
Train Accuracy: 0.8557
Test Accuracy: 0.7898
Test Accuracy: 0.7898
ROC-AUC: 0.8390
**************************************************
🔎 Tuning Random Forest ...
Best Params: {'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
Best CV Score: 0.8551748293748294
Train Accuracy: 0.8758
Test Accuracy: 0.8203
Test Accuracy: 0.8203
ROC-AUC: 0.9000
**************************************************


### Conclusion

**From the baseline model evaluations:**  
- Logistic Regression achieved a strong balance with 84% test accuracy and a ROC-AUC of ~0.897, making it a solid baseline.
- Naive Bayes and KNN performed reasonably but had slightly lower test accuracy (~79%) and ROC-AUC compared to Logistic Regression.
- Decision Tree showed signs of overfitting (very high training accuracy ≈ 98% but low test accuracy ≈ 72%).
- Random Forest achieved a good balance with 81% test accuracy and a ROC-AUC of ~0.889, but still slightly below Logistic Regression.

**After Hyperparameter Tuning:**  
- KNN improved slightly but remained at ~79% test accuracy and ROC-AUC ≈ 0.871.
- Decision Tree generalized better after tuning (≈79% test accuracy, ROC-AUC ≈ 0.839), but still weaker than ensemble methods.
- Random Forest became the best overall performer with 82% test accuracy and the highest ROC-AUC (0.90) after tuning, showing strong predictive power while avoiding overfitting.

**✅ Summary:**  

1. Random Forest (tuned) provided the best overall results in terms of accuracy and ROC-AUC.

2. Logistic Regression remained a strong and interpretable baseline model.

3. Ensemble methods (like Random Forest) proved to be more robust than single-tree models and distance-based algorithms (KNN).