# **Titanic Survival Prediction**

## Introduction

In this project, we aim to predict the survival of passengers aboard the Titanic using machine learning. This is part of Kaggle’s beginner-friendly competition, **"Titanic: Machine Learning from Disaster."** Kaggle provides us with two datasets: a **training set** (with survival outcomes) and a **test set** (without the target column).

Our goal is to train a model using the training data and generate predictions on the test set. However, since the test set does **not include the `Survived` column**, we cannot directly evaluate our model's performance on it.

To overcome this, we'll first split the original training data into **training and validation subsets**. This allows us to train the model on one part and evaluate it on the other using metrics like the **classification report** and **confusion matrix**. Once we are satisfied with the model's performance, we will use it to make final predictions on the test data for submission.

This approach helps us build a more reliable and well-validated model.

## Objectives

- Import the data from the Kaggle repository  
- Perform data wrangling and preprocessing  
- Create a machine learning pipeline  
- Tune hyperparameters for optimal model performance  
- Train the model on the training data using **Random Forest**  
- Evaluate the model on the validation set  
- Train a second model using **Logistic Regression**  
- Evaluate and compare both models on the validation set  
- Select the best-performing model and apply it to the test data  
- Prepare the predictions for submission

## Import the data from the kaggle repository

In [None]:
#first let's import all required libraries

import pandas as pd #data manipulation, handle data in tabular format (DataFrames)
import numpy as np #numerical operations
import matplotlib.pyplot as plt #basic plotting, data visualization
import seaborn as sns #advanced statistical plots
%matplotlib inline

import warnings #suppress the warning for better output
warnings.filterwarnings('ignore')


from sklearn.model_selection import train_test_split  # split data into training and validation sets
from sklearn.model_selection import GridSearchCV  # for hyperparameter tuning using grid search
from sklearn.model_selection import StratifiedKFold  # for cross-validation while preserving class distribution

from sklearn.compose import ColumnTransformer  # apply different preprocessing to numerical and categorical features
from sklearn.pipeline import Pipeline  # chain preprocessing and modeling steps together
from sklearn.preprocessing import StandardScaler  #standardize numerical features (mean=0, std=1)
from sklearn.preprocessing import OneHotEncoder  # convert categorical variables into binary dummy variables

from sklearn.ensemble import RandomForestClassifier  # powerful ensemble method using decision trees
from sklearn.linear_model import LogisticRegression  # linear model for binary classification

from sklearn.metrics import classification_report  # get precision, recall, f1-score, etc.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay  # compute and visualize confusion matrix

from sklearn.impute import SimpleImputer #handle missing values
from sklearn.compose import ColumnTransformer #applying transormation to specific columns (numerical vs categorical)

In [None]:
#import train and test data sets
train_data = '/kaggle/input/titanic/train.csv'
test_data = '/kaggle/input/titanic/test.csv'

train_df = pd.read_csv(train_data)
test_df = pd.read_csv(test_data)

train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.shape

Therefore, our train data set have 891 rows and 12 columns

In [None]:
train_df.count()

From the reults we can say that 'cabin' column have lot of missing values so let's drop the  column, and the columns 'age' and 'embarked' also have some missing values, so let's replace the 'age' column missing vcalues with the mean, and the 'embarked' missing values with the most frequent values.

In [None]:
#drop 'survived' (target) and 'cabin' (too many missing values)
X = train_df.drop(['Survived', 'Cabin', 'Name'], axis = 1)
y = train_df['Survived']

### How balanced are the classes in the dataset?
Claqss balance refe to whether the target variable ('Survived', inour data set) has roughly equal representation of each category (like, '0' = Not Survived, '1' = Survived). If one class is much more frequent than the other then the dataset is imbalanced.

In [None]:
y.value_counts()

In [None]:
y.value_counts(normalize=True)*100

So about 38% of the passengers in the data set survived. Beacuse of this slight imbalance, we should stratify the data when performing train/test split and for cross-validation.

What is stratify in Machine Learning and why is it important?

`stratify` is a parameter used in `train_test_split()` to ensure that the train and test sets maintain the same class distribution as the original dataset. This is especially useful when the target variable (e.g., 'Survived' in the Titanic dataset) is imbalanced

Why use `stratify`?
- Prevent class imbalance issues in training and testing data.
- Ensure both train and test sets represent the original dataset's class distribution.
- Avoids situations where one set has more survival cases than the other leading to biased models.

- Without `stratify` the slit may result in an uneven distribution of survivors and non-survivors.
- With `stratify`, both train and test sets will have the same proportion of survivors as the original dataset.

In [None]:
#SPLIT TRAIN SET INTO TRAIN AND VALIDATION SUBSETS
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

## Perform data wrangling and preprocessing
Now let's define prepocessing transformers for numerical and categorical features, this will automatically detect numerical and categorical columns and assign them to separate numeric and categorical features.

In [None]:
numerical_features = X_train.select_dtypes(include = ['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include = ['object', 'category']).columns.tolist()

print(numerical_features)
print(categorical_features)

Let's create two preprocessing pipelines for numerical and categorical features. Each Pipeline automates the data cleaning and transformation process before feeding the data into a machine learning model. this helps in handling missing values and data standardization

In [None]:
numerical_features_transformer = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'median')), ('scaler', StandardScaler())])

Categorical_features_transformer = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')), ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

### Combine the transformers into a single column transformer

We'll use the sklearn 'column transformer' estimator to seperately transform the features, which will then concatenate the output as a single feature space, ready for input to a machine learning estimator.

Note:
- Pipeline = Step by step transformation for one type of data (numerical or ctegorical)
- ColumnTransformer = Applies multiple pipelines (numerical + categorical) and combines them.

In [None]:
preprocessor = ColumnTransformer(
    transformers = [
        ('numeric', numerical_features_transformer, numerical_features),
        ('categorical', Categorical_features_transformer, categorical_features)
    ]
)

## Create a machine learning pipeline
Now let's create the model pipeline by combining the preprocessing with a Random Forest Classifier.

In [None]:
pipeline = Pipeline(steps = [
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(random_state = 42))
])

### Define a Parameter Grid

A parameter Grid is a structed way to define a set of hyperparameters for tuning a machine learning model. It is commonly used in gris search to find the best combination of hyperparameters that optimize model performance. 

How it works?

1. A dictionary-like structure specifies different hyperparameter values.
2. Model is trained and evaluated for each combination
3. The best-performing set is chosen based on a scoring metic (e.g., accuracy, RMSE)

For example, Hyperparameters in random forest are:
1. Number of trees (n_estimators) - More trees can make the model better, but too many may slow it down.
2. Maximum depth of trees(max_depth) - A deeper tree capturesmore details but may overfit.
3. Minimun samples per split (min_samples_split) - Control how much data is needed to split a node in a tree.

Let's use the grid in a cross-validation search to optimize the model.

In [None]:
parameter_grid = {
    'classifier__n_estimators': [50, 100], #2 options
    'classifier__max_depth': [None, 10, 20], #3 options
    'classifier__min_samples_split': [2,5] #2 options
}

#in total no of hyper parameter combinations = 2*3*2 = 12 candidates
#GridSearchCV will train the model 12 times with different sets of hyperparameters. 

### Perform grid search cross-validation and fit the best model to the training data

It means:
1. Try different hyperparameter combinations (Grid search)
2. Evaluate each combination using cross-validation (Cross-Validation)
3. Find the best-pergorming combination
4. Train the final model with the best parameters on the full training data

Step-by step Explanation:
1. Grid Search (Trying different settings)

Imagine we're baking a take and testinf different oven temperatures and baking times to find the best combination. Grid Search does this for machine learning models by testing multiple hyperparameter combinations.

2. Cross-Validation (Ensuring stability)

Instead of training the model on a single split of data, cross-validation splits the training data into multiple paets (folds), trains the model on some folds, and test on the others. This ensures the model works well across different data splits.

3. Find the best Parameters

After testing all combinations, the model picks the best hyperparameters based on a performance metric (e.g., accuracy, F1-score, RMSE). 

4. Fit the best model on full training data.

Once the best hyperparameters are found, a final model is trained on the entire training dataset using those parameters. This is the fianl model used for predictions.

You've already split the data into training and validation using train_test_split. That's a simple hold-out validation, and it's completely valid for early testing and fast model iteration.

- Why use StratifiedKFold or cross-validation? - StratifiedKFold is part of cross-validation, which gives you a more robust estimate of your model's performance.

When we're using GridSearchCV, it internally uses cross-validation (like StratifiedKFold) to evaluate how different hyperparameter combinations perform.

- Instead of relying on a single split (which might be lucky or unlucky), cross-validation:
- Trains on different parts of the training data
- Validates on the remaining parts
- Averages the performance
- This makes your model selection less prone to overfitting on one split.

## Evaluate the model on the validation set

In [None]:
#cross-validation method
cv = StratifiedKFold(n_splits = 5, shuffle = True)

In [None]:
#train the pipeline model
model = GridSearchCV(estimator = pipeline, param_grid = parameter_grid, cv = cv, scoring = 'accuracy', verbose = 2)
model.fit(X_train, y_train)

In [None]:
#print the best parameter and best cross validation score
print('\nBest Parameters Found: ', model.best_params_)
print('Best Cross-Validation Score: {:.2f}'.format(model.best_score_))

In [None]:
#display model's estimated score
test_score = model.score(X_val, y_val)
print('Test set score: {:.2f}'.format(test_score))

In [None]:
#Let's get the Model predictions from the grid search estimator on the unseen data, and print a classification model.
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))

In [None]:
#plot confusion matrix
Random_forest_conf_matrix = confusion_matrix(y_val, y_pred)

plt.figure()
sns.heatmap(Random_forest_conf_matrix, annot = True, cmap = 'Blues', fmt = 'd')

plt.title('Titanic Classification Confusion Matrix - Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.tight_layout()
plt.show()

### Insights
- Class 0 (Did not survive) is predicted really well — high recall (0.93) means you're catching most of the actual "did not survive" cases.
- Class 1 (Survived) has good precision (0.84), but relatively low recall (0.59) — meaning the model misses a lot of people who actually survived.
- Overall Accuracy = 80%, which is solid, but there's room to improve recall on class 1.

- True Negatives (0 → 0): 102 → Great, model correctly predicted many "did not survive".
- False Positives (0 → 1): 8 → Not bad; a few were wrongly predicted as "survived".
- False Negatives (1 → 0): 28 → This is the issue! These are real survivors the model missed.
- True Positives (1 → 1): 41 → Decent, but ideally should be higher.

- Our model is better at identifying non-survivors than survivors.
- Low recall for class 1 suggests it's playing it safe and not “confident” enough to say someone survived.


## Train a second model using Logistic Regression


In [None]:
#Replace RandomForestClassifier with Logistic Regression
pipeline.set_params(classifier = LogisticRegression(random_state = 42))

#update the model estimator to use thje new pipeline
model.estimator = pipeline

#define a new grid with logistic regression parameters
parameter_grid_LR = {
    'classifier__solver': ['liblinear'],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__class_weight': [None, 'balanced']
}

model.param_grid = parameter_grid_LR

#fit the updated pipeline with logistic regressionm
model.fit(X_train, y_train)

#make predictions
y_pred = model.predict(X_val)

In [None]:
#Display the classification report for the new model and compare the reults to our previous model
classification_report_lr = classification_report(y_val, y_pred)
print(classification_report_lr)

All of the scores are slightly better for logistic regression compared to Random Forest classification, although the differences are insignificant.

In [None]:
#display the confusion matrix for the new model and compare the results to our previous model.

confusion_matrix_lr = confusion_matrix(y_val, y_pred)

plt.figure()
sns.heatmap(confusion_matrix_lr, annot=True, cmap = 'Blues', fmt = 'd')

plt.title('Titanic Classification Confusion Matrics')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.tight_layout()
plt.show()

### Insights
- TN (True Negatives): 99 passengers were correctly predicted not to survive.
- FP (False Positives): 11 passengers were incorrectly predicted to survive, but didn’t.
- FN (False Negatives): 23 passengers were incorrectly predicted to not survive, but did.
- TP (True Positives): 46 passengers were correctly predicted to survive.
- Class 0 (Not Survived): Very high recall (0.90), meaning you're catching most of the non-survivors correctly.
- Class 1 (Survived): Precision is solid (0.81), but recall is lower (0.67) — you're missing some actual survivors (as shown by the 23 FN in the confusion matrix).
- Accuracy: 81% overall — very solid for a baseline model.
- Macro Avg (average across classes): Balanced, but F1 could improve slightly for Class 1.
- Good overall accuracy (81%).
- Strong performance in identifying passengers who didn’t survive.
- Balanced precision for both classes.


Now let's try other models
- XGBoost
- SVM

In [None]:
from xgboost import XGBClassifier

pipeline.set_params(classifier = XGBClassifier (random_state = 42, use_label_encoder = False, eval_metric = 'logloss'))
model.estimator = pipeline
parameter_grid_XGB = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [3, 5, 7],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__subsample': [0.8, 1]
}

model.param_grid = parameter_grid_XGB
model.fit(X_train, y_train)
y_pred_XGB = model.predict(X_val)

In [None]:
#Display the classification report for the new model and compare the reults to our previous model
classification_report_XGB = classification_report(y_val, y_pred_XGB)
print(classification_report_XGB)

In [None]:
#display the confusion matrix for the new model and compare the results to our previous model.

confusion_matrix_XGB = confusion_matrix(y_val, y_pred_XGB)

plt.figure()
sns.heatmap(confusion_matrix_XGB, annot=True, cmap = 'Blues', fmt = 'd')

plt.title('Titanic Classification Confusion Matrics')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
#FOR SVM
from sklearn.svm import SVC

# Replace classifier in the pipeline
pipeline.set_params(classifier=SVC(kernel='linear', probability=True, random_state=42))

# Update model's estimator
model.estimator = pipeline

# Define parameter grid for SVM
parameter_grid_SVM = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf'],
    'classifier__class_weight': [None, 'balanced'],
}

# Update model's param_grid
model.param_grid = parameter_grid_SVM

# Fit model
model.fit(X_train, y_train)

# Make predictions
y_pred_SVM = model.predict(X_val)

In [None]:
#Display the classification report for the new model and compare the reults to our previous model
classification_report_SVM = classification_report(y_val, y_pred_SVM)
print(classification_report_SVM)

In [None]:
#display the confusion matrix for the new model and compare the results to our previous model.

confusion_matrix_SVM = confusion_matrix(y_val, y_pred_SVM)

plt.figure()
sns.heatmap(confusion_matrix_SVM, annot=True, cmap = 'Blues', fmt = 'd')

plt.title('Titanic Classification Confusion Matrics')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.tight_layout()
plt.show()

## Evaluate and compare FOUR models on the validation set

| Model |	Accuracy |	Recall (Class 1)	| Precision (Class 1)	| F1 (Class 1)|
|---|---|---|---|---|
|Random Forest	|0.79	|0.64	|0.79	|0.70|
|Logistic Reg.	|0.81	|0.67	|0.81	|0.73|
|XGBoost|	0.76|	0.65	|0.70|	0.68|
|SVM	|0.81	|0.71	|0.78	|0.74|

- SVM is a strong performer here, especially for predicting survivors (good recall and precision).
- Logistic Regression still has the highest overall accuracy, but SVM is extremely close and more balanced.
- Random Forest and XGBoost are good but lean more toward class 0 prediction strength.

## Select the best-performing model and apply it to the test data
Therefore our, final model is SVM, let's select the model and predict it onn the actual test data

In [None]:
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")
test_df = test_df.drop(columns=["Name", "Cabin"])

#Predict using the trained model pipeline
test_predictions = model.predict(test_df)


## Prepare the predictions for submission

In [None]:
submission = pd.DataFrame({
    "PassengerId": test_df["PassengerId"],
    "Survived": test_predictions
})

submission.to_csv("submission.csv", index=False)


In [None]:
submission.head()

In [None]:
from IPython.display import FileLink

# Create a clickable download link
FileLink("submission.csv")
