# Insider ML Engineer Challenge - Titanic Dataset

In this challenge, we are going to develop a full API to use it with AI.

The Titanic Dataset can be found at [Kaggle](https://www.kaggle.com/competitions/titanic/data), but in this case, it is already downloaded in the [`dataset directory`](../dataset), due to its small size.

## Requirements

- Python 3.12

### Preparing the environment

**PS: The following commands may be used on a Linux terminal.**

First, we need to create and activate our virtual environment (venv):

```bash
python3 -m venv venv && source venv/bin/activate
```

Then, using pip, we must download the dependecies from [`requirements.txt`](../requirements.txt)

```bash
pip install -r requirements.txt
```

### Git LFS

It is also good to install the Git LFS for uploading models to GitHub.
In Linux systems, especifically Debian based like Ubuntu, we can do this in the terminal running:

```bash
sudo apt install git-lfs
```

Then, we must initialize Git LFS in our project directory:

```bash
git lfs install
```

Now, we need to make it track our pickle files (in this case, there is just one file):

```bash
git lfs track "*.pkl"
```

This will create (or update) our [`.gitattributes`](../.gitattributes) file, which will not store the content in the Git repository, but link it into GitHub. In this way, we will not upload these bigger files, but just a link to it.

In this challenge, our models will be quite small, but we still will be using Git LFS because of real-world good practices.

## Code

**This code will focus on discovering the useful data for training the models. Therefore, _it is not_ the clean EDA and feature engineering Pipeline used in the API. Normally, we would put in .gitignore, but for docs reasons we will keep it.**

One way to start is by knowing that the [`gender_submission file`](../dataset/gender_submission.csv) assumes that all women survived and all men died. This is based on the priority of women and children first in accident cases. With this file, we get a **0.76555** score in Kaggle, we can use this value to estimate a minimum threshold for our AI to exceed.

Let's start importing our dependencies and then checking if the gender is a good parameter.

### Importing Dependecies

In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC

%matplotlib inline

### Load dataset

In [107]:
train_df = pd.read_csv("../dataset/train.csv")

In [108]:
test_df = pd.read_csv("../dataset/test.csv")

### Feature Engineering

In [109]:
train_df["number_of_cabins"] = train_df.Cabin.apply(
    lambda x: 0 if pd.isna(x) else len(x.split(" "))
)

In [None]:
train_df["cabin_categories"] = train_df.Cabin.apply(lambda x: str(x)[0])

In [111]:
# Extract title from Name
train_df["name_title"] = train_df["Name"].apply(
    lambda x: x.split(",")[1].split(".")[0].strip()
)

In [112]:
train_df["train_test_join"] = 1
test_df["train_test_join"] = 0

test_df["Survived"] = -1  # Creating column Survived in test split

In [115]:
train_test_df = pd.concat(
    [train_df, test_df]
)  # We will group them for general analysis

In [116]:
train_test_df["Embarked"] = train_test_df["Embarked"].fillna("C")

In [117]:
# Fill null with the median age and fare
train_test_df.Age = train_test_df.Age.fillna(train_test_df.Age.median())
train_test_df.Fare = train_test_df.Fare.fillna(train_test_df.Fare.median())

# Normalizing Fare
train_test_df["norm_fare"] = np.log(train_test_df.Fare + 1)

In [118]:
# Count how many cabins a person has (0 if missing, otherwise count how many space-separated entries)
train_test_df["cabin_multiple"] = train_test_df.Cabin.apply(
    lambda x: 0 if pd.isna(x) else len(x.split(" "))
)

# Take the first letter of the cabin string (e.g., 'C85' → 'C'; NaN becomes 'n')
train_test_df["cabin_categories"] = train_test_df.Cabin.apply(
    lambda x: str(x)[0]
)

train_test_df.drop(columns=["Ticket"], inplace=True, errors='ignore') # If Ticket does is already dropped, ignores error

# Extract the title from the name (e.g., "Mr", "Mrs", "Miss", etc.)
train_test_df["name_title"] = train_test_df.Name.apply(
    lambda x: x.split(",")[1].split(".")[0].strip()
)

In [119]:
# Converting
train_test_df["cabin_multiple"] = train_test_df["cabin_multiple"].astype(str)
train_test_df.Pclass = train_test_df.Pclass.astype(str)

# Define categorical and numeric features
categorical_cols = [
    "Pclass",
    "Sex",
    "Embarked",
    "cabin_categories",
    "cabin_multiple",
    "name_title",
]
numerical_cols = [
    "Age",
    "SibSp",
    "Parch",
    "norm_fare",
]
meta_col = ["train_test_join"]  # used to split train/test later

# Apply one-hot encoding only to categorical columns
# We already done it to good categories for analysis purposes, but we did not merged them with our train before. This time, we are going to one hot encode all of them.
categorical_dummies = pd.get_dummies(train_test_df[categorical_cols])

# Combine with numerical + meta columns
train_test_dummies = pd.concat(
    [
        train_test_df[numerical_cols + meta_col].reset_index(drop=True),
        categorical_dummies.reset_index(drop=True),
    ],
    axis=1,
)

In [121]:
# Split train and test
X_full_train = train_test_dummies[train_test_dummies.train_test_join == 1].drop(
    ["train_test_join"], axis=1
)
X_test = train_test_dummies[train_test_dummies.train_test_join == 0].drop(
    ["train_test_join"], axis=1
)
y_full_train = train_test_df[train_test_df.train_test_join == 1].Survived

# Split training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_full_train,
    y_full_train,
    test_size=0.2,
    random_state=29,
    stratify=y_full_train,
)

# Scale data
scale = StandardScaler()

# Columns to scale
columns_to_scale = ["Age", "SibSp", "Parch", "norm_fare"]

# Fit scaler only on X_train, then transform all
X_train[columns_to_scale] = scale.fit_transform(X_train[columns_to_scale])
X_val[columns_to_scale] = scale.transform(X_val[columns_to_scale])
X_test[columns_to_scale] = scale.transform(X_test[columns_to_scale])

### Model

The RandomForest and SVC models were chosen based on previous personal works with this dataset

In [122]:
rf_model = RandomForestClassifier(random_state=29)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_val)

# Evaluation
print("Random Forest Evaluation:")
print("Accuracy:", accuracy_score(y_val, y_pred_rf))
print("F1 Score:", f1_score(y_val, y_pred_rf))
print("Precision:", precision_score(y_val, y_pred_rf))
print("Recall:", recall_score(y_val, y_pred_rf))

Random Forest Evaluation:
Accuracy: 0.8435754189944135
F1 Score: 0.8
Precision: 0.7887323943661971
Recall: 0.8115942028985508


In [123]:
svm_model = SVC(probability = True, random_state=29)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_val)

# Evaluation
print("SVC Evaluation:")
print("Accuracy:", accuracy_score(y_val, y_pred_svm))
print("F1 Score:", f1_score(y_val, y_pred_svm))
print("Precision:", precision_score(y_val, y_pred_svm))
print("Recall:", recall_score(y_val, y_pred_svm))

SVC Evaluation:
Accuracy: 0.8491620111731844
F1 Score: 0.8
Precision: 0.8181818181818182
Recall: 0.782608695652174


### Fine-tuning

In [124]:
def clf_performance(model, name):
    print(f"\nBest {name} Model Evaluation:")
    print("Best parameters:", model.best_params_)

    y_pred = model.predict(X_val)
    print("Accuracy:", accuracy_score(y_val, y_pred))
    print("F1 Score:", f1_score(y_val, y_pred))
    print("Precision:", precision_score(y_val, y_pred))
    print("Recall:", recall_score(y_val, y_pred))
    print("Classification Report:\n", classification_report(y_val, y_pred))

In [61]:
rf = RandomForestClassifier(random_state=29)
param_grid_rf = {
    "n_estimators": [100, 200, 400, 500],
    "criterion": ["gini", "entropy"],
    "bootstrap": [True],
    "max_depth": [15, 20, 25],
    "max_features": ["log2", "sqrt", 10],
    "min_samples_leaf": [2, 3],
    "min_samples_split": [2, 3],
}

clf_rf = GridSearchCV(rf, param_grid=param_grid_rf, cv=5, verbose=2, n_jobs=-1)
best_clf_rf = clf_rf.fit(X_train, y_train)
clf_performance(best_clf_rf, "Random Forest")

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.3s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   0.7s
[CV] END bootstrap=True, criterion=gini, max_depth=15, max_features=log2, min_samples_le

In [63]:
# SVC
svc = SVC(probability=True, random_state=29)
param_grid_svc = [
    {
        "kernel": ["rbf"],
        "gamma": [0.1, 0.5, 1, 2, 5, 10],
        "C": [0.1, 1, 10, 100, 1000],
    },
    {"kernel": ["linear"], "C": [0.1, 1, 10, 100, 1000]},
    {"kernel": ["poly"], "degree": [2, 3, 4, 5], "C": [0.1, 1, 10, 100, 1000]},
]

clf_svc = GridSearchCV(
    svc, param_grid=param_grid_svc, cv=5, verbose=2, n_jobs=-1
)
best_clf_svc = clf_svc.fit(X_train, y_train)
clf_performance(best_clf_svc, "SVC")

Fitting 5 folds for each of 55 candidates, totalling 275 fits
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.1s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.5, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.5, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.5, kernel=rbf; total time=   0.3s
[CV] END .......................C=0.1, gamma=0.5, kernel=rbf; total time=   0.2s
[CV] END .......................C=0.1, gamma=0.5, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=

Comparing both SVC and Random Forest, the result is that overall, the best approach is using the SVC method.  
There is a better way to make a good analysis using the collected metrics, but for this case where we are going to test it in Kaggle, we will use the model with best macro average.   
We could also create an Ensemble of the top SVC models, but for now, we will just use the best one of each type of model.

In [None]:
joblib.dump(best_clf_svc.best_estimator_, "../models/best_svc_model.pkl")
joblib.dump(best_clf_rf.best_estimator_, "../models/best_rf_model.pkl")

['../models/best_rf_model.pkl']

In [72]:
model = joblib.load("../models/best_svc_model.pkl")

print(model.feature_names_in_)

['Age' 'SibSp' 'Parch' 'norm_fare' 'Pclass_1' 'Pclass_2' 'Pclass_3'
 'Sex_female' 'Sex_male' 'Embarked_C' 'Embarked_Q' 'Embarked_S'
 'cabin_categories_A' 'cabin_categories_B' 'cabin_categories_C'
 'cabin_categories_D' 'cabin_categories_E' 'cabin_categories_F'
 'cabin_categories_G' 'cabin_categories_T' 'cabin_categories_n'
 'cabin_multiple_0' 'cabin_multiple_1' 'cabin_multiple_2'
 'cabin_multiple_3' 'cabin_multiple_4' 'name_title_Capt' 'name_title_Col'
 'name_title_Don' 'name_title_Dona' 'name_title_Dr' 'name_title_Jonkheer'
 'name_title_Lady' 'name_title_Major' 'name_title_Master'
 'name_title_Miss' 'name_title_Mlle' 'name_title_Mme' 'name_title_Mr'
 'name_title_Mrs' 'name_title_Ms' 'name_title_Rev' 'name_title_Sir'
 'name_title_the Countess']
