## Problem

Build a small python package to perform an ML prediction on the titanic dataset

## Solution

This is a binary classification task (survived / died).
There are  observations, 9  categorical variables (age, and family members on the boat have a low cardinality so they can be considered categorical)

The goal here is not to show good ML skills since the problem can be solved efficiently with barely a couple lines of code using for exemple sklearn's library, but rather to show good coding skills.

Variable description is available in [page 2 of this pdf](https://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf)

As mentioned in the README.md, here are the steps to be undertaken
- Build a basic preprocessor class for this dataset, extending sklearn's preprocessor (deal with nan, reformat dtypes, use OH encoding if it makes sense as an option)
- Build a feature_importance module (using correlation that removes low importance features and ctoo highly correlated features) [using this](https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook)
- Build a basic model class for prediction, extending sklearn basemodel, using basic sklearn model in effect
- Build a proper scoring class
- Build a train/validation/test pipeline from the original dataset, using the 2 previous classes.
- Build a test folder with unit-tests for everything
- Build a CLI tool to execute it all
- Set up formatters (black, isort, flake8
)
- Report on MLOps automation

Most corresponding classes have been implemented under code/

### Downloading the data

And splitting it into 70% train/validation and 30% test

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml(name="titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

### Data Exploration

In a genuine ML project we should check the variables to assess the following:
1. find NaN rates and perform imputation or drop features when appropriate.
2. study correlations, remove features that are redundant, and limit the number of features if necessary to avoid the curse of dimensionality. In the case of categorical variables, correlation isn't defined but similar measures (Cramer's V or Theil U do exist).
3. Do proper label encoding for categorical features (using targetencoding, OHencoding, labelencoding, grouping, etc)
4. check for anomalies and outliers and remove them

Here we will not go through all the process, but rather build a set of preprocessor and estimator to tackle 1. and 3.

In [2]:
X.head(20)

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,,"New York, NY"
6,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
8,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In [3]:
from code.estimators.baseline import BaselineEstimator
from code.preprocessors.baseline import AgeTransformer, CustomPreprocessor

from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("ageTransformmer", AgeTransformer()),
        (
            "preprocessor",
            CustomPreprocessor(
                categorical_columns=["pclass", "sex", "sibsp", "parch", "embarked"],
                ignored_columns=[
                    "name",
                    "ticket",
                    "cabin",
                    "boat",
                    "body",
                    "home.dest",
                ],
            ),
        ),
        ("estimator", BaselineEstimator()),
    ]
)

We made use of sklearn's pipelining API here to simplify the fit/transform/predict process.

### Train the model

Here we will use the AUC metric.
This is a simple supervised binary classification task. The only pitafll in this case is picking accuracy in the case of imbalanced classes which can be misleading. 

AUC is a metric between 0 and 1 which measures basically the "separability" between the 2 classes and the tradeoff true postives/false positives. It is particularly indicated in the case of binary classification and is free of imbalanced bias.

Here we not only train the model, but also display model performance in the form of a 5-cv cross-validated AUC

In [4]:
from code.training.base_trainer import train

MODEL_PATH = "saved_models/baselinemodel.pkl"

cross_validated_training_score = train(
    model=model, X=X_train, y=y_train, save_path=MODEL_PATH
)
print(f"cross-validated auc score at training time: {cross_validated_training_score}")

cross-validated auc score at training time: 0.8398421644690302


## Evaluate the model

A proper procedure should be defined.
Here we measure AUC in a simple fashion, anything can be done here, for exemple to compare 2 models.

In [5]:
from code.evaluation.base_evaluator import evaluate
from code.prediction.base_predictor import load_model_from_path

fitted_model = load_model_from_path(MODEL_PATH)

evaluation_score = evaluate(fitted_model, X_test, y_test)


print(f"auc score at testing time: {evaluation_score}")

auc score at testing time: 0.7727044590025359


### Model Performance

Here there seeem to be a little overfitting as the performance of the model slightly decreases on the testing set.
An AUC above .85 is laudable (although it also depends on the dataset's quality and the statistical distribution: is the variable to be predicted really a function of the features?).

A proper hyperparameter tuning (say with a gridsearch) could come into play to limit that fact. The chosen estimator, a boosting technique is normally very robust to overfitting and this should suffice.

## Feature Importance

One of the reasons for this choice of model (a tree-based technique) is that it allows for interpretability.
To study feature importance, there are basically 2 techniques:
- studying correlations with the target variable (which we skipped here) and tools like "percentage of variance explained".
- using feature_importance which is a function of "how many times the features comes into play at a node in the trees of the model".

Let us check it out

In [6]:
import pandas as pd

pd.Series(
    model.named_steps["estimator"].feature_importances,
    index=model.named_steps["preprocessor"].get_feature_names_out(),
).sort_values(ascending=False)

sex_male        0.463726
fare            0.187127
pclass_3        0.160435
age             0.081159
parch_1         0.023678
pclass_1        0.022728
embarked_C      0.016377
parch_0         0.008222
embarked_Q      0.007687
sibsp_3         0.007558
embarked_S      0.007484
pclass_2        0.005102
sibsp_0         0.001968
sibsp_2         0.001781
parch_2         0.001311
parch_4         0.001027
sibsp_1         0.000951
sibsp_4         0.000913
sibsp_8         0.000595
sibsp_5         0.000170
parch_6         0.000000
parch_9         0.000000
parch_5         0.000000
parch_3         0.000000
embarked_nan    0.000000
dtype: float64

Without much surprise, the 3 main factors for survival appear to be:
- the sex (because some were given priority?)
- the fare and therefore wealth of the individual (because located closer to the small boats and therefore could get out on the best ships) directly correlated to the class on the boat
- the age (because some were given priority?)

For a more detailed exploratory analysis of categorical variables, refer to this [other github project](https://github.com/BenjaminLAZARD/Interviews/tree/main/equativ)