## Titanic Dataset

My goal is build a new version [^1] of predictive model that predicts which passengers survived the Titanic shipwreck; and try to use some "advanced" (for me :smile:) techniques such as cross validation, grid search and an ensemble algorithm like "Random Forest Classifier". Let's see what happen :smile:

[^1]: See https://www.kaggle.com/code/francescopaolol/logisticregression-on-complete-titanic-dataset

#  Import data set

In [None]:
import pandas as pd 

# We'll use a dataset taken from: https://www.kaggle.com/competitions/titanic
dfTrain = pd.read_csv("./data/train.csv", sep=',')
dfTest = pd.read_csv("./data/test.csv", sep=",")

# Basic EDA and cleaning data

In [None]:
from basic_exploration import *
basicEDA(dfTrain, "Titanic Train")

In [None]:
basicEDA(dfTest, "Titanic Test")

Some considerations: first of all, we can see how "Survived" is our target:
- Survived 0 = no, 1 = yes

As regards the other features we have:
- PassengerID: 
- Pclass: 1 = 1st, 2 = 2nd, 3 = 3rd
- Name: self explanatory
- Sex: self explanatory
- Age: self explanatory
- SibSp = nr of sibilings / spouses abroad
- Parch = nr of parents / children abroad
- Ticket = self explanatory
- Fare = passenger fare
- Cabin = self explanatory
- Embarked = port of embrarkation --> C = Chernourg, Q = Queenstown, S = Southhampton

I think I can do something in order to slim down this dataset.

We can see that the two dataset are similar.

## Feature engineering

So, I think that 'Name', 'Embarked', 'Cabin' and 'Ticket' features can be dropped because I don't believe that be called "Nicholas" or "Augusta" increases the possibility to survive. Same reasoning for the others features.
Then, let's start to drop useless features.

In [None]:
delColumn(dfTrain, "Name")
delColumn(dfTrain, "Ticket")
delColumn(dfTrain, "Cabin")
delColumn(dfTrain, "Embarked")

As regards "Fare", let's check out if the fare is related to the better chances to be survive.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

if "Fare" in dfTrain.columns:
    dfTmp = dfTrain[["Fare", "Survived"]]
    dfSurvivedYes = dfTmp[dfTmp.Survived == 1]
    dfSurvivedNo = dfTmp[dfTmp.Survived == 0]

    plt.figure(figsize = (15,8))
    plt.title("Fare related to stay alive.")
    sns.histplot(data = dfSurvivedYes[dfSurvivedYes.Fare > 0],
                 x = 'Fare',
                 color = 'navy'
                )

    plt.figure(figsize = (15,8))
    plt.title("Fare related to not survive.")
    sns.histplot(data = dfSurvivedNo[dfSurvivedNo.Fare > 0],
                 x = 'Fare',
                 color = 'navy'
                )



    plt.show()

So one can be dead or alive, no matter how much he paid as fare: we can also drop this feature.

In [None]:
delColumn(dfTrain, "Fare")

At this point we have few but good features. Remains to resolve the missing 'Age' records (that is 19.865 %).
So, let's find out correlations with "Age" feature.

In [None]:
dfTrain.corr()

We have three feature correlate with "Age": Pclass (PCC: -0.369226), SibSp (PCC: -0.308247), Parch (PCC: -0.189119).
I'm going to use "IterativeImputer" (which is a multivariate imputer that estimates each feature from all the others) with RandomForestRegressor...

In [None]:
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor
import pandas as pd

dftmp = dfTrain.loc[:, ["Age"]]

imp = IterativeImputer(RandomForestRegressor(), 
                       max_iter=10, 
                       tol=0.001, 
                       random_state=0, 
                       sample_posterior=False, 
                       verbose=True)
dftmp = pd.DataFrame(imp.fit_transform(dftmp), 
                     columns=dftmp.columns)

...check if all data are property filled...

In [None]:
print("\nNumber of rows where 'Age' are null or empty")
print(dftmp.isnull().sum())

...and finally refill the missing Age values.

In [None]:
delColumn(dfTrain, "Age")
dfTrain = dfTrain.join(dftmp)

Remains to encode the "Sex" feature.

In [None]:
from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()

dfTrain["Sex"] = labelencoder_X.fit_transform(dfTrain["Sex"])

So this is our starting dataset.

In [None]:
dfTrain.head()

Same things to test dataset

In [None]:
delColumn(dfTest, "Name")
delColumn(dfTest, "Ticket")
delColumn(dfTest, "Cabin")
delColumn(dfTest, "Embarked")
delColumn(dfTest, "Fare")

dftmp = dfTest.loc[:, ["Age"]]

imp = IterativeImputer(RandomForestRegressor(), 
                       max_iter=10, 
                       tol=0.001, 
                       random_state=0, 
                       sample_posterior=False, 
                       verbose=True)
dftmp = pd.DataFrame(imp.fit_transform(dftmp), 
                     columns=dftmp.columns)
dfTest.drop("Age", axis=1, inplace=True)
dfTest = dfTest.join(dftmp)

dfTest["Sex"] = labelencoder_X.fit_transform(dfTest["Sex"])

dfTest.head()

# Train and test the model

Once prepared data, we can split data in train and test, as usual.

In [None]:
from sklearn.model_selection import train_test_split

X = dfTrain.drop("Survived", axis=1)
y = dfTrain["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

And prepare the grid search with RandomForestClassifier model, and return

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, KFold

rndForestParams = { 
    "criterion" : ["gini", "entropy"],                  # {“gini”, “entropy”, “log_loss”},
    "min_samples_leaf" : [1, 5, 10],                    # The minimum number of samples required to be at a leaf node. 
    "min_samples_split" : [4, 10, 14, 16],              # The minimum number of samples required to split an internal node
    "n_estimators": [150, 300, 700, 1000]               # The number of trees in the forest.
}

rfModel = RandomForestClassifier(
    max_features = "sqrt",                              # The number of features to consider when looking for the best split
    oob_score = True,                                   # Whether to use out-of-bag samples to estimate the generalization score. 
                                                        # Only available if 'bootstrap = True' (that's default value!)
    random_state = 1,                                   # Controls both the randomness of the bootstrapping of the samples used when building trees 
    n_jobs = -1                                         # '-1' means using all processors.
)                                        

cv_method = KFold(n_splits = 10, shuffle = True)


gs = GridSearchCV(
    estimator = rfModel, 
    param_grid = rndForestParams, 
    scoring='accuracy', 
    cv = cv_method, 
    n_jobs=-1
)

Now we can fit the gridsearch object (it will take a while...).

In [None]:
gs.fit(X_train, y_train)

We can see the parameter setting that gave the best results on the hold out data...

In [None]:
gs.best_params_

...and set up a model with the estimator that was chosen by the search.

In [None]:
RFC_Model = gs.best_estimator_

And show what is the average of all cv folds for a single combination of the parameters you specify in the tuned_params.

In [None]:
gs.best_score_                                  #Mean cross-validated score of the best_estimator

Let's predict on train data.

In [None]:
RFC_Model.predict(X_train)

And make the prediction on test data.

In [None]:
RFC_Model.score(X_test, y_test)                 #Return the mean accuracy on the given test data and labels

## Model Performance Analysis

Displaying the learning curve, we note that we have enough data to try to make a model.

In [None]:
from yellowbrick.model_selection import learning_curve

learning_curve(RFC_Model, X_test, y_test, scoring='accuracy')
plt.show()

In [None]:
from sklearn.metrics import classification_report

dfReport = pd.DataFrame(classification_report(y_test, RFC_Model.predict(X_test), output_dict=True))
dfReport

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

predictions = RFC_Model.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels = RFC_Model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                              display_labels = RFC_Model.classes_)
disp.plot()
plt.grid(visible=None)
plt.show()

# Submission

In [None]:
predictions = RFC_Model.predict(dfTest)
predictions

In [None]:
PassengerId = dfTest['PassengerId']
submission = pd.DataFrame({"PassengerId": PassengerId,"Survived": predictions})
submission.to_csv('submission.csv', index=False)