## In this notebook, we are building a machine learning model to predict the survival of passengers on the Titanic.

## 1. Importing libraries

First we import all the libraries to load and augment the data. We will also import the libraries that are used to encode our data and to train the model.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from pycaret.classification import *
import pickle

## 2. Loading the data

Next, we load and display the raw data file to check what kind of data we are dealing with.

In [2]:
#load the data
data = pd.read_csv(".\\Datasets\\Titanic-Dataset.csv")
data.columns = data.columns.str.lower()
data

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## 3. Data augmentation

**3.1 Dropping `passengerid`, `name` and `ticket`**

For this model we decided to drop the classes `passengerid`, `name` and `ticket`. 

These columns were dropped because they did not contain a feature whihc impacted the survival chance of the passengers.

To make sure this decision gave us the highest accuracy, we made the decision after carefull dropping different classes and comparing the accuracy of the different models.

In [3]:
# Drop unnecessary columns
data = data.drop(columns=['passengerid', 'name', 'ticket'])

**3.2 Filling the null-values**

After dropping the unnecessary classes, we filled the null-values of the `age` class, the `cabin` class and the `fare` class.

For the `age` class, we decided to fill the null-values with the `mean age`. This helps our model to predict more accuratly by reducing the bias while still having the same daya size to train from.

For the `cabin` class, we decided to replace the null-values with the word `'Missing'`. This ensures that all values in the `cabin` class are of the same data type, while having a dataset that is complete.

For the `fare` class, we decided to fill the null-values with the `median fare`. The median is not affected by outliers, which as we discussed in the graphs are a lot. This also helps us to maintain the overall data distribution because we use the median, which is a measure of central tendency. This also helps to maintain the overall shape of the data distribution.

For the `embarked` class we decided, after carefully evaluating our models accuracy, to not replace the null-values in the `embarked` class, because this ensured a higher model accuracy on the test dataset.



In [4]:
# Fill in missing values in the 'age' column with the mean age
data['age'] = data['age'].fillna(data['age'].mean())

# Fill in missing values in the 'cabin' column with 'Missing'
data['cabin'] = data['cabin'].fillna('Missing')

# Fill in missing values in the 'fare' column with the median fare
data['fare'].fillna(data['fare'].median())

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
886    13.0000
887    30.0000
888    23.4500
889    30.0000
890     7.7500
Name: fare, Length: 891, dtype: float64

**3.3 Encoding the data**

Next up we encode the string values of the `sex`, the `embarked` and the `cabin` class to numerical values using the `LabelEncoder` library.

For the `sex` class, the male passengers are converted to the number 1 and the female passengers are converted to the number 0.

For the `embarked` class, the `Cherbourg` values are encoded as 0, the `Queenstown` values as 1 and the `Southampton` values as 2.

For the `cabin` class, the `null-values` are encoded as 146. All the other values get a different number assigned depending on their value.

Encoding these string values is necessary because the machine learning algorithms require a numerical input. By encoding the string values in the `sex` class, we train the model effectivly to make predictions.

In [5]:
data['sex'] = LabelEncoder().fit_transform(data['sex'])
data['embarked'] = LabelEncoder().fit_transform(data['embarked'])
data['cabin'] = LabelEncoder().fit_transform(data['cabin'])
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,cabin,embarked
0,0,3,1,22.000000,1,0,7.2500,146,2
1,1,1,0,38.000000,1,0,71.2833,81,0
2,1,3,0,26.000000,0,0,7.9250,146,2
3,1,1,0,35.000000,1,0,53.1000,55,2
4,0,3,1,35.000000,0,0,8.0500,146,2
...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.000000,0,0,13.0000,146,2
887,1,1,0,19.000000,0,0,30.0000,30,2
888,0,3,0,29.699118,1,2,23.4500,146,2
889,1,1,1,26.000000,0,0,30.0000,60,0


## 4. Splitting the data

Next up, we split the dataset in train and test data. We decided to split te dataset in 80% training data and 20% test data.

Afterwards, we saved the test data to a .csv file so we can use this to make our predictions in the comparison file.

In [6]:
# Split data in test and train data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
# Export test_data to a CSV file
test_data.to_csv('.\\Datasets\\test_data.csv', index=True)

## 5. Defining the features

After splitting the data, we define the different features in our dataset. Each item in this list contains a feature which will be used to predict whether a passenger survived or not. 

In [7]:
titanic_features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked',  'cabin']

## 6. Training the Pycaret model

We trained our model on the augmented `train_data` dataset. We set the `survived` class as target. This means that our model will be trained to predict whether a passenger has survived the Titanic crash or not.

We also defined our categorical features, which define the features that will be given to the model and on which the model can base its predictions on.

We then train the models and evaluate the the performance of all the available estimators in the model library using corss-validation. This then returns us the model with the highest accuracy.

Lastly we evaluate our model by using the `evaluate_model()` function. This class returns various plots and metrics to help us understand how well the model performed, like for example a learning curve, a confusion matrix, a validation curve and a feature selection plot.

In [8]:
# Define target and setup experiment
experiment = setup(
    data=train_data,
    target='survived',
    categorical_features=titanic_features,
)

# Compare models
best = compare_models()
print(best)

# Evaluate model
evaluate_model(best)



Unnamed: 0,Description,Value
0,Session id,3839
1,Target,survived
2,Target type,Binary
3,Original data shape,"(712, 9)"
4,Transformed data shape,"(712, 25)"
5,Transformed train set shape,"(498, 25)"
6,Transformed test set shape,"(214, 25)"
7,Categorical features,8
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.8132,0.8317,0.7538,0.7491,0.7498,0.601,0.6027,0.203
gbc,Gradient Boosting Classifier,0.8112,0.8642,0.7058,0.7745,0.7362,0.5902,0.5938,0.163
lr,Logistic Regression,0.8071,0.8431,0.7,0.7653,0.729,0.5802,0.5834,1.273
ridge,Ridge Classifier,0.8071,0.8456,0.6731,0.7807,0.7207,0.5752,0.5805,0.069
lda,Linear Discriminant Analysis,0.8071,0.8455,0.6731,0.7807,0.7207,0.5752,0.5805,0.078
lightgbm,Light Gradient Boosting Machine,0.8053,0.8588,0.6898,0.772,0.7258,0.5758,0.5807,0.172
xgboost,Extreme Gradient Boosting,0.799,0.8473,0.7003,0.7539,0.7233,0.5664,0.5699,0.288
dt,Decision Tree Classifier,0.7829,0.7645,0.7161,0.7163,0.7113,0.5386,0.5427,0.111
rf,Random Forest Classifier,0.771,0.827,0.6895,0.7034,0.6913,0.5104,0.5154,0.228
et,Extra Trees Classifier,0.7589,0.7994,0.662,0.6903,0.6708,0.4815,0.4869,0.245


AdaBoostClassifier(algorithm='SAMME.R', estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=3839)


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## 7. Saving the Pycaret model

After carefull examining the different plots and the accuracy of our model we save the model as pycaret_model.pkl.

In [9]:
# Save model
save_model(best, '.\\Models\\pycaret_model')
print("Model saved as pycaret_model.pkl!")

Transformation Pipeline and Model Successfully Saved
Model saved as pycaret_model.pkl!


## 8. Training the custom model.

After training our Pycaret model, we train our custom model. As custom model for Titanic, we decided to use a random fores tree.

We first initialize the model by calling the `RendomForestRegressor()` function. We use `n_estimators` to specify that the model will use 100 decission trees to base its predictions on. We ensure reproducability by setting the `random_state` to 42.

Then we use the `fit()` function of the random forest model to train the model using the defined `titanic features` from the train_data as features to base its predictions on and the `survived` class of the train_data as output of the model.

Lastly we save the random forest model as `random_forest_model.pkl` using pickle.

In [10]:
# Initialize a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Random Forest model
rf_model.fit(train_data[titanic_features], train_data['survived'])

# Save the Random Forest model to a file
with open('.\\Models\\random_forest_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)
print("Random Forest model saved as random_forest_model.pkl!")

Random Forest model saved as random_forest_model.pkl!
