# **Predictive analysis**


Based on certain characteristics of the passengers of the Titanic, we seek to build a classification algorithm that can predict with excellent efficiency the survival case of a passenger of the Titanic. 

To do this, we use the different Scikit-Learn pipelines to preprocess the data, find the best hyperparameters and find the best classification algorithm among those tested.

[Data Source](https://www.kaggle.com/c/titanic/data)

### **Importing the basics libraries**

In [163]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.notebook_repr_html", False)

### **Importing dataset**

In [178]:
df = pd.read_csv("Titanic-Dataset.csv")

print("Data shape :", df.shape)

print("\nTwo first row of the dataset :\n")

df.head()

Data shape : (891, 12)

Two first row of the dataset :



   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

### **Data info**

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## **Data preprocessing** 

**SibSp:**

This feature is based on the assumption that for a passenger whose loved ones have survived, the chances of survival are higher, and vice versa. Given the database provided, it cannot be said that there is a relationship between two passengers. We therefore prefer to remove this variable, but if you want more information, we advise you to take a look at this publication via this link below:

https://www.kaggle.com/code/ailuropus/extracting-family-relationships-on-titanic-sibsp


**Name :**

We prefer to remove this variable to simplify our work, but if you want to use it, you can use sklearn's CountVectorizer module to convert this variable to numbers. This will help determine if the name had an impact on a deceased person's case.


**Ticket :**

We also remove this variable, it deserves special treatment and the base dataset does not provide much information about it.

**Passenger ID:**

This feature is not important for predictive modeling. We delete it too.

### **Missing values**

In [4]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### **Let’s put the features in a list according to their context to facilitate their pretreatment**

In [5]:
cat_missing_values = ['Embarked', 'Cabin']

cat_without_missing = ['Sex']

num_missing_values = ['Age']

drop_columns = ['PassengerId', 'Ticket', "Name"]

num_columns = ['Pclass', 'SibSp', 'Parch', 'Fare']

### **Importing librairies**

In [6]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split 

### **Pipline to replace missing categorical values and convert categories into numbers**

In [7]:
cat_preprocessor = make_pipeline(
                                SimpleImputer(strategy='constant',missing_values=np.nan, fill_value='missing'),
    
                                OneHotEncoder(handle_unknown='ignore', sparse_output=False)
                                )

### **Pipeline for scaling numeric values**

In [8]:
num_preprocessor = make_pipeline(SimpleImputer(strategy="median"), MinMaxScaler())


### **Final data preprocessing**

In [9]:
data_preprocessing = make_column_transformer(
    
    (OneHotEncoder(sparse_output=False), cat_without_missing),
    
    (cat_preprocessor , cat_missing_values),
    
    (num_preprocessor, num_missing_values),
    
    ('drop' , drop_columns),
    
    (MinMaxScaler() , num_columns)
)

### **Split the dataset into x and y**

In [10]:
X = df.drop("Survived", axis = 1)
y = df["Survived"]

In [11]:
X.shape

(891, 11)

In [12]:
y.shape

(891,)

### **Fit data_preprocessing**

In [14]:
data_array = data_preprocessing.fit_transform(X)

In [179]:
df_encoded = pd.DataFrame(data_array,columns=data_preprocessing.get_feature_names_out())
df_encoded.head(2)

   onehotencoder__Sex_female  onehotencoder__Sex_male  pipeline-1__Embarked_C  \
0                        0.0                      1.0                     0.0   
1                        1.0                      0.0                     1.0   

   pipeline-1__Embarked_Q  pipeline-1__Embarked_S  \
0                     0.0                     1.0   
1                     0.0                     0.0   

   pipeline-1__Embarked_missing  pipeline-1__Cabin_A10  pipeline-1__Cabin_A14  \
0                           0.0                    0.0                    0.0   
1                           0.0                    0.0                    0.0   

   pipeline-1__Cabin_A16  pipeline-1__Cabin_A19  ...  pipeline-1__Cabin_F38  \
0                    0.0                    0.0  ...                    0.0   
1                    0.0                    0.0  ...                    0.0   

   pipeline-1__Cabin_F4  pipeline-1__Cabin_G6  pipeline-1__Cabin_T  \
0                   0.0                   0.

### **Split dataset with train_test_split method**

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
                                                   X, y, test_size=0.25, random_state=42)

In [17]:
X_train.shape

(668, 11)

In [18]:
X_test.shape

(223, 11)

## **Choose the classification algorithm using GridSearCV method of Sklearn**

Selection of the most efficient classifier with the technical GridSearchCV

### **Random Forest Classifier**

In [19]:
from sklearn.ensemble import RandomForestClassifier as rfc

In [47]:
pipe_rfc = Pipeline(steps=[('preprocessor', data_preprocessing),
                       ('rf_classifier', rfc(random_state = 42))])

In [48]:
param_dict = {
    'rf_classifier__n_estimators' : [5, 10, 15, 20, 30, 60, 80, 100],
    'rf_classifier__max_features': ['sqrt', 'log2',None, .1, .25, .3, .35, .4],
    'rf_classifier__max_depth' : [None, 4, 7, 10, 15, 20, 25, 30, 35],
    'rf_classifier__criterion': ['gini', 'entropy', 'log_loss']
}
param_dict


{'rf_classifier__n_estimators': [5, 10, 15, 20, 30, 60, 80, 100],
 'rf_classifier__max_features': ['sqrt',
  'log2',
  None,
  0.1,
  0.25,
  0.3,
  0.35,
  0.4],
 'rf_classifier__max_depth': [None, 4, 7, 10, 15, 20, 25, 30, 35],
 'rf_classifier__criterion': ['gini', 'entropy', 'log_loss']}

### **Cross validation**

In [49]:
from sklearn.model_selection import KFold

cross_validation= KFold(n_splits=5,
                                shuffle=True,
                                random_state=42)

cross_validation

KFold(n_splits=5, random_state=42, shuffle=True)

### **GridSearch**

In [50]:
from sklearn.model_selection import GridSearchCV

In [51]:
GridSear_rfc = GridSearchCV(pipe_rfc, param_dict, cv = cross_validation)

In [52]:
GridSear_rfc

In [53]:
GridSear_rfc.fit(X_train, y_train)

In [54]:
GridSear_rfc.best_params_

{'rf_classifier__criterion': 'entropy',
 'rf_classifier__max_depth': 10,
 'rf_classifier__max_features': 0.35,
 'rf_classifier__n_estimators': 15}

In [196]:
round(GridSear_rfc.best_score_*100, 4)

84.279

### **K-Nearest Neighbors Classifier**

In [56]:
from sklearn.neighbors import KNeighborsClassifier

In [57]:
pipe_knn = Pipeline(steps=[('preprocessor', data_preprocessing),
                       ('KN_classifier', KNeighborsClassifier())])

In [58]:
params_knn = {'KN_classifier__n_neighbors': [1, 3, 5, 7, 9, 11, 12, 13, 14, 15, 16, 18, 20] , 
                "KN_classifier__weights": ['uniform', 'distance'], 
                "KN_classifier__algorithm": ['auto', 'ball_tree', 'kd_tree', 'brute'],
              "KN_classifier__leaf_size" : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
                }

params_knn

{'KN_classifier__n_neighbors': [1, 3, 5, 7, 9, 11, 12, 13, 14, 15, 16, 18, 20],
 'KN_classifier__weights': ['uniform', 'distance'],
 'KN_classifier__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
 'KN_classifier__leaf_size': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}

In [59]:
GridSear_knn = GridSearchCV(pipe_knn, params_knn  , cv = cross_validation)

In [60]:
GridSear_knn.fit(X_train, y_train)

In [61]:
GridSear_knn.best_params_

{'KN_classifier__algorithm': 'auto',
 'KN_classifier__leaf_size': 10,
 'KN_classifier__n_neighbors': 18,
 'KN_classifier__weights': 'uniform'}

In [194]:
round(GridSear_knn.best_score_*100, 4)

80.6834

### **Adaboost classifier**

In [63]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

from sklearn.ensemble import AdaBoostClassifier

In [64]:
pipe_adb = Pipeline(steps=[('preprocessor', data_preprocessing),
                       ('adb_classifier', AdaBoostClassifier(random_state = 42))])

In [65]:
params_adb = {'adb_classifier__n_estimators': [100, 120, 150, 160, 180, 200, 250, 300, 350] , 
                "adb_classifier__learning_rate": [0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.7, 0.8, 0.9, 1], 
                "adb_classifier__algorithm": ['SAMME', 'SAMME.R']
                }

params_adb

{'adb_classifier__n_estimators': [100, 120, 150, 160, 180, 200, 250, 300, 350],
 'adb_classifier__learning_rate': [0.1,
  0.2,
  0.25,
  0.3,
  0.35,
  0.4,
  0.5,
  0.7,
  0.8,
  0.9,
  1],
 'adb_classifier__algorithm': ['SAMME', 'SAMME.R']}

In [66]:
GridSear_adb = GridSearchCV(pipe_adb, params_adb  , cv = cross_validation)

In [67]:
GridSear_adb.fit(X_train, y_train)

In [68]:
GridSear_adb.best_params_

{'adb_classifier__algorithm': 'SAMME',
 'adb_classifier__learning_rate': 1,
 'adb_classifier__n_estimators': 300}

In [195]:
round(GridSear_adb.best_score_*100, 4)

81.4252

### **Hist Gradient Boosting Classifier**

In [80]:
from sklearn.ensemble import HistGradientBoostingClassifier as HGBClassifier

In [81]:
pipe_hgb = Pipeline(steps=[('preprocessor', data_preprocessing),
                       ('hgb_classifier', HGBClassifier(random_state = 42))])

In [105]:
params_hgb = {'hgb_classifier__learning_rate': [0.03, 00.4, 0.05, 0.05, 0.1, 0.2, 0.25],
                "hgb_classifier__l2_regularization" : [0, 4, 8, 10, 16, 20, 25],
                 "hgb_classifier__max_depth" : [6, 7, 8, 9, 12, 14, None],
                }

params_hgb


{'hgb_classifier__learning_rate': [0.02, 0.03, 0.4, 0.05, 0.1, 0.2, 0.25],
 'hgb_classifier__l2_regularization': [0, 4, 8, 10, 16, 20, 25],
 'hgb_classifier__max_depth': [6, 7, 8, 9, 12, 14, None]}

In [106]:
GridSear_hgb = GridSearchCV(pipe_hgb, params_hgb  , cv = cross_validation)

In [107]:
GridSear_hgb.fit(X_train, y_train)

In [108]:
GridSear_hgb.best_params_

{'hgb_classifier__l2_regularization': 10,
 'hgb_classifier__learning_rate': 0.05,
 'hgb_classifier__max_depth': 7}

In [197]:
round(GridSear_hgb.best_score_*100, 4)

83.9737

## **Selection of the best model** 


In [129]:
best_model = pd.DataFrame({'RF Classifier' : round(GridSear_rfc.best_score_*100, 2)  , 'KNN Classifier' : round(GridSear_knn.best_score_*100, 2), 
                          'ADB Classifier' : round(GridSear_adb.best_score_*100, 2) ,  'HGB Classifier' : round(GridSear_hgb.best_score_*100, 2) }, index = ['GridSearch_best_score'])

best_model

Unnamed: 0,RF Classifier,KNN Classifier,ADB Classifier,HGB Classifier
GridSearch_best_score,84.28,80.68,81.43,83.97



The Random Forest Classifier achieved an efficiency of over 83.825% using the GridSearch method. We select this model and train it on the data set with the best hyperparameters

### **We train the best pipeline on all our data**

In [131]:
final_model = Pipeline([('data_preprocessing', data_preprocessing),
                        ('rfc', rfc(random_state=42,
                                            max_depth=7,
                                            max_features='sqrt',
                                            n_estimators=100))
                        ])

 
final_model.fit(X, y)

In [198]:
round(final_model.score(X, y)*100, 4)

86.0831

Our model is ready to be saved for future predictions or for an internet deployment.

## **Save model**

In [141]:
import joblib

In [142]:
# save the model to a file
joblib.dump(final_model, 'randomforest_classifier.joblib')

['randomforest_classifier.joblib']

## **Prediction**

Predict a small random sample from the dataset. 

In [145]:
random_set = df.sample(n=4, random_state=42)

random_set_without_target = random_set.drop("Survived", axis = 1)

random_set_without_target 

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
709,710,3,"Moubarek, Master. Halim Gonios (""William George"")",male,,1,1,2661,15.2458,,C
439,440,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,C.A. 18723,10.5,,S
840,841,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,SOTON/O2 3101287,7.925,,S
720,721,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,0,1,248727,33.0,,S


In [146]:
model_save = joblib.load('randomforest_classifier.joblib')
print(model_save)

Pipeline(steps=[('data_preprocessing',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(sparse_output=False),
                                                  ['Sex']),
                                                 ('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['Embarked', 'Cabin']),
                       

In [147]:
model_save.predict(random_set_without_target)

array([0, 0, 0, 1], dtype=int64)

### **The true values that were predicted**

In [148]:
np.array(random_set["Survived"])

array([1, 0, 0, 1], dtype=int64)