![crack](https://cdn.radiofrance.fr/s3/cruiser-production/2021/01/1888eb1e-654a-4ab4-835b-581cf76844f7/1200x680_titanic.jpg)

# Titanic 

Let's start using classifications on a very popular dataset: **Titanic**. 

We're going to make an algorithm together that will try to predict who will survive to the titanic crash based on many variables.

Your goal will be to: 

1. Preprocess the data 
2. Create a classification algorithm 

Happy Coding!

## Step 1 - Import Data 🤹‍♀️

- Import usual librairies

In [142]:
# Imports
import pandas as pd
import numpy as np # Not always necessary
import matplotlib.pyplot as plt # Not always necessary
import seaborn as sns # Not always necessary
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

- Import `titanic.csv` and visualize dataset

In [143]:
df = pd.read_csv("./assets/ML/titanic.csv")
print("Type de df :", type(df), df.shape)  
df.head(10)

Type de df : <class 'pandas.core.frame.DataFrame'> (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


- Remove `PassengerId`, `Name`, `Ticket`, `Cabin` columns from the dataset

In [144]:
print("Type de df :", type(df), df.shape)  
df = df.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"])

print("Type de df :", type(df), df.shape)  
df.head()

Type de df : <class 'pandas.core.frame.DataFrame'> (891, 12)
Type de df : <class 'pandas.core.frame.DataFrame'> (891, 8)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## Step 2 - EDA 📊

- Visualize `Sex` and `Survived`

In [145]:
#df["Sex"].replace(["male", "female"], [0,1], inplace=True)


# t = 1.96
# n = len(df)
# err = t * df.std() / n**0.5
# Pour yerr voir https://pandas.pydata.org/pandas-docs/version/0.23/visualization.html#visualization-errorbars


# sns.catplot(df, x="Sex", y="Survived")



- Visualize `SibSp` and `Survived`

- Visualize `Pclass` and `Survived`

- Visualize `Embarked` and `Survived`

- Visualize `Parch` and `Survived`

- Visualize `Fare` and `Survived`

* Show your dataset main statistics

In [146]:
df.describe(include="all")

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889
unique,,,2,,,,,3
top,,,male,,,,,S
freq,,,577,,,,,644
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,
min,0.0,1.0,,0.42,0.0,0.0,0.0,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,


- Let's take a look to missing values

In [147]:
df.isna().sum() / len(df) * 100


Survived     0.000000
Pclass       0.000000
Sex          0.000000
Age         19.865320
SibSp        0.000000
Parch        0.000000
Fare         0.000000
Embarked     0.224467
dtype: float64

## Step 3 - Preprocessing 🍳

- Split your dataset by $X$ and $y$

In [148]:
features_list = ['Pclass', 'Sex', 'Age', "SibSp", "Parch", "Fare", "Embarked"]           
X = df.loc[:,features_list]                                    
y = df.loc[:,"Survived"]    

# On pourait faire plus simple pour X avec un drop de la colonne "Survived"


- Split your data in train and test sets

In [149]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0, 
                                                    stratify=y) # Allows you to stratify your sample. 
                                                                # Meaning, you will have the same
                                                                # proportion of categories in test 
                                                                # and train set

In [150]:
X_train.head(7)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
502,3,female,,0,0,7.6292,Q
464,3,male,,0,0,8.05,S
198,3,female,,0,0,7.75,Q
765,1,female,51.0,1,0,77.9583,S
421,3,male,21.0,0,0,7.7333,Q
368,3,female,,0,0,7.75,Q
643,3,male,,0,0,56.4958,S


* Deal with missing values 
    * you can replace missing values in numerical columns by the median 
    * you can replace missing values in categorical columns by a new category called "*Unknown*"
    * Check out [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html?highlight=simpleimputer#sklearn.impute.SimpleImputer) documentation to do so 😉

In [151]:
#X_train = X_train.copy()                   # Copy dataset to avoid caveats of assign a copy of a slice of a DataFrame
                                           # More info here https://towardsdatascience.com/explaining-the-settingwithcopywarning-in-pandas-ebc19d799d25

# imputer = SimpleImputer(missing_values = np.NaN, strategy="median")     # Instanciate class of SimpleImputer with strategy of median
# X_train.Age = imputer.fit_transform(X_train.Age.values.reshape(-1,1))   # transform(X_train.loc[:,["Age"]]) # Fit and transform columns where there are missing values



# imputer2 = SimpleImputer(missing_values = np.NaN, strategy="constant", fill_value="Unknown")
# X_train.Embarked = imputer.fit_transform(X_train.Embarked.values.reshape(-1,1))

In [152]:
X_train.head(7)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
502,3,female,,0,0,7.6292,Q
464,3,male,,0,0,8.05,S
198,3,female,,0,0,7.75,Q
765,1,female,51.0,1,0,77.9583,S
421,3,male,21.0,0,0,7.7333,Q
368,3,female,,0,0,7.75,Q
643,3,male,,0,0,56.4958,S


- Make all the required preprocessings on the train set

In [136]:
# X_train["Sex"].replace(["male", "female"], [0,1], inplace=True)
# X_train["Embarked"].replace(["Q", "S"], [0,1], inplace=True)

# gender_replacer = SimpleImputer(missing_values = np.NaN, strategy="constant", fill_value="Unknown")


features_to_replace = [2, 5]
imputer1 = SimpleImputer(missing_values = np.NaN, strategy="median")     
X_train.Age = imputer.fit_transform(X_train.Age.values.reshape(-1,1))   

imputer2 = SimpleImputer(missing_values = np.NaN, strategy="constant", fill_value="Unknown")
X_train.Embarked = imputer.fit_transform(X_train.Embarked.values.reshape(-1,1))

categorical_features = [0, 1, 6]
categorical_transformer = OneHotEncoder(drop='first')  # Pour virer les class 1, 2 et 3 car dans ce context 3 est pas sup à 1

features_to_rescale = [2, 5]
numeric_transformer = StandardScaler()


featureencoder = ColumnTransformer(                   # ColumnTransformer provient du module compose
    transformers=[
        ('cat', categorical_transformer, categorical_features),   # "cat" c'est nous qui le donnons 
        ('num', numeric_transformer, features_to_rescale)
        ]
    )

X_train = featureencoder.fit_transform(X_train)


## Build your model 🏋️‍♂️

- Create your Logistic Regression model

In [138]:
classifier = LogisticRegression(random_state = 0) # Instanciate model 
classifier.fit(X_train, y_train)                  # Fit model. Ajustement 


- Evaluate it (don't forget to preprocess X_test)

In [139]:
y_train_pred = classifier.predict(X_train)
print(y_train_pred[0:5])

[0 0 0 1 0]


- Look at your model scores on train and test

In [140]:
X_test = featureencoder.transform(X_test)
print(X_test[0:5,:])

y_test_pred = classifier.predict(X_test)
print(y_train_pred[0:5])


[[ 0.          1.          0.8824882  -0.35183846]
 [ 0.          1.          0.29839793 -0.44660331]
 [ 0.          1.          0.76567014 -0.03390238]
 [ 0.          1.         -0.09099558 -0.44660331]
 [ 0.          0.         -0.40251038  0.94944027]]


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

- What can you say about it ?

- Create the confusion matrix with `plot_confusion_matrix`

- Create a dataframe with features importance

## Bonus - Feature Importance 🏄‍♂️

* Now harder, try to visualize the coefficients of your model and therefore deduce a feature importance ranking