# Discovering the Titanic üö¢üö¢

The Titanic is a well known ship, but did you know that it is also one of the most popular datasets in Data Science ? Here's the link to the dataset:

<a href="https://www.kaggle.com/c/titanic/"> Titanic </a>

Machine Learning is of course all about statistical prediction and understanding of data. The objective of this exercise is to predict whether a passenger survived the sinking of the Titanic, based on the information available about that passenger. The part of the code to train the model, make predictions and evaluate its performance has already been coded. You have to complete the upstream part, which will allow you to prepare the dataset before training the model (preprocessing).

1. Download the dataset _titanic.csv_.
2. Try to understand what's in this dataset.
    1. You will find all the explanations via this link : <a href="https://www.kaggle.com/c/titanic/data"> Titanic Data </a>

3. Place the file _titanic.csv_ in the same folder as this notebook and read it.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings

warnings.filterwarnings(
    "ignore", category=DeprecationWarning
)  
print("good")

good


In [5]:
df = pd.read_csv("titanic.csv")


4. Explore the dataset and determine which columns are useful for prediction and what preprocessing you will do.

In [10]:
row = df.shape[0]
print(f"Le nombre des lignes est de {row}")

df.head()


Le nombre des lignes est de 891


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
# determinons les diiferentes caract des donn√©es
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,891.0,,,,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,,,,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,,,,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Name,891.0,891.0,"Braund, Mr. Owen Harris",1.0,,,,,,,
Sex,891.0,2.0,male,577.0,,,,,,,
Age,714.0,,,,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,,,,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,,,,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Ticket,891.0,681.0,347082,7.0,,,,,,,
Fare,891.0,,,,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [18]:
# Percentage o.f missing values:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [20]:
# Normaliser
(df.isna().sum() / df.shape[0])*100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

## Preprocessing - pandas part üêºüêº 
5. Use the pandas library to discard columns you won't use for prediction.

In this dataset, some categorical variables have too many modalities, we will have to think about throwing them away: typically, for a dataset that is less than 1000 lines long, we will tend to reject categorical variables that have more than 15-20 possible values. So pay attention to the number of unique values in each column, to decide which ones you will keep.

In [36]:
#Les colonnes qui poss√®dent tout le temps les m√™me valeurs ne me servent √† rien 
# Les colonnes qui poss√®dent trop de valeurs uniques qualitatives ne me servent √† rien dans le contexte du machine learning.
# passengerId, name, Ticket
column_to_drop = ["PassengerId","Name","Ticket"]
df = df.drop(
    column_to_drop, axis=1 
)



KeyError: "['PassengerId', 'Name', 'Ticket'] not found in axis"

In [37]:
df = df.drop(
    ['Cabin'], axis=1 
)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


6. Separate the target variable (Y) from the explanatory variables (X)

##### Variables explicatives (X) Nous devons identifier quelles colonnes contiennent des variables cat√©gorielles et quelles colonnes contiennent des variables num√©riques, car elles seront trait√©es diff√©remment.

##### - Variables cat√©gorielles : Sex, Embarked
##### - Variables num√©riques :  Classe, √Çge, Bbsp, Parch, Tarif.

##### Il sera donc n√©cessaire de pr√©voir la cr√©ation d'un 
- **Num√©riqe** = transformateur num√©rique (qui fera appel √† la classe **StandardScaler**) 
- **Cat√©goriel** = et d'un transformateur cat√©goriel (qui fera appel √† la classe **OneHotEncoder**). 
- comme nous observons des valeurs manquantes dans l'ensemble de donn√©es initial, nous devrons pr√©voir l'appel √† la classe **SimpleImputer** pour g√©rer ces valeurs manquantes.

## Preprocessing - scikit-learn part üî¨üî¨
7. Separate your data to create a train set and a test set, the latter should represent 15% of the available data.

In [41]:
# Separate target variable Y from features X
target_name = 'Survived'

print("Separating labels from features...")
Y = df.loc[:,target_name]
X = df.drop(target_name, axis = 1)
print("...Done.")
print(Y.head())
print()
print(X.head())
print()



Separating labels from features...
...Done.
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

   Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0       3    male  22.0      1      0   7.2500        S
1       1  female  38.0      1      0  71.2833        C
2       3  female  26.0      0      0   7.9250        S
3       1  female  35.0      1      0  53.1000        S
4       3    male  35.0      0      0   8.0500        S



In [42]:
# Separer les donn√©es
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=0)
print("Done")

Dividing into train and test sets...
Done


In [43]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
545,1,male,64.0,0,0,26.0,S
37,3,male,21.0,0,0,8.05,S
214,3,male,,1,0,7.75,Q
40,3,female,40.0,1,0,9.475,S
236,2,male,44.0,1,0,26.0,S


8. Create the preprocessing pipeline for numeric columns

In [47]:
# Create pipeline for numeric features
# Numerique: StandardScaler rehefa qualitative
numeric_features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] # Names of numeric columns in X_train/X_test
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')), # missing values in Age will be replaced by columns' mean
    ('scaler', StandardScaler())
])

# StandardScaler: StandardScaler
# StandardScaler est un outil de scikit-learn qui normalise les donn√©es num√©riques en les transformant pour avoir :
# Moyenne = 0
# √âcart-type = 1
# Mba ho meme ordre de grandeur , meme , ramene √† l'echelle
# type regression: modele tres sensible √† la ordre grandeur

9. Create the preprocessing pipeline for category columns

In [49]:
# Qualitative:OneHotEncoder
# Create pipeline for categorical features
categorical_features = ['Sex', 'Embarked'] # Names of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

# Manambotra ilay sparse matrice, ngezaa be io

10. Use the preprocessing pipelines of questions 9 and 10 to transform X_train and X_test

Reminder: you need to call `fit_transform()` on X_train and only `transform()` on X_test, to ensure that the latter gets the same transformations as X_train.

In [50]:
# Use ColumnTranformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head())
X_test = preprocessor.transform(X_test) # Don't fit again !!
print('...Done.')
print(X_test[0:5,:])
print()


Performing preprocessings on train set...
     Pclass     Sex   Age  SibSp  Parch    Fare Embarked
545       1    male  64.0      0      0  26.000        S
37        3    male  21.0      0      0   8.050        S
214       3    male   NaN      1      0   7.750        Q
40        3  female  40.0      1      0   9.475        S
236       2    male  44.0      1      0  26.000        S
...Done.
[[-1.60067161e+00  2.61131471e+00 -4.63468368e-01 -4.65997851e-01
  -1.09604554e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -6.78358906e-01 -4.63468368e-01 -4.65997851e-01
  -4.71133941e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -2.71796941e-16  4.31545801e-01 -4.65997851e-01
  -4.77176214e-01  1.00000000e+00  1.00000000e+00  0.00000000e+00]
 [ 8.10688409e-01  7.75217807e-01  4.31545801e-01 -4.65997851e-01
  -4.42433140e-01  0.00000000e+00  0.00000000e+00  1.00000000e+00]
 [-3.94991602e-01  1.08123396e+00  4.31545801e-01 -4.65997851e-01
  -1.0960

Performing preprocessings on train set...
     Pclass     Sex   Age  SibSp  Parch    Fare Embarked
545       1    male  64.0      0      0  26.000        S
37        3    male  21.0      0      0   8.050        S
214       3    male   NaN      1      0   7.750        Q
40        3  female  40.0      1      0   9.475        S
236       2    male  44.0      1      0  26.000        S
...Done.
[[-1.60067161e+00  2.61131471e+00 -4.63468368e-01 -4.65997851e-01
  -1.09604554e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -6.78358906e-01 -4.63468368e-01 -4.65997851e-01
  -4.71133941e-01  1.00000000e+00  0.00000000e+00  1.00000000e+00]
 [ 8.10688409e-01 -2.71796941e-16  4.31545801e-01 -4.65997851e-01
  -4.77176214e-01  1.00000000e+00  1.00000000e+00  0.00000000e+00]
 [ 8.10688409e-01  7.75217807e-01  4.31545801e-01 -4.65997851e-01
  -4.42433140e-01  0.00000000e+00  0.00000000e+00  1.00000000e+00]
 [-3.94991602e-01  1.08123396e+00  4.31545801e-01 -4.65997851e-01
  -1.0960

### Training model

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
# Train model
# Probleme de classification: regression logistique
model = LogisticRegression()

print("Training model...")
model.fit(X_train, Y_train) # Training is always done on train set !!
print("...Done.")

Training model...
...Done.


### Predictions

In [16]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = model.predict(X_train)
print("...Done.")
print(Y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[0 0 0 0 0]



In [17]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = model.predict(X_test)
print("...Done.")
print(Y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[0 0 0 1 1]



### Performances evaluation

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
# Print scores
print("Accuracy on training set : ", accuracy_score(Y_train, Y_train_pred))
print("Accuracy on test set : ", accuracy_score(Y_test, Y_test_pred))

Accuracy on training set :  0.8018494055482166
Accuracy on test set :  0.7910447761194029


If you get a score close to 0.79 on the test set, it means that you managed to do all the preprocessings with a good methodology! :-)