# Optional - Advanced Solution

The last solution is perfectly valid and we applied the rules from the lectures strictly. Now that we are more comfortable with Preprocessing with python, let's take a step back and see what we could have done differently by digging into the interpretation of the variables a little deeper.

1. Load the titanic dataset again

In [115]:
# prelude

import pandas as pd
import re

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler        
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [116]:
df = pd.read_csv("../12_assets/05_supervised_ML/titanic.csv")
df.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
766,767,0,1,"Brewe, Dr. Arthur Jackson",male,,0,0,112379,39.6,,C
325,326,1,1,"Young, Miss. Marie Grice",female,36.0,0,0,PC 17760,135.6333,C32,C
735,736,0,3,"Williams, Mr. Leslie",male,28.5,0,0,54636,16.1,,S
658,659,0,2,"Eitemiller, Mr. George Floyd",male,23.0,0,0,29751,13.0,,S
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
171,172,0,3,"Rice, Master. Arthur",male,4.0,4,1,382652,29.125,,Q
165,166,1,3,"Goldsmith, Master. Frank John William ""Frankie""",male,9.0,0,2,363291,20.525,,S
448,449,1,3,"Baclini, Miss. Marie Catherine",female,5.0,2,1,2666,19.2583,,C
854,855,0,2,"Carter, Mrs. Ernest Courtenay (Lilian Hughes)",female,44.0,1,0,244252,26.0,,S
293,294,0,3,"Haas, Miss. Aloisia",female,24.0,0,0,349236,8.85,,S


In [117]:
df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [118]:
# % of missing val
100 * df.isnull().sum() / len(df)

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

Let's explore the features more in details and try to extract more information than previously:

**A. Preprocessing to be planned with pandas**

**Unnecessary columns for prediction, to be thrown away** :
- _PassengerId_ and _Name_ are passenger identifiers, we won't use them for prediction (these columns don't contain any information)

<Note type="tip" title="Actually, _Name_ contains useful information !">

As it is true that _Name_ cannot be used as such for prediction, it contains valuable information on the socio-economic background of the passenger in the form of their title. We will try and extract a _Title_ variable from the variable _Name_

</Note>

- _Ticket_ and _Cabin_ have too many different modalities, they might not be very useful and if we had to pass them in OneHotEncoding, they would make the number of columns explode in relation to the number of rows.

<Note type="tip" title="We can do something with the _Cabin_ variable !">

_Ticket_ and _Cabin_ do have way too many modalities in order to be useful for prediction, however, the _Cabin_ variable can easily be used after a slight transformation : let's create a new variable _HasCabin_ which is equal to 1 when the passenger has a cabin number and 0 otherwise.

</Note>

**Columns with too many missing values, to be discarded** : Cabin


**Target variable/target (Y) that we will try to predict, to separate from the others** : Survived

**------------**

**B. Preprocessings to be planned with scikit-learn**.

**Explanatory variables (X)**
We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

- Categorical variables : Sex, Embarked, HasCabin, Title
- Numerical variables : Class, Age, Bbsp, Parch, Fare.

In this dataset, we have both types of variables. It will thus be necessary to plan to create a numeric_transformer (which will call the StandardScaler class) and a categorical_transformer (which will call the OneHotEncoder class). Moreover, as we observe missing values in the _Age_ and _Embarked_ columns, we will have to plan to call the SimpleImputer class to handle the missing values. 

**Target variable Y**
Here, the target variable Y is categorical (survival vs. death) but we notice that it is already encoded in numbers (1 vs. 0). It will therefore not be necessary to go through a label encoding step.

In [119]:
# We will try and extract a _Title_ variable from the variable _Name_
# drop _PassengerId_ and _Name_
# let's create a new variable _HasCabin_ which is equal to 1 when the passenger has a cabin number and 0 otherwise.
# drop cabin

# target = Survived

# - Categorical variables : Sex, Embarked, HasCabin, Title
# - Numerical variables : Class, Age, Bbsp, Parch, Fare.

# SimpleImputer sur Age et Embarked


## Preprocessing - pandas part ##
2. Create a column _HasCabin_ in the dataset that is constant equal to 1

In [120]:
df["HasCabin"] = 1
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


3. Using a mask, change the value of the variable _HasCabin_ to 0 wherever Cabin is missing.

In [121]:
df.loc[df["Cabin"].isna(), "HasCabin"] = 0
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


5. Create a column _Title_ that only contains the title extracted from the _Name_ variable. 

<Note type="tip" title="Remember pandas handles columns of strings efficiently">

Some method from [the str module](https://docs.python.org/3.3/library/stdtypes.html?highlight=split) can be helpful 😉
You can create a function that allows to extract the title from one element of the column, and then use the `apply()` method to apply this function to the whole column.

</Note>

In [122]:
# return name.split(', ')[1].split('.'[0]

def extractTitle(str):
  title = re.search(", \w+.", str)
  b,e = title.span()
  return str[b+2 :e-1]

In [123]:
txt = "McCarthy, Mr. Timothy J"
print(extractTitle(txt))

Mr


In [124]:
df["Title"] = df["Name"].apply(extractTitle)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,Mr


6. Display all the possible values and number of instances of each of these values in your dataset for the new _Title_ variable.

In [125]:
df["Title"].value_counts()

Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
the           1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64

7. Some of these values represent only very few instances, and other values seem to represent the similar categories of people. Bring the similar categories under one name, and create a new category called _Rare_ that will represent all the underrepresented modalities.

In [126]:
type(df["Title"].value_counts())


pandas.core.series.Series

In [128]:
# if less than 40 occurences then "Rare" in Title row
val2replace = df["Title"].value_counts()[4:]
val2replace.index



Index(['Dr', 'Rev', 'Mlle', 'Major', 'Col', 'the', 'Capt', 'Ms', 'Sir', 'Lady',
       'Mme', 'Don', 'Jonkheer'],
      dtype='object', name='Title')

In [97]:
df['Title'] = df['Title'].apply(lambda x: "Rare" if x in val2replace.index else x)
df["Title"].value_counts()

Title
Mr        517
Miss      182
Mrs       125
Master     40
Rare       27
Name: count, dtype: int64

8. Now that we are done squeezing some extra information out of our variables, let's reproduce all the subsequent steps from the first solution and let's compare our models' performances.

In [98]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,HasCabin,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,Mr


In [99]:
col2drop = ["PassengerId", "Name", "Ticket", "Cabin"]
df.drop(col2drop, axis=1, inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,HasCabin,Title
0,0,3,male,22.0,1,0,7.25,S,0,Mr
1,1,1,female,38.0,1,0,71.2833,C,1,Mrs
2,1,3,female,26.0,0,0,7.925,S,0,Miss
3,1,1,female,35.0,1,0,53.1,S,1,Mrs
4,0,3,male,35.0,0,0,8.05,S,0,Mr


In [100]:
target_name = "Survived"
y = df.loc[:, target_name]
X = df.drop(target_name, axis=1)  

display(y.head())
display(X.head())


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,HasCabin,Title
0,3,male,22.0,1,0,7.25,S,0,Mr
1,1,female,38.0,1,0,71.2833,C,1,Mrs
2,3,female,26.0,0,0,7.925,S,0,Miss
3,1,female,35.0,1,0,53.1,S,1,Mrs
4,3,male,35.0,0,0,8.05,S,0,Mr


In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [102]:
# On a 2 étapes dans notre pipe
# Une liste de tuples à 2 éléments

numeric_features = ["Pclass", "Age", "SibSp", "Parch", "Fare", "HasCabin"]  

numeric_transformer = Pipeline(
    steps=[
        (
            "imputer_num",
            SimpleImputer(strategy="median"),     # moins sensible que la moyenne aux val extremes
        ),  
        (
            "scaler", 
            StandardScaler()                      
        ),
    ]
)

In [103]:
# Create pipeline for categorical features
categorical_features = ["Sex", "Embarked", "Title"]            # Names of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer_cat",
            SimpleImputer(strategy="most_frequent"),  # missing values will be replaced by most frequent value
        ),  
        (
            "encoder",
            OneHotEncoder(drop="first"),              # drop => avoid correlations between features
        ),  
    ]
)

In [104]:
feature_encoder = ColumnTransformer(
  transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),    
  ]
)

In [105]:
X_train = feature_encoder.fit_transform(X_train)
print(X_train[0:5,:].round(3))

[[-1.601  2.624 -0.463 -0.466 -0.11  -0.536  1.     0.     1.     0.
   1.     0.     0.   ]
 [ 0.811 -0.665 -0.463 -0.466 -0.471 -0.536  1.     0.     1.     0.
   1.     0.     0.   ]
 [ 0.811 -0.053  0.432 -0.466 -0.477 -0.536  1.     1.     0.     0.
   1.     0.     0.   ]
 [ 0.811  0.788  0.432 -0.466 -0.442 -0.536  0.     0.     1.     0.
   0.     1.     0.   ]
 [-0.395  1.094  0.432 -0.466 -0.11  -0.536  1.     0.     1.     0.
   1.     0.     0.   ]]


In [106]:
X_test = feature_encoder.transform(X_test)  
print(X_test[0:5,:].round(3))

[[ 0.811 -0.053 -0.463 -0.466 -0.342 -0.536  1.     0.     0.     0.
   1.     0.     0.   ]
 [ 0.811 -0.053 -0.463 -0.466 -0.481 -0.536  1.     0.     1.     0.
   1.     0.     0.   ]
 [ 0.811 -1.736  3.117  0.781 -0.047 -0.536  1.     1.     0.     0.
   0.     0.     0.   ]
 [-1.601 -0.053  0.432 -0.466  2.318  1.865  0.     0.     0.     0.
   0.     1.     0.   ]
 [ 0.811 -0.053 -0.463  2.027 -0.326 -0.536  0.     0.     0.     0.
   0.     1.     0.   ]]


### Training model

In [107]:
model = LogisticRegression()
model.fit(X_train, y_train) 

### Predictions

In [108]:
y_train_pred = model.predict(X_train)
print(y_train_pred[0:5])

[0 0 0 1 0]


In [110]:
y_test_pred = model.predict(X_test)
print(y_test_pred[0:5])

[0 0 0 1 1]


### Performances evaluation

In [112]:
# Print scores
print("Accuracy on training set : ", accuracy_score(y_train, y_train_pred).round(3))
print("Accuracy on test set     : ", accuracy_score(y_test, y_test_pred).round(3))

Accuracy on training set :  0.827
Accuracy on test set     :  0.821


Tada 🥳 If you worked well, the score has improved a bit!
This example shows that by adding a little additional information to a model, it is possible to create a significant impact on the performances of the predictive model. Knowing and applying the preprocessing guidelines is great but always remember to check for two important things before you proceed :

* If a variable contains missing values, ask yourself why this value is missing and whether you could use it as information to feed the model with. In the above example, the fact that a passenger does not have a cabin number simply means that they have no cabin. It is very common for missing values to contain hidden meaning, completely random missing values (caused by a bug or other unpredictable causes) are very rare.

* When a non-numerical variable is not usable as is, always ask yourself whether you could still extract some information from it. Here the _Name_ variable cannot be used, however it mentions the passenger's title which can be useful information.