# Discovering the Titanic

The Titanic is a well known ship, but did you know that it is also one of the most popular datasets in Data Science ? Here's the link to the dataset:

<a href="https://www.kaggle.com/c/titanic/"> Titanic </a>

Machine Learning is of course all about statistical prediction and understanding of data. The objective of this exercise is to predict whether a passenger survived the sinking of the Titanic, based on the information available about that passenger. The part of the code to train the model, make predictions and evaluate its performance has already been coded. You have to complete the upstream part, which will allow you to prepare the dataset before training the model (preprocessing).

1. Download the dataset _titanic.csv_.
2. Try to understand what's in this dataset.
    1. You will find all the explanations via this link : <a href="https://www.kaggle.com/c/titanic/data"> Titanic Data </a>

3. Place the file _titanic.csv_ in the same folder as this notebook and read it.

In [156]:
# prelude

import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



In [157]:


df = pd.read_csv("../12_assets/05_supervised_ML/titanic.csv")
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
117,118,0,2,"Turpin, Mr. William John Robert",male,29.0,1,0,11668,21.0,,S
723,724,0,2,"Hodges, Mr. Henry Price",male,50.0,0,0,250643,13.0,,S
584,585,0,3,"Paulner, Mr. Uscher",male,,0,0,3411,8.7125,,C
47,48,1,3,"O'Driscoll, Miss. Bridget",female,,0,0,14311,7.75,,Q
322,323,1,2,"Slayter, Miss. Hilda Mary",female,30.0,0,0,234818,12.35,,Q


4. Explore the dataset and determine which columns are useful for prediction and what preprocessing you will do.

In [158]:
df.shape

(891, 12)

In [159]:
df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [160]:
# % of missing val
100 * df.isnull().sum() / len(df)

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

* On va virer PassengerId, Name, Ticket, Cabin
* target = Survived

## Preprocessing - pandas part
5. Use the pandas library to discard columns you won't use for prediction.

In this dataset, some categorical variables have too many modalities, we will have to think about throwing them away: typically, for a dataset that is less than 1000 lines long, we will tend to reject categorical variables that have more than 15-20 possible values. So pay attention to the number of unique values in each column, to decide which ones you will keep.

In [161]:
col2drop = ["PassengerId", "Name", "Ticket", "Cabin"]
df.drop(col2drop, axis=1, inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


6. Separate the target variable (Y) from the explanatory variables (X)

In [162]:
target_name = "Survived"
y = df.loc[:, target_name]
X = df.drop(target_name, axis=1)  

display(y.head())
display(X.head())

# features_list = ["Survived	Pclass",	"Sex",	"Age	SibSp", 	"Parch	Fare", 	"Embarked"]
# X = df.loc[:,features_list] 
# y = df.loc[:,"Survived"]      


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


## Preprocessing - scikit-learn part
7. Separate your data to create a train set and a test set, the latter should represent 15% of the available data.

In [163]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

8. Create the preprocessing pipeline for numeric columns

In [164]:
# On a 2 étapes dans notre pipe
# Une liste de tuples à 2 éléments

numeric_features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]  

numeric_transformer = Pipeline(
    steps=[
        (
            "imputer_num",
            SimpleImputer(strategy="median"),     # moins sensible que la moyenne aux val extremes
        ),  
        (
            "scaler", 
            StandardScaler()                      
        ),
    ]
)

9. Create the preprocessing pipeline for category columns

In [165]:
# Create pipeline for categorical features
categorical_features = ["Sex", "Embarked"]            # Names of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
        (
            "imputer_cat",
            SimpleImputer(strategy="most_frequent"),  # missing values will be replaced by most frequent value
        ),  
        (
            "encoder",
            OneHotEncoder(drop="first"),              # drop => avoid correlations between features
        ),  
    ]
)

10. Use the preprocessing pipelines of questions 9 and 10 to transform X_train and X_test

Reminder: you need to call `fit_transform()` on X_train and only `transform()` on X_test, to ensure that the latter gets the same transformations as X_train.

In [166]:
feature_encoder = ColumnTransformer(
  transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),    
  ]
)

In [167]:
X_train = feature_encoder.fit_transform(X_train)
print(X_train[0:5,:].round(3))

[[-1.601  2.624 -0.463 -0.466 -0.11   1.     0.     1.   ]
 [ 0.811 -0.665 -0.463 -0.466 -0.471  1.     0.     1.   ]
 [ 0.811 -0.053  0.432 -0.466 -0.477  1.     1.     0.   ]
 [ 0.811  0.788  0.432 -0.466 -0.442  0.     0.     1.   ]
 [-0.395  1.094  0.432 -0.466 -0.11   1.     0.     1.   ]]


In [168]:
X_test = feature_encoder.transform(X_test)  
print(X_test[0:5,:].round(3))

[[ 0.811 -0.053 -0.463 -0.466 -0.342  1.     0.     0.   ]
 [ 0.811 -0.053 -0.463 -0.466 -0.481  1.     0.     1.   ]
 [ 0.811 -1.736  3.117  0.781 -0.047  1.     1.     0.   ]
 [-1.601 -0.053  0.432 -0.466  2.318  0.     0.     0.   ]
 [ 0.811 -0.053 -0.463  2.027 -0.326  0.     0.     0.   ]]


### Training model

In [169]:
model = LogisticRegression()
model.fit(X_train, y_train) 

### Predictions

In [170]:
y_train_pred = model.predict(X_train)
print(y_train_pred[0:5])

[0 0 0 0 0]


In [171]:
y_test_pred = model.predict(X_test)
print(y_test_pred[0:5])

[0 0 0 1 1]


### Performances evaluation

In [174]:
# Print scores
print("Accuracy on training set : ", accuracy_score(y_train, y_train_pred).round(3))
print("Accuracy on test set     : ", accuracy_score(y_test, y_test_pred).round(3))

Accuracy on training set :  0.803
Accuracy on test set     :  0.791


If you get a score close to 0.79 on the test set, it means that you managed to do all the preprocessings with a good methodology! :-)