<img src="https://cellstrat2.s3.amazonaws.com/PlatformAssets/bluewhitelogo.svg" alt="drawing" width="200"/>

# ML Tuesdays - Session 1
## Titanic Survival Classification Exercise

### General Guidelines
1. The notebook has been split into multiple steps with fine-grained instructions for each step. Use the instructions for each code cell to complete the code.
2. Some hints specify the function to be used but you need to figure out the arguments yourself in some using the docstring of the function which can be pulled up with `Shift+Tab`.

### About the Titanic Dataset
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Here's a description of the columns in the dataset:

| Variable | Definition                                 | Key                                            |
| -------- | ------------------------------------------ | ---------------------------------------------- |
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |
| Age      | Age in years                               |
| sibsp    | # of siblings / spouses aboard the Titanic |
| parch    | # of parents / children aboard the Titanic |
| ticket   | Ticket number                              |
| fare     | Passenger fare                             |
| cabin    | Cabin number                               |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |
[Dataset Source](https://www.kaggle.com/c/titanic/overview)

In [187]:
import pandas as pd
import numpy as np

## Data Preprocessing

1. Load Data
2. Remove Irrelevant Columns
3. Split to X and y data
4. Handling Missing Values
5. Encode the Categorical Variables
6. Perform Train Test Split
7. Feature Scaling

### Load Data
Read the dataset with `index_col` as "PassengerId"

In [143]:
dataset = pd.read_csv("titanic.csv")

In [144]:
dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Remove Irrelevant Columns
Look for columns which are either subjective or not relevant for predicting the survivability and then remove them using the `drop()` function.

In [145]:
dataset.drop(columns=['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], inplace=True)

### Handle Missing Data

Check the number of NaN values

In [146]:
dataset.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Embarked      2
dtype: int64

Drop any rows where the number of NaNs in a column is insignificant. Use the `subset` argument to specify a list of columns according to which the rows should be dropped.

In [147]:
dataset.dropna(subset=['Embarked'], inplace=True)

In [148]:
dataset.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Embarked      0
dtype: int64

In [149]:
from sklearn.impute import SimpleImputer

In [150]:
age_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
dataset['Age'] = age_imputer.fit_transform(dataset['Age'].values.reshape(-1, 1))

In [151]:
dataset.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Embarked    0
dtype: int64

### Split into X and Y Data
Separate the features from the labels in two variables `X_data` and `y_data`

In [152]:
X_data = dataset.iloc[:, 1:]
y_data = dataset.iloc[:, 0]

In [153]:
X_data.shape

(889, 6)

In [154]:
y_data.shape

(889,)

### Encode Categorical Features

LabelEncode the binary features and OneHotEncode the ones that have multiple categories

In [155]:
from sklearn.preprocessing import LabelEncoder

In [156]:
X_data

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,3,male,22.000000,1,0,S
1,1,female,38.000000,1,0,C
2,3,female,26.000000,0,0,S
3,1,female,35.000000,1,0,S
4,3,male,35.000000,0,0,S
...,...,...,...,...,...,...
886,2,male,27.000000,0,0,S
887,1,female,19.000000,0,0,S
888,3,female,29.642093,1,2,S
889,1,male,26.000000,0,0,C


In [157]:
gender_encoder = LabelEncoder()
embarked_encoder = LabelEncoder()

In [158]:
X_data.iloc[:, 1] = gender_encoder.fit_transform(X_data.iloc[:, 1])

In [159]:
X_data.iloc[:, -1] = embarked_encoder.fit_transform(X_data.iloc[:, -1])

OneHotEncode the multi-class categorical variables

In [160]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [161]:
X_data['Embarked'].unique()

array([2, 0, 1])

In [162]:
X_data['Pclass'].unique()

array([3, 1, 2])

In [163]:
col_transformer = ColumnTransformer([('encoder', OneHotEncoder(), [0, 5])], remainder='passthrough')

In [164]:
X_data = col_transformer.fit_transform(X_data)

In [165]:
print(X_data)

[[ 0.         0.         1.        ... 22.         1.         0.       ]
 [ 1.         0.         0.        ... 38.         1.         0.       ]
 [ 0.         0.         1.        ... 26.         0.         0.       ]
 ...
 [ 0.         0.         1.        ... 29.6420927  1.         2.       ]
 [ 1.         0.         0.        ... 26.         0.         0.       ]
 [ 0.         0.         1.        ... 32.         0.         0.       ]]


### Train Test Split

In [166]:
from sklearn.model_selection import train_test_split

In [167]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=0)

### Feature Scaling
Standard Scale the features which are continous in nature and require scaling

In [127]:
from sklearn.preprocessing import StandardScaler

In [171]:
age_scaler = StandardScaler()
X_train[:, -3] = age_scaler.fit_transform(X_train[:, -3].reshape(-1, 1)).reshape(-1)
X_test[:, -3] = age_scaler.transform(X_test[:, -3].reshape(-1, 1)).reshape(-1)

## Train and Evaluate
1. Train a LogisticRegression Model
2. Evaluate on the Test split and report the classification report

In [194]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Build and Fit the model

In [195]:
model = LogisticRegression()

In [196]:
model.fit(X_train, y_train)

LogisticRegression()

Make predictions on both train and test split

In [197]:
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

Check the train and test accuracy

In [198]:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
print('Train Acc:', train_acc)
print('Test Acc:', test_acc)

Train Acc: 0.8241912798874824
Test Acc: 0.7191011235955056


Print out the test classification report with details of precision, recall and f1

In [182]:
test_report = classification_report(y_test, test_preds)
print(test_report)

              precision    recall  f1-score   support

           0       0.76      0.84      0.80       105
           1       0.73      0.62      0.67        73

    accuracy                           0.75       178
   macro avg       0.74      0.73      0.73       178
weighted avg       0.75      0.75      0.74       178



In [184]:
dataset

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked
0,0,3,male,22.000000,1,0,S
1,1,1,female,38.000000,1,0,C
2,1,3,female,26.000000,0,0,S
3,1,1,female,35.000000,1,0,S
4,0,3,male,35.000000,0,0,S
...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,S
887,1,1,female,19.000000,0,0,S
888,0,3,female,29.642093,1,2,S
889,1,1,male,26.000000,0,0,C


### Hurray! You have successfully completed the ML Exercise in the first ML Monday (Tuesdays)