<img src="https://cellstrat2.s3.amazonaws.com/PlatformAssets/bluewhitelogo.svg" alt="drawing" width="200"/>

# ML Tuesdays - Session 1
## Titanic Survival Classification Exercise

### General Guidelines
1. The notebook has been split into multiple steps with fine-grained instructions for each step. Use the instructions for each code cell to complete the code.
2. Some hints specify the function to be used but you need to figure out the arguments yourself in some using the docstring of the function which can be pulled up with `Shift+Tab`.

### About the Titanic Dataset
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Here's a description of the columns in the dataset:

| Variable | Definition                                 | Key                                            |
| -------- | ------------------------------------------ | ---------------------------------------------- |
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |
| Age      | Age in years                               |
| sibsp    | # of siblings / spouses aboard the Titanic |
| parch    | # of parents / children aboard the Titanic |
| ticket   | Ticket number                              |
| fare     | Passenger fare                             |
| cabin    | Cabin number                               |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |
[Dataset Source](https://www.kaggle.com/c/titanic/overview)

In [187]:
import pandas as pd
import numpy as np

## Data Preprocessing

1. Load Data
2. Remove Irrelevant Columns
3. Split to X and y data
4. Handling Missing Values
5. Encode the Categorical Variables
6. Perform Train Test Split
7. Feature Scaling

### Load Data
Read the dataset with `index_col` as "PassengerId"

### Remove Irrelevant Columns
Look for columns which are either subjective or not relevant for predicting the survivability and then remove them using the `drop()` function.

### Handle Missing Data

Check the number of NaN values

Drop any rows where the number of NaNs in a column is insignificant. Use the `subset` argument to specify a list of columns according to which the rows should be dropped.

In [149]:
from sklearn.impute import SimpleImputer

### Split into X and Y Data
Separate the features from the labels in two variables `X_data` and `y_data`

### Encode Categorical Features

LabelEncode the binary features and OneHotEncode the ones that have multiple categories

In [155]:
from sklearn.preprocessing import LabelEncoder

OneHotEncode the multi-class categorical variables

In [160]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

### Train Test Split

In [166]:
from sklearn.model_selection import train_test_split

### Feature Scaling
Standard Scale the features which are continous in nature and require scaling

In [127]:
from sklearn.preprocessing import StandardScaler

## Train and Evaluate
1. Train a LogisticRegression Model
2. Evaluate on the Test split and report the classification report

In [178]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Build and Fit the model

Make predictions on both train and test split

Check the train and test accuracy

Print out the test classification report with details of precision, recall and f1

### Hurray! You have successfully completed the ML Exercise in the first ML Tuesday (Monday)