# Lab 11

## [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

![](./images/var.png)

## Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix, plot_confusion_matrix, accuracy_score, precision_score,
    recall_score, f1_score, plot_roc_curve, plot_precision_recall_curve, precision_recall_curve,
    roc_auc_score
)

## Load and Check Data

In [None]:
train = pd.read_csv("./data/titanic/train.csv")
test = pd.read_csv("./data/titanic/test.csv")

In [None]:
print(train.shape)
print(test.shape)

In [None]:
train.sample(5)

In [None]:
X = train.drop(['Survived'], axis=1)
Y = train.Survived

In [None]:
print(X.shape)
print(Y.shape)

In [None]:
X.sample(5)

In [None]:
test.sample(5)

## Check for Missing Values

In [None]:
# Total missing values in each columns
train.isnull().sum(axis=0)

In [None]:
# percentage of missing values in each columns
test.isnull().sum(axis=0)

## Preprocessing Pipeline

In [None]:
pipeline1 = ColumnTransformer([
    ('drop', 'drop', ['PassengerId', 'Name', 'Cabin', 'Ticket']),
    ('ageimputer', IterativeImputer(max_iter=10, random_state=27), ['Age', 'Fare']),
    ('embarkedimputer', SimpleImputer(strategy='most_frequent'), ['Embarked'])],
    remainder='passthrough'
)

pipeline2 = ColumnTransformer([
    ('scaler', MinMaxScaler(), [0, 1, 5, 6]),
    ('onehot', OneHotEncoder(), [2, 3, 4])
])

pipeline = make_pipeline(pipeline1, pipeline2)

In [None]:
pipeline.fit(X)
X_train = pipeline.transform(X)
X_test = pipeline.transform(test)

In [None]:
pd.DataFrame(X_train).sample(5)

In [None]:
pd.DataFrame(X_test).sample(5)

## Modeling

In [None]:
lr = LogisticRegression(penalty='none', verbose=2)
rr = SGDClassifier(penalty='l2', n_jobs=4, loss='log')
ls = SGDClassifier(penalty='l1', n_jobs=4, loss='log')
en = SGDClassifier(penalty='elasticnet', n_jobs=4, loss='log')

In [None]:
metrics.SCORERS.keys()

In [None]:
cv = RepeatedKFold(n_splits=5, n_repeats=1, random_state=27)

grid_ridge_lasso = {
    'alpha': np.arange(0, 1, 0.05),
    'learning_rate': np.array(['constant']),
    'eta0': np.array([0.1, 0.01, 0.001])
}

grid_elastic = {
    'alpha': np.arange(0, 1, 0.05),
    'l1_ratio': np.arange(0, 1, 0.05),
    'learning_rate': np.array(['constant']),
    'eta0': np.array([0.1, 0.01, 0.001])
}

lr_score = cross_val_score(lr, X_train, Y, cv=cv, scoring='accuracy')

rr_search = GridSearchCV(rr, grid_ridge_lasso, cv=cv, scoring='accuracy')
rr_score = rr_search.fit(X_train, Y)

ls_search = GridSearchCV(ls, grid_ridge_lasso, cv=cv, scoring='accuracy')
ls_score = ls_search.fit(X_train, Y)

en_search = GridSearchCV(en, grid_elastic, cv=cv, scoring='accuracy')
en_score = en_search.fit(X_train, Y)

In [None]:
print(np.mean(lr_score))
print(rr_score.best_score_)
print(ls_score.best_score_)
print(en_score.best_score_)

In [None]:
predictions = en_score.best_estimator_.predict(X_test)

In [None]:
pd.DataFrame({
    'PassengerId': test.PassengerId,
    'Survived': predictions
}).to_csv('./output/submission.csv', index=False)

# More Metrics 

### Confusion Matrix

![](./images/cfm.png)

In [None]:
predictions = en_score.best_estimator_.predict(X_train)
predictions_proba = en_score.best_estimator_.predict_proba(X_train)
confusion_matrix(predictions, Y)

In [None]:
plot_confusion_matrix(en_score.best_estimator_, X_train, Y)

**Suppose your job was to classify cats and non-cats images. You were given 250 images of cats and 25 images of non-cats. You finalized a model and evaluated its performance in the test set. Your classifier correctly classified 200 out of 250 cats. However, it could only classify 5 non-cats images correctly. Can you create a confusion matrix out of it?**

In [None]:
cfm = np.array([
    [200, 50],
    [45, 5]
])

cfm = pd.DataFrame(cfm,
                   index=['Originally Cat', 'Originally Non-Cat'],
                  columns=['Predicted Cat', 'Predicted Non-Cat'])

In [None]:
sns.heatmap(cfm, annot=True)

### Classification Metrics

![](./images/metrics.gif)

### Accuracy

In [None]:
205 / 300

### Precision

The precision is the ratio `tp / (tp + fp)` where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

In [None]:
200 / 245

### Sensitivity / Recall

The recall is the ratio `tp / (tp + fn)` where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

In [None]:
200 / 250

### Specificity

In [None]:
5 / 50

### F1 Score

In [None]:
(2 * (200 / 245) * (200 / 250)) / ((200 / 250) + (200 / 245))

### AUC ROC Curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate.

In [None]:
plot_roc_curve(en_score.best_estimator_, X_train, Y)

### Precision-Recall Curve

The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

In [None]:
plot_precision_recall_curve(en_score.best_estimator_, X_train, Y)