## Titanic: Model Selection

### Importing libraries

In [1]:
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Loading the datasets

In [2]:
df_normalized = pd.read_csv('./data/normalized.csv')

In [3]:
df_normalized.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_S,Survived
0,1,1,0.883877,0,0,0.467719,0,1,1
1,2,1,0.907009,0,0,0.421111,0,1,1
2,1,1,0.258722,0,0,0.965952,1,0,1
3,3,0,0.994813,0,0,0.101724,0,1,0
4,3,1,0.872129,0,0,0.489277,0,1,0


In [4]:
df_standardized = pd.read_csv('./data/standardized.csv')

In [5]:
df_standardized.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_S,Survived
0,1,1,0.475345,0,0,-0.213706,0,1,1
1,2,1,0.557735,0,0,-0.379165,0,1,1
2,1,1,-1.751309,0,0,1.555036,1,0,1
3,3,0,0.870471,0,0,-1.513,0,1,0
4,3,1,0.4335,0,0,-0.137176,0,1,0


### Model Selection

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

In [7]:
MODELS = [KNeighborsClassifier(), RandomForestClassifier(), DecisionTreeClassifier(), XGBClassifier(), SVC(), LogisticRegression()]

In [8]:
X_normalized = df_normalized.loc[:, df_normalized.columns != 'Survived']
y_normalized = df_normalized.loc[:, 'Survived']

X_standardized = df_standardized .loc[:, df_standardized .columns != 'Survived']
y_standardized = df_standardized .loc[:, 'Survived']

#### Training on the normalized dataset

In [9]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

for model in MODELS:
    X_train, X_test, y_train, y_test = train_test_split(X_normalized, y_normalized, test_size=0.2, random_state=532)

    print(type(model), "is getting evaluated")
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    print('Cross Validation Score:', cross_val_score(model, X_normalized, y_normalized).mean())
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
    print('Accuracy Score:', accuracy_score(y_test, y_pred), '\n')

<class 'sklearn.neighbors._classification.KNeighborsClassifier'> is getting evaluated
Cross Validation Score: 0.7979222898750864
Confusion Matrix: 
 [[96 17]
 [16 50]]
Accuracy Score: 0.8156424581005587 

<class 'sklearn.ensemble._forest.RandomForestClassifier'> is getting evaluated
Cross Validation Score: 0.768765300357793
Confusion Matrix: 
 [[89 24]
 [17 49]]
Accuracy Score: 0.770949720670391 

<class 'sklearn.tree._classes.DecisionTreeClassifier'> is getting evaluated
Cross Validation Score: 0.757535622371477
Confusion Matrix: 
 [[88 25]
 [17 49]]
Accuracy Score: 0.7653631284916201 

<class 'xgboost.sklearn.XGBClassifier'> is getting evaluated
Cross Validation Score: 0.783334379511644
Confusion Matrix: 
 [[90 23]
 [19 47]]
Accuracy Score: 0.7653631284916201 

<class 'sklearn.svm._classes.SVC'> is getting evaluated
Cross Validation Score: 0.8068984997803025
Confusion Matrix: 
 [[98 15]
 [19 47]]
Accuracy Score: 0.8100558659217877 

<class 'sklearn.linear_model._logistic.LogisticRegr

The ***KNN***, ***XGBoost***, ***SVC*** and the ***Logistic Regression*** seem quite good to be tunned.

#### Training on the standardized dataset

In [10]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

for model in MODELS:
    X_train, X_test, y_train, y_test = train_test_split(X_standardized, y_standardized, test_size=0.2, random_state=532)

    print(type(model), "is getting evaluated")
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    print('Cross Validation Score:', cross_val_score(model, X_standardized, y_standardized).mean())
    print('Confusion Matrix: \n', confusion_matrix(y_test, y_pred))
    print('Accuracy Score:', accuracy_score(y_test, y_pred), '\n')

<class 'sklearn.neighbors._classification.KNeighborsClassifier'> is getting evaluated
Cross Validation Score: 0.7979222898750863
Confusion Matrix: 
 [[100  13]
 [ 15  51]]
Accuracy Score: 0.8435754189944135 

<class 'sklearn.ensemble._forest.RandomForestClassifier'> is getting evaluated
Cross Validation Score: 0.7687527462180654
Confusion Matrix: 
 [[95 18]
 [14 52]]
Accuracy Score: 0.8212290502793296 

<class 'sklearn.tree._classes.DecisionTreeClassifier'> is getting evaluated
Cross Validation Score: 0.7328165212478817
Confusion Matrix: 
 [[90 23]
 [16 50]]
Accuracy Score: 0.7821229050279329 

<class 'xgboost.sklearn.XGBClassifier'> is getting evaluated
Cross Validation Score: 0.7788337204193083
Confusion Matrix: 
 [[100  13]
 [ 17  49]]
Accuracy Score: 0.8324022346368715 

<class 'sklearn.svm._classes.SVC'> is getting evaluated
Cross Validation Score: 0.8170045822610005
Confusion Matrix: 
 [[100  13]
 [ 16  50]]
Accuracy Score: 0.8379888268156425 

<class 'sklearn.linear_model._logis

The ***KNN***, ***SVC*** and the ***Logistic Regression*** seem again to be the best ones to be tuned.

### Parameter Tuning

In [11]:
from sklearn.model_selection import GridSearchCV

def tuneModel(model, params, X, y):
    gridCV = GridSearchCV(estimator=model, param_grid=params)
    gridCV.fit(X, y)

    return {'score': gridCV.best_score_, 'params': gridCV.best_params_}

#### K-Nearest Neighbors

In [12]:
# NORMALIZED DATASET
knn_params = {
    'n_neighbors': [5, 7, 9, 11, 13],
    'weights': ['distance', 'uniform'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}

tuneModel(KNeighborsClassifier(), knn_params, X_normalized, y_normalized)

{'score': 0.8102755633670202,
 'params': {'algorithm': 'auto',
  'metric': 'manhattan',
  'n_neighbors': 13,
  'weights': 'uniform'}}

In [13]:
# STANDARDIZED DATASET
knn_params = {
    'n_neighbors': [5, 7, 9, 11, 13],
    'weights': ['distance', 'uniform'],
    'metric': ['euclidean', 'manhattan', 'minkowski'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
}

tuneModel(KNeighborsClassifier(), knn_params, X_standardized, y_standardized)

{'score': 0.8001569267465947,
 'params': {'algorithm': 'auto',
  'metric': 'manhattan',
  'n_neighbors': 11,
  'weights': 'uniform'}}

#### Support Vector Classifier (SVC)

In [14]:
# NORMALIZED DATASET
svc_params = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4],
    'class_weight': [None, 'balanced']
}

tuneModel(SVC(), svc_params, X_normalized, y_normalized)

{'score': 0.8204067541271733,
 'params': {'C': 100,
  'class_weight': None,
  'degree': 2,
  'gamma': 'scale',
  'kernel': 'poly'}}

In [15]:
# STANDARDIZED DATASET
svc_params = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4],
    'class_weight': [None, 'balanced']
}

tuneModel(SVC(), svc_params, X_standardized, y_standardized)

{'score': 0.8181219006967547,
 'params': {'C': 1,
  'class_weight': None,
  'degree': 2,
  'gamma': 'auto',
  'kernel': 'rbf'}}

#### Logistic Regression

In [16]:
# NORMALIZED DATASET
logistic_regression_params = {
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 
    'penalty': ['l2', None], 
    'C': [0.1, 1, 10, 100], 
    'class_weight': [None, 'balanced'], 
    'max_iter': [1000, 1200, 2000]
}

tuneModel(LogisticRegression(), logistic_regression_params, X_normalized, y_normalized)



{'score': 0.792285481137405,
 'params': {'C': 1,
  'class_weight': None,
  'max_iter': 1000,
  'penalty': 'l2',
  'solver': 'newton-cg'}}

In [17]:
# STANDARDIZED DATASET
logistic_regression_params = {
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 
    'penalty': ['l2', None], 
    'C': [0.1, 1, 10, 100], 
    'class_weight': [None, 'balanced'], 
    'max_iter': [1000, 1200, 2000]
}

tuneModel(LogisticRegression(), logistic_regression_params, X_standardized, y_standardized)



{'score': 0.7967861402297407,
 'params': {'C': 0.1,
  'class_weight': None,
  'max_iter': 1000,
  'penalty': 'l2',
  'solver': 'lbfgs'}}

### Conclusion

The ***KNN***, ***SVC*** and the ***Logistic Regression*** perform quite well in this problem. The best one of them is the **SVC trained on the normalized dataset**. 