# Classification modelling
The objective is to train and fine-tune models to identify the most precise one for predicting the survival chances of a Titanic passenger based on the available information.

In [271]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from joblib import dump, load
from titanicTransformers import *
import numpy as np
from sklearn.model_selection import GridSearchCV
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.metrics import f1_score, accuracy_score

## Loading the non transformed train and test sets.
The sets sets are not loaded already transformed so the whole data processing pipeline can be set in this single notebook for future ease of implementation and automated repeatability.

In [256]:
X_train = pd.read_csv(r'/home/panos/Python/MachineLearning/titanic/datasets/X_train.csv')
y_train = pd.read_csv(r'/home/panos/Python/MachineLearning/titanic/datasets/y_train.csv')
target = y_train.values.ravel(order='c')

In [257]:
y_test = pd.read_csv(r'/home/panos/Python/MachineLearning/titanic/datasets/y_test.csv')
X_test = pd.read_csv(r'/home/panos/Python/MachineLearning/titanic/datasets/X_test.csv')

## Model training and cross validation.

Encoding is performed first, outside the pipeline.


In [258]:
X_trainEncoded = titanicEncoder().fit_transform(X_train)
X_testEncoded = titanicEncoder().fit_transform(X_test)

The models which will be trained and validated are:
* Logistic Regression
* Linear kernel SVM
* Radial Basis kernel SVM
* Decision Tree Classifier

*5-fold StratifiedKFold is used for cross validation*

In [259]:
models = {'Logistic Regression': LogisticRegression(),
        'Linear SVM':LinearSVC(),
        'Kernel SVM':SVC(),
        'Decision Tree Classifier':DecisionTreeClassifier()}

scores = []

for modelName, model in models.items():
    
    clf = make_pipeline(imputeColumnMean(),
                        MinMaxScaler(),
                        model)
    scores.append(cross_val_score(clf, X_trainEncoded, target, scoring='f1').round(2))
       
    # saving each model.
    dump(model, f"{modelName}.pkl")
    dump(scores, f"{modelName}_cv.pkl")

## Cross validation scores

Scoring: f1

In [260]:
df = pd.DataFrame(scores).T
df.columns = models.keys()
df.index = np.arange(1,6)
df.loc[len(df)+1] = df.mean()
df.rename(index={len(df):'mean'}, inplace=True)
df.rename_axis('fold', inplace=True)
df

Unnamed: 0_level_0,Logistic Regression,Linear SVM,Kernel SVM,Decision Tree Classifier
fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.7,0.7,0.67,0.68
2,0.78,0.78,0.75,0.73
3,0.69,0.68,0.62,0.63
4,0.74,0.73,0.7,0.7
5,0.75,0.75,0.73,0.73
mean,0.732,0.728,0.694,0.694


Very close cross validation performance between **Logistic Regression** and **Linear SVM**.  

These two models' hyperparameters will be tuned using Grid Search.

## Grid search - hyperparameter tuning

### Logistic regression
Parameters:
* Regularization parameter C
* Class weight (automatically adjust weights inversely proportional to class frequencies in the input data, or all classes weights equal 1)

In [262]:
paramGrid = [{'C':[5,6,7,8,9,10], 'class_weight':[None, 'balanced']}]
logReg = LogisticRegression()
clf = make_pipeline(imputeColumnMean(), MinMaxScaler())
X = clf.fit_transform(X_trainEncoded)
logGridSearch = GridSearchCV(logReg, paramGrid,cv=5, return_train_score=True, scoring='f1')

logGridSearch.fit(X, target)

bestLogReg = logGridSearch.best_estimator_
print('Best parameters: ', logGridSearch.best_params_)
print('Best estimator f1 score: ', logGridSearch.best_score_)

Best parameters:  {'C': 9, 'class_weight': None}
Best estimator f1 score:  0.7432874216686579


### Linear SVM
Parameters:
* Regularization parameter C
* Max iterations for convergence (for large numbers of C and low numbers of iterations convergence fails are expected)

In [264]:
paramGrid = [{'C':[0.1, 1, 100, 1000], 'max_iter':[500, 1000, 1500]}]
svm = LinearSVC()
clf = make_pipeline(imputeColumnMean(), MinMaxScaler())
X = clf.fit_transform(X_trainEncoded)
svmGridSearch = GridSearchCV(svm, paramGrid,cv=5, return_train_score=True, scoring='f1')

# Convergence warnings expected while grid searching.
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=ConvergenceWarning)
    svmGridSearch.fit(X, target)

bestSvm = svmGridSearch.best_estimator_
print('Best parameters: ', svmGridSearch.best_params_)
print('Best estimator f1 score: ', svmGridSearch.best_score_)

Best parameters:  {'C': 100, 'max_iter': 1500}
Best estimator f1 score:  0.7310945659817325


### CV Verdict:
**Logistic** regression performs slightly better during cross validation testing, even after hyperparameter tuning. It is also computationally cheaper so it will be the model to test against the test set.

## Test set evaluation

In [284]:
X = transformTitanicDf(X_test)

predictions = bestLogReg.predict(X.values)

f1 = f1_score(predictions, y_test)
accuracy = accuracy_score(predictions, y_test)
print('f1 score against the test set: ', f1.round(3))
print('Accuracy against the test set: ', accuracy.round(3))

f1 score against the test set:  0.715
Accuracy against the test set:  0.789


## That's it!
The model is now ready for review and then ready to deploy!
There is always room for improvement and methods to try out like feature engineering and further tuning.