# TITANIC - KAGGLE

### ML ANALYSIS WITH SCI-KIT LEARN

We look at machine learning models based on the following classifiers, initiated with their default hyperparameter values:

- perceptron;
- logistic regression;
- support vector machine;
- naive Bayes;
- decision tree;
- random forest;
- $k$-nearest neighbours.

#### LIBRARY IMPORTS (PREPROCESSING)

In [1]:
import pandas as pd
import numpy as np
import ml_generals as ml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### DATASET IMPORTS

In the folder `datasets/` we have prepared the training data, labels and test data saved as:

- `trainingData.csv`;
- `trainingLabels.csv`;
- `testData.csv`


In [2]:
train = pd.read_csv('datasets/trainingData.csv')
labels = pd.read_csv('datasets/trainingLabels.csv')

The machine learning models typically pass `numpy` arrays. The above datasets are accordingly transformed:


In [3]:
train = train.to_numpy()
labels = labels.values.ravel()

### PREPROCESSING

Before preparing the datasets for model training, we can normalize values to generate more accurate predictions with the `StandardScalar()` method.

**Note.** *We do not normalise the labels. Only the features (training data).*

In [4]:
trainNormalised = StandardScaler().fit_transform(train)

The method `train_test_split` allows for validation testing by machine learning models while training. It takes the training data with its labels and splits it into data for training and testing (validating).

The `test_size` parameter determines what percentage the training data to reserve for validation testing. Splits `30:70` are commonly used.

In [5]:
tst_size = 0.3 # modify as you wish

data_train, data_test, labels_train, labels_test = train_test_split(
    trainNormalised,
    labels,
    test_size=tst_size
)

### MODEL TRAINING 

#### CLASSIFIER IMPORTS

We import the following classifiers from the `sklearn` library:

- perceptron;
- logistic regression;
- support vector machine;
- naive Bayes;
- decision tree;
- random forest;
- $k$-nearest neighbour.

In [6]:
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#### INITIALISING THE MODELS

Above we have imported a number of machine learning classifiers. These are initialized below with their default hyperparameter settings.

In [7]:
classifier_pcp = Perceptron()
classifier_lr = LogisticRegression()
classifier_SVC = SVC()
classifier_gnb = GaussianNB()
classifier_dt = DecisionTreeClassifier()
classifier_rf = RandomForestClassifier()
classifier_knn = KNeighborsClassifier()

#### FITTING THE MODELS

The above models are fit by calling the `.fit()` method and passing `data_train` and `labels_train` from above.

In [8]:
classifier_pcp.fit(data_train, labels_train)
classifier_lr.fit(data_train, labels_train)
classifier_SVC.fit(data_train, labels_train)
classifier_gnb.fit(data_train, labels_train)
classifier_dt.fit(data_train, labels_train)
classifier_rf.fit(data_train, labels_train)
classifier_knn.fit(data_train, labels_train);

### MODEL RESULTS

The method `modelValidationResults` in the module `ml_generals.py` passes in the models above, evaluates them on the testing set `data_test, labels_test` and returns a tuple object. The tuples are:

- Accuracy score;
- ROC-AUC Score;
- Confusion matrix;
- Cross validation score;
- Aggregate score.

We can build a pandas dataframe now with these columns, along with the classifier name and aggregate score, for each model above as follows.

In [9]:
model_names = [
    Perceptron.__name__,
    LogisticRegression.__name__,
    SVC.__name__,
    GaussianNB.__name__,
    DecisionTreeClassifier.__name__,
    RandomForestClassifier.__name__,
    KNeighborsClassifier.__name__
]

fitted_models = [
    classifier_pcp, 
    classifier_lr,
    classifier_SVC,
    classifier_gnb,
    classifier_dt,
    classifier_rf,
    classifier_knn
]

#### THE RESULTS DATAFRAME

In [10]:
results_columns = [
    'Classifier',
    'Accuracy score',
    'ROC-AUC score',
    'Confusion Matrix',
    'Cross validation score', 
    'Aggregate score'
]
results_df = pd.DataFrame(columns = results_columns)

for i, clss in enumerate(fitted_models):
    name = [model_names[i]]
    rsults = ml.modelValidationResults(
        clss, 
        train=data_train, 
        train_labels=labels_train, 
        test=data_test,
        test_labels=labels_test
    )
    accScore = rsults[0]
    rocaucScore = rsults[1]
    crssval = rsults[-1]

    lst = [name] + [rsults] 
    lst_flattened = [item for sublst in lst for item in sublst]
    results_df = results_df.append(pd.Series(lst_flattened, index = results_columns), ignore_index=True)


In [11]:
results_df.style.hide_index()

Classifier,Accuracy score,ROC-AUC score,Confusion Matrix,Cross validation score,Aggregate score
Perceptron,0.735075,0.732024,[[125 43]  [ 28 72]],0.73702,0.734706
LogisticRegression,0.817164,0.801548,[[145 23]  [ 26 74]],0.78021,0.799641
SVC,0.824627,0.777143,[[162 6]  [ 41 59]],0.791423,0.797731
GaussianNB,0.80597,0.794643,[[141 27]  [ 25 75]],0.778571,0.793061
DecisionTreeClassifier,0.798507,0.778571,[[144 24]  [ 30 70]],0.788351,0.788477
RandomForestClassifier,0.824627,0.805476,[[148 20]  [ 27 73]],0.778751,0.802951
KNeighborsClassifier,0.791045,0.766548,[[145 23]  [ 33 67]],0.769099,0.775564


### MODEL PERFORMANCE TRANSCRIPT

There are three metrics (columns) by which we can sort model performance. These are:

- Accuracy score;
- ROC-AUC score;
- Cross validation

We could also further inspect the entries of the confusion matrix and sort by values in there, e.g., order by true-positive to false-negative ratio, etc. 

#### BEST ACCURACY SCORE

In [12]:
results_df.sort_values('Accuracy score').iloc[[-1]].style.hide_index()

Classifier,Accuracy score,ROC-AUC score,Confusion Matrix,Cross validation score,Aggregate score
RandomForestClassifier,0.824627,0.805476,[[148 20]  [ 27 73]],0.778751,0.802951


#### BEST ROC-AUC SCORE

In [13]:
results_df.sort_values('ROC-AUC score').iloc[[-1]].style.hide_index()

Classifier,Accuracy score,ROC-AUC score,Confusion Matrix,Cross validation score,Aggregate score
RandomForestClassifier,0.824627,0.805476,[[148 20]  [ 27 73]],0.778751,0.802951


#### BEST CROSS VALIDATION SCORE

In [14]:
results_df.sort_values('Cross validation score').iloc[[-1]].style.hide_index()

Classifier,Accuracy score,ROC-AUC score,Confusion Matrix,Cross validation score,Aggregate score
SVC,0.824627,0.777143,[[162 6]  [ 41 59]],0.791423,0.797731


#### BEST AGGREGATE SCORE

In [15]:
results_df.sort_values('Aggregate score').iloc[[-1]].style.hide_index()

Classifier,Accuracy score,ROC-AUC score,Confusion Matrix,Cross validation score,Aggregate score
RandomForestClassifier,0.824627,0.805476,[[148 20]  [ 27 73]],0.778751,0.802951


### RETURN BEST FUNCTION

The following function passes a metric and returns the classifier which outperformed others with respect to that metric.


In [16]:
def returnBest(metric):
    
    row = results_df.sort_values(metric).iloc[[-1]]
    modelName = row['Classifier'].values[0]
    modelNameIndex = model_names.index(modelName)
    classifier = fitted_models[modelNameIndex]
    
    return classifier, modelName


### PREDICTIONS

The test data prepared earlier is read below. 

**Note.** *The test data must be normalised in order to be analysed in consistency with the training data.*

In [17]:
test = pd.read_csv('datasets/testData.csv')

testToPass = test.drop(['PassengerId'], axis=1)
testToPass = testToPass.to_numpy()
testToPassNormalised = StandardScaler().fit_transform(testToPass)

#### PREDICTIONS DATAFRAME

As with the `returnBest` function, the `prediction` function coded below passes in a metric, generates predictions and writes these to a `.csv` file, which is saves in `datasets/` as `predictions.csv`.

Code for implementing this is as follows.

In [18]:
def prediction(metric):
    
    classifier_best, classifier_name = returnBest(metric)
    preds_best = classifier_best.predict(testToPassNormalised)
    
    predCols = [
    'PassengerId',
    'Survived'
    ]
    pred_df = pd.DataFrame(columns = predCols)
    
    for i, row in test.iterrows():
        psngerId = int(row['PassengerId'])
        survived = int(preds_best[i])
        pred_df = pred_df.append(pd.Series(
            [
                psngerId,
                survived
            ], index = predCols
        ), ignore_index=True)

    pred_df.to_csv('datasets/predictions.csv', index=False)
    
    print(f"{classifier_name} performed best with respect to {metric}.\nPredictions were generated by {classifier_name}.")
    return 
    

### PREDICTIONS

With respect to the metric `Aggregate score`, we have predictions:

In [19]:
prediction('Aggregate score')

RandomForestClassifier performed best with respect to Aggregate score.
Predictions were generated by RandomForestClassifier.


### SUBMISSION

Uncomment and run to submit predictions via KAGGLE API from the command line.

In [21]:
# ! kaggle competitions submit -c "titanic" -f datasets/predictions.csv -m "submitted from command line"

The latest submission scored **0.77033** on KAGGLE. 