# Classifier Performance.

The purpose of this notebook is to study the performance of various classification algorithms on a particluar data set. It is divided into four sections. 

- [Data Prep](#data_prep): Load the data into pandas DataFrame and extract/construct features.

At this point all models run with the default parameters.  This will be upgraded very soon.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

import xgboost

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier



<a id="data_prep"></a>

## Data Prep

In [2]:
ls ./input/

test.csv  train.csv


In [3]:
data = pd.read_csv('./input/train.csv')

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Feature Engineering

Here we map representative numerical values on to non-numerical data or calculate relevant metrics.



In [5]:
## sex
sex_mapping = {"female": 245 ,"male": 143}
data['Sex'] = data['Sex'].map(sex_mapping)


## embarked
embarked_mapping = {'Q' : 1, 'S' : 2, 'C' : 3}
data['Embarked'] = data['Embarked'].fillna('S')
data['Embarked'] = data['Embarked'].map(embarked_mapping)

## cabin
data["Cabin"] = data["Cabin"].apply(lambda x: 0 if type(x) == float else 1)


## age binning
data['Age'].fillna((data['Age'].mean()), inplace=True)
age_bins = [(data['Age'].min()-1), 15, 30, 45, 60, (data['Age'].max()+1)]
age_labels =[1, 2, 3, 4, 5]
data['Age'] = pd.cut(data['Age'], age_bins, labels=age_labels)

## fare_binning
data['Fare'].fillna((data['Fare'].median()), inplace=True)
fare_bins = [(data['Fare'].min()-1), 5, 10, 15, 20, 30, 50, 100, (data['Fare'].max()+1)]
fare_labels = [1, 2, 3, 4, 5, 6, 7, 8]
data['Fare'] = pd.cut(data['Fare'],fare_bins,labels=fare_labels)

## name features
data['name_length'] = data['Name'].apply(len)

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
# Create a new feature Title, containing the titles of passenger names

data['Title'] = data['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"

data['Title'] = data['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

data['Title'] = data['Title'].replace('Mlle', 'Miss')
data['Title'] = data['Title'].replace('Ms', 'Miss')
data['Title'] = data['Title'].replace('Mme', 'Mrs')

# Mapping titles
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
data['Title'] = data['Title'].map(title_mapping)
data['Title'] = data['Title'].fillna(0)

In [6]:
data.drop(['Name', 'Ticket', 'PassengerId'], axis = 1, inplace=True)

## Sampling Sweep

 

In [7]:
target = 'Survived'

In [8]:
models = {
        'NB' : GaussianNB(),
        'LR' : LogisticRegression(),
        'PC' : Perceptron(),
        'SGD' : SGDClassifier(),
        'SVC' : SVC(),
        'LSVC' : LinearSVC(),
        'KNN' : KNeighborsClassifier(),
        'DT' : DecisionTreeClassifier(),
        'RF' : RandomForestClassifier(),
        'ET' : ExtraTreesClassifier(),
        'ABC' : AdaBoostClassifier(),
        'GBC' : GradientBoostingClassifier(),
        }

Map of model names for formatting later on.

In [9]:
names =  ['Naive Bayes',
          'Logistic Regression',
          'Perceptron',
          'Stochastic Gradient Descent',
          'Support Vector Clasifier', 
          'Linear SVC',
          'k-Nearest Neighbors',               
          'Decision Tree', 
          'Random Forest',
          'Extra Trees',
          'Adaptive Boost Classifier',
          'Gradient Boosting Classifier']

keys = [ i for i in models.keys()]

model_map = dict(zip(keys,names))

Define the test fractions, i.e. the fraction of data set aside for testing.  The balance will be used for training.  

In [10]:
test_fraction = [0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.9]

Make the DataFrame to store accuracy scores.

In [11]:
accuracy = pd.DataFrame(keys, columns = ['Model'])

Loop through test fractions splitting data accordingly.  Then loop through each model, train, test, and report accuracy.

In [12]:
for x in test_fraction:
    
    train, test = train_test_split(data, test_size = x)

    x_train = train.drop(target,axis=1)
    y_train = train[target]
    x_test = test.drop(target,axis=1)
    y_test = test[target]

    for key in keys:

        models[key].fit(x_train, y_train)
        y_pred = models[key].predict(x_test)
        accuracy.loc[accuracy.Model == key, str(x)] = round(accuracy_score(y_pred, y_test) * 100, 2)

Map model names onto abbreviations.

In [13]:
accuracy['Model'] = accuracy['Model'].map(model_map)

In [14]:
accuracy

Unnamed: 0,Model,0.1,0.2,0.3,0.4,0.5,0.7,0.9
0,Naive Bayes,85.56,81.01,74.63,78.71,77.13,78.04,77.93
1,Logistic Regression,85.56,84.36,81.34,80.39,80.27,80.93,78.93
2,Perceptron,81.11,59.22,38.43,63.31,48.43,61.86,37.91
3,Stochastic Gradient Descent,35.56,40.78,61.57,35.85,58.52,38.14,62.84
4,Support Vector Clasifier,84.44,83.8,81.72,81.51,78.25,79.65,72.07
5,Linear SVC,66.67,59.22,70.15,74.23,74.89,80.13,77.68
6,k-Nearest Neighbors,83.33,82.68,79.1,80.39,78.92,79.33,75.06
7,Decision Tree,84.44,78.77,80.22,78.43,76.91,70.83,70.57
8,Random Forest,84.44,81.01,81.72,81.79,79.15,74.84,79.8
9,Extra Trees,88.89,83.24,80.22,82.07,78.92,78.37,76.81


## k-Fold Validation Test

We will run a k-Fold validation test on the data 

In [None]:
kf_data = pd.DataFrame(keys, columns = ['Model'])

In [None]:
num_folds = 10


# make the folds to go with the keys
kf = KFold(n_splits=num_folds)

# intialize list of indices for folds
train_idx = []
test_idx = []

# get indices for folds
for train, test in kf.split(data):
    train_idx.append(train)
    test_idx.append(test)


for n in range(len(train_idx)):    
    
    # loop through the first level models, fit, and test.
    for key in keys:

        x_train = data.iloc[train_idx[n]].drop([target],axis=1)
        y_train = data.iloc[train_idx[n]].Survived
        
        x_test = data.iloc[test_idx[n]].drop([target],axis=1)
        y_test = data.iloc[test_idx[n]].Survived
        
        models[key].fit(x_train, y_train)
        
        y_pred = models[key].predict(x_test)
        
        kf_data.loc[kf_data.Model == key, str(n)] = round(accuracy_score(y_pred, y_test) * 100, 2)



Map names onto model keys and do some statistics.

In [None]:
## names
kf_data['Model'] = kf_data['Model'].map(model_map)

## compute the stats
kf_data['mean'] = kf_data.mean(axis=1)
kf_data['median'] = kf_data.loc[:, kf_data.columns != 'mean'].median(axis=1)
kf_data['std_dev'] = kf_data.loc[:, ((kf_data.columns != 'mean') 
                                     & (kf_data.columns != 'median'))].std(axis=1)

## display 
kf_data[['Model', 'mean', 'median', 'std_dev']].sort_values(by = ['mean'], ascending=0)

<a id="blending"></a>
# Blending

This is my first attempt at blending.  I studied [Anisotropic](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python) and [MLWave](https://mlwave.com/kaggle-ensembling-guide/).

First we'll move 20% of the data to a final validation set.

In [None]:
validation_data, working_data = train_test_split(data, test_size = 0.8)

### First Level

In [None]:
kf = KFold(data.shape[0], n_splits= 9, random_state=1)

In [None]:
# dict of model instances
first_level_models = {
        'NB' : GaussianNB(),
        'LR' : LogisticRegression(),
        'SVC' : SVC(),
        'KNN' : KNeighborsClassifier(),
        'DT' : DecisionTreeClassifier(),
        'RF' : RandomForestClassifier(),
        'ET' : ExtraTreesClassifier(),
        'ABC' : AdaBoostClassifier(),
        'GBC' :  GradientBoostingClassifier(),
        }

# get the keys
first_level_keys = [ i for i in first_level_models.keys()]

# make the folds to go with the keys
kf = KFold(n_splits=len(first_level_keys))

# intialize list of indices for folds
train_idx = []
test_idx = []

# get indices for folds
for train, test in kf.split(working_data):
    train_idx.append(train)
    test_idx.append(test)

# intialize counter
n=0

# train each first level model on a different fold
for key in first_level_keys:
    
    x_train = data.iloc[train_idx[n]].drop([target],axis=1)
    y_train = data.iloc[train_idx[n]].Survived
    
    x_test = data.iloc[test_idx[n]].drop([target],axis=1)
    y_test = data.iloc[test_idx[n]].Survived
    
    first_level_models[key].fit(x_train, y_train)
    y_pred = first_level_models[key].predict(x_test)
    
    print(key,' : ',round(accuracy_score(y_pred, y_test)*100, 2))
    
    n += 1


### Second Level

Go back to full working_data set for second level. All of this data has been seen by at least one classifier, which is why we set another 20% aside for final validation.  At this point, we're going to recombine the data and split it into two equal fractions.  One fraction will be used to generate the first level predictions which are used to train the second level model.  The other fraction will be used to generate the input to the second level model for testing.


In [None]:
# split full data set
train, test = train_test_split(working_data, test_size = 0.1)

x_train = train.drop(target,axis=1)
y_train = train[target]

x_test = test.drop(target,axis=1)
y_test = test[target]

# intialize DataFrames
first_level_predictions = pd.DataFrame()
second_level_input = pd.DataFrame()

# generate the first level predictions and second level input
for key in first_level_keys:
    first_level_predictions[key] = first_level_models[key].predict(x_train)
    second_level_input[key] = first_level_models[key].predict(x_test)

Optional heatmap below to view the correlation of the different models' predictions.  It is good to have uncorrelated models in the blend.

In [None]:
colormap = plt.cm.viridis
plt.figure(figsize=(6,6))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(first_level_predictions.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
plt.show()

#### Blended model testing

In [None]:
GBC = GradientBoostingClassifier()

# fit the first level predicitons
GBC.fit(first_level_predictions, y_train)

# generate second level prediction
second_level_prediction = GBC.predict(second_level_input)

print('Blended Model (GBC): ',round(accuracy_score(second_level_prediction, y_test) * 100, 2))

In [None]:
XGB = xgboost.XGBClassifier()

# fit the first level predicitons
XGB.fit(first_level_predictions, y_train)

# generate second level prediction
second_level_prediction = XGB.predict(second_level_input)

print('Blended Model (XGB): ',round(accuracy_score(second_level_prediction, y_test) * 100, 2))

## Validation


In [None]:
x_validation = validation_data.drop(target,axis=1)
y_validation = validation_data[target]

# intialize DataFrames
first_level_predictions = pd.DataFrame()
second_level_input = pd.DataFrame()

# generate the first level predictions and second level input
for key in first_level_keys:
    second_level_input[key] = first_level_models[key].predict(x_validation)

scikit-learn Gradient Boosted Classifier

In [None]:
validation_prediction = GBC.predict(second_level_input)
print('Blended Model Validation: ',round(accuracy_score(validation_prediction, y_validation) * 100, 2))

xgboost classifier

In [None]:
validation_prediction = XGB.predict(second_level_input)
print('Blended Model Validation: ',round(accuracy_score(validation_prediction, y_validation) * 100, 2))