# sklhelper testing

This Notebook is used for testing the sklhelper class.  The goal of the class is to streamline initial evaluation of various scikit-learn classifiers on new data sets.

Sections:
- [Titanic Data Prep](#feature)
- [Class testing](#kfold)

In [1]:
import pandas as pd
import re
from sklhelper import sklhelpClassify



#### Load data

In [2]:
data = pd.read_csv('./input/titanic.csv')

In [3]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
data.drop(['boat', 'body', 'home.dest'], axis=1, inplace=True)

<a id="feature"></a>
## Feature Engineering

Here we map representative numerical values onto non-numerical data or calculate relevant metrics.  First, let's count the missing values.


In [5]:
data.isnull().sum()

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64

Not too bad.  The most widely documented solution for the missing *age*'s is to estimate them by title.  We will follow this path.

First we'll need to extract the *title* feature from the *name* feature.  We can look at the data and see that the title is always preceded by a white space and followed by a period.  This is a job for a [regular expression](https://docs.python.org/3/howto/regex.html).  In this cass, we'll use the `findall()` method to grab any text which is found between a white space and a period.  


In [6]:
data['title'] = data['name'].apply(lambda x: re.findall('\w+\.',x)[0])

In [7]:
data['title'].unique()

array(['Miss.', 'Master.', 'Mr.', 'Mrs.', 'Col.', 'Mme.', 'Dr.', 'Major.',
       'Capt.', 'Lady.', 'Sir.', 'Mlle.', 'Dona.', 'Jonkheer.',
       'Countess.', 'Don.', 'Rev.', 'Ms.'], dtype=object)

We need to fill the NaN's in the *Age* category and it seems like common practice to use the median of the corresponding title group.  We'll do this before binning the title group to preserve resolution.  The data is binned using `pandas.cut`, setting `labels=False` leaves the bin labels as integers which is the input format we need for xgboost.  

In [8]:
## fill NaN with marker value
data['age'].fillna(-1, inplace=True)

## get unique titles
titles = data.title.unique()

## calculate median age for each title
medians = dict()
for title in titles:
    median = data.age[(data["age"] != -1) & (data['title'] == title)].median()
    medians[title] = median

## replace empty age with median value from
## the passenger's title group
for index, row in data.iterrows():
    if row['age'] == -1:
        data.loc[index, 'age'] = medians[row['title']]

## categorical map
data['age'] = pd.qcut(data['age'], 5, labels=False)

Now we'll sort the various titles into a few categories.  I've expanded on the commonly presented groupings for experimentation.

In [9]:
## list of rare titles indicating higher socioeconomic standing
rare_titles = ['Master.', 'Don.','Dona.', 'Dr.', 
                'Lady.', 'Sir.', 'Countess.', 'Jonkheer.']
               
religious = ['Rev.']

military = ['Major.', 'Capt.', 'Col.',]


## label rare titles
data['title'] = data['title'].replace(rare_titles, 'Rare')

## religious
data['title'] = data['title'].replace(religious, 'Religious')

## military
data['title'] = data['title'].replace(military, 'Military')

## normalize married female prefixes
data['title'] = data['title'].replace('Mme.','Mrs.')

## normalize single female prefixes
data['title'] = data['title'].replace([ 'Mlle.', 'Ms.'], 'Miss.')

## map integers onto title data
title_number_mapping = {'Mr.' : 1, 'Mrs.' : 2, 'Miss.' : 3, 'Rare' : 4 , 'Military' : 5, 'Religious' : 6}
data['title'] = data['title'].map(title_number_mapping)

The *fare* binning looks just like the *age* binning but utilizing the *pclass* feature as a metric.

In [10]:
## fill NaN with marker value
data['fare'].fillna(-1, inplace=True)

## get list of all classes
all_pclasses = data.pclass.unique()

## calculate median fare for each class
medians = dict()
for pclass in all_pclasses:
    median = data.fare[(data["fare"] != -1) & (data['pclass'] == pclass)].median()
    medians[pclass] = median

## assign missing fares the median value
## for the passenger's pclass
for index, row in data.iterrows():
    if row['fare'] == -1:
        data.loc[index, 'fare'] = medians[row['pclass']]

## bin data
data['fare'] = pd.qcut(data['fare'], 5,labels=False)

Convert the sex to an integer.

In [11]:
## categorical map
sex_mapping = {"female": 1 ,"male": -1}
data['sex'] = data['sex'].map(sex_mapping)

Convert the port of embarkation to an integer.

In [12]:
## fill NaN
data['embarked'] = data['embarked'].fillna('S')

## categorical map
embarked_mapping = {'Q' : 1, 'S' : 2, 'C' : 3}
data['embarked'] = data['embarked'].map(embarked_mapping)

We'll create a new feature which indicates if the passenger's ticket number was paired with another passenger. 

In [13]:
data['ticket_paired'] = data.duplicated(subset='ticket',keep=False);

We'll add a feature for the total family size.  There could be some synergistic effect which is missed by accounting for parents and siblings separately.

In [14]:
data['family_size'] = data['sibsp'] + data['parch'] + 1

We'll take the same approach with *age* and *pclass* to account for age-class interaction.

In [15]:
data['age_class'] = data.age * data.pclass

The *cabin* feature is fairly rich.  We don't have data on everyone, but when we do we can isolate which deck of the ship they were on.  We can also isolate any passengers which booked multiple cabins.  

In [16]:
## fill NaN with U for unknown
data["cabin"].fillna('U', inplace=True)

## strip number from cabin
data["cabin"] = data["cabin"].apply(lambda x: re.sub('[0-9]','',x))

## look for passengers with multiple cabins
data['num_cabins'] = data["cabin"].str.count('[A-G]')

## replace the outliers
data["cabin"] = data["cabin"].replace('T','U')

## reduce multi-cabin entries to single character deck letter
data["cabin"] = data["cabin"].apply(lambda x: x.split()[0])

## map integers onto deck data.
cabin_mapping = {'A' : 1, 'B' : 2, 'C' : 3, 'D' : 4, 'E' : 5, 'F' : 6, 'G' : 7, 'U' : 8}
data["cabin"] = data["cabin"].map(cabin_mapping)

Drop columns that won't be passed to the sklearn models.

In [17]:
# data.drop(['name', 'ticket', 'cabin', 'fare', 'embarked', 
#            'family_size', 'num_cabins', 'ticket_paired'], axis = 1, inplace=True)
data.drop(['name', 'ticket'], axis = 1, inplace=True)

In [18]:
# list(data.columns)

Check for bad spots

In [19]:
# row number of bad data
# pd.isnull(data).any(1).nonzero()[0]

In [20]:
# data.head()

<a id="kfold"></a>
## sklhelper testing

Run a k-Fold validation test for a variety of models.  Default k=5.  Aggregated statistics are reported at the end.

Define the column name of the predicted quantity.

In [21]:
target = 'survived'

Create sklhelper instance.

In [22]:
skl = sklhelpClassify()

Import the pandas DataFrame which contains the data.

In [23]:
skl.get_data(data)

Set the target for prediction.

In [24]:
skl.set_target(target)

Run the k-fold test.

In [25]:
skl.kfold()

Generate ranked summary.

In [26]:
skl.ranked_summary()

                           model    mean  median    std_dev
0                  Random Forest  81.514   81.30   4.089637
6      Support Vector Classifier  80.598   78.63   4.744531
1                    Extra Trees  80.290   80.08   4.340760
12     eXtreme Gradient Boosting  79.146   76.34   5.443834
11  Gradient Boosting Classifier  78.916   75.95   5.976494
10     Adaptive Boost Classifier  78.842   76.34   4.977958
3            Logistic Regression  78.150   77.78   4.133304
13         Multilayer Perceptron  78.150   78.24   3.428790
7                     Linear SVC  77.614   77.39   3.650675
8             k-Nearest Neigbors  76.472   77.01   4.784843
9                  Decision Tree  75.402   75.57   1.831412
2           Gaussian Naive Bayes  74.640   77.78   5.816954
4                     Perceptron  70.978   70.99   6.965003
5    Stochastic Gradient Descent  62.734   76.34  21.784842


In [27]:
# skl.report()