# Prototype of code

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import argparse
import importlib
import math

import titalib.preproc as prep
import titalib.printer as prt
import titalib.models as mdl

from sklearn import svm
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from sklearn import tree

importlib.reload(prep)
importlib.reload(prt)

<module 'titalib.printer' from '/home/marcel/progs/python_scripts/titanic/titalib/printer.py'>

## Feature engineering and preparing the data

Let's load the data and look at the first 10 entries as comparison point:

In [3]:
datatrain_raw = prep.dataload('data/train.csv')
datatest_raw = prep.dataload('data/test.csv')
datatrain = datatrain_raw.copy()
datatest = datatest_raw.copy()

Now let's compare it with the pre-processed version of the data using the *preproc.dataformat()* function:

In [4]:
# limit of some categorical features
limits = {"child": 15,
          "sibsplimit": 2,
          "parchlimit": 1,
          "farebound1": 10,
          "farebound2": 50,
          "titlelimit": 10}
droplist = ['Name','Ticket','Survived','Cabin','Embarked','Pclass','SibSp','Parch']
fullset, trainsize, labels = prep.dataformat(datatrain_raw,
                                             datatest_raw,
                                             droplist=droplist,
                                             limits=limits)
fullset.head(10)

NaN in the features:
 ['Age', 'Cabin', 'Embarked', 'Fare']

'Age' categories: 2 (childs under 15)
'Fare' categories: 3 (delimiters: 10 and 50)
'Sex' categories: 2
'Relatives' categories: 2
'Location' categories: 4


Unnamed: 0_level_0,Age,Fare,Sex,Relatives,Location
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,0,1,1,0
2,1,2,0,1,3
3,1,0,0,0,0
4,1,2,0,1,3
5,1,0,1,0,0
6,1,0,1,0,0
7,1,2,1,0,3
8,0,0,1,0,0
9,1,0,0,1,0
10,0,1,0,1,1


So far so good. Using the *preproc.dataformat()* function, we:
- extract labels from the **Survived** feature.
- dropped **Name**, **Ticket** and **Survived** features.
- replaced the 'NaN' by easily distinguishable dummy values ('Z' for **Cabin** and **Embarked**, '200' for **Age**, '0' for **Fare**).
- normalized the **Fare** feature by the number of **Cabin** booked, and split it into 3 features defined by 2 customizable boundaries (below the first, between the 2, and above the highest)
- turn the **Age** feature into a class feature by assigning each passenger a class label given his slice of age, which is a user-defined parameter.
- reduced the **Cabin** feature to a single letter and move the too low populated categories into a 'garbage' category to reduce the number of parameters.
- converts all the class features into a numeric (integer) class label.

We must now perform a one-hot encoding on all our class features (all of them currently) before being able to use them in scikit-learn. We will use the *preprocessing.OneHotEncoder()* for this task, as it output a single array of binarized features which size depends on the sum of input features range.

First, let's convert our pandas dataframe as a 2D array:

In [4]:
test = fullset.as_matrix()

Now, let's define a mask of features that are to be 'one-hot encoded':

In [5]:
from sklearn import preprocessing

enc = preprocessing.OneHotEncoder(sparse=True)
test = enc.fit_transform(test)

In [6]:
test

<1309x28 sparse matrix of type '<class 'numpy.float64'>'
	with 11781 stored elements in Compressed Sparse Row format>

It's ok, we transformed our $1309 \times 8$ (full data entries x class features) matrix into a $1309 \times 32$ sparse matrix. Each of our class feature has been split into $N$ features taking only '1' or '0' value, where $N$ is the number of category of the feature. We have here 8 class features, and the sum of all classes is 32. We thus should have 8 ones and 24 zeros per row for a total of 32 features. Let's check this:

In [7]:
test.toarray()[0]

array([ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
        1.,  0.])

which corresponds, in this order, to:
- **Age**=1
- **Cabin**=6
- **Embarked**=2
- **Fare**=0
- **Parch**=0
- **Pclass**=3
- **Sex**=1
- **SibSp**=1

To the exception of the *Cabin* field, it is our 1st passenger as we can see below. The reason the *Cabin* field give a different result is because of the current implementation, which merge cabin categories **after** the string to integer conversion. It means that the field *Cabin* can take the initial 9 + 1 (for new 'merged') = 10 category values, but only a subset of them will be represented. We thus have a subset of categories instead of 10. This will change nothing for the one-hot encoding.

In [8]:
fullset.head(1)

Unnamed: 0_level_0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,8,2,0,0,0,3,1,1


We implement those modifications in the *preproc.onehot()* function.

In [9]:
mask=[True,True,True,True,True,True,True,True,True]  
datatrain, datatest, labels = prep.dataonehot(dataset=fullset,
                                                labels=labels,
                                                mask=mask,
                                                trainsize=trainsize)
datatrain[0]

Total number of Features: 28



array([ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
        1.,  0.])

One last thing we need to do is to rescale our features if we plan to use a Support Vector Machine, which is not scale invariant. The *preproc.datanorm()* function will scale all those features to 0 mean value, the unit variance, and then output a matrix of data ready to be used by our machine learning algorithm.

In [10]:
datatrain = prep.datanorm(datatrain)
datatest = prep.datanorm(datatest)
datatrain[0]

array([-0.30974338,  0.63320091, -0.49789473, -0.23598136, -0.26629582,
        0.54492498, -0.35154137, -0.48204268, -0.30756234,  0.61583843,
        0.84661857, -0.62277642, -0.40019526,  0.85053175, -0.50665528,
       -0.4039621 , -0.21680296, -0.1767767 ,  0.56049915, -0.56049915,
       -0.56568542, -0.51015154,  0.90258736, -0.73769513,  0.73769513,
       -1.46574551,  1.80642129, -0.30095727])

Now we can start dealing with the real work, the Machine Learning part.

## Training the model

Let's try a Support Vector Machine from scikit-learn to make sure everything can run, before going deeper into any model.

In [11]:
from sklearn import svm

X = datatrain
y = labels
classif = svm.SVC(cache_size=1000)
classif.fit(X, y)

SVC(C=1.0, cache_size=1000, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
classif.score(X,y)

0.85970819304152635

It seems to be working. Let's dig things deeper now. The SVM is an appropriate choice for this kind of problem (classification) but not the only solution. Given the fact we have a small amount of data entries (less than 1000 here) we will stick at it, as it is optimal conditions for it. 

We should try to:
- fine tune its parameters using a grid search. We will then need to split the data into a training and a cross-validation set.
- try different slice of *Ages*.
- try different bounds for the 3 *Fare* categories we defined earlier.
- try different number of categories for *SibSp* and *Parch*.
- try a different limit for the number of *Cabin* categories.

In [13]:
from sklearn.model_selection import ShuffleSplit

# shuffle and splits the data into n_folds sets of size folds_size of the original data set
n_folds = 5
folds_size = 0.4
shuffle = ShuffleSplit(n_splits=n_folds, test_size=folds_size, random_state=0)

In [14]:
from sklearn.model_selection import GridSearchCV

classif = svm.SVC(cache_size=1000)
# parameters to be tested
param_C = np.linspace(1,20,5)
param_Gam = np.linspace(0.001,0.1,5)
parameters = [{'C': param_C, 'gamma': param_Gam, 'kernel': ['rbf']}]
    
# grid search on the parameters
grid_search = GridSearchCV(classif, parameters, cv=shuffle.split(X, y), n_jobs=-1, verbose=1)
grid_search.fit(X, y)

# display best results
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in grid_search.best_params_.keys():
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Done  88 tasks      | elapsed:    0.9s


Best score: 0.817
Best parameters set:
	C: 4.8888888888888893
	gamma: 0.0080000000000000002
	kernel: 'rbf'


[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    4.0s finished


The score result is here a mean over the results from the test part of the 5 folds declared above, the model being trained on the remaining part (60% in the example above). By tuning the parameters with a finer grid search and additional feature engineering, one can reach a score of about 85%.

Given the amount of data we currently have, I think it is also better to train on something about 80% of the data and test it on the remaining 20%. In order to get a reliable results, one should increase the number of folds to something like 10 or 15. By training on 50% of the data or less, our model clearly overfits too much as I could obtain a training score above 90%, but a test score around 81%. Increasing the training set size will see the training and test accuracy converge to a value between 85 and 87% with fined-tuned parameters/features.

## Make predictions and write the output file

In [15]:
datatrain_raw = prep.dataload('data/train.csv')
datatest_raw = prep.dataload('data/test.csv')
datatrain = datatrain_raw.copy()
datatest = datatest_raw.copy()

In [16]:
fullset, trainsize, labels = prep.dataformat(datatrain_raw, datatest_raw, droplist=droplist, limits=limits);
datatrain, datatest, labels = prep.dataonehot(dataset=fullset, labels=labels, trainsize=trainsize);

NaN in the features:
	Age: True
	Cabin: True
	Embarked: True
	Fare: True
	Name: False
	Parch: False
	Pclass: False
	Sex: False
	SibSp: False
	Survived: False
	Ticket: False

Remaining Age categories: 3  (child under 15 years)
Remaining Cabin categories: 4 (6 merged)
Embarked categories: 3
Fare categories delimiters: 10 and 50 (3 categories)
Remaining Parch categories: 2
Pclass categories: 3
Gender categories: 2
Remaining SibSp categories: 3
Remaining Title categories: 5 

Total number of Features: 28



In [17]:
datatrain = prep.datanorm(datatrain)
datatest = prep.datanorm(datatest)

In [18]:
classif = svm.SVC(C=8,gamma=0.008,kernel='rbf',cache_size=2000)
classif.fit(datatrain, labels)

SVC(C=8, cache_size=2000, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.008, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
predictions = classif.predict(datatest)
predictions

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0,

Our predictions are ready to be written in the output file. We just have to write it the correct format for the Kaggle challenge.