# Kaggle Titanic Challenge using sklearns Random Forest Classifier

This jupyter notebook will describe a model for Kaggle's "Titanic: Machine Learning from Disaster"-challenge (https://www.kaggle.com/c/titanic).

You can get the data here: https://www.kaggle.com/c/titanic/data.

## Libraries

First we need to import some libraries.
The libraries in question were used in the versions:

1. Pandas 0.21.0
2. Numpy 1.14.2
3. sklearn 0.19.1

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data preprocessing

### Loading the data

We first need to import the data:

In [2]:
# The training data
data_labels = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
data = pd.read_csv('./train.csv', skiprows=1, index_col=False, names=data_labels)

# For the evualiation to be uploaded to kaggle
predict_labels = ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
x_predict = pd.read_csv('./test.csv', skiprows=1, index_col=False, names= predict_labels)

### Working the data

We want to get relevant features from the dataset and therefore need to preprocess it.
There were a few thinks which I wanted to consider for the model.

#### Titles from names

When you take a look at the data you will find that in the column 'Name' there is not only the full name included but also a title. I thought that a 'Major' might have a higher survival rate compared to a normaler 'Mr.' or 'Mrs.'. So I extract the title from every name in the dataset and - after adding the title to the model - dropped the name. I then converted this 'Title' column to a categorical one.
But since not all of the Titles which can be found in the train set can be found in the prediction set I needed to make a big set in order to both have the same number of categories (even if that means that some columns in the prediction set will be filled by zeros).

In [3]:
def retrieve_title(dataset):
    dataset['Name'] = dataset['Name'].str.split(', ', expand = True)[1]
    dataset['Title'] = dataset['Name'].str.split(' ', expand = True)[0]
    dataset = dataset.drop(['Name'], axis=1)
    return dataset

# The prediction set has no Survived column so we need to store it somewehere else will computing the titles.
y_data = data['Survived']
data = data.drop(['Survived'], axis=1)
data = data.append(x_predict)

data = retrieve_title(data)

#### Dealing with missing values

In [4]:
# Replace missing values for age.

data = data.fillna({
    'Age': data['Age'].mean()
})

# There are just a few people with nan in Embarked so we can just drop them
data = data.dropna(how='any', subset=['Embarked'])

#### Dummie categories

In [5]:
def make_dummies(dataset, column_list, _prefix):
    return pd.get_dummies(dataset, columns=column_list, prefix=_prefix)

data = make_dummies(data, ['Sex'], 'Sex')
data = make_dummies(data, ['Embarked'], 'Embarked')
data = make_dummies(data, ['Title'], 'Title')

#### Removing the prediction dataset again & Train-Test Split

In [6]:
train = data[:889]
train['Survived'] = y_data
x_predict = data[889:]

x_train, x_test = train_test_split(train, test_size=0.2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Have a look at the data here:

In [7]:
x_train.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,...,Title_Miss.,Title_Mlle.,Title_Mme.,Title_Mr.,Title_Mrs.,Title_Ms.,Title_Rev.,Title_Sir.,Title_the,Survived
714,715,2,52.0,0,0,250647,13.0,,0,1,...,0,0,0,1,0,0,0,0,0,0
716,717,1,38.0,0,0,PC 17757,227.525,C45,1,0,...,1,0,0,0,0,0,0,0,0,1
724,725,1,27.0,1,0,113806,53.1,E8,0,1,...,0,0,0,1,0,0,0,0,0,1
682,683,3,20.0,0,0,6563,9.225,,0,1,...,0,0,0,1,0,0,0,0,0,0
215,216,1,31.0,1,0,35273,113.275,D36,1,0,...,1,0,0,0,0,0,0,0,0,1


#### Feature vectors

In [8]:
y_train = x_train['Survived']
y_test = x_test['Survived']

#### Remove unused columns

In [9]:
# Drop freatures we are not going to use
x_train = x_train.drop(['PassengerId','Survived', 'Ticket', 'Fare', 'Cabin'], axis=1)
x_test = x_test.drop(['PassengerId','Survived', 'Ticket', 'Fare', 'Cabin'], axis=1)
x_predict = x_predict.drop(['PassengerId', 'Ticket', 'Fare', 'Cabin'], axis=1)

### Training

The training happens here.

#### Hyperparameter-Tuning

Hyperparameter tuning by hand often takes a lot of time and can be difficult and rather hard to understand.
There are a number of ways when you want to automate hyperparameter tuning (Grid-Search, Bayesian Optimization, etc.) we are going to use Random Search in this project (for a detailed explaination on that start here http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

In [10]:
# Be extra fancy

from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 3000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 121, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 12, 15]
min_samples_leaf = [1, 2, 4, 6, 8]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

#### The Algorithm

In [11]:
clf=RandomForestClassifier()

clf_rad = RandomizedSearchCV(estimator = clf,
                             param_distributions = random_grid,
                             n_iter = 100,
                             cv = 3,
                             verbose=2,
                             random_state=42,
                             n_jobs = -1)

#### Fitting the model

In [12]:
clf_rad.fit(x_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   28.5s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  5.0min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 511, 822, 1133, 1444, 1755, 2066, 2377, 2688, 3000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 21, 32, 43, 54, 65, 76, 87, 98, 109, 121, None], 'min_samples_split': [2, 5, 10, 12, 15], 'min_samples_leaf': [1, 2, 4, 6, 8], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [13]:
# What should be used for training
print(clf_rad.best_params_)

{'n_estimators': 511, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'auto', 'max_depth': 121, 'bootstrap': False}


#### Test

The model tries to label given every datapoint with an answer key it deems to be appropiate and compares it to the provided answer key.

In [14]:
best_rad = clf_rad.best_estimator_
y_pred=best_rad.predict(x_test)

#### How well did we do?

In [15]:
print('Train acc: ', accuracy_score(y_train, best_rad.predict(x_train)))
print('Test acc: ', accuracy_score(y_test, y_pred))

# If you want to compare
# for x in range(len(y_pred)):
#    print(list(y_pred)[x], list(y_test)[x])

Train acc:  0.8663853727144867
Test acc:  0.797752808988764


### Taking part in the challenge

Kaggle provides a second dataset. This dataset misses a label. We will try to compute those labels here and store them to a .csv which is formatted in a way Kaggle expects us to if we want to take part in the challenge.

In [16]:
result= []
_currentId = 892

y_pred=best_rad.predict(x_predict)

for i in range(len(y_pred)):
    result.append({'PassengerId': _currentId,'Survived': list(y_pred)[i]})
    _currentId = _currentId + 1

resultDf = pd.DataFrame(result)
resultDf.to_csv('evaluation_submission.csv',index=False)