# Titanic Example (Part 3) - Data Preparation

## Kaggle

Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.

https://www.kaggle.com/competitions

We will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

## Section 1-0 - First Cut

We will start by splitting the data into a training set and a test set. Next we process the training data, at which point the data will be used to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we then compare our predictions against the 'ground truth' to see how well our model performed.

It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.

### Extracting data

Load the training data from the .csv file (https://www.kaggle.com/c/titanic/data).

In [164]:
import pandas as pd

data = pd.read_csv("titanic-data.csv")
data = data.set_index('PassengerId')

data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Review the size pf the data.

In [165]:
data.shape

(891, 11)

### Cleaning data

Review a the first 10 selection of data. 

In [166]:
data.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Review the last 10 tail-end section of the data. 

In [167]:
data.tail(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Remove irrelevant features (columns) for our current purpose.
**Hint:** only 3 features must be removed.

In [168]:
del data['Name']
del data['Ticket']
del data['Cabin']

Review the type of data in the columns, and their respective counts.

In [169]:
data.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

Remove the rows with missing values (NaN).

In [170]:
print(data.isnull().sum())

data = data.dropna(how='any')

print(data.isnull().sum())

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64
Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


Start preparing the data for scikit-learn. 
- **Remember that** scikit-learn only takes numerical array as inputs. Thus, you should convert categorical columns into numerical ones (refer to Part 1).

In [171]:
dummies = pd.get_dummies(data['Sex'])
data = pd.concat([data, dummies], axis=1)

dummies = pd.get_dummies(data['Embarked'])
data = pd.concat([data, dummies], axis=1)

data.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,female,male,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,male,22.0,1,0,7.25,S,0.0,1.0,0.0,0.0,1.0
2,1,1,female,38.0,1,0,71.2833,C,1.0,0.0,1.0,0.0,0.0
3,1,3,female,26.0,0,0,7.925,S,1.0,0.0,0.0,0.0,1.0
4,1,1,female,35.0,1,0,53.1,S,1.0,0.0,0.0,0.0,1.0
5,0,3,male,35.0,0,0,8.05,S,0.0,1.0,0.0,0.0,1.0


In [172]:
del data['Embarked']
del data['Sex']

In the final review of our training data, check that (1) there are no NaN values, and (2) all the values are in numerical form.

In [173]:
print(data.isnull().sum())

data.dtypes

Survived    0
Pclass      0
Age         0
SibSp       0
Parch       0
Fare        0
female      0
male        0
C           0
Q           0
S           0
dtype: int64


Survived      int64
Pclass        int64
Age         float64
SibSp         int64
Parch         int64
Fare        float64
female      float64
male        float64
C           float64
Q           float64
S           float64
dtype: object

Convert the processed training data from a Pandas dataframe into a numerical (Numpy) array, and create a column from the outcomes of the training data.

In [174]:
import numpy as np

data_np = np.array(data)

### Split data
We now split the data into an 80% training set and 20% test set.

In [175]:
from sklearn.cross_validation import train_test_split

y = data['Survived']

del data['Survived']

X_train, X_test, y_train, y_test = train_test_split(data, y, random_state=1, test_size=0.2)

### Scikit-learn - Training the model

In this section, we'll simply use the model as a black box.

In particular, we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.

http://en.wikipedia.org/wiki/Random_forest

In [176]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)

Use the training dataset to train your model (refer to Part 1)

In [177]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

### Scikit-learn - Making predictions

Before proceeding, review a selection of your test data.

In [178]:
X_test.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,female,male,C,Q,S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
690,1,15.0,0,1,211.3375,1.0,0.0,0.0,0.0,1.0
280,3,35.0,1,1,20.25,1.0,0.0,0.0,0.0,1.0
509,3,28.0,0,0,22.525,0.0,1.0,0.0,0.0,1.0
10,2,14.0,1,0,30.0708,1.0,0.0,1.0,0.0,0.0
497,1,54.0,1,0,78.2667,1.0,0.0,1.0,0.0,0.0


As before, process the test data in a similar fashion to what you did to the training data.
1. Remove irrelevant features
2. Remove the rows with missing values (NaN)
3. Convert categorical columns into numerical ones
4. Convert data from a Pandas dataframe into a numerical (Numpy) array

In [179]:
#Already done

We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.

In [180]:
predictions = model.predict(X_test)

### Evaluation

Calculate the number of correct predictions

In [181]:
sum(predictions == y_test)

110

To get a sense of how good our prediction is, calculate the model's accuracy 
**Hint:** divide the number of correct predictions by the length of the array of actual values.

In [182]:
(sum(predictions == y_test) / len(predictions)) * 100

76.923076923076934