## Using Random Forest to predict Titanic survivors

For Random Forest requires floats as input variables (strings need to be converted) and missing data needs to be filled.

#### How do I clean and fill?

If there is a lot of missing data you will want to try and fill it in. Sometimes that may not be possible, but in others such as Far price could be estimated if you know the class, or Age could be estimated using the median age. 

#### Data is complete and in floats, let's predict..

Three simple steps:    
1) Initialize the model    
2) Fit it to the training data       
3) Predict new values         

Nearly all scikit-learn share a few common named functions, once they are initialized. These are:  

- modelname.fit()
- modelname.predict()
- modelname.score()

#### Getting started

Read in training data with Name, Cabin and Ticket columns removed. Gender and Embarked columns are converted to numbers. 
Drop PassengerId column.

In [98]:
import pandas as pd
import numpy as np

df = pd.read_csv('/home/sophie/projects/Titanic/data/train.csv', header=0)

In [99]:
print(list(df))
print(df.dtypes)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [100]:
# Change Gender into 1/0 from sex
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(float)

#Drop columns
df = df.drop(['Name','Cabin','Ticket','PassengerId','Sex'], axis=1)

print(list(df))

['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Gender']


In [101]:
# Are there many null values in Embarked?
df.isnull().sum()

# Print the rows where Embarked is null
df[df['Embarked'].isnull()]

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Gender
61,1,1,38.0,0,0,80.0,,0.0
829,1,1,62.0,0,0,80.0,,0.0


In [104]:
# That is not too many, so we can remove those rows
df = df.dropna(subset = ['Embarked'])

#This also removes rows with a nan in Embarked column
#df = df[pd.notnull(df['Embarked'])]

In [105]:
# Check that the nans have been removed
print(df[df['Embarked'].isnull()])

Empty DataFrame
Columns: [Survived, Pclass, Age, SibSp, Parch, Fare, Embarked, Gender]
Index: []


In [106]:
# Turn Embarked into float numbers
df['Embarked'] = df['Embarked'].map({'C': 1 ,'Q': 2 ,'S': 3}).astype(float)
print(df.Embarked[0:3])

0    3.0
1    1.0
2    3.0
Name: Embarked, dtype: float64


In [107]:
# What still needs to be turned into a float?
df.dtypes

Survived      int64
Pclass        int64
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked    float64
Gender      float64
dtype: object

In [108]:
# Are there any columns with nans left?
df.isnull().sum()

Survived      0
Pclass        0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
Gender        0
dtype: int64

We still need to fill in the blank values of Age. We could fill them in with the median. Also see what difference it makes using the median, or mean. 

We will use the median age for each class to fill in.    
First, make a table to store the median values

In [109]:
#Make a table filled with zeros
median_ages = np.zeros((2,3)) # male/female for each class
median_ages

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [110]:
# Loop over the table to fill in the values

for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = df[(df['Gender'] == i) & (df['Pclass'] == j + 1)]['Age'].dropna().median()
        
median_ages

array([[ 35. ,  28. ,  21.5],
       [ 40. ,  30. ,  25. ]])

In [113]:
# Make a copy of Age 
df['AgeFill'] = df['Age']

In [114]:
# Fill the new column with the correct values. 

for i in range(0, 2):
    for j in range(0, 3):
        # we need df.loc here to specify the row AND the column. 
        # only where age is null, gender is 1/0 and class is 1-3, that AgeFill will be set to the median age.
        df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j + 1), 'AgeFill'] = median_ages[i,j]



In [115]:
# Now, are there null values still left in AgeFill?
df.isnull().sum()

Survived      0
Pclass        0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
Gender        0
AgeFill       0
dtype: int64

In [116]:
# We can drop the Age column now we have AgeFill
df = df.drop(['Age'], axis=1)

# This seems to successfully transform the whole dataframe into floats. Perhaps best as the final step?
df= df.astype(float)
df.dtypes

Survived    float64
Pclass      float64
SibSp       float64
Parch       float64
Fare        float64
Embarked    float64
Gender      float64
AgeFill     float64
dtype: object

Now we have a data in a format we can use!

In [123]:
# One person had 6 Children on board.
print(max(df['Parch']))

6.0


In [131]:
print(list(df))
print(df.iloc[:,1:][0:5])  df.iloc(rows,columns)

['Survived', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Gender', 'AgeFill']
   Pclass  SibSp  Parch     Fare  Embarked  Gender  AgeFill
0     3.0    1.0    0.0   7.2500       3.0     1.0     22.0
1     1.0    1.0    0.0  71.2833       1.0     0.0     38.0
2     3.0    0.0    0.0   7.9250       3.0     0.0     26.0
3     1.0    1.0    0.0  53.1000       3.0     0.0     35.0
4     3.0    0.0    0.0   8.0500       3.0     1.0     35.0


Before we can use the test data, it has to go through the same rigorous process as above. 
Will put make a script for it and see if it could be used for both the training and test data.

In [132]:
# import the random forest package
from sklearn.ensemble import RandomForestClassifier

# Create the random forest object which will include all the parameters for the fit
forest = RandomForestClassifier(n_estimators = 100)

In [134]:
# Fit the training data to the Survived labels and create the decision trees. (x,y)(train_inputs, classification labels)
forest = forest.fit(df.iloc[:,1:], df.iloc[:,0])

In [167]:
#Read in the test data which was cleaned up as the training set was in this notebook. The script is
# /home/sophie/projects/Titanic/data/clean_test.py
test_data = pd.read_csv('/home/sophie/projects/Titanic/data/clean_test.csv', usecols = ['Pclass','SibSp','Parch',
                        'Fare','Embarked', 'Gender','AgeFill'], sep = " ", header=0).astype(np.float32)

#test_data = test_data.astype(float)

print (max(test_data['Fare']))
print(test_data.dtypes)

test_data['Fare'] = round(test_data.Fare, 2)
print(test_data.head())

print(test_data[np.isinf(test_data)])
test.isinf().sum()

512.329
Pclass      float32
SibSp       float32
Parch       float32
Fare        float32
Embarked    float32
Gender      float32
AgeFill     float32
dtype: object
   Pclass  SibSp  Parch   Fare  Embarked  Gender  AgeFill
0     3.0    0.0    0.0   7.83       2.0     1.0     34.5
1     3.0    1.0    0.0   7.00       3.0     0.0     47.0
2     2.0    0.0    0.0   9.69       2.0     1.0     62.0
3     3.0    0.0    0.0   8.66       3.0     1.0     27.0
4     3.0    1.0    1.0  12.29       3.0     0.0     22.0
     Pclass  SibSp  Parch  Fare  Embarked  Gender  AgeFill
0       NaN    NaN    NaN   NaN       NaN     NaN      NaN
1       NaN    NaN    NaN   NaN       NaN     NaN      NaN
2       NaN    NaN    NaN   NaN       NaN     NaN      NaN
3       NaN    NaN    NaN   NaN       NaN     NaN      NaN
4       NaN    NaN    NaN   NaN       NaN     NaN      NaN
5       NaN    NaN    NaN   NaN       NaN     NaN      NaN
6       NaN    NaN    NaN   NaN       NaN     NaN      NaN
7       NaN    NaN

In [164]:
# Take the same decision trees and run it on the test data.
#
output = forest.predict(test_data)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').