## Exploring Survival On The Titanic (Python)
#### Converting into python, from r, some of the ideas from Megan Drisdals brilliant analysis - https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic/comments to see if I can improve my score.

- Feature Engineering
- Missing Value Imputation
- Prediction

In [225]:
# Import libraries

import numpy as np
from numpy.random import random_integers
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

print('Libraries Ready!')

Libraries Ready!


In [226]:
# Load training and test data
train = pd.read_csv('/home/sophie/projects/Titanic/data/train.csv', header=0)
test = pd.read_csv('/home/sophie/projects/Titanic/data/test.csv', header=0)

# bind training & test data so we can apply all the following changes to both in one go.
full = pd.concat([train,test])

In [227]:
print(len(full))
print(len(train)+len(test))
print(full.shape)
print(full.head())
print(list(full))
print(list(train))
print(list(test))

1309
1309
(1309, 12)
    Age Cabin Embarked     Fare  \
0  22.0   NaN        S   7.2500   
1  38.0   C85        C  71.2833   
2  26.0   NaN        S   7.9250   
3  35.0  C123        S  53.1000   
4  35.0   NaN        S   8.0500   

                                                Name  Parch  PassengerId  \
0                            Braund, Mr. Owen Harris      0            1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...      0            2   
2                             Heikkinen, Miss. Laina      0            3   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)      0            4   
4                           Allen, Mr. William Henry      0            5   

   Pclass     Sex  SibSp  Survived            Ticket  
0       3    male      1       0.0         A/5 21171  
1       1  female      1       1.0          PC 17599  
2       3  female      0       1.0  STON/O2. 3101282  
3       1  female      1       1.0            113803  
4       3    male      0       0.0  

In [228]:
# rows, columns with iloc
print(train.PassengerId.iloc[-1])

# This will be where we want to split it in the end.

891


### Feature Engineering

In [229]:
# Grab title from passenger names

full['Title'] = full['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

# Show title counts by sex

#first convert the Title column into a set.
title = set(full['Title'] )
print(title)

{'Lady', 'the Countess', 'Miss', 'Mr', 'Major', 'Jonkheer', 'Rev', 'Dr', 'Capt', 'Col', 'Sir', 'Master', 'Mme', 'Mlle', 'Don', 'Dona', 'Ms', 'Mrs'}


In [230]:
# Count the number of times each one appears
count_title = Counter(full['Title'])

print(count_title)

Counter({'Mr': 757, 'Miss': 260, 'Mrs': 197, 'Master': 61, 'Rev': 8, 'Dr': 8, 'Col': 4, 'Major': 2, 'Mlle': 2, 'Ms': 2, 'Lady': 1, 'the Countess': 1, 'Capt': 1, 'Sir': 1, 'Jonkheer': 1, 'Don': 1, 'Dona': 1, 'Mme': 1})


In [231]:
# get counts using groupby
group = full.groupby(['Title', 'Sex']).size()
print (group)

Title         Sex   
Capt          male        1
Col           male        4
Don           male        1
Dona          female      1
Dr            female      1
              male        7
Jonkheer      male        1
Lady          female      1
Major         male        2
Master        male       61
Miss          female    260
Mlle          female      2
Mme           female      1
Mr            male      757
Mrs           female    197
Ms            female      2
Rev           male        8
Sir           male        1
the Countess  female      1
dtype: int64


In [232]:
p = group.reset_index()
p.index = p['Title']+p['Sex']

#Drop columns
p = p.drop(['Title','Sex'], axis=1)

print (p)

                      0
Captmale              1
Colmale               4
Donmale               1
Donafemale            1
Drfemale              1
Drmale                7
Jonkheermale          1
Ladyfemale            1
Majormale             2
Mastermale           61
Missfemale          260
Mllefemale            2
Mmefemale             1
Mrmale              757
Mrsfemale           197
Msfemale              2
Revmale               8
Sirmale               1
the Countessfemale    1


In [233]:
print(p[(p <= 2)])
# [Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer']

                      0
Captmale            1.0
Colmale             NaN
Donmale             1.0
Donafemale          1.0
Drfemale            1.0
Drmale              NaN
Jonkheermale        1.0
Ladyfemale          1.0
Majormale           2.0
Mastermale          NaN
Missfemale          NaN
Mllefemale          2.0
Mmefemale           1.0
Mrmale              NaN
Mrsfemale           NaN
Msfemale            2.0
Revmale             NaN
Sirmale             1.0
the Countessfemale  1.0


Any values below 2 will be converted to "Rare" and Mlle, Mme and Ms will be converted to Miss.

In [234]:
# Titles with very low cell counts to be combined to "rare" level

full['Title'].replace(to_replace = ['Mlle','Ms'], value = 'Miss', inplace=True)

full['Title'].replace(to_replace = ['Mme'], value = 'Mrs', inplace=True)

full['Title'].replace(to_replace = ['Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Dr'
                                    , 'Major', 'Rev', 'Sir', 'Jonkheer'], value = 'Rare', inplace=True)
# Show title counts by sex again

print(full.head())

# get counts using groupby
group = full.groupby(['Title', 'Sex']).size()

print(group)

    Age Cabin Embarked     Fare  \
0  22.0   NaN        S   7.2500   
1  38.0   C85        C  71.2833   
2  26.0   NaN        S   7.9250   
3  35.0  C123        S  53.1000   
4  35.0   NaN        S   8.0500   

                                                Name  Parch  PassengerId  \
0                            Braund, Mr. Owen Harris      0            1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...      0            2   
2                             Heikkinen, Miss. Laina      0            3   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)      0            4   
4                           Allen, Mr. William Henry      0            5   

   Pclass     Sex  SibSp  Survived            Ticket Title  
0       3    male      1       0.0         A/5 21171    Mr  
1       1  female      1       1.0          PC 17599   Mrs  
2       3  female      0       1.0  STON/O2. 3101282  Miss  
3       1  female      1       1.0            113803   Mrs  
4       3    male      0   

In [235]:
# get counts using groupby
group = full.groupby(['Title', 'Sex']).size()

print(group)

Title   Sex   
Master  male       61
Miss    female    264
Mr      male      757
Mrs     female    198
Rare    female      4
        male       25
dtype: int64


My counts for each title now match up with Megans.

In [252]:
# Finally, grab surname from passenger name and make a new column
# Take the first word on the left side of the ,
full['Surname'] = full['Name'].apply(lambda x: x.split(',')[0].strip())
print(full['Surname'][0:3])
print(list(full))

0       Braund
1      Cumings
2    Heikkinen
Name: Surname, dtype: object
['Age', 'Cabin', 'Embarked', 'Fare', 'Name', 'Parch', 'PassengerId', 'Pclass', 'Sex', 'SibSp', 'Survived', 'Ticket', 'Title', 'Surname']


### Missing Value Imputation

In [4]:
# Create a family size variable including the passenger themselves

# Create a family variable 

# Use ggplot2 to visualize the relationship between family size & survival

In [5]:
# Discretize family size

# Show family size by survival using a mosaic plot (maybe not)

There’s probably some potentially useful information in the passenger cabin variable including about their deck. Let’s take a look.

In [6]:
# Look at the cabin variable

# Make a new "deck" variable with just the deck letter

### Missingness

In [7]:
# Passengers 62 and 830 are missing Embarkment. 

We will infer their values for embarkment based on present data that we can imagine may be 
relevant: passenger class and fare

In [8]:
# Use ggplot2 to visualize embarkment, passenger class, & median fare


Voilà! The median fare for a first class passenger departing from Charbourg (‘C’) coincides nicely with the $80 paid by our embarkment-deficient passengers. I think we can safely replace the NA values with ‘C’.

In [9]:
# Fill in passengers 62 and 830 with 'C'

We’re close to fixing the handful of NA values here and there. Passenger on row 1044 has an NA Fare value

In [10]:
# Show row 1044

# Replace missing fare value with median fare for class/embarkment

### Predictive Imputation

We will create a model predicting ages based on other variables.

In [11]:
# Show number of missing Age values



We could definitely use rpart (recursive partitioning for regression) to predict missing ages, but I’m going to use the mice package for this task just for something different. You can read more about multiple imputation using chained equations in r here (PDF). Since we haven’t done it yet, I’ll first factorize the factor variables and then perform mice imputation.

In [None]:
# Perform mice imputation, excluding certain less-than-useful variables:

In [None]:
# compare the imputation distribution with the original data distribtion
# Plot Age

# Replace Age variable from the mice model.

# Show new number of missing Age values

### Feature Engineering round 2

Now that we know everyone’s age, we can create a couple of new age-dependent variables: Child and Mother. A child will simply be someone under 18 years of age and a mother is a passenger who is 1) female, 2) is over 18, 3) has more than 0 children (no kidding!), and 4) does not have the title ‘Miss

In [12]:
# First we'll look at the relationship between age & survival
# For both male and female


In [13]:
# Create the column child, and indicate whether child or adult

# Show counts

In [14]:
# Adding Mother variable

All of the variables we care about should be taken care of and there should be no missing data. I’m going to double check just to be sure:

### Prediction

Our first step is to split the data back into the original test and training sets.

In [15]:
# check data

In [16]:
# Build a model using RandomForest

# Show model error

### Variable Importance

Let’s look at relative variable importance by plotting the mean decrease in Gini calculated across all trees.

In [17]:
# Get importance

# Create a rank variable based on importance

# Use ggplot2 to visualize the relative importance of variables

In [None]:
# Predict using the test set

# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)

