## Titanic: Machine Learning from Disaster

---

### Overview

This is an intro ML competition from Kaggle, described [here](https://www.kaggle.com/c/titanic).

Training data is [here](https://www.kaggle.com/c/titanic/download/train.csv) and validation data is [here](https://www.kaggle.com/c/titanic/download/test.csv).

#### Data Dictionary

Variable | Definition | Key
--- | --- | ---
survival | Survival | 0 = No, 1 = Yes
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd
sex | Sex	
Age	| Age in years	
sibsp |	# of siblings / spouses aboard the Titanic	
parch | # of parents / children aboard the Titanic	
ticket | Ticket number	
fare | Passenger fare	
cabin | Cabin number	
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton


#### Variable Notes

**pclass**: A proxy for socio-economic status (SES)
**1st** = Upper
**2nd** = Middle
**3rd** = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...

*Sibling* = brother, sister, stepbrother, stepsister
*Spouse* = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...

*Parent* = mother, father
*Child* = daughter, son, stepdaughter, stepson


### Solution

In [142]:
# start off by ensuring the test and training data have been downloaded.

import os
import tarfile
from six.moves import urllib
import os.path

TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
LOCAL_DATA_PATH = './tmp/'

input_files = [TRAIN_FILE, TEST_FILE]
for file in input_files:
    if not os.path.isfile(LOCAL_DATA_PATH + file):
        raise ValueError('Missing file: ' + file)

In [143]:
# display the raw input training data
import pandas as pd

raw_train_df = pd.read_csv(LOCAL_DATA_PATH + TRAIN_FILE)
raw_test_df = pd.read_csv(LOCAL_DATA_PATH + TEST_FILE)
raw_train_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [144]:
raw_train_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [145]:
# stats of the numeric columns
raw_train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [146]:
# 891 rows of 12 columns
raw_train_df.shape

(891, 12)

In [147]:
# we can convert this to a boolean
raw_train_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [148]:
# this should be a categorical. there's no real value in the #
raw_train_df['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [149]:
# we might be able to extract some information based on title (mr., Mrs. etc.)
raw_train_df.head(5)['Name']

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [150]:
# we can convert this to a categorical
raw_train_df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [166]:
# not sure there's any useful info in the ticket #s
raw_train_df['Ticket'].value_counts().head(5)

1601        7
CA. 2343    7
347082      7
347088      6
CA 2144     6
Name: Ticket, dtype: int64

In [168]:
# we can identify those with multiple cabins (true/false)
# we can also extract deck and room # features
# in cases of multiple cabins, just pick the 'best' deck, aka. closest to A.
raw_train_df['Cabin'].value_counts().head(10)

C23 C25 C27    4
G6             4
B96 B98        4
F33            3
E101           3
F2             3
D              3
C22 C26        3
B35            2
D26            2
Name: Cabin, dtype: int64

In [161]:
# create a method to transform both the training set and the validation set
# will clean data and perform feature engineering here
def clean_df(raw_df):
    raw_df = raw_df.set_index('PassengerId')
    df = pd.DataFrame(index=raw_df.index)

    # convert survived to boolean column
    df['survived'] = raw_df['Survived'].astype('bool')
    
    # map passenger class to categories
    preference_map = {1: 'upper', 2: 'middle', 3: 'lower'}
    df['p_class'] = raw_df['Pclass'].map(preference_map).astype('category')
    
    # map sex to categories
    df['sex'] = raw_df['Sex'].astype('category')
    
    # fill missing ages with median of training set
    df['age'] = raw_df['Age'].fillna(raw_train_df['Age'].median())
    
    # copy sibling/spouse count directly
    df['sibling_spouse_count'] = raw_df['SibSp']
    
    # copy parent/child count directly
    df['parent_child_count'] = raw_df['Parch']
           

    return df

In [162]:
cleaned_train_df = clean_df(raw_train_df)

In [163]:
cleaned_train_df.tail()

Unnamed: 0_level_0,survived,p_class,sex,age,sibling_spouse_count,parent_child_count
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
887,False,middle,male,27.0,0,0
888,True,upper,female,19.0,0,0
889,False,lower,female,28.0,1,2
890,True,upper,male,26.0,0,0
891,False,lower,male,32.0,0,0


In [164]:
raw_train_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q
