# Titanic: Machine Learning from Disaster

This notebook summarizes my (rather limited) efforts for working on the [Kaggle Titanic challenge](https://www.kaggle.com/c/titanic). From the machine learning approaches below, the Random Forest model scored best with a prediction accuracy of 0.78947 for the test set.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data import and wrangling

So, let's import training and test data and have a first look.

In [2]:
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


I decided to fill the missing Age and Fare data by taking the means grouped by Sex and passenger Class for Age and only grouped by passenger class for Fare.

In [6]:
for col in ['Sex', 'Pclass']:
    train[col] = train[col].astype('category')
    test[col] = test[col].astype('category')
train.Age = train.groupby(['Sex','Pclass']).Age.transform(lambda x: x.fillna(x.mean()))
test.Age = test.groupby(['Sex','Pclass']).Age.transform(lambda x: x.fillna(x.mean()))
test.Fare = test.groupby('Pclass').Fare.transform(lambda x: x.fillna(x.mean()))

In [7]:
train.describe()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,29.318643,0.523008,0.381594,32.204208
std,257.353842,0.486592,13.281103,1.102743,0.806057,49.693429
min,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,21.75,0.0,0.0,7.9104
50%,446.0,0.0,26.507589,0.0,0.0,14.4542
75%,668.5,1.0,36.0,1.0,0.0,31.0
max,891.0,1.0,80.0,8.0,6.0,512.3292


In [8]:
train.corr()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,0.039636,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.067485,-0.035322,0.081629,0.257307
Age,0.039636,-0.067485,1.0,-0.251313,-0.180705,0.118308
SibSp,-0.057527,-0.035322,-0.251313,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,-0.180705,0.414838,1.0,0.216225
Fare,0.012658,0.257307,0.118308,0.159651,0.216225,1.0


## Feature selection / engineering

I proceeded to look at the available data in more detail to identify significant features and decided to use the following ones:

1. Sex
2. Passenger class
3. Family size

Certainly, there are more useful features that could be extracted by spending significantly more time on exploring this data set.

In [9]:
train.groupby('Sex').Survived.mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [10]:
train.groupby('Pclass').Survived.mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [11]:
train.groupby(train['Parch'] + train['SibSp']).Survived.mean()

0     0.303538
1     0.552795
2     0.578431
3     0.724138
4     0.200000
5     0.136364
6     0.333333
7     0.000000
10    0.000000
Name: Survived, dtype: float64

In [12]:
def extend_df(df):
    ''' extend df with additional columns to be used by ML algorithms '''
    dummies = pd.get_dummies(df[['Sex', 'Pclass']], prefix=['sex', 'pclass'])
    df2 = pd.concat([df, dummies], axis=1)
    df2['baby'] = (df2.Age <= 1).astype(int)
    df2['toddler'] = ((df2.Age > 1) & (df2.Age <= 3)).astype(int)
    df2['minor'] = ((df2.Age > 3) & (df2.Age <= 18)).astype(int)
    df2['fam_size'] = df2['Parch'] + df2['SibSp']
    df2['single'] = (df2['fam_size'] == 0).astype(int)
    df2['med_fam_size'] = ((df2['fam_size'] > 0) & (df2['fam_size'] <= 3)).astype(int)
    
    return df2
    
train2 = extend_df(train)
test2 = extend_df(test)

# Machine Learning approaches

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
features = ['sex_male', 'pclass_1', 'pclass_2', 'baby', 'toddler', 'minor', 'single', 'med_fam_size']
X = train2[features]
y = train2.Survived

## Logistic Regression

In [14]:
from sklearn.linear_model import LogisticRegression

lg_model = LogisticRegression()
lg_model.fit(X, y)
lg_predictions = lg_model.predict(X)
accuracy_score(lg_predictions, y)

0.819304152637486

In [15]:
confusion_matrix(y, lg_predictions)

array([[491,  58],
       [103, 239]], dtype=int64)

In [16]:
def save_prediction(model, prefix):
    df = test[['PassengerId']].copy()
    df[prefix + '_pred'] = pd.Series(model.predict(test2[features])).astype('int')
    rd = {prefix + '_pred': 'Survived'}
    df.rename(columns=rd).to_csv(prefix + '_submission.csv', index=False)
    
save_prediction(lg_model, 'lg')

## Naive Bayes

In [17]:
from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X, y)
nb_predictions = nb_model.predict(X)
accuracy_score(nb_predictions, y)

0.7699214365881033

In [18]:
save_prediction(nb_model, 'nb')

## Support Vector Machine

In [19]:
from sklearn import svm

svm_model = svm.SVC(C=1000.0)
svm_model.fit(X, y)
svm_predictions = svm_model.predict(X)
accuracy_score(svm_predictions, y)

0.8282828282828283

In [20]:
save_prediction(svm_model, 'svm')

## Random Forest Classifier

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(max_depth=5, random_state=42)
rf_model.fit(X, y)
rf_predictions = rf_model.predict(X)
accuracy_score(rf_predictions, y)

0.8271604938271605

In [22]:
save_prediction(rf_model, 'rf')