In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Artificial intelligence
import fastai
from fastai import *
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Visualization and formatting
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display
%matplotlib inline

In [2]:
df_train = pd.read_csv('titanic/train.csv')
df_test = pd.read_csv('titanic/test.csv')

# Data Exploration

**Feature Laydown**

We can start by looking at the features that are available in the dataset and observing which ones might be useful to us.

* PassengerId: Unique identifier for each passenger.
* Survived: A boolean indicating whether the passenger survived or not.
* Pclass: A proxy for socio-economic status (SES).
    - 1st = Upper
    - 2nd = Middle
    - 3rd = Lower
* Name: Name of passenger.
* Sex: Sex of passenger.
* Age: Age of passenger.
* SibSp: Number of siblings / spouses aboard the Titanic.
* Parch: Number of parents / children aboard the Titanic.
* Fare: Passenger fare
* Cabin: Cabin number of passenger.
* Embarked: Port of embarkation.
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton
* WikiId: Irrelevant.
* Name_wiki: Name of passenger on Wikipedia.
* Age_wiki: Age of passenger on Wikipedia.
* Hometown: Hometown of passenger.
* Boarded: The city where the passenger boarded the Titanic (Cherbourg, Queenstown, Southampton).
* Destination: The destination of the passenger.
* Lifeboat: Which lifeboat the passenger was on.
* Body: Irrelevant.
* Class: An integer indicating the class of the passenger (First, Second, Third).

In [3]:
df_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,WikiId,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Body,Class
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,691.0,"Braund, Mr. Owen Harris",22.0,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada",,,3.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,...,C,90.0,"Cumings, Mrs. Florence Briggs (née Thayer)",35.0,"New York, New York, US",Cherbourg,"New York, New York, US",4,,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,865.0,"Heikkinen, Miss Laina",26.0,"Jyväskylä, Finland",Southampton,New York City,14?,,3.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,127.0,"Futrelle, Mrs. Lily May (née Peel)",35.0,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US",D,,1.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,627.0,"Allen, Mr. William Henry",35.0,"Birmingham, West Midlands, England",Southampton,New York City,,,3.0


## Preprocessing

To shrink the amount of features our model will have to handle and to remove features that would confuse the model, we can drop certain columns.

Notably, the column **WikiId** doesn't provide us any useful data as this is exclusive to Wikipedia. We can drop the **Body** column as it provides categorical data in forms of storage sizes, which has nothing to do with this model.

In [4]:
df_train = df_train.drop(columns=[ 'WikiId', 'Body' ])
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Class
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"Braund, Mr. Owen Harris",22.0,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada",,3.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings, Mrs. Florence Briggs (née Thayer)",35.0,"New York, New York, US",Cherbourg,"New York, New York, US",4,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"Heikkinen, Miss Laina",26.0,"Jyväskylä, Finland",Southampton,New York City,14?,3.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"Futrelle, Mrs. Lily May (née Peel)",35.0,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US",D,1.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"Allen, Mr. William Henry",35.0,"Birmingham, West Midlands, England",Southampton,New York City,,3.0


As with many datasets, there will be values that are not filled in. We can use the `fastai` module for this.<br>
- **Categorify** will convert the categorical values to something similar to **pd.Categorical**.<br>
- **FillMissing** handles missing values differently depending on whether they are categorical or continuous.
    - **Categorical**
        - `#na` is used to fill in the missing values.
    - **Continuous**
        - The median is being used to fill in the missing values.

<br>

We want some of the data to be set aside so we can validate if our model is working on data it hasn't seen yet. This ensures that we know that our model has actually learnt from the data it is given.  
**RandomSplitter** facilitates this split of data by randomly selecting a percentage of the data to be used as validation data.

<br>

**cont_cat_split** is used to separate the columns into continuous and categorical.

In [5]:
procedures = [ Categorify, FillMissing ]
splits = RandomSplitter(valid_pct=0.2, seed=42)(range_of(df_train))
cont, cat = cont_cat_split(df_train, 1, 'Survived')
display(Markdown(f'### {len(cont)} continuous features and {len(cat)} categorical features'))
display(cont)
display(cat)

### 8 continuous features and 10 categorical features

['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Age_wiki', 'Class']

['Name',
 'Sex',
 'Ticket',
 'Cabin',
 'Embarked',
 'Name_wiki',
 'Hometown',
 'Boarded',
 'Destination',
 'Lifeboat']

**TabularPandas** is a pandas dataframe on steroids.<br>
- **Parameters**
    - df: The pandas dataframe to be converted.
    - procs: The preprocessing steps to be applied.
    - cat_names: The names of the categorical columns.
    - cont_names: The names of the continuous columns.
    - y_names: The names of the target columns.
    - splits: The splits to be used.

In [6]:
to = TabularPandas(
    df=df_train,
    procs=procedures,
    cat_names=cat,
    cont_names=cont,
    y_names='Survived',
    splits=splits
)
to.show(3)

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Name_wiki,Hometown,Boarded,Destination,Lifeboat,Age_na,Age_wiki_na,Class_na,PassengerId,Pclass,Age,SibSp,Parch,Fare,Age_wiki,Class,Survived
788,"Dean, Master. Bertram Vere",male,C.A. 2315,#na#,S,"Dean, Master Bertram Vere","Bartley Farm, Hampshire, England",Southampton,"Wichita, Kansas, US",10,False,False,False,789,3,1.0,1,2,20.575001,1.0,3.0,1.0
525,"Farrell, Mr. James",male,367232,#na#,Q,"Farrell, Mr. James ""Jim""","Killoe, Longford, Ireland",Queenstown,New York City,#na#,False,False,False,526,3,40.5,0,0,7.75,25.0,3.0,0.0
821,"Lulic, Mr. Nikola",male,315098,#na#,S,"Lulić, Mr. Nikola","Konjsko Brdo, Croatia",Southampton,"Chicago, Illinois, US",15,False,False,False,822,3,27.0,0,0,8.6625,29.0,3.0,1.0


# Modeling

## Data Splits

The **TabularPandas** dataframe knows that it is split into training and validation data. We also provided it a dependent variable to be used as the target. Therefore we can now split the data further, from training and validation data, to training and validation data separated by the **features** and the **labels**.

In [7]:
X_train, y_train = to.train.xs, to.train.y
X_test, y_test = to.valid.xs, to.valid.y

## Training

In [8]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
model.fit(X_train.values, y_train.values)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [9]:
y_pred = model.predict(X_test.values)
accuracy_score(y_test, y_pred)

0.9887640449438202

In [10]:
X_test.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Name_wiki,Hometown,Boarded,Destination,Lifeboat,...,Age_wiki_na,Class_na,PassengerId,Pclass,Age,SibSp,Parch,Fare,Age_wiki,Class
303,429,1,110,117,2,406,80,3,94,2,...,1,1,304,2,28.0,0,0,12.35,46.0,2.0
778,439,2,451,0,2,416,100,3,148,0,...,1,1,779,3,28.0,0,0,7.7375,22.0,3.0
531,819,2,182,0,1,549,167,2,148,0,...,1,1,532,3,28.0,0,0,7.2292,17.0,3.0
385,206,2,622,0,3,195,247,4,63,0,...,1,1,386,2,18.0,0,0,73.5,21.0,2.0
134,773,2,553,0,3,737,306,4,100,0,...,1,1,135,2,25.0,0,0,13.0,25.0,2.0


In [11]:
y_test.head()

303    1.0
778    0.0
531    0.0
385    0.0
134    0.0
Name: Survived, dtype: float32

In [12]:
display(Markdown(f'### The predictions from the model vs. the real answer'))
print('Model prediction:', model.predict(X_test.loc[303].values.reshape(1, -1))[0])
print('Target value:', y_test.loc[303])

### The predictions from the model vs. the real answer

Model prediction: 1.0
Target value: 1.0
