In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Project specification
    
Part of a data scientist's job is to use her or his intuition and insight to write algorithms and heuristics. A data scientist also creates mathematical models to make predictions based on some attributes from the data that they are examining.

The goal of this project is to predict whether or not the Titanic passengers survived or perished.  

For more information about the Titanic and the specifics of this dataset see:
- [Titanic](http://en.wikipedia.org/wiki/RMS_Titanic)
- [Titanic dataset, kaggle](http://www.kaggle.com/c/titanic-gettingStarted)
        

In [41]:
filename = "../data/titanic-data.csv"

dtypes = {'PassengerId': np.int32, 'Survived': 'bool', 'Pclass': 'category', 'SibSp': np.int32, 'Parch': np.int32}
titanic = pd.read_csv(filename, dtype=dtypes, index_col='PassengerId')


In [42]:
print(titanic.info())

titanic

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Survived  891 non-null    bool    
 1   Pclass    891 non-null    category
 2   Name      891 non-null    object  
 3   Sex       891 non-null    object  
 4   Age       714 non-null    float64 
 5   SibSp     891 non-null    int32   
 6   Parch     891 non-null    int32   
 7   Ticket    891 non-null    object  
 8   Fare      891 non-null    float64 
 9   Cabin     204 non-null    object  
 10  Embarked  889 non-null    object  
dtypes: bool(1), category(1), float64(2), int32(2), object(5)
memory usage: 64.5+ KB
None


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,False,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,True,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,False,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,True,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Dataset description

### Data Dictionary
Variable	|    Definition	   |    Key
:---------: | :--------------: | :---------:
survival    | Survival	       | 0 = No, 1 = Yes
pclass  	| Ticket class	   | 1 = 1st, 2 = 2nd, 3 = 3rd
sex		    | Sex 
Age	        | Age in years	
sibsp	    | # of siblings / spouses aboard the Titanic
parch	    | # of parents / children aboard the Titanic
ticket	    | Ticket number	
fare	    | Passenger fare	
cabin	    | Cabin number	
embarked	| Port of Embarkation	| 	C = Cherbourg, Q = Queenstown, S = Southampton

### Variable Notes
#### pclass: A proxy for socio-economic status (SES)
1st = Upper  
2nd = Middle  
3rd = Lower

#### age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

#### sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister  
Spouse = husband, wife (mistresses and fiancés were ignored)

#### parch: The dataset defines family relations in this way...
Parent = mother, father  
Child = daughter, son, stepdaughter, stepson  
Some children travelled only with a nanny, therefore parch=0 for them.

# Part 1 - Simple heuristic

In this part, we will write a simple heuristic that will use the passengers' gender to predict if that person survived the Titanic disaster.
    
The prediction should be 78% accurate or higher.
        
Here's a simple heuristic to start off:
1. If the passenger is female, the heuristic assumes that the passenger survived
2. If the passenger is male, the heuristic assumes that the passenger died
    
The predictions are written back into the "predictions" dictionary. The keys of the dictionary are the passenger's id and the associated value is 1 if the passenger survived or 0 otherwise.


In [34]:
def predict(passenger):
        return 1 if passenger['Sex'] == 'female' else 0
    
    
def simple_heuristic(df):
    predictions = {}
    for passenger_id, passenger in df.iterrows():
          predictions[passenger_id] = predict(passenger)
    return predictions


predictions = pd.Series(simple_heuristic(titanic))
total_predictions = len(titanic)
correct_predictions = (~(titanic['Survived'] ^ predictions)).sum()
accuracy = correct_predictions / total_predictions

print('Prediction accuracy: {:.2%}'.format(accuracy))

Prediction accuracy: 78.68%


# Part 2 - A more complex heuristic

In this apart we will write a more sophisticated algorithm that will use the passengers' gender and their socioeconomical class and age to predict if they survived the Titanic diaster. 
    
The prediction should be 79% accurate or higher.
    
Here's the algorithm to predict if the passenger survived:
1. If the passenger is female or
2. If his/her socioeconomic status is high AND if the passenger is under 18
we predict that the passenger survived, otherwise we predict the passenger perished.
    
The predictions are written back into the "predictions" dictionary. The keys of the dictionary are the passenger's id and the associated value is 1 if the passenger survived or 0 otherwise.

In [46]:
import statsmodels.api as sm


def predict(passenger):
    if passenger['Sex'] == 'female':
        return 1
    elif passenger['Age'] < 18 and passenger['Pclass'] == '1':
        return 1
    else:
        return 0
    

def complex_heuristic(df):
    predictions = {}
    for passenger_index, passenger in df.iterrows():
        predictions[passenger_index] = predict(passenger)
    return predictions


predictions = pd.Series(complex_heuristic(titanic))
total_predictions = len(titanic)
correct_predictions = (~(titanic['Survived'] ^ predictions)).sum()
accuracy = correct_predictions / total_predictions

print('Prediction accuracy: {:.2%}'.format(accuracy))

Prediction accuracy: 79.12%


# Part 3 - Custom heuristic

For this exercise, we will write a custom heuristic that will take in some combination of the passenger's attributes and predict if the passenger survived the Titanic diaster.

Target accuracy is 80% or better.

In [78]:
def predict(passenger):
    if passenger['Sex'] == 'female' and passenger['Pclass'] != '3':
        return 1
    elif passenger['Age'] < 7:
        return 1
    else:
        return 0
    

def custom_heuristic(df):
    predictions = {}
    for passenger_index, passenger in df.iterrows():
        predictions[passenger_index] = predict(passenger)
    return predictions


predictions = pd.Series(custom_heuristic(titanic))
total_predictions = len(titanic)
correct_predictions = (~(titanic['Survived'] ^ predictions)).sum()
accuracy = correct_predictions / total_predictions

print('Prediction accuracy: {:.2%}'.format(accuracy))

Prediction accuracy: 80.25%
