Rebecca Black â€¢ March 30, 2016

These three functions below were written to demonstrate how one can often formulate surprisingly
accurate models using simple intuition. The data used is the now canonical Titanic dataset, 
and I was simply playing around with various predictors of survival without incorporating any
statistical modeling at all.

The accuracy using gender as a single predictor is 78.6%:

In [None]:
import numpy as np
import pandas as pd
import statsmodels as sm

def simple_heuristic(file_path):
    
    predictions = {}
    df = pd.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'female':
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
        
    return predictions

#Function call:
simple_heuristic('train.csv')

Adding some constraints about passenger class and age increase the accuracy only
slightly, to 79.12%:

In [None]:
import numpy
import pandas
import statsmodels as sm

def complex_heuristic(file_path):
    predictions = {}
    df = pandas.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'female' or passenger['Pclass']==1 and passenger['Age'] < 18:
            predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
        
    return predictions

#Function call:
complex_heuristic('train.csv')

Next I incorporate multiple conditions and possibilities. This increases the accuracy
to 80.2%:

In [None]:
import numpy
import pandas
import statsmodels as sm

def custom_heuristic(file_path):

    predictions = {}
    df = pandas.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        
        passenger_id = passenger['PassengerId']
        if passenger['Sex'] == 'female' or passenger['Pclass']==1 and passenger['Age'] < 18:
            predictions[passenger_id] = 1
        elif passenger['Pclass']==2 and passenger['Age'] < 18 and passenger['Parch'] > 0:
                predictions[passenger_id] = 1
        elif passenger['Pclass']==2 and passenger['Age'] < 18 and passenger['SibSp'] > 0:
                predictions[passenger_id] = 1
        else:
            predictions[passenger_id] = 0
            
    return predictions

#Function call:
custom_heuristic('train.csv')

The takeaway here is that you don't necessarily want to just start throwing computing resources
at a dataset and then call it a day. You can often get a good feel for what's going on with your
response variables by simply playing for a bit and trying out some simple heuristics. This can
often give you some insight into different ways to proceed with more heavy duty modeling.