# titanic.py


A Python-based solution for the Titanic machine learning problem on Kaggle.com. Based on myfirstforest.py from Kaggle.com.

    / (root)
    |- Kaggle Data/ [data files from Kaggle.com]
    |- Python/ [current location]
    |- R/

## Notes:
- Use the pandas library to use dataframes instead of complicated list/ \
  dictionary manipulation.
- Use the RandomForestClassifier algorithm from the scikit-learn ensemble \
  module (sklearn.ensemble) to build randomized decision trees.
- Use the numpy package of tools to perform some analysis

In [186]:
# Import necessary libraries and modules
import csv # Module for working with CSV files
import pprint # Module for human-friendly data printing
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

In [187]:
# Relevant variables
## Location of data files, relative to current directory
DATA_DIR = '../Kaggle Data/' 

## Initialize Pretty Printer for future debugging use
pp = pprint.PrettyPrinter()

In [188]:
# Load data
## Training data
## Load data into dataframe, don't bring headers
train = pd.read_csv(DATA_DIR + 'train.csv', header=0)
test = pd.read_csv(DATA_DIR + 'test.csv', header=0)

In [None]:
# Clean up data
## Strings to integers, fill in missing data.
## Convert 'Sex' column to boolean 'IsMale'
train['IsMale'] = train['Sex'].map({'female': 0, 'male': 1}).astype(int)
test['IsMale'] = test['Sex'].map({'female': 0, 'male': 1}).astype(int)

### Working with pandas DataFrame objects
- Use `pd.DataFrame.loc` to address row by number index, select column by header. This avoids the ambiguity when changing values that will result in a `SettingWithCopyWarning` error.
- `pd.DataFrame.isnull()` returns a panda Series object with index and data.
- Application of `.isnull()` to `train.Embarked` returns a Series object whose index contains the row numbers of missing Embarked data as pandas Index object.
- Similarly, `train.Embarked` is a Series object, applying `Series.mode()` returns a Series object of length 1 with the mode, so address with index 0. Apply `Series.dropna()` to exclude these from the mode() call.


In [None]:
## Clean up Embarked data.
### Embarked: missing -> most common
for row_num in train.Embarked[train.Embarked.isnull()].index:
    train.loc[row_num, 'Embarked'] = train.Embarked.dropna().mode().values[0]

for row_num in test.Embarked[test.Embarked.isnull()].index:
    test.loc[row_num, 'Embarked'] = test.Embarked.dropna().mode().values[0]

### Embarked: Map strings to list of ports, then change strings to integers
PORTS = list(enumerate(np.unique(train.Embarked)))
PORTS_TEST = list(enumerate(np.unique(test.Embarked)))
#### Shorthand dictionary assignment, name as key, list index as value
PORTS_DICT = {name : i for i, name in PORTS}
PORTS_DICT_TEST = {name : i for i, name in PORTS_TEST}
#### Use a lambda function, map the existing Embarked data x to PORTS_DICT \
#### using x as key and setting new value to integer value.
train.Embarked = train.Embarked.map(lambda x: PORTS_DICT[x]).astype(int)
test.Embarked = test.Embarked.map(lambda x: PORTS_DICT_TEST[x]).astype(int)

### Age: missing -> median of all ages
AGE_MEDIAN = train.Age.dropna().median()
for row_num in train.Age[train.Age.isnull()].index:
    train.loc[row_num, 'Age'] = AGE_MEDIAN

AGE_MEDIAN_TEST = test.Age.dropna().median()    
for row_num in test.Age[test.Age.isnull()].index:
    test.loc[row_num, 'Age'] = AGE_MEDIAN_TEST
    

### At least one fare is missing from the `test` dataset

In [None]:
### Fare: missing -> median for fare class
MEDIAN_FARE_TEST = np.zeros(len(np.unique(test.Pclass)))
for f in range(0,len(np.unique(test.Pclass))):
    MEDIAN_FARE_TEST[f] = test.loc[test.Pclass == f+1, 'Fare'].dropna().median()

for row_num in test.Fare[test.Fare.isnull()].index:
    temp_pclass = test.loc[row_num, 'Pclass'] - 1
    test.loc[row_num, 'Fare'] = MEDIAN_FARE_TEST[temp_pclass]

- Sex is now redundant with IsMale, drop the column
- Cabin is absent for a large number of passengers, also non-integer
- Names are unique, non-integer
- Ticket numbers also unique, non-integer in many cases
- Must remove PassengerId or they become part of prediction, values nonsense

In [189]:
### Remove redundant column(s), remove non-integer columns
train = train.drop(['Sex', 'Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
passenger_ids = test.PassengerId.values
test = test.drop(['Sex', 'Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

In [None]:
# Cleaned data to np array
train_data = train.values
test_data = test.values

### Machine learning using `RandomForestClassifier`
The `forest.fit()` call uses the training dataset to build a forest of decision trees. We set the number of trees to build using the `n_estimators` parameter when initializing the `RandomForestClassifier` object. The `fit()` call requries two arguments, with an optional third.
- The first argument is a matrix of numerical values associated with the features we're interested in using to model the outcome, e.g. passenger class (Pclass), sex (IsMale), age (Age), number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), and port of embarkation (Embarked). 
- The second argument is a matrix of the outcome(s) we want to model; in this case, the survival data (Survived).

Scikit-learn ensemble documentation on RandomForestClassifier.fit(): http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit

To make this model more accurate, we can try to recombine some of the given data to make synthetic features that may have affected the survival of the passengers aboard the *Titanic*. These features would have to be added earlier, before certain columns were dropped from the dataframes. Possible features that oculd be added:
- family size: attempts to keep a large family together might negatively impact survival of an individual (SibSp + Parch + 1)
- age of youngest family member: could be change the difficulty of evacuating (probably correlate by name?)
- age of oldest family member: similar to youngest family member
- whether individual was a child: children could have been given priority on life boats or otherwise preferentially evacuated

In [190]:
# Train the forest
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data[0::, 1::], train_data[0::, 0])

# Make a prediction
output = forest.predict(test_data)

# Write to file
predictions_file = open("myfirstforest.csv", "w")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(passenger_ids, output))
predictions_file.close()

# Print the result here as well
print(pd.DataFrame(data=output, index=passenger_ids, columns=['Survived']))

      Survived
892        0.0
893        0.0
894        0.0
895        1.0
896        0.0
897        0.0
898        0.0
899        0.0
900        1.0
901        0.0
902        0.0
903        0.0
904        1.0
905        0.0
906        1.0
907        1.0
908        0.0
909        1.0
910        0.0
911        0.0
912        1.0
913        0.0
914        1.0
915        1.0
916        1.0
917        0.0
918        1.0
919        1.0
920        1.0
921        0.0
...        ...
1280       0.0
1281       0.0
1282       0.0
1283       1.0
1284       0.0
1285       0.0
1286       0.0
1287       1.0
1288       0.0
1289       1.0
1290       0.0
1291       0.0
1292       1.0
1293       0.0
1294       1.0
1295       0.0
1296       0.0
1297       1.0
1298       0.0
1299       0.0
1300       1.0
1301       1.0
1302       1.0
1303       1.0
1304       0.0
1305       0.0
1306       1.0
1307       0.0
1308       0.0
1309       1.0

[418 rows x 1 columns]
