* DAMI Assignment *

# Who Survived the Titanic Disaster?

## 1. Introduction

In this Titanic example, we will use decision trees. The main advantage of this model is that humans can easily understand and reproduce the sequence of decisions taken to predict the target class of a new data point.

This is very important for tasks such as medical diagnosis or credit approval, where we want to show a reason for the decision rather than just saying this is what the training data suggests (which is, by definition, what every supervised learning method does).

The problem we would like to solve is determining whether a Titanic passenger would have survived, given their age, class, and sex.

Why age, class and sex features?

Answer: Particular features (the name is an extreme case) could result in overfitting (consider a tree that asks if the name is X; she survived). Features for which a small number of instances with each value present a similar problem. They might not be helpful for generalisation. We will use class, age, and sex because we expect them to have possibly influenced the passenger's survival.

For this assignment, you need this Jupyter notebook and this dataset Download dataset. Have fun with data mining and building trees!

Each instance in the dataset has the following form:

     "1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","St Louis, MO","B-5","24160 L221","2","female"
     
Note that the raw data consists largely of strings. To apply machine learning algo's these strings have to be converted to numerical data first (at least the columns that are of interest)!

## 2. Prepare Dataset with Pandas 

Pandas is a Python module that works with the so-called dataframe concept (rows are observations, columns refer to the features). A dataframe is essentially a two-dimensional labeled data structure where
each column represent a feature and each row represents an observation.

More details, see: https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii 

### 2.1. Load Dataset

Download the Titanic dataset (csv file) from Canvas, read it with Pandas into a dataframe. Show the first 5 rows.

In [None]:
import pandas as pd
import numpy as np

print('numpy version:', np.__version__)
print('matplotlib version:', pd.__version__)

## Your code ...

# read with pandas into a dataframe


# show the first 5 rows


Also show the 10 last rows. What is the problem in the last couple of rows?


In [None]:
## Your code ...

# show 10 last rows

# problem: ?

### 2.2. Investigate Dataset

How does Pandas interpret the data? The following 3 commands can be used to investigate the data. Describe in your own words what the command does.

In [None]:
df_titanic.dtypes

## Your answer ...

In [None]:
df_titanic.info()

## Your answer ...


In [None]:
df_titanic.describe()

## Your answer ...


In [None]:
# Slice and print the first 10 rows of the 'age' column. 

## Your code ...


In [None]:
# What kind of object is this 'age' column?   


# Note: Single column is neither an numpy array, nor a pandas dataframe but rather a pandas-specific object called data Series.

In [None]:
# What is the average age over all passengers?

## Your code ...

In [None]:
# The next thing we'd like to do is look at more specific subsets of the dataframe. Slice the columns 'sex', 'pclass', 
# and 'age'.

## Your code ...

In [None]:
# First look at all of the missing 'age' values, because we will need to address them in our model if we hope to use 
# all the data for more advanced algorithms. To filter for missing values you can use:
df_titanic[df_titanic['age'].isnull()][['sex', 'pclass', 'age']]

In [None]:
# Before we finish the initial investigation, let's use one other convenience function of pandas to derive a 
# histogram of any numerical column. 
import pylab as pyl
df_titanic['age'].hist()
pyl.show()

In [None]:
# Inside the parentheses of .hist(), you can also be more explicit about options of this function. Before you invoke 
# it, you can also be explicit that you are dropping the missing values of age:
df_titanic['age'].dropna().hist(bins=16, range=(0,80), alpha = .5)
pyl.show()

## 3. Data Munging

### 3.1. Transform the Data 

Transform the values in the dataframe into the shape we need for machine learning. 

First of all, it's hard to run analysis on the string values of "male" and "female". 
Let's store this transformation into a new column 'Sex'. We have a precedent of analyzing the women first, so let's decide female = 0 and male = 1.  

In [None]:
# Let's store our transformation in a new column, so the original sex isn't changed. Show the first 3 instances.

## Your code ...

In [None]:
# Do the same for passenger class, make it numeric, in a new column PClass. Show the first 3 instances.

## Your code ...

In [None]:
# Show all males in the second class

## Your code ...

### 3.2. Deal with Missing Values

Now it's time to deal with the missing values of age! Why? Simply because most machine learning will need a complete set of 
values in that column to use it. By filling it in with guesses, we'll be introducing some noise into a model, but if we can 
keep our guesses reasonable, some of them should be close to the historical truth (whatever it was...), and the overall 
predictive power of age might still make a better model than before. 

We know the average age of all passengers (with valid age field) is 31.2 - we could fill in the null values with that mean value. But may be the median would be better? (to reduce the influence of a few rare 70- and 80-year olds?) The age histogram did seem positively skewed. These are the kind of decisions you make as you create your models.

In [None]:
# Replace the NaN (unknown) age values with a reasonable estimate. Do this in a new column 'AgeFill'

# Optionally, if you like a bit more programming try this one ...
# Use the mean age that was typical for males and females in each passenger class 

## Your code ...

### 3.3 Hot Encoding

We have a categorical feature attribute: pclass. We already converted its three classes into 1, 2, and 3. This transformation implicitly introduces an ordering. 

As a final step, we will try a more general approach that does not assume an ordering. This is widely used to convert categorical classes into real-valued attributes. We will introduce an additional encoder and convert the class attributes into three new binary features, each of them indicating if the instance belongs to a feature value (1) or (0). This is called one hot encoding, and it is a very common way of managing categorical attributes for real-based methods.

In [None]:
df_titanic['FirstClass'] = df_titanic['pclass'].map( {'1st': 1, '2nd': 0, '3rd': 0} ).astype(int)
df_titanic['SecondClass'] = df_titanic['pclass'].map( {'1st': 0, '2nd': 1, '3rd': 0} ).astype(int)
df_titanic['ThirdClass'] = df_titanic['pclass'].map( {'1st': 0, '2nd': 0, '3rd': 1} ).astype(int)

df_titanic.head(5)

## 4. Finalize Dataset for Analysis

In [None]:
# Finalize pre-processing by turning this into a numerical feature set (dataframe titanic_X) and a numerical target column 
# (dataframe titanic_y)

## Your code ...

In [None]:
titanic_X.head(10)

In [None]:
titanic_y.head(10)

## 5. Analyse Dataset

The preprocessing step is usually under-estimated in machine learning methods, but as we can see even in this very simple example, it can take some time to make data look as our methods expect. It is also very important in the overall machine learning process; if we fail in this step (for example, incorrectly encoding attributes, or selecting the wrong features), the following steps will fail, no matter how good the method we use for learning!!

We are now ready for the implementation of decision trees in scikit-learn, as this algo expects as input a list of 
real-valued features, and the decision rules of the model would be of the form: Feature < value. 
For example, AgeFill < 20.0.

Standardization (normalization) is not an issue for decision trees because the relative magnitude of features does not 
affect the classifier performance; so scaling is not needed.

### 5.1. Training a Decision Tree Classifier

In [None]:
# Now to the interesting part; let's build a decision tree from our training data. 
# As usual, first separate training and testing data, and check the size of both sets.
from sklearn.model_selection import train_test_split

## Your code ...

In [None]:
# Now, we can create a new DecisionTreeClassifier and use the fit method of the classifier to do the learning job. 
# Parameter settings: use the entropy citerion and try out different settings for the depth of the tree and the 
# minimum samples required for a node in the tree graph
from sklearn import tree

## Your code ...

### 5.2. Evaluation Metrics Function

In [None]:
# Define a generic helper function to measure the performance of the classifier, and call this function to show 
# the results (e.g. accuracy)
from sklearn import metrics

## Your code ...

## 6. Introducing Random Forest

A common criticism to decision trees is that once the training set is divided after
answering a question, it is not possible to *reconsider this decision*. For example, if
we divide men and women, every subsequent question would be only about men or
women, and the method could not consider another type of question (say, age less
than a year, irrespective of the gender). Random Forests try to introduce some level
of randomization in each step, proposing alternative trees and combining them to
get the final prediction. These types of algorithms that consider several classifiers
answering the same question are called **ensemble methods**. In the Titanic task, it is
probably hard to see this problem because we have very few features, but usually
a case has in the order of thousand(s) features.

Random Forests propose to build several decision trees, each one based on a subset of the training
instances (selected randomly), and using a small random number of features. 
This produces multiple classifiers (multiple decision trees). 
At prediction time, each grown tree, given an instance, predicts its target class exactly as decision trees do. 
The class that most of the trees vote (that is the class most predicted by the trees) is the one suggested by the ensemble classifier.

## 7. Implement Random Forest

Implement a Random Forest classifier. Can you improve to above accuracy? Look at the sklearn documentation. Play with the parameters of the ``RandomForestClassifier``. Especially the parameter ``n_estimators`` (the number of trees in the forest) is of interest.

In [None]:
# Implement a Random Forest classifier, does this improve the prediction?

## Your code ...