# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In this lab, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [2]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.

## Preprocessing the data

Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.

One-Hot encoding is useful for changing over categorical data into numerical data, with each different option within a category changed into either a 0 or 1 in a separate *new* category as to whether it is that option or not (e.g. Queenstown port or not Queenstown port). Check out [this article](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) before continuing. 

**Question:** Why would it be a terrible idea to one-hot encode the data without removing the names?

In [3]:
# Removing the names
features_no_names = features_raw.drop(['Name'], axis=1)

# One-hot encoding
features = pd.get_dummies(features_no_names)

And now we'll fill in any blanks with zeroes.

In [4]:
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


## (TODO) Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [6]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier(criterion="entropy")
model.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [7]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.815642458101


# Exercise: Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

You can use your intuition, trial and error, or even better, feel free to use Grid Search!

**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook next.

In [50]:
from sklearn.model_selection import GridSearchCV
parameters= [{'criterion':['entropy','gini'],'max_depth':[7,8,9,10,13],'min_samples_leaf':[6,7,8,9],'min_samples_split':[2,3,4,5,10]}]
# TODO: Train the model
model1 = DecisionTreeClassifier()
grid=GridSearchCV(model1,parameters,cv=5)
grid.fit(X_train,y_train)
estim=grid.best_estimator_
print(estim)
y_train_pred2 = estim.predict(X_train)
y_test_pred2 = estim.predict(X_test)
train_accuracy2 = accuracy_score(y_train, y_train_pred2)
test_accuracy2 = accuracy_score(y_test, y_test_pred2)
print('The training accuracy is', train_accuracy2)
print('The test accuracy is', test_accuracy2)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=7, min_samples_split=3,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
The training accuracy is 0.870786516854
The test accuracy is 0.849162011173


In [11]:
model2=DecisionTreeClassifier(criterion="gini",max_depth=7,min_samples_leaf=6,min_samples_split=10,random_state=42)
model2.fit(X_train,y_train)
# TODO: Make predictions
y_train_pred1 = model2.predict(X_train)
y_test_pred1 = model2.predict(X_test)

# TODO: Calculate the accuracy
train_accuracy1 = accuracy_score(y_train, y_train_pred1)
test_accuracy1 = accuracy_score(y_test, y_test_pred1)
print('The training accuracy is', train_accuracy1)
print('The test accuracy is', test_accuracy1)

The training accuracy is 0.875
The test accuracy is 0.860335195531


In [22]:
pd.DataFrame(grid.cv_results_)[100:140]



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,params,split0_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
100,0.010975,0.000329,0.001073,4.1e-05,gini,7,6,2,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.797203,...,0.797753,0.012868,37,0.859402,0.871705,0.873462,0.864912,0.870403,0.867977,0.005153
101,0.01094,0.000224,0.001109,0.000108,gini,7,6,3,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.797203,...,0.799157,0.01384,24,0.859402,0.871705,0.876977,0.864912,0.870403,0.86868,0.00602
102,0.011029,0.000221,0.001072,2.7e-05,gini,7,6,4,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.796348,0.013228,49,0.859402,0.871705,0.873462,0.864912,0.870403,0.867977,0.005153
103,0.010931,0.000167,0.001049,3.7e-05,gini,7,6,5,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.796348,0.013228,49,0.859402,0.871705,0.876977,0.864912,0.870403,0.86868,0.00602
104,0.010881,0.00025,0.001059,4.1e-05,gini,7,6,10,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.796348,0.013228,49,0.859402,0.871705,0.873462,0.864912,0.870403,0.867977,0.005153
105,0.010989,0.000313,0.001095,4.8e-05,gini,7,7,2,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.807584,0.020707,1,0.855888,0.86819,0.871705,0.863158,0.863398,0.864468,0.005342
106,0.011075,0.000293,0.001095,1.6e-05,gini,7,7,3,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.797203,...,0.803371,0.028657,5,0.855888,0.86819,0.871705,0.863158,0.863398,0.864468,0.005342
107,0.010916,0.000257,0.001061,5.1e-05,gini,7,7,4,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.807584,0.020707,1,0.855888,0.86819,0.871705,0.863158,0.863398,0.864468,0.005342
108,0.010958,0.000251,0.001089,5e-05,gini,7,7,5,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.79021,...,0.807584,0.022621,1,0.855888,0.86819,0.871705,0.863158,0.863398,0.864468,0.005342
109,0.011215,0.000296,0.001146,4.1e-05,gini,7,7,10,"{'criterion': 'gini', 'max_depth': 7, 'min_sam...",0.797203,...,0.80618,0.023934,4,0.855888,0.86819,0.871705,0.863158,0.863398,0.864468,0.005342


In [30]:
grid.cv_results_["mean_test_score"].max()

0.8089887640449438

In [34]:
grid.best_score_

0.8061797752808989