<a href="https://www.kaggle.com/code/amsamms/titanic-decision-tree?scriptVersionId=97660769" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In this lab, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
full_data = pd.read_csv('../input/data-science-day1-titanic/DSB_Day1_Titanic_train.csv')

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [2]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.

## Preprocessing the data

Now, let's do some data preprocessing. First, we'll remove the names of the passengers, and then one-hot encode the features.

**Question:** Why would it be a terrible idea to one-hot encode the data without removing the names?

**Answer:** If we one-hot encode the names columns, then there would be one column for each name, and the model would be learn the names of the survivors, and make predictions based on that. This would lead to some serious overfitting!

In [3]:
# Removing the names
features_no_name = features_raw.drop(['Name'], axis=1)

# One-hot encoding
features = pd.get_dummies(features_no_name)

And now we'll fill in any blanks with zeroes.

In [4]:
features = features.fillna(0.0)
display(features.head())

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


## (TODO) Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [6]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [7]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.8100558659217877


# Exerise: Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

You can use your intuition, trial and error, or even better, feel free to use Grid Search!

**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook in this same folder.

In [8]:
# Training the model
model = DecisionTreeClassifier(splitter='best',max_depth=6, min_samples_leaf=7, min_samples_split=10,criterion='gini')
model.fit(X_train, y_train)

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8679775280898876
The test accuracy is 0.8435754189944135


In [9]:
# Gridsearch 
# specify parameter values to search
param_grid={}
param_grid['max_depth']=[i for i in range(1,5,1)]
param_grid['min_samples_leaf']=[i for i in range(2,5,1)]
param_grid['min_samples_split']=[i for i in range(1,5,1)]
param_grid['criterion']=['gini','entropy']

model = DecisionTreeClassifier()

#insticiate gridsearchcv instance
from sklearn.model_selection import GridSearchCV
grid=GridSearchCV(estimator=DecisionTreeClassifier(),param_grid=param_grid,cv=10,n_jobs=-1,scoring='accuracy',verbose=4)

In [10]:
grid.fit(X_train,y_train)

Fitting 10 folds for each of 96 candidates, totalling 960 fits
[CV 2/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 7/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 10/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 4/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=2;, score=0.958 total time=   0.0s
[CV 8/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=2;, score=0.775 total time=   0.0s
[CV 2/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=3;, score=0.750 total time=   0.0s
[CV 6/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=3;, score=0.676 total time=   0.0s
[CV 1/10] END criterion=gini, max_depth=1, min_samples_leaf=2, min_samples_split=4;, score=0.833 total time=   0.0s
[CV 2/10] END 

240 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
240 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/opt/conda/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 254, in fit
    % self.min_samples_split
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 0.7879108  0.7879108         nan 0.7879108  0.7879108  0.7879108


GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4],
                         'min_samples_leaf': [2, 3, 4],
                         'min_samples_split': [1, 2, 3, 4]},
             scoring='accuracy', verbose=4)

In [11]:
grid.best_score_

0.8117957746478874

In [12]:
grid.best_params_

{'criterion': 'gini',
 'max_depth': 3,
 'min_samples_leaf': 2,
 'min_samples_split': 2}

In [13]:
# randomizesearchcv
# specify parameter values to search
param_grid={}
param_grid['max_depth']=[i for i in range(60)]
param_grid['min_samples_leaf']=[i for i in range(60)]
param_grid['min_samples_split']=[i for i in range(60)]
param_grid['splitter']=['best', 'random']
param_grid['criterion']=['gini','entropy']

In [14]:
#insticiate gridsearchcv instance
from sklearn.model_selection import RandomizedSearchCV
random=RandomizedSearchCV(estimator=DecisionTreeClassifier(),param_distributions=param_grid,cv=10,n_iter=50,n_jobs=-1,scoring='accuracy',verbose=4)
random.fit(X_train,y_train)
print(random.best_params_)
print(random.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits

[CV 6/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 7/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 8/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 9/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 10/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=1;, score=nan total time=   0.0s
[CV 1/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=2;, score=0.861 total time=   0.1s
[CV 2/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=2;, score=0.736 total time=   0.1s
[CV 7/10] END criterion=entropy, max_depth=4, min_samples_leaf=2, min_samples_split=4;, score=0.803 total time= 

20 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/opt/conda/lib/python3.7/site-packages/sklearn/tree/_classes.py", line 254, in fit
    % self.min_samples_split
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

--------------------------------------------------------------------