# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In the introductory project, you studied the Titanic survival data, and you were able to make predictions about passenger survival. In that project, you built a decision tree by hand, that at each stage, picked the features that were most correlated with survival. Lucky for us, this is exactly how decision trees work! In this lab, we'll do this much quicker by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

Recall that these are the various features present for each passenger on the ship:
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)  
- (Target variable) **Survived**: Outcome of survival (0 = No; 1 = Yes)  



In [3]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier


In [4]:
# Load the dataset
data = pd.read_csv('titanic_data.csv')
# Print the first few entries of the Titanic data
data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [9]:
# define variables(features, outcomes)
#Note: do not include Name column with features 
feature = data.drop(['Name','Ticket', 'Cabin', 'Embarked', 'Survived'], axis = 1)
outcome = data['Survived']
# Show the new dataset with 'Survived' removed
feature.head()
outcome.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [11]:
#data exploration:
feature.isnull()
feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Sex          891 non-null    object 
 3   Age          714 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
 6   Fare         891 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 48.9+ KB


In [15]:
#data cleaning: fill null values with zero
feature.fillna(0, inplace = True)

## Preprocessing the data


In [16]:
feature.isnull().sum()

PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

In [21]:
#transformation: Perform feature scaling on the data
# first: define the standardization scaling object using StandardScaler().
from sklearn.preprocessing import StandardScaler

# second: apply the scaler to the numerical columns on the data:
scaler = StandardScaler()

we'll one-hot encode the features.

In [23]:
#dummies variables: convert catogrical columns to numerical
## perform one-hot encoding on categorical columns Using pandas.get_dummies()
feature = pd.get_dummies(feature)
scaler.fit_transform(feature)

array([[-1.73010796,  0.82737724, -0.10231279, ..., -0.50244517,
        -0.73769513,  0.73769513],
       [-1.72622007, -1.56610693,  0.80749164, ...,  0.78684529,
         1.35557354, -1.35557354],
       [-1.72233219,  0.82737724,  0.12513832, ..., -0.48885426,
         1.35557354, -1.35557354],
       ...,
       [ 1.72233219,  0.82737724, -1.35329389, ..., -0.17626324,
         1.35557354, -1.35557354],
       [ 1.72622007, -1.56610693,  0.12513832, ..., -0.04438104,
        -0.73769513,  0.73769513],
       [ 1.73010796,  0.82737724,  0.46631498, ..., -0.49237783,
        -0.73769513,  0.73769513]])

In [25]:
feature.shape

(891, 8)

## Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [33]:
#split the data to two sets. training set and testing set:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, fbeta_score

X_train, X_test, y_train, y_test = train_test_split(feature, outcome, test_size=0.2, random_state=42)

In [34]:
# Define the classifier model as DecisionTree
model = DecisionTreeClassifier()

#fit the model to the data
model.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [35]:
# Making predictions on scaling data
prediction_train = model.predict(X_train)
prediction_test = model.predict(X_test)

train_accuracy = accuracy_score(y_train, prediction_train)
test_accuracy = accuracy_score(y_test, prediction_test)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.7653631284916201


# Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth` The maximum number of levels in the tree.
- `min_samples_leaf` The minimum number of samples allowed in a leaf.
- `min_samples_split` The minimum number of samples required to split an internal node.



use Grid Search!



In [129]:
#grid search
#import gridsearch
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import make_scorer

#define the classifier model by DecisionTree
clf = DecisionTreeClassifier()

#define the parameters:
# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'max_depth':[4,5], 'min_samples_leaf':[2,4,6], 'min_samples_split':[2,4,6]}

#define the score method using make_scorer()
scorer = make_scorer(f1_score)

#define gridsearchcv function with cv=3 (so cross validation=3)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv = 3)
#fit/ train the function/ object
grid_fit = grid_obj.fit(X_train, y_train)
#get the best estimtor model
best_clf = grid_fit.best_estimator_

In [130]:
best_clf

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=6, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [131]:
# Make predictions using the new model.
y_train_pred = best_clf.predict(X_train)
y_test_pred = best_clf.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

f1_test = f1_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
print('The f1 score for the best model is', f1_test)

The training accuracy is 0.8469101123595506
The test accuracy is 0.8212290502793296
The f1 score for the best model is 0.7611940298507462
