# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In the introductory project, you studied the Titanic survival data, and you were able to make predictions about passenger survival. In that project, you built a decision tree by hand, that at each stage, picked the features that were most correlated with survival. Lucky for us, this is exactly how decision trees work! In this lab, we'll do this much quicker by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

Recall that these are the various features present for each passenger on the ship:
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)  
- (Target variable) **Survived**: Outcome of survival (0 = No; 1 = Yes)  



In [118]:
# Import libraries necessary for this project
import pandas as pd

In [119]:
# Load the dataset
df = pd.read_csv('titanic_data.csv')
# Print the first few entries of the Titanic data
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [120]:
# define variables(features, outcomes)
#Note: do not include Name column with features 
#drop the name and Id because they don't effect 
features = df.drop(['Survived', 'Name', 'PassengerId'],axis =1)
outcomes = df['Survived']
# Show the new dataset with 'Survived' removed
features.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,female,35.0,1,0,113803,53.1,C123,S
4,3,male,35.0,0,0,373450,8.05,,S


In [121]:
#data exploration:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       714 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Ticket    891 non-null    object 
 6   Fare      891 non-null    float64
 7   Cabin     204 non-null    object 
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(3), object(4)
memory usage: 62.8+ KB


In [122]:
features.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,891.0,714.0,891.0,891.0,891.0
mean,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.42,0.0,0.0,0.0
25%,2.0,20.125,0.0,0.0,7.9104
50%,3.0,28.0,0.0,0.0,14.4542
75%,3.0,38.0,1.0,0.0,31.0
max,3.0,80.0,8.0,6.0,512.3292


In [123]:
#data cleaning: fill null values with zero
features['Age'] = features['Age'].fillna(0)
features['Cabin'] = features['Cabin'].fillna(0)
features['Embarked'] = features['Embarked'].fillna(0)

In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Preprocessing the data


In [125]:
#transformation: Perform feature scaling on the data
# first: define the standardization scaling object using StandardScaler().
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# second: apply the scaler to the numerical columns on the data:
scaler.fit_transform([features['Age'], features['Fare']])

array([[ 1., -1.,  1., ..., -1., -1.,  1.],
       [-1.,  1., -1., ...,  1.,  1., -1.]])

we'll one-hot encode the features.

In [126]:
#dummies variables: convert catogrical columns to numerical
## perform one-hot encoding on categorical columns Using pandas.get_dummies()
features = pd.get_dummies(features)

In [127]:
features.shape

(891, 840)

## Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [128]:
#split the data to two sets. training set and testing set:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size = 0.2)

In [129]:
# Define the classifier model as DecisionTree
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
#fit the model to the data
model.fit(X_train, y_train)


DecisionTreeClassifier()

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [130]:
# Making predictions on scaling data
from sklearn.metrics import accuracy_score

train_accuracy = accuracy_score(model.predict(X_train), y_train)
accuracy_score = accuracy_score(model.predict(X_test), y_test)


print('The training accuracy is', train_accuracy)
print('The test accuracy is', accuracy_score)

The training accuracy is 1.0
The test accuracy is 0.8715083798882681


# Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth` The maximum number of levels in the tree.
- `min_samples_leaf` The minimum number of samples allowed in a leaf.
- `min_samples_split` The minimum number of samples required to split an internal node.



use Grid Search!



In [131]:
#grid search
#import gridsearch
from sklearn.model_selection import GridSearchCV
#define the classifier model by DecisionTree
clf = DecisionTreeClassifier(random_state = 40)

#define the parameters:
# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'max_depth': [5,6,7, 8, 9, 10], 'min_samples_leaf': [2, 3, 4, 5, 6], 'min_samples_split': [5,6,7,8,9,10]}

#define the score method using make_scorer()
from sklearn.metrics import accuracy_score, make_scorer
scorer = make_scorer(accuracy_score)

#define gridsearchcv function with cv=3 (so cross validation=3)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer, cv = 3)
#fit/ train the function/ object
grid_fit = grid_obj.fit(X_train, y_train)
#get the best estimtor model
best_clf = grid_fit.best_estimator_


In [132]:
# Make predictions using the new model.
y_train_pred = best_clf.predict(X_train)
y_test_pred = best_clf.predict(X_test)

# Calculating accuracies

train_accuracy = accuracy_score(best_clf.predict(X_train), y_train)
test_accuracy = accuracy_score(best_clf.predict(X_test), y_test)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)


The training accuracy is 0.8764044943820225
The test accuracy is 0.8603351955307262


In [133]:
#Extra point: copy the "best" model found from grid search earlier to new model
from sklearn.base import clone # Import functionality for cloning a model
model = clone(best_clf)
model

DecisionTreeClassifier(max_depth=9, min_samples_leaf=2, min_samples_split=10,
                       random_state=40)