# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In the introductory project, you studied the Titanic survival data, and you were able to make predictions about passenger survival. In that project, you built a decision tree by hand, that at each stage, picked the features that were most correlated with survival. Lucky for us, this is exactly how decision trees work! In this lab, we'll do this much quicker by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

Recall that these are the various features present for each passenger on the ship:
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)  
- (Target variable) **Survived**: Outcome of survival (0 = No; 1 = Yes)  



In [53]:
# Import libraries necessary for this project
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# Pretty display for notebooks
%matplotlib inline

# Allows the use of display() for DataFrames
from IPython.display import display

In [54]:
# Load the dataset
data = pd.read_csv('titanic_data.csv')
# Print the first few entries of the Titanic data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


In [55]:
# define variables(features, outcomes)
#Note: do not include Name column with features 
outcome = data[['Survived']]
features = data.drop(['Survived'],axis=1)

# Show the new dataset with 'Survived' removed
features.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [56]:
#data exploration:
print(data.info())
print(display(data))
print(features.isnull().sum())
#data.plot.scatter([],[],c='g')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


None
PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [57]:
#data cleaning: fill null values with zero
features = features.fillna(0.0)
print(features.isnull().sum())
display(features.head(n=7))

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0.0,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0.0,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0.0,S
5,6,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,0.0,Q
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


In [58]:
#additional point form me, here we can see the statistcal information about our data like
#like the one in last project for last lesson

## Preprocessing the data


In [59]:
#transformation: Perform feature scaling on the data
# first: define the standardization scaling object using StandardScaler().

stand = StandardScaler()
# second: apply the scaler to the numerical columns on the data:
numerical = ['PassengerId','Pclass','Age','SibSp','Parch','Fare']
features[numerical] = stand.fit_transform(features[numerical])
features.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,-1.730108,0.827377,"Braund, Mr. Owen Harris",male,-0.102313,0.432793,-0.473674,A/5 21171,-0.502445,0.0,S
1,-1.72622,-1.566107,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.807492,0.432793,-0.473674,PC 17599,0.786845,C85,C
2,-1.722332,0.827377,"Heikkinen, Miss. Laina",female,0.125138,-0.474545,-0.473674,STON/O2. 3101282,-0.488854,0.0,S
3,-1.718444,-1.566107,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.636903,0.432793,-0.473674,113803,0.42073,C123,S
4,-1.714556,0.827377,"Allen, Mr. William Henry",male,0.636903,-0.474545,-0.473674,373450,-0.486337,0.0,S


we'll one-hot encode the features.

In [60]:
#dummies variables: convert catogrical columns to numerical
## perform one-hot encoding on categorical columns Using pandas.get_dummies()
features_final = pd.get_dummies(features)
features_final.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,"Name_Abbing, Mr. Anthony","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelson, Mr. Samuel",...,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_0.0,Embarked_C,Embarked_Q,Embarked_S
0,-1.730108,0.827377,-0.102313,0.432793,-0.473674,-0.502445,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,-1.72622,-1.566107,0.807492,0.432793,-0.473674,0.786845,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,-1.722332,0.827377,0.125138,-0.474545,-0.473674,-0.488854,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,-1.718444,-1.566107,0.636903,0.432793,-0.473674,0.42073,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,-1.714556,0.827377,0.636903,-0.474545,-0.473674,-0.486337,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [61]:
features_final.shape

(891, 1732)

## Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [62]:
#split the data to two sets. training set and testing set:
X_train, X_test,y_train, y_test = train_test_split(features_final,outcome)

In [63]:
# Define the classifier model as DecisionTree
clf = DecisionTreeClassifier()

#fit the model to the data
clf.fit(X_train, y_train)

DecisionTreeClassifier()

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [64]:
# Making predictions on scaling data
train_pred = clf.predict(X_train)
test_pred = clf.predict(X_test)

train_accuracy = accuracy_score(y_train,train_pred)
test_accuracy = accuracy_score(y_test,test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 1.0
The test accuracy is 0.7982062780269058


# Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth` The maximum number of levels in the tree.
- `min_samples_leaf` The minimum number of samples allowed in a leaf.
- `min_samples_split` The minimum number of samples required to split an internal node.



use Grid Search!



In [70]:
#grid search
#import gridsearch

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

#define the classifier model by DecisionTree
clf = DecisionTreeClassifier()

#define the parameters:
# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'max_depth':[3,6],'min_samples_leaf':[20,25],'min_samples_split':[20,25]}

#define the score method using make_scorer()
scorer = make_scorer(accuracy_score)

#define gridsearchcv function with cv=3 (so cross validation=3)
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1,cv=3)
#fit/ train the function/ object
grid_fit = grid_obj.fit(X_train,y_train)
#get the best estimtor model
best_clf = grid_fit.best_estimator_


In [71]:
# Make predictions using the new model.
y_train_pred = best_clf.predict(X_train)
y_test_pred = best_clf.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train,y_train_pred)
test_accuracy = accuracy_score(y_test,y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8158682634730539
The test accuracy is 0.8116591928251121
