## Decision Tree Exercises

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

from acquire import get_titanic_data

##### 1. Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

- What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

- Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

- Evaluate your in-sample results using the model score, confusion matrix, and classification report.

- Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

- Run through steps 2-4 using a different max_depth value.

- Which model performs better on your in-sample data?

- Which model performs best on your out-of-sample data, the validate set?



---

## Random Forest Exercises

##### 1. Continue working in your model file with titanic data to do the following:

- Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

- Evaluate your results using the model score, confusion matrix, and classification report.

- Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

- Run through steps increasing your min_samples_leaf and decreasing your max_depth.

- What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [2]:
#get the titanic dataset
df = get_titanic_data()
df.head()

Reading from csv file...


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
#basic clean for the titanic dataset
df = df.drop(columns=['passenger_id', 'class', 'embarked', 'deck', 'age'])
dum_df = pd.get_dummies(df[['sex', 'embark_town']], drop_first = [True, True])
df = pd.concat([df, dum_df], axis = 1)
df.head()

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,male,1,0,7.25,Southampton,0,1,0,1
1,1,1,female,1,0,71.2833,Cherbourg,0,0,0,0
2,1,3,female,0,0,7.925,Southampton,1,0,0,1
3,1,1,female,1,0,53.1,Southampton,0,0,0,1
4,0,3,male,0,0,8.05,Southampton,1,1,0,1


In [4]:
df.shape

(891, 11)

In [5]:
#split the data
train, test = train_test_split(df, train_size = 0.8, stratify = df.survived, random_state=123)
train, validate = train_test_split(train, train_size = 0.7, stratify = train.survived, random_state=123)
#verify the data split
train.shape, validate.shape, test.shape

((498, 11), (214, 11), (179, 11))

In [6]:
#split the train set for x and y
x_train = train.drop(columns=['survived'])
y_train = train[['survived']]
x_train.head()

Unnamed: 0,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,1,male,0,0,40.125,Cherbourg,1,1,0,0
165,3,male,0,2,20.525,Southampton,0,1,0,1
50,3,male,4,1,39.6875,Southampton,0,1,0,1
259,2,female,0,1,26.0,Southampton,0,0,0,1
306,1,female,0,0,110.8833,Cherbourg,1,0,0,0


In [7]:
y_train.head()

Unnamed: 0,survived
583,0
165,1
50,0
259,1
306,1


In [8]:
#check which is the most common y value for baseline
y_train.value_counts()

survived
0           307
1           191
dtype: int64

In [9]:
y_train['baseline'] = 0
y_train.head()

Unnamed: 0,survived,baseline
583,0,0
165,1,0
50,0,0
259,1,0
306,1,0


In [10]:
#evaluate the baseline value
baseline_eval = accuracy_score(y_train.survived, y_train.baseline)
baseline_eval

0.6164658634538153

In [11]:
#select features to model
selected_features = ['pclass', 'sex_male']

In [12]:
#create a classifier object
clf = RandomForestClassifier(max_depth=5, random_state=123)

In [13]:
#fit the classifier object with the train selected features and the train y values
clf.fit(x_train[selected_features], y_train.survived)

RandomForestClassifier(max_depth=5, random_state=123)

In [14]:
#use the trained model to predict survival based on the train dataset
y_train['y_pred'] = clf.predict(x_train[selected_features])
y_train.head()

Unnamed: 0,survived,baseline,y_pred
583,0,0,0
165,1,0,0
50,0,0,0
259,1,0,1
306,1,0,1


In [15]:
#evaluate the accuracy of the models predictions
accuracy_score(y_train.survived, y_train.y_pred)

0.7991967871485943

In [17]:
#evaluate the model with a classification report
pd.DataFrame(classification_report(y_train.survived, y_train.y_pred, output_dict=True)).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.820433,0.863192,0.84127,307.0
1,0.76,0.696335,0.726776,191.0
accuracy,0.799197,0.799197,0.799197,0.799197
macro avg,0.790217,0.779764,0.784023,498.0
weighted avg,0.797255,0.799197,0.797358,498.0


In [18]:
#view the crosstab of the train survived vs the train predicted to evaluate the tp, tn, fp, fn
pd.crosstab(y_train.survived, y_train.y_pred)

y_pred,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,265,42
1,58,133


### Model 1 (max depth = 5) evaluation calculations
**Positive** = survived

- True positives: 133
- True negatives: 265
- False positives: 58
- False negatives: 42
- Baseline Accuracy: 61.65%
- Model Accuracy: 79.92%
- Model precision: 76%
- Model recall: 69.63%
- Model f1: 72.68%
- Model support: 191