## Decision Tree Exercises

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.
<p>

2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)
<p>

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.
<p>

4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
<p>

5. Run through steps 2-4 using a different max_depth value.
<p>

6. Which model performs better on your in-sample data?
<p>

7. Which model performs best on your out-of-sample data, the validate set?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import acquire
import prepare

In [2]:
df = acquire.get_titanic_data()
df.head()

Reading from csv file...


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
df.isna().sum()

passenger_id      0
survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
deck            688
embark_town       2
alone             0
dtype: int64

In [4]:
df = prepare.prep_titanic(df)
df.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,1,0,7.25,0,1,0,1
1,1,1,1,0,71.2833,0,0,0,0
2,1,3,0,0,7.925,1,0,0,1
3,1,1,1,0,53.1,0,0,0,1
4,0,3,0,0,8.05,1,1,0,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   survived                 891 non-null    int64  
 1   pclass                   891 non-null    int64  
 2   sibsp                    891 non-null    int64  
 3   parch                    891 non-null    int64  
 4   fare                     891 non-null    float64
 5   alone                    891 non-null    int64  
 6   sex_male                 891 non-null    uint8  
 7   embark_town_Queenstown   891 non-null    uint8  
 8   embark_town_Southampton  891 non-null    uint8  
dtypes: float64(1), int64(5), uint8(3)
memory usage: 44.5 KB


In [6]:
train, test = train_test_split(df, test_size=.2, random_state=177, stratify=df.survived)
train, validate = train_test_split(train, test_size=.3, random_state=177, stratify=train.survived)
train, test, validate

(     survived  pclass  sibsp  parch      fare  alone  sex_male  \
 882         0       3      0      0   10.5167      1         0   
 174         0       1      0      0   30.6958      1         1   
 790         0       3      0      0    7.7500      1         1   
 380         1       1      0      0  227.5250      1         0   
 841         0       2      0      0   10.5000      1         1   
 ..        ...     ...    ...    ...       ...    ...       ...   
 441         0       3      0      0    9.5000      1         1   
 438         0       1      1      4  263.0000      0         1   
 42          0       3      0      0    7.8958      1         1   
 840         0       3      0      0    7.9250      1         1   
 245         0       1      2      0   90.0000      0         1   
 
      embark_town_Queenstown  embark_town_Southampton  
 882                       0                        1  
 174                       0                        0  
 790                      

In [7]:
# 1. What is your baseline prediction? What is your baseline accuracy? 
# remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). 
# When you make those predictions, what is your accuracy? This is your baseline accuracy.
train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

In [8]:
x_train = train.drop(columns=["survived"])
y_train = train.survived

# What is your baseline prediction? 
baseline = y_train.mode()

# What is your baseline accuracy? 
match_baseline_prediction = (y_train == 0)
baseline_accuracy = match_baseline_prediction.mean()

print(f'Our baseline accuracy is: {baseline_accuracy:.3} or {baseline_accuracy:.3%}')

Our baseline accuracy is: 0.616 or 61.647%


In [9]:
# 2. Fit the decision tree classifier to your training sample and transform 
# (i.e. make predictions on the training sample)

clf1 = DecisionTreeClassifier(max_depth=1, random_state=177)
clf1 = clf1.fit(x_train, y_train)
y_predictions = clf1.predict(x_train)

In [10]:
# 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

mod_score = clf1.score(x_train, y_train)
con_matrix = pd.DataFrame(confusion_matrix(y_train, y_predictions))
class_report = classification_report(y_train, y_predictions)
print(f'Accuracy using Model Score:\n {mod_score:.2%}')
print(f'Using Confusion Matrix:\n {con_matrix}')
print(f'and Class report:\n {class_report}')

Accuracy using Model Score:
 80.12%
Using Confusion Matrix:
      0    1
0  266   41
1   58  133
and Class report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84       307
           1       0.76      0.70      0.73       191

    accuracy                           0.80       498
   macro avg       0.79      0.78      0.79       498
weighted avg       0.80      0.80      0.80       498



In [11]:
# 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

tp = con_matrix.loc[0,0]
fp = con_matrix.loc[1,0]
fn = con_matrix.loc[0,1]
tn = con_matrix.loc[1,1]
all = tp+fp+fn+tn
print(f'True Positive(tp): {tp} \nFalse Positive(fp): {fp} \nFalse Negative(fn): {fn} \nTrue Negative(tn): {tn}')

True Positive(tp): 266 
False Positive(fp): 58 
False Negative(fn): 41 
True Negative(tn): 133


In [12]:

accuracy = (tp + tn)/all
print(f"Accuracy: {accuracy:.4}")

true_positive_rate = tp/(tp+fn)
print(f"True Positive Rate: {true_positive_rate:.4}")

false_positive_rate = fp/(fp+tn)
print(f"False Positive Rate: {false_positive_rate:.4}")

true_negative_rate = tn/(tn+fp)
print(f"True Negative Rate: {true_negative_rate:.4}")

false_negative_rate = fn/(fn+tp)
print(f"False Negative Rate: {false_negative_rate:.4}")

precision = tp/(tp+fp)
print(f"Precision: {precision:.4}")

recall = tp/(tp+fn)
print(f"Recall: {recall:.4}")

f1_score = 2*(precision*recall)/(precision+recall)
print(f"F1 Score: {f1_score:.4}")

support_pos = tp+fn
print(f"Support (0): {support_pos}")

support_neg = fp+tn
print(f"Support (1): {support_neg}")

Accuracy: 0.8012
True Positive Rate: 0.8664
False Positive Rate: 0.3037
True Negative Rate: 0.6963
False Negative Rate: 0.1336
Precision: 0.821
Recall: 0.8664
F1 Score: 0.8431
Support (0): 307
Support (1): 191


In [13]:
# 5. Run through steps 2-4 using a different max_depth value.
# Let's 
for i in range(2, 16):
    # Make the thing
    tree = DecisionTreeClassifier(max_depth=i, random_state=177)

    # Fit the thing
    tree = tree.fit(x_train, y_train)

    # Use the thing
    y_predictions = tree.predict(x_train)

    report = classification_report(y_train, y_predictions, output_dict=True)
    print(f"Tree with max depth of {i}")
    print(pd.DataFrame(report))
    print('______________________________________________________________________')

Tree with max depth of 2
                    0           1  accuracy   macro avg  weighted avg
precision    0.820988    0.764368  0.801205    0.792678      0.799272
recall       0.866450    0.696335  0.801205    0.781392      0.801205
f1-score     0.843106    0.728767  0.801205    0.785937      0.799253
support    307.000000  191.000000  0.801205  498.000000    498.000000
______________________________________________________________________
Tree with max depth of 3
                    0           1  accuracy   macro avg  weighted avg
precision    0.819242    0.832258  0.823293    0.825750      0.824234
recall       0.915309    0.675393  0.823293    0.795351      0.823293
f1-score     0.864615    0.745665  0.823293    0.805140      0.818994
support    307.000000  191.000000  0.823293  498.000000    498.000000
______________________________________________________________________
Tree with max depth of 4
                    0           1  accuracy   macro avg  weighted avg
precision    

In [14]:
# 6. Which model performs better on your in-sample data?
# 13 and up are 93% accurate
tree = DecisionTreeClassifier(max_depth=13, random_state=177)
tree = tree.fit(x_train, y_train)
y_predictions = tree.predict(x_train)
report = pd.DataFrame(classification_report(y_train, y_predictions, output_dict=True))
print(report)

                    0           1  accuracy   macro avg  weighted avg
precision    0.922840    0.954023  0.933735    0.938431      0.934799
recall       0.973941    0.869110  0.933735    0.921526      0.933735
f1-score     0.947702    0.909589  0.933735    0.928646      0.933084
support    307.000000  191.000000  0.933735  498.000000    498.000000


In [15]:
tree = DecisionTreeClassifier(max_depth=14, random_state=177)
tree = tree.fit(x_train, y_train)
y_predictions = tree.predict(x_train)
report = pd.DataFrame(classification_report(y_train, y_predictions, output_dict=True))
print(report)

                    0           1  accuracy   macro avg  weighted avg
precision    0.922840    0.954023  0.933735    0.938431      0.934799
recall       0.973941    0.869110  0.933735    0.921526      0.933735
f1-score     0.947702    0.909589  0.933735    0.928646      0.933084
support    307.000000  191.000000  0.933735  498.000000    498.000000


In [16]:
# 7. Which model performs best on your out-of-sample data, the validate set?
x_validate = validate.drop(columns=["survived"])
y_validate = validate.survived

metrics = []

for i in range(2, 16):
    tree = DecisionTreeClassifier(max_depth=i, random_state=177)
    tree = tree.fit(x_train, y_train)
    in_sample_accuracy = tree.score(x_train, y_train)
    out_of_sample_accuracy = tree.score(x_validate, y_validate)
    output = {
        "max_depth": i,
        "train_accuracy": in_sample_accuracy,
        "validate_accuracy": out_of_sample_accuracy
    }
    metrics.append(output)
    
df = pd.DataFrame(metrics)
df["difference"] = df.train_accuracy - df.validate_accuracy
df

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
0,2,0.801205,0.738318,0.062887
1,3,0.823293,0.780374,0.042919
2,4,0.84739,0.780374,0.067016
3,5,0.871486,0.775701,0.095785
4,6,0.891566,0.761682,0.129884
5,7,0.89759,0.742991,0.1546
6,8,0.913655,0.752336,0.161318
7,9,0.925703,0.752336,0.173366
8,10,0.929719,0.757009,0.17271
9,11,0.929719,0.761682,0.168037


## Random Forests Exercises

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.
<p>

2. Evaluate your results using the model score, confusion matrix, and classification report.
<p>

3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
<p>

4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.
<p>

5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [17]:
df = acquire.get_titanic_data()
df.head()

Reading from csv file...


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [18]:
df = prepare.prep_titanic(df)
df.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,1,0,7.25,0,1,0,1
1,1,1,1,0,71.2833,0,0,0,0
2,1,3,0,0,7.925,1,0,0,1
3,1,1,1,0,53.1,0,0,0,1
4,0,3,0,0,8.05,1,1,0,1


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   survived                 891 non-null    int64  
 1   pclass                   891 non-null    int64  
 2   sibsp                    891 non-null    int64  
 3   parch                    891 non-null    int64  
 4   fare                     891 non-null    float64
 5   alone                    891 non-null    int64  
 6   sex_male                 891 non-null    uint8  
 7   embark_town_Queenstown   891 non-null    uint8  
 8   embark_town_Southampton  891 non-null    uint8  
dtypes: float64(1), int64(5), uint8(3)
memory usage: 44.5 KB


In [20]:
df.describe()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.523008,0.381594,32.204208,0.602694,0.647587,0.08642,0.725028
std,0.486592,0.836071,1.102743,0.806057,49.693429,0.489615,0.47799,0.281141,0.446751
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,0.0,7.9104,0.0,0.0,0.0,0.0
50%,0.0,3.0,0.0,0.0,14.4542,1.0,1.0,0.0,1.0
75%,1.0,3.0,1.0,0.0,31.0,1.0,1.0,0.0,1.0
max,1.0,3.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0


In [21]:
df.isna().sum()

survived                   0
pclass                     0
sibsp                      0
parch                      0
fare                       0
alone                      0
sex_male                   0
embark_town_Queenstown     0
embark_town_Southampton    0
dtype: int64

In [22]:
train, test = train_test_split(df, test_size=.2, random_state=177, stratify=df.survived)
train, validate = train_test_split(train, test_size=.3, random_state=177, stratify=train.survived)
train.shape, test.shape, validate.shape

((498, 9), (179, 9), (214, 9))

In [23]:
x_train = train.drop(columns=["survived"])
y_train = train.survived

x_test = test.drop(columns=["survived"])
y_test = test.survived

x_validate = validate.drop(columns=["survived"])
y_validate = validate.survived

In [24]:
# 1. Fit the Random Forest classifier to your training sample and transform
# (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

# create the thing
forest1 = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=1349)
# fit the thing
forest1.fit(x_train, y_train)
# use the thing
y_predictions = forest1.predict(x_train)



In [25]:
# 2. Evaluate your results using the model score, confusion matrix, and classification report.

mod_score = forest1.score(x_train, y_train)
con_matrix = pd.DataFrame(confusion_matrix(y_train, y_predictions))
class_report = classification_report(y_train, y_predictions)
print(f'Accuracy using Model Score: {mod_score:.2%}')
print(f'Using Confusion Matrix:\n{con_matrix}')
print(f'Class report:\n{class_report}')

Accuracy using Model Score: 92.97%
Using Confusion Matrix:
     0    1
0  297   10
1   25  166
Class report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94       307
           1       0.94      0.87      0.90       191

    accuracy                           0.93       498
   macro avg       0.93      0.92      0.92       498
weighted avg       0.93      0.93      0.93       498



In [26]:
# 3. Print and clearly label the following: 
# Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
tp = con_matrix.loc[1,1]
fp = con_matrix.loc[0,1]
fn = con_matrix.loc[1,0]
tn = con_matrix.loc[0,0]
all = tp+fp+fn+tn
print(f'True Positive(tp): {tp} \nFalse Positive(fp): {fp} \nFalse Negative(fn): {fn} \nTrue Negative(tn): {tn}')

True Positive(tp): 166 
False Positive(fp): 10 
False Negative(fn): 25 
True Negative(tn): 297


In [27]:

accuracy = (tp + tn)/all
print(f"Accuracy: {accuracy:.4}")

true_positive_rate = tp/(tp+fn)
print(f"True Positive Rate: {true_positive_rate:.4}")

false_positive_rate = fp/(fp+tn)
print(f"False Positive Rate: {false_positive_rate:.4}")

true_negative_rate = tn/(tn+fp)
print(f"True Negative Rate: {true_negative_rate:.4}")

false_negative_rate = fn/(fn+tp)
print(f"False Negative Rate: {false_negative_rate:.4}")

precision = tp/(tp+fp)
print(f"Precision: {precision:.4}")

recall = tp/(tp+fn)
print(f"Recall: {recall:.4}")

f1_score = 2*(precision*recall)/(precision+recall)
print(f"F1 Score: {f1_score:.4}")

support_pos = tp+fn
print(f"Support (0): {support_pos}")

support_neg = fp+tn
print(f"Support (1): {support_neg}")

Accuracy: 0.9297
True Positive Rate: 0.8691
False Positive Rate: 0.03257
True Negative Rate: 0.9674
False Negative Rate: 0.1309
Precision: 0.9432
Recall: 0.8691
F1 Score: 0.9046
Support (0): 191
Support (1): 307


In [41]:
# 4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.
metrics = []
max_depth = 11

for i in range(1, max_depth):
    # Make the thing
    depth = max_depth -i
    n_samples = i
    forest = RandomForestClassifier(max_depth=depth, min_samples_leaf=n_samples, random_state=1349)

    # Fit the thing
    forest = forest.fit(x_train, y_train)

    # Use the thing
    sample_accuracy = forest.score(x_train, y_train)
    
    out_of_sample_accuracy = forest.score(x_validate, y_validate)

    output = {
        "min_samples_per_leaf": n_samples,
        "max_depth": depth,
        "train_accuracy": sample_accuracy,
        "validate_accuracy": out_of_sample_accuracy
    }
    
    metrics.append(output)
    
df = pd.DataFrame(metrics)
df["difference"] = df.train_accuracy - df.validate_accuracy
df

Unnamed: 0,min_samples_per_leaf,max_depth,train_accuracy,validate_accuracy,difference
0,1,10,0.929719,0.761682,0.168037
1,2,9,0.881526,0.794393,0.087134
2,3,8,0.865462,0.785047,0.080415
3,4,7,0.859438,0.794393,0.065045
4,5,6,0.845382,0.794393,0.050989
5,6,5,0.845382,0.785047,0.060335
6,7,4,0.821285,0.780374,0.040911
7,8,3,0.815261,0.761682,0.053579
8,9,2,0.803213,0.775701,0.027512
9,10,1,0.76506,0.761682,0.003378


In [47]:
# 5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?



In [28]:
# After making a few models, which one has the best performance (or closest metrics) on both train and validate?