<div class="alert alert-block alert-warning">

# Decision Tree Exercises

<div class="alert alert-block alert-success">

Using the titanic data, in your classification-exercises repository, create a notebook, decision_tree.ipynb where you will do the following:

In [2]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay

from prepare import prep_titanic, prep_telco, split_data
from acquire import new_titanic_data, new_telco_data

NameError: name 'iris_df' is not defined

In [None]:
# Acquire data
titanic = prep_titanic(new_titanic_data())

In [None]:
# Train, validate, split data
train, validate, test = split_data(titanic, 'survived')

In [None]:
train.info()

<div class="alert alert-block alert-success">

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [None]:
# Counts for target value: survived
train.survived.value_counts()

In [None]:
# we know what our X and y are, let's be explicit about defining them
X_train = train.drop(columns='survived')
y_train = train.survived

X_val = validate.drop(columns='survived')
y_val = validate.survived

X_test = test.drop(columns='survived')
y_test = test.survived

In [None]:
# The mode is a great baseline
baseline = train.mode()

# Produce a boolean array with True representing a match between the baseline prediction and reality
matches_baseline_prediction = (train == 0)

print(f"Baseline accuracy: {matches_baseline_prediction.mean()}")

<div class="alert alert-block alert-success">

2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [None]:
X_train['sex'] = X_train.sex.map({'male': 1, 'female': 0})
X_train['embark_town'] = X_train.embark_town.map({'Southampton': 1, 'Queenstown': 0})
X_train.info()

In [None]:
# Make the model
tree1 = DecisionTreeClassifier(max_depth=3, random_state=42)

In [None]:
# Fit the model 
tree1 = tree1.fit(X_train, y_train)

In [None]:
# Use the model
# Evaluate the model's performance on train, first
y_predictions = tree1.predict(X_train)

In [None]:
plt.figure(figsize=(12, 7))
plot_tree(tree1, feature_names=X_train.columns.tolist(), class_names=['0','1'])
plt.show()

<div class="alert alert-block alert-success">

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [None]:
# we have our model clf, let's get those metrics from our informational output
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
      .format(tree1.score(X_train, y_train)))

In [None]:
# Confusion matrix
conf = confusion_matrix(y_train, y_predictions)
conf

In [None]:
ConfusionMatrixDisplay.from_predictions(y_train, y_predictions)

In [None]:
print(classification_report(y_train, y_predictions))

<div class="alert alert-block alert-success">

4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [None]:
labels = sorted(y_train.unique())
labels

pd.DataFrame(conf, index=[str(label) + ' actual'for label in labels], columns=[str(label) + ' predict'for label in labels])

In [None]:
conf.ravel()

In [None]:
TN, FP, FN, TP = conf.ravel()
TN, FP, FN, TP

In [None]:
all = (TP + TN + FP + FN)

accuracy = (TP + TN) / all

TPR = recall = TP / (TP + FN)
FPR = FP / (FP + TN)

TNR = TN / (FP + TN)
FNR = FN / (FN + TP)

precision =  TP / (TP + FP)
f1 =  2 * ((precision * recall) / ( precision + recall))

support_pos = TP + FN
support_neg = FP + TN


In [None]:
print(f"Accuracy: {accuracy}\n")
print(f"True Positive Rate/Sensitivity/Recall/Power: {TPR}")
print(f"False Positive Rate/False Alarm Ratio/Fall-out: {FPR}")
print(f"True Negative Rate/Specificity/Selectivity: {TNR}")
print(f"False Negative Rate/Miss Rate: {FNR}\n")
print(f"Precision/PPV: {precision}")
print(f"F1 Score: {f1}\n")
print(f"Support (0): {support_pos}")
print(f"Support (1): {support_neg}")

<div class="alert alert-block alert-success">

5. Run through steps 2-4 using a different max_depth value.

In [None]:
# max depth = 1
tree2 = DecisionTreeClassifier(max_depth=1, random_state= 42)
tree2.fit(X_train, y_train)
tree2.score(X_train, y_train)

In [None]:
# from previous tree1 max depth = 3
tree1.fit(X_train, y_train)
tree1.score(X_train, y_train)

<div class="alert alert-block alert-success">

6. Which model performs better on your in-sample data?

* Model with max depth = 3

<div class="alert alert-block alert-success">

7. Which model performs best on your out-of-sample data, the validate set?

* very close, but model with depth = 3

In [None]:
# clean up validate set
X_val['sex'] = X_val.sex.map({'male': 1, 'female': 0})
X_val['embark_town'] = X_val.embark_town.map({'Southampton': 1, 'Queenstown': 0})
X_val.info()

In [None]:
# max depth = 1 
tree = DecisionTreeClassifier(max_depth=1, random_state= 42)
out_tree = tree.fit(X_train, y_train)
out_tree.score(X_val, y_val)

In [None]:
# max depth = 3
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
out_tree = tree.fit(X_train, y_train)
out_tree.score(X_val, y_val)

<div class="alert alert-block alert-success">

1. Work through these same exercises using the Telco dataset.

2. Experiment with this model on other datasets with a higher number of output classes.

In [None]:
# Acquire data
telco = prep_telco(new_telco_data())

In [None]:
# Train, validate, split data
train, validate, test = split_data(telco, 'churn')

In [None]:
train.head()

In [None]:
train = train.drop(columns = ['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents', 'phone_service', 'multiple_lines', 'online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies', 'paperless_billing', 'contract_type', 'internet_service_type', 'payment_type'])

train['total_charges'] = train['total_charges'].str.strip()
train = train[train.total_charges != '']
train['total_charges'] = train.total_charges.astype(float)

train['churn'] = train.churn.map({'Yes': 1, 'No': 0})
train['churn'] = train.churn.astype(int)

train.info()

In [None]:
train.head()

In [None]:
X_train = train.drop(columns='churn')
y_train = train.churn

X_val = validate.drop(columns='churn')
y_val = validate.churn

X_test = test.drop(columns='churn')
y_test = test.churn

In [None]:
train.churn.value_counts()

In [None]:
# The mode is a great baseline
baseline = y_train.mode()

# Produce a boolean array with True representing a match between the baseline prediction and reality
matches_baseline_prediction = (y_train == 0)

baseline_accuracy = matches_baseline_prediction.mean()
print(f'baseline accuracy: {baseline_accuracy:.2%}')

In [None]:
# Make the model
churn_tree = DecisionTreeClassifier(max_depth=3, random_state=42)

In [None]:
#fit the model
churn_tree1 = churn_tree.fit(X_train, y_train)

In [None]:
#visualize
plt.figure(figsize=(13, 7))
plot_tree(churn_tree, feature_names=X_train.columns.tolist(), 
class_names = np.array(churn_tree.classes_).astype('str').tolist(), rounded=True)
plt.show()

In [None]:
accuracy = churn_tree.score(X_train, y_train)
print(f'Model 1 accuracy: {accuracy:.2%}')