# Decision Tree Exhibition

This is an exhibition of the basic decision tree machine learning model compared against its sklearn counterpart.

## Part 1: Decision Tree from Scratch

In [1]:
# Step 0: Import the necessary packages
from from_scratch.decision_tree import DecisionTree
from from_scratch.evaluation_metrics import f1_measure, precision_and_recall, confusion_matrix, accuracy
from from_scratch.import_data import load_data, train_test_split


We will asses our model on the [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [2]:
# Step 1: Import diabetes.csv with load_data
features, targets, attribute_names = load_data("data/diabetes.csv")
train_features, train_targets, test_features, test_targets = train_test_split(
    features, targets, fraction=0.85)


In [3]:
# Step 2: Fit a decision tree to the training data
learner = DecisionTree(attribute_names)
learner.fit(train_features, train_targets)

learner.visualize()  # visualize tree


0: Glucose == 128.0
1:  Age == 29.0
2:   BMI == 31.0
3:    Pregnancies == 8.0
4:     DiabetesPedigreeFunction == 0.678
5:      root == 0
5:      SkinThickness == 14.0
6:       Insulin == 182.0
7:        root == 1
7:        root == 0
6:       root == 0
4:     root == 1
3:    BloodPressure == 44.0
4:     root == 1
4:     DiabetesPedigreeFunction == 0.503
5:      Insulin == 37.0
6:       SkinThickness == 31.0
7:        Pregnancies == 5.0
8:         root == 0
8:         root == 1
7:        root == 0
6:       Pregnancies == 2.0
7:        root == 0
7:        SkinThickness == 32.0
8:         root == 0
8:         root == 0
5:      Insulin == 170.0
6:       SkinThickness == 30.0
7:        Pregnancies == 2.0
8:         root == 0
8:         root == 0
7:        Pregnancies == 4.0
8:         root == 0
8:         root == 1
6:       root == 0
2:   BMI == 26.5
3:    Pregnancies == 8.0
4:     root == 0
4:     BloodPressure == 55.0
5:      root == 1
5:      DiabetesPedigreeFunction == 0.232
6:       roo

As we can see, our model achieves a pretty high level of accuracy given how simple it is.

In [4]:
# Step 3: Predict labels of testing set and evaluate the decision tree's performance
predictions = learner.predict(test_features)

confusion_mat = confusion_matrix(test_targets, predictions)
accuracy_num = accuracy(test_targets, predictions)
precision, recall = precision_and_recall(test_targets, predictions)
f1_measure_num = f1_measure(test_targets, predictions)

print(f"Confusion Matrix:\n{confusion_mat}\n")
print(f"Accuracy: {accuracy_num}\n")
print(f"Precision: {precision}; Recall: {recall}\n")
print(f"F1_Measure: {f1_measure_num}\n")


Confusion Matrix:
[[63 20]
 [13 20]]

Accuracy: 0.7155172413793104

Precision: 0.5; Recall: 0.6060606060606061

F1_Measure: 0.5479452054794521



## Part 2: Decision Tree with scikit-learn


Let's now examine the sklearn implementation of the decision tree model.

In [5]:
# Step 0: Import the necessary packages
# For preparing the data and fitting the decision tree
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text


In [6]:
# Step 1: Import and wrangle diabetes.csv
pima = pd.read_csv("data/diabetes.csv", header=0)
X = pima[["Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
          "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]]
y = pima["Outcome"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=1)


In [7]:
# Step 2: Fit decision tree classifier
clf = DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(X_train, y_train)

print(export_text(clf, feature_names=["Pregnancies", "Glucose", "BloodPressure",
      "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]))  # Visualize tree


|--- Glucose <= 127.50
|   |--- BMI <= 26.45
|   |   |--- BMI <= 9.10
|   |   |   |--- Pregnancies <= 7.50
|   |   |   |   |--- class: 0
|   |   |   |--- Pregnancies >  7.50
|   |   |   |   |--- class: 1
|   |   |--- BMI >  9.10
|   |   |   |--- DiabetesPedigreeFunction <= 0.67
|   |   |   |   |--- class: 0
|   |   |   |--- DiabetesPedigreeFunction >  0.67
|   |   |   |   |--- DiabetesPedigreeFunction <= 0.71
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- DiabetesPedigreeFunction >  0.71
|   |   |   |   |   |--- class: 0
|   |--- BMI >  26.45
|   |   |--- Age <= 28.50
|   |   |   |--- BMI <= 30.95
|   |   |   |   |--- Pregnancies <= 7.00
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- Pregnancies >  7.00
|   |   |   |   |   |--- class: 1
|   |   |   |--- BMI >  30.95
|   |   |   |   |--- BloodPressure <= 51.00
|   |   |   |   |   |--- BMI <= 34.40
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- BMI >  34.40
|   |   |   |   |   |   |--- BMI <= 48.55
|   |  

The sklearn implementation is a bit less accurate, but the model trains significantly faster than the from-scratch implementation.

In [8]:
# Step 3: Predict labels of testing set and evaluate the decision tree's performance
y_predictions = clf.predict(X_test)

confusion_mat2 = metrics.confusion_matrix(y_test, y_predictions)
accuracy_num2 = metrics.accuracy_score(y_test, y_predictions)
precision2, recall2, f1_measure_num2, _ = metrics.precision_recall_fscore_support(
    y_test, y_predictions)

print(f"Confusion Matrix:\n{confusion_mat2}\n")
print(f"Accuracy: {accuracy_num2}\n")
print(f"Precision: {precision2}; Recall: {recall2}\n")
print(f"F1_Measure: {f1_measure_num2}\n")


Confusion Matrix:
[[56 19]
 [17 24]]

Accuracy: 0.6896551724137931

Precision: [0.76712329 0.55813953]; Recall: [0.74666667 0.58536585]

F1_Measure: [0.75675676 0.57142857]



In this exhibition we saw a from-scratch implementation of the basic decision tree classifier perform just as well as the equivalent sklearn implementation.  Further improvements to this model would include pruning and using a forest of shallow decision trees rather than just one singular tree.  These improvements would help prevent the model from overfitting.