<a href="https://colab.research.google.com/github/KeshavGulati/Flexbox-ch-04/blob/master/decision_trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree & Random Forest Tutorial

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

First, we want to look at the distribution of the dataset. This is an important first step that many often skip, so always remember to do it! It's extremely important to know the breakdown of your data.

In [None]:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

total_labels = np.concatenate([train_labels, test_labels])

fig, ax = plt.subplots(1, 1, figsize=(20, 10))
classes, freqs = np.unique(total_labels, return_counts=True)
ax.bar(classes, freqs)
ax.set_xticks(classes);

Now, the first question one may have is how a decision tree helps classify images. Essentially, the idea is that we can replicate the ideas of image filters by having the decision tree check individual pixels for their values. If a group of pixels exhibit a certain desired feature, then that should be encoded as a pathway on the decision tree.

First, however, we need to transform all our images into one dimensional vectors. There's a variety of ways to do this, but the simplest is to simply just flatten the images. It's boring, but it works.

In [None]:
# Note that MNIST images are 28 x 28, so we just need to flatten our arrays to shape (n, 784)

train_vecs, test_vecs = train_images.reshape(train_images.shape[0], 784), test_images.reshape(test_images.shape[0], 784)
train_vecs.shape, test_vecs.shape

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Play with the hyperparameters!
dtree = DecisionTreeClassifier(max_depth=5)
rf = RandomForestClassifier(n_estimators=10, max_depth=5)

dtree.fit(train_vecs, train_labels)
rf.fit(train_vecs, train_labels)

print("Classifiers trained")

Awesome, let's test these guys out now!

In [None]:
dtree_preds = dtree.predict(test_vecs)
rf_preds = rf.predict(test_vecs)

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix, accuracy_score

def plot_confmat(true_labels, pred_labels):
    """
    Plots a confusion matrix from given data
    """
    fig2, ax = plt.subplots(1, 1, num=2, figsize=(10, 10))

    cm = confusion_matrix(true_labels, pred_labels)
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # normalize the confusion matrix
    for pair in np.argwhere(np.isnan(cm_norm)):
        cm_norm[pair[0]][pair[1]] = 0

    annot = np.zeros_like(cm, dtype=object)
    for i in range(annot.shape[0]):  # Creates an annotation array for the heatmap
        for j in range(annot.shape[1]):
            annot[i][j] = f'{cm[i][j]}\n{round(cm_norm[i][j] * 100, ndigits=3)}%'

    ax = sns.heatmap(cm_norm, annot=annot, fmt='', cbar=True, cmap=plt.cm.magma, vmin=0, ax=ax) # plot the confusion matrix
    ax.set_title(f'Accuracy = {round(accuracy_score(true_labels, pred_labels), 2) * 100}%')
    ax.set(xlabel='Predicted Label', ylabel='Actual Label')
    
    fig2.tight_layout()

In [None]:
print('Decision Tree Confusion Matrix')
plot_confmat(test_labels, dtree_preds)

In [None]:
print('Random Forest Confusion Matrix')
plot_confmat(test_labels, rf_preds)