# Classification: Decision Trees and Random Forest

We all know what a decision tree is, but how do we build one optimally?

Answer: we use Gini impurities or entropy/information gain for our splits.

Decisions trees are simple to understand and visualise, which makes them perfect for your non-tech savvy boss and clients to grasp.

References:
Decision trees: https://scikit-learn.org/stable/modules/tree.html <br>
Plot the decision surface of decision trees trained on the iris dataset: https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html <br>
Plot the decision surfaces of ensembles of trees on the iris dataset: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html <br>
Confusion matrix: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install graphviz

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Classifying the Iris dataset

The iris dataset is one of these most famous 20th century datasets for ML. Made up of 150 samples of three types of flowers (50 per type), with four features of sepal and petal width and length. The Iris dataset is simple to understand with plenty of variance, overlap, and some outliers, making it perfect for testing and showcasing ML algorithms.

### Viewing the Iris dataset

Let's view the dataset one more time.

In [None]:
iris_x, iris_y = datasets.load_iris(return_X_y=True, as_frame=True)
iris = datasets.load_iris() # Needed for plotting

In [None]:
iris_x.head()

In [None]:
iris_y.head()

### Choosing our inputs

Scikit-learn requires our input data to be numpy arrays, so let's convert them here.

In [None]:
# Convert the data into numpy arrays
X = iris_x.to_numpy()
y = iris_y.to_numpy()

<details><summary>Click to cheat</summary>

```python
# Convert the data into numpy arrays
X = iris_x.to_numpy()
y = iris_y.to_numpy()
```
</details>

### Creating our Decision tree

Let's make one classifier for every unique pair of features.

In [None]:
# Indexes of the pairs of features
pairIdxs = [[0, 1],
            [0, 2],
            [0, 3],
            [1, 2],
            [1, 3],
            [2, 3]]

# Choose your max depth
# max_depth = 

# Create the decision tree models for each pair


# Train the models in a collection called models2


<details><summary>Click to cheat</summary>

```python
# Indexes of the pairs of features
pairIdxs = [[0, 1],
            [0, 2],
            [0, 3],
            [1, 2],
            [1, 3],
            [2, 3]]

# Choose your max depth
max_depth = 15

# Create the decision tree models for each pair
models = [DecisionTreeClassifier(max_depth=max_depth) for _ in pairIdxs]

# Train the models
models2 = [model.fit(X[:, pair], y) for model, pair in zip(models, pairIdxs)]
```
</details>

### Plotting the Regions

In [None]:
# Parameters
plot_colors = "ryb"
plot_step = 0.02
pairIdxs = [[0, 1],
            [0, 2],
            [0, 3],
            [1, 2],
            [1, 3],
            [2, 3]]

fig, sub = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))

for pairidx, (model, pair, ax) in enumerate(zip(models2, pairIdxs, sub.flatten())):
    # For plotting purposes, we only include two features
    X2 = X[:, pair]

    x_min, x_max = X2[:, 0].min() - 0.3, X2[:, 0].max() + 0.3
    y_min, y_max = X2[:, 1].min() - 0.3, X2[:, 1].max() + 0.3
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
    )
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
    X_pred = np.zeros((len(xx.ravel()), 2))
    X_pred[:, 0] = xx.ravel()
    X_pred[:, 1] = yy.ravel()

    Z = model.predict(X_pred)
    Z = Z.reshape(xx.shape)
    cs = ax.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    ax.set_xlabel(iris.feature_names[pair[0]])
    ax.set_ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(len(iris.target_names)), plot_colors):
        idx = np.where(y == i)
        ax.scatter(
            X2[idx, 0],
            X2[idx, 1],
            c=color,
            label=iris.target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15,
        )

plt.suptitle("Decision surface of decision trees trained on pairs of features")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
plt.axis("tight")

### Decision Tree Visualisation

Scikit-learn comes with a handy tool to actually show what our decision tree looks like. Below is what the tree looks like when trained on all four features. Unfortunately, the picture is usually very pixelated and difficult to read.

In [None]:
from sklearn.tree import plot_tree

plt.figure()
model = DecisionTreeClassifier().fit(iris.data, iris.target)
plot_tree(model, filled=True)
plt.title("Decision tree trained on all the iris features")
plt.show()

Instead, let's use the `graphviz` package to visualise our tree.

In [None]:
import graphviz
from sklearn import tree

dot_data = tree.export_graphviz(model, out_file=None, 
    feature_names=iris.feature_names,  
    class_names=iris.target_names,  
    filled=True, rounded=True,  
    special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

We can also save the image to a PDF.

In [None]:
import graphviz 

dot_data = tree.export_graphviz(model, out_file=None, 
    feature_names=iris.feature_names,  
    class_names=iris.target_names,
    filled=True, rounded=True,  
    special_characters=True)  
graph = graphviz.Source(dot_data) 
graph.render("iris") 

### Confusion matrix of our models

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

iris = datasets.load_iris()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]

# pick a model from our trained models
# model = models2[0]

for title, normalize in titles_options:
    disp = ConfusionMatrixDisplay.from_estimator(
        model,
        X,
        y,
        display_labels=iris.target_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
    )
    disp.ax_.set_title(title)

plt.show()


## Digits dataset

Decisions trees are nice and simple, but they tend to overfit the training data very badly.

Instead, we can use a random forest of decision trees to prevent overfitting.

### Loading the data

First things first, we need to load the data. Let's also view the first few samples while we're at it.

In [None]:
# Load the digits as a bunch object
# We do this to get the target names and images for plotting

# Also load the digits X as a pandas Dataframe and the y as a Series


# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")

<details><summary>Click to cheat</summary>

```python
# Load the digits as a bunch object
# We do this to get the target names and images for plotting
digits = datasets.load_digits()
# Also load the digits X as a pandas Dataframe and the y as a Series
digits_X, digits_y = datasets.load_digits(return_X_y=True, as_frame=True)

# Plot the first few examples
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Digit {label}")
```
</details>

Now let's split our labelled data into training and testing sets with a 70/30 ratio.

<details><summary>Click to cheat</summary>

```python
X_train, X_test, y_train, y_test = train_test_split(
    digits_X.to_numpy(), digits_y.to_numpy(), train_size=0.7
)
```
</details>

### Create the model

In [None]:
# Define your hyperparameters
# n_trees = 
# criterion = 'gini', 'entropy'
# max_depth = 
# max_features = 'auto', 'sqrt', 'log'

# Create the untrained model with your hyperparameters

# Train the model


# get the predictions


<details><summary>Click to cheat</summary>

```python
# Define your hyperparameters
n_trees = 100
criterion='gini'
max_depth=5
max_features='auto'

# Create the untrained model with your hyperparameters
model = RandomForestClassifier(
    n_estimators=n_trees,
    criterion=criterion,
    max_depth=max_depth,
    max_features=max_features
)

# Train the model
model.fit(X_train, y_train)

# get the predictions
y_pred = model.predict(X_test)
```
</details>

### Test the model

Let's see a few examples of our predictions.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, prediction in zip(axes, X_test, y_pred):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Prediction: {prediction}")

Let's also view our confusion matrix for good measure.

In [None]:
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle("Confusion Matrix")

plt.show()

Finally, we'll look at our measures of performance.

In [None]:
from sklearn.metrics import classification_report

target_names = [str(name) for name in digits.target_names]

print(classification_report(y_test, y_pred, target_names=target_names))