#L3-1: Implementing Decision Trees

Hello and welcome! In this lab, we will implement decision trees. Let's get started.

# Class Enjoyment Data

The first thing we need is data! And the most primitive way to get the data in (which is practical only if you have a really small dataset) is to enter it manually and that's what we do to enter the class enjoyment data discussed in the class:

## Data Entry

Butr let's load the packages we are going to need for data processing first:

In [None]:
import numpy as np
import pandas as pd

Then, we can enter our data:

In [None]:
data = [['y', 'y', 'n', 'y', 'n', '+2'],
        ['y', 'y', 'n', 'y', 'n', '+2'],
        ['n', 'y', 'n', 'n', 'n', '+2'],
        ['n', 'n', 'n', 'y', 'n', '+2'],
        ['n', 'y', 'y', 'n', 'y', '+2'],
        ['y', 'y', 'n', 'n', 'n', '+1'],
        ['y', 'y', 'n', 'y', 'n', '+1'],
        ['n', 'y', 'n', 'y', 'n', '+1'],
        ['n', 'n', 'n', 'n', 'y',  '0'],
        ['y', 'n', 'n', 'y', 'y',  '0'],
        ['n', 'y', 'n', 'y', 'n',  '0'],
        ['y', 'y', 'y', 'y', 'y',  '0'],
        ['y', 'y', 'y', 'n', 'y', '-1'],
        ['n', 'n', 'y', 'y', 'n', '-1'],
        ['n', 'n', 'y', 'n', 'y', '-1'],
        ['y', 'n', 'y', 'n', 'y', '-1'],
        ['n', 'n', 'y', 'y', 'n', '-2'],
        ['n', 'y', 'y', 'n', 'y', '-2'],
        ['y', 'n', 'y', 'n', 'n', '-2'],
        ['y', 'n', 'y', 'n', 'y', '-2']
       ]

...and create a pandas DataFrame out of it:

In [None]:
column_names = ['Easy', 'AI', 'Systems', 'Theory', 'Morning', 'Rating']

df = pd.DataFrame(data, columns=column_names)

Let's view what we created:

In [None]:
display(df)

Looks nice!

There is a caveat though: many ML algorithms only work if we have purely numeric data, so everything has to be converted to numbers. We often take some processing steps on data before feeding it to machine learning algorithms and these steps are usually called *data pre-processing*.

## Processing Features

We will use the object [make_column_transformer](https://https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) defined in the compose submodule od scikit-learn and objects [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) and [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from scikit-learn's preprocessing submodule to help us with converting binary features that we have in the data to number values.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

Now, we feed columns we want converted to these objects in order to convert the categorical features (which Boolean features are a kind of) to numbers. That is acoomplished below by using make_column_transformer and OrdinalEncoder functions. We also want to convert class labels to be positive integer numbers and we do that with LabelEncoder. Read the documentation for these function to get a grasp of different parameters and details of what is done.

In [None]:
feature_names = column_names[:-1]
label_name = column_names[-1]

X_preprocess = make_column_transformer((OrdinalEncoder(), feature_names), 
                                       remainder='drop')
y_preprocess = LabelEncoder()

## Data Matrices

Next, we create the actual converted data matrices (including the label vector) using the preprocessor objects we just created:

In [None]:
X = X_preprocess.fit_transform(df[feature_names])
y = y_preprocess.fit_transform(df[label_name])

..and here is the data as NumPy arrays:

In [None]:
display(X, y)

## The Decision Tree

### Making

We use the decision tree classifier class implemented in scikit-learn, [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) implemented in the tree submodule of scikit-learn.

In [None]:
from sklearn.tree import DecisionTreeClassifier

We define an object (instance) from the class and let's call it `dtree`:

In [None]:
dtree = DecisionTreeClassifier()

Now, we can 'fit' the data to our classifier, which basically means training your classifier and finding the QuAM. We use the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) method of the `dtree` object of DecisionTreeClassifer class we just created:

In [None]:
dtree.fit(X, y)

Now, we have our decision tree classifier QuAM ready. Now, we can make predictions on new points with our QuAM.

Let's first define some new points:

In [None]:
X_new = np.array([[1., 1., 1., 1., 0.],
                  [0., 0., 0., 1., 1.]
                 ])

Now, we can use our QuAM to 'predict' labels for new data points using the [predict](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) method of the `dtree` object of the DecisionTreeClassifier class we trained (fitted):

In [None]:
yhat_new = dtree.predict(X_new)
display(yhat_new)

We want to display class names as they are shown in the original data and we do that by using the inverse_transform method of the preprocessing object we transformed the labels with. We can get a list of class labels this way:

In [None]:
class_names = y_preprocess.inverse_transform(np.arange(y.max() + 1))

Now, we can see the original label names for our predictions:

In [None]:
display(class_names[yhat_new])

### Visualizing

Let's visualize our tree. We will use the program [Graphviz](https://graphviz.org/) (and the Python package [graphviz](https://pypi.org/project/graphviz/), which provides a Python interface of Graphviz) to generate the visualization. Iorder to do that, we have to generate an approprite kind of description, called DOT format, generated using the [export_graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html) function implmented in scikit-learn's tree submodule, of the decision tree we just created  and convert the DOT formatted data to graph with Graphviz. Then, we can display our graph. Let us import the required packages first:

In [None]:
from sklearn.tree import export_graphviz

Then, we can use the packages we imported to visualize:

In [None]:
import graphviz
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dtree,
                           out_file=None, 
                           class_names=class_names.tolist(),
                           feature_names=feature_names,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

display(graphviz.Source(dot_data))

We can also do a (maybe) more informative kind of visualization, showing us the distribution of points in different regions in the tree.

For that, we need a package [dtreeviz](https://github.com/parrt/dtreeviz) that is not included in Colab environment by default, so we install it:

In [None]:
!pip install dtreeviz

Now, we use the dtreeviz object in the trees submodule of dtreeviz to create the visualization:

In [None]:
from dtreeviz.trees import dtreeviz

In [None]:
viz = dtreeviz(tree_model=dtree,
               x_data=X,
               y_data=y,
               target_name=label_name,
               feature_names=np.array(feature_names),
               class_names=class_names.tolist(),
               orientation ='LR',
               scale=2.0)              
display(viz)

Now, we can also visualize what happens when we want to make a prediction for a new data point:

In [None]:
viz = dtreeviz(tree_model=dtree,
               x_data=X,
               y_data=y,
               target_name=label_name,
               feature_names=np.array(feature_names),
               class_names=class_names.tolist(),
               orientation ='LR',
               X=X_new[0],
               scale=2.0)              
display(viz)

We use the explain_prediction_path function in the trees submodule of dtreeviz to generate an explanation of the prediction generated by the decision tree:

In [None]:
from dtreeviz.trees import explain_prediction_path

In [None]:
print(explain_prediction_path(tree_model=dtree,
                              x=X_new[0],
                              feature_names=np.array(feature_names),
                              explanation_type="plain_english"))

We can also see how important the different features were in predicting what the QuAM did:

In [None]:
print(explain_prediction_path(tree_model=dtree,
                              x=X_new[0],
                              feature_names=np.array(feature_names),
                              explanation_type="sklearn_default"))

Finally (for this data), here is a dynamic visualization showing how the tree is built, step-by-step:

In [None]:
#@title Interactive Visualizer: How is the tree built? { run: "auto" }

slider = 12 #@param {type:"slider", min:2, max:15, step:1}
display(slider)

dtree = DecisionTreeClassifier(max_leaf_nodes=slider)
dtree.fit(X, y)

dot_data = export_graphviz(dtree,
                           out_file=None, 
                           class_names=class_names.tolist(),
                           feature_names=feature_names,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

graph = graphviz.Source(dot_data)  
display(graph)

viz = dtreeviz(tree_model=dtree,
               x_data=X,
               y_data=y,
               target_name=label_name,
               feature_names=np.array(feature_names),
               class_names=class_names.tolist(),
               scale=2.0)              
display(viz)

# The Iris Dataset

Let's try building a decision tree classifier on a real-world data. Let's use the Iris dataset we talked about in the lab for module 2:

http://archive.ics.uci.edu/ml/datasets/iris

## Getting the data

Again, we use the requests package and the get function therein as well as a [StringIO](https://docs.python.org/3/library/io.html?highlight=stringio#io.StringIO) object (a part of Python's [io](https://docs.python.org/3/library/io.html) package) to fetch the data off web and convert it to a text string object:

In [None]:
from requests import get
from io import StringIO

Now, we can read the data and put it in a DataFrame:

In [None]:
url="http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
file_bytes = get(url).content
data_file = StringIO(file_bytes.decode('utf-8'))

feature_names_iris=["Sepal Length (cm)", "Sepal Width (cm)", "Petal Length (cm)", "Petal Width (cm)"]
label_name_iris = "Class"

column_names_iris = feature_names_iris + [label_name_iris]

iris_df = pd.read_csv(data_file, names=feature_names_iris + [label_name_iris])

This is the data:

In [None]:
display(iris_df)

## (Building and Visualizing)

We can build a deciion tree with all four features, however, here, we want to visualize the space where the data lies in (aprt form the tree) and to do that we should have data that has less than (or equal to) 3 dimensions.

In [None]:
#@title
# X_iris = iris_df[feature_names_iris].values
# y_iris = iris_df[label_name_iris].values

In [None]:
#@title
# dtree_iris = DecisionTreeClassifier()
# dtree_iris.fit(X_iris, y_iris)

# dot_data = StringIO()
# export_graphviz(dtree_iris, 
#                 out_file=dot_data, 
#                 class_names=np.unique(y_iris),
#                 feature_names=feature_names,
#                 filled=True,
#                 rounded=True,
#                 special_characters=True)
# graph = graph_from_dot_data(dot_data.getvalue())
# Image(graph.create_png())

## Choosing Two Features

We do a 2D visualization first. So, let's choose two features:

In [None]:
feature_names_2t=["Petal Length (cm)", "Petal Width (cm)"]

...and get NumPy arrays for the data:

In [None]:
y_preprocess_iris = LabelEncoder()

X_iris_2t = iris_df[feature_names_2t].values
y_iris = y_preprocess_iris.fit_transform(iris_df[label_name_iris].values)

class_names_iris = y_preprocess_iris.inverse_transform(np.arange(y_iris.max() + 1)).astype(str).tolist()

Now, we can build the decisioon tree:

In [None]:
dtree_iris_2t = DecisionTreeClassifier()
dtree_iris_2t.fit(X_iris_2t, y_iris)

Now, we can visualize the tree:

In [None]:
dot_data = export_graphviz(dtree_iris_2t,
                           out_file=None, 
                           class_names=class_names_iris,
                           feature_names=feature_names_2t,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

display(graphviz.Source(dot_data))

viz = dtreeviz(tree_model=dtree_iris_2t,
               x_data=X_iris_2t,
               y_data=y_iris,
               target_name=label_name_iris,
               feature_names=np.array(feature_names_2t),
               class_names=class_names_iris,
               orientation ='LR',
               scale=2.0)              
display(viz)

## Surfaces and Regions

Now, let's try to visualize the space created by a decision tree classifier. For that we use plotly. Here, we are using an older function and you may be able to accomplish the same thing with plotly express, but this workd for now:

In [None]:
import plotly.graph_objects as go

In [None]:
#@title
# fig = go.Figure()

# fig.add_trace(go.Scatter(x=X_iris_2t[:, 0],
#                         y=X_iris_2t[:, 1],
#                         mode='markers',
#                         marker=dict(color=yn_iris_2t,
#                                     colorscale=colorscale,
#                                     size=10
#                                    )
#                        )
#              )
# fig.update_layout(xaxis=dict(range=[0, 7], gridwidth=1, zeroline=True, zerolinecolor='LightGrey', nticks=8), 
#                  yaxis=dict(range=[0, 3], gridwidth=1, zeroline=True, zerolinecolor='LightGrey', nticks=3), 
#                  plot_bgcolor="white"
#                 )
# fig.show()

Then, we can visualize the space using a scatterplot and a heatmap overlaiod on each other:

In [None]:
fig = go.Figure()

colorscale = [(0.00,   "red"), (0.32,   "red"), 
              (0.33, "green"), (0.66, "green"), 
              (0.67,  "blue"), (1.00,  "blue")
             ]
             
fig.add_trace(go.Scatter(x=X_iris_2t[:, 0],
                        y=X_iris_2t[:, 1],
                        mode='markers',
                        marker=dict(color=y_iris,
                                    colorscale=colorscale,
                                    size=10
                                   )
                       )
             )

red_tr =   "rgba(255,   0,   0, 0.25)" 
green_tr = "rgba(  0, 255,   0, 0.25)"
blue_tr =  "rgba(  0,   0, 255, 0.25)"
colorscale_tr = [(0.00,   red_tr), (0.32,   red_tr), 
                 (0.33, green_tr), (0.66, green_tr), 
                 (0.67,  blue_tr), (1.00,  blue_tr)
                ]

x_iris_2t_mins = X_iris_2t.min(axis=0)
x_iris_2t_maxs = X_iris_2t.max(axis=0)

x_iris_2t1_vis1 = np.linspace(x_iris_2t_mins[0], x_iris_2t_maxs[0], 60)
x_iris_2t2_vis1 = np.linspace(x_iris_2t_mins[1], x_iris_2t_maxs[1], 25)

XX_iris_2t1_vis, XX_iris_2t2_vis = np.meshgrid(x_iris_2t1_vis1, x_iris_2t2_vis1)

x_iris_2t1_vis = XX_iris_2t1_vis.flatten()
x_iris_2t2_vis = XX_iris_2t2_vis.flatten()

X_iris_2t_vis = np.c_[x_iris_2t1_vis, x_iris_2t2_vis]

yn_iris_2t_vis = dtree_iris_2t.predict(X_iris_2t_vis)

YYn_iris_2t_vis = yn_iris_2t_vis.reshape(XX_iris_2t1_vis.shape)

fig.add_trace(go.Heatmap(x=x_iris_2t1_vis1,
                        y=x_iris_2t2_vis1,
                        z=YYn_iris_2t_vis,
                        zmin=yn_iris_2t_vis.min(),
                        zmax=yn_iris_2t_vis.max(),  
                        colorscale=colorscale_tr
                       )
            )
fig.show()

## Three Features: 3D

We can do the same thing with 3 features and in 3D space:

In [None]:
feature_names_3t = ["Sepal Length (cm)", "Petal Length (cm)", "Petal Width (cm)"]

In [None]:
X_iris_3t = iris_df[feature_names_3t].values

In [None]:
dtree_iris_3t = DecisionTreeClassifier()
dtree_iris_3t.fit(X_iris_3t, y_iris)

In [None]:
dot_data = export_graphviz(dtree_iris_3t,
                           out_file=None, 
                           class_names=class_names_iris,
                           feature_names=feature_names_3t,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

display(graphviz.Source(dot_data))

viz = dtreeviz(tree_model=dtree_iris_3t,
               x_data=X_iris_3t,
               y_data=y_iris,
               target_name=label_name_iris,
               feature_names=np.array(feature_names_3t),
               class_names=class_names_iris,
               orientation ='LR',
               scale=2.0)              
display(viz)

Now, we will use a 3D scatter plot:

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter3d(x=X_iris_3t[:, 0],
                          y=X_iris_3t[:, 1],
                          z=X_iris_3t[:, 2],
                          mode='markers',
                          marker=dict(color=y_iris,
                                      colorscale=colorscale,
                                      size=5
                                     )
                         )
             )

fig.show()

Let's visualize the space as well by overlaying a 3D volume plot on top:

In [None]:
x_iris_3t_mins = X_iris_3t.min(axis=0)
x_iris_3t_maxs = X_iris_3t.max(axis=0)

x_iris_3t1_vis = np.linspace(x_iris_3t_mins[0], x_iris_3t_maxs[0], 20)
x_iris_3t2_vis = np.linspace(x_iris_3t_mins[1], x_iris_3t_maxs[1], 20)
x_iris_3t3_vis = np.linspace(x_iris_3t_mins[2], x_iris_3t_maxs[2], 20)

XX_iris_3t1_vis, XX_iris_3t2_vis, XX_iris_3t3_vis = \
  np.meshgrid(x_iris_3t1_vis, x_iris_3t2_vis, x_iris_3t3_vis)

x_iris_3t1_vis = XX_iris_3t1_vis.flatten()
x_iris_3t2_vis = XX_iris_3t2_vis.flatten()
x_iris_3t3_vis = XX_iris_3t3_vis.flatten()

X_iris_3t_vis = np.c_[x_iris_3t1_vis, x_iris_3t2_vis, x_iris_3t3_vis]

yn_iris_3t_vis = dtree_iris_3t.predict(X_iris_3t_vis)

fig.add_trace(go.Volume(x=x_iris_3t1_vis,
                       y=x_iris_3t2_vis,
                       z=x_iris_3t3_vis,
                       value=yn_iris_3t_vis,
                       isomin=0,
                       isomax=2,
                       opacity=0.25,
                       surface_count=20,
                       colorscale=colorscale,
                       showscale=False
                      )
            )

fig.show()

Finally, here, let's see how the space is split iteratively when building a decision tree using a dynamic visualization:

In [None]:
#@title Interactive Visualizer { run: "auto" }

slider = 10 #@param {type:"slider", min:2, max:10, step:1}
display(slider)

dtree_iris_3tl = DecisionTreeClassifier(max_leaf_nodes=slider)
dtree_iris_3tl.fit(X_iris_3t, y_iris)

dot_data = export_graphviz(dtree_iris_3tl,
                           out_file=None, 
                           class_names=class_names_iris,
                           feature_names=feature_names_3t,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

viz = dtreeviz(tree_model=dtree_iris_3tl,
               x_data=X_iris_3t,
               y_data=y_iris,
               target_name=label_name_iris,
               feature_names=np.array(feature_names_3t),
               class_names=class_names_iris,
               orientation ='LR',
               scale=2.0)
              
yn_iris_3tl_vis = dtree_iris_3tl.predict(X_iris_3t_vis)

fig = go.Figure()
fig.add_trace(go.Scatter3d(x=X_iris_3t[:, 0],
                          y=X_iris_3t[:, 1],
                          z=X_iris_3t[:, 2],
                          mode='markers',
                          marker=dict(color=y_iris,
                                      colorscale=colorscale,
                                      size=5
                                     )
                         )
             )
fig.add_trace(go.Volume(x=x_iris_3t1_vis,
                       y=x_iris_3t2_vis,
                       z=x_iris_3t3_vis,
                       value=yn_iris_3tl_vis,
                       isomin=0,
                       isomax=2,
                       opacity=0.25,
                       surface_count=20,
                       colorscale=colorscale,
                       showscale=False
                      )
            )

fig.show()
display(graphviz.Source(dot_data))
display(viz)

## The Heart Disease Dataset

Now, let's use a more "real" real-world dataset as well. We use the Heart Disease dataset from UCI dataset repository:

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

## Getting Data

We get the data like before. This data is composed of four different sources, eahc stored in a separate file:

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data"
file_bytes = get(url).content

We can now try to display the file as text:

In [None]:
data_file = StringIO(file_bytes.decode('utf-8'))

As you can see that fails. This is because there are errors in the file; a very common thing that happens with real-world datasets. So, let's ignore the errors and display the file:

In [None]:
data_file = StringIO(file_bytes.decode('utf-8', 'ignore'))

Now, we can see the file:

In [None]:
print(data_file.getvalue())

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data"
file_bytes = get(url).content
data_file = StringIO(file_bytes.decode('utf-8', 'ignore'))
print(data_file.getvalue())

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/long-beach-va.data"
file_bytes = get(url).content
data_file = StringIO(file_bytes.decode('utf-8', 'ignore'))
print(data_file.getvalue())

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/switzerland.data"
file_bytes = get(url).content
data_file = StringIO(file_bytes.decode('utf-8', 'ignore'))
print(data_file.getvalue())

As you can see the separator in the file is not commas but rather spaces. We need to consider that when converting the data to a DataFrame. Let's load the 'Cleveland' data:

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
file_bytes = get(url).content
data_file = StringIO(file_bytes.decode('utf-8'))

feature_names_heart=["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"]
label_name_heart = "num"

column_names_heart = feature_names_heart + [label_name_heart]

heart_df = pd.read_csv(data_file, names=feature_names_heart + [label_name_heart])

...and display:

In [None]:
display(heart_df)

Let's try to see if we have missing values:

In [None]:
rows_with_missing = heart_df.eq("?").any(1)

The warning happens because some columns are numbers and can't be trivially compared with the string `"?"`.Let's see where our missing values happen:

In [None]:
display(rows_with_missing)

Let's view the actual rows:

In [None]:
display(heart_df[rows_with_missing])

Now, let us do data exploration:

In [None]:
heart_df_nm = heart_df[~rows_with_missing]

In [None]:
import plotly.express as px

In [None]:
fig = px.scatter_matrix(heart_df_nm, dimensions=feature_names_heart, color=label_name_heart)
fig.update_layout(
    autosize=False,
    width=13 * 100,
    height=13 * 100,
    margin=dict(l=0, r=0, t=0, b=0)
)
fig.show()

## Splitting the data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
display(X_heart_nm, y_heart_nm)

In [None]:
X_heart_nm = heart_df_nm[feature_names_heart].values
y_heart_nm = heart_df_nm[label_name_heart].values

In [None]:
X_heart_nm_train, X_heart_nm_test, y_heart_nm_train, y_heart_nm_test = \
  train_test_split(X_heart_nm, y_heart_nm, test_size=0.33)

## Building a Tree

Now, we can build a decision tree:

In [None]:
dtree_heart_nm = DecisionTreeClassifier()

In [None]:
dtree_heart_nm.fit(X_heart_nm_train, y_heart_nm_train)

In [None]:
class_names_heart = np.unique(heart_df_nm[label_name_heart])

dot_data = export_graphviz(dtree_heart_nm,
                           out_file=None, 
                           class_names=class_names_heart.astype("str").tolist(),
                           feature_names=feature_names_heart,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

display(graphviz.Source(dot_data))

# viz = dtreeviz(tree_model=dtree_heart_nm,
#                x_data=X_heart_nm.astype("float"),
#                y_data=y_heart_nm,
#                target_name=label_name_heart,
#                feature_names=np.array(feature_names_heart),
#                class_names=class_names_heart.tolist(),
#                orientation ='LR',
#                scale=2.0)              
# display(viz)

## Evaluation

A basic evaluation metric we have for the task of classification was a *confusion matrix*. A confusion matrix is basically a table which shows the number of datapoints that were classified as a certain class, where they really belonged to (maybe) another class. In short, in shows us the number of correct classifications and misclassifications. Again we use scikit-learn's implementation for showing confusion matrices:

In [None]:
from sklearn.metrics import confusion_matrix

First, let's classify (predict) using our QuAM we just built. The QuAM object has a method called `predict` (like it had a `fit` method to train and find the QuAM) which classifies the data.

In [None]:
yhat_heart_nm_train = dtree_heart_nm.predict(X_heart_nm_train)

Now, we can build the confusion matrix:

In [None]:
cm_heart_nm_train = confusion_matrix(y_heart_nm_train, yhat_heart_nm_train)

...and display it:

In [None]:
display(cm_heart_nm_train)

scikit-learn has another function, [plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html) implemented in the same metrics submodule, which gives you the confucion matrix in a nice representation

In [None]:
from sklearn.metrics import plot_confusion_matrix

Which shows a heatmap-like table:

In [None]:
plot_confusion_matrix(dtree_heart_nm, X_heart_nm_train, y_heart_nm_train)

Now, let's do the confusion matrix for test data as well:

In [None]:
plot_confusion_matrix(dtree_heart_nm, X_heart_nm_test, y_heart_nm_test)

What do you see?

Let's also do a classification evaluation summary on both training and test data as well:

In [None]:
from sklearn.metrics import classification_report

In [None]:
print("Results on training data:")
print(classification_report(y_heart_nm_train, yhat_heart_nm_train))
print()
yhat_heart_nm_test = dtree_heart_nm.predict(X_heart_nm_test)
print("Results on test data:")
print(classification_report(y_heart_nm_test, yhat_heart_nm_test))

## The learning curve

To study the bias and variance of our model, let's do a learning curve as well:

In [None]:
from sklearn.model_selection import learning_curve

In [None]:
data_sizes, training_scores, validation_scores = \
  learning_curve(DecisionTreeClassifier(), X_heart_nm_train, \
                 y_heart_nm_train, cv=10, scoring='accuracy', \
                 train_sizes=np.linspace(0.01, 1.0, 51))

In [None]:
training_mean = training_scores.mean(axis=1) 
training_standard_deviation = training_scores.std(axis=1) 

In [None]:
validation_mean = validation_scores.mean(axis=1) 
validation_standard_deviation = validation_scores.std(axis=1)

In [None]:
import plotly.graph_objects as go

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=data_sizes, 
                        y=training_mean,
                        mode='lines',
                        name='Training',
                        line=dict(color='red')))
fig.add_trace(go.Scatter(x=data_sizes, 
                        y=training_mean - training_standard_deviation,
                        mode='lines',
                        name='Training lower bound',
                        line=dict(width=0, color='red'),
                        showlegend=False))
fig.add_trace(go.Scatter(x=data_sizes, 
                        y=training_mean + training_standard_deviation,
                        mode='lines',
                        name='Training upper bound',
                        line=dict(width=0, color='red'),
                        fill='tonexty',
                        fillcolor='rgba(255, 0, 0, 0.3)',
                        showlegend=False))

fig.add_trace(go.Scatter(x=data_sizes, 
                        y=validation_mean,
                        mode='lines',
                        name='Validation',
                        line=dict(color='blue')))
fig.add_trace(go.Scatter(x=data_sizes, 
                        y=validation_mean - validation_standard_deviation,
                        mode='lines',
                        name='Validation lower bound',
                        line=dict(width=0, color='blue'),
                        showlegend=False))
fig.add_trace(go.Scatter(x=data_sizes, 
                        y=validation_mean + validation_standard_deviation,
                        mode='lines',
                        name='Validation upper bound',
                        line=dict(width=0, color='blue'),
                        fill='tonexty',
                        fillcolor='rgba(0, 0, 255, 0.3)',
                        showlegend=False))

fig.update_layout(title='Learning curve',
                 xaxis_title='Dataset size',
                 yaxis_title='Accuracy')
fig.show()

What's to note here is not only that we are consistently in the high bias range, but also that we have high variance!

That's all Folks!