# Exercise 2: Decision Trees

In this assignment you will implement a Decision Tree algorithm as learned in class.

## Read the following instructions carefully:

1. This jupyter notebook contains all the step by step instructions needed for this exercise.
2. Write vectorized code whenever possible.
3. You are responsible for the correctness of your code and should add as many tests as you see fit. Tests will not be graded nor checked.
4. Write your functions in the provided `hw2.py` python module only. All the logic you write is imported and used in this jupyter notebook.
5. You are allowed to use functions and methods from the [Python Standard Library](https://docs.python.org/3/library/) and [numpy](https://www.numpy.org/devdocs/reference/) only. Any other imports detected in `hw2.py` will earn you the grade of 0, even if you only used them for testing.
6. Your code must run without errors. During the environment setup, you were given a specific version of `numpy` to install. Changes of the configuration we provided are at your own risk. Code that cannot run will also earn you the grade of 0.
7. Write your own code. Cheating will not be tolerated. 
8. Submission includes the `hw2.py` file and this notebook. Answers to qualitative questions should be written in markdown cells (with $\LaTeX$ support).
9. You are allowed to include additional functions.
10. Submission: zip only the completed jupyter notebook and the python file `hw2.py`. Do not include the data or any directories. Name the file `ID1_ID2.zip` and submit only one copy of the assignment.

## In this exercise you will perform the following:
1. Practice OOP in python.
2. Implement two impurity measures: Gini and Entropy.
3. Implement a decision tree from scratch.
4. Prune the tree to achieve better results.
5. Visualize your results and the tree.

In [238]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from hw2 import * # this imports all functions from hw2.

# make matplotlib figures appear inline in the notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Make the notebook automatically reload external python modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Warmup - OOP in python

Our decision tree will be implemented using a dedicated python class. Python classes are very similar to classes in Java.


You can use the following [site](https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/) to learn about classes in python.

In [239]:
class Node(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, node):
        self.children.append(node)

In [240]:
n = Node(5)
p = Node(6)
q = Node(7)
n.add_child(p)
n.add_child(q)
n.children

[<__main__.Node at 0x1a202277f0>, <__main__.Node at 0x1a202d8a20>]

## Data preprocessing

We will use the breast cancer dataset that is available as a part of sklearn - a popular machine learning and data science library in python. In this example, our dataset will be a single matrix with the **labels on the last column**. Notice that you are not allowed to use additional functions from sklearn.

In [241]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# load dataset
X, y = datasets.load_breast_cancer(return_X_y = True)
X = np.column_stack([X,y]) # the last column holds the labels

# split dataset
X_train, X_test = train_test_split(X, random_state=99)

print("Training dataset shape: ", X_train.shape)
print("Testing dataset shape: ", X_test.shape)

Training dataset shape:  (426, 31)
Testing dataset shape:  (143, 31)


In [242]:
# initialize class example dataset
class_dataset = np.array([[0,0,1,0,0],
                   [0,0,1,1,0],
                   [1,0,1,0,1],
                   [2,1,1,0,1],
                   [2,2,0,0,1],
                   [2,2,0,1,0],
                   [1,2,0,1,1],
                   [0,1,1,0,0],
                   [0,2,0,0,1],
                   [2,1,0,0,1],
                   [0,1,0,1,1],
                   [1,1,1,1,1],
                   [1,0,0,0,1],
                   [2,1,1,1,0]])

## Impurity Measures

Implement the functions `calc_gini` (5 points) and `calc_entropy` (5 points) in the python file `hw2.py`. You are encouraged to test your implementation using the cell below.

In [243]:
# data from class enropy example
# entropy test 
entropy = calc_entropy(class_dataset)
print(entropy)
print('entropy V' if np.isclose(entropy, 0.940, rtol=1e-01) else 'entropy X')

# gini test
gini = calc_gini(class_dataset)
print(gini)
print('gini V' if np.isclose(gini, 0.450, rtol=1e-01) else 'gini X')

0.9402859586706311
entropy V
0.4591836734693877
gini V


## Building a Decision Tree

Use a Python class to construct the decision tree (look at the `DecisionNode` class in the python file `hw2.py`. Your class should support the following functionality:

1. Initiating a node for a decision tree. You will need to use several class methods and class attributes and you are free to use them as you see fit. We recommend that every node will hold the feature and value used for the split and its children.
2. Your code should support both Gini and Entropy as impurity measures. 
3. The provided data includes continuous data. In this exercise, create at most a single split for each node of the tree. The threshold you need to use for this exercise are the average of each consecutive pair of values. For example, assume some features contains the following values: [1,2,3,4,5]. You should use the following thresholds [1.5, 2.5, 3.5, 4.5]. 
4. When constructing the tree, test all possible thresholds for each feature. The stopping criteria is a pure tree.

Complete the class `DecisionNode` in the python file `hw2.py`. The structure of this class is entirely up to you. Complete the function `build_tree` in the python file `hw2.py`. This function should get the training dataset and the impurity as inputs, initiate a root for the decision tree and construct the tree according to the procedure you learned in class. (30 points).

In [244]:
# threshold function test
test_values = [1,2,3,4,5]
test_values = np.column_stack([test_values, np.zeros(len(test_values))])
test_thresholds = build_thresholds_for_attribute_values(test_values, 0)
print(test_thresholds)

[1.5, 2.5, 3.5, 4.5]


In [245]:
# test settings definitions on X_train data
test_attribute_index = 8
test_threshold = 0.23
test_impurity_function = calc_entropy

In [246]:
# thresholds function test on X_train data
build_thresholds_for_attribute_values(X_train, test_attribute_index)

[0.1185,
 0.12090000000000001,
 0.12175,
 0.12625,
 0.13065,
 0.1325,
 0.13455,
 0.13495000000000001,
 0.13515,
 0.13565,
 0.13625,
 0.13690000000000002,
 0.13774999999999998,
 0.1384,
 0.1387,
 0.13965,
 0.1407,
 0.14100000000000001,
 0.1416,
 0.14215,
 0.14250000000000002,
 0.14300000000000002,
 0.1437,
 0.14455,
 0.14515,
 0.14565,
 0.14615,
 0.14650000000000002,
 0.14665,
 0.14695,
 0.14725,
 0.14795,
 0.14865,
 0.1488,
 0.1492,
 0.14955000000000002,
 0.14975,
 0.15025,
 0.1507,
 0.15095,
 0.15125,
 0.15145,
 0.15155000000000002,
 0.15185,
 0.1527,
 0.15339999999999998,
 0.15360000000000001,
 0.15375,
 0.15385,
 0.15410000000000001,
 0.15435,
 0.1545,
 0.1548,
 0.1552,
 0.15545,
 0.1558,
 0.15625,
 0.15645,
 0.15655,
 0.15685,
 0.1572,
 0.15765,
 0.15810000000000002,
 0.15825,
 0.15835,
 0.15845,
 0.15855,
 0.1587,
 0.15885,
 0.15895,
 0.15915,
 0.15935,
 0.15949999999999998,
 0.15985,
 0.16015000000000001,
 0.16045,
 0.161,
 0.1614,
 0.16155,
 0.1617,
 0.16185,
 0.1619499999999999

In [247]:
# weighted average test on X_train data
calc_weighted_average_by_attribute(X_train, test_attribute_index, test_threshold, test_impurity_function)

0.9546904191180343

In [248]:
# data split test on X_train data
test_group_a, test_group_b, group_a_size, group_b_size = split_data(X_train, test_attribute_index, test_threshold)
print("(%d, %d)" % (group_a_size, group_b_size))
seconds_split_attribute_index = 5
second_split_test_threshold = 0.17
test_group_a, test_group_b, group_a_size, group_b_size = split_data(test_group_a, seconds_split_attribute_index, second_split_test_threshold)
print("(%d, %d)" % (group_a_size, group_b_size))

(400, 26)
(367, 33)


In [249]:
# test attribute column removal
test_group_a = remove_attribute_column(test_group_a, 1)

In [250]:
# test find best information gain on class example dataset
find_best_information_gain_params(class_dataset, calc_entropy)

(0.15183550136234159, 2, 0.5)

In [251]:
# python support passing a function as arguments to another function.
tree_gini = build_tree(data=X_train, impurity=calc_gini) 
tree_entropy = build_tree(data=X_train, impurity=calc_entropy)

## Tree evaluation

Complete the functions `predict` and `calc_accuracy` in the python file `hw2.py`. You are allowed to implement this functionality as a class method.

After building both trees using the training set (using Gini and Entropy as impurity measures), you should calculate the accuracy on the test set and print the measure that gave you the best test accuracy. For the rest of the exercise, use that impurity measure. (10 points)

In [252]:
# test predict function on X_test data
test_instance = X_test[5, :]
prediction = predict(tree_entropy, test_instance)
print(prediction)

0.0


In [253]:
# test calc_accuracy on X_test data
gini_accuracy = calc_accuracy(tree_gini, X_test)
entropy_accuracy = calc_accuracy(tree_entropy, X_test)
print("Gini Accuracy:", gini_accuracy)
print("Entropy Accuracy:", entropy_accuracy)

Gini Accuracy: 79.02097902097903
Entropy Accuracy: 93.00699300699301


## Chi square pre-pruning

Consider the following p-value cut-off values: [1 (no pruning), 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001]. For each value, construct a tree and prune it according to the cut-off value. Next, calculate the training and testing accuracy. On a single plot, draw the training and testing accuracy as a function of the p-value. What p-value gives you the best results? Does the results support the theory you learned in class regarding Chi square pruning? Explain. (20 points)

**Note**: You need to change the `DecisionNode` to support Chi square pruning. Make sure the `chi_value=1` corresponds to no pruning. The values you need from the Chi square table are available in the python file `hw2.py`.

In [254]:
# test chi value computation based on class example
chi_test_data = np.array([[0,1,1],
                          [0,1,1],
                          [1,1,1],
                          [1,1,1],
                          [1,1,1],
                          [0,0,0]])
group_a = chi_test_data[[5], :]
group_b = chi_test_data[[0,1,2,3,4], :]
chi_value = chi_square_split_test(1, 0.5, group_a, group_b, 5, 1)
print(chi_value)
print("V" if np.isclose(chi_value, 6, rtol=0) else "X")

6.0
V


In [258]:
training = X_train
testing  = X_test

accuracy_dict = {}
for chi_value in [1, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00001]:
    accuracy_dict[chi_value] = calc_tree_accuracy_by_p_value(training, testing, calc_entropy, chi_value)
print(accuracy_dict)

{1: 93.00699300699301, 0.01: 20.97902097902098, 0.005: 20.97902097902098, 0.001: 20.97902097902098, 0.0005: 20.97902097902098, 0.0001: 20.97902097902098, 1e-05: 20.97902097902098}


In [None]:
#### Your visualization here ####

Your answer here

## Post pruning

Construct a decision tree without Chi square pruning. For each leaf in the tree, calculate the test accuracy of the tree assuming no split occurred on the parent of that leaf and find the best such parent (in the sense that not splitting on that parent results in the best testing accuracy among possible parents). Make that parent into a leaf and repeat this process until you are left with just the root. On a single plot, draw the training and testing accuracy as a function of the number of internal nodes in the tree. Explain the results: what would happen to the training and testing accuracies when you remove nodes from the tree? Can you suggest a different approach to achieve better results? (20 points)

In [None]:
#### Your code here ####

Your answer here

## Print the tree

Complete the function `print_tree` in the python file `hw2.py` and print the tree using the chosen impurity measure and no pruning. Your code should like something like this (10 points):
```
[X0 <= 1],
  [X1 <= 2]
    [X2 <= 3], 
       leaf: [{1.0: 10}]
       leaf: [{0.0: 10}]
    [X4 <= 5], 
       leaf: [{1.0: 5}]
       leaf: [{0.0: 10}]
   leaf: [{1.0: 50}]
```


In [236]:
print("Gini Tree:")
print_tree(tree_gini)
print("------------------------------------")
print("Entropy Tree:")
print_tree(tree_entropy)

Gini Tree:
[A27 <= 0.142350]
   leaf: [{0: 22}]
leaf: [{1: 249}]
   [A13 <= 21.925000]
      leaf: [{0: 6}]
leaf: [{1: 8}]
      [A4 <= 0.079285]
         leaf: [{1: 1}]
         leaf: [{0: 140}]
------------------------------------
Entropy Tree:
[A27 <= 0.142350]
   leaf: [{0: 22}]
leaf: [{1: 249}]
   [A13 <= 21.925000]
      leaf: [{0: 6}]
leaf: [{1: 8}]
      [A4 <= 0.079285]
         leaf: [{1: 1}]
         leaf: [{0: 140}]
