# Practical session 2  - Decision trees 

### Recursion

A recursive function is a function that calls itself until some base case is reached. The base case is some condition we check with every call to the function to make sure it still makes sense to call itself. Without the base case the recursion would continue infinitely.

Recursion is often explained by referring to Russian nesting dolls. Each time you open a doll, another doll is inside, this continues until you reach the smallest doll (the base case). Without knowing how many dolls there are we know how to solve the task of opening all the dolls, as we simply keep calling the open *'function'* until we reach the last doll.

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d2/Russian-Matroshka_no_bg.jpg" width="30%">

An example of a problem we can solve using a recursive function is calculating the factorial. The base case is that if ```n == 1``` we no longer need to calculate the factorial, as here we know the answer, and otherwise we calculate the answer by calculating the factorial for ```n-1```, until we reach 1. In the cell below the ```factorial``` function is given, with print statements to show whats happening.

In [1]:
def factorial(n):
    if n == 1:
        print("This I know! (the base case)")
        return 1
    else:
        print("I don't know the factorial for", n, "let's try", n-1)
        return n * factorial(n-1)
    
factorial(5)

I don't know the factorial for 5 let's try 4
I don't know the factorial for 4 let's try 3
I don't know the factorial for 3 let's try 2
I don't know the factorial for 2 let's try 1
This I know! (the base case)


120

Let's practice recursion.

### Exercise 1

Write a recursive function ``rec_sum`` which takes a list of numbers and returns the sum of that list. 

**Hint:** Remember that you can use the a colon to select a part of a list. For example ```a[2:]``` returns all but the first two elements from the list ```a```.

In [None]:
    
rec_sum([1,2,3,4,5,6])

Clearly the function in exercise 1 is not the most useful recursive function, and it would be easier solved with just a loop. But it might help you get started thinking about how it works. Let's see how it can be more useful.

### Exercise 2

In the cell below you are given a list which contains a nested list which contains another nested list, etc. You do not know how many levels of nesting lists there are, all you know that the last list contains a number. Write a recursive function which prints this number by searching through the list.

As an advanced exercise you can also try to keep track of how many levels you had to descend in order to reach the final answer.

**Hint:** You can check if something is a list using the ```isinstance``` function: ```isinstance([1,2,3], list)```

In [3]:
nested = [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[13]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

def ...
    
search(nested)

(13, 37)

## Tree structures

Recursion is very useful when dealing with tree structures, as we often do not know how deep the tree is. All we can see is if the node we are currently looking at has any children, and if it does we can try to visit those, and repeat this.

Different from the examples above, a tree splits up into branches, which means we're not doing one, but two (and sometimes even more) recursive calls everytime we are going down a level. Decision trees are binary trees: this means that every node has either 0 or 2 children. If it has 0 then it is a leaf node. The figure below is annotated with some of the terminology used when talking about trees.


<img src="https://www.tutorialspoint.com/data_structures_algorithms/images/binary_tree.jpg" alt="Tree structure" width="50%">

In the cell below I have defined a class called ```Node```, which we can use to construct a Decision tree. The node stores references to its children (left and right), it stores which attributes (feature) of our dataset we want to apply the decision to, the value we want to compare our feature with, and finally what the majority class is at that node.


In [4]:
class Node:
    """A node in a (binary) decision tree"""
    
    def __init__(self):
        """Initialiser of the class"""
        self.left = None # left child
        self.right = None # right child
        self.attribute = None # column on which we decide
        self.value = None # value to check against
        self.majority = None # Majority class at this label
    
    def isleaf(self):
        """Helper function to check if the current node is a leaf"""
        if self.left == None and self.right == None:
            return True
        return False
        
    def question(self, attribute, value):
        """ Helper function to add question to node.
        """
        self.attribute = attribute
        self.value = value
        
    def __str__(self, depth=1):
        """ You can ignore this function, 
        but basically it helps print the node in a human-readable manner """
        if self.isleaf():
            return "Predict: \"{:s}\"".format(self.majority)
        else:
            s = "if features[{:d}] == \"{:s}\" then:\n {:s} \n{:s}else:\n {:s}"
            return s.format(self.attribute, 
                            self.value, 
                            "\t" * depth+self.left.__str__(depth+1),
                            "\t" * (depth-1),
                            "\t" * depth+self.right.__str__(depth+1))

So how do we use the Node class?  To illustrate this you'll you'll find an example of a made-up tree below, applied to a number of objects (i.e., fruits).

In [5]:
# Create a new node and store it in the root variable
root = Node()
# Specify the decision we want to take at this node
# In this case we want to see if feature 2 contains the value Round
# We can use the question function to specify the attribute-value pair to be used as a question:
root.question(2, "Round")
# Create a new node, which we'll visit if the object is round.
root.left = Node()
# Create another node, we'll go here if it is not round.
root.right = Node()
# The right node is a leaf node, if it's not round we'll predict Banana
# Normally you'd want to determine the majority based on the data
root.right.majority = "Banana"

# Continue with the left node, let's see if our round object is red
root.left.question(1, "Red")
# If it is red then we'll predict Apple
root.left.left = Node()
root.left.left.majority = "Apple"
# Otherwise it has to be a lime, as anything round and not red must be a lime. Right?!
root.left.right = Node()
root.left.right.majority = "Lime"
# Try to extend the tree further by continuing on from here

# Thanks to the __str__() function we can print the tree 
# and get the rules formatted in a humanly readable format.
print(root)

if features[2] == "Round" then:
 	if features[1] == "Red" then:
 		Predict: "Apple" 
	else:
 		Predict: "Lime" 
else:
 	Predict: "Banana"


In [6]:
# Additionally we can use isleaf() to check if a node is a leaf node or not.
print("Is root a leaf node?", root.isleaf())

print("Is the right child of root a leaf node?", root.right.isleaf())
print()

# If we want to find out which feature the root looks at we can:
print("The root looks at feature", root.attribute, "and checks if its value is equal to", root.value)

Is root a leaf node? False
Is the right child of root a leaf node? True

The root looks at feature 2 and checks if its value is equal to Round


### Dataset

In the example above I made up the decisions, but normally you would want to generate these based on the data. For this we'll use the weather dataset in the next cell. The objective of this dataset is to figure out if the weather conditions are such that it is nice enough to go and play outside. 

It has the following features, all of which are categorical.
- outlook {sunny, overcast, rainy}
- temperature {hot, mild, cool}
- humidity {high, normal}
- windy {TRUE, FALSE}

And the target is:
- Can we play outside today? {yes, no}

The features are stored in ``X_train``. Each row in ``X_train`` is a different day/moment; ``y_train`` contains the label for each row.

In [7]:
X_train = [['sunny', 'hot', 'high', 'FALSE'],
 ['sunny', 'hot', 'high', 'TRUE'],
 ['overcast', 'hot', 'high', 'FALSE'],
 ['rainy', 'mild', 'high', 'FALSE'],
 ['rainy', 'cool', 'normal', 'FALSE'],
 ['rainy', 'cool', 'normal', 'TRUE'],
 ['overcast', 'cool', 'normal', 'TRUE'],
 ['sunny', 'mild', 'high', 'FALSE'],
 ['sunny', 'cool', 'normal', 'FALSE'],
 ['rainy', 'mild', 'normal', 'FALSE'],
 ['sunny', 'mild', 'normal', 'TRUE'],
 ['overcast', 'mild', 'high', 'TRUE'],
 ['overcast', 'hot', 'normal', 'FALSE'],
 ['rainy', 'mild', 'high', 'TRUE']]

y_train = ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']

Let's do some quick analysis of the distribution of the features and the label, and write a function which will be useful later on.

### Exercise 3

Write a function in the next cell called ```majority``` which takes a list of categorical values and returns the one which occurs most often.

**Hint:** {'A': 5, 'B': 6}.items() returns an iterator of pairs of tuples, which you can sort using ```sorted()```.

In [8]:
def majority...

if majority(y_train) == 'yes' and majority(y_train[:3]) == 'no':
    print("Majority is correct!")
else:
    print("Your majority function contains a mistake")

Majority is correct!


## Generating and evaluating potential questions

Now that we have a dataset we can figure out how to add questions. To do this we first need to generate the set of potential questions. Because we are dealing with features which are categorical all our questions are going to be whether the feature's value is equal to the specified value (of the form ```if temperature == 'hot'```). 

### Exercise 4

Write a function in the next cell that takes a list of training inputs (structured like ``X_train``) as input and returns the unique values in each column. The output should be a list of sets (each set corresponding to a column).

You shouldn't need a recursive function to solve this.



In [9]:
def questionset(X):

questionset(X_train)

[{'overcast', 'rainy', 'sunny'},
 {'cool', 'hot', 'mild'},
 {'high', 'normal'},
 {'FALSE', 'TRUE'}]

Before we determine whether a question is a good one to ask, let's figure out how to actually apply one to a dataset. Or in others words if we have a question how do we split the dataset according to the answer.

### Exercise 5

Write a function in the cell below that takes a node, a list of training examples (``X``), and a list of training targets (``y``), and returns four lists. The first containg the rows from ``X`` for which the answer to the specified question is ``False`` and the second containing the targets for those rows. The third and fourth lists should contain the same but then for the rows which give the answer ``True``.

**Hints** 
- The node has a defined question, so you can use ```node.attribute``` and ```node.value``` to perform the conditional.
- The easiest way to do this is probably by creating the four lists at the start of your function, appending to them when appropriate and then returning them at the end.

In [10]:
root = Node()
root.question(0, 'overcast')

def split(node, X, y):
    
split(root, X_train, y_train)

([['sunny', 'hot', 'high', 'FALSE'],
  ['sunny', 'hot', 'high', 'TRUE'],
  ['rainy', 'mild', 'high', 'FALSE'],
  ['rainy', 'cool', 'normal', 'FALSE'],
  ['rainy', 'cool', 'normal', 'TRUE'],
  ['sunny', 'mild', 'high', 'FALSE'],
  ['sunny', 'cool', 'normal', 'FALSE'],
  ['rainy', 'mild', 'normal', 'FALSE'],
  ['sunny', 'mild', 'normal', 'TRUE'],
  ['rainy', 'mild', 'high', 'TRUE']],
 ['no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no'],
 [['overcast', 'hot', 'high', 'FALSE'],
  ['overcast', 'cool', 'normal', 'TRUE'],
  ['overcast', 'mild', 'high', 'TRUE'],
  ['overcast', 'hot', 'normal', 'FALSE']],
 ['yes', 'yes', 'yes', 'yes'])

Once we know how to make splits, we need to figure out which is the best question to ask first. In the lecture we discussed how the best decisions reduces the uncertainty the most. So let's write some functions to help us measure the uncertainty.

## Entropy

Entropy is a measure of uncertainty, Entropy is calculated as follows:

$I(P) = - \sum\limits_{i=1}^N P_i log_2(P_i)$

where $P$ is a list of class probabilities (i.e., the proportion the class is present in the set). Given that for decision trees you'll be dealing with lists of labels you'll need to convert these to probabilities for each individual label. 

### Exercise 6

Write the ```entropy``` function in the cell below. Add tests to verify that the entropy of the list ```[0,1]``` is `1.0`, of the list ```[-1,1,2,3]``` is 2.0, and that the entropy of the first 10 examples of y_train is higher than that of the whole y_train.

In [11]:
from math import log2

def entropy(labels):

if not (entropy([0,1]) == 1.0 and
        entropy([-1,1,2,3]) == 2.0 and
        entropy(y_train[:10]) > entropy(y_train)):
    print("Your entropy function contains a mistake!")

## Weighted impurity

We now need to aggregate the entropy at both the left and the right node, while weighting by the relative size of the children.

$G(P) = f_{left} I(n_{left}) + f_{right} I(n_{right})$

Here $n_{left}$ is the left child, and $f_{left}$ is the weight given to the left child, etc.

Usually the weight  is equal to the proportion the child node has of the parent node. For example, if the parent contains $20$ instances, and after the split the left child would have 15 and the right 5, then $f_{left} = \frac{15}{20}$ and $f_{right} = \frac{5}{20}$.

### Information gain

You will often see Information Gain as a scoring function, rather than simply weighted entropy. Information gain is:

$$IG(P) = I(P) - [f_{left} I(n_{left}) + f_{right} I(n_{right})]$$

so it is simply the entropy of the parent node, minus the weighted entropy of the children. Since the term $I(P)$ is the same for each split, we can just ignore it and simply use the weighted entropy directly to choose the split.


### Exercise 7

Implement the ``G`` function in the cell below. To verify if you have done it correctly you can split ```X_train``` and ```y_train``` using the function you wrote for exercise 5 using node A and B. If correct, the weighted impurity ``G`` using node A should be lower than when using node B.

**Hint:** It should be enough to give it two lists, the list for the left child and that for the right child. As the parent list is the concatination of those two.

In [12]:
A = Node()
A.question(0, 'overcast')
B = Node()
B.question(0, 'sunny')

def G(left, right):

L, yL, R, yR = split(A, X_train, y_train)
print("A", G(yL,yR))

L, yL, R, yR = split(B, X_train, y_train)
print("B", G(yL,yR))

A 0.7142857142857143
B 0.8380423950607803


We have all the building blocks we need to start fitting a tree to a dataset. Let's give it a go!

### Advanced Exercise 1

Implement the ```fit(X,y)``` function below, where ``X`` is a matrix of features and y is a list of labels. It should return a tree (i.e., a instance of the Node() class).

**Hints:** 
- Start by thinking of what the right base case is, and implement this.
- Remember that you can call the ``Node.question`` function repeatedly to change the question, allowing you to test multiple questions without creating new Nodes.
- Remember that this should be a recursive function.

In [13]:
def fit(X, y):
    
decision_tree = fit(X_train, y_train)
print(decision_tree)

if features[0] == "overcast" then:
 	if features[2] == "high" then:
 		if features[3] == "FALSE" then:
 			if features[0] == "sunny" then:
 				Predict: "no" 
			else:
 				Predict: "yes" 
		else:
 			Predict: "yes" 
	else:
 		if features[0] == "sunny" then:
 			if features[3] == "FALSE" then:
 				Predict: "no" 
			else:
 				Predict: "yes" 
		else:
 			Predict: "no" 
else:
 	Predict: "yes"


### Advanced exercise 2

Once we have fitted a decision tree we would like to verify how well it works, and use it to predict the label for new samples. Implement the ```predict(tree, x)``` function in the cell below, where ```tree``` is a fitted tree, and ``x`` is one feature vector (a list). It should return a single label, either 'yes' or 'no'.

**Hints:**
- What is the base case?
- Remember that going left or right depends on the answer to the question at each node.

In [14]:
def predict(tree, x):


# This code applies the predict function to 
print('\t\tData\t\t\tTruth\tPrediction')
for row, label in zip(X_train, y_train):
    print(row, '\t', label, '\t', predict(decision_tree, row))

		Data			Truth	Prediction
['sunny', 'hot', 'high', 'FALSE'] 	 no 	 no
['sunny', 'hot', 'high', 'TRUE'] 	 no 	 no
['overcast', 'hot', 'high', 'FALSE'] 	 yes 	 yes
['rainy', 'mild', 'high', 'FALSE'] 	 yes 	 yes
['rainy', 'cool', 'normal', 'FALSE'] 	 yes 	 yes
['rainy', 'cool', 'normal', 'TRUE'] 	 no 	 no
['overcast', 'cool', 'normal', 'TRUE'] 	 yes 	 yes
['sunny', 'mild', 'high', 'FALSE'] 	 no 	 no
['sunny', 'cool', 'normal', 'FALSE'] 	 yes 	 yes
['rainy', 'mild', 'normal', 'FALSE'] 	 yes 	 yes
['sunny', 'mild', 'normal', 'TRUE'] 	 yes 	 yes
['overcast', 'mild', 'high', 'TRUE'] 	 yes 	 yes
['overcast', 'hot', 'normal', 'FALSE'] 	 yes 	 yes
['rainy', 'mild', 'high', 'TRUE'] 	 no 	 no


### Advanced exercise 3

Visually it's quite easy to figure out how deep a tree is, but can we do it automatically? Write a recursive function in the cell below which returns the depth of the decision tree it was passed as an argument.

In [15]:
def depth(tree):
  
    
depth(decision_tree)

5