Osnabrück University - Machine Learning (Summer Term 2024) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Lukas Niehaus

# Exercise Sheet 01: Decision Trees

## Introduction

This is the first official exercise sheet. The homework sheets will usually be available at the beginning of the week and are supposed to be solved in groups of three. They have to be handed in by the end of Sunday of that week. The exercises are then presented to your tutor in a small feedback session. To acquire the admission for the final exam, you will have to pass $N-2$ of the weekly provided exercise sheets.

Sign up for a group on Stud.IP (See `Participants` -> `Functions/Groups`). The times mentioned there are the times for the feedback session of your group. If none of them fits, send any of the tutors an e-mail so we can try to arrange something.

Your group will have a group folder in Stud.IP under `Files`. Upload your solutions there to hand them in.

All exercise sheets will use [Jupyter Notebooks](http://jupyter-notebook.readthedocs.org/en/latest/notebook.html). To be able to run these on your system, you will need to install Python and a few packages. We suggest the newest version of Python 3 and installing the conda environment as explained in the practice session and in the file "ml-install.txt".

This week's sheet should be solved and handed in before end of **Sunday, April 21, 2024**. 
Please upload your results to your group's Stud.IP folder. In case you cannot do this first sheet (due to technical or organizational problems) please upload a description of your problem instead. Your tutor will help you to solve the problems in the first feedback session and you may hand in this sheet together with the second sheet one week later.

## Decision Trees [4 Points]
Draw the decision trees for the following boolean functions. Either use pen and paper and scan/photograph the result or employ your ASCII artist within below.

Note: $\oplus := xor$, that means one of the operands has to be true, while the other one has to be false:

|$$\oplus$$ | $$B$$ | $$\neg B$$|
|:---------|:-----|:---------|
|$$A$$      |  f  |    t|
|$$\neg A$$ |  t  |    f|

**a)** $\neg A \wedge B$


![image](Trees_a.jpg)

**b)** $A \oplus B$

![image](Trees_b.jpg)

**c)** $A \vee (B \wedge C) \vee (\neg C \wedge D)$

![image](Trees_c.jpg)

**d)** $(A \rightarrow (B \wedge \neg C)) \vee (A \wedge B)$

![image](Trees_d.jpg)

## Entropy and Information Gain [8 Points]

In many machine learning applications it is crucial to determine which criterions are necessary for a good classification. Decision trees have those criterions close to the root, imposing an order from significant to less significant criterions. One way to select the most important criterion is to compare its information gain or its entropy to others. The following dataset is a hands-on example for this method.

Consider the following attributes with their possible values:

  * $raining = \{yes, no\}$
  * $tired = \{yes, no\}$
  * $late = \{yes, no\}$
  * $distance = \{short, medium, long\}$

And a training data set consisting of those attributes:

| #  | raining | tired | late | distance | attend_party |
|----|---------|-------|------|----------|--------------|
| 1  | yes     | no    | no   | short    | **yes**      |
| 2  | yes     | no    | yes  | medium   | **no**       |
| 3  | no      | yes   | no   | long     | **no**       |
| 4  | yes     | yes   | yes  | short    | **no**       |
| 5  | yes     | no    | no   | short    | **yes**      |
| 6  | no      | no    | no   | medium   | **yes**      |
| 7  | no      | yes   | no   | long     | **no**       |
| 8  | yes     | no    | yes  | short    | **no**       |
| 9  | yes     | yes   | no   | short    | **yes**      |
| 10 | no      | yes   | no   | medium   | **no**       |
| 11 | no      | yes   | no   | long     | **no**       |
| 12 | no      | yes   | yes  | short    | **no**       |

**a)** Build the root node of a decision tree from the training samples given in the table above by calculating the information gain for all four attributes (raining, tired, late, distance).

$$\operatorname{Gain}(S,A) = \operatorname{Entropy}(S) - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|}\operatorname{Entropy}(S_v)$$

$$\operatorname{Entropy}(S) = -p_{\oplus} log_{2} p_{\oplus} - p_{\ominus} log_{2} p_{\ominus}$$

$S$ is the set of all data samples. $S_v$ is the subset for which attribute $A$ has value $v$. An example for attribute **tired** with value $yes$ would be:
$$|S_{yes}| = 7, S_{yes}:[1+, 6−]$$

Root Node: Attribute "Late"
![image](entropy_a1.jpg)
![image](entropy_a2.jpg)

(we apologize for the weirdly rotated image... we tried to fix it but didn't find a solution)

**b)** Perform the same calculation as in **a)** but use the gain ratio instead of the information gain. Does the result for the root node change?

$$\operatorname{GainRatio}(S,A) = \frac{\operatorname{Gain}(S,A)}{\operatorname{SplitInformation}(S,A)}$$

$$\operatorname{SplitInformation}(S,A) = - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|} \log_{2} \frac{|S_{v}|}{|S|}$$

Root Node: Attribute "Late" (unchanged)
![image](entropy_b.jpg)
![image](entropy_b2.jpg)

## ID3 algorithm [4 Points]

Implement the following two functions in Python. Take a look at the `assert`s to see how the function should behave. An assert is a condition that your function is required to pass. Most of the conditions here are taken from the lecture slides (ML-03, Slide 12 & 13). Don't worry if you do not get all asserts to pass, just comment the failing ones out.

**a) Entropy**

$$\operatorname{Entropy}(S) = - \sum_{i=1...c} p_i \log_2 p_i$$

In [5]:
from math import log2
import numpy as np
def entropy(s):
    """
    Calculate the entropy for a given target value set.

    Args:
        s (list): Target classes for specific observations.

    Returns:
        The entropy of s.
    """
    # find all unique possible assignments for the given object
    unique_vals = set(s)

    #get the absolute number of samples
    total_n = len(s)

    #initialize sum var to update progressively
    entropy_sum = 0

    # for every var assignment
    for elem in unique_vals:
        # count the number of samples having this value
        counts = s.count(elem)
        # calculate the entropy for this value
        entropy_sum = entropy_sum + counts/total_n * log2(counts/total_n)
    
    entropy = entropy_sum * (-1)

    return entropy

    
    
# See ML-03, Slide 12 & 13

In [6]:
# Epsilon: Account for small computational and rounding erros
epsilon = 1e-3
assert abs(entropy([1,1,1,0,0,0]) - 1.0) < epsilon
assert abs(entropy([1,1,1,1,0,0,0]) - 0.985) < epsilon
assert abs(entropy([1,1,1,1,1,1,0]) - 0.592) < epsilon
assert abs(entropy([1,1,1,1,1,1,0,0]) - 0.811) < epsilon
assert abs(entropy([2,2,1,1,0,0]) - 1.585) < epsilon
assert abs(entropy([2,2,2,1,0]) - 1.371) < epsilon
assert abs(entropy([2,2,2,0,0]) - 0.971) < epsilon
assert abs(entropy(['yes','yes','yes','no','no','no']) - 1.0) < epsilon

**b)** Information Gain

$$\operatorname{Gain}(S,A) = \operatorname{Entropy}(S) - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|} \operatorname{Entropy}(S_v)$$

In [55]:
def gain(targets, attr_values):
    """
    Calculates the expected reduction in entropy due to sorting on A.

    Args:
        targets (list): Target classes for observations in attr_values.
        attr_values (list): Values of each instance for the respective attribute.

    Returns:
        The information gain of an attribute.
    """
    #calculate the overall entropy at the current state
    s_entropy = entropy(targets)

    # find all unique possible assignments for the attribute
    unique_vals = set(attr_values)
    # calculate the absolute number of samples 
    total_n = len(attr_values)

    # find all unique possible value assignments for the targets
    unique_targets = set(targets)

    # initialize sum var to update progressively
    sum_part = 0

    # for every attribute value
    for elem in unique_vals:
        # find those samples holding the value of the current iteration and save their idx
        affected_indx = []
        for i, _ in enumerate(attr_values):
            if attr_values[i] == elem:
                affected_indx.append(i)

        # calculate the number of samples where the attribute holds the value of the current iteration
        val_n = len([val for val in attr_values if val == elem])
        
        # cross-reference the target list and attribute value list
        # only consider samples which hold the value of the current iteration
        # save the target for those samples in a list (used for calculating entropy later)
        val_subset = list()
        for idx in affected_indx:
           for item in unique_targets:
                if targets[idx] == item:
                    val_subset.append(item)

        #calculate the entropy of the subset of samples holding the value of the current iteration
        entropy_val = entropy(val_subset)

        # update the sum to include the calculations for this value assignment
        sum_part = sum_part + val_n/total_n * entropy_val
    
    # calculate the gain from with the total entropy and the entropy under different value assignments
    total_gain = s_entropy - sum_part

    return total_gain

        
    

# See ML-03, Slide 12 & 13


In [56]:
# The lists here can each be seen as one column of a table such as the one in assignment 2.
# Assert targets would be the last column, while the attribute values are the values of one attribute, here the
# example rain and distance
assert_targets = ["no","no","yes","yes","yes","no","yes","no","yes","yes","yes","yes","yes","no"]
assert_attribute_values_1 = ["yes", "yes","yes","yes","no", "no", "no", "yes", "no", "no", "no","yes", "no", "yes"]
assert_attribute_values_2 = ["high","low","medium","high","high","medium","low","medium","low","high","high","medium","low","low"]
assert_attribute_values_3 = [0,1,0,0,0,1,1,0,0,0,1,1,0,1]

epsilon = 1e-3
assert abs(gain(assert_targets, assert_attribute_values_1) - 0.152) < epsilon
assert abs(gain(assert_targets, assert_attribute_values_2) - 0.05) < epsilon
assert abs(gain(assert_targets, assert_attribute_values_3) - 0.048) < epsilon

**c)** ID3

In the next two cells we have implemented the ID3 algorithm. It relies on your two functions from above, `entropy` and `gain`. Try to understand what the code does and replace `# YOUR CODE HERE` with meaningful comments describing the respective parts of the code. Do not forget to write the docstring. Though its often annoying, being able to read other peoples code is one of the key skills (and obstacles) in software engineering. So give it a try! Otherwise you are of course welcome to write your own implementation.

In [28]:
from collections import Counter, namedtuple


class Node(namedtuple('Node', 'label children')):
    """
    A small node representation with a pretty string representation.
    """
    def __str__(self, level=0):
        return_str ='{}{!s}\n'.format(' ' * level * 4, self.label)
        for child in self.children:
            return_str += child.__str__(level + 1)
        return return_str

def id3(data, attributes, targets, target_names, attribute_names):
    """
    Recursively calculate a tree of Nodes (fields: label [string], children [list])
    using the ID3 algorithm.

    Args:
        data (list): list of the x-data of our samples (samples are lists themselves)
        attributes (list): attribute indices
        targets (list): Target classes for observations in attr_values
        target_names (list): list of all labels
        attribute_names (list): list of all attribute names

    Returns:
        The (Sub)tree created by ID3
    """

    # if all target vales are equal, return the value of the first target as the result
    if all(target == targets[0] for target in targets):
        return Node('Result: {!s}'.format(target_names[targets[0]]), [])

    # if there are no attributtes, find the most common target label and return its value/name as the result
    if len(attributes) == 0:       
        most_common_idx = Counter(targets).most_common(1)[0][0]
        return Node('Result: {!s}'.format(target_names[most_common_idx]), [])

    # create a list of the respective gains for every attribute as calculated by all data samples
    # find the attribute with the highest gain
    gains = [gain(targets, [r[attribute] for r in data])
             for attribute in attributes]
    max_gain_attribute = attributes[gains.index(max(gains))]

    # attribute with the highest gain becomes the root node
    root = Node('Attribute: {!s} (gain {!s})'.format(attribute_names[max_gain_attribute],
                                                     round(max(gains), 4)), [])
    
    # iterate over all unique values the attribute w/ highest gain takes in the data
    for vi in set(data_sample[max_gain_attribute] for data_sample in data):
        # create a child node of that attribute with the value of the current iteration
        # assign these children to the root node
        child = Node('Value: {!s}'.format(vi), [])
        root.children.append(child)

        # save indices of samples in data where the value of the highest gain attribute is 
        # equal to the value of the current iteration
        vi_indices = [idx for idx, data_sample in enumerate(data)
                          if data_sample[max_gain_attribute] == vi]
 
        # create a data / target subset (list) where all samples have the value of the current iteration
        # for the highest gain attribute
        data_vi = [data[i] for i in vi_indices]
        targets_vi = [targets[i] for i in vi_indices]
        
        # create an attribute list where the attribute with highest gain is removed
        attributes_vi = [attribute for attribute in attributes if not attribute == max_gain_attribute]
       
        if data_vi:
            # if there are data samples (if we are not at a leaf node):
            # assign a sub-tree to the child node obtained by performing id3 on the subset of the data
            child.children.append(
                id3(data_vi, attributes_vi, targets_vi, target_names, attribute_names)
            )

        else:
            # there are no more data samples left to assign / incorporate in the tree
            # find the most common target label in the target sublist and assign its value/name as the leaf node of the child 
            most_common_idx = Counter(targets_vi).most_common(1)[0][0]
            label = 'Result: {!s}'.format(target_names[most_common_idx])
            child.children.append(Node(label, []))

    return root

**d)** The algorithm is applied to two data sets. Run those and discuss the differences. For which data set is the ID3 algorithm better suited and why?

First look at the json file in which the party dataset is saved:

In [None]:
import json

with open('party.json', 'r') as party_file:
    party = json.load(party_file)
    
print(json.dumps(party, indent=4, sort_keys=True))

We see that the dataset is parsed as a dictionary with four entries:

* `attributes`: A list of the attribute names
* `data`: A list of the x-data of our samples. Each sample is again a list
* `target_names`: A list of the targets, i.e. labels
* `targets`: A list of the labels of our samples

This code runs the ID3 algorithm on the party data set which you already know from assignment 2.

In [61]:
import json

with open('party.json', 'r') as party_file:
    
    party = json.load(party_file)

# Make sure our gain function handles the data set as expected.
epsilon = 1e-3
assert abs(gain(party['targets'], [r[2] for r in party['data']]) - 0.252) < epsilon


data = party['data']
attribute_names = party['attributes']
attributes = list(range(len(attribute_names)))
targets = party['targets']
target_names = party['target_names']


# Apply ID3 algorithm
tree_party = id3(data, attributes, targets, target_names, attribute_names)

print(tree_party)

Attribute: late (gain 0.2516)
    Value: no
        Attribute: distance (gain 0.75)
            Value: long
                Result: no
            Value: short
                Result: yes
            Value: medium
                Attribute: tired (gain 1.0)
                    Value: no
                        Result: yes
                    Value: yes
                        Result: no
    Value: yes
        Result: no



This code runs the ID3 algorithm on the famous iris flowers data set.

In [None]:
import json

with open('iris.json', 'r') as iris_file:
    iris = json.load(iris_file)

# Make sure our gain function handles the data set as expected.
epsilon = 1e-3
assert abs(gain(iris['targets'], [r[2] for r in iris['data']]) - 1.446) < epsilon

data = iris['data']
attribute_names = iris['attributes']
attributes = list(range(len(attribute_names)))
targets = iris['targets']
target_names = iris['target_names']

# Apply ID3 algorithm
tree_iris = id3(data, attributes, targets, target_names, attribute_names)

print(tree_iris)

Applying the ID3 algorithm to the "party" dataset results in a slim but quite large / long tree. Although it is possible to directly "eliminate" a subset of data through the root node attribute (i.e. one of the node's children is a leaf node, thus reulting in definite classification of the data samples in this subset), the data within the still to classify subset is quite heterogenous. Meaning that it does not suffice to simply select a second attribute to classify the rest of the data. Since the attributes have a smaller set of possible value assignments in comparison to the iris dataset attributes, resulting subsets of data will be more likely to have several classes present within. Which then also have more claseses within them too. This requires a longer tree structure. Showing this is also the fact that e.g. the root node gain is lower for the party dataset than the iris dataset.

However, as a form of visual representation, the tree structure here is more comprehensible to the human eye (in this way of visualization).

The iris dataset tree is smaller and therefore, by definition, better. The large number of possible value assignments for attributes has the effect that the data is sorted into smaller / more detailled "bins". Some of these "bins" (subsets of data obtained by sorting data into groups by their values for these attributes) are so specific to a class that no further classification is needed. They are homogenous in themselves. And even for those child nodes of the root node where that is not the case, one more attribute assignment suffices to classify the rest of the data. 

In this sense the ID3 algorithm is better suited for the iris dataset. Although the visualization of the tree is easier to read / understand in the party dataset.

## Decision Trees on Iris Flowers [4 Points]

In this exercise we are going to examine and compare two decision trees that were generated from the iris flower data set to classify three variations of Iris flowers. The Iris data set is a classical example of a labeled dataset, i.e. every sample consists of two parts: features and labels. There are four features per sample in this data set (sepal length ($x_1$), sepal width ($x_2$), petal length ($x_3$) and petal width ($x_4$) in cm) and a corresponding label (Iris Setosa, Iris Versicolour, Iris Virginica). These samples are by nature **noisy**, no matter how carefully the measurement was taken - slight deviation from the actual length **cannot be avoided**. We want to learn how the features are related to the label so that we could (in the future) predict the label of a new sample automatically. One way to obtain such a `classifier` is to train a decision tree on the data.

Here are two decisions tree generated by the data set. We will now take a closer look.

**Tree 1:**

**Tree 2:**

**a)** What does it mean that the features $x1$ and $x2$ do not appear in the decision trees?

X1 and x2 are irrelevant in this case to classify the flowers. The information gain is not big enough (in comparison to gain from other attributes) to actually make a difference in classification. By assigning other attributes with higher information gain to nodes first, we are able to classify all data samples. 

**b)** With which method from the lecture might the second tree have been generated from the first one? Explain the procedure.

It might have been generated from the first one using reduced error pruning. Reduced error pruning works by removing the subtree of a node n and turning it into a leaf node while assigning it the most common classification of its training examples. The algorithm evaluates the impact on performance on a test / validation set when runing each node and then greedily decides to remove the n where accuracy is most improved. This procedure is repeated over and over again until the performance drops below the original tree's level.

**c)** After training the tree we can calculate the accuracy, i.e. the percentage of the training set that is classified correctly. Although the first tree was trained on the data set until no improvement of the accuracy was possible, its accuracy is *only* 98%. Explain why it is not 100 %

Because of noise in the data. There might be outliers within one group of flowers, which have the same attributes as a flower from the other group. If the tree were to accurately classify these outliers its overall performance would decrease again or it would overfit and become too complex and specific to be useful.

**d)** Tree 2 only has a 96% accuracy on the training set. Why might this tree still be preferable over tree 1?

The second tree might still be preferable since it is less complex. Each attribute is used once at most (while in the first tree each attribute was used up to two times). Due to this, it takes less effort to classify with the second tree while it also doesn't loose too much in accuracy. Additionally, with a less complex tree (while achieving similar accuracy) we may be able to generalize more, which is helpful when applying it to unseen data. The choice of a less complex model, while performance stays roughly the same, is also in line with the Occam's Razor argument.