Osnabrück University - Machine Learning (Summer Term 2024) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Lukas Niehaus

# Exercise Sheet 01: Decision Trees

## Introduction

This is the first official exercise sheet. The homework sheets will usually be available at the beginning of the week and are supposed to be solved in groups of three. They have to be handed in by the end of Sunday of that week. The exercises are then presented to your tutor in a small feedback session. To acquire the admission for the final exam, you will have to pass $N-2$ of the weekly provided exercise sheets.

Sign up for a group on Stud.IP (See `Participants` -> `Functions/Groups`). The times mentioned there are the times for the feedback session of your group. If none of them fits, send any of the tutors an e-mail so we can try to arrange something.

Your group will have a group folder in Stud.IP under `Files`. Upload your solutions there to hand them in.

All exercise sheets will use [Jupyter Notebooks](http://jupyter-notebook.readthedocs.org/en/latest/notebook.html). To be able to run these on your system, you will need to install Python and a few packages. We suggest the newest version of Python 3 and installing the conda environment as explained in the practice session and in the file "ml-install.txt".

This week's sheet should be solved and handed in before end of **Sunday, April 21, 2024**. 
Please upload your results to your group's Stud.IP folder. In case you cannot do this first sheet (due to technical or organizational problems) please upload a description of your problem instead. Your tutor will help you to solve the problems in the first feedback session and you may hand in this sheet together with the second sheet one week later.

## Decision Trees [4 Points]
Draw the decision trees for the following boolean functions. Either use pen and paper and scan/photograph the result or employ your ASCII artist within below.

Note: $\oplus := xor$, that means one of the operands has to be true, while the other one has to be false:

|$$\oplus$$ | $$B$$ | $$\neg B$$|
|:---------|:-----|:---------|
|$$A$$      |  f  |    t|
|$$\neg A$$ |  t  |    f|

**a)** $\neg A \wedge B$


         A
        / \
     t /   \ f
      /     \
     No      B
            / \
         t /   \ f
          /     \
        Yes     No


**b)** $A \oplus B$

               A
              / \
         t  /     \  f
          /         \
         B           B
        / \         / \
     t /   \ f   t /   \ f
      /     \     /     \
     No     Yes Yes     No

**c)** $A \vee (B \wedge C) \vee (\neg C \wedge D)$

            A
           / \
      t  /     \  f
       /         \
     Yes          C
                 / \
            t  /     \  f
             /         \
            B           D
           / \         / \
        t /   \ f   t /   \ f
         /     \     /     \
       Yes     No  Yes     No

**d)** $(A \rightarrow (B \wedge \neg C)) \vee (A \wedge B)$

hint: remember material implication from logic

$= (\neg A \vee (B \wedge \neg C)) \vee (A \wedge B)$

$= \neg A \vee (A \wedge B) \vee (B \wedge \neg C)$

$= \neg A \vee (A \wedge B)$

$= \neg A \vee  B$

              A
             / \
          t /   \ f
           /     \
          B      Yes
         / \
      t /   \ f
       /     \
     Yes     No

## Entropy and Information Gain [8 Points]

In many machine learning applications it is crucial to determine which criterions are necessary for a good classification. Decision trees have those criterions close to the root, imposing an order from significant to less significant criterions. One way to select the most important criterion is to compare its information gain or its entropy to others. The following dataset is a hands-on example for this method.

Consider the following attributes with their possible values:

  * $raining = \{yes, no\}$
  * $tired = \{yes, no\}$
  * $late = \{yes, no\}$
  * $distance = \{short, medium, long\}$

And a training data set consisting of those attributes:

| #  | raining | tired | late | distance | attend_party |
|----|---------|-------|------|----------|--------------|
| 1  | yes     | no    | no   | short    | **yes**      |
| 2  | yes     | no    | yes  | medium   | **no**       |
| 3  | no      | yes   | no   | long     | **no**       |
| 4  | yes     | yes   | yes  | short    | **no**       |
| 5  | yes     | no    | no   | short    | **yes**      |
| 6  | no      | no    | no   | medium   | **yes**      |
| 7  | no      | yes   | no   | long     | **no**       |
| 8  | yes     | no    | yes  | short    | **no**       |
| 9  | yes     | yes   | no   | short    | **yes**      |
| 10 | no      | yes   | no   | medium   | **no**       |
| 11 | no      | yes   | no   | long     | **no**       |
| 12 | no      | yes   | yes  | short    | **no**       |

**a)** Build the root node of a decision tree from the training samples given in the table above by calculating the information gain for all four attributes (raining, tired, late, distance).

$$\operatorname{Gain}(S,A) = \operatorname{Entropy}(S) - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|}\operatorname{Entropy}(S_v)$$

$$\operatorname{Entropy}(S) = -p_{\oplus} log_{2} p_{\oplus} - p_{\ominus} log_{2} p_{\ominus}$$

$S$ is the set of all data samples. $S_v$ is the subset for which attribute $A$ has value $v$. An example for attribute **tired** with value $yes$ would be:
$$|S_{yes}| = 7, S_{yes}:[1+, 6−]$$

*Caution: some pretty intense rounding was applied to the following numbers. So your results might differ, but they should be in the same ballpark!*

Entropy of the whole dataset:

$$Entropy\left(S\right) = -\frac{4}{12} \log_{2} \frac{4}{12} - \frac{8}{12} \log_{2} \frac{8}{12} \approx 0.92$$

Attribute **raining**:
$$\left|S_{yes}\right|:6 , S_{yes}:[3+,3-]$$

$$Entropy\left(S_{yes}\right) = -\frac{3}{6} \log_{2} \frac{3}{6}-\frac{3}{6} \log_{2} \frac{3}{6} = 1$$

$$\left|S_{no}\right|:6 , S_{no}:[1+,5-]$$

$$Entropy\left(S_{no}\right) = -\frac{1}{6} \log_{2} \frac{1}{6}-\frac{5}{6} \log_{2} \frac{5}{6} \approx 0.65$$

$$Gain\left(S,raining\right) \approx 0.92 - \left(\frac{6}{12}\cdot 1 + \frac{6}{12}\cdot 0.65\right) = 0.095$$

Attribute **tired**:
$$\left|S_{yes}\right|:7 , S_{yes}:[1+,6-]$$

$$Entropy\left(S_{yes}\right) = -\frac{1}{7} \log_{2} \frac{1}{7}-\frac{6}{7} \log_{2} \frac{6}{7} \approx 0.59$$

$$\left|S_{no}\right|:5 , S_{no}:[3+,2-]$$

$$Entropy\left(S_{no}\right) = -\frac{3}{5} \log_{2} \frac{3}{5}-\frac{2}{5} \log_{2} \frac{2}{5} \approx 0.97$$

$$Gain\left(S,tired\right) \approx 0.92 - \left(\frac{7}{12}\cdot 0.59 + \frac{5}{12}\cdot 0.97\right) \approx 0.171$$

Attribute **late**:
$$\left|S_{yes}\right|:4 , S_{yes}:[0+,4-]$$

$$Entropy\left(S_{yes}\right) = -\frac{0}{4} \log_{2} \frac{0}{4}-\frac{4}{4} \log_{2} \frac{4}{4} = 0$$

$$\left|S_{no}\right|:8 , S_{no}:[4+,4-]$$

$$Entropy\left(S_{no}\right) = -\frac{4}{8} \log_{2} \frac{4}{8}-\frac{4}{8} \log_{2} \frac{4}{8} = 1$$

$$Gain\left(S,late\right) \approx 0.92 - \left(\frac{4}{12}\cdot 0 + \frac{8}{12}\cdot 1\right) \approx 0.253$$

Attribute **distance**:
$$\left|S_{short}\right|:6 , S_{short}:[3+,3-]$$

$$Entropy\left(S_{short}\right) = -\frac{3}{6} \log_{2} \frac{3}{6}-\frac{3}{6} \log_{2} \frac{3}{6} = 1$$

$$\left|S_{medium}\right|:3 , S_{medium}:[1+,2-]$$

$$Entropy\left(S_{medium}\right) = -\frac{1}{3} \log_{2} \frac{1}{3}-\frac{2}{3} \log_{2} \frac{2}{3} \approx 0.918$$

$$\left|S_{long}\right|:3 , S_{long}:[0+,3-]$$

$$Entropy\left(S_{long}\right) = -\frac{0}{3} \log_{2} \frac{0}{3}-\frac{3}{3} \log_{2} \frac{3}{3} = 0$$

$$Gain\left(S,distance\right) \approx 0.92 - \left(\frac{6}{12}\cdot 1 + \frac{3}{12}\cdot 0.918 + \frac{3}{12}\cdot 0\right) = 0.191$$

The information gain is greatest for the **late** attribute.

**b)** Perform the same calculation as in **a)** but use the gain ratio instead of the information gain. Does the result for the root node change?

$$\operatorname{GainRatio}(S,A) = \frac{\operatorname{Gain}(S,A)}{\operatorname{SplitInformation}(S,A)}$$

$$\operatorname{SplitInformation}(S,A) = - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|} \log_{2} \frac{|S_{v}|}{|S|}$$

Attribute **raining**:

$$SplitInformation(S,raining) = -\left(\frac{6}{12} \log_{2} \frac{6}{12} + \frac{6}{12} \log_{2} \frac{6}{12}\right) = 1$$
$$GainRatio(S,raining) = \frac{0.095}{1} = 0.095$$

Attribute **tired**:

$$SplitInformation(S,tired) = -\left(\frac{7}{12} \log_{2} \frac{7}{12} + \frac{5}{12} \log_{2} \frac{5}{12}\right) = 0.98$$
$$GainRatio(S, tired) = \frac{0.171}{0.98} \approx 0.174$$

Attribute **late**:

$$SplitInformation(S,late) = -\left(\frac{4}{12} \log_{2} \frac{4}{12} + \frac{8}{12} \log_{2} \frac{8}{12}\right) = 0.918$$
$$GainRatio(S,late) = \frac{0.253}{0.918} \approx 0.276$$

Attribute **distance**:

$$SplitInformation(S,distance) = -\left(\frac{6}{12} \log_{2} \frac{6}{12} + \frac{3}{12} \log_{2} \frac{3}{12} + \frac{3}{12} \log_{2} \frac{3}{12}\right) = 1.5$$
$$GainRatio(S,distance) = \frac{0.191}{1.5} \approx 0.127$$

We should still use the **late** attribute.

## ID3 algorithm [4 Points]

Implement the following two functions in Python. Take a look at the `assert`s to see how the function should behave. An assert is a condition that your function is required to pass. Most of the conditions here are taken from the lecture slides (ML-03, Slide 12 & 13). Don't worry if you do not get all asserts to pass, just comment the failing ones out.

**a) Entropy**

$$\operatorname{Entropy}(S) = - \sum_{i=1...c} p_i \log_2 p_i$$

In [None]:
from math import log2
def entropy(s):
    """
    Calculate the entropy for a given target value set.

    Args:
        s (list): Target classes for specific observations.

    Returns:
        The entropy of s.
    """
    ### BEGIN SOLUTION
    freq = {}
    for item in s:
        freq[item] = freq.get(item, 0) + 1
    return -sum(f/len(s) * log2(f/len(s)) for f in freq.values())

    # or alternatively:
    return -sum((s.count(target) / len(s)) * log2(s.count(target) / len(s))
                for target in set(s)) # Sets only contain unique values.
    ### END SOLUTION

# See ML-03, Slide 12 & 13

In [None]:
# Epsilon: Account for small computational and rounding erros
epsilon = 1e-3
assert abs(entropy([1,1,1,0,0,0]) - 1.0) < epsilon
assert abs(entropy([1,1,1,1,0,0,0]) - 0.985) < epsilon
assert abs(entropy([1,1,1,1,1,1,0]) - 0.592) < epsilon
assert abs(entropy([1,1,1,1,1,1,0,0]) - 0.811) < epsilon
assert abs(entropy([2,2,1,1,0,0]) - 1.585) < epsilon
assert abs(entropy([2,2,2,1,0]) - 1.371) < epsilon
assert abs(entropy([2,2,2,0,0]) - 0.971) < epsilon
assert abs(entropy(['yes','yes','yes','no','no','no']) - 1.0) < epsilon

**b)** Information Gain

$$\operatorname{Gain}(S,A) = \operatorname{Entropy}(S) - \sum_{v \in \operatorname{Values}(A)} \frac{|S_v|}{|S|} \operatorname{Entropy}(S_v)$$

In [None]:
def gain(targets, attr_values):
    """
    Calculates the expected reduction in entropy due to sorting on A.

    Args:
        targets (list): Target classes for observations in attr_values.
        attr_values (list): Values of each instance for the respective attribute.

    Returns:
        The information gain of
    """
    ### BEGIN SOLUTION
    sigma = 0
    for v in set(attr_values): # Sets only contain unique values.
        S_v = [targets[key] for (key, v_) in enumerate(attr_values) if v_ == v]
        sigma += ((len(S_v) / len(targets)) * entropy(S_v))
    return entropy(targets) - sigma
    ### END SOLUTION

# See ML-03, Slide 12 & 13


In [None]:
# The lists here can each be seen as one column of a table such as the one in assignment 2.
# Assert targets would be the last column, while the attribute values are the values of one attribute, here the
# example rain and distance
assert_targets = ["no","no","yes","yes","yes","no","yes","no","yes","yes","yes","yes","yes","no"]
assert_attribute_values_1 = ["yes", "yes","yes","yes","no", "no", "no", "yes", "no", "no", "no","yes", "no", "yes"]
assert_attribute_values_2 = ["high","low","medium","high","high","medium","low","medium","low","high","high","medium","low","low"]
assert_attribute_values_3 = [0,1,0,0,0,1,1,0,0,0,1,1,0,1]

epsilon = 1e-3
assert abs(gain(assert_targets, assert_attribute_values_1) - 0.152) < epsilon
assert abs(gain(assert_targets, assert_attribute_values_2) - 0.05) < epsilon
assert abs(gain(assert_targets, assert_attribute_values_3) - 0.048) < epsilon

**c)** ID3

In the next two cells we have implemented the ID3 algorithm. It relies on your two functions from above, `entropy` and `gain`. Try to understand what the code does and replace `# YOUR CODE HERE` with meaningful comments describing the respective parts of the code. Do not forget to write the docstring. Though its often annoying, being able to read other peoples code is one of the key skills (and obstacles) in software engineering. So give it a try! Otherwise you are of course welcome to write your own implementation.

In [None]:
from collections import Counter, namedtuple


class Node(namedtuple('Node', 'label children')):
    """
    A small node representation with a pretty string representation.
    """
    def __str__(self, level=0):
        return_str ='{}{!s}\n'.format(' ' * level * 4, self.label)
        for child in self.children:
            return_str += child.__str__(level + 1)
        return return_str

def id3(data, attributes, targets, target_names, attribute_names):
    """
    Recursively calculate a tree of Nodes (fields: label [string], children [list])
    using the ID3 algorithm.
    ### BEGIN SOLUTION
    Args:
        data (list):            The (subset of) data points/examples. Each example is a list 
                                   with a value for each attribute.
        attributes (list):      The integer representation of the attributes from which
                                    the best attribute is computed.
        targets (list):         The target values for each example. Same length as data.
        target_names (list):    Names of the target values represented as strings.
        attribute_names (list): Names of the attribute represented as strings. 

    Returns:
        The root node
    ### END SOLUTION
    """

    ### BEGIN SOLUTION
    # If all data points have the same target value, directly return the single-node tree Root, 
    # with this target value as label
    ### END SOLUTION
    if all(target == targets[0] for target in targets):
        return Node('Result: {!s}'.format(target_names[targets[0]]), [])

    ### BEGIN SOLUTION
    # If the list of attributes is empty, directly return the single node tree root,
    # with the most common target value as label
    ### END SOLUTION
    if len(attributes) == 0:       
        most_common_idx = Counter(targets).most_common(1)[0][0]
        return Node('Result: {!s}'.format(target_names[most_common_idx]), [])

    ### BEGIN SOLUTION
    # Find the attribute with the maximum gain
    ### END SOLUTION
    gains = [gain(targets, [r[attribute] for r in data])
             for attribute in attributes]
    max_gain_attribute = attributes[gains.index(max(gains))]

    ### BEGIN SOLUTION
    # Create a root note with the maximum gain attribute
    ### END SOLUTION
    root = Node('Attribute: {!s} (gain {!s})'.format(attribute_names[max_gain_attribute],
                                                     round(max(gains), 4)), [])
    ### BEGIN SOLUTION
    # For each possible value, vi, of the maximum gain attribute
    ### END SOLUTION
    for vi in set(data_sample[max_gain_attribute] for data_sample in data):
        ### BEGIN SOLUTION
        # Add a new tree branch below root for which the maximum gain attribute has value vi
        ### END SOLUTION
        child = Node('Value: {!s}'.format(vi), [])
        root.children.append(child)

        ### BEGIN SOLUTION
        # Find the indices of the datapoints which have vi as value for the maxmium gain attribute
        ### END SOLUTION
        vi_indices = [idx for idx, data_sample in enumerate(data)
                          if data_sample[max_gain_attribute] == vi]
 
        ### BEGIN SOLUTION
        # Create the data and target subsets with maximum gain attribute = vi
        ### END SOLUTION
        data_vi = [data[i] for i in vi_indices]
        targets_vi = [targets[i] for i in vi_indices]
        
        ### BEGIN SOLUTION
        # exclude the maximum gain attribute from the subset of attributes which are used in the
        # next iteration
        ### END SOLUTION
        attributes_vi = [attribute for attribute in attributes if not attribute == max_gain_attribute]
       
        if data_vi:
            ### BEGIN SOLUTION
            # If data_vi is not empty below this new branch
            # add the subtree ID3 (data_vi, attributes - maximum gain attributes).
            ### END SOLUTION
            child.children.append(
                id3(data_vi, attributes_vi, targets_vi, target_names, attribute_names)
            )

        else:
            ### BEGIN SOLUTION
            # If no data point has value vi for the max gain attribute add a new leaf node with 
            #the most common target value in the current dataset as label
            ### END SOLUTION
            most_common_idx = Counter(targets_vi).most_common(1)[0][0]
            label = 'Result: {!s}'.format(target_names[most_common_idx])
            child.children.append(Node(label, []))

    return root

**d)** The algorithm is applied to two data sets. Run those and discuss the differences. For which data set is the ID3 algorithm better suited and why?

First look at the json file in which the party dataset is saved:

In [None]:
import json

with open('party.json', 'r') as party_file:
    party = json.load(party_file)
    
print(json.dumps(party, indent=4, sort_keys=True))

We see that the dataset is parsed as a dictionary with four entries:

* `attributes`: A list of the attribute names
* `data`: A list of the x-data of our samples. Each sample is again a list
* `target_names`: A list of the targets, i.e. labels
* `targets`: A list of the labels of our samples

This code runs the ID3 algorithm on the party data set which you already know from assignment 2.

In [None]:
import json

with open('party.json', 'r') as party_file:
    
    party = json.load(party_file)

# Make sure our gain function handles the data set as expected.
epsilon = 1e-3
assert abs(gain(party['targets'], [r[2] for r in party['data']]) - 0.252) < epsilon


data = party['data']
attribute_names = party['attributes']
attributes = list(range(len(attribute_names)))
targets = party['targets']
target_names = party['target_names']


# Apply ID3 algorithm
tree_party = id3(data, attributes, targets, target_names, attribute_names)

print(tree_party)

This code runs the ID3 algorithm on the famous iris flowers data set.

In [None]:
import json

with open('iris.json', 'r') as iris_file:
    iris = json.load(iris_file)

# Make sure our gain function handles the data set as expected.
epsilon = 1e-3
assert abs(gain(iris['targets'], [r[2] for r in iris['data']]) - 1.446) < epsilon

data = iris['data']
attribute_names = iris['attributes']
attributes = list(range(len(attribute_names)))
targets = iris['targets']
target_names = iris['target_names']

# Apply ID3 algorithm
tree_iris = id3(data, attributes, targets, target_names, attribute_names)

print(tree_iris)

YOUR ANSWER HERE

The problem with the iris data set is that since the ID3 algorithm works with the assumption of nominal variables it splits on every unique value for continuous data. Thus the tree grows very wide.

## Decision Trees on Iris Flowers [4 Points]

In this exercise we are going to examine and compare two decision trees that were generated from the iris flower data set to classify three variations of Iris flowers. The Iris data set is a classical example of a labeled dataset, i.e. every sample consists of two parts: features and labels. There are four features per sample in this data set (sepal length ($x_1$), sepal width ($x_2$), petal length ($x_3$) and petal width ($x_4$) in cm) and a corresponding label (Iris Setosa, Iris Versicolour, Iris Virginica). These samples are by nature **noisy**, no matter how carefully the measurement was taken - slight deviation from the actual length **cannot be avoided**. We want to learn how the features are related to the label so that we could (in the future) predict the label of a new sample automatically. One way to obtain such a `classifier` is to train a decision tree on the data.

Here are two decisions tree generated by the data set. We will now take a closer look.

**Tree 1:**

**Tree 2:**

**a)** What does it mean that the features $x1$ and $x2$ do not appear in the decision trees?

Sepal length and sepal width are not relevant for the classification. This might be either because they are redundant or because they are independent of the class.

**b)** With which method from the lecture might the second tree have been generated from the first one? Explain the procedure.

Reduced error pruning. Greedily remove the node that reduces error on validation set the most.

**c)** After training the tree we can calculate the accuracy, i.e. the percentage of the training set that is classified correctly. Although the first tree was trained on the data set until no improvement of the accuracy was possible, its accuracy is *only* 98%. Explain why it is not 100 %

The dataset is probably inconsistent, i.e. there are samples with the same features but different classes. Alternative: the thresholding has not produced the optimal partitioning.

**d)** Tree 2 only has a 96% accuracy on the training set. Why might this tree still be preferable over tree 1?

Tree 1 is probably overfitted to this specific dataset, i.e. it has not only captured the structure but also the noise in the data. It probably won't generalize as well as the second tree.
Another advantage of tree 2 is that it is faster at classifiying new data since less computations have to be made. This difference is hardly noticeable however.