# Entropy

Prove that the following two equations for computing the entropy of a set $T$ (with respect to classes
$C_1,\ldots,C_k$) are equivalent:

$$\text{entropy}(T)=-\sum_{i=1}^k p_i \cdot \log(p_i)$$

$$\text{entropy}(T)=-\frac{1}{|T|} \left( \sum_{i=1}^k c_i \cdot \log(c_i) \right) + \log(|T|)$$

where $c_i = \vert C_i \cap T \vert$ denotes the number of object in $T$ that are in class $C_i$ and $p_i$ denotes the probability of an object in $T$ to be within $C_i$.

You can assume the $C_i$ to be disjoint and that every object in $T$ is contained in exactly one $C_i$.

### TODO

# Decision Trees

In [2]:
import pandas as pd
df = pd.DataFrame([
    [ 23 , 'm', 'low'],
    [ 35 , 'f', 'low'],
    [ 25 , 'm', 'low'],
    [ 78 , 'f', 'low'],
    [ 46 , 'f', 'med'],
    [ 81 , 'f', 'med'],
    [ 47 , 'm', 'med'],
    [ 55 , 'm', 'high'],
    [ 71 , 'f', 'high'],
    [ 41 , 'm', 'high']],
    columns=[  "Age", "Gender", "Cancer-Risk" ])

Given the following data set of patient records:

In [3]:
df

Unnamed: 0,Age,Gender,Cancer-Risk
0,23,m,low
1,35,f,low
2,25,m,low
3,78,f,low
4,46,f,med
5,81,f,med
6,47,m,med
7,55,m,high
8,71,f,high
9,41,m,high


a) Discretize the Age attribute into the values $\{< 40, [40:70], > 70 \}$.

In [None]:
# TODO: Write code or write a Markdown block containing your computations and results.


b) For the discretized age attribute only, compute for $T$ and for each partition $T_i$ the (1) entropy and (2) gini index.
Then compute the resulting (3) information gain and (4) $\operatorname{Benefit}_{\operatorname{Gini}}$ of the split
with respect to their ability to predict Cancer-Risk.

In [None]:
# TODO: You do not need to do this as code.
# Preferably, make this by hand and write a Markdown block.
# In that case, you can delete the code frame below.

def entropy(T):
    # TODO

def gini(T):
    # TODO

def information_gain(T, column):
    # TODO: column is the splitting attribute

def benefit(T, column):
    # TODO: column is the splitting attribute

# TODO: Output results

c) Construct the decision tree of the above records and attributes using the Gini index split strategy only.
Stop if all remaining attributes are constant. Argue how you choose the prediction of a leaf.

Draw the tree with all necessary information to predict new records.

Hint: You do not need to compute the benefit, if you have only one candidate attribute remaining.

In [None]:
# TODO: You do not need to do this as code.
# Preferably, make this by hand and write a Markdown block.
# For printing your results (using either way), you can use the class below.

In [None]:
class Tree():
    def __init__(self,parent=None,classification=None,attr_val=None,column=None):
        self.classification = classification
        self.attr_val = attr_val
        self.column = column
        self.children = []
        self.parent = parent
        if not parent is None:
            parent.add_child(self)
    
    def children(self):
        return self.children
    
    def add_child(self,tree):
        self.children.append(tree)
        tree.parent = self
    
    def depth_in_tree(self):
        if self.parent is None:
            return 0
        return 1+self.parent.depth_in_tree()
    
    def print_self(self,column_names):
        self._print_self_helper("",column_names)
        
    def _print_self_helper(self,prefix,column_names):
        if len(self.children) > 0:
            print(prefix,"("+str(self.attr_val)+")",column_names[self.column])
            for child in self.children:
                child._print_self_helper(" |  "*(len(prefix)//4)+" |--",column_names)
        else:
            print(prefix,"("+str(self.attr_val)+")",self.classification)

# Usage:
# Create a root node with Tree()
# Create any further node with Tree(parent=<parent node>) or with Tree() and later call parent.add_child(child)
# Use the classification attribute in leaved for the classification value
# Use the column attribute in inner nodes to specify the splitting attribute
# Use the attr_val attribute in all nodes to specify the attribute value of the previous split
# All attributes can be changed at any time
# Print the entire tree with <root>.print_self()

# Example:
# Creates a DT with random splits up to a maximum depth.
# Classification labels are chosen arbitrary and do not represent real classifications.
import numpy as np
def print_example_DT(df):
    n_attrs = len(df.columns)-1
    first_split = np.random.randint(n_attrs)
    second_split = (first_split + np.random.randint(n_attrs-2) + 1) % n_attrs
    classes = list(set(df[df.columns[-1]]))
    attr_vals = [list(set(df[df.columns[i]])) for i in range(len(df.columns)-1)]
    # First split with attribute first_split
    root = Tree(column=first_split)
    for val in attr_vals[first_split]:
        # Second split with attribute second_split
        inner_node = Tree(parent=root,column=second_split,attr_val=val)
        for val2 in attr_vals[second_split]:
            c = classes[np.random.randint(len(classes))]
            leaf = Tree(
                parent=inner_node,
                classification=c,
                attr_val=val2)
    root.print_self(df.columns)

print_example_DT(df)

d) Predict the class labels of the new records given below.

In [None]:
qdf = pd.DataFrame([
    [ 35 , 'm', None ],
    [ 28 , 'f', None ],
    [ 48 , 'm', None ],
    [ 55 , 'f', None ],
    [ 83 , 'm', None ],
    [ 72 , 'f', None ]],
    columns=[  "Age", "Gender", "Cancer-Risk" ])
qdf

In [None]:
# TODO: Compute in a Markdown block and write results into the Dataframe qdf.
# You may code it, but that's quite advanced.

# Mushroom Classification

We use a variant of the well known Mushroom data from the UCI machine learning repository.

Use the `mushroom.csv` found on the Moodle website. In this jupyter template file you can find some code snippets to load the data and do intermediate processing. Please implement missing functions flagged with `#TODO`

*Note:* In the lecture we discussed DTs for categorical inputs. `SKLearn`'s DTs are implemented using numerical attributes which makes it difficult to use the `SKLearn`'s interface directly. 

In [None]:
import numpy as np
import pandas as pd
from plotly import graph_objects as go

df = pd.read_csv('mushrooms.csv')

column_names = df.columns
X = df.to_numpy()[:,:-1]
y = df.to_numpy()[:,-1]

a) Implement a function `gini(y)` to compute the Gini for labels `y`.

In [None]:
def gini(y):
    # TODO
    return 0

b) Implement a function `entropy(y)` to compute the information entropy for labels `y`.

In [None]:
def entropy(y):
    # TODO
    return 0

c) Implement a function `classification_error(y)` to compute the classification error for labels `y`.

In [None]:
def classification_error(y):
    # TODO
    return 0

d) Implement `gain_ratio` quality function for labels `y`.

In [None]:
def gain_ratio(attribute, y):
    # TODO
    return 0

You can use the provided benefit function to transform an impurity measure into a benefit measure.

In [None]:
def benefit(X, y, impurity):
    score = impurity(y)
    for v in numpy.unique(X):
        subset = y[X == v]
        score = score - len(subset)/float(len(y)) * impurity(subset)
    return score

gini_benefit = lambda X,y: benefit(X,y,gini)
information_gain = lambda X,y: benefit(X,y,entropy)
error_benefit = lambda X,y: benefit(X,y,classification_error)
print(column_names[0], gini_benefit(X[:,0], y), information_gain(X[:,0], y), error_benefit(X[:,0], y))

e) Implement a function `find_best(X, y, quality)` to find the best attribute in $X$, using
the given quality function (e.g., `gini_benefit`). The function must return the quality
as well as the best attribute.

In [None]:
def find_best(X, y, quality):
    # TODO
    return column, quality_value

print(find_best(X, y, gini_benefit),     column_names[find_best(X, y, gini_benefit)[1]])
print(find_best(X, y, information_gain), column_names[find_best(X, y, information_gain)[1]])
print(find_best(X, y, error_benefit),    column_names[find_best(X, y, error_benefit)[1]])
print(find_best(X, y, gain_ratio),       column_names[find_best(X, y, gain_ratio)[1]])

f) Implement a recursive function `split(X,y,quality)` to construct a decision tree.<br/>
Stop when a branch is pure, and always split into all distinct attribute values of the *best* attribute.<br/>
If more than one attribute has the same score, use the *first* of the best attributes.

Return the result as a an object of the `Tree` class provided in *Decision Trees c)*.

In [None]:
def split(X, y, quality):
    root = Tree()
    # TODO
    return root

g) Use the provided `print_self` method of the `Tree` class to print the different trees using the quality measures `gini_benefit`, `information_gain`, `error_benefit` and `gain_ratio`.

What do you notice?

In [None]:
print("Gini:")
tree = split(X, y, gini_benefit)
tree.print_self(df.columns)
print("\nInformation Gain:")
# TODO
print("\nClassification Error:")
# TODO
print("\nGain ratio:")
# TODO

### TODO: Write what you notice!