# Entropy

Prove that the following two equations for computing the entropy of a set $T$ (with respect to classes
$C_1,\ldots,C_k$) are equivalent:

$$\text{entropy}(T)=-\sum_{i=1}^k p_i \cdot \log(p_i)$$

$$\text{entropy}(T)=-\frac{1}{|T|} \left( \sum_{i=1}^k c_i \cdot \log(c_i) \right) + \log(|T|)$$

where $c_i = \vert C_i \cap T \vert$ denotes the number of object in $T$ that are in class $C_i$ and $p_i$ denotes the probability of an object in $T$ to be within $C_i$.

You can assume the $C_i$ to be disjoint and that every object in $T$ is contained in exactly one $C_i$.

### TODO

$p_i=\frac{c_i}{|T|}$

$\log\left(\frac{c_i}{|T|}\right) = \log(c_i) - \log(|T|)$

$\Rightarrow \text{entropy}(T)=- \left( \sum_{i=1}^k \frac{c_i}{|T|} \cdot \log(c_i)- \log(|T|) \right) = -\frac{1}{|T|} \left( \sum_{i=1}^k c_i \cdot \log(c_i) \right) + \log(|T|)$

# Decision Trees

Classificcation: Labels sind Klassen wie Sonne, Regen, Schnee
    
Regression: Kontinuierliche Endpunkte

In [1]:
import pandas as pd
df = pd.DataFrame([
    [ 23 , 'm', 'low'],
    [ 35 , 'f', 'low'],
    [ 25 , 'm', 'low'],
    [ 78 , 'f', 'low'],
    [ 46 , 'f', 'med'],
    [ 81 , 'f', 'med'],
    [ 47 , 'm', 'med'],
    [ 55 , 'm', 'high'],
    [ 71 , 'f', 'high'],
    [ 41 , 'm', 'high']],
    columns=[  "Age", "Gender", "Cancer-Risk" ])

Given the following data set of patient records:

In [8]:
df

Unnamed: 0,Age,Gender,Cancer-Risk
0,23,m,low
1,35,f,low
2,25,m,low
3,78,f,low
4,46,f,med
5,81,f,med
6,47,m,med
7,55,m,high
8,71,f,high
9,41,m,high


a) Discretize the Age attribute into the values $\{< 40, [40:70], > 70 \}$.

In [2]:
df["Age"] = pd.cut(df.iloc[:,0].values,(0,40,70,100),labels=("<40","[40:70]",">70"))

df

Unnamed: 0,Age,Gender,Cancer-Risk
0,<40,m,low
1,<40,f,low
2,<40,m,low
3,>70,f,low
4,[40:70],f,med
5,>70,f,med
6,[40:70],m,med
7,[40:70],m,high
8,>70,f,high
9,[40:70],m,high


b) For the discretized age attribute only, compute for $T$ and for each partition $T_i$ the (1) entropy and (2) gini index.
Then compute the resulting (3) information gain and (4) $\operatorname{Benefit}_{\operatorname{Gini}}$ of the split
with respect to their ability to predict Cancer-Risk.

### Entropie
$p_j = |T \, \text{und}\, C_j|$ is the number of objects in subset $T$ labeled category $C_j$


$\text{entropy}(T_{\text{all}}) = - p_{\text{low}} \cdot \log_2(p_{\text{low}}) - p_{\text{med}} \cdot \log_2(p_{\text{med}}) - p_{\text{high}} \cdot \log_2(p_{\text{high}}) = - \frac{4}{10} \cdot \log_2(\frac{4}{10}) - \frac{3}{10} \cdot \log_2(\frac{3}{10}) - \frac{3}{10} \cdot \log_2(\frac{3}{10}) \approx 1.57$

$\text{entropy}(T_{\text{old}}) = - \frac{1}{3} \cdot \log_2(\frac{1}{3}) - \frac{1}{3} \cdot \log_2(\frac{1}{3}) - \frac{1}{3} \cdot \log_2(\frac{1}{3}) \approx 1.58$

$\text{entropy}(T_{\text{midage}}) = - \frac{0}{4} \cdot \log_2(\frac{0}{4}) - \frac{2}{4} \cdot \log_2(\frac{2}{4}) - \frac{2}{4} \cdot \log_2(\frac{2}{4}) \approx 1$

$\text{entropy}(T_{\text{young}}) = - \frac{3}{3} \cdot \log_2(\frac{3}{3}) - \frac{0}{3} \cdot \log_2(\frac{0}{3}) - \frac{0}{3} \cdot \log_2(\frac{0}{3}) \approx 0$

### Gini Index

$\text{Gini}(T_{\text{all}}) = 1 - p_{\text{low}}^2 - p_{\text{med}}^2 - p_{\text{high}}^2 = 1 - \frac{4}{10}^2 - \frac{3}{10}^2 - \frac{3}{10}^2 \approx 0.66$

$\text{Gini}(T_{\text{old}}) = 1 - \frac{1}{3}^2 - \frac{1}{3}^2 - \frac{1}{3}^2 \approx 0.67$

$\text{Gini}(T_{\text{midage}}) = 1 - \frac{0}{4}^2 - \frac{2}{4}^2 - \frac{2}{4}^2 \approx 0.5$

$\text{Gini}(T_{\text{young}}) = 1 - \frac{3}{3}^2 - \frac{0}{3}^2 - \frac{0}{3}^2 \approx 0$

### information gain

$gain = \text{entropy}(T_{\text{all}}) -\frac{3}{10}\text{entropy}(T_{\text{old}}) -\frac{4}{10}\text{entropy}(T_{\text{midage}}) -\frac{3}{10}\text{entropy}(T_{\text{young}}) \approx 0.69$

### Benefit Gini

$benefit = \text{Gini}(T_{\text{all}}) -\frac{3}{10} \text{Gini}(T_{\text{old}}) -\frac{4}{10} \text{Gini}(T_{\text{midage}}) -\frac{3}{10} \text{Gini}(T_{\text{young}}) \approx 0.26 $

In [8]:
# TODO: You do not need to do this as code.
# Preferably, make this by hand and write a Markdown block.
# For printing your results (using either way), you can use the class below.
import numpy as np

df_1 = df[df["Age"]=="<40"]
df_2 = df[df["Age"]=="[40:70]"]
df_3 = df[df["Age"]==">70"]

df_m = df[df["Age"]=="m"]
df_f = df[df["Age"]=="f"]
df_Ti=(df_1,df_2,df_3)
df_gender =(df_m,df_f)
def entropy(T):
    # TODO
    temp=[]
    for i in T:
        if i[-1] not in temp:
            temp.append(i[-1])
    temp_count = np.zeros(len(temp))
    for i in T:
        temp_count[temp.index(i[-1])]+=1
    temp_count = [x*np.log2(x) for x in temp_count]
    return -1/len(T)*sum(temp_count)+np.log2(len(T))

def gini(T):
    # TODO
    temp=[]
    for i in T:
        if i[-1] not in temp:
            temp.append(i[-1])
    temp_count = np.zeros(len(temp))
    for i in T:
        temp_count[temp.index(i[-1])]+=1
    temp_count = [(x/len(T))**2 for x in temp_count]
    return 1-sum(temp_count)

def information_gain(T, column):
    # TODO: column is the splitting attribute
    temp=[]
    for i in T:
        if i[column] not in temp:
            temp.append(i[column])
    temp_col=[]
    for j in range(len(temp)):
        dump=[]
        for i in T:
            if i[column] == temp[j]:
                dump.append(i)
        temp_col.append(dump)
    entropy_Ti = [len(temp_col[i])/len(T)*entropy(temp_col[i]) for i in range(len(temp))]
    return entropy(T)-sum(entropy_Ti)

def benefit(T, column):
    # TODO: column is the splitting attribute
    temp=[]
    for i in T:
        if i[column] not in temp:
            temp.append(i[column])
    temp_col=[]
    for j in range(len(temp)):
        dump=[]
        for i in T:
            if i[column] == temp[j]:
                dump.append(i)
        temp_col.append(dump)
    gini_Ti = [len(temp_col[i])/len(T)*gini(temp_col[i]) for i in range(len(temp))]
    return gini(T)-sum(gini_Ti)

# TODO: Output results
print("Gesamte Entropie und Gini:")
print(entropy(df.to_numpy()))
print(gini(df.to_numpy()))
for i in df_Ti:
    print("Entropie und Gini für:",i.to_numpy()[0][0])
    print(entropy(i.to_numpy()))
    print(gini(i.to_numpy()))
print("Info Gain und Benefit:")
print(information_gain(df.to_numpy(),0))
print(benefit(df.to_numpy(),0))

Gesamte Entropie und Gini:
1.5709505944546684
0.66
Entropie und Gini für: <40
0.0
0.0
Entropie und Gini für: [40:70]
1.0
0.5
Entropie und Gini für: >70
1.584962500721156
0.6666666666666667
Info Gain und Benefit:
0.6954618442383216
0.26


c) Construct the decision tree of the above records and attributes using the Gini index split strategy only.
Stop if all remaining attributes are constant. Argue how you choose the prediction of a leaf.

Draw the tree with all necessary information to predict new records.

Hint: You do not need to compute the benefit, if you have only one candidate attribute remaining.

       low                   med
       /                     /
    (<40)                   m---high
     /                     /
  Age----(40:70)-----Gender---f---med
    \
    (>70)            low
      \             /
     Gender-----m--- med
        \           \
  high---f--low      high
          \
           med

In [21]:
class Tree():
    def __init__(self,parent=None,classification=None,attr_val=None,column=None):
        self.classification = classification
        self.attr_val = attr_val
        self.column = column
        self.children = []
        self.parent = parent
        if not parent is None:
            parent.add_child(self)
    
    def children(self):
        return self.children
    
    def add_child(self,tree):
        self.children.append(tree)
        tree.parent = self
    
    def depth_in_tree(self):
        if self.parent is None:
            return 0
        return 1+self.parent.depth_in_tree()
    
    def print_self(self,column_names):
        self._print_self_helper("",column_names)
        
    def _print_self_helper(self,prefix,column_names):
        if len(self.children) > 0:
            print(prefix,"("+str(self.attr_val)+")",column_names[self.column])
            for child in self.children:
                child._print_self_helper(" |  "*(len(prefix)//4)+" |--",column_names)
        else:
            print(prefix,"("+str(self.attr_val)+")",self.classification)

# Usage:
# Create a root node with Tree()
# Create any further node with Tree(parent=<parent node>) or with Tree() and later call parent.add_child(child)
# Use the classification attribute in leaved for the classification value
# Use the column attribute in inner nodes to specify the splitting attribute
# Use the attr_val attribute in all nodes to specify the attribute value of the previous split
# All attributes can be changed at any time
# Print the entire tree with <root>.print_self()

# Example:
# Creates a DT with random splits up to a maximum depth.
# Classification labels are chosen arbitrary and do not represent real classifications.
import numpy as np
def print_example_DT(df):
    n_attrs = len(df.columns)-1
    first_split = np.random.randint(n_attrs)
    second_split = (first_split + np.random.randint(n_attrs-1) + 1) % n_attrs
    classes = list(set(df[df.columns[-1]]))
    attr_vals = [list(set(df[df.columns[i]])) for i in range(len(df.columns)-1)]
    # First split with attribute first_split
    root = Tree(column=first_split)
    for val in attr_vals[first_split]:
        # Second split with attribute second_split
        inner_node = Tree(parent=root,column=second_split,attr_val=val)
        for val2 in attr_vals[second_split]:
            c = classes[np.random.randint(len(classes))]
            leaf = Tree(
                parent=inner_node,
                classification=c,
                attr_val=val2)
    root.print_self(df.columns)

print_example_DT(df)

 (None) spore-print-color
 |-- (brown) bruises?
 |   |-- (no) leaves
 |   |-- (bruises) woods
 |-- (purple) bruises?
 |   |-- (no) meadows
 |   |-- (bruises) meadows
 |-- (orange) bruises?
 |   |-- (no) meadows
 |   |-- (bruises) paths
 |-- (buff) bruises?
 |   |-- (no) paths
 |   |-- (bruises) meadows
 |-- (green) bruises?
 |   |-- (no) woods
 |   |-- (bruises) meadows
 |-- (chocolate) bruises?
 |   |-- (no) grasses
 |   |-- (bruises) paths
 |-- (white) bruises?
 |   |-- (no) waste
 |   |-- (bruises) leaves
 |-- (black) bruises?
 |   |-- (no) grasses
 |   |-- (bruises) leaves
 |-- (yellow) bruises?
 |   |-- (no) paths
 |   |-- (bruises) waste


d) Predict the class labels of the new records given below.

In [None]:
qdf = pd.DataFrame([
    [ 35 , 'm', None ],#low, weil <40->low
    [ 28 , 'f', None ],#auch low
    [ 48 , 'm', None ],#50/50 low/med
    [ 55 , 'f', None ],#med, weil [40:70]->f->low
    [ 83 , 'm', None ],#33/33/33 für alles
    [ 72 , 'f', None ]],#33/33/33 für alles
    columns=[  "Age", "Gender", "Cancer-Risk" ])
qdf

In [None]:
# TODO: Compute in a Markdown block and write results into the Dataframe qdf.
# You may code it, but that's quite advanced.

# Mushroom Classification

We use a variant of the well known Mushroom data from the UCI machine learning repository.

Use the `mushroom.csv` found on the Moodle website. In this jupyter template file you can find some code snippets to load the data and do intermediate processing. Please implement missing functions flagged with `#TODO`

*Note:* In the lecture we discussed DTs for categorical inputs. `SKLearn`'s DTs are implemented using numerical attributes which makes it difficult to use the `SKLearn`'s interface directly. 

In [3]:
import numpy as np
import pandas as pd
from plotly import graph_objects as go

df = pd.read_csv('mushrooms.csv')

column_names = df.columns
X = df.to_numpy()[:,:-1]
y = df.to_numpy()[:,-1]

In [4]:
df_ = pd.DataFrame([
    [ 23 , 'm', 'low'],
    [ 35 , 'f', 'low'],
    [ 25 , 'm', 'low'],
    [ 78 , 'f', 'low'],
    [ 46 , 'f', 'med'],
    [ 81 , 'f', 'med'],
    [ 47 , 'm', 'med'],
    [ 55 , 'm', 'high'],
    [ 71 , 'f', 'high'],
    [ 41 , 'm', 'high']],
    columns=[  "Age", "Gender", "Cancer-Risk" ])

df_["Age"] = pd.cut(df_.iloc[:,0].values,(0,40,70,100),labels=("<40","[40:70]",">70"))

df_

Unnamed: 0,Age,Gender,Cancer-Risk
0,<40,m,low
1,<40,f,low
2,<40,m,low
3,>70,f,low
4,[40:70],f,med
5,>70,f,med
6,[40:70],m,med
7,[40:70],m,high
8,>70,f,high
9,[40:70],m,high


a) Implement a function `gini(y)` to compute the Gini for labels `y`.

In [8]:
def get_val_probs(y):
    return np.array([np.mean(y==yv) for yv in set(y)])
#mean über true false array gibt anteil von true zurück.
#weil true=1, false=0

def gini(y):
    return 1-np.sum([get_val_probs(y)**2])

print(gini(y))

0.7457200201407965


b) Implement a function `entropy(y)` to compute the information entropy for labels `y`.

In [10]:
def entropy(y):
    return -np.sum([Pi*np.log2(Pi) if Pi > 0 else 0 for Pi in get_val_probs(y)])

print(entropy(y))

2.274747200596189


c) Implement a function `classification_error(y)` to compute the classification error for labels `y`.

In [11]:
def classification_error(y):
    return 1-np.max(get_val_probs(y))

print(classification_error(y))

0.6125061546036435


d) Implement `gain_ratio` quality function for labels `y`.

In [12]:
def get_splits(attribute,y):
    return [y[attribute == a] for a in sorted(list(set(attribute)))]

def split_info(attribute,y):
    Tis = get_splits(attribute,y)
    Pis = [float(len(Ti))/len(y) for Ti in Tis]
    return -np.sum([Pi*np.log2(Pi) for Pi in Pis])

def information_gain(attribute,y):
    Tis = get_splits(attribute,y)
    return entropy(y)-np.sum([float(len(Ti))/len(y)*entropy(Ti) for Ti in Tis])

def gain_ratio(attribute, y):
    ig = information_gain(attribute,y)
    si = split_info(attribute,y)
    return ig / si if ig > 0 else 0

You can use the provided benefit function to transform an impurity measure into a benefit measure.

In [14]:
def benefit(X, y, impurity):
    score = impurity(y)
    for v in np.unique(X):
        subset = y[X == v]
        score = score - len(subset)/float(len(y)) * impurity(subset)
    return score

gini_benefit = lambda X,y: benefit(X,y,gini)
information_gain = lambda X,y: benefit(X,y,entropy)
error_benefit = lambda X,y: benefit(X,y,classification_error)
print(column_names[0], gini_benefit(X[:,0], y), information_gain(X[:,0], y), error_benefit(X[:,0], y))

class 0.025662014242646025 0.15683360460509221 5.551115123125783e-17


e) Implement a function `find_best(X, y, quality)` to find the best attribute in $X$, using
the given quality function (e.g., `gini_benefit`). The function must return the quality
as well as the best attribute.

In [18]:
def find_best(X, y, quality):
    qvals = [quality(X[:,i],y) for i in range(len(X.T))]
    idx = np.argmax(qvals)
    return idx, qvals[idx]


print(find_best(X, y, gini_benefit),     column_names[find_best(X, y, gini_benefit)[0]])
print(find_best(X, y, information_gain), column_names[find_best(X, y, information_gain)[0]])
print(find_best(X, y, error_benefit),    column_names[find_best(X, y, error_benefit)[0]])
print(find_best(X, y, gain_ratio),       column_names[find_best(X, y, gain_ratio)[0]])

(11, 0.17057585739086134) stalk-root
(11, 0.6812394987242061) stalk-root
(21, 0.1915312653865091) population
(17, 0.43392233309641826) veil-color


f) Implement a recursive function `split(X,y,quality)` to construct a decision tree.<br/>
Stop when a branch is pure, and always split into all distinct attribute values of the *best* attribute.<br/>
If more than one attribute has the same score, use the *first* of the best attributes.

Return the result as a an object of the `Tree` class provided in *Decision Trees c)*.

In [19]:
from scipy.stats import mode

def split(X, y, quality):
    root = Tree()
    if len(set(y)) > 1:
        col,qual = find_best(X,y,quality)
        root.column = col
        if qual > 0:
            c_vals = sorted(list(set(X[:,col])))
            for c_val in c_vals:
                sel = X[:,col] == c_val
                node = split(X[sel],y[sel],quality)
                node.attr_val = c_val
                root.add_child(node)
        else:
            m = mode(y).mode[0]
            root.classification = "{:} ({:4.2f})".format(m,np.mean(y==m))
    else:
        root.classification = y[0]
    return root

g) Use the provided `print_self` method of the `Tree` class to print the different trees using the quality measures `gini_benefit`, `information_gain`, `error_benefit` and `gain_ratio`.

What do you notice?

In [22]:
print("Gini:")
tree = split(X, y, gini_benefit)
tree.print_self(df.columns)
print("\nInformation Gain:")
tree = split(X, y, information_gain)
tree.print_self(df.columns)
print("\nClassification Error:")
tree = split(X, y, error_benefit)
tree.print_self(df.columns)
print("\nGain ratio:")
tree = split(X, y, gain_ratio)
tree.print_self(df.columns)

Gini:



The input array could not be properly checked for nan values. nan values will be ignored.



 (None) stalk-root
 |-- (bulbous) spore-print-color
 |   |-- (black) woods
 |   |-- (brown) woods
 |   |-- (chocolate) stalk-surface-above-ring
 |   |   |-- (fibrous) cap-color
 |   |   |   |-- (buff) gill-color
 |   |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |   |-- (white) grasses (0.50)
 |   |   |   |-- (gray) gill-color
 |   |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |   |-- (white) grasses (0.50)
 |   |   |   |-- (white) gill-color
 |   |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |   |-- (white) grasses (0.50)
 |   |   |-- (silky) gill-color
 |   |   |   |-- (chocolate) stalk-color-above-ring
 |   |   |   |   |-- (brown) stalk-color-below-ring
 |   |   |   |   |   |-- (brown) grasses (0.33)
 |   |   |   |   |   |-- (buff) grasses (0.33)
 |   |   |   |   |   |-- (pink) grasses (0.33)
 |   |   |   |   |-- (buff) stalk-

 (None) population
 |-- (abundant) grasses
 |-- (clustered) ring-number
 |   |-- (none) woods
 |   |-- (one) leaves
 |   |-- (two) waste
 |-- (numerous) grasses (0.68)
 |-- (scattered) odor
 |   |-- (almond) grasses (0.50)
 |   |-- (anise) grasses (0.50)
 |   |-- (creosote) woods
 |   |-- (foul) cap-color
 |   |   |-- (buff) gill-color
 |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |-- (white) grasses (0.50)
 |   |   |-- (gray) gill-color
 |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |-- (white) grasses (0.50)
 |   |   |-- (white) gill-color
 |   |   |   |-- (chocolate) grasses (0.50)
 |   |   |   |-- (pink) grasses (0.50)
 |   |   |   |-- (white) grasses (0.50)
 |   |-- (none) grasses
 |   |-- (pungent) grasses (0.50)
 |-- (several) spore-print-color
 |   |-- (black) stalk-root
 |   |   |-- (bulbous) woods
 |   |   |-- (equal) urban (0.64)
 |   |-- (brown) stalk-root
 |   |   |-- (bulbou

### TODO: Write what you notice!