# Calculating Entropies for a data set

Calculating Entropies is a vital process in constructing a decision tree. In this script, we take a given csv data that contains the data for 12 restaurant goers and calculate entropies at each split of the decision tree. We won't be constructing a decision tree per se, but expanding upon this work to actually creating one is trivial.

In [27]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("entropy_data.csv")
data.head(5)

Unnamed: 0,Alt,Bar,Fri,Hun,Pat,Price,Rain,Res,Type,Est,Stay?
0,yes,no,no,yes,some,3,no,yes,french,0-10,yes
1,yes,no,no,yes,full,1,no,no,thai,30-60,no
2,no,yes,no,no,some,1,no,no,burger,0-10,yes
3,yes,no,yes,yes,full,1,no,no,thai,O30,yes
4,yes,no,yes,no,full,3,no,yes,french,>60,no


As we can see, the decision variable here is the "Stay?" column. Rest are the attributes that contribute to the decision. We now write functions to calculate entropies for splits on each of the attributes and also a helper function to display those entropy tables.

In [80]:
from collections import defaultdict
from IPython.core.display import display, HTML

def calculate_entropy(df):
    tables = {}
    for c in df.columns:
        if c =="Stay?": continue
        col_vals = set(df[c])
        entropy_dict = defaultdict(list)
        # for each column values, there will be two fields, yes and no (for the Stay)
        entropy = 0
        for cv in col_vals:
            yes = len(df[(df[c] == cv) & (df['Stay?']=='yes')])
            
            entropy_dict[cv].append(yes)
            no = len(df[(df[c] == cv) & (df['Stay?']=='no')])
            entropy_dict[cv].append(no)    
            yes += 0.
            no += 0.
            entropy += (yes and yes * np.log2((yes+no)/yes) or 0) + (no and no * np.log2((yes+no)/no) or 0 )
        entropy_dict['entropy'] = entropy
        tables[c] = entropy_dict
    return tables


def display_entropy(data):
    entropy = calculate_entropy(data)
    for k in entropy.keys():
        Entropy = entropy[k].pop('entropy')
        display(HTML("<h3>%s</h3>" % k ))
        p = pd.DataFrame(entropy[k], index=['yes', 'no'])    
        print p.T
        print "Entropy =", Entropy
        print
        

## First for the root

In [82]:
display_entropy(data)

       yes  no
0-10     4   2
30-60    1   1
>60      0   2
O30      1   1
Entropy = 9.50977500433



      yes  no
full    2   4
none    0   2
some    4   0
Entropy = 5.50977500433



     yes  no
no     3   4
yes    3   2
Entropy = 11.7513499245



     yes  no
no     1   4
yes    5   2
Entropy = 9.6514844544



     yes  no
no     4   4
yes    2   2
Entropy = 12.0



   yes  no
1    3   4
2    2   0
3    1   2
Entropy = 9.6514844544



     yes  no
no     3   3
yes    3   3
Entropy = 12.0



     yes  no
no     3   3
yes    3   3
Entropy = 12.0



         yes  no
burger     2   2
french     1   1
italian    1   1
thai       2   2
Entropy = 12.0



     yes  no
no     4   3
yes    2   3
Entropy = 11.7513499245



## Next step

Minimum Entropy is for Pat. 
Pat = some and Pat = None produces leafs.. so we check for pat = full

In [84]:
data = data[data.Pat == "full"]
# remove the data column
data = data.drop('Pat', 1)
data

Unnamed: 0,Alt,Bar,Fri,Hun,Price,Rain,Res,Type,Est,Stay?
1,yes,no,no,yes,1,no,no,thai,30-60,no
3,yes,no,yes,yes,1,no,no,thai,O30,yes
4,yes,no,yes,no,3,no,yes,french,>60,no
8,no,yes,yes,no,1,yes,no,burger,>60,no
9,yes,yes,yes,yes,3,no,yes,italian,O30,no
11,yes,yes,yes,yes,1,no,no,burger,30-60,yes


In [85]:
display_entropy(data)

       yes  no
30-60    1   1
>60      0   2
O30      1   1
Entropy = 4.0



     yes  no
no     1   2
yes    1   2
Entropy = 5.50977500433



     yes  no
no     2   2
yes    0   2
Entropy = 4.0



     yes  no
no     0   2
yes    2   2
Entropy = 4.0



     yes  no
no     2   3
yes    0   1
Entropy = 4.85475297227



   yes  no
1    2   2
3    0   2
Entropy = 4.0



     yes  no
no     0   1
yes    2   3
Entropy = 4.85475297227



         yes  no
burger     1   1
french     0   1
italian    0   1
thai       1   1
Entropy = 4.0



     yes  no
no     0   1
yes    2   3
Entropy = 4.85475297227



Minimum entropy is for Type, Price, Hun, Res and Est

We split by Hungry arbitrarily. Hung =no gives leaf node. We split now for Hun = yes

In [86]:
data = data[data.Hun == "yes"]
# remove the data column
data = data.drop('Hun', 1)
data

Unnamed: 0,Alt,Bar,Fri,Price,Rain,Res,Type,Est,Stay?
1,yes,no,no,1,no,no,thai,30-60,no
3,yes,no,yes,1,no,no,thai,O30,yes
9,yes,yes,yes,3,no,yes,italian,O30,no
11,yes,yes,yes,1,no,no,burger,30-60,yes


In [87]:
display_entropy(data)

       yes  no
30-60    1   1
O30      1   1
Entropy = 4.0



     yes  no
no     1   1
yes    1   1
Entropy = 4.0



     yes  no
no     2   1
yes    0   1
Entropy = 2.75488750216



     yes  no
no     0   1
yes    2   1
Entropy = 2.75488750216



    yes  no
no    2   2
Entropy = 4.0



     yes  no
yes    2   2
Entropy = 4.0



         yes  no
burger     1   0
italian    0   1
thai       1   1
Entropy = 2.0



   yes  no
1    2   1
3    0   1
Entropy = 2.75488750216



Minimum entropy is Type,

Type = burger and type = italian gives leaf node. We calculate entropies for type = thai.

In [89]:
data = data[data.Type == "thai"]
# remove the data column
data = data.drop('Type', 1)
data

Unnamed: 0,Alt,Bar,Fri,Price,Rain,Res,Est,Stay?
1,yes,no,no,1,no,no,30-60,no
3,yes,no,yes,1,no,no,O30,yes


In [90]:
display_entropy(data)

       yes  no
30-60    0   1
O30      1   0
Entropy = 0



    yes  no
no    1   1
Entropy = 2.0



    yes  no
no    1   1
Entropy = 2.0



     yes  no
no     0   1
yes    1   0
Entropy = 0



    yes  no
no    1   1
Entropy = 2.0



     yes  no
yes    1   1
Entropy = 2.0



   yes  no
1    1   1
Entropy = 2.0



Minimum Entropy is Fri, we split on Fri to get two leaf nodes... Thus, constructing an extremely overfitted decision tree.