# Introduction

We learned about the [Gini Impurtiy Index](https://sites.google.com/view/stevenloaiza/machine-learning/decision-trees/gini-impurity-index?authuser=0) as one of the metrics to optimally choose splits in decision trees. Information Gain is yet another method that can also be used to optimally choose which feature to split the dataset on. Before we go on to learn about Information gain, we must first discuss Entropy, which was introduced by Shannon(1948).

## Entropy

[Definition](https://en.wikipedia.org/wiki/Entropy_(information_theory)): [E]ntropy provides an absolute limit on the shortest possible average length of a lossless compression encoding of the data produced by a source, and if the entropy of the source is less than the channel capacity of the comunication channel,the data generated by the source can be reliably communicated to the receiver.

The above definition is extrememly difficult to understand, and the exact definition is not neccessarily pertinant to our discussions of decision trees. Shannon(1948) used the concept of entropy for the theory of communication, to determine how to send encoded (bits) informaiton from a sender to a receiver without loss of information and with the minimum amount of bits.

Please take a look at [Demystifying Entropy](https://towardsdatascience.com/demystifying-entropy-f2c3221e2550)  and [The intuition behind Shannon's Entropy](https://towardsdatascience.com/the-intuition-behind-shannons-entropy-e74820fe9800) for an easily to understand explanation.

#### Bits

What are bits? We usually have TRUE or FALSE when using if statements which takes on 1 bit of data. A Bit takes on a single binary value 0(FALSE) or 1(TRUE). See the table below to understand how the storage capabilties increase with each bit.

\begin{equation*}
\begin{vmatrix}
\mathbf{bits} & \mathbf{Values}  \\
1 & 0 ,1  \\
2 & 00 , 01, 10, 11  \\
3 & 000, 001, 010, 011, 100, 101, 110, 111  \\
4 & ...\\
\end{vmatrix}
\end{equation*}
\\

$$ 2^x = n \\
ln(2^x)=ln(n)\\
x \cdot ln(2) = ln (n) \\
x = \frac{ln(n)} {ln(2}=log_2(n)$$
#### Lossless
This concept simply means that no information was loss in the transmission from sender to receiver.

#### Formula

$$Entropy = - \sum_i P(i) \cdot log_2P(i)$$
The formula above gives us the minimum average encoding size , which uses the minimum encoding size for each message type.

High Entropy : More uncertanty

Low Entropy : More Predictability

#### Example

In [42]:
# Create Sample 
import pandas as pd
data={'Emotion':['sick','sick','notsick','notsick','notsick','sick','notsick','notsick'],'Temperature':['under','over','under','under','over','over','under','over'],'StayHome':['N','Y','Y','N','Y','N','N','Y']}
df=pd.DataFrame (data)
#sort it by Emotion
df.sort_values(['Emotion'],inplace=True)
print(df)


   Emotion Temperature StayHome
1     sick        over        Y
4  notsick        over        Y
5     sick        over        N
7  notsick        over        Y
0     sick       under        N
2  notsick       under        Y
3  notsick       under        N
6  notsick       under        N


We will be calculating the entropy for each split. For instance we are going to have two split for the column "Emotion". One on "notsick" and the other on "sick".

In [248]:
print(df[df.Emotion=='sick'],"\n")
print(df[df.Emotion=='notsick'],"\n")
print(df[df.Emotion=='notsick'].sort_values('StayHome'))

  Emotion Temperature StayHome
5    sick        over        N
0    sick       under        N
1    sick        over        Y 

   Emotion Temperature StayHome
3  notsick       under        N
6  notsick       under        N
4  notsick        over        Y
7  notsick        over        Y
2  notsick       under        Y 

   Emotion Temperature StayHome
3  notsick       under        N
6  notsick       under        N
4  notsick        over        Y
7  notsick        over        Y
2  notsick       under        Y


Once each split we can calculate the entropy on the target varible "stayhome".

In [20]:
X=np.array([[1,0],[0,1]])
Y=np.array([[2,1],[1,2]]) 
Z=np.dot(X,Y)
Z

array([[2, 1],
       [1, 2]])

In [8]:
import numpy as np
import math

def entropy(col,df):
# Takes the column number and data frame as an input
    #Grabs the column we specify and the last column (which we assume is the decision col)
    new_df=df.iloc[:,[col,len(df.columns)-1]]
    #rename the columns
    new_df.columns=('col1','col2')
    #return the unique values in this feature along with their counts
    names,count=np.unique(new_df.col1,return_counts=True)
    #create an empty list
    entropy_list=list()
    
    #for loop on each split to get the entry at the split
    for i in range(0,(len(names))):
            dff=new_df[new_df.col1==names[i]]
            den=len(dff)
            columns=new_df.col2.unique()
            p1 = dff.col2.eq(columns[0]).sum()/den
            p2 = dff.col2.eq(columns[1]).sum()/den
            P=[p1,p2]
            ent=0
            entropy_list.append(count[i])
            entropy_list.append(names[i])
            k=0
            for p in P:
                k=k+1
                ent += -p * math.log(p,2)
                if(k==len(P)):
                    entropy_list.append(ent)
    return entropy_list
    

    

In [32]:
print("Entropy for Emotion Split",entropy(0,df)[1:3])
print("Confirm",-(3/5)*math.log((3/5),2)-(2/5)*math.log((2/5),2),"\n")
print("Entropy for Emotion Split",entropy(0,df)[4:6])
print("Confirm",-(1/3)*math.log((1/3),2)-(2/3)*math.log((2/3),2))


Entropy for Emotion Split ['notsick', 1.0]
Confirm 0.9709505944546686 

Entropy for Emotion Split ['sick', 1.0]
Confirm 0.9182958340544896


1.0

# Information Gain

Now that we have discussed Entropy we can move forward into information gain. This is the concept of a decrease in entropy after splitting the data on a feature. The greater the information gain, the greater the decrease in entropy or uncertanty. 

$$InformationGain(T,X) = Entropy(T) - \sum_{splits}\frac{s_1}{T}Entropy(s_1)$$

-  T: Target population prior to the split $T= \sum_{All Splits}$. 
-  Entropy(T): Measure the disorder before the split
-  $s_i$: is the number of observations on the $i^{th}$ split
-  Entropy($s_i$): Meausres the disorder for the target variable on split $s_1$

Given the Example above T=8, $s_1=5$, $s_2=3$, Entropy(s_1) = 0.9709...$ $Entropy(s_2) = 0.91829...$.
Its is difficult to tell but even when we split the original dataset using the feature "Emotion", we are not gaining much information to have a homogenous bucket (pure set to identify either 'N' or 'Y').

In [237]:
print(df.sort_values('StayHome').reset_index(drop=True),"\n")
print('-----------------------------------------------------------------------------')
print(df[df.Emotion=='sick'].sort_values('StayHome').reset_index(drop=True),"\n")
print(df[df.Emotion=='notsick'].sort_values('StayHome').reset_index(drop=True))

   Emotion Temperature StayHome
0  notsick       under        N
1  notsick       under        N
2     sick        over        N
3     sick       under        N
4  notsick        over        Y
5  notsick        over        Y
6  notsick       under        Y
7     sick        over        Y 

-----------------------------------------------------------------------------
  Emotion Temperature StayHome
0    sick        over        N
1    sick       under        N
2    sick        over        Y 

   Emotion Temperature StayHome
0  notsick       under        N
1  notsick       under        N
2  notsick        over        Y
3  notsick        over        Y
4  notsick       under        Y


In [219]:
def entropy(col,df,option=2):
# Takes the column number and data frame as an input
    #Grabs the column we specify and the last column (which we assume is the decision col)
    new_df=df.iloc[:,[col,len(df.columns)-1]]
    #rename the columns
    new_df.columns=('col1','col2')
    #return the unique values in this feature along with their counts
    names,count=np.unique(new_df.col1,return_counts=True)
    
    
    #create an empty list
    entropy_list=list()
    
    #for loop on each split to get the entry at the split
    for i in range(0,(len(names))):
            if(option==2):
                dff=new_df[new_df.col1==names[i]]
                entropy_list.append(count[i])
            else:
                dff=new_df
            den=len(dff)
            columns=new_df.col2.unique()
            p1 = dff.col2.eq(columns[0]).sum()/den
            p2 = dff.col2.eq(columns[1]).sum()/den
            P=[p1,p2]
            ent=0
            k=0
            for p in P:
                k=k+1
                ent += -p * math.log(p,2)
                if(k==len(P)):
                    entropy_list.append(ent)
    return entropy_list
    
def information_gain(col,df):
    den=len(df)

    info_gain = entropy(col,df,1)[1]
    Split=entropy(col,df,2)
    c=len(Split)/2
    c=int(c)
    
    j=0
    
    for i in range(0,c):
        weight=Split[j]/den
        info_gain= info_gain - weight*Split[j+1]
        j=j+2
    return(info_gain)

In [220]:
print("The Algorithm states the info gain is: ",information_gain(0,df))
#confrim
EntropyT =-(4/8)*math.log((4/8),2)-(4/8)*math.log((4/8),2)
EntropyS1=-(3/5)*math.log((3/5),2)-(2/5)*math.log((2/5),2)
EntropyS2=-(1/3)*math.log((1/3),2)-(2/3)*math.log((2/3),2)
print("Check: ", 1-((5/8)*EntropyS1+(3/8)*EntropyS2))

The Algorithm states the info gain is:  0.048794940695398525
Check:  0.04879494069539847


As you can see the infromation we gain is minimal on the split "Emotion". Could we have down better by splitting on Tempreature instead?

In [239]:
print("Information Gain from Splitting on Temp: ",information_gain(1,df))

Information Gain from Splitting on Temp:  0.1887218755408671


This is a great improvement on the amount of information we gained. Lets take a look at this in the table format. As you can tell we have gone from an even split in the orginal dataset, to a 25% / 75% split once we conditioned on Temperature. Therefore, we have gain more information because we are able to place each predictor into a bucket with similar values.

In [249]:
print(df.sort_values('StayHome').reset_index(drop=True),"\n")
print('-----------------------------------------------------------------------------')
print(df[df.Temperature=='over'].sort_values('StayHome').reset_index(drop=True),"\n")
print(df[df.Temperature=='under'].sort_values('StayHome').reset_index(drop=True))

   Emotion Temperature StayHome
0  notsick       under        N
1  notsick       under        N
2     sick        over        N
3     sick       under        N
4  notsick        over        Y
5  notsick        over        Y
6  notsick       under        Y
7     sick        over        Y 

-----------------------------------------------------------------------------
   Emotion Temperature StayHome
0     sick        over        N
1  notsick        over        Y
2  notsick        over        Y
3     sick        over        Y 

   Emotion Temperature StayHome
0  notsick       under        N
1  notsick       under        N
2     sick       under        N
3  notsick       under        Y


# Appendix

There are times when a bucket is able to completely isolate one of the decision parameters and correctly identify it. When this occurs the probabily will of the other parameters occuring is 0. We arent able to take the $log (0)$ because it creates a $-inf$. 

Additionally, the code above is hard coded to work on features with only two possible outcomes. When we move into Decison Trees we will generalize the entropy function to take into account n possible outcomes.