# **CIS 520: Machine Learning, Fall 2022**
## **Information Gain for Decision Trees**

- **Content Creators:** Lyle Ungar, Amiel Orbach, Jasleen Dhanoa
- **Acknowledgments/Citations:** Eric Eaton and https://www.python-course.eu/Decision_Trees.php

**Goal** : To understand how to compute information gain for a given dataset manually and in a program. 

**Instructions**: For the first dataset compute the information gain and fill in the answers. Follow the functions for the second dataset to see how information gain can be computed in a function.


In [None]:

#@markdown Tell us your thoughts about what you want to learn.
w2_upshot = '' #@param {type:"string"}
import time
try: t0;
except NameError: t0=time.time()

## **Autograding and the PennGrader**

First, you'll need to set up the PennGrader, which we'll be using throughout the semester to help you with your homeworks and worksheeets.

PennGrader is not only **awesome**, but it was built by an equally awesome person: Leo Murri.  Today, Leo works as a data scientist at Amazon!

PennGrader was developed to provide students with *instant* feedback on their answer. You can submit your answer and know whether it's right or wrong instantly. We then record your most recent answer in our backend database.

### Imports and Setup (Do Not Modify This Section)

In [1]:
%%capture
!pip install penngrader


In [2]:
import random 
import numpy as np
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
from numpy.linalg import *
np.random.seed(42)  # don't change this line

import dill
import base64

In [3]:
# For autograder only, do not modify this cell. 
# True for Google Colab, False for autograder
NOTEBOOK = (os.getenv('IS_AUTOGRADER') is None)
if NOTEBOOK:
    print("[INFO, OK] Google Colab.")
else:
    print("[INFO, OK] Autograder.")
    sys.exit()

[INFO, OK] Google Colab.


### Insert PennID here!

In [4]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO 
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 99999999 # YOUR PENN-ID GOES HERE AS AN INTEGER#

In [5]:
import penngrader.grader

grader = penngrader.grader.PennGrader(homework_id = 'CIS_5200_202230_HW_Info_Gain_WS', student_id = STUDENT_ID)

PennGrader initialized with Student ID: 99999999

Make sure this correct or we will not be able to store your grade


In [6]:
# A helper function for grading utils
def grader_serialize(obj):        # A helper function
    '''Dill serializes Python object into a UTF-8 string'''
    byte_serialized = dill.dumps(obj, recurse = True)
    return base64.b64encode(byte_serialized).decode("utf-8")

##Creating Data

We have toy data about risk factors (such as age above 50, smoking, asthma and obsese) and corresponding data if that person is diabetic. Let's see how we can best use the avialable data to predict if a person is diabetic.

In [7]:
import pandas as pd
import numpy as np


data = pd.DataFrame({"Ageover50":["True","False","False","True","True","True","False","False","True","False"],
                     "smoking":["True","True","False","True","True","True","False","False","True","False"],
                     "asthma":["True","True","True","True","True","True","False","True","True","True"],
                     "obese":["True","True","False","True","True","False","False","False","True","True"],
                     "diabetic":["yes","yes","no","yes","no","yes","no","no","yes","no"]}, 
                    columns=["Ageover50","smoking","asthma","obese","diabetic"])

features = data[["Ageover50","smoking","asthma","obese"]]
target = data["diabetic"]

data

Unnamed: 0,Ageover50,smoking,asthma,obese,diabetic
0,True,True,True,True,yes
1,False,True,True,True,yes
2,False,False,True,False,no
3,True,True,True,True,yes
4,True,True,True,True,no
5,True,True,True,False,yes
6,False,False,False,False,no
7,False,False,True,False,no
8,True,True,True,True,yes
9,False,False,True,True,no



**Q1:** Compute the information gain (IG) using the data above for the various split attributes as indicated below. You should calculate the Information Gain using formula.

**Note:** *Your answers should be accurate up to 3 decimals. Enter the answer as ans = 0.111*

Include your answers to the assigned variable names in order to check them. 

1. Compute IG for split attribute "Age over 50" given the target attribute is diabetic

In [8]:
ans_IG_over_50 = 0.278

In [9]:
grader.grade(test_case_id = 'test_case_IG_over_50', answer = ans_IG_over_50)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


2. Compute IG for split attribute smoking given the target attribute is diabetic

In [10]:
ans_IG_smoking = 0.610

In [11]:
grader.grade(test_case_id = 'test_case_IG_smoking', answer = ans_IG_smoking)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


3. Compute IG for split attribute asthma given the target attribute is diabetic

In [16]:
ans_IG_asthma = 0.108

In [17]:
grader.grade(test_case_id = 'test_case_IG_asthma', answer = ans_IG_asthma)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


4. Compute IG for split attribute obese given the target attribute is diabetic

In [18]:
ans_IG_obese = 0.125

In [19]:
grader.grade(test_case_id = 'test_case_IG_obese', answer = ans_IG_obese)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


Good job! You finished calculating the information gain manually. Now let's take a look at how we compute them in a program. 

In [20]:
def InfoGain(data,attribute_name,target_name="class"):
    """
    Calculate the information gain of a dataset. This function takes three parameters:
    1. data = The dataset for whose feature the IG should be calculated (pd Dataframe)
    2. attribute_name = the name of the feature for which the information gain should be calculated(string)
    3. target_name = the name of the target feature.(string)
    """    
    #Calculate the entropy of the total dataset
    total_entropy = entropy(data[target_name])
    
    
    #Calculate the values and the corresponding counts for the attribute by which tree is split
    vals,counts= np.unique(data[attribute_name],return_counts=True)
    
    #Calculate the weighted entropy
    Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
    
    #Calculate the information gain
    Information_Gain = total_entropy - Weighted_Entropy

    return Information_Gain

In [21]:
## Calculating the functions mathmatically

def entropy(target):
    """
    Calculate the entropy of a dataset.
    The only parameter of this function is the target_col parameter which specifies the target column
    """
    elem,count = np.unique(target,return_counts = True)
    entropy = np.sum([(-count[i]/np.sum(count))*np.log2(count[i]/np.sum(count)) for i in range(len(elem))])
    return entropy

**Q2:**  Use the code above to calculate Information Gain for all the split attributes and answer the questions below.

In [24]:
# sample usage of the code snippet above
# You can also use it check your calculations for Information Gain
ans1= InfoGain(data,'Ageover50',target_name="diabetic")
ans2= InfoGain(data,'smoking',target_name="diabetic")
ans3= InfoGain(data,'asthma',target_name="diabetic")
ans4= InfoGain(data,'obese',target_name="diabetic")
ans5 = entropy(data["diabetic"])

ans=[ans1,ans2,ans3,ans4]
tot_entropy = ans5
print(ans)
print(tot_entropy)
print("Max IG = ", np.max(ans))

[0.2780719051126377, 0.6099865470109875, 0.10803154614559995, 0.12451124978365313]
1.0
Max IG =  0.6099865470109875


1. The Information Gain is maximum for which split attribute given the target is "diabetic".

    (*Enter the attribute name i.e. Ageover50, smoking, asthma, obese as a string*)

In [29]:
ans_max_IG = "smoking"

In [30]:
grader.grade(test_case_id = 'test_case_max_IG', answer = ans_max_IG)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


2. Are entropy and Information Gain of a dataset the same?

   (*Enter Yes/No as a string*)

In [31]:
ans_IG_entropy_same = "No"

In [32]:
grader.grade(test_case_id = 'test_case_IG_entropy_same', answer = ans_IG_entropy_same)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


3. Is the entropy of all the features(e.g. diabetic, asthma) of a database the same?

   (*Enter Yes/No as a string*)

In [33]:
ans_is_entropy_same = "No"

In [34]:
grader.grade(test_case_id = 'test_case_is_entropy_same', answer = ans_is_entropy_same)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


4. If we were trying to make a decision tree to predict diabetes with the above features, what feature should be used as the root node in that Decision Tree?
(*Enter the attribute name i.e. Ageover50, smoking, asthma, obese as a string*)

In [35]:
ans_root_node_DT = "smoking"

In [36]:
grader.grade(test_case_id = 'test_case_root_node_DT', answer = ans_root_node_DT)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


**Q3:** You may also want to consider the following questions:

- How would you calculate information gain for multiple split attributes?(for eg if we are using Gender as a feature)
- What are the differences between the information gain and the entropy of a dataset?
- What is the mathematical relationship between information gain and entropy for a given dataset?

Now let's add another column(feature) gender. How does adding this column change your Decision Tree?

In [37]:
data['gender'] = ['male', 'male','male','female','female','female','non-binary',\
                  'non-binary','non-binary','non-binary']
data

Unnamed: 0,Ageover50,smoking,asthma,obese,diabetic,gender
0,True,True,True,True,yes,male
1,False,True,True,True,yes,male
2,False,False,True,False,no,male
3,True,True,True,True,yes,female
4,True,True,True,True,no,female
5,True,True,True,False,yes,female
6,False,False,False,False,no,non-binary
7,False,False,True,False,no,non-binary
8,True,True,True,True,yes,non-binary
9,False,False,True,True,no,non-binary


In [42]:
# sample usage of the code snippet above
# You can also use it check your calculations for Information Gain
ans1= InfoGain(data,'Ageover50',target_name=["diabetic","gender"])
ans2= InfoGain(data,'smoking',target_name=["diabetic","gender"])
ans3= InfoGain(data,'asthma',target_name=["diabetic","gender"])
ans4= InfoGain(data,'obese',target_name=["diabetic","gender"])
ans5 = entropy(data[["diabetic","gender"]])

ans=[ans1,ans2,ans3,ans4]
tot_entropy = ans5
print(ans)
print(tot_entropy)
print("Max IG = ", np.max(ans))

[0.3390359525563187, 0.490468570732828, 0.12625794497561404, 0.0722421719028139]
2.2854752972273342
Max IG =  0.490468570732828


In [None]:
#@markdown Write down any considerations on these questions. This is an open ended question to encourage you to think about Information Gain.
considerations = 'When using 2 feature attributes, the IG is calculated upon each of the attributes separately and a combination function such as sum, average or others of choice may be chosen to determine the column that provides max information gain. Entropy is the characteristic of every feature/ attribute of a set of data. However, information gain is something that exists only when comparison between two or more features exists where  at least one of the features is chosen as the target feature. The very crux of using the IG lies in being able to predict the unknown better from known data. ' #@param {type:"string"}

## Submitting to the Autograder

First of all, please run your notebook from beginning to end and ensure you are getting all the points from the autograder!

Now go to the File menu and choose "Download .ipynb".  Go to [Gradescope](https://www.gradescope.com/courses/409970) and:

1. From "File" --> Download both .ipynb and .py files
1. Name these files `Info_Gain_WS.ipynb` and `Info_Gain_WS.py` respectively
1. Sign in using your Penn email address (if you are a SEAS student we recommend using the Google login) and ensure  your class is "CIS 5200"
1. Select **Worksheet: Info Gain**
1. Upload both files
1. PLEASE CHECK THE AUTOGRADER OUTPUT TO ENSURE YOUR SUBMISSION IS PROCESSED CORRECTLY!

You should be set! Note that this assignment has 10 autograded points that will show up upon submission. Points are awarded based on a combination of correctness and sufficient effort. 