In [1]:
headers = ['level', 'lang', 'tweets', 'phd', 'interviewed_well']
data = [
    ['Senior', 'Java', 'no', 'no', 'False'],
    ['Senior', 'Java', 'no', 'yes', 'False'],
    ['Mid', 'Python', 'no', 'no', 'True'],
    ['Junior', 'Python', 'no', 'no', 'True'],
    ['Junior', 'R', 'yes', 'no', 'True'],
    ['Junior', 'R', 'yes', 'yes', 'False'],
    ['Mid', 'R', 'yes', 'yes', 'True'],
    ['Senior', 'Python', 'no', 'no', 'False'],
    ['Senior', 'R', 'yes', 'no', 'True'],
    ['Junior', 'Python', 'yes', 'no', 'True'],
    ['Senior', 'Python', 'yes', 'yes', 'True'],
    ['Mid', 'Python', 'no', 'yes', 'True'],
    ['Mid', 'Java', 'yes', 'no', 'True'],
    ['Junior', 'Python', 'no', 'yes', 'False'],
]

Construct a decision tree

In [2]:
def decision_tree(level, lang, tweets, phd):
    if level == "Senior":  # 5 cases
        # told to split on tweets
        if tweets == "yes":  # 2 cases
            interviewed_well = True  # leaf node (2/5)
        elif tweets == "no":  # 3 cases
            interviewed_well = False  # leaf node (3/5)
    elif level == "Junior":  # 5 cases
        # told to split on phd
        if phd == "yes":  # 2 cases
            interviewed_well = False  # leaf node (2/5)
        elif phd == "no":  # 3 cases
            interviewed_well = True  # leaf node (3/5)
    else:  # 4 cases
        interviewed_well = True  # leaf node (4/14)
    return interviewed_well

In [3]:
X_1 = ["Junior","Java","yes","no"]
X_2 = ["Junior","Java","yes","yes"]
X_1_pred = decision_tree(*X_1) # the star makes the function take the arguments of the list as a tuple  
X_2_pred = decision_tree(*X_2) # the star makes the function take the arguments of the list as a tuple
print(X_1_pred, X_2_pred)

True False


We should be able to do better than `random` for attribute selections.  
This is where Entropy comes in!  
## Entropy:
- a measure of uncertainty
- goal: minimize uncertainty in order to maximize certainty
- result: we can get closer to making a leaf node faster
$$E = -\sum_{i=1}^{n}p_i log_2(p_i)$$
* What the formula is saying:
    * Since $0 < p_i \leq 1$, we know that $-p_i log_2(p_i) \geq 0$ is positive
    * e.g., for $log_2(0.5) = y$, we have $2^y = \frac{1}{2}$, which means $y = -1$
    * If $p_i = 1$, then $-p_i log_2(p_i) = 0$
    * $E$ has the highest value when labels are equally distributed

<img src="figures/entropy_graph.png" width="300"/>

Since we want a small E, we want $P_{i}$ to be close to 1 or 0.

Pick the attribute that maximizes information gain
* Information Gain = $E_{start} - E_{new}$
    * At each partition, pick attribute with highest information gain
    * That is, split on attribute with greatest reduction in entropy
    * Which means find attribute with smallest $E_{new}$

### Lab Task 3
What is $E$ for the following distributions? Recall: $E = -\sum_{i=1}^{n}p_i log_2(p_i)$
1. $p_{yes} = 3/8$ and $p_{no} = 5/8$
1. $p_{yes} = 2/8$ and $p_{no} = 6/8$

In [4]:
import math

# 1.1
p_yes = 3 / 8
p_no = 5 / 8
E = -(p_yes * math.log(p_yes, 2) + p_no * math.log(p_no, 2))
print(E)

# 1. 2
p_yes = 2 / 8
p_no = 6 / 8
E = -(p_yes * math.log(p_yes, 2) + p_no * math.log(p_no, 2))
print(E)

# notice how 3/8 and 5/8 are both further from 0 or 1 than 2/8 and 6/8

0.9544340029249649
0.8112781244591328


Since E is smaller for the 2/8 and 6/8 case, we chose this distribution

### Lab Task 4

In [5]:
# lab task 4 continues to use the data as lab task 1

# 9 interviewed well, 5 did not

Estart = -(5/14 * math.log(5/14,2)) - (9/14 * math.log(9/14,2))
print(Estart)

Estart = -(5/14 * math.log(5/14,2) + 9/14 * math.log(9/14,2))
print(Estart)

0.9402859586706309
0.9402859586706309


### Lab Task 5

In [11]:
# 3/5 of the seiniors have a class label false
E_seniors = -(3/5 * math.log(3/5,2)) - (2/5 * math.log(2/5,2))
# 4/4 of the mid have a class label of true
E_mid = 4/4 * math.log(4/4,2) # UH OH! you cant take the log of 0, so we dont have to do the 0/4 part
# 2/5 of the juniors had the class label false
E_juniors = -(2/5 * math.log(2/5,2)) - (3/5 * math.log(3/5,2))

print(E_seniors)
print(E_mid)
print(E_juniors)

0.9709505944546686
0.0
0.9709505944546686


In [7]:
# new entropy if we split on the level
E_new_level = (5/14) * E_seniors + (4/14) * E_mid + (5/14) * E_juniors # e_new(level) is the weighted average if partition entropies
print(E_new_level) # we want to find the attribute to minimize this

0.6935361388961918


In [8]:
# compute information gain level
# this is E_start - E_new_level
E_gain_level = Estart - E_new_level
print(E_gain_level) # we want to maximize this
# this is why we want the smallest E_new_level, because it will result in the largest E_gain_level

0.2467498197744391


We will go through each of the other attributes to see which attribute has the highest information gain (the one we should split on)

### Lab Task 6

In [29]:
# we will find the E_new_level for each of the other attributes
# tweets
E = -(p_yes * math.log(p_yes, 2) + p_no * math.log(p_no, 2))
# seinors: 2 out of 5 that tweeted, 
E_seniors_tweets = -((3/5) * math.log(3/5,2)) + ((2/5 )* math.log(2/5,2))
# juniors: 3 out of 5 tweeted
E_juniors_tweets = E_seniors_tweets = -((2/5) * math.log(2/5,2)) + ((3/5 )* math.log(3/5,2))
# mid: 2 out of 4 tweeted
E_mid_tweets = -((2/4 )* math.log(2/4,2)) + ((2/4 )* math.log(2/4,2))

E_new_tweets = (5/14) * E_seniors_tweets + (4/14) * E_mid_tweets + (5/14) * E_juniors_tweets

E_gain_level_tweets = Estart - E_new_tweets
print(E_gain_level_tweets) 

# NOTE: IDK IF THIS IS CORRECT LOL

0.8784346147740444
