### Setup and Imports

In [1]:
import numpy as np

In [2]:
# first 2 columns = features
# third column = label
X = np.zeros((30, 3))

# class 0 = normally distributed mu=2, sigma=.1
X[:15, :2] = np.random.normal(loc=2, scale=.5, size=(15, 2))

# class 1 = normally distributed mu=4, sigma=.1
X[15:, :2] = np.random.normal(loc=4, scale=1, size=(15, 2))
X[15:, 2] += 1

### Decision Trees

You want to build a classifier. But unlike logistic regression, which uses a continuous function to model discrete results, you want to use a discrete set of rules to classify your input variable. A trivial example might be to classify whether or not you want to play tennis

- if the weather is sunny
    - if the temperature is below 70 -> don't play tennis
    - if the temperature is above 70 -> play tennis
- if the weather is cloudy
    - if the temperature is below 70 -> don't play tennis
    - if the temperature is above 70 -> don't play tennis
    
if we encoded our the problem (1 = sunny, 2 = cloudy), (2nd variable representing temperature), (1 = play tennis, 0 = dont play tennis)

```
(1, 68, 0)
(1, 82, 1)
(2, 66, 0)
(2, 90, 0)
```



### Entropy comes into play

Entropy is the expectation, or average surprisal. The term surprisal comes from how surprised you might be by an event, in an urn with 99 red balls, and 1 green ball, pulling a green ball would be a big surprise. Note that a urn with 99 red balls and 1 green ball would be a low entropy urn, given that 99 times out of 100, you would be correct and entropy refers to the average.

Now lets apply this to classification. You want to divide your days into 2 groups, good for tennis and bad for tennis. We want all the good days to be on one side of the classification boundry, and all of the bad days to be on the other, thus we want to sort all of our days into 2 groups in such a way that entropy is minimized.

### Entropy (Loss Function)
$$ H = E\left[\log_2 \left(\frac{1}{P(x_i)}\right)\right]= -\sum_i^n P(x_i) \cdot \log_2(P(x_i)) $$

### Minimizing Entropy
Decisions trees dont have a fancy way of minimizing their loss function. Its just an exhaustive search.
For each feature, $f_i$, explore all possible split values, then calculate the entropy of the split. After exploring all possible splits, choose the best option. Lets see this in action with our example

Weather Splits.
- Split on Sunny
    - Group 1: (1, 68, 0), (1, 82, 1)
    - Group 2: (2, 66, 0), (2, 90, 0)

Entropy of Group 2 $= 0$, Entropy of Group 1 $= 1$. Note that these groups will be exactly the same for Split on Cloudy. Next we explore the temperature splits

- Split on 68
    - Group 1: (1, 68, 0), (2, 66, 0)
    - Group 2: (1, 82, 1), (2, 90, 0)
    
Entropy of Group 1 $=0$, Entropy of Group 2 $=1$

- Split on 82
    - Group 1: (1, 68, 0), (1, 82, 1), (2, 66, 0)
    - Group 2: (2, 90, 0)
    
Entropy of Group 1 $=2.75$, Entropy of Group 2 $=0$

- Split on 66
    - Group 1: (2, 66, 0)
    - Group 2: (1, 68, 0), (1, 82, 1), (2, 90, 0)

Entropy of Group 1 $=0$, Entropy of Group 2 $=.9$

- Split on 90
    - Group 1: (2, 90, 0)
    - Group 2: (1, 68, 0), (1, 82, 1), (2, 66, 0)
   
Entropy of Group 1 $=0$, Entropy of Group 2 $=.9$

Therefore we can either split on weather or on a temperature of 66.

### Multiple Splits
This however is only a very basic tree. What about more complex trees? After choosing the best split the natural choice would then be to split recursively. For example, if we split on sunny

- Split on Sunny
    - Group 1: (1, 68, 0), (1, 82, 1)
    - Group 2: (2, 66, 0), (2, 90, 0)

We have already established that the entropy of group 2 is 0, so we would no longer split. But for group 1 the entropy is 2 so we would split again. Splitting on either 68 or 82 would yield trees with zero entropy so we could arrive at either of the following trees

       is Sunny               is Sunny
      N /    \ Y             N /    \ Y
            68 <                   82 < 
         N /   \ Y              N /   \ Y

In [3]:
# lets implement it
def entropy(probs):
    # probs = probabilities
    epsilon = (1e-2)
    return -np.dot(probs, np.log2(probs + epsilon))

def partial_entropy(X, partial):
    if len(X[partial]) == 0:
        return 0
    # calculate the entropy of one part of the tree (either Y or N)
    positive_prob = (np.sum(X[:, -1][partial])/ sum(partial))
    negative_prob = (1 - positive_prob)
    return entropy(np.array([positive_prob, negative_prob]))
    
class Node:
    def __init__(self, feature, split, class_=None):
        # feature (the thing which this node is split on)
        # split (the boundary at which this node is split)
        # class_ (if this is a leaf node, what class does the elements belong to)
        self.feature = feature
        self.split = split
        self.left = None
        self.right = None
        self.class_ = class_
    
    def forward(self, x):
        if self.class_ is not None:
            return self.class_
        
        else:
            x_feat_val = x[self.feature]

            if x_feat_val < self.split:
                return self.left.forward(x)
            else:
                return self.right.forward(x)

In [4]:
def fit(X):
    if X is None:
        return
    
    if len(X) == 1:
        return Node(feature=None, split=None, class_=X[0][-1])
    
    n_rows, n_features = X.shape
    best_split = {'feature': 0, 'entropy' : float('inf'), 'left' : None, 'right' : None}
    
    for feature_num, feature_values in enumerate(X[:, :-1].T):
        for split in np.unique(feature_values):
            left = feature_values < split
            left_entropy, right_entropy = partial_entropy(X, left), partial_entropy(X, ~left)
            total_entropy =  left_entropy + right_entropy
            
            if total_entropy < best_split['entropy']:
                best_split = {'feature': feature_num,
                              'split_val' : split,
                              'entropy' : total_entropy, 
                              'left' : X[left], 
                              'right' : X[~left],
                              'split' : split
                             } 
    
    root = Node(feature=best_split['feature'], split=best_split['split_val'])
    root.left = fit(best_split['left'])
    root.right = fit(best_split['right'])
    
    return root

In [5]:
decision_tree = fit(X)

In [6]:
(decision_tree.forward([3.9, 3.9]), 
decision_tree.forward([3.2, 3.2]), 
decision_tree.forward([2.7, 2.7]),
decision_tree.forward([2., 2.]))

(1.0, 0.0, 0.0, 0.0)