# Given

A sample of categorical data:

|   Ear Shape | Face Shape | Whiskers |   Cat (1-yes, 0-no)  |
|:---------:|:-----------:|:---------:|:------:|
|   Pointy   |   Round     |  Present  |    1   |
|   Floppy   |  Not Round  |  Present  |    1   |
|   Floppy   |  Round      |  Absent   |    0   |
|   Pointy   |  Not Round  |  Present  |    0   |


# Find 

Implement a classification tree:
- one hot encode the features
- find the feature to split (entropy)
    - define impurity
    - calculate information gain
- calculate root
- split further in left and right
- stop splitting
    - when node is 100% purity
    - maximum depth of tree exceeds the defined level
    - improvements in information gain is too small
    - number of examples in the node is lower than predefined threshold

# Solution

In [1]:
import numpy as np

In [2]:
x = np.array([[1, 1, 1], [0, 0, 1], [0, 1, 0], [1, 0, 1], [1, 1, 1], [1, 1, 0], [0, 0, 0], [1, 1, 0], [0, 1, 0], [0, 1, 0]])
y = np.array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0])

x.shape, y.shape

((10, 3), (10,))

Entropy 

* in general for multi-class problems:
$$H(X) = -\sum_{i=1}^{n} p(x_i) \log_{2}(p(x_i))$$

* in binary classification simplifies to:
$$H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1)$$
where $p_1$ is the quantity of dogs (1s) in the dataset, and $p_0 = 1-p_1$ is the probability of cats

In [3]:
def calculate_entropy(branch):
    # In - inner branch with values inside = [0, 1, 1, 0, 1]
    # Out - 
    #   entropy = dirtiness of a dataset. 0 - ideal, 1 - bad
    #   p - weight

    p = np.sum(branch) / len(branch)
    
    if p==1 or p==0:
        entropy = 0
    else:
        entropy = - p * np.log2(p) - (1-p) * np.log2(1-p)
    
    return entropy, p

Information gain

* in mgeneral for multi-class:

    $$\text{Information Gain}(S, A) = \text{Entropy}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \cdot \text{Entropy}(S_v)$$

    , where 
    * $S$ = dataset, 
    * $A$ = attribute, 
    * $|S|$ - total number of instances in the dataset $S$,
    * $|S_v|$ - subset of instances where attribute $A$ has value $v$
    * $\text{Values}(A)$ - possible values of attribute A 
    
<p>

* in binary classiciation simplifies to:

$$\text{Information Gain} = H(p_{root})- \left(w_{\text{left}}\cdot H\left(p_1^\text{left}\right) + w_{\text{right}}\cdot H\left(p_1^\text{right}\right)\right),$$

In [139]:
def calculate_ig(parent, children = []):
    """
    Function to calculate information gain
    Parameters:
    ----------
    parent : list
        List containing the parent dataset/groups
    children : list, optional
        List containing the children datasets/groups.
        Default is an empty list.
        
    Returns:
    ----------
    IG : float
        Information Gain as a measure of difference of entropy from parent to children.
        
    Unit Tests:
    ------
    >> calculate_ig(y, [y[x[:,0]==0],y[x[:,0]==1]])
    Expected result = 0.2780719051126377

    >> calculate_ig([0,0,0,0],[np.array([0]), np.array([0, 0, 0])])
    Expected result = 0 

    Note: 
    -----
    This function requires an external function 'calculate_entropy(child)' to work, where 'child'
    is a member of children dataset.
    """

    H_root, _ = calculate_entropy(parent)
    sum = 0
    for child in children:
        H, w = calculate_entropy(child)
        sum += H * w
    IG = H_root - sum
    return IG 

In [140]:
def choose_best_feature(x, y, processed_nodes = []):
    """
    Function to determine the best feature to split the data on.

    Parameters:
    ----------
    x : numpy array
        2D array of input features.
    y : numpy array
        1D array of target variable values.
    processed_nodes : list, optional
        List of node indices that have already been processed. Default is an empty list.

    Returns:
    ----------
    int
        Index of the feature with the highest information gain.

    Usage:
    ------
    x = np.array([[0,1,0],[0,0,0],[0,1,0],[0,1,0]])
    y = np.array([0, 0, 0, 0])
    indices_of_best_feature = choose_best_feature(x, y, [])
    Expected Result: 0.2780719051126377

    Note:
    ------
    This function leverages an external function named 'calculate_ig(parent, children)' to calculate the information gain of a split.
    """

    ig = []

    for feature_id in range(x.shape[1]):
        if feature_id not in processed_nodes:
            left_subset = y[x[:,feature_id]==0]
            right_subset = y[x[:,feature_id]==1]
            feature_ig = calculate_ig(y, [left_subset, right_subset])
            ig.append(feature_ig)
        else:
            ig.append(0)

    print(f"ig = {ig}")

    return(np.argmax(ig))


Tree and node classes

In [76]:
class TreeNode():

    left_feature_id = -1
    right_feature_id = -1

    #rows = []
    path = []

    left_feature = TreeNode()
    right_feature = TreeNode()

    def __repr__(self):
        attrs = vars(self)
        return ', '.join(f"{key}={value}" for key, value in attrs.items())


class Tree():
    root_feature = -1
    node = TreeNode()

Recursion

In [134]:
def process(current = TreeNode(), x = [[]], y = [], path = []):

    # figure parent
    parent = path[-1]

    # left and right subsets
    x_left = x[x[:,parent]==0]
    x_right = x[x[:,parent]==1]

    y_left = y[x[:,parent]==0]
    y_right = y[x[:,parent]==1]

    current.path = path
    current.left_branch = y_left
    current.right_branch = y_right
    
    if len(y_left)<=1 or len(y_right)<=1:

        current.left_feature_id = "bottom"
        current.right_feature_id = "bottom"

        print(current)

        return 0

    else:

        best_split_left = choose_best_feature(x_left, y_left, processed_nodes=path)
        best_split_right = choose_best_feature(x_right, y_right, processed_nodes=path)

        current.left_feature_id = best_split_left
        current.right_feature_id = best_split_right

        print(current)

    process(current.left_feature, x = x_left, y = y_left, path = path + [best_split_left])
    process(current.left_feature, x_right, y_right, path + [best_split_right])


# Answer

In [142]:
tree = Tree()

# Step 0
tree.root_feature = choose_best_feature(x, y)

print(f"root = {tree.root_feature}")

# Recursion
process(tree.node, x, y, path=[tree.root_feature,])

ig = [0.2780719051126377, 0.13091388234321688, 0.08544279530415388]
root = 0
ig = [0, 0.22192809488736231, 0.7219280948873623]
ig = [0, 0.7219280948873623, 0.10973087218436928]
left_feature_id=2, right_feature_id=1, path=[0], left_branch=[1 0 0 0 0], right_branch=[1 0 1 1 1]
left_feature_id=bottom, right_feature_id=bottom, path=[0, 2], left_branch=[0 0 0 0], right_branch=[1]
left_feature_id=bottom, right_feature_id=bottom, path=[0, 1], left_branch=[0], right_branch=[1 1 1 1]
