# Introduction

- Decision trees are a type of model for predicting both continuous and categorical values.
- They classify data by partitioning the sample space efficiently.
- Decision trees are still one of the most powerful modeling tools in machine learning.
- They are highly interpretable and simple to explain.
## Entropy and Information Gain

- Decision trees can give different predictions based on the questions asked and their order.
- Selecting the right questions in the right order is crucial.
- Entropy and information gain are useful mechanisms for choosing the most promising questions in a decision tree.

## From graphs to decision trees
- Decision trees are a type of classifier that partitions the sample space recursively.
- A decision tree is a directed acyclic graph with a root node and internal, leaf, and terminal nodes.
- Internal nodes have outgoing edges while terminal nodes have no outgoing edges.
- Directed Acyclic Graphs are collections of nodes and edges with specified traversal directions.
- Acyclic graphs are graphs where no node can be visited twice along any path from one node to another.
- DAGs are directed graphs with no cycles.
- DAGs have a topological ordering, which is a sequence of nodes with every edge directed from earlier to later in the sequence.


## Partitioning the sample space

<img src="images/dt1.png" width=650>

- Decision trees partition a sample space into sub-spaces based on attributes.
- Internal nodes check for a condition and perform a decision, while terminal nodes represent a class.
- Decision tree induction is related to rule induction.
- Each path from the root to a leaf can be transformed into a rule.

## Definition

<img src="images/dt2.png" width=650>

- Decision trees are a type of classifier where each node represents a choice and each leaf node represents a classification.
- Unknown instances are routed down the tree based on attribute values until they reach a leaf and are classified.
- Feature importance is crucial to decision trees as selecting the correct feature affects the classification process.
- Regression trees are represented similarly but predict continuous values instead of classifications.

## Training process

<img src="images/dt3.png" width=650>

- To train a decision tree for predicting target features:
- Present a dataset with features and a target
- Use feature selection and measures like information gain and Gini index to select predictors
- Grow the tree until a stopping criteria is met
- Use the trained tree to predict the class of new examples based on their features

## Splitting criteria

- Decision trees are built using recursive binary splitting and a cost function to select the best split
- Two algorithms commonly used to build decision trees are CART and ID3
- CART uses the Gini Index as a metric while ID3 uses the entropy function and information gain as metrics.

## Greedy search 

- To classify data, we use decision trees with the best attribute at the root.
- We repeat the process to create further splits until all data is classified.
- The top-down, greedy search is used to find the best attribute.
- The information gain criteria helps identify the best attribute for ID3 classification trees.
- Decision trees always try to maximize information gain.
- The attribute with the highest information gain will be split on first.





## Shannon's Entropy

- Entropy measures disorder or uncertainty.
- It is named after Claude Shannon, the "father of information theory".
- Information theory provides measures of uncertainty associated with random variables.
- The amount of uncertainty is measured in bits.
- The entropy of a variable is the "amount of information" contained in the variable.
- The amount of information is proportional to the amount of "surprise" its reading causes.
- Shannon's entropy quantifies the amount of information in a variable and provides a foundation for a theory around the notion of information.
- Entropy is an indicator of how messy data is.
- Higher entropy means less predictive power in data science.

## Entropy and Decision Trees

<img src="images/split.png" width=500>

- Decision trees are used to group data into classes based on a target variable.
- The goal is to maximize purity of the classes while creating clear leaf nodes.
- Data cannot always be fully classified, but can be made tidier through splits using different feature variables.
- Entropy is computed before and after each split to determine if it should be retained or stopped.

###  Calculating Entropy

<img src="images/ent.png" width=400>

- A dataset can contain both True and False values and be split into subsets according to their target value
- The ratio of Trues to Falses in the dataset can be calculated using p = n/N and q = m/N
- Entropy can be calculated using the equation E = -p . log_2(p) - q . log_2(q) and is a measure of the disorder or uncertainty in the dataset
- When the split between target classes is at 0.5, the entropy value is at its maximum, 1; when the split is at 0 or 1, the entropy value is 0
- The more one-sided the proportion of target classes, the less entropy; when the proportion is exactly equal, there is maximum entropy and perfect chaos
- Decision Trees can be used to split the contents of a dataset into subsets, creating more organized subsets based on common attributes.



In [1]:
from math import log 

# Write a function `entropy(pi)` to calculate total entropy in a given discrete probability distribution `pi`
# The function should take in a probability distribution `pi` as a list of class distributions. This should be a list of two integers, representing how many items are in each class. For example: `[4, 4]` indicates that there are four items in each class, `[10, 0]` indicates that there are 10 items in one class and 0 in the other. 
# Calculate and return entropy according to the formula: $$Entropy(p) = -\sum (P_i . log_2(P_i))$$
# Make sure to avoid invalid operations like: $$log_2(0)$$

def entropy(pi):
    '''
    return the Entropy of a probability distribution:
        entropy(p) = - SUM (Pi * log(Pi) )
    '''
    total = 0
    for p in pi:
        p = p / sum(pi)
        if p != 0:
            total += p * log(p, 2)
    return -total

# Test the function 

print(entropy([1, 1])) # Maximum Entropy e.g. a coin toss
print(entropy([0, 6])) # No entropy, ignore the -ve with zero , it's there due to log function
print(entropy([2, 10])) # A random mix of classes

# 1.0
# -0.0
# 0.6500224216483541

1.0
-0.0
0.6500224216483541


### Generalization of Entropy 

- Entropy is a measure of uncertainty in a dataset
- It characterizes the amount of information contained within the dataset
- Equation to calculate entropy: H(S) = -sum(Pi*log2(Pi))
- When H(S) = 0, the dataset is perfectly classified
- We can easily calculate information gain for potential splits by knowing the amount of entropy in a subset.

$$\large H(S) = -\sum (P_i . log_2(P_i))$$

## Information Gain 

- Information gain is a criterion used by the ID3 algorithm to create decision trees.
- It is calculated by comparing entropy of the parent and child nodes after a split.
- A weighted average based on the number of samples in each class is used in the calculation.
- The attribute with the highest information gain is chosen for the split.
- The ID3 algorithm uses entropy to calculate information gain and pick the attribute to split on.

$$\large IG(A, S) = H(S) - \sum{}{p(t)H(t)}  $$

Where:

* $H(S)$ is the entropy of set $S$
* $t$ is a subset of the attributes contained in $A$ (we represent all subsets $t$ as $T$)
* $p(t)$ is the proportion of the number of elements in $t$ to the number of elements in $S$
* $H(t)$ is the entropy of a given subset $t$ 

In [2]:
# Write a function `IG(D,a)` to calculate the information gain 
# As input, the function should take in `D` as a class distribution array for target class, and `a` the class distribution of the attribute to be tested
# Using the `entropy()` function from above, calculate the information gain as: $$gain(D,A) = Entropy(D) - \sum(\frac{|D_i|}{|D|}.Entropy(D_i))$$
# where $D_{i}$ represents distribution of each class in `a`.

def IG(D, a):
    '''
    return the information gain:
    gain(D, A) = entropy(D)− SUM( |Di| / |D| * entropy(Di) )
    '''
    total = 0
    for i in a:
        total += sum(i) / sum(D) * entropy(i)
    return entropy(D) - total


# Test the function
# Set of example of the dataset - distribution of classes
test_dist = [6, 6] # Yes, No
# Attribute, number of members (feature)
test_attr = [ [4,0], [2,4], [0,2] ] # class1, class2, class3 of attr1 according to YES/NO classes in test_dist

print(IG(test_dist, test_attr))

# 0.5408520829727552

0.5408520829727552
