#  Decision Trees

A decision tree uses a tree structure to represent a number of possible decision paths and an outcome for each path.

## Entropy

Entropy in Decision Trees is a concept from information theory that measures how impure or uncertain a dataset is.

Think of it like this:

-  If all data points belong to one class, there is no randomness ‚Üí entropy is 0.
-  If the data is evenly split between classes, randomness is maximum ‚Üí entropy is 1 (for binary classification).

#### The Entropy Equation

`H(S)=‚àí(p1_‚Äãlog2‚Äãp1 ‚Äã+ p2_‚Äãlog2‚Äãp2+‚ãØ+pn_‚Äãlog2‚Äãpn‚Äã)`

Where:
- H(S) = Entropy of the dataset (S)
- pi = proportion (probability) of class ci
- log2 = logarithm with base 2


Log values are negative (since $0 < p < 1$).

**Example:**

$$
\log_2(0.5) = -1
$$


## Special Rule: \(0 \log 0 = 0\)

You might wonder:

 **log(0) is undefined ‚Äî so why is this allowed?**

This comes from a mathematical limit:
$$
\lim_{p \to 0} p \log p = 0
$$

### What does this mean?

 If a class never appears in the dataset (probability = 0), it contributes **zero uncertainty**.

Which makes intuitive sense ‚Äî something that never happens adds **no randomness** to the system.



<br>

If p·µ¢ ‚âà 0 ‚Üí that class almost never appears ‚Üí no uncertainty.

If p·µ¢ ‚âà 1 ‚Üí one class dominates ‚Üí very predictable ‚Üí no uncertainty.

 So both extremes mean low entropy.


In [None]:
from typing import List
import math

def entropy(class_probabilities:List[float])-> float:
    """ Given a list of class probabilities, compute the entropy """
    return sum(-p * math.log(p, 2)
    for p in class_probabilities if p>0)              # ignore zero probabilities

assert entropy([1.0]) == 0
assert entropy([0.5,0.5]) == 1
assert 0.81 < entropy([0.25,0.75]) < 0.82


Our data will consist of pairs(input, label), which means that we will need to compute the class probability ourselves.

In [2]:
from typing import Any
from collections import Counter

def class_probabilities(labels: List[Any]) -> List[float]:
    total_count = len(labels)
    return [count / total_count for count in Counter(labels).values()]
def data_entropy(labels: List[Any])-> float:
    return entropy(class_probabilities(labels))

assert data_entropy(['a']) == 0
assert data_entropy([True, False]) == 1
assert data_entropy([3,4,4,4]) == entropy([0.25,0.75])

#### The Entropy of a Partition

Earlier, we measured entropy for one dataset to see how mixed the labels were.

Now in a decision tree, every question splits the data into smaller groups.
This splitting is called a partition.


<br>
For example, my "Australian five-cent coin" question was pretty dumb, as it partitioned the remaining animals at the point into S1 = {echidna} and S2 = {everything else}, where s2 is both large and high-entropy.(S1 has no entropy, but it represents a small fraction of the remaining "classes")


Mathematically, if we partition our data (S) into subsets `S1,....,Sm` containing proportions `q1,....,qm` of the data, then we compute the entropy of the partition as weighted sum:

`H =q1*H(S1)+...+qm*H(Sm)`

In [None]:
def partition_entropy(subsets: List[List[Any]])-> float:
    """ Returns the entropy from this partition of data into subsets """
    total_count = sum(len(subset) for subset in subsets)

    return sum(data_entropy(subset) * len(subset) / total_count
               for subset in subsets)

## Creating a Decision Tree

In [None]:
from typing import NamedTuple, Optional 
 
class Candidate(NamedTuple): 
    level: str 
    lang: str 
    tweets: bool 
    phd: bool 
    did_well: Optional[bool] = None  # allow unlabeled data 
 
                  #  level     lang     tweets  phd  did_well
inputs = [Candidate('Senior', 'Java',   False, False, False), 
          Candidate('Senior', 'Java',   False, True,  False), 
          Candidate('Mid',    'Python', False, False, True), 
          Candidate('Junior', 'Python', False, False, True), 
          Candidate('Junior', 'R',      True,  False, True), 
          Candidate('Junior', 'R',      True,  True,  False), 
          Candidate('Mid',    'R',      True,  True,  True), 
          Candidate('Senior', 'Python', False, False, False), 
          Candidate('Senior', 'R',      True,  False, True), 
          Candidate('Junior', 'Python', True,  False, True), 
          Candidate('Senior', 'Python', True,  True,  True), 
          Candidate('Mid',    'Python', False, True,  True), 
          Candidate('M-id',    'Java',   True,  False, True), 
          Candidate('Junior', 'Python', False, True,  False) 
         ]


from typing import Dict, TypeVar
from collections import defaultdict

T = TypeVar('T')           #generic type for inputs
def partition_by(inputs:List[T], attribute: str)-> Dict[Any, List[T]]:
    """ Partition the inputs into lists based on the specified attribute """
    partitions: Dict[Any, List[T]] = defaultdict(list)
    for input in inputs:
        key = getattr(input,attribute)  # value of specified attribute
        partitions[key].append(input)
    return partitions

# Compute entropy
# Entropy is evaluated ONLY on the labels.

def partition_entropy_by(inputs:List[Any],
                         attribute: str,
                         label_attribute:str)-> float:
    """ Compute the entropy corresponding to given partition """
    # partitions consit of our inputs 
    partitions =  partition_by(inputs, attribute)

    # But paritition_entropy needs just the class labels 
    labels  = [[getattr(input,label_attribute)for input in partition]for partition in partitions.values()]
    return partition_entropy(labels)

""" Then we just need to find the minimum-entropy partition for the whole dataset """
for key in ['level','lang','tweets','phd']:
    print(key, partition_entropy_by(inputs, key,"did_well"))

assert 0.69 < partition_entropy_by(inputs, 'level', 'did_well')  < 0.70
assert 0.86 < partition_entropy_by(inputs, 'lang', 'did_well')   < 0.87
assert 0.78 < partition_entropy_by(inputs, 'tweets', 'did_well') < 0.79
assert 0.89 < partition_entropy_by(inputs, 'phd', 'did_well')    < 0.90

For lowest entropy, we do splitting on `level`, so we will need to make a subtree for each possible level value.

<br>

Every mid candidate is labeled `True`, which means that the `Mid` subtree is simply a leaf node predicting `True`,

<br>

For `Senior` candidates, we have mix of `Trues` and `Falses`

In [None]:
senior_inputs = [input for input in inputs if input.level == 'Senior']

assert 0.4 == partition_entropy_by(senior_inputs, 'lang', 'did_well')
assert 0.0 == partition_entropy_by(senior_inputs, 'tweets', 'did_well')
assert 0.95 < partition_entropy_by(senior_inputs, 'phd', 'did_well') < 0.96

This shows us that our next split should be on **`tweets`**, which results in a **zero-entropy partition**.

For the **Senior-level candidates**:

- `"yes"` tweets **always result in `True`**
- `"no"` tweets **always result in `False`**

Since the entropy becomes **0**, this means the data is perfectly separated, and no further splitting is required for this branch of the decision tree.

---

Finally, if we do the same thing for the **Junior candidates**, we end up splitting on **`phd`**.

After this split, we observe:

- **No PhD ‚Üí always results in `True`**
- **PhD ‚Üí always results in `False`**

Again, this produces a **zero-entropy partition**, meaning the classification is perfectly pure and the tree does not need to grow any further on this path.

---

## Put It All Together

We define a tree to be either:

- a `Leaf` (that predicts a single value)
- a `Split` (containing an attribute to split on, subtrees for specific values of that attribute, and possibly a default value to use if we see an unknown value)

In [None]:
from typing import NamedTuple, Union, Any

class Leaf(NamedTuple):
    value: Any

class Split(NamedTuple):
    attribute: str
    subtrees: dict
    default_value: Any = None

DecisionTree = Union[Leaf, Split]

Our hiring tree would look like:

In [None]:
hiring_tree = Split('level', {   # first, consider "level" 
    'Junior': Split('phd', {     # if level is "Junior", next look at "phd" 
        False: Leaf(True),       #   if "phd" is False, predict True 
        True: Leaf(False)        #   if "phd" is True, predict False 
    }), 
    'Mid': Leaf(True),           # if level is "Mid", just predict True 
    'Senior': Split('tweets', {  # if level is "Senior", look at "tweets" 
        False: Leaf(False),      #   if "tweets" is False, predict False 
        True: Leaf(True)         #   if "tweets" is True, predict True 
    })
})

What to do if we encounter an unexpected(or missing) attribute value like a candidate whose `level` is `Intern`
<br>
In the case we handle it by populating the most common `default_value` attribute with the most common label.



In [None]:
def classify(tree: DecisionTree, input:Any)-> Any:
    """ Classify the input the given decision tree """

    # If this is a leaf node, return its value 
    if isinstance(tree,Leaf):
        return tree.value
    

    # Otherwise this tree consists of an attribute to split on 
    # and a dictionary whose keys are values of that attribute 
    # and whose values are subtrees to consider next 

    subtree_key = getattr(input, tree.attribute)
    if subtree_key not in tree.subtrees:
        return tree.default_value             # returns the default value
    
    subtree = tree.subtrees[subtree_key]
    return classify(subtree,input )


 Build Tree

In [None]:
def build_tree_id3(inputs: List[Any], 
                   split_attributes: List[str], 
                   target_attribute: str) -> DecisionTree: 
    # Count target labels 
    label_counts = Counter(getattr(input, target_attribute) 
                           for input in inputs) 
    most_common_label = label_counts.most_common(1)[0][0] 
 
    # If there's a unique label, predict it 
    if len(label_counts) == 1: 
        return Leaf(most_common_label) 
 
    # If no split attributes left, return the majority label 
    if not split_attributes: 
        return Leaf(most_common_label) 
 
    # Otherwise split by the best attribute 
 
    def split_entropy(attribute: str) -> float: 
        """Helper function for finding the best attribute""" 
        return partition_entropy_by(inputs, attribute, target_attribute) 
 
    best_attribute = min(split_attributes, key=split_entropy) 
 
    partitions = partition_by(inputs, best_attribute) 
    new_attributes = [a for a in split_attributes if a != best_attribute] 
 
    # Recursively build the subtrees 
    subtrees = {attribute_value : build_tree_id3(subset, 
                                                 new_attributes, 
                                                 target_attribute) 
                for attribute_value, subset in partitions.items()} 
 
    return Split(best_attribute, subtrees, default_value=most_common_label)

In [None]:
tree = build_tree_id3(inputs, 
                      ['level', 'lang', 'tweets', 'phd'], 
                      'did_well') 
 
# Should predict True
assert classify(tree, Candidate("Junior", "Java", True, False)) 
 
# Should predict False
assert not classify(tree, Candidate("Junior", "Java", True, True))
# And also to data with unexpected values:
# Should predict True
assert classify(tree, Candidate("Intern", "Java", True, True))

## Random Forests


## ‚ùó Problem with Decision Trees: Overfitting
Decision trees are very powerful, but they often **overfit**.

**Overfitting** means:
> The model memorizes training data instead of learning general patterns.

Result:
- Excellent performance on training data ‚úÖ
- Poor performance on new/unseen data ‚ùå

---

# ‚úÖ Solution: Random Forest

Instead of building **one decision tree**, we build **many trees** and combine their predictions.

This technique is called a **Random Forest**.

Think of it like asking multiple experts rather than trusting one person.

---

## üå≥ How Predictions Work

### ‚úî Classification (Yes/No)
Each tree votes, and the majority wins.

Example:

Tree 1 ‚Üí Yes  
Tree 2 ‚Üí No  
Tree 3 ‚Üí Yes  
Tree 4 ‚Üí Yes  

‚úÖ Final Prediction ‚Üí **Yes**

---

### ‚úî Regression (Predicting Numbers)
Take the average of predictions.

Example:

20, 22, 19, 21  
Final prediction = **20.5**

---

# üé≤ Where Does the "Random" Come From?

Random Forest uses **two sources of randomness**.

---

## ‚úÖ 1. Bootstrapping (Random Data Sampling)

Instead of training every tree on the full dataset:

üëâ We sample data **with replacement**.

### What does "with replacement" mean?
- After picking a row, we put it back.
- The same row can appear multiple times.
- Some rows may not appear at all.

Example dataset:

`A, B, C, D, E`

Bootstrap sample:

`B, D, B, A, E`


Notice:
- B appears twice ‚úÖ
- C is missing ‚ùå

Every tree gets different data ‚Üí Every tree becomes different.

---

##  Out-of-Bag Samples (Bonus Advantage)

Data not selected for a tree is called:

###  Out-of-bag data

We can use it to test the model without needing a separate test set.

Very efficient!

---

# üì¶ Bagging (Bootstrap Aggregating)

**Bagging = Bootstrapping + Aggregating**

Steps:
1. Create multiple datasets using random sampling.
2. Train a tree on each dataset.
3. Combine their predictions.

Result ‚Üí More stable and accurate model.

---

# üé≤ 2. Random Feature Selection

Another way Random Forest creates diversity:

üëâ Instead of checking ALL attributes when splitting,
we only check a **random subset**.

``` python
# if there are already few enough split candidates, look at all of them 
    if len(split_candidates) <= self.num_split_candidates: 
        sampled_split_candidates = split_candidates 
    # otherwise pick a random sample 
    else: 
        sampled_split_candidates = random.sample(split_candidates, 
                                                 self.num_split_candidates) 
 
    # now choose the best attribute only from those candidates 
    best_attribute = min(sampled_split_candidates, key=split_entropy) 
 
    partitions = partition_by(inputs, best_attribute)
```
---

## Example

All features:

[Age, Salary, Education, Experience, Location]

Random subset:

[Education, Location]


Now the tree must choose the best split **only from these features**.

---

## Why Do This?

If every tree always picks the strongest feature:

Tree 1 ‚Üí Salary
Tree 2 ‚Üí Salary
Tree 3 ‚Üí Salary


All trees become similar ‚ùå

But randomness creates diversity:

Tree 1 ‚Üí Salary
Tree 2 ‚Üí Education
Tree 3 ‚Üí Experience
Tree 4 ‚Üí Location


Different trees ‚Üí Better combined predictions ‚úÖ

---

#  Ensemble Learning

## Definition:
**Ensemble learning** is a technique where multiple models are combined to produce a stronger overall model.

 Ensemble = Teamwork.

---

## Weak vs Strong Learners

### Weak Learner:
- Slightly better than random guessing
- Makes mistakes
- High bias, low variance

### Strong Learner:
Created by combining many weak learners.

Example:

Model accuracies:
- 70%
- 68%
- 72%

Combined ‚Üí **Much higher accuracy**

Errors cancel out.

---

#  Why Random Forest Works So Well

A single tree is:
- Sensitive to noise
- High variance

Many trees:
- Reduce errors
- Improve stability
- Lower overfitting

---


-  **Random Forest = Random Data + Random Features + Many Trees**

- **Bagging = Bootstrap + Aggregation**

- **Ensemble Learning = Combine multiple weak models ‚Üí One strong model**

---

Random Forest is basically an improved version of bagging.

