# Decision Trees

Decision trees are a popular type of model used in machine learning for both classification (predicting discrete categories) and regression (predicting continuous numbers). They work by recursively splitting a dataset based on feature values to create a model that is easy to interpret and visualize. Despite being very powerful and widely used in real-world applications (including many machine learning competitions), decision trees sometimes receive less academic attention than other methods.

*Example Context:*  
Imagine running a cat adoption center. You have a dataset with 10 examples where each animal is described by features such as **ear shape**, **face shape**, and **whiskers**. The task is to classify each animal as a cat or not a cat. In this example, the input features ($X$) are categorical (e.g., "pointy" vs. "floppy" ears) and the target variable ($Y$) is binary (cat = 1, not cat = 0).

---

## Structure of a Decision Tree

A decision tree consists of several types of nodes that form a branching structure:

- **Root Node:**  
  The topmost node that represents the entire dataset. All examples start here.

- **Decision Nodes:**  
  These are the internal nodes (often depicted as ovals) where a test is performed on one of the features. Depending on the outcome (e.g., "pointy ear" vs. "floppy ear"), the example is sent down different branches.

- **Leaf Nodes:**  
  Terminal nodes (often depicted as rectangles) that provide a final prediction. For classification, a leaf might say "cat" or "not cat"; for regression, it might output an average value (e.g., predicted weight).

*Clarification:*  
Although we call it a "tree," its structure is more like a flowchart. Think of it as asking a series of yes/no questions until you reach a conclusion. The convention of having the "root" at the top and "leaves" at the bottom is just a diagrammatic choice—similar to how indoor hanging plants display their roots above ground.

---

## How to Build a Decision Tree

The process of constructing a decision tree from a training set involves two main challenges:

### Choosing a Feature to Split On

At each node, you must decide which feature to use to divide the data. For instance, at the root node you might decide to split based on **ear shape**:
- All examples with **pointy ears** go to the left branch.
- All examples with **floppy ears** go to the right branch.

This decision is made by evaluating which feature best separates the classes (i.e., makes the groups as "pure" as possible).

### Deciding When to Stop Splitting

You cannot split indefinitely. Common stopping criteria include:
- **Pure Node:** Stop if all examples at a node belong to a single class (i.e., the node is pure).
- **Maximum Depth:** Set a limit on how many splits (or “hops” from the root) are allowed to prevent overly complex trees.
- **Minimal Information Gain:** If a potential split does not reduce impurity by a significant amount, you may decide to stop.
- **Minimum Sample Size:** If a node contains too few examples, further splitting might be unreliable.

---

## Measuring Purity with Entropy

To decide the best split, you need a way to quantify the “purity” of a set of examples. **Entropy** is a measure of impurity in a dataset.

### Entropy Definition

Let $p_1$ be the fraction of positive examples (e.g., cats). Then the entropy $H(p_1)$ is defined as:

$$
H(p_1) = -p_1 \log_2(p_1) - (1-p_1) \log_2(1-p_1)
$$

- **Maximum Entropy:**  
  When $p_1 = 0.5$, the classes are evenly mixed and $H(0.5) = 1$. This is the worst-case (most impure) scenario.
- **Minimum Entropy:**  
  When $p_1 = 0$ or $1$, the node is pure and $H(0) = H(1) = 0$.

*Example:*  
- If you have 3 cats and 3 dogs ($p_1 = 0.5$), the entropy is $1$.
- If you have 5 cats and 1 dog ($p_1 \approx 0.83$), the entropy is lower (around $0.65$).
- If all examples are cats ($p_1 = 1$), the entropy is $0$.

---

## Information Gain: Choosing the Best Split

**Information Gain (IG)** measures the reduction in entropy after a split. It tells you how much more “pure” the resulting subsets are compared to the original set.

### Information Gain Formula

If you split a node into two branches (left and right), the information gain is calculated as:

$$
\text{IG} = H(p_1^\text{root}) - \left( w^\text{left} \cdot H(p_1^\text{left}) + w^\text{right} \cdot H(p_1^\text{right}) \right)
$$

Where:
- $H(p_1^\text{root})$ is the entropy at the root node.
- $H(p_1^\text{left})$ and $H(p_1^\text{right})$ are the entropies of the left and right subsets.
- $w^\text{left}$ and $w^\text{right}$ are the proportions of examples in the left and right branches.

*Practical Example:*  
Suppose at the root node, there are 10 examples (5 cats, 5 dogs), so $p_1^\text{root} = 0.5$ and $H(0.5)=1$. Now, consider splitting on **ear shape**:
- **Left branch (pointy ears):** 5 examples with 4 cats ($p_1^\text{left} = 0.8$, entropy roughly $0.72$).
- **Right branch (floppy ears):** 5 examples with 1 cat ($p_1^\text{right} = 0.2$, entropy roughly $0.72$).
- With weights $w^\text{left}=w^\text{right}=0.5$, the weighted entropy is $0.5 \times 0.72 + 0.5 \times 0.72 = 0.72$.
- Information gain is then $1 - 0.72 = 0.28$.

The algorithm computes the information gain for each possible feature split and selects the one with the highest gain.

---

## Handling Different Feature Types

Decision trees can handle various types of input features, but different methods are used depending on whether the feature is categorical or continuous.

### Categorical Features

#### Binary Features
For features that take on only two values (e.g., "whiskers" can be either present or absent), the tree splits directly on the binary condition.

#### Multi-valued Features and One-Hot Encoding
When a feature has more than two possible values (e.g., **ear shape** might be "pointy", "floppy", or "oval"), one common approach is **one-hot encoding**:
- Replace the single multi-valued feature with $k$ binary features (where $k$ is the number of possible values).
- For example, if ear shape has three values, create three new features:
  - **Pointy Ear:** $1$ if pointy, otherwise $0$.
  - **Floppy Ear:** $1$ if floppy, otherwise $0$.
  - **Oval Ear:** $1$ if oval, otherwise $0$.
- Each example will have exactly one of these features set to $1$ ("hot") and the others set to $0$.

*Benefit:*  
This transformation allows the decision tree algorithm to work with each feature as a simple binary indicator.

### Continuous Features

Continuous features (like **weight**) can take any numerical value. The decision tree must choose an optimal threshold to split the data.

#### How to Split on Continuous Features:
1. **Sort the Examples:**  
   Arrange the data points by the continuous feature (e.g., weight).

2. **Consider Candidate Thresholds:**  
   Use the midpoints between consecutive values as potential thresholds.

3. **Evaluate Information Gain:**  
   For each candidate threshold, split the data into two subsets (e.g., weight $\leq t$ and weight $> t$) and calculate the resulting weighted entropy. Choose the threshold that maximizes information gain.

*Example:*  
For weight, testing thresholds like $8$, $9$, or $13$ might result in different splits. If splitting at $t=9$ gives the highest reduction in entropy (say, information gain of $0.61$ compared to lower gains at $t=8$ or $t=13$), then the threshold $9$ is chosen.

---

## Regression Trees: Predicting Continuous Values

While classification trees predict categories, **regression trees** are designed to predict continuous values (e.g., predicting an animal's weight).

### Key Differences from Classification Trees

- **Target Variable:**  
  In regression trees, the target is a number (e.g., weight) rather than a category.

- **Splitting Criterion:**  
  Instead of reducing entropy, regression trees aim to reduce the variance of the target variable. The goal is to create subsets where the target values are as similar as possible.

### Variance Reduction

For a node with target values, calculate the variance. When splitting, compute the weighted variance of the resulting subsets:

$$
\text{Reduction in Variance} = \text{Variance}_\text{root} - \left( w^\text{left} \cdot \text{Variance}_\text{left} + w^\text{right} \cdot \text{Variance}_\text{right} \right)
$$

- **Leaf Node Prediction:**  
  At a leaf node, the prediction is the average of the target values in that node.

*Practical Example:*  
Suppose at the root node, the overall variance of weight is $20.51$. If splitting on **ear shape** results in two subsets with weighted variances that add up to $11.67$, then the reduction in variance is $20.51 - 11.67 = 8.84$. The algorithm chooses the feature and threshold that maximize this reduction.

---

## Recursive Construction and Ensemble Methods

### Recursion in Decision Trees

The process of building a decision tree is inherently recursive:
- **At the root:** Evaluate all features and choose the best split.
- **For each branch:** Treat the branch as a new dataset and repeat the splitting process.
- **Stop Splitting:** When a stopping criterion is met (pure node, maximum depth, etc.).

*Analogy:*  
Think of recursion like peeling layers of an onion—each layer (or branch) is processed in the same way until you reach the core (a leaf node).

### Tree Ensembles

Often, a single decision tree may not provide the best performance. By combining multiple trees, you can build a more robust model. Methods include:
- **Bagging:** Building multiple trees on random subsets of data and averaging their predictions (e.g., Random Forests).
- **Boosting:** Sequentially building trees where each new tree focuses on correcting the errors of the previous trees.

Ensembles reduce overfitting and improve generalization by aggregating the strengths of many trees.

---

## Recap and Practical Considerations

### Summary of Key Points

- **Decision Trees:**  
  - Break the problem into a series of binary decisions.
  - Use a recursive algorithm to build a tree structure with root, decision, and leaf nodes.
  
- **Purity Measures:**  
  - **Entropy:** Quantifies impurity using the formula 

$$H(p_1) = -p_1 \log_2(p_1) - (1-p_1) \log_2(1-p_1)$$

  - **Information Gain:** Measures how much a split reduces entropy.
  
- **Handling Features:**
  - **Categorical Features:** Use one-hot encoding for multi-valued features.
  - **Continuous Features:** Determine optimal split thresholds by evaluating candidate values.
  
- **Regression Trees:**  
  - Predict continuous values by minimizing the variance in the target variable.
  - Use variance reduction as the splitting criterion.

- **Recursive and Ensemble Methods:**  
  - Build trees recursively.
  - Improve performance using ensembles like bagging and boosting.

### Practical Tips

- **Parameter Tuning:**  
  Parameters such as maximum depth, minimum samples per node, or information gain threshold can be tuned using cross-validation.
  
- **Overfitting:**  
  Complex trees may overfit training data. Pruning strategies and ensemble methods help mitigate this risk.

- **Implementation:**  
  While understanding the theory is essential, many open-source libraries (like scikit-learn) implement these algorithms efficiently, allowing you to focus on model tuning rather than low-level implementation details.