# Exploring Decision Trees  

A decision tree is a vital tool in machine learning, used mainly for sorting data into categories or predicting values. It shines when dealing with complex relationships between different pieces of information. Unlike simpler models, decision trees are great at handling data that doesn't follow straight lines or simple patterns. They're also helpful because they can explain how they make decisions, which makes them easier for everyone involved to understand.

# Types of Decision Trees

Decision trees come in two main types, each designed to handle different kinds of target variables:

- **Categorical Variable Decision Tree:** This type is good at dealing with target variables that have categories, not just 'Yes' or 'No'.  But also It's useful when the outcome has many options to choose from. For example, it could help predict someone's preferred way of getting around, like 'Car', 'Bus', 'Train', or 'Walk'.

- **Continuous Variable Decision Tree:** This type is made for target variables that are on a continuous scale, like predicting income based on things like age and job. It's great when the outcome is something you can measure, not just choose from a list. target variables that are on a continuous scale, like predicting income based on things like age and job. It's great when the outcome is something you can measure, not just choose from a list.

### Key Concepts in Decision Trees

- **Root Node**: Represents the entire dataset, serving as the starting point for division into smaller groups.
- **Splitting**: The process of dividing a node into smaller sub-nodes based on specific criteria.
- **Decision Node**: A node that further splits into additional nodes based on decisions made during the splitting process.
- **Leaf/Terminal Node**: Final nodes in the tree structure that do not split further.
- **Pruning**: The removal of unnecessary sub-nodes to simplify the tree and improve its efficiency.
- **Branch/Sub-Tree**: Segments of the tree that extend from the main structure, representing different paths or outcomes.
- **Parent and Child Node**: Nodes in the tree where the parent node splits into smaller nodes (children) based on certain conditions.

### Managing Numerical and Categorical Data in Decision Trees

Decision trees are versatile in handling both numerical and categorical data simultaneously. Here's how they manage each type:

- **Categorical Features**: The tree splits based on belonging to a particular class within the categorical feature.
- **Continuous Features**: Splits are determined by values above or below a threshold within the continuous feature.

The decision tree selects the best feature to split on at each step, aiming to reduce uncertainty. Whether a feature is categorical or continuous doesn't impact this decision-making process.

*In real-world scenarios, converting categorical features into numerical format, often through techniques like One-Hot Encoding, can be beneficial.*

# How to Choose the Set of Split Points

The choice of split points for a variable depends on whether it's numeric or categorical.

## Numeric Predictor Variables
- **Unique Values**: If a predictor is numeric and has unique values, there are *n – 1* split points for *n* data points. However, considering all these points may be impractical due to their number. Instead, common practice involves selecting split points based on specific percentiles of the value distribution (e.g., every tenth percentile like 10%, 20%, 30%, etc.).

  **Example**: For a column of numeric values [20, 25, 30, 35, 50, 55, 70, 90] with 8 data points, average values between each pair of data points are calculated, resulting in 7 split values [22.5, 27.5, 32.5, 42.5, 52.5, 62.5, 80.0]. Information gain (IG) is then computed for each split point to select the one maximizing IG.

#### Categorical Predictor Variables
- **Binary Splitting**: With categorical predictors, binary splitting is common, creating either one child node per class (*multiway splits*) or only two child nodes (*binary split*). Binary splits are preferred because multiway splits can quickly break data into small subsets, leading to overfitting.

  - **Two-Class Predictor**: If a categorical predictor has two classes, there is only one possible split.
  
  - **Multiple-Class Predictor**: For predictors with more than two classes:
    - **Small Number of Classes**: Consider all possible splits into two child nodes.
    - **Large Number of Classes**: Order classes by their average output value and make a binary split into two groups of ordered classes (resulting in *k – 1* possible splits for *k* classes).

**Example**: For three classes like apple, banana, and orange, potential binary splits include:
  
  |        | child1 |     child2     |
  |:------:|:------:|:--------------:|
  | split1 |  apple | banana, orange |
  | split2 | banana |  apple, orange |
  | split3 | orange |  apple, banana |

For *k* classes, there are *2^(k-1)-1* possible splits, which can be computationally intensive for large *k*. This can bias decision trees towards splitting variables with many classes, resulting in potentially significant improvements but also increasing the risk of overfitting.
  
For four classes like apple, banana, orange, and grapes, potential binary splits include: 
  
| Split   | Child 1        | Child 2            |
|---------|----------------|--------------------|
| Split 1 | Grapes, Apple  | Banana, Orange     |
| Split 2 | Grapes, Banana | Apple, Orange      |
| Split 3 | Grapes, Orange | Apple, Banana      |
| Split 4 | Grapes         | Apple, Banana, Orange |
| Split 5 | Apple          | Grapes, Banana, Orange |
| Split 6 | Banana         | Grapes, Apple, Orange  |
| Split 7 | Orange         | Grapes, Apple, Banana  |

### Criteria Used for Enhancing Node Splits

The standards for enhancing node splits vary based on whether the target variable is continuous or categorical.

#### Continuous Target Variable
- **Reduction in Sum of Squared Errors (SSE)**:
  When the outcome is numerical, improvement is gauged by the difference in the sum of squared errors (SSE) between the node and its child nodes post-split. The squared error for a node is computed as:
  
  $$ \sum_{i=1}^{n}{(y_i - c)}^2 $$
  
  Where:
  - $n$ represents the number of cases at the node.
  - $c$ denotes the average outcome of all cases at that node.
  - $y_i$ stands for the outcome value of the $i$-th case.

#### Categorical Target Variable
- **Gini Impurity**:

    Gini Impurity measures the likelihood of a randomly selected element in a dataset being wrongly classified. It's used in creating classification trees, which divide data into groups based on certain criteria. Gini Impurity ranges from 0 to 0.5, where 0 means all elements belong to the same class and 0.5 means elements are evenly distributed across different classes.

  For categorical outcomes, the split can be based on:
  - **Gini Impurity**: 
    $$ Gini\ impurity = \sum_{i=1}^{k}{p_i(1-p_i)} $$
    Where:
    - $k$ is the number of classes.
    - $p_i$ is the proportion of cases belonging to class $i$.
    
  - **Information Gain**:
    $$ Information\ Gain = Entropy(Parent) - Entropy(Children) $$
    Where:
    - **Entropy**:
      $$ Entropy = -\sum_{i=1}^{k}{p_i \log_2(p_i)} $$
      $p_i$ is the probability of class $i$.
    
- **Chi-Square**:
  Another method for categorical targets is Chi-Square, which assesses the statistical significance of differences between the parent node and child nodes. The Chi-Square value for a class is calculated as:
  
  $$ Chi-Square = \sqrt{\frac{(Actual-Expected)^2}{Expected}} $$
  
  Where:
  - **Expected**: Expected value for a class in a child node based on parent node distribution.
  - **Actual**: Actual value for a class in a child node.

The selected criteria aim to minimize impurity or error after the split, resulting in more homogeneous child nodes. Each criterion evaluates the effectiveness of the split in reducing uncertainty or error, guiding the decision tree towards optimal node divisions.

### Illustration of Node Splitting Considering SSE or Variance for a Decision Tree

Consider a dataset of 50 startups, aiming to predict profit based on various features, including categorical and continuous variables.

#### Dataset Overview

The dataset contains information about startups, including the amount spent on Research and Development (R&D), Administration, Marketing Spend, State of operation, and the resulting Profit.

#### Sample Data

| R&D Spend | Administration | Marketing Spend | State      | Profit   |
|-----------|----------------|-----------------|------------|----------|
| 165349    | 136897         | 471784          | New York   | 192261   |
| 162597    | 151377         | 443898          | California | 191792   |
| 153442    | 101145         | 407934          | Florida    | 191050   |
| ...       | ...            | ...             | ...        | ...      |

<center><img src="./imgs/split.png"/></center>

### Illustration of Node Splitting Considering Information Gain for a Decision Tree

Consider an experiment with two predictors *variable1* and *variable2* and a target variable with two outcomes: *stop* and *continue*.

| variable1 | variable2 | outcome   |
|-----------|-----------|-----------|
| 3         | 5         | stop      |
| 7         | 6         | continue  |
| 3         | 3         | stop      |
| 4         | 8         | continue  |
| 3         | 9         | continue  |
| 6         | 5         | stop      |
| 5         | 8         | continue  |
| 6         | 4         | continue  |

Let's calculate the initial entropy of the root node before any splitting:

$$H = -\left(\frac{3}{8}\right)\log_2\left(\frac{3}{8}\right) - \left(\frac{5}{8}\right)\log_2\left(\frac{5}{8}\right) = 0.954$$

Now we have two options: either to choose *variable1* for split or *variable2*. To decide which variable to split on, we will calculate entropies in both cases.

**Splitting with *variable1* at split point 4 (average or 50th percentile):**
- $p_{(>4)} = \frac{4}{8}$
- $H_{(>4)} = -\left(\frac{1}{4}\right)\log_2\left(\frac{1}{4}\right) - \left(\frac{3}{4}\right)\log_2\left(\frac{3}{4}\right) = 0.81$
  | variable1 | outcome   |
  |-----------|-----------|
  | 7         | continue  |
  | 6         | stop      |
  | 5         | continue  |
  | 6         | continue  |
- $p_{(\leq 4)} = \frac{4}{8}$
- $H_{(\leq 4)} = -\left(\frac{2}{4}\right)\log_2\left(\frac{2}{4}\right) - \left(\frac{2}{4}\right)\log_2\left(\frac{2}{4}\right) = 1.0$
  | variable1 | outcome   |
  |-----------|-----------|
  | 3         | stop      |
  | 3         | stop      |
  | 4         | continue  |
  | 3         | continue  |
Entropy after splitting by *variable1*:
$$H(\text{variable1}) = -p_{(>4)}H_{(>4)} - p_{(\leq 4)}H_{(\leq 4)} = 0.9$$

**Splitting with *variable2* at split point 6 (average or 50th percentile):**
- $p_{(>6)} = \frac{3}{8}$
- $H_{(>6)} = -\left(\frac{3}{3}\right)\log_2\left(\frac{3}{3}\right) - \left(\frac{0}{3}\right)\log_2\left(\frac{0}{3}\right) = 0$
  | variable2 | outcome   |
  |-----------|-----------|
  | 8         | continue  |
  | 9         | continue  |
  | 8         | continue  |
- $p_{(\leq 6)} = \frac{5}{8}$
- $H_{(\leq 6)} = -\left(\frac{2}{5}\right)\log_2\left(\frac{2}{5}\right) - \left(\frac{3}{5}\right)\log_2\left(\frac{3}{5}\right) = 0.971$
  | variable2 | outcome   |
  |-----------|-----------|
  | 5         | stop      |
  | 6         | continue  |
  | 3         | stop      |
  | 5         | stop      |
  | 4         | continue  |
Entropy after splitting by *variable2*:
$$H(\text{variable2}) = - p_{(>6)}H_{(>6)} - p_{(\leq 6)}H_{(\leq 6)} = 0.28$$

Calculate information gain after each split:
$$IG(1) = H - H(\text{variable1}) = 0.954 - 0.9 = 0.054$$
$$IG(2) = H - H(\text{variable2}) = 0.954 - 0.28 = 0.674$$

We observe that if we split the node by considering *variable2*, we achieve the highest information gain. Therefore, we select *variable2* for the split and continue this process for impure nodes until we obtain pure leaf nodes for the decision tree. The following diagram shows the decision tree in this case.

<center><img src="./imgs/decisiontree.png"/></center>

### Illustration of Node Splitting Considering Gini Impurity for a Decision Tree

Consider the same experiment data from above.

| variable1 | variable2 | outcome   |
|-----------|-----------|-----------|
| 3         | 5         | stop      |
| 7         | 6         | continue  |
| 3         | 3         | stop      |
| 4         | 8         | continue  |
| 3         | 9         | continue  |
| 6         | 5         | stop      |
| 5         | 8         | continue  |
| 6         | 4         | continue  |

We have two variables for splitting. We need to calculate the weighted Gini impurity for both *variable1* and *variable2*. Whichever variable gives the lowest impurity will be selected for the split.

**Splitting with *variable1* at split point 4 (average value):**
- Gini Impurity (>4) = 1 - $\left(\left(\frac{3}{4}\right)^2 + \left(\frac{1}{4}\right)^2\right) = 0.375$

  | variable1 | outcome  |
  |-----------|----------|
  | 7         | continue |
  | 6         | stop     |
  | 5         | continue |
  | 6         | continue |

- Gini Impurity (<=4) = 1 - $\left(\left(\frac{2}{4}\right)^2 + \left(\frac{2}{4}\right)^2\right) = 0.375$

  | variable1 | outcome  |
  |-----------|----------|
  | 3         | stop     |
  | 3         | stop     |
  | 4         | continue |
  | 3         | continue |

- Weighted Gini Impurity (variable1) = $\frac{4}{8} \times 0.375 + \frac{4}{8} \times 0.5 = 0.4375$

**Splitting with *variable2* at split point 6 (average value):**
- Gini Impurity (>6) = 1 - $\left(\left(\frac{3}{3}\right)^2 + \left(\frac{0}{3}\right)^2\right) = 0$

  | variable2 | outcome  |
  |-----------|----------|
  | 8         | continue |
  | 9         | continue |
  | 8         | continue |

- Gini Impurity (<=6) = 1 - $\left(\left(\frac{2}{5}\right)^2 + \left(\frac{3}{5}\right)^2\right) = 0.48$

  | variable2 | outcome  |
  |-----------|----------|
  | 5         | stop     |
  | 6         | continue |
  | 3         | stop     |
  | 5         | stop     |
  | 4         | continue |

- Weighted Gini Impurity (variable2) = $\frac{3}{8} \times 0 + \frac{5}{8} \times 0.48 = 0.3$

From the above calculations, we see that the weighted Gini impurity is less for *variable2*. Therefore, we choose *variable2* for splitting the data. This process is continued recursively to build the entire decision tree.

**Note**: For the same data, using Gini impurity and entropy measure, different variables can be selected as the root node. This is because each algorithm has its bias. Gini impurity tends to select a wider spread of data for variable selection, whereas the entropy method is biased toward compact data with lower spread.

### Illustration of Node Splitting Considering Chi-Square for a Decision Tree

Let's consider an example with categorical predictors and a categorical target for ease of understanding:

| Performance | Class | Outcome       |
|-------------|-------|---------------|
| Average     | IX    | Play Cricket  |
| Below Avg   | X     | Play Cricket  |
| Average     | IX    | Play Cricket  |
| Below Avg   | X     | Play Cricket  |
| Below Avg   | X     | Play Cricket  |
| Average     | IX    | Play Cricket  |
| Average     | X     | Doesn't Play  |
| Average     | IX    | Doesn't Play  |
| Below Avg   | X     | Doesn't Play  |
| Average     | IX    | Doesn't Play  |
| Average     | X     | Doesn't Play  |
| Below Avg   | IX    | Doesn't Play  |

Distribution of target before splitting (Parent node):

- Plays Cricket = $6/12 = 0.5$
- Doesn't play Cricket = $6/12 = 0.5$

Let's try to split the node by considering the Performance variable and calculate the Chi-Square.

**Splitting by Performance:**
- Performance: Average
  - Expected Cricket Players = $7 * (0.5) = 3.5$
  - Actual Cricket Players = $3$
  - Chi-Square Players = $0.267$
  
  - Expected Non-Players = $7 * (0.5) = 3.5$
  - Actual Non-Players = $4$
  - Chi-Square Non-Players = $0.267$

- Performance: Below Avg
  - Expected Cricket Players = $5 * (0.5) = 2.5$
  - Actual Cricket Players = $3$
  - Chi-Square Players = $0.316$
  
  - Expected Non-Players = $5 * (0.5) = 2.5$
  - Actual Non-Players = $2$
  - Chi-Square Non-Players = $0.316$

Total Chi-Square (Performance) = $0.316 + 0.316 + 0.267 + 0.267 = 1.166$

**Splitting by Class:**
- Class: IX
  - Expected Cricket Players = $6 * (0.5) = 3$
  - Actual Cricket Players = $3$
  - Chi-Square Players = $0$
  
  - Expected Non-Players = $6 * (0.5) = 3$
  - Actual Non-Players = $3$
  - Chi-Square Non-Players = $0$

- Class: X
  - Expected Cricket Players = $6 * (0.5) = 3$
  - Actual Cricket Players = $3$
  - Chi-Square Players = $0$
  
  - Expected Non-Players = $6 * (0.5) = 3$
  - Actual Non-Players = $3$
  - Chi-Square Non-Players = $0$

Total Chi-Square (Class) = $0 + 0 + 0 + 0 = 0$

We see that the total Chi-Square is greater when splitting by Performance than by Class, so we choose Performance to split the root node. This process is repeated recursively until we obtain pure leaf nodes.

### Advantages of Decision Tree

- Can be used for both classification and regression problems.
- Handles both continuous and categorical variables.
- No feature scaling required.
- Non-linear parameters don't affect performance.
- Can automatically handle missing values and outliers.
- Shorter training period compared to Random Forest.

### Disadvantages of Decision Tree

- Prone to overfitting, leading to wrong predictions.
- High variance in output due to overfitting.
- Adding new data may require re-generation of the entire tree.
- Not suitable for large datasets, prone to overfitting.

### Popular Algorithms

- ID3 (Iterative Dichotomiser): Uses Information Gain.
- C4.5: Uses Gain Ratio.
- CART (Classification and Regression Trees): Uses Gini Index.

### Handling Missing Values

- Ignore missing values.
- Treat them as another category or nominal feature.
- Use surrogate features or diminish weights for distribution.

### Robustness to Outliers

- Regression trees are affected, while classification trees are less affected.

### Pruning a Tree

**Post-Pruning:**
- Minimum error: Prune to the point with minimum cross-validated error.
- Smallest tree: Prune slightly beyond minimum error for a smaller tree.

**Pre-Pruning (Early Stopping):**
- Stops tree-building early if error doesn't decrease significantly.
- Used together or separately for accuracy and interpretability.
