# **Decision Tree:**

- **Definition:** Decision Tree is a popular supervised machine learning algorithm used for classification and regression tasks. It organizes data into a tree-like structure where each internal node represents a feature, each branch represents a decision based on that feature, and each leaf node represents a class label (in classification) or a numerical value (in regression).

- **Example:**
  - Suppose we have a dataset of weather conditions and corresponding decisions to play tennis:
    - **Outlook**: Sunny, Overcast, Rainy
    - **Temperature**: Hot, Mild, Cool
    - **Humidity**: High, Normal
    - **Wind**: Weak, Strong
    - **Decision**: Play, Don't Play

| Outlook | Temperature | Humidity | Wind | Decision |
|---------|-------------|----------|------|----------|
| Sunny   | Hot         | High     | Weak | Don't Play  |
| Sunny   | Hot         | High     | Strong | Don't Play  |
| Overcast| Hot         | High     | Weak | Play  |
| Rainy   | Mild        | High     | Weak | Play  |
| Rainy   | Cool        | Normal   | Weak | Play  |
| Rainy   | Cool        | Normal   | Strong | Don't Play |
| Overcast| Cool        | Normal   | Strong | Play  |
| Sunny   | Mild        | High     | Weak | Don't Play |
| Sunny   | Cool        | Normal   | Weak | Play  |
| Rainy   | Mild        | Normal   | Weak | Play  |
| Sunny   | Mild        | Normal   | Strong | Play  |
| Overcast| Mild        | High     | Strong | Play  |
| Overcast| Hot         | Normal   | Weak | Play  |
| Rainy   | Mild        | High     | Strong | Don't Play |

- **Mathematical Formulas:**

  1. **Entropy (H(S))**:
     - Entropy measures the impurity or randomness in a dataset.
     - Formula: $ H(S) = -\sum_{i=1}^{m} p_i \log_2(p_i) $
     - Example: Suppose there are two classes, Yes and No, and the dataset has 9 instances of Yes and 5 instances of No.
     - Calculation: $ H(S) = -(\frac{9}{14} \log_2(\frac{9}{14}) + \frac{5}{14} \log_2(\frac{5}{14})) $

  2. **Information Gain (IG(A))**:
     - Information Gain measures the effectiveness of an attribute in classifying the data.
     - Formula: $ IG(S, A) = H(S) - \sum_{i=1}^{n} \frac{|S_i|}{|S|} H(S_i) $
     - Example: To decide the splitting attribute at the root node, calculate the information gain for each attribute.

  3. **Decision Rule**:
     - Once the tree is constructed, decision rules at each node determine the traversal path.
     - Example: If Outlook is Sunny, go left; if Overcast, go middle; if Rainy, go right.

  4. **Pruning (Optional)**:
     - Pruning removes branches that do not provide significant improvements.
     - Example: Remove nodes with minimal impact on overall accuracy.

## Example 1

### Taken from https://vtupluse.com/

<img src="images/ID3-ex1-1.png" width="100%">

<img src="images/ID3-ex1-2.png" width="100%">

<img src="images/ID3-ex1-3.png" width="100%">

<img src="images/ID3-ex1-4.png" width="100%">

<img src="images/ID3-ex1-5.png" width="100%">

<img src="images/ID3-ex1-6.jpg" width="100%">

<img src="images/ID3-ex1-7.png" width="100%">

<img src="images/ID3-ex1-8.png" width="100%">

<img src="images/ID3-ex1-9.png" width="100%">

<img src="images/ID3-ex1-10.jpg" width="100%">

<img src="images/ID3-ex1-11.png" width="100%">

<img src="images/ID3-ex1-12.png" width="100%">

<img src="images/ID3-ex1-13.png" width="100%">

<img src="images/ID3-ex1-14.png" width="100%">

## Example 2

### Taken from https://vtupluse.com/

<img src="images/ID3-ex2-1.png" width="100%">

<img src="images/ID3-ex2-2.png" width="100%">

<img src="images/ID3-ex2-4.png" width="100%">

<img src="images/ID3-ex2-5.png" width="100%">

<img src="images/ID3-ex2-6.png" width="100%">

<img src="images/ID3-ex2-7.png" width="100%">

<img src="images/ID3-ex2-8.png" width="100%">

<img src="images/ID3-ex2-9.png" width="100%">

<img src="images/ID3-ex2-10.png" width="100%">

<img src="images/ID3-ex2-11.png" width="100%">

<img src="images/ID3-ex2-12.png" width="100%">

<img src="images/ID3-ex2-13.png" width="100%">

<img src="images/ID3-ex2-14.png" width="100%">

<img src="images/ID3-ex2-15.png" width="100%">

<img src="images/ID3-ex2-16.png" width="100%">

<img src="images/ID3-ex2-16.png" width="100%">

<img src="images/ID3-ex2-18.png" width="100%">

<img src="images/ID3-ex2-19.png" width="100%">

# **Limitations of Information Gain**

   - One of the limitations of information gain is its bias towards features with a large number of unique values or categories.
   - In decision tree algorithms, information gain tends to favor such features, as they can potentially provide more granular splits.

## Example

Let's consider a simple example to illustrate one of the limitations of information gain: its bias towards features with many values.

Suppose we have a dataset representing whether customers purchased a product based on two features: Age (Young, Middle-aged, Senior) and Zip Code (Zip1, Zip2, ..., Zip1000). The target variable is Purchase (Yes/No).

| Customer | Age         | Zip Code | Purchase |
|----------|-------------|----------|----------|
| 1        | Young       | Zip1     | Yes      |
| 2        | Young       | Zip2     | Yes      |
| 3        | Middle-aged | Zip3     | Yes      |
| 4        | Senior      | Zip4     | Yes      |
| 5        | Senior      | Zip5     | No       |
| 6        | Senior      | Zip6     | Yes      |
| 7        | Middle-aged | Zip7     | No       |
| 8        | Young       | Zip8     | Yes      |
| 9        | Young       | Zip9     | No       |
| 10       | Senior      | Zip10    | No       |

In this dataset, the Zip Code feature has 1000 unique values (Zip1, Zip2, ..., Zip1000), while the Age feature has only 3 unique values (Young, Middle-aged, Senior).

Now, let's calculate the information gain for both features and observe the bias towards the Zip Code feature:

### Information Gain for Age:

Entropy before split (H_before):
$ H_{\text{before}} = -\left(\frac{4}{10} \log_2 \frac{4}{10} + \frac{6}{10} \log_2 \frac{6}{10}\right) $

Entropy after split (H_after):
$ H_{\text{after}} = \frac{4}{10} \times 0 + \frac{3}{10} \times \text{Entropy}(Young) + \frac{3}{10} \times \text{Entropy}(Middle-aged) + \frac{3}{10} \times \text{Entropy}(Senior) $

Information gain (IG):
$ IG_{\text{Age}} = H_{\text{before}} - H_{\text{after}} $

### Information Gain for Zip Code:

Entropy before split (H_before):
$ H_{\text{before}} = -\left(\frac{4}{10} \log_2 \frac{4}{10} + \frac{6}{10} \log_2 \frac{6}{10}\right) $

Entropy after split (H_after):
$ H_{\text{after}} = \frac{1}{10} \times \text{Entropy}(Zip1) + \frac{1}{10} \times \text{Entropy}(Zip2) + \ldots + \frac{1}{10} \times \text{Entropy}(Zip1000) $

Information gain (IG):
$ IG_{\text{Zip Code}} = H_{\text{before}} - H_{\text{after}} $

In this example, even if Age is a more informative feature in practice, the Zip Code feature is likely to have a higher information gain simply because it has more unique values. This demonstrates the bias of information gain towards features with many values.

# **Alternative to Entropy: Gini Index**

- **Definition**: Gini Index is another impurity measure used in decision tree algorithms, particularly for binary splits. It measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the node.

- **Mathematical Formulas**:

  1. **Gini Index (Gini(S))**:
     - Gini index for a dataset S with m classes:
     $ Gini(S) = 1 - \sum_{i=1}^{m} p_i^2 $
     where $ p_i $ is the proportion of instances belonging to class $ i $ in S.
     
  2. **Gini Index for Splitting (Gini\_split(A))**:
     - Gini index for splitting a dataset S based on attribute A:
     $ Gini\_split(A) = \sum_{j} \frac{|S_j|}{|S|} Gini(S_j) $
     where $ S_j $ is the subset of S after splitting on attribute A.

- **Example**:

  - Suppose we have a binary classification problem with two classes, Yes and No.
  - Consider a node with 10 instances, where 7 instances belong to class Yes and 3 instances belong to class No.
  - Calculation of Gini Index:
    $ Gini(S) = 1 - \left(\left(\frac{7}{10}\right)^2 + \left(\frac{3}{10}\right)^2\right) = 1 - \left(\frac{49}{100} + \frac{9}{100}\right) = 1 - \frac{58}{100} = \frac{42}{100} = 0.42 $

- **Binary Splits**:
  - Gini Index is particularly useful for binary splits, where an attribute divides the dataset into two subsets.
  - At each node, the algorithm calculates the Gini Index for each possible split and chooses the one with the lowest index.

- **Comparison with Entropy**:
  - Both Gini Index and Entropy are impurity measures used in decision trees.
  - Gini Index tends to be faster to compute than entropy because it doesn't involve logarithmic calculations.
  - In practice, both measures often lead to similar results, but there may be slight differences in performance depending on the dataset.

## Example 1

### Taken from https://vtupluse.com/

<img src="images/DT-Gini-ex1-1.png" width="100%">

<img src="images/DT-Gini-ex1-2.png" width="100%">

<img src="images/DT-Gini-ex1-3.png" width="100%">

<img src="images/DT-Gini-ex1-4.png" width="100%">

<img src="images/DT-Gini-ex1-5.png" width="100%">

<img src="images/DT-Gini-ex1-6.png" width="100%">

<img src="images/DT-Gini-ex1-7.png" width="100%">

<img src="images/DT-Gini-ex1-8.png" width="100%">

<img src="images/DT-Gini-ex1-9.png" width="100%">

<img src="images/DT-Gini-ex1-10.png" width="100%">

<img src="images/DT-Gini-ex1-11.png" width="100%">

<img src="images/DT-Gini-ex1-12.png" width="100%">

<img src="images/DT-Gini-ex1-13.png" width="100%">

<img src="images/DT-Gini-ex1-14.png" width="100%">