## Prerequisites

To start with Decision Tree, you must have an understanding of:
- Probability

- Differences between classification and regression
- Evaluation metrics for classification and regression

# Decision Tree

A **Decision Tree** is a supervised machine learning model used for both **classification** and **regression tasks**. It works by learning a set of decision rules from data and organizing them into a **tree-like structure**.

Think of a decision tree like a flowchart of yes/no questions you ask to make a decision.
- Each if-else is like a branch of the tree.

- The final True/False at the bottom is the leaf node (the decision).

## Intuition with Breast Cancer Classification Example

Let's take a look at a medical example illustrating if **Breast Cancer** is present or not by predicting whether a **breast mass** is **benign** (not cancerous) or **malignant** (cancerous).

<div align="center">
    <figure>
     <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?id=1_Y_Y8eG5epFOsrvoRJlZy56EGxCCvuvg"> -->
     <img src = "https://i.postimg.cc/kMfTP7vK/Breast-Cancer-Classification.png" width=80%>
     <figcaption>Figure: Breast Cancer Classification with Decision Tree </figcaption>
    </figure>
</div>

- **Root Node:** The top-most node; the starting point.

- **Internal Node:** A decision point based on a feature.

- **Leaf Node:** The terminal node that gives a prediction.

- **Branch:** A connection between nodes that represents a decision path.

- **Depth:** The number of layers from root to leaf.

</br>

We classify a breast mass as **benign** or **malignant** following these steps.

**Step 1: Test → Concave points_mean <= 0.051**

* If `concave points_mean` is **less than or equal to 0.051**: Go **left**.
    * Proceed to next test (on left).
* If `concave points_mean` is **greater than 0.051**: Go **right**.
    * Proceed to next test (on right).

**Step 2a (Left Branch): Test → radius_mean <= 14.98**

* If `radius_mean` is **less than or equal to 14.98**: Go **left**.
    * **Predict -> benign**.
* If `radius_mean` is **greater than 14.98**: Go **right**.
    * **Predict -> malignant**.

**Step 2b (Right Branch): Test → radius_mean <= 11.345**

* If `radius_mean` is **less than or equal to 11.345**: Go **left**.
    * **Predict -> benign**.
* If `radius_mean` is **greater than 11.345**: Go **right**.
    * **Predict -> malignant**.




## Popular Decision Tree Alorithms

* **ID3 (Iterative Dichotomiser 3):** This algorithm leverages the concepts of **entropy** and **information gain** to determine the best splits. It is particularly well-suited for datasets with **categorical features**.

* **C4.5:** As a successor to ID3, C4.5 extends its capabilities by supporting **continuous attributes** and effectively handling **missing values** in the data.

* **CART (Classification and Regression Trees):** This versatile algorithm, notably implemented in the popular `scikit-learn` library, can be used for both **classification** and **regression** tasks. For classification, it commonly employs the **Gini impurity** criterion, while for regression, it typically uses **Mean Squared Error (MSE)** to guide the splitting process.

## Impurity Metrics and Selection of the Best Attribute

The decision tree we just examined used two attributes: **Concave points_mean** and **radius_mean** as test/split functions.

In real life, we may not always be lucky to have such a clear-cut set of attributes for every decision. Most real datasets have multiple attributes, thus leading to confusion about which attribute to choose as a test/split function at each node. So, to identify the **best attribute for splitting** at each node, we use metrics like:
- **Gini impurity**,
- **Entropy**, and
- **Mean Squared Error (MSE)**.

These metrics are called **impurity metrics**.

Let us try to understand the impurity metrics and how they help to determine which attribute to use to split by using a toy classification problem with three classes and two features shown below.

<div align="center">
    <figure>
        <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?id=1fEPv2lkdw6azEvjW7AS-saHZwRaBdBgO"> -->
        <img src="https://i.postimg.cc/vTxZpmvY/image.png">
        <figcaption> <br> Figure 2: Test attributes and their associated gini values.  a) Dataset at the parent node before split. b) After a vertical split. c) After a horizontal split.</figcaption>
    </figure>
</div>

__Data Before Split__

* The **leftmost figure** shows a scatter plot and class histogram of training samples in a **parent node**.
* Goal: Select the attribute ($x₀$ or $x₁$) that best separates the classes.

__Splitting by $x₀$ (Vertical Split)__

* Creates two sets:

  * **Right set**: contains samples from **one class only** → **pure**.
  * **Left set**: contains samples from **two classes** → **impure**.

__Splitting by $x₁$ (Horizontal Split)__

* Creates:

  * **Top set**: samples from **one class only** → **pure**.
  * **Bottom set**: samples from **all three classes** → **highly impure**.

### Gini Impurity

* Shown at the top of the split plots.

* **Lower Gini value = better split**.
* In this case, **$x₀$ (vertical split)** has **lower impurity** than $x₁$, so it's selected for the split.

Let's go back to our **Breast Cancer classification** example. Can you verify that using the **radius_mean** test in the root node would have reduced the uncertainty in prediction better than the attribute **Concave points_mean**? If one attribute is sufficient to make a good split, the decision tree above might have only one node! Such a tree with a single node is called a **decision stump**.

Now we know about the impurity metrics, let's delve into them. Throughout this reading material, we will be talking about the gini impurity. Mathematically, gini is given by:

\begin{equation}
    \text{Gini} = 1 - \sum_{i=1}^{n} P_{i}^2
\end{equation}

Where,$P_i$ is the probability of an object being classified to a class $i$.

Now, let's write our function to compute the gini impurity using `numpy`.

In [None]:
import numpy as np

def compute_gini(class_frequencies):
    probabilities = class_frequencies / np.sum(class_frequencies)
    gini = 1 - np.sum(probabilities ** 2)
    return gini

Let's suppose that we have four different samples of data for the **Breast classification** example above. We will use the function `compute_gini()` to compute the gini impurities.

<!-- https://drive.google.com/open?id=1mcrYoGeTUz7ToLp03iHzUCN11i2k5oZt -->



<div align="center">
<table>
  <thead>
    <tr>
      <th>Benign</th>
      <th>Malignant</th>
      <th>Remarks</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>150</td>
      <td>0</td>
      <td>Pure Data</td>
    </tr>
    <tr>
      <td>10</td>
      <td>90</td>
      <td>Unevenly Distributed Impure Data</td>
    </tr>
    <tr>
      <td>60</td>
      <td>40</td>
      <td>Unevenly Distributed Impure Data</td>
    </tr>
    <tr>
      <td>50</td>
      <td>50</td>
      <td>Evenly Distributed Impure Data</td>
    </tr>
  </tbody>
</table>
</div>

In [None]:
# Computing gini impurities for different proportions of data shown in the table above
class_dist_matrix = np.array([
    [150, 0],
    [10, 90],
    [60, 40],
    [50, 50]
])

for num_cases in range(len(class_dist_matrix)):
    class_dist = class_dist_matrix[num_cases,:]
    gini_impurity = compute_gini(class_dist)
    print("{:.2f}".format(gini_impurity))

0.00
0.18
0.48
0.50


From the computations above, we can see that a pure set (samples from one class only) has zero impurity.

Impurity (uncertainty of prediction) becomes maximum when a set has an equal number of samples from each class. Just like gini and entropy, MSE also measures impurity in data. It is used in regression trees. Attribute that yields split with lower MSE are preferred over a split with higher MSE.


### Decision Tree Theory
* Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar, [Introduction to Data Mining](https://www-users.cs.umn.edu/~kumar001/dmbook/index.php), 2nd Edition
   * Check unit 3.3.1, 3.3.2, and 3.3.3 page 119 to understand how decision tree work and how to build them


* A. Criminisi and J. Shotton, Decision Forests for Computer Vision and Medical Image Analysis
   * Check chapter 3 page 7 to get a basic understanding of tree$-$data structure and decision tree.