# Decision Tree Inducers

 






## Prerequisites

Before proceeding through this notebook, students should have completed
* Decision Tree basics 
* Impurity Metrics

## Learning Objectives
After going through this notebook, students should be able to:  
* List popular decision tree inducers.
* List differences between ID3, C4.5, and CART.  
* Use algorithms discussed in this chapter and python to create decision tree from scratch.

## Introduction


In previous notebooks we studied about impurity metrics and pruning methods. In this notebook we will see how different impurity metrics and pruning methods are used by decision tree inducers. 

Decision Tree inducers are algorithms used to create decision tree from the training data. Some popular decision tree inducers are:  

* ID3

* C4.5
* CART  

## ID3

ID3 stands for Iterative Dichotomiser 3. 'Iterate' means to perform repeatedly, and 'dichotomize' means to represent as divided. ID3 repeatedly divides data at each node and hence the name. It was developed by Ross Quinlan and is used to solve classification problems. It creates multiway splits. For an *n*-valued categorical attribute, ID3 makes *n* splits. For example, if an attribute **outlook** has three values *sunny, foggy, and rainy*, then ID3 makes three splits, one for each distinct value of **outlook**. 

ID3 uses **information gain to determine the best split**. The tree growing process stops when all instances are assigned a class label or when maximum information gain is not greater than 0. ID3 **doesn't use any tree pruning method**.


### Algorithm

1. For all instances in a dataset $D$:

   * If all instances belong to the same class $C$, or stopping criteria are met (e.g., pure instances, maximum tree depth):

     * Create a **leaf node** labeled with class $C$ and stop.
   * Else:

     * Compute **Information Gain** for all previously unselected attributes.
     * Select an attribute (say $f$) that yields the **maximum Information Gain**.
     * If $\max(\text{Information Gain}) \leq 0$:

       * Create a **leaf node** with the **majority class label** and stop.
     * Else:

       * Split the dataset into subsets according to the value of attribute $f$.

2. Apply the algorithm **recursively** from Step 1 for each of the subsets. 


### Disadvantage of ID3
- ID3 is prone to overfitting because it doesn't use pruning method. 
- Another is that it can only handle categorical attributes. 

## C4.5
C4.5, also developed by Ross Quinlan, is an improved version of ID3. Like ID3, C4.5 creates multiway splits for a categorical attribute and is used to solve classification problems. 

Unlike ID3, C4.5 removed the restriction that attributes must be categorical. For continuous attributes, C4.5 creates a threshold and then splits the data into two parts. The first part contains all instances for which the attribute value is less than or equal to the threshold. The second part contains all instances for which the attribute value is greater than the threshold. It uses **gain ratio as splitting** criteria. The tree growing process stops when the number of instances to be split is below a certain threshold. For pruning, C4.5 **uses Error-based pruning** to solve the overfitting problem. Our previous notebook on pruning methods doesn't discuss error-based pruning, but you may refer to the additional resources section if you want to know more about error-based pruning.


### Algorithm

<!-- ```
1. For all instances in a dataset:
        i. If all instances in D belong to the same class, C or other stopping criteria are met:
            a. Create a leaf node and stop.
        ii. Else
            a. Compute gain ratio for all attributes.
            b. Select an attribute(say f) that yields the greatest gain ratio.
            c. Split the data into subsets according to value of attribute f.
2. Apply the algorithm recursively from step-1 for each of the subset.
3. Prune tree using error based pruning.
``` -->

1. For all instances in a dataset $D$:

   * If all instances belong to the same class $C$, or other stopping criteria are met (e.g., pure instances, maximum tree depth):

     * Create a **leaf node** labeled with class $C$ and **stop**.
   * Else:

     * Compute the **Gain Ratio** for all unused attributes.
     * Select the attribute $f$ with the **highest Gain Ratio**.
     * Split the dataset into subsets based on the values of attribute $f$.

2. Apply the algorithm **recursively** from Step 1 to each subset.

3. After the full tree is constructed, **prune the tree using Error-Based Pruning**.

## CART

CART stands for **Classification and Regression Trees**. CART constructs binary trees. Each  split generates maximum of two subsets. For example, if an attribute **outlook** has 3 values *sunny, foggy, and rainy* then one subset may contain all instances in which **outlook** has value *foggy*. The other subset may contain instances in which **outlook** has value *sunny* and *rainy*.  The splits are selected using **information gain** as the impurity metric and the tree is pruned using **cost-complexity pruning**.  

An advantage of CART over ID3 and C4.5 is that it can be used for both regression and classification.



### CART Algorithm (Classification)

1. For all instances in a dataset $D$:

   * If all instances belong to the same class $C$, or other stopping criteria are met (e.g., pure instances, maximum tree depth):

     * Create a **leaf node** and **stop**.
   * Else:

     * Compute **Information Gain** for all attributes.
     * Select the attribute $f$ that yields the **greatest Information Gain**.
     * Split the data into subsets according to the value of attribute $f$.

2. Apply the algorithm **recursively** from Step 1 to each subset.

3. **Prune the tree** using **Cost Complexity Pruning**.

> The algorithm uses **Information Gain** as the impurity metric, but you could also use **Twoing Criterion** (refer to the *Additional Resources* section for more details).


### CART Algorithm (Regression)

The above algorithm describes how to construct a decision tree for a **classification problem** using CART. Since **CART supports both classification and regression**, here are the steps for constructing **regression trees**:

1. For all instances in a dataset $D$:

   * If all target values are approximately the same, or other stopping criteria are met (e.g., Mean Squared Error below a threshold):

     * Create a **leaf node** and **stop**.
   * Else:

     * Compute the **Mean Squared Error (MSE)** for all attributes.
     * Select the attribute $f$ that yields the **least MSE**.
     * Split the data into subsets according to the value of attribute $f$.

2. Apply the algorithm **recursively** from Step 1 to each subset.

3. **Prune the tree** using **Cost Complexity Pruning**.

> The algorithm uses **MSE** as the impurity metric, but **Variance** or **Residual Sum of Squares (RSS)** can also be used.


Now that we have studied decision tree inducers, we can conclude that among ID3, C4.5, and CART the easiest to implement is ID3. However, since CART can be used for both regression and classification, it is the most versatile. Due to this reason CART is used by `Scikit-learn` to create `DecisionTreeClassifier` and `DecisionTreeRegressor`.   

The following table summarizes above algorithms:  
<!-- https://drive.google.com/file/d/1jPFzaYgTYyfNvYF23yYBlDTY6G0h0oLz/view?usp=sharing -->


<div align="center">
 <figure>
 <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?id=1jPFzaYgTYyfNvYF23yYBlDTY6G0h0oLz"> -->
 <img src="https://i.postimg.cc/NMHSVX69/image.png">
 
 <!-- <figcaption>Figure 1: Impurity v/s probability of class A for a binary classification(class A and class B) problem.</figcaption> -->
 </figure>
</div>

## Additional Resources

### Decision Tree Inducers:
* Lior Rokach, and Oded Maimon, [Decision Trees](http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf)
  * See Section 8 page 181 to read about Decision Tree Inducers


### Pruning Method:
* Lior Rokach, and Oded Maimon, [Decision Trees](http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf)
  * See Section 3.11 page 172 to read about Twoing Criteria
  * See Section 6.4 page 176 to read about Minimum Error Pruning

* John Mingers, [An Empirical Comparison of Pruning Methods for Decision Tree Induction](https://link.springer.com/article/10.1023/A:1022604100933)  
  * See section 2.2.3 to read about Minimum Error Pruning
  * See section 2.2.4 to read about Reduced Error Pruning
