# Impurity Metrics for Continuous Attributes

## Prerequisites
Before proceeding with this chapter, students must have completed:
- Decision tree [Impurity Metrics, the earlier notebook

## Learning Objectives 
After going through this notebook, students should be able to: 
- Find the best split when the dataset has continuous attributes



In the previous notebook, we discussed impurity metrics and demonstrated how we compute them for discrete attributes. In this notebook, we will see how we can compute impurity metrics for continuous attributes. We will use gini as the impurity metric, but the process can be used to compute other impurity metrics. We will use the following dataset for the demonstration: 

<div>
 
 <table>
 <caption>Dataset</caption>
 <tr>
 <th> Yearly Income(in thousand dollars)</th>
 <th> Net Worth(in million dollars)</th>
 <th> Will repay loan</th>
 </tr>
 <tr>
 <td> 120 </td>
 <td> 2.5 </td>
 <td> yes </td>
 </tr>
 <tr>
 <td> 250 </td>
 <td> 1.0 </td>
 <td> yes </td>
 </tr>
 <tr>
 <td> 130 </td>
 <td> 1.0 </td>
 <td> no </td>
 </tr>
 <tr>
 <td> 80 </td>
 <td> 3.0 </td>
 <td> yes </td>
 </tr>
 </table>
</div>




We will use the attribute **Net Worth** to compute the gini index. The step involved in the process are: 

1. Arrange all attribute values in ascending order. Ignore duplicates.   
 Sorted attribute values = `[1.0, 2.5, 3.0]`

  Sorting values is important because if values are not sorted, then we may not be able to explore all possible threshold values and hence the splits. The resulting best split obtained in such a way may not be the most optimal split.  

2. Find the average of two consecutive numbers in the sorted attribute values list. These are used as threshold values.   
 averaged attribute values = `[1.75, 2.75]`

3. Use each of these threshold values as a split value and compute gini for child nodes. All instances with attribute value less than or equal to the threshold value form the left child and the rest form the right child.   
 Example: Using `1.75` as the split value, we get following splits: 
<div>
 <table>
 <caption>Split 1: Net Worth <= 1.75</caption>
 <tr>
 <th> Yearly Income(in thousand dollars)</th>
 <th> Net Worth(in million dollars)</th>
 <th> Will repay loan</th>
 </tr>

 <tr>
 <td> 250 </td>
 <td> 1.0 </td>
 <td> yes </td>
 </tr>
 <tr>
 <td> 130 </td>
 <td> 1.0 </td>
 <td> no </td>
 </tr>
 </table>
</div>


 <div>
 
 <table>
 <caption>Split 2: Net Worth > 1.75</caption>
 <tr>
 <th> Yearly Income(in thousand dollars)</th>
 <th> Net Worth(in million dollars)</th>
 <th> Will repay loan</th>
 </tr>
 <tr>
 <td> 120 </td>
 <td> 2.5 </td>
 <td> yes </td>
 </tr>
 <tr>
 <td> 80 </td>
 <td> 3.0 </td>
 <td> yes </td>
 </tr>
 </table>
</div>

Gini for **Split 1**, **Net Worth** ≤ *1.75*, can be computed as:

$$
\text{Gini}_{\text{split 1}} = 1 - \left[P_{\text{Will repay loan = yes}}^2 + P_{\text{Will repay loan = no}}^2\right]
$$

$$
= 1 - \left[\left(\frac{1}{2}\right)^2 + \left(\frac{1}{2}\right)^2\right]
$$

$$
= 0.5
$$

Gini for **Split 2**, **Net Worth** > *1.75*, can be computed as:

$$
\text{Gini}_{\text{split 2}} = 1 - \left[P_{\text{Will repay loan = yes}}^2 + P_{\text{Will repay loan = no}}^2\right]
$$

$$
= 1 - \left[\left(\frac{2}{2}\right)^2 + 0\right]
$$

$$
= 0
$$

Gini for the attribute **Net Worth** can be computed by taking a weighted average of \$Gini\_{\text{split 1}}\$ and \$Gini\_{\text{split 2}}\$:

$$
\text{Gini}_{\text{Net Worth}} = \frac{2}{4} \cdot \text{Gini}_{\text{split 1}} + \frac{2}{4} \cdot \text{Gini}_{\text{split 2}}
$$

$$
= \frac{2}{4} \cdot 0.5 + \frac{2}{4} \cdot 0
$$

$$
= 0.25
$$

You can paste this directly into a Markdown cell in a Jupyter notebook. Let me know if you want the same formatting for entropy or gain ratio as well.



4. Repeat step 3 for each threshold values of each attribute. 
<div align="center">
 <figure>
 <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?id=1co3Dbz-Jve8GpOdeM33xOK81RbMLwit1" width="300"> -->
 <img src="https://i.postimg.cc/k45xB99h/image.png" width="300">
 <figcaption>Figure 1: Attribute, threshold value, and their corresponding gini</figcaption>
 </figure>
</div>

5. Attribute and threshold value that correspond to the minimum gini is used to make the split. In case of a tie, select one randomly. 

<div align="center">
 <figure>
 <!-- <img src="https://doc.google.com/a/fusemachines.com/uc?id=14Ys5RStLNbxiIBqCyaoHQKT3syQT4awK" width="300"> -->
 <img src="https://i.postimg.cc/dVBnWJs4/image.png" width="300">
 <figcaption>Figure 2: Entries highlighted in green correspond to minimum gini</figcaption>
 </figure>
</div>

In our case, both **Net Worth** at threshold value 1.75 and **Yearly Income** at threshold value 125 yields minimum gini. Anyone of them can be selected at random to make the split. 

In this way, we can find the best split for continuous attributes. 

In the next chapter, we will discuss decision tree algorithms. We will discuss in brief about impurity metric and pruning method used by each of those algorithms. 







## Key Takeaways

* To find the best split for a continuous attribute, we need to compute gini(or any other impurity metric of choice) for each threshold value of the attribute 
* If we have multiple candidates for the best split(minimum gini), we can select any of those to create the split.

## Additional Resources

### Split Value for Continuous Attribute
* [Choosing Split value for continuous attribute](https://datascience.stackexchange.com/questions/24339/how-is-a-splitting-point-chosen-for-continuous-variables-in-decision-trees) 
 Refer this stack exchange question to find out how we can select split points for continuous attributes 


