A decision tree is one of the easiest to understand and implement algorithms that can be used for both classification and regression. It is a non-parametric algorithm and looks like a bunch of if-elif-else statements. 

One of the biggest advantages of using this algorithm is, there is a need for feature scaling.

The problem with a decision tree is if we increase the depth of the tree by the large number it overfits and if the depth is insufficient it underfits. To handle the overfitting, we have a technique known as pruning. Pruning simply means cutting off the nodes.

There are two famous approaches to applying the decision tree algorithm. One is using an entropy and information gain, and another is gini impurities. Both of these techniques measure the impurities of the features. Highest the value of impurity, that feature will be used at the bottom of the tree.

![download.png](attachment:download.png)

Even the mathematics behind the decision tree is simple. We just need to have knowledge about two things.

1: Concept of the logarithm: It is preferable to choose a base of 2 for a log (binary logarithm). Log base (e) also known as the natural log can also be used. <b> GK of the log: </b> We cannot have negative, zero, or one as a base of a log and we cannot pass the negative value or 0 to the log function. If we pass any value that is between 0 - 1 in the log, it will return a negative value.

2: Concept of the probability: It’s just a simple probability :D

![gini-entropy.png](attachment:gini-entropy.png)
If we are doing binary classification, both entropy and Gini impurities have a different range of values. 

For entropy, if a particular feature is pure or less impure then the value will be or tends to 0 but if the feature is highly impure or the probability of both binary terms is 0.5 then the value will be 1 (at max). But for multiclass classification, the value can go above 1.

For gini impurities, if a particular feature is pure or less impure then the value will be or tends to 0 (this is same as entropy) but if the feature is highly impure or the probability of both binary terms is 0.5 then the value will be 0.5 (at max).

So, which one to use might be the first question! The answer is in the mathematical formulation of both approaches. In entropy, we have to calculate the log of the probability whereas using Gini impurities, we just have to calculate the square of the probability. If we talk about the cost of time to calculate both, calculating a log takes more time than square. Hence, gini impurities is preferable and it’s the default in sklearn implementation of decision tree algorithm.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.datasets import load_iris