The decision tree Algorithm belongs to the family of supervised machine learning algorithms. It can be used for both a classification problem as well as for regression problem.<br>
The goal of this algorithm is to create a model that predicts the value of a target variable, for which the decision tree uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree.<br>

Assumptions that we make while using the Decision tree:
1. In the beginning, we consider the whole training set as the root.
2. Feature values are preferred to be categorical, if the values continue then they are converted to discrete before building the model.
3. Based on attribute values records are distributed recursively.
4. We use a statistical method for ordering attributes as a root node or the internal node.

## Entropy

Entropy is the measures of impurity, disorder, or uncertainty in a bunch of examples.

**Purpose of Entropy:**<br>
Entropy controls how a Decision Tree decides to split the data. It affects how a Decision Tree draws its boundaries.<br>
“Entropy values range from 0 to 1”, Less the value of entropy more it is trusting able.

$$ Entropy = \sum_{i=1}^n p_i * \log_2(p_i) $$
Where Pi is probability of class i

**Puresubset:** The pure subset is a situation where we will get either all yes or all no in this case.

## Information Gain:
Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information. <br>

Information Gain is applied to quantify which feature provides maximal information about the classification based on the notion of entropy, i.e.by quantifying the size of uncertainty, disorder or impurity, in general, with the intention of decreasing the amount of entropy initiating from the top (root node) to bottom(leaves nodes).<br>

Formula for Information Gain is :

$$ IG = E_{before} - \sum_{i}^n(N_i/N) * E_{after} $$
$E_{before}$ = Entropy before split<br>
$E_{after}$ = Entropy after split<br>
$N_i$ = number of elements from that particular feature which belongs to ith split after split<br>
N = Total number of elements in that feature or variable<br>
n = Total number of splits for particular node

## Gini Index:

Gini Index, also known as Gini impurity, calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly. If all the elements are linked with a single class then it can be called pure.<br>

The minimum value of the Gini Index is 0. This happens when the node is pure, this means that all the contained elements in the node are of one unique class. Therefore, this node will not be split again. Thus, the optimum split is chosen by the features with less Gini Index. Moreover, it gets the maximum value when the probability of the two classes are the same.

$$ Gini Index = 1 - \sum_{i=1}^n(P_i)^2 $$

**Classification and Regression Tree (CART) algorithm deploys the method of the Gini Index to originate binary splits.**<br>
In addition, decision tree algorithms exploit Information Gain to divide a node and Gini Index or Entropy is the passageway to weigh the Information Gain.

## Gini Index vs Entropy
Take a look below for the getting discrepancy between Gini Index and Information Gain,

![image.png](attachment:image.png)

1. Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following figure, both of them are represented. The gini index has also been represented multiplied by two to see concretely the differences between them, which are not very significant.
2. Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster.
3. **The method of the Gini Index is used by CART algorithms, in contrast to it, Information Gain is used in ID3, C4.5 algorithms.**
4. Gini index operates on the categorical target variables in terms of “success” or “failure” and performs only binary split, in opposite to that Information Gain computes the difference between entropy before and after the split and indicates the impurity in classes of elements.

**The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction**.

## Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches).

a) Standard deviation for one attribute:

![image.png](attachment:image.png)

Coefficient of Deviation (CV) is used to decide when to stop branching.


## Information Gain

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.

![image.png](attachment:image.png)

Information gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.
Mathematically, IG is represented as:

![image-2.png](attachment:image-2.png)

In a much simpler way, we can conclude that:

![image-3.png](attachment:image-3.png)

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.

## Gain ratio
 
Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values.<br>

C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account.<br>

Let us consider if we have a dataset that has users and their movie genre preferences based on variables like gender, group of age, rating, blah, blah. With the help of information gain, you split at ‘Gender’ (assuming it has the highest information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be equally important and with the help of gain ratio, it will penalize a variable with more distinct values which will help us decide the split at the next level.<br>

![image.png](attachment:image.png)

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.

## Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

![image.png](attachment:image.png)

Above X-bar is the mean of the values, X is actual and n is the number of values.

Steps to calculate Variance:
1. Calculate variance for each node.
2. Calculate variance for each split as the weighted average of each node variance.

## How to avoid/counter Overfitting in Decision Trees?
 
The common problem with Decision trees, especially having a table full of columns, they fit a lot. Sometimes it looks like the tree memorized the training data set. If there is no limit set on a decision tree, it will give you 100% accuracy on the training data set because in the worse case it will end up making 1 leaf for each observation. Thus this affects the accuracy when predicting samples that are not part of the training set.

Here are two ways to remove overfitting:
1. Pruning Decision Trees.
2. Random Forest

### Pruning Decision Trees
The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data.<br>
In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V.

![image.png](attachment:image.png)

In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it has more importance on the right-hand side of the tree, hence removing overfitting.

We can avoid overfitting by changing the parameters like
- max_leaf_nodes
- min_samples_leaf
- max_depth

Pruning Parameters
- max_leaf_nodes<br>
   - Reduce the number of leaf nodes
- min_samples_leaf<br>
   - Restrict the size of sample leaf<br>
   - Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total<br>
- max_depth<br>
   - Reduce the depth of the tree to build a generalized tree
   - Set the depth of the tree to 3, 5, 10 depending after verification on test data

### Random Forest
Random Forest is an example of ensemble learning, in which we combine multiple machine learning algorithms to obtain better predictive performance.<br>

Why the name “Random”?<br>
Two key concepts that give it the name random:<br>

- A random sampling of training data set when building trees.
- Random subsets of features considered when splitting nodes.

A technique known as bagging is used to create an ensemble of trees where multiple training sets are generated with replacement.<br>
In the bagging technique, a data set is divided into N samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples. Later, the resultant predictions are combined using voting or averaging in parallel.



## Which is better Linear or tree-based models?
 
Well, it depends on the kind of problem you are solving.
1. If the relationship between dependent & independent variables is well approximated by a linear model, linear regression will outperform the tree-based model.
2. If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
3. If you need to build a model that is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!
