# Decision Trees

A decision tree is a prediction model. Given a database, several logical conditionals are generated to categorize the data. The input values can be discrete or continuous.
The decision trees are a supervised learning technique that can be applied to solve regression and classification problems.

<img src="images/7_decisiontrees_simpsons1.png">
<img src="images/7_decisiontrees_simpsons2.png">
<img src="images/7_decisiontrees_iris.png">
<img src="images/7_decisiontrees_administrarmedicamento.png">

The elements of a decision tree are:
- __Nodes__: They represent a logical conditional that only depends on one feature. It can be seen as the if..else sentence. There may be several nodes with the same feature.
- __Branches__: They grow in the nodes and correspond to the values that the logical conditional can have. Each node can have two o more branches.
- __Leaves__: They are the final nodes and correspond to the final conclusions.
    - If it is a classification decision tree: a leaf node represents a class
    - If it is a regression decision tree: a leaf node represents a value

## ID3 algorithm

ID3 is an algorithm that constructs classification decision trees based on discrete inputs. It generates the nodes from top to bottom recursively. Each time that a node is generated, it evaluates the division gain and which feature must be used.

<img src="images/7_decisiontrees_id3_equation.png">
<img src="images/7_decisiontrees_id3_ejemplo1.png">
<img src="images/7_decisiontrees_id3_ejemplo2.png">
<img src="images/7_decisiontrees_id3_ejemplo3.png">
<img src="images/7_decisiontrees_id3_ejemplo4.png">
<img src="images/7_decisiontrees_id3_ejemplo5.png">
<img src="images/7_decisiontrees_id3_ejemplo6.png">
<img src="images/7_decisiontrees_id3_ejemplo7.png">
<img src="images/7_decisiontrees_id3_ejemplo8.png">
<img src="images/7_decisiontrees_id3_ejemplo9.png">
<img src="images/7_decisiontrees_id3_ejemplo10.png">

## CART algorithm (Classification and Regression Trees)

CART is an algorithm to construct decision trees for classification and regression with discrete and continuous data. It creates binary trees.

The nodes are divided as following:
- If the feature is numeric, the conditional follows the next structure:

    If [feature_value] <= [value] then <NODE1>, else <NODE2>

    
- If the feature is categorical, the conditional follows the next structure:
    
    If [feature_value] is in {value1,value2,‚Ä¶,valuen} then <NODE1>, else <NODE2>

All the possible ways for division need to be tested, including all the cut points, and peform the one with the biggest value in the gain division.
    
<img src="images/7_decisiontrees_cart_equation.png">
    
__Impurity measures__

- Classification: Entropy
    
    $H(f) = - \sum_i f_i log_2 f_i$
    
- Classification: Gini impurity
    
    $I_G(f) = \sum_i f_i (1-f_i) = \sum_i (f_i-f_i^2) = \sum_i f_i - \sum_i f_i^2 = 1 - \sum_i f_i^2$
    
    Where $i$ goes for all the class labels, this is $i={1,2,...,m}$, and $f_i$ represents the percentage of data labeled as class $i$. Its range goes from 0 to $\approx$1, where 0 is totally pure and $\approx$1 is very impure.

- Regression: Variance
    
    $\sigma^2 = \frac{1}{n}\sum_i (Y_i-\hat{Y})^2$
    
## Feature importance
    
Feature importance is proportional to the impurity reduction of all nodes related to that feature. The impurity reduction $ùêºùëÖ_ùëó$ in each node $ùëó$ representing a rule can be calculated with:
    
$IR_j = w_jI_j - (w_{left}I_{left} + w_{right}I_{right})$
    
where $left$ and $right$ represent the children nodes of node $ùëó$, $ùêº$ represents the impurity of each node, and the weights $ùë§$ are the samples‚Äô proportion in nodes, and they are calculated as the number of samples in the node divided by the total number of samples. Once the impurity reduction in all nodes is known, the importance of the feature $ùëò$, $ùêπùêº_ùëò$, is calculated as following:
    
$FI_k = \frac{\sum_{j\in N_k}IR_j}{\sum_{j\in N}IR_j}$

## Advantages and disadvantages of decision trees

Advantages:
- Easy to interpret and understand
- Require few or null data pre-processing
- Can handle with numerical and categorical data (sklearn version only can be executed with numerical data)
- It is a white box model
- Work well with big data

Disadvantages:
- They have a big problem of overfitting
- They are based on greedy algorithms

## Ensemble

Ensemble: Construction of several simple learning models, each one with the same or different training dataset. The final decision is calculated based on a votation scheme. Ensembles obtain better predictive performance that all its models by itself. They reduce the overfitting because the models are very simple.

__Random Forest__ is a supervised learning model formed by several decision trees. All the trees are different among them because they are training with a randomly selected subset of samples and features.