# WEEK 8: Notes

In this week, we'll start with the Decision Tree implementations in SK-Learn followed by ensemble methods(bagging and boosting)

# Decision Tree
- Non-parametric supervised learning
- Can learn classification as well as regression models
- Predicts labels based on rules inferred from the features in the training set

## Tree Algorithms:
![Screenshot 2023-11-17 at 5.31.00 PM.png](attachment:f516811a-b8dc-4adc-825a-08f3a40bad51.png)

# SK-Learn Implementation of Trees

- SK-Learn uses an optimized versoin of the <span style="color:purple">CART</span> algorithm

- It <span style="color:red">does not support</span> categorical variables(at least, for now)

In [2]:
from sklearn.tree import (DecisionTreeClassifier, 
                          DecisionTreeRegressor)

- Both above estimators have the same set of parameters except for "```criterion```" parameter
    ***
    - ```splitter```: **Strategy** for splitting at each note
        - ```'best'```, ```'random'```
        - default "best"
    ***
    
    - ```max_depth```: **Maximum depth** of the tree
        - int values
        - When ```None```, the tree expands until all leaves are pure or they contain less than ```min_sample_split```
        - Default ```"None"```
    ***
    
    - ```min_samples_split```: **Minimum number of samples** required to split an internal node
        - int, float values
        - Default = 2
    ***
    
    - ```min_samples_leaf```: **Minimum number of samples** required to be at a leaf node
        - int, float values
        - default = 1
    ***
    
    - ```criterion```: **Specifies function** to measure the quality of a split
        - ***CLASSIFICATION***: gini, entropy
            - Default: 'gini'
        - ***Regression***: squared_error, friedman_mse, absolute_error, poisson
            - Default: 'squared_error'
            
    ***

## Tree Visialization

- Can visualize the tree trained using the following API below
- Some parameters are there, which can be adjusted to customize the tree viz.


In [4]:
from sklearn.tree import plot_tree

## Avoiding Overfitting of trees

- Overfitting is the most common problem that can be faced while training a tree
- So, we can use the following streategies to avoid it

1. ```Pre-Pruning```
    - Uses HP search like ```GridSearchCV``` for finding the best set of parameters
1. ```Post-Pruning```
    - First, it grows the tree without constraints and then it uses ```cost_complexity_pruning``` with ```max_depth``` and ```min_samples_split```

- Pruning is a technique that removes the parts of the Decision Tree which prevent it from growing to its full depth. The parts that it removes from the tree are the parts that do not provide the power to classify instances. A Decision tree that is trained to its full depth will highly likely lead to overfitting the training data - therefore Pruning is important. 


## Some Practical Usage Tips
- Make sure that we have the right ratio of samples to the number of features. DTrees tend to overfit data when d>>

- Perform PCA or feature selection (<span style="color:purple">dimensionality reduction</span>) on the data before using it for training the trees
    - It gives a better chance of finding discriminative features
- Visualize the trained tree by using <span style="color:purple">max_depth=3</span> as an initial tree depth to get a feel for the fitment and then increase the depth
- Balance the dataset before training to prevent the tree from being biased toward the classes that are dominant
- use ```min_samples_split``` or ```min_samples_leaf``` to ensure that multiple samples influence every decision in the tree, by controlling which splits will be considered.
    - A very small number will usually mean the tree will overfit
    - A large number will prevent the tree from learning the data