# Ensembles

- Ensembles combine multiple models to improve prediction quality
- Ensembles often outperform single models in real-world problems and Kaggle competitions

## What are ensembles?

- Ensemble in Data Science refers to using multiple models to make a prediction.
- Ensembles are typically more effective for supervised learning tasks.
- Most Kaggle competitions are won using ensemble methods.
- Ensemble methods are successful due to their effectiveness in combining multiple models.

## Resiliency to variance

<img src="images/new_bias-and-variance.png" width=400>

- Ensemble methods use the "Wisdom of the crowd" concept to make better predictions.
- The average of all predictions usually outperforms any single prediction.
- "Wisdom of the crowd" works because overestimates and underestimates cancel each other out.
- Estimators are rarely perfect, so we use many models and average their predictions.
- The idea that overestimates and underestimates cancel each other out is called smoothing.

## Which models are used in ensembles?

- Tree-based ensemble methods are the focus of this section, but any models can be used in an ensemble.
- Model stacking, or Meta-ensembling, involves aggregating predictions from multiple different models.
- Ensembles can consist of multiple logistic regressions, Naive Bayes classifiers, Tree-based models, and even deep neural networks.
- Using different models increases the potential to pick up on different characteristics of the data.
- For more information on model stacking, check out Kaggle's blog, No Free Hunch.

## Bootstrap aggregation

<img src="images/new_bagging.png" width=400>

- Ensembling is made possible through Bagging, which combines bootstrap resampling and aggregation.
- Bootstrap resampling involves sampling subsets of the dataset with replacement.
- Aggregation combines the different estimates to arrive at a single estimate.
- A common approach is to use each classifier's prediction as a "vote" and let the overall prediction be the majority vote or compute a weighted average.
- The process for training an ensemble through bootstrap aggregation involves repeatedly training classifiers on a sample from the dataset.
- Decision trees are often used in ensembles because their sensitivity to variance becomes an advantage when aggregated together.

# Random Forests

## Understanding the Random forest algorithm

- Random Forest is a supervised learning algorithm for classification and regression tasks.
- It is an ensemble of decision trees, but if all the trees are the same, performance won't improve.
- To create high variance among the trees, the algorithm uses Bagging and the Subspace Sampling Method.

## Bagging

- To encourage differences among trees in a forest, they should be trained on different samples of data.
- Bootstrap Aggregation (AKA Bagging) is used to obtain a portion of the data by sampling with replacement.
- Two-thirds of the training data is sampled with replacement to build each tree, while the remaining one-third is used as an internal test set.
- The Out-Of-Bag Data (OOB) is used to calculate the Out-Of-Bag Error for each new tree created.
- All trees focusing on the same predictors can be a weakness as they may all make the wrong prediction if a predictor provides bad information.
- The second major part of the Random forest algorithm addresses this issue.

## Subspace sampling method

<img src="images/new_rf-diagram.png" width=400>

- To increase variability in the random forest, Subspace sampling is used.
- This method randomly selects a subset of features to use as predictors for each decision tree node.
- In a dataset with 3000 rows and 10 columns, for each tree, 2000 rows are randomly selected with replacement.
- A tunable parameter determines how many predictors are used per node, with 6 being used in this example.
- The tree is trained on the modified dataset of 2000 rows and 6 columns.
- Unused columns from step 3 are dropped from the out-of-bag rows and used as an internal testing set to calculate the out-of-bag error for the tree.

## Resiliency to overfitting

- A random forest is created by training multiple decision trees on different sets of data and subsets of features to make predictions.
- This diversity among the trees makes the model resilient to noisy data and reduces the chance of overfitting.
- In a single decision tree or a forest where all trees focus on the same predictors, false signals can lead to incorrect predictions.
- With a random forest and subspace sampling, some nodes in the trees won't even know certain predictors exist, making them "immune" to false signals.
- The "wisdom of the crowd" in the forest buffers the performance of every constituent, reducing the chance that every tree makes the same mistake.

## Making predictions with random forests

- Random forest algorithm aggregates predictions from multiple trees to make a final prediction.
- Benefits include strong performance and interpretability.
- Drawbacks include computational complexity and memory usage.
- Random forest algorithm was created by Leo Breiman and Adele Cutler.

[random forests paper](https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf)

[random forests website](https://www.stat.berkeley.edu/%7Ebreiman/RandomForests/cc_home.htm)
