Ensembles can give you a boost in accuracy on your dataset. This notebook will step you through Boosting, Bagging and Majority Voting and show you how you can continue to ratchet up the accuracy of the models on your own datasets.

* How to use bagging ensemble methods such as bagged decision trees, random forest and
extra trees.

* How to use boosting ensemble methods such as AdaBoost and stochastic gradient boosting.

* How to use voting ensemble methods to combine the predictions from multiple algorithms.

#### Combine Models Into Ensemble Predictions

The three most popular methods for combining the predictions from different models are:

* **Bagging**. Building multiple models (typically of the same type) from different subsamples
of the training dataset.

* **Boosting**. Building multiple models (typically of the same type) each of which learns to
fix the prediction errors of a prior model in the sequence of models.

* **Voting**. Building multiple models (typically of differing types) and simple statistics (like
calculating the mean) are used to combine predictions.

This assumes you are generally familiar with `machine learning algorithms` and `ensemble methods` and will not go into the details of how the algorithms work or their parameters. The `Pima Indians onset of Diabetes dataset` is used to demonstrate each algorithm. Each `ensemble algorithm` is demonstrated using `10-fold cross-validation` and the `classiffication accuracy performance metric`.

### Bagging Algorithms

`Bootstrap Aggregation` (or Bagging) involves taking multiple samples from your training dataset
(with replacement) and training a model for each sample. The final output prediction is averaged
across the predictions of all of the sub-models. The three `bagging models` covered in this section
are as follows:

* Bagged Decision Trees.
* Random Forest.
* Extra Trees.

#### Bagged Decision Trees

`Bagging` performs best with algorithms that have `high variance`. A popular example are
`decision trees`, often constructed without pruning. Using the `BaggingClassifier` with the `Classiffication` and `Regression` `Trees algorithm (DecisionTreeClassifier1)`. A total of 100 trees are created.

##### Bagged Decision Trees for Classification

In [42]:
# IMport Libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [43]:
# Load Dataset
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [44]:
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("A robust estimate of model accuracy: ", results.mean())

A robust estimate of model accuracy:  0.770745044429255


#### Random Forest

`Random Forests` is an extension of `bagged decision trees`. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classiffiers. 

Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split. You can construct a `Random Forest model` for `classification` using the `RandomForestClassifier class`.

Using Random Forest for classification with 100 trees and split points chosen from a random selection of 3 features.

##### Random Forest Classification

In [45]:
# Import Libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [46]:
# Load Dataset
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

In [47]:
X = array[:,0:8]
Y = array[:,8]

In [48]:
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print("A mean estimate of classication accuracy.", results.mean())

A mean estimate of classication accuracy. 0.7681989063568011


#### Extra Trees

`Extra Trees` are another modification of `bagging` where `random trees` are constructed from samples of the training dataset.
You can construct an `Extra Trees model` for `classification` using the `ExtraTreesClassifier class`. A demonstration of extra trees
with the number of trees set to `100` and splits chosen from `7` random features

##### Extra Trees Classification

In [49]:
# Import Libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

In [50]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

In [51]:
X = array[:,0:8]
Y = array[:,8]

In [53]:
num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, random_state=7)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print("A mean estimate of classification accuracy.", results.mean())

A mean estimate of classification accuracy. 0.756373889268626


#### Boosting Algorithms

`Boosting ensemble algorithms` creates a sequence of models that attempt to correct the mistakes
of the models before them in the sequence. Once created, the models make `predictions` which
may be weighted by their demonstrated accuracy and the results are combined to create a final
output prediction. The two most common `boosting ensemble machine learning algorithms` are:

* AdaBoost.
* Stochastic Gradient Boosting.

#### AdaBoost

`AdaBoost` was perhaps the `first` successful `boosting ensemble algorithm`. It generally works by
`weighting instances in the dataset by how easy or difficult they are to classify`, allowing the
algorithm to pay less attention to them in the construction of subsequent models. You can
construct an `AdaBoost model` for `classiffication` using the `AdaBoostClassifier class`.

##### AdaBoost Classification

In [54]:
# Import Libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

In [55]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

In [56]:
X = array[:,0:8]
Y = array[:,8]

In [57]:
num_trees = 30
seed=7
kfold = KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("A mean estimate of classiffication accuracy: ", results.mean())

A mean estimate of classiffication accuracy:  0.760457963089542


#### Stochastic Gradient Boosting

`Stochastic Gradient Boosting` `(SGBoost)` (also called `Gradient Boosting Machines`) are one of the most
sophisticated `ensemble techniques`. It is also a technique that is proving to be perhaps one of
the best techniques available for improving performance via ensembles. You can construct a
`Gradient Boosting model` for classification using the `GradientBoostingClassifier class`.

`Stochastic Gradient Boosting for classification with 100 trees`.

##### Stochastic Gradient Boosting Classification

In [58]:
# Import Libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

In [59]:
# Load Dataset
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values

In [60]:
# Split Dataset
X = array[:,0:8]
Y = array[:,8]

In [61]:
seed = 7
num_trees = 100
kfold = KFold(n_splits=10, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("mean estimate of classification accuracy.", results.mean())

mean estimate of classification accuracy. 0.7669002050580999


#### Voting Ensemble

`Voting` is one of the simplest ways of combining the predictions from `multiple machine learning algorithms`. It works by first creating `two or more standalone models` from your `training dataset`. A `Voting Classifier` can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data. The predictions of the sub-models can be `weighted`, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called `stacking` `(stacked aggregation)` and is currently not provided in scikit-learn.

You can create a `voting ensemble model` for classification using the `VotingClassifier class`. The code below provides an example of combining the predictions of `logistic regression`, `classification` and `regression trees` and `support vector machines` together for a `classification problem`.

##### Voting Ensemble for Classification

In [62]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [63]:
# Import libraries
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [64]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
kfold = KFold(n_splits=10, random_state=7)

In [65]:
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))

In [41]:
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print("mean estimate of classification accuracy:", results.mean())

mean estimate of classification accuracy: 0.7330143540669857
