## Improve Performance with Ensembles

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_csv('dataset/diabetes.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [4]:
from sklearn.model_selection import KFold, cross_val_score
kfold = KFold(n_splits=10, random_state=0, shuffle=True)

### Combine Models Into Ensemble Predictions
- <b>Bagging:</b> Building multiple models (typically of the same type) from different sub-samples of the training dataset

- <b>Boosting:</b> Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models

- <b>Voting:</b> Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

### Bagging Algorithms

#### Bagged Decision Trees
Best with algorithms that have high variance

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(estimator=cart, n_estimators=num_trees, random_state=0)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7733937115516063


#### Random Forest
An extension of bagged decision trees

In [13]:
from sklearn.ensemble import RandomForestClassifier

num_trees = 100
max_features = 3
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7721291866028708


#### Extra Trees
Another modification of bagging where random trees are constructed from samples of the training dataset.

##### Random Forest vs Extra Trees:
- random forest: finds the best splits using random samples — slower but more accurate.
- extra trees: splits randomly without searching — faster but slightly less accurate.

In [15]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, y, cv=kfold)
results.mean()

0.7604066985645934

### Boosting Algorithm

#### AdaBoost
The first successful boosting ensemble algorithm

In [16]:
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30
seed = 7

model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7577922077922079


#### Stochastic Gradient Boosting

In [17]:
from sklearn.ensemble import GradientBoostingClassifier

num_trees = 100
seed = 7

model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

0.7656185919343815


### Voting Ensemble

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

estimators = []

model1 = LogisticRegression()
estimators.append(('logistic', model1))

model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))

model3 = SVC()
estimators.append(('svm', model3))

ensemble = VotingClassifier(estimators=estimators)
results = cross_val_score(ensemble, X, y, cv=kfold)
results.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.7708133971291866