# Bagging Algorithms
Bootstrap Aggregation (or Bagging) involves taking multiple samples from your training dataset (with replacement) and training a model for each sample. The final output prediction is averaged across the predictions of all of the sub-models.

### Bagged Decision Trees
Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning. 

In [3]:
# Bagged Decision Trees for Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#Prep Harness and Fit
seed = 7
kfold = KFold(n_splits = 10, random_state = seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


0.770745044429


### Random Forest
Random Forests is an extension of bagged decision trees. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split.

In [6]:
#Random Forest Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#Prep harness and Fit
seed = 7
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=seed)
model = RandomForestClassifier(n_estimators = num_trees, max_features = max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.770745044429


### Extra Trees
Extra Trees are another modification of beggin where random trees are constructed from samples of the training dataset. 

In [11]:
# Extra Trees Classification 
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#Harness prep and fit
seed = 7
num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, random_state=seed)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features = max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.772060833903


# Boosting Algorithms
Boosting ensemble algorithms creates a sequence fo models that attempt to corect the mistake of the models before them in the sequence. Once created the models make prediction which may be weighted by their domenstrated accuracy and the results are combined to create a final output prediction.

### AdaBoost
Perhaps the first successful boosting ensemble algorithm, it works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay less attention to them in the construction of subsequent models

In [14]:
# AdaBoost Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#Harness Prep and Fit
seed = 7
num_trees = 30
kfold = KFold(n_splits = 10, random_state = seed)
model = AdaBoostClassifier(n_estimators = num_trees, random_state = seed)
restuls = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.772060833903


### Stochastic Gradient Boosting
Stochastic Gradient Boosting (also call Gradient Boosting Machines/GBM) are one of the most sophisticated ensembled techniques that is proving to be one of the best technique available for improving performance via ensembles.  

In [17]:
# Stochastic Gradient Boosting (GBM) Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#Harness and Fit
seed = 7
num_trees = 100
kfold = KFold(n_splits = 10, random_state = seed)
model = GradientBoostingClassifier(n_estimators = num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.766900205058


# Voting Ensemble
Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A voting classifer can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data.  The prediction so fht esub-models can be weighted but spedifying the weight for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models but this is called stacking (stacked aggregation) and is currently not provided in sklearn.

In [21]:
# Voting Ensemble for Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

#load data 
filename = 'pima-indians-diabetes.data.csv'
names=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

#create submodels
estimators = []
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()
estimators.append(('logistic', model1))
estimators.append(('cart', model2))
estimators.append(('svm', model3))

#create ensemble model
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

0.738209159262
