<a href="https://colab.research.google.com/github/Akif-Mufti/Machine-Learning-with-Python/blob/master/Bagging1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#  Bagging. Building multiple models (typically of the same type) from dierent subsamples
# of the training dataset.
#  Boosting. Building multiple models (typically of the same type) each of which learns to
# x the prediction errors of a prior model in the sequence of models.
#  Voting. Building multiple models (typically of diering types) and simple statistics (like
# calculating the mean) are used to combine predictions.

In [0]:
# Bootstrap Aggregation (or Bagging) involves taking multiple samples from your training dataset
# (with replacement) and training a model for each sample. The nal output prediction is averaged
# across the predictions of all of the sub-models

In [0]:
# Bagged Decision Trees
# Bagging performs best with algorithms that have high variance. A popular example are
# decision trees, often constructed without pruning. In the example below is an example
# of using the BaggingClassifier with the Classication and Regression Trees algorithm
# (DecisionTreeClassifier1). A total of 100 trees are created.

In [2]:
# Bagged Decision Trees for Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


0.770745044429255


In [0]:
# Random Forests is an extension of bagged decision trees. Samples of the training dataset are
# taken with replacement, but the trees are constructed in a way that reduces the correlation
# between individual classiers. Specically, rather than greedily choosing the best split point in
# the construction of each tree, only a random subset of features are considered for each split.

In [3]:
# Random Forest Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7695146958304853


In [0]:
# Extra Trees are another modication of bagging where random trees are constructed from
# samples of the training dataset. You can construct an Extra Trees model for classication using
# the ExtraTreesClassifier class3. The example below provides a demonstration of extra trees
# with the number of trees set to 100 and splits chosen from 7 random features.

In [4]:
# Extra Trees Classification
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_trees = 100
max_features = 7
kfold = KFold(n_splits=10, random_state=7)
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7759227614490773
