In [0]:
%matplotlib inline

import matplotlib.pyplot as plt

In [0]:
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score, validation_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

import numpy as np
import seaborn as sns

In [0]:
digits = load_digits()
print(digits.data.shape)

(1797, 64)


In [0]:
X = digits.data
y = digits.target

Create a DecisionTreeClassifier with default settings and measure its score with cross_val_score.

In [0]:
clf = DecisionTreeClassifier(random_state=1)
cvs = cross_val_score(clf, X, y, cv=10)

In [0]:
print('Mean model quality value: ' + str(cvs.mean()))

Mean model quality value: 0.8241154562383614


Use the BaggingClassifier from sklearn.ensemble to teach bagging over the DecisionTreeClassifier. Use the default settings in the BaggingClassifier, setting only the number of trees to 100.

In [0]:
bagging = BaggingClassifier(clf, n_estimators=100)
cvs = cross_val_score(bagging, X, y, cv=10)

In [0]:
print('Mean model quality value: ' + str(cvs.mean()))

Mean model quality value: 0.9237150837988828


Now examine the BaggingClassifier params and select them so that each basic algorithm will train not on all d features, but on sqrt(d) features. 

The sqrt of the number of features is often used in classification problems, while in regression problems the number of features divided by three is often taken. But in general, nothing prevents you from choosing any other number of features.

In [0]:
n_features = digits.data.shape[1]
bagging = BaggingClassifier(clf, n_estimators=100, max_features=int(np.sqrt(n_features)))
cvs = cross_val_score(bagging, X, y, cv=10)

In [0]:
print('Mean model quality value: ' + str(cvs.mean()))

Mean model quality value: 0.9232091868404719


In [0]:
clf = DecisionTreeClassifier(max_features=int(np.sqrt(n_features)))
bagging = BaggingClassifier(clf, n_estimators=100)
cvs = cross_val_score(bagging, X, y, cv=10)

In [0]:
print('Mean model quality value: ' + str(cvs.mean()))

Mean model quality value: 0.9482402234636871


Now compare the quality of the classifier with the RandomForestClassifier from sklearn.ensemble. Do this, and then examine how the quality of classification on a given dataset depends on the number of trees, the number of features selected when building each tree tip, and the restrictions on the depth of the tree.

In [0]:
rf_classifier = RandomForestClassifier(n_estimators=100)
bagging = BaggingClassifier(rf_classifier, n_estimators=100)
cvs = cross_val_score(bagging, X, y, cv=10)

In [0]:
print('Mean model quality value: ' + str(cvs.mean()))

Mean model quality value: 0.9471229050279328


In [0]:
param_range = np.array([3, 5, 7, 10])
train_scores, test_scores = validation_curve(bagging, X, y, param_name="max_features", param_range=param_range, cv=10, scoring="accuracy")
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

In [0]:
print(train_scores_mean, test_scores_mean)

[0.9985778 1.        1.        1.       ] [0.86088454 0.90429236 0.93099628 0.94322781]


In [0]:
param_range = np.array([5, 10, 50, 100])
train_scores, test_scores = validation_curve(bagging, X, y, param_name="base_estimator__max_depth", param_range=param_range, cv=10, scoring="accuracy")
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

In [0]:
print(train_scores_mean, test_scores_mean)