<a href="https://colab.research.google.com/github/RohanOpenSource/ml-notebooks/blob/main/EnsembleLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Ensemble Learning, at it's core, is an enseble(many) models voting on the correct class name. The easiest way this can be implemented is taking the class which has been outputted by the majority of the models and outputting that. This is called hard voting.

In [8]:
iris = load_iris()
X = iris["data"][:, (2, 3)] # this is the petal length and width
y = iris["target"]#boolean of whether the flower is an iris virginica or not as a float 64
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [9]:
model_1 = RandomForestClassifier()
model_2 = SVC()
model_comb = VotingClassifier(
    estimators=[('rfc', model_1), ('svc', model_2)],
    voting='hard'
)
model_comb.fit(X_train, y_train)

VotingClassifier(estimators=[('rfc',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=None,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
       

In [10]:
for clf in (model_1, model_2, model_comb):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

RandomForestClassifier 1.0
SVC 1.0
VotingClassifier 1.0


Hard voting hasn't given us great results. Let's try using bagging. Bagging is splitting the data into chunks and training each model on one of those chunks.

In [11]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    SVC(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True
) #setting the number of jobs to -1 tells sklearn to use all of the available cpu threads for this model.

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_ #"Out Of Bag" score

0.9464285714285714

Time for Random Forests, one of the most powerful algorithms in ml is built of of the mediocre decision tree. A random forest is an ensemble of deicision trees trained on subsamples of the data. The label that is chosen by the majority of the trees is what is outputed by the forest. This eliminates the overfitting issue that is prevalent with decision trees. 

In [12]:
rfc_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rfc_clf.fit(X_train, y_train)

y_pred = rfc_clf.predict(X_test)
y_pred

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0])

Now lets make a random forest from it's counterpart to better understand what it is

In [13]:
decision_tree_clf = DecisionTreeClassifier(max_features="auto", max_leaf_nodes=16)
rfc_clf2 = BaggingClassifier(decision_tree_clf, n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)#A random forest is simply an ensemble of decision trees with bagging
rfc_clf2.fit(X_train, y_train)
y_pred_2 = rfc_clf2.predict(X_test)

In [14]:
y_pred == y_pred_2

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

These models are identcal, thus they have identical predictions. Now lets make a small random forest with Gradient Boosting. Gradient Boosting is when you train models in an ensemble on the residual errors of the previously trained model. 

In [15]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train);

In [16]:
y2 = y_train - tree_reg1.predict(X_train); #We train the second model on the residual errors of the first model
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2);

In [17]:
y3 = y2 - tree_reg2.predict(X_train) #The third model is trained on the residual errors of the seond model
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=2,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [18]:
y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred, y_test

(array([ 1.01449275, -0.01407867,  2.04107143,  1.01449275,  1.48214286,
        -0.01407867,  1.01449275,  1.72857143,  1.01449275,  1.01449275,
         1.72857143, -0.01407867, -0.01407867, -0.01407867, -0.01407867,
         1.01449275,  2.04107143,  1.01449275,  1.01449275,  2.04107143,
        -0.01407867,  1.72857143, -0.01407867,  2.04107143,  2.04107143,
         2.04107143,  2.04107143,  2.04107143, -0.01407867, -0.01407867,
        -0.01407867, -0.01407867,  1.01449275, -0.01407867, -0.01407867,
         1.72857143,  1.01449275, -0.01407867]),
 array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
        0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0]))

An easier way to do this is to use sklearn's built in Gradient Boosting Regressor.

In [29]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train, y_train)
y_pred = gbrt.predict(X_test)
errors = [mean_squared_error(y_test, y_pred)]
errors

[0.014361159094852083]

We have too few estimators. The optimal number of estimators can be found be taking the argmin of the errors of this model and then feeding 1+ that as the number of estimators in and otherwise identical model

In [32]:
gbrt2 = GradientBoostingRegressor(max_depth = 4, n_estimators=120)
gbrt.fit(X_train, y_train)
errors = [mean_squared_error(y_test, y_pred)
        for y_pred in gbrt.staged_predict(X_test)]
best_n = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=4, n_estimators=best_n)
gbrt_best.fit(X_train, y_train)
y_pred = gbrt_best.predict(X_test)
errors = [mean_squared_error(y_test, y_pred)]
errors

[0.380367441622979]

Stacking is the final thing I will code in this notebook. The basis of stacking is rather than having a hard voting algorithm aggregate the predictions, we can train a model to do it for us. While this may make some angry citizens storm the capital building, it is for the greater good. Besides they would have stormed it anyways.

In [36]:
from sklearn.ensemble import RandomForestRegressor
import tensorflow as tf

stacking_model_1 = GradientBoostingRegressor(max_depth=2, n_estimators=100)
stacking_model_2 = RandomForestRegressor(max_depth=2, n_estimators=200)

X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size = 0.3)

stacking_model_1.fit(X_train_1, y_train_1)
stacking_model_2.fit(X_train_2, y_train_2)

pred_1 = stacking_model_1.predict(X_train_2)
pred_2 = stacking_model_2.predict(X_train_2)

blender = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation="relu"),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(1),
])

blender.compile(loss=tf.keras.losses.mean_squared_error, optimizer = tf.keras.optimizers.Adam(), metrics=["mse"])
blender.fit(pred_1, y_train_2, epochs = 40)
blender.fit(pred_2, y_train_2, epochs = 40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7fcafb2d6590>

In [38]:
p_1 = stacking_model_1.predict(X_test)
p_2 = stacking_model_2.predict(X_test)

blender.predict(p_1)
blender.predict(p_2)

array([[ 1.0099456e+00],
       [-9.4616320e-04],
       [ 1.9535915e+00],
       [ 1.0145307e+00],
       [ 1.0711775e+00],
       [-9.4616320e-04],
       [ 1.0099456e+00],
       [ 1.9019414e+00],
       [ 1.0145307e+00],
       [ 1.0099456e+00],
       [ 1.9019414e+00],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [ 1.0269490e+00],
       [ 1.9535915e+00],
       [ 1.0099456e+00],
       [ 1.0099456e+00],
       [ 1.9535915e+00],
       [-9.4616320e-04],
       [ 1.5352073e+00],
       [-9.4616320e-04],
       [ 1.9535915e+00],
       [ 1.9535915e+00],
       [ 1.9019414e+00],
       [ 1.9494067e+00],
       [ 1.9535915e+00],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [ 1.0099456e+00],
       [-9.4616320e-04],
       [-9.4616320e-04],
       [ 1.6401632e+00],
       [ 1.0145307e+00],
       [-9.4616320e-04]], dtype=float32)