# Bagging and Pasting

Bagging and pasting are ensemble methods that use [resampling methods](https://github.com/AlbinFranzen/ML-Algorithms-From-Scratch/blob/master/Model%20Optimisation/Model%20Assessment/Resampling%20Methods.ipynb) to create new sets of data and train a model on the new sets, aggregating the results. Bagging is the same as the bootstrap which randomly  resamples. Pasting however will only use each each datapoint once for each sample.

In [4]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [84]:
from sklearn.tree import DecisionTreeClassifier

dtc_clf = DecisionTreeClassifier()

In [101]:
from sklearn.metrics import accuracy_score

dtc_clf.fit(X_train, y_train)
dtc_clf_y_pred = dtc_clf.predict(X_test)

print("Decision Tree: " + str(accuracy_score(y_test, dtc_clf_y_pred)))

Decision Tree: 0.864


In [82]:
total = [0 for i in range(len(y_test))]
for i in range(500):  
    idx = np.array([np.random.randint(0,100) for i in range(100)])
    resampled_x = X_train[idx]
    resampled_y = y_train[idx]
    dtc_clf.fit(resampled_x, resampled_y)
    dtc_clf_y_pred = dtc_clf.predict(X_test)
    total = [a + b for a, b in zip(total, dtc_clf_y_pred)]

In [83]:
final_guess = [0 if i<250 else 1 for i in total]
final_guess = np.array(final_guess)
print("Decision Tree with bagging: " + str(accuracy_score(y_test, final_guess)))

Decision Tree with bagging: 0.88


In [125]:

total = [0 for i in range(len(y_test))]
for i in range(500):  
    idx = random.sample(range(70), 10)
    resampled_x = X_train[idx]
    resampled_y = y_train[idx]
    dtc_clf.fit(resampled_x, resampled_y)
    dtc_clf_y_pred = dtc_clf.predict(X_test)
    total = [a + b for a, b in zip(total, dtc_clf_y_pred)]

In [126]:
final_guess = [0 if i<250 else 1 for i in total]
final_guess = np.array(final_guess)
print("Decision Tree with pasting: " + str(accuracy_score(y_test, final_guess)))

Decision Tree with pasting: 0.84


## Random Forests

Random forests are the same as bagging except now for each time the decision tree makes a split it can only do so using a random subset of all the predictors (subset size is usually equal to the square root of the number of predictors). This makes all the bagged trees less correlated and can generalise better to the dataset.

## Extremly Randomized Trees (Extra-Trees)

Extra trees are the same as random forests except for each except at each node instead of splitting based on criterion such as gini the splits are completely randomized. So now features are randomized, split is randomized and sampling is randomized.