<a href="https://colab.research.google.com/github/Mateusz-best-creator/Learning_ML/blob/main/Book_Chapter7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Modeling

Ensemble methods work best when the predictiors are **independent** from one another as possible.

## Hard and Soft voting

In [None]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500,
                  noise=0.3,
                  random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
voting_clf = VotingClassifier(estimators=[
    ("lr", LogisticRegression(random_state=42)),
    ("rf", RandomForestClassifier(random_state=42)),
    ("svc", SVC(random_state=42))
])
voting_clf.fit(X_train, y_train)

In [None]:
for name, clf in voting_clf.named_estimators_.items():
  print(f"{name} = {clf.score(X_test, y_test).round(3)}")

lr = 0.89
rf = 0.91
svc = 0.9


In [None]:
# Here we see that ensemble model outperforms every individual model
voting_clf.score(X_test, y_test)

0.91

## Bagging and Pasting

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                n_estimators=500,
                                max_samples=100,
                                n_jobs=-1)
bagging_clf.fit(X_train, y_train)

In [None]:
bagging_clf.score(X_test, y_test)

0.91

## Out-of-bag Evaluation

In [None]:
bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                n_estimators=500,
                                max_samples=100,
                                n_jobs=-1,
                                oob_score=True)
bagging_clf.fit(X_train, y_train)

In [None]:
bagging_clf.oob_score_

0.9225

In [None]:
from sklearn.metrics import accuracy_score
y_pred = bagging_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.91

In [None]:
bagging_clf.oob_decision_function_[:5]

array([[0.        , 1.        ],
       [0.93931398, 0.06068602],
       [0.97050938, 0.02949062],
       [0.97964377, 0.02035623],
       [0.93606138, 0.06393862]])

## Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,
                                 max_leaf_nodes=16,
                                 n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
y_pred_rf[:10]

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 1])

## Feature Importance

In [None]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500)
rnd_clf.fit(iris.data, iris.target)

In [None]:
for importance, name in zip(rnd_clf.feature_importances_, iris.data.columns):
  print(f"{name} = {importance.round(3)}")

sepal length (cm) = 0.104
sepal width (cm) = 0.025
petal length (cm) = 0.446
petal width (cm) = 0.426


RandomForests are very useful when you need to understand what features matter, in particaular if you need to perform feature selection.

## Boosting

In AdaBoost we first train base classifier. Then algorithm increases weights of misclassified training instances and trains another classifier with updated weights.



In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                             n_estimators = 30,
                             learning_rate=0.5)
ada_clf.fit(X_train, y_train)

In GradientBoosting just like in AdaBoost we sequentially add predictors to the ensemble, each one correcting it's predecessor. This method tries to fit the new predictor ot the residual errors made by the previous predictor.

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

In [None]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

# And so on...

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2,
                                 n_estimators=500,
                                 learning_rate=0.1,
                                 n_iter_no_change=10) # Early Stopping
gbrt.fit(X, y)

In [None]:
gbrt.n_estimators_

51

## Stacking

In [None]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(estimators=[
    ("lr", LogisticRegression()),
    ("rf", RandomForestClassifier()),
    ("svc", SVC())
], final_estimator=RandomForestClassifier(),
   cv=5)

stacking_clf.fit(X_train, y_train)

# Exercises

1. Yes, since they are all trained using different algorithms there is a chance that exnemble model will perform better.
2. In hard voting we take predictions with the highest amount of votes, in soft voting we take the predictions with the highest probability.
3. Yes it is possible to distribute bagging ensemble across multiple servers. We cannot do this with boosting algorthms like `AdaBoost` or `GradientBoosting`. We can do this with random forests and stacking ensembles.
4. The benefit of `out-of-bag` evaluation is that we do not have to create additional validation set to test our model performance.
5. Extra trees are faster than random forests, we randomly split nodes, we do not try to minimize gini impurity when splitting. If random forests are overfitting training instances, we can try extra trees since this randomness acts as a regularization.
6. We can increase the number of predictors, removing regularization from basic estimator and increase learning rate.
7. We should implement early stopping callback, constrain basic model and lower the learning rate.

## Exercise 8

In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)

  warn(


In [2]:
mnist_data = mnist.data
mnist_target = mnist.target
mnist_data[:2], mnist_target[:2]

(   pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  pixel9  \
 0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
 1     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
 
    pixel10  ...  pixel775  pixel776  pixel777  pixel778  pixel779  pixel780  \
 0      0.0  ...       0.0       0.0       0.0       0.0       0.0       0.0   
 1      0.0  ...       0.0       0.0       0.0       0.0       0.0       0.0   
 
    pixel781  pixel782  pixel783  pixel784  
 0       0.0       0.0       0.0       0.0  
 1       0.0       0.0       0.0       0.0  
 
 [2 rows x 784 columns],
 0    5
 1    0
 Name: class, dtype: category
 Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9'])

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(mnist_data[:60000], mnist_target[:60000], test_size=0.15)
X_test, y_test = mnist_data[60000:], mnist_target[60000:]
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape

((51000, 784), (51000,), (9000, 784), (9000,), (10000, 784), (10000,))

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import ExtraTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

rf_clf = RandomForestClassifier()
svc_clf = LinearSVC()
et_clf = ExtraTreeClassifier()

rf_clf.fit(X_train, y_train)
svc_clf.fit(X_train[:3000], y_train[:3000])
et_clf.fit(X_train, y_train);



In [5]:
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

def classification_metrics(y_true, y_pred):
  precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)
  return {
      "Recall": np.mean(recall, axis=0),
      "Precision": np.mean(precision, axis=0),
      "F1": np.mean(f1, axis=0)
  }

In [6]:
rf_preds = rf_clf.predict(X_valid)
svc_preds = svc_clf.predict(X_valid[:2000])
et_preds = et_clf.predict(X_valid)

In [7]:
print(f"Random Forest = {classification_metrics(y_valid, rf_preds)}")
print(f"SVC = {classification_metrics(y_valid[:2000], svc_preds)}")
print(f"Extra Forest = {classification_metrics(y_valid, et_preds)}")

Random Forest = {'Recall': 0.9678106072218515, 'Precision': 0.9679580688141842, 'F1': 0.9678707237457168}
SVC = {'Recall': 0.8216270091866003, 'Precision': 0.8210180161075378, 'F1': 0.8204645246735526}
Extra Forest = {'Recall': 0.8118779644411355, 'Precision': 0.8118162502790621, 'F1': 0.8115920538172313}


In [11]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
    ("rf", rf_clf),
    ("svc", svc_clf),
    ("et", et_clf)
])

voting_clf.fit(X_train, y_train)
voting_preds = voting_clf.predict(X_valid)
print(f"Stacking with much less data = {classification_metrics(y_valid, voting_preds)}")



Stacking with much less data = {'Recall': 0.9112539777235623, 'Precision': 0.9214149309994093, 'F1': 0.9122676039242397}


## Exercise 9

In [12]:
rf_preds[:3], et_preds[:3]

(array(['4', '5', '6'], dtype=object), array(['4', '5', '6'], dtype=object))

In [24]:
# Let's train a blender!
blender = RandomForestClassifier()

estimators = voting_clf.estimators_
X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=object)
for index, estimator in enumerate(estimators):
    X_valid_predictions[:, index] = estimator.predict(X_valid)
X_valid_predictions[:10]

array([[4, 4, 4],
       [5, 5, 8],
       [6, 6, 6],
       [1, 1, 1],
       [3, 3, 3],
       [3, 9, 5],
       [4, 4, 4],
       [2, 2, 2],
       [0, 0, 0],
       [2, 2, 8]], dtype=object)

In [25]:
blender.fit(X_valid_predictions, y_valid)

In [26]:
blender.score(X_valid_predictions, y_valid)

0.9718888888888889

In [28]:
from sklearn.ensemble import StackingClassifier

stacking = StackingClassifier(estimators=[
    ("rf", rf_clf),
    ("svc", svc_clf),
    ("et", et_clf)
], final_estimator=RandomForestClassifier())

# Stacking classifier is a bit better optimized so we should get slightly better results
stacking.fit(X_train, y_train)
stacking.score(X_valid, y_valid)