# Ensemble Learning and Random Forests<br>
**Enseble method:** you can train a group of decision tree
classifiers, each on a different random subset of the training set. You can then obtain
the predictions of all the individual trees, and the class that gets the most votes is the
ensemble’s prediction. <br>Such an ensemble of decision trees is called a random forest.<br>
We will examine the most popular ensemble methods, including voting classifiers, bagging and pasting ensembles, random forests, and boosting, and stacking ensembles.

# 1- Voting Classifiers

If we train data to 4 different models let's say KNN, svm, decision tree, and logistic regression.<br><br>
The most voted class will be taken as the prediction like if 3 models of 4 predict that the predicted class is robot, then the prediction will be according to the majority which is the robot, and this is called **hard voting classifier**.<br><br>
Voting Classifiers get high accuracy than best classifier in ensemble even if each classifier in ensemble is weak, this is according to the **law of large numbers**.<br><br>
To get high accuracy you have to make the predictors are independent as possible, as if they are dependent they will likely make the same error so, the ensemble method accuracy will be reduced.

In [23]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
v_clf = VotingClassifier(
estimators=[('lr', LogisticRegression(random_state=42)),
('rf', RandomForestClassifier(random_state=42)),
('svc', SVC(random_state=42))])
v_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])

In [3]:
for model,clf in v_clf.named_estimators_.items():
  print(f"The accuracy of {model}: {clf.score(X_test,y_test)}")

The accuracy of lr: 0.864
The accuracy of rf: 0.896
The accuracy of svc: 0.896


Now the best classifiers are random forest, and support vector machines lets try ensemble method

In [4]:
# predict in voting classifier makes the hard voting
print(v_clf.predict(X_test[:1]))
print([v_clf.predict(X_test[:1]) for clf in v_clf.estimators])


[1]
[array([1]), array([1]), array([1])]


Here all of the classifiers predicted the same class for this instance 

In [5]:
v_clf.score(X_test,y_test)

0.912

**The voting classifier get higher accuracy than best classifier by 1.6 %**

If all classifiers have predict_proba() method then, you can predict according to the highest class probability averaged overall individual classifiers this is called **soft voting**

In [6]:
v_clf.voting = 'soft'
"""in svc the predict proba() is not available,
when its probability heperparameter = False 
which is false by default, so it has to be = True """
v_clf.named_estimators['svc'].probability = True
v_clf.fit(X_train,y_train)
v_clf.score(X_test,y_test)

0.92

# WoW! it works we get a higher accuracy

In [7]:
print(v_clf.named_estimators)
print(v_clf.estimators)

{'lr': LogisticRegression(random_state=42), 'rf': RandomForestClassifier(random_state=42), 'svc': SVC(probability=True, random_state=42)}
[('lr', LogisticRegression(random_state=42)), ('rf', RandomForestClassifier(random_state=42)), ('svc', SVC(probability=True, random_state=42))]


# 2- Bagging and Pasting

Another approach is to use the same training algorithm for every
predictor but train them on different random subsets of the training set.<br> When sampling
is performed with replacement, this method is called **bagging**. When sampling is performed without replacement, it is called **pasting**.
<br>In other words, every classifier in ensemble take diffrent sample of data.<br>**Pasting**has absoute independence, while **Bagging** has less independence.<br>
Bagging: short for bootstrap aggregation.

In [8]:
""" n_jobs tells Scikit-Learn the number of CPU
cores to use for training and predictions 
and (-1) means use all available cores

n_estimators  = no. of base estimator in ensemble
"""
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1, random_state=42)

# Out of Bag (OOB) Evaluation <br>
it can
be shown mathematically that only about 63% of the training instances are sampled on
average for each predictor. The remaining 37% of the training instances that are not
sampled are called out-of-bag (OOB) instances. Note that they are not the same 37%
for all predictors.

A bagging ensemble can be evaluated using OOB instances, without the need for a
separate validation set: indeed, if there are enough estimators, then each instance in the
training set will likely be an OOB instance of several estimators, so these estimators
can be used to make a fair ensemble prediction for that instance. Once you have a
prediction for each instance, you can compute the ensemble’s prediction accuracy (or
any other metric).

In [9]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
max_samples=100, n_jobs=-1, random_state=42,oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9253333333333333

In [10]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.904

Often the accuracy score is higher than the oob_score but at this case the oob_score is higher which is something good

# 3- Random Forests

**Random Forest** is an ensemble of decision trees, generally
trained via the bagging method (or sometimes pasting), typically with max_samples
set to the size of the training set. Instead of building a BaggingClassifier and
passing it a DecisionTreeClassifier, you can use the
RandomForestClassifier class, which is more convenient and optimized for
decision trees (similarly, there is a RandomForestRegressor class for
regression tasks).

In [11]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test,y_pred_rf)

0.912

# Extra tree Classifier <br>
Rondom forest splits random subset of feature at each node, it is possible to make trees even more random by using **splitter='random'** 
id decision tree, but in random forest we can use **ExtraTreesClassifier with the bootstrap=False**, this technique trades more bias for a lower variance, and it is much faster than regular random forests, but we are not sure if it will perform better than random forest or not so, the only way to be sure is to compare them using cross validation


In [12]:
rnd_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16,
n_jobs=-1, random_state=42,bootstrap=False)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test,y_pred_rf)

0.912

We can know the importance of features by looking @ the impurity of nodes, we can access **feature_importances_** variable to show the importance of each feature

In [13]:
iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_,iris.data.columns):
  print(f"Importance: {round(score, 2)*100}%, Feature name: {name}")

Importance: 11.0%, Feature name: sepal length (cm)
Importance: 2.0%, Feature name: sepal width (cm)
Importance: 44.0%, Feature name: petal length (cm)
Importance: 42.0%, Feature name: petal width (cm)


So, the most important features in the iris dataset are: petal lenght, and petal width.

#Boosting
It refers to any ensemble method that combine several weak learners into a strong one.<br>
The general idea of boosting is to train the predictors sequentially each trying to correct the predrcessor.

#4- AdaBoost <br>
One way for a new predictor to correct its predecessor is to pay a bit more attention to
the training instances that the predecessor underfit. This results in new predictors
focusing more and more on the hard cases. This is the technique used by AdaBoost.
for example the misclassified instances get more weight, then the next predictor predict on the updated weights.<br><br>
If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of
estimators or more strongly regularizing the base estimator.

In [14]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=30,
learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred = ada_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.904

#5- Gradient Boosting<br>
It is like Adaboost, but instead of updating weights of training instances, it tries ti fit the new predictor to the residual error made by previous one.<br>It scales all the trees by the same amount<br>
Residual error is the difference between the predicted values and actual values


In [19]:
gbrt_best = GradientBoostingClassifier(
max_depth=2, learning_rate=0.05, n_estimators=500,
n_iter_no_change=10, random_state=42)
gbrt_best.fit(X_train, y_train)
y_pred_gbrt = gbrt_best.predict(X_test)
accuracy_score(y_test,y_pred_gbrt)

0.92

**n_iter_no_change:** means to stop adding trees if there is no progress, if it is too low this may lead too early stop which will cause underfitting, while if it is too high it will overfit<br><br>

In [21]:
gbrt_best.n_estimators_

152

so, the best number of estimators are 152 not 500<br>
**LETS TRY IT AGAIN**

In [22]:
gbrt_best = GradientBoostingClassifier(
max_depth=2, learning_rate=0.05, n_estimators=152,
n_iter_no_change=10, random_state=42)
gbrt_best.fit(X_train, y_train)
y_pred_gbrt = gbrt_best.predict(X_test)
accuracy_score(y_test,y_pred_gbrt)

0.92

So, when we put the n_estimators=152, and 500, the accuracy is the same, it doesn't overfit because of the early stopping due to the **n_iter_no_change**, as there is no progress so, it stops adding more trees at 152

#6- Stacking <br>
It is based on a simple idea: instead of using trivial functions to aggregate the predictions of all predictors in an ensemble, why
don’t we train a model to perform this aggregation? such as making a predictor for final prediction (blender or meta learner)<br> **Example:** We entered a new instance for 4 predictors the values predict by 4 predictors are **[100, 112.5, 106, 109.7]** instead of taking the average value of predictors as the predicted value in **Regression** or raking the most voted class in **Classification**, here we take the predicted values from the predictors and make them as inputs for predictor (blender or meta learner) that will give us the predicted value from the model **[108.25]**

# The following photo will explain the process of Stacking more

![image](https://user-images.githubusercontent.com/96451039/217559225-5400c4d9-1e59-4c2c-82b0-d119b8bf35bb.png)


In [24]:
stacking_clf = StackingClassifier(
estimators=[
('lr', LogisticRegression(random_state=42)),
('rf', RandomForestClassifier(random_state=42)),
('svc', SVC(probability=True, random_state=42))
],
# final estimator is the meta 
final_estimator=RandomForestClassifier(random_state=43),cv=5)
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
accuracy_score(y_test,y_pred)

0.928

# Your journey hasn't ended yet you have a lot to learn in ensemble methods such as XGBoost, CatBoost, and LightGBM.