<h1 align=middle style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Random Forest
</font>
</h1>

Ensemble Learning is the practice of combining different decision trees (regression or classification), and giving the combined results as the answer.

<h1 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Classification based on Vote
</font>
</h1>

In [30]:
from sklearn.datasets import make_moons


X, y = make_moons(n_samples=200, noise=0.15)

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier


log_clf = LogisticRegression()
ran_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', ran_clf), ('svc', svm_clf)], 
    voting='hard'
)

voting_clf.fit(X_train, y_train)

In [33]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, ran_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(clf.__class__.__name__, round(accuracy, 4))

LogisticRegression 0.825
RandomForestClassifier 0.925
SVC 1.0
VotingClassifier 0.95


We can expect the result of the voting classifier to be better.

we can add the hyperparameter "probibility=True" to the SVC model and get the probability of each model and the voiting classifier.

<h1 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Bagging and Pasting
</font>
</h1>

In [34]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier



bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500, 
    max_samples=100, 
    bootstrap=True,
    n_jobs=-1
)

bag_clf.fit(X_train, y_train)

In [35]:
y_pred = bag_clf.predict(X_test)

In the bagging method, (for a big enough times of bagging), 37% of the data is not used.

So it is the perfect validation dataset.

In [36]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    bootstrap=True, 
    n_jobs=-1, 
    oob_score=True
)

bag_clf.fit(X_train, y_train)

In [37]:
bag_clf.oob_score_

0.94375

Let's see how well it works :

In [38]:
from sklearn.metrics import accuracy_score


y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.925

Noice

We could do the same thing with features, which is called Random Patch

<h1 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Random Forest
</font>
</h1>

In [39]:
from sklearn.ensemble import RandomForestClassifier


rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)


y_pred_rnd = rnd_clf.predict(X_test)

In [40]:
accuracy_score(y_pred_rnd, y_test)

0.925

This bagging is nearly identical to RandomForestClassifier

In [41]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter='random', max_leaf_nodes=16), 
    n_estimators=500, 
    n_jobs=-1, 
    max_samples=1.0, 
    bootstrap=True
)

In [42]:
bag_clf.fit(X_train, y_train)

In [43]:
y_pred_bag = bag_clf.predict(X_test)
accuracy_score(y_pred_bag, y_test)

1.0

# Even more random

So we can do another thing aswell, we can randomize the feature chosen in every node (it would not neccessarily be the best anymore).


This is called Extremely Randomized Trees.

and reduces variance at the cost of biad.

The class is : ExtraTreesClassifier

# Feature importance

In [44]:
from sklearn.datasets import load_iris


iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

In [45]:
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09645968353474602
sepal width (cm) 0.023416968674925064
petal length (cm) 0.43709480423527614
petal width (cm) 0.4430285435550527


<h1 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Boosting
</font>
</h1>

**Boosting** is a method in machine learning that aims to improve the accuracy of predictive models. It belongs to the ensemble techniques, which combine the predictions from multiple algorithms for enhanced performance.

### Key Concepts

- **Ensemble Method**: Boosting is a type of ensemble learning where multiple models are strategically used to solve the same problem.
- **Weak to Strong Learners**: The goal is to sequentially combine weak learners (models that are slightly better than random guessing) to form a strong, accurate predictor.
- **Sequential Learning**: Models are built in a sequence, with each new model focusing on the errors of its predecessor.
- **Weight Adjustment**: Boosting involves adjusting the weights of training instances, with more focus on the ones that were misclassified in earlier rounds.
- **Popular Algorithms**: AdaBoost (Adaptive Boosting) and Gradient Boosting are widely-used boosting algorithms.
- **Applications**: Boosting is effective for classification and regression, particularly with imbalanced datasets.

### Advantages and Disadvantages

- **Advantages**: Boosting can lead to very high accuracy and is effective in complex predictive problems.
- **Disadvantages**: It can be computationally intensive and has a risk of overfitting if not carefully implemented.

### Conclusion

Boosting is a powerful approach in machine learning, transforming a series of weak learners into a highly accurate collective model through an adaptive, sequential process.


# AdaBoostClassifier

The `AdaBoostClassifier` is a popular boosting algorithm in machine learning. It focuses on combining multiple weak learners, typically decision trees, to create a strong classifier. The key aspects of AdaBoost are:

### 1. Weighted Error Rate of the i'th Predictor

Each weak learner is assigned a weighted error rate, which is calculated based on its performance on the weighted training instances. The error rate is given by:

$$
\text{error}_i = \frac{\sum \text{weights of misclassified instances}}{\sum \text{weights of all instances}}
$$

### 2. Predictor Weight

Based on the error rate, a weight is assigned to the predictor. This weight determines the influence of the predictor in the final decision. It is computed as:

$$
\text{weight}_i = \alpha_i = \eta \cdot \log{\frac{1 - \text{error}_i}{\text{error}_i}}
$$

where \(\eta\) is the learning rate.

### 3. Updating the Weights

After calculating the predictor's weight, the algorithm updates the weights of the training instances. Misclassified instances are given more weight, while correctly classified instances have their weights decreased. The update rule is:

$$
\text{new weight} = \text{weight} \cdot \exp(\alpha_i)
$$

for misclassified instances, and remains the same for correctly classified instances.

### 4. Final Prediction Choice

The final prediction is made by combining the predictions of all the learners, weighted by their respective weights. The predicted class is the one that receives the highest weighted sum of votes:

$$
\text{predicted class} = \argmax \sum_{i: \text{correct predictions}} \alpha_i - \sum_{i: \text{incorrect predictions}} \alpha_i
$$

### Extensions: SAMME and SAMME.R

- **SAMME**: Stands for *Stagewise Additive Modeling using a Multiclass Exponential loss function*. It extends AdaBoost to multi-class classification.
- **SAMME.R**: A variant of SAMME, the 'R' stands for 'Real'. It uses the actual class probabilities rather than predictions, often leading to faster convergence and improved accuracy.

### Learning Rates

- The learning rate $\eta$ influences the contribution of each weak learner. A smaller $\eta$ means the model adapts more slowly, potentially requiring more estimators but often improving the generalization.

### Conclusion

The `AdaBoostClassifier` is an effective algorithm that iteratively focuses on the more challenging aspects of the training data, adjusting the weights of the learners and the training instances to improve its predictive accuracy over time.


In [46]:
from sklearn.ensemble import AdaBoostClassifier


ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), 
    n_estimators=200, 
    algorithm="SAMME.R", 
    learning_rate=0.5
)

ada_clf.fit(X, y)

A Decision Tree of max_depth 1 is known as a Decision Stump

# Gradient Boosting

Gradient Boosting is a powerful machine learning technique for regression and classification problems. It builds on the idea of boosting, creating a strong predictive model from an ensemble of weak learners, typically decision trees.

### 1. Gradient Descent Approach

Gradient Boosting uses a gradient descent algorithm to minimize the loss when adding new models. Each new tree is built to reduce the residual errors of the previous trees.

### 2. Loss Function

A differentiable loss function is used to quantify the difference between the predicted and actual values. Common choices include mean squared error for regression and logistic loss for classification.

### 3. Sequential Tree Building

New trees are added one at a time, with each tree learning from the mistakes (residual errors) of the preceding ones. This sequential addition of trees aims to improve the model iteratively.

### 4. Shrinkage (Learning Rate)

- **Learning Rate**: A key parameter in Gradient Boosting is the learning rate, which scales the contribution of each tree. A smaller learning rate requires more trees in the model, but can lead to better generalization.
- **Trade-off**: There's a trade-off between the learning rate and the number of trees: Lower learning rates need more trees for model convergence, but often yield better performance.

### 5. Regularization

To prevent overfitting, Gradient Boosting introduces regularization techniques such as tree constraints (depth, number of leaves), random sampling of training data, and subsampling of features.

### 6. Stochastic Gradient Boosting

An extension of Gradient Boosting, it introduces randomness into the tree-building process, improving accuracy and robustness by reducing variance and overfitting.

### Conclusion

Gradient Boosting is a versatile and powerful technique, adaptable to a range of problem types. Its key strength lies in its ability to combine simple models into a complex one through an iterative process, where each model corrects its predecessor's errors.



In [47]:
from sklearn.tree import DecisionTreeRegressor


tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

In [48]:
y2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)

In [49]:
y3 = y2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

Now to combine them :

In [50]:
y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))

Or we can just use the class :

In [51]:
from sklearn.ensemble import GradientBoostingRegressor


gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gb_reg.fit(X, y)

Learnin Rate is the hyperparameter contrilling the the influence of each tree

There is a Regularization technique (smaller learning rate and larger number of tree's (n_estimators)) also known as Shrinkage.

but how can we know the sweet spot :

In [52]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


X_train, X_val, y_train, y_val = train_test_split(X, y)

gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gb_reg.fit(X_train, y_train)

In [53]:
errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gb_reg.staged_predict(X_val)]


best_n_estimators = np.argmin(errors) + 1

In [54]:
gb_reg_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimators)
gb_reg_best.fit(X_train, y_train)

Or we can use the early stopping for the training procces as well

In [55]:
gb_reg = GradientBoostingRegressor(max_depth=2, warm_start=True)


min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gb_reg.n_estimators_ = n_estimators
    gb_reg.fit(X_train, y_train)
    y_pred = gb_reg.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)

    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

The XGBoost (Extreme Gradient Boosting), is one of the most efficient and most effective methods of implementing Gradient Boosting :

In [56]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

This library has some amazing features like auto early stopping

In [57]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.37887
[1]	validation_0-rmse:0.28636
[2]	validation_0-rmse:0.21256
[3]	validation_0-rmse:0.16150
[4]	validation_0-rmse:0.13036
[5]	validation_0-rmse:0.10983
[6]	validation_0-rmse:0.09867
[7]	validation_0-rmse:0.09287
[8]	validation_0-rmse:0.09108
[9]	validation_0-rmse:0.09133
[10]	validation_0-rmse:0.09454




<h1 align=left style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Stacking
</font>
</h1>

## Stacked Generalization

Stacked Generalization, or "stacking," is an advanced machine learning algorithm that combines multiple base models with a meta-learner to improve prediction accuracy.

### 1. Overview of Stacking

Stacking involves two or more layers of models:
- **First Layer (Base Models)**: Comprises various machine learning models trained on the full training dataset.
- **Second Layer (Meta-Learner)**: A model that learns how to optimally combine the predictions of the first-layer models.

### 2. Training the Base Models

- The training data is divided into two parts: a training subset and a hold-out subset.
- Base models are trained on the training subset.
- Each base model then makes predictions on the hold-out subset.
- These predictions are used as features (meta-features) for the next layer.

### 3. Hold-Out Method

- The hold-out method ensures that the meta-learner is trained on data that has not been seen by the base models, preventing information leakage and overfitting.
- This method resembles a validation approach where the hold-out set acts as unseen data.

### 4. Out-of-Fold Predictions

- As an alternative, out-of-fold predictions are used, akin to k-fold cross-validation.
- The training set is split into k-folds, and base models are trained on k-1 folds and make predictions on the fold left out.
- This process is repeated such that each instance in the training set has a corresponding out-of-fold prediction.
- These out-of-fold predictions form the training data for the meta-learner.

### 5. Training the Meta-Learner

- The meta-learner is trained on the meta-features, which are the predictions of the base models.
- The target for the meta-learner remains the original target variable.

### 6. Final Model

- In the final stacked model, base models first predict new data.
- These predictions are fed into the meta-learner, which then makes the final prediction.

### 7. Benefits and Challenges

- **Benefits**: Stacking often results in higher predictive accuracy than any individual model.
- **Challenges**: It requires careful tuning to prevent overfitting and can be computationally intensive.

### Conclusion

Stacked Generalization is a sophisticated ensemble technique. By carefully training layers of models and ensuring that each layer learns from the previous one's predictions, stacking harnesses the strengths of multiple models for superior predictive performance.


For full implementation see :
https://github.com/Menelau/DESlib