## Chapter 7. Ensemble Learning and Random Forests

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd.

As discussed in Chapter 2, you will often use Ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor. In fact, the winning solutions in Machine Learning competitions often involve several Ensemble methods (most famously in the Netflix Prize competition).

In this chapter we will discuss the most popular Ensemble methods, including bagging, boosting, and stacking. We will also explore Random Forests.

### Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You may have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a K-Nearest Neighbors classifier, and perhaps a few more (see Figure 7-1

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes. This majority-vote classifier is called a hard voting classifier

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.

.
.
// 10000 defa atılan %51 olasılıkla yazı gelen para anlatısı.
.
.

TIP
Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy.

**--------------------------------------------------------------------------**

The following code creates and trains a voting classifier in Scikit-Learn, composed of three diverse classifiers (the training set is the moons dataset, introduced in Chapter 5):

In [48]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd 
diabetes = pd.read_csv("diabetes.csv")
df = diabetes.copy()
df = df.dropna()
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df["Outcome"]=df.Outcome.astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Pregnancies               768 non-null    int64   
 1   Glucose                   768 non-null    int64   
 2   BloodPressure             768 non-null    int64   
 3   SkinThickness             768 non-null    int64   
 4   Insulin                   768 non-null    int64   
 5   BMI                       768 non-null    float64 
 6   DiabetesPedigreeFunction  768 non-null    float64 
 7   Age                       768 non-null    int64   
 8   Outcome                   768 non-null    category
dtypes: category(1), float64(2), int64(6)
memory usage: 54.8 KB


In [4]:
y=df[["Outcome"]]
X=df.drop(columns='Outcome')
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test,=train_test_split(X,y, test_size=0.30, random_state=42)

In [5]:
X_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
334,1,95,60,18,58,23.9,0.260,22
139,5,105,72,29,325,36.9,0.159,28
485,0,135,68,42,250,42.3,0.365,24
547,4,131,68,21,166,33.1,0.160,28
18,1,103,30,38,83,43.3,0.183,33
...,...,...,...,...,...,...,...,...
71,5,139,64,35,140,28.6,0.411,26
106,1,96,122,0,0,22.4,0.207,27
270,10,101,86,37,0,45.6,1.136,38
435,0,141,0,0,0,42.4,0.205,29


In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
tree_clf = DecisionTreeClassifier() 
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf),
                ('svc', svm_clf),('treeee', tree_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train);

In [49]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, tree_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.7402597402597403
RandomForestClassifier 0.7489177489177489
SVC 0.7359307359307359
DecisionTreeClassifier 0.7056277056277056
VotingClassifier 0.7532467532467533


--As you can see, aggregate of randomly selected 4 algorithm better than each algorithm--

If all classifiers are able to estimate class probabilities (i.e., they all have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. 

It often achieves higher performance than hard voting because it gives more weight to highly confident votes. 

All you need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can estimate class probabilities

This is not the case for the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method). If you modify the preceding code to use soft voting, you will find that the voting classifier achieves over 91.2% accuracy!

In [51]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
tree_clf = DecisionTreeClassifier() 
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf),
                ('svc', svm_clf),('treeee', tree_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train);
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, tree_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.7402597402597403
RandomForestClassifier 0.7748917748917749
SVC 0.7359307359307359
DecisionTreeClassifier 0.6926406926406926
VotingClassifier 0.7532467532467533


### Bagging and Pasting / torbalama ve yapıştırma

One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging1 (short for bootstrap aggregating2). When sampling is performed without replacement, it is called pasting.3

As you can see in Figure 7-4, predictors can all be trained in parallel, via different CPU cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging and pasting are such popular methods: they scale very well.



### Bagging and Pasting in Scikit-Learn

Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class (or BaggingRegressor for regression). The following code trains an ensemble of 500 Decision Tree classifiers:5 each is trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False). The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (–1 tells Scikit-Learn to use all available cores):

In [56]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf=BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=100,
    max_samples=300, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_pred,y_test)

0.7402597402597403

**NOTE**

The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Tree classifiers.

bu görsel önümüzdeki konulardan olan boosting ile baggingin karşılaştırıldığı bir görsel, beğendim: https://yandex.com.tr/gorsel/search?text=bagging%20and%20pasting&from=tabbar&pos=8&img_url=https%3A%2F%2Fwww.educba.com%2Facademy%2Fwp-content%2Fuploads%2F2019%2F11%2Fbagging-and-boosting.png&rpt=simage
        

### Out-of-Bag Evaluation
I think, not sure. Thats means, when ve involve an ensamble model. Than there is 
instance thet are not sampled (out-of-bag/oob). So WE can use that for measure performance of out model:


hey i found a video about that. He discussed about oob about 06.00 minute: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&ab_channel=StatQuestwithJoshStarmer

In [57]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.7597765363128491

In [58]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7445887445887446

In [63]:
bag_clf.predict_proba(X_test)[:3]

array([[0.476, 0.524],
       [0.662, 0.338],
       [0.93 , 0.07 ]])

### Random Forests

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node (see Chapter 6), it searches for the best feature among a random subset of features

In [64]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train,y_train)
y_pred_rf = rnd_clf.predict(X_test)

### Extra Trees

A forest of such extremely random trees is called an Extremely Randomized Trees ensemble12 (or Extra-Trees for short). Once again, this technique trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random Forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

**TIP**

It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them using cross-validation (tuning the hyperparameters using grid search).

## Terminology:
**Bootsrap:** Randomly selected data and allof dor duplicates is called sampling with replacement/boostrap. For exam: if we have 10 row, and will select 3 row, than if we using boostrap, we may select firstly, 1.row, second, 5.row and finally, may be 1.row again  or 5 or 6. Thats it.

### Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it (see Chapter 6).

Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

## Boosting | Güçlendirme

 The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. 

### Adaboost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains a base classifier (such as a Decision Tree) and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. Then it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on 

**WARNING**
There is one important drawback to this sequential learning technique: it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as bagging or pasting.

Adaboost açıklayıcı kaynak: https://www.youtube.com/watch?v=LsK-xG1cLYA&ab_channel=StatQuestwithJoshStarmer

### Gradient Boosting

Gradient Boosting açıklayıcı kaynak: https://www.youtube.com/watch?v=LsK-xG1cLYA&ab_channel=StatQuestwithJoshStarmer

And adlso there is a great representation in book. 

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called **shrinkage**

In order to find the optimal number of trees, you can use early stopping (see Chapter 4). A simple way to implement this is to use the staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

In [83]:
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)
#gbrt.staged_predict(X_test) || returns an generator 
errors = [mean_squared_error(y_test, y_pred)
          for y_pred in gbrt.staged_predict(X_test)]
#np.argmin(errors) | returns first index that smaller than its after for exm: [5,3,1,5..] ise 1 olan 2. index sexilir
bst_n_estimators = np.argmin(errors) + 1


gbrt_best = GradientBoostingClassifier(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=2,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=39,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [84]:
y_pred=gbrt_best.predict(X_train)
accuracy_score(y_pred,y_train)

0.8212290502793296

In [85]:
y_pred=gbrt_best.predict(X_test)
accuracy_score(y_pred,y_test)

0.7662337662337663

It is also possible to implement early stopping by actually stopping training early.
with using warm_start=true.

There is a code in book about this but i dont thing that's usefull. So, if you wonder whats that, check out tha book, section gradient boosting.

The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this technique trades a higher bias for a lower variance. It also speeds up training considerably. This is called Stochastic Gradient Boosting.

-------------

## XGBoost
It is worth noting that an optimized implementation of Gradient Boosting is available in the popular Python library XGBoost, which stands for Extreme Gradient Boosting. This package was initially developed by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC), and it aims to be extremely fast, scalable, and portable. In fact, XGBoost is often an important component of the winning entries in ML competitions. XGBoost’s API is quite similar to Scikit-Learn’s:

In [90]:
import xgboost
xgb_reg = xgboost.XGBClassifier()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)

XGBoost also offers several nice features, such as automatically taking care of early stopping:

In [95]:
# Early_stopping_rounds | tells us that will stop when dont descrease amount of error 5 times in a row.
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_test, y_test)], early_stopping_rounds=5)
y_pred = xgb_reg.predict(X_test)

[0]	validation_0-error:0.29004
Will train until validation_0-error hasn't improved in 5 rounds.
[1]	validation_0-error:0.26840
[2]	validation_0-error:0.27273
[3]	validation_0-error:0.27706
[4]	validation_0-error:0.28571
[5]	validation_0-error:0.28571
[6]	validation_0-error:0.25974
[7]	validation_0-error:0.27273
[8]	validation_0-error:0.24242
[9]	validation_0-error:0.25541
[10]	validation_0-error:0.25541
[11]	validation_0-error:0.25541
[12]	validation_0-error:0.24675
[13]	validation_0-error:0.25108
Stopping. Best iteration:
[8]	validation_0-error:0.24242



-----------------

## Stacking

...

---

# EXERCİSES

* If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
    * Sure, we can.
    * How? 
        * Using VotingClassifier that we did earlier, above.
    
    

* What is the difference between hard and soft voting classifiers?
    * hard: looks just 1 or not  
    * soft: looks possible class rate. 

* Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?
    * Yeah its possible but the model (Random forrest), but they shouldnt use sequential technique (Boosting).  

* What is the benefit of out-of-bag evaluation?
    * So we can evaluate model perform that close to test data perform. 

* What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
    * While random forrest select best features of random subset, ExtraTree select random features of random features subset. 
    * Thats faster than randoom forrest because selecting best features for root node is takes long time and extraTree is dont do that.  

* If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?
    * we can descrease max_deepth, max leaf node or amount of estimater.
* If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate
    * if our model overfit, we can descrease learning rate so may be end up lover variance and highter bias. 
    
    
   * Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?