# Lab 6b: Multiple models & Ensemble Learning 

## Outline
- Introduction
 - Voting Methods 
- Ensemble Learning 
    - Bagging
        - Random Forest
    - Boosting
        - AdaBoost
        - GradientBoost
- Summary

### What's missing?
This lab provides a helpful introduction to the use of Multiple Models and Ensemble Methods for classification.  There are two topics it doesn't cover that should be addressed when using these approaches in the world.
1. Stratification: When the number of class labels is unequal in your data, or the data is not iid it is important to use stratified sampling to ensure the number of classes in the test and training sets closely match those found in the original data.  The sklearn parameter `stratify` is often available to help one do this.
2. Several of the methods considered here construct classifiers via random sampling or other random selections.  In sklearn, setting the `random_state` parameter to a particular seed when using these methods will ensure you generate reproducible results between runs.  This can be helpful when debugging or trying to understand the big picture.

## Introduction
In most areas, having multiple experts work on a problem often leads to a better solution.  The same idea can be applied to Machine Learning,  where the experts are different _estimators_ (e.g., classifiers, regressors, neural nets).  As in the real world, if the estimators are generally accurate but "have different perspectives", their combination will perform better than any single one of them.

In the Introduction section of this lab, we will use two features of the iris dataset (sepal length (cm), petal length (cm)) to classify iris data.   The code below shows the outcome of combining the results of a three different classifiers for this task. 


In [None]:
from itertools import product
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

from sklearn.model_selection import cross_val_score

# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]  # sepal length (cm), petal length (cm)
y = iris.target

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=3)
clf2 = KNeighborsClassifier(n_neighbors=10)
clf3 = SVC(kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2),
                                    ('svc', clf3)],
                        voting="hard", weights=[2, 1, 1])

clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
eclf.fit(X, y)

# Plotting decision regions
def plot_clf(plt, clf, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.4)
    plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))
for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3, eclf],
                        ['Decision Tree (depth=4)', 'kNN (k=10)',
                         'SVC(kernel=rbf)', 'Combined (%s voting)'%(eclf.voting)]):
    plot_clf(axarr[idx[0], idx[1]], clf,X,y)
    axarr[idx[0], idx[1]].set_title(tt)
    # Compute and display out of fold accuracy
    
plt.show()

ac = cross_val_score(eclf, X, y, cv=5, scoring = 'accuracy')
print("Combined (%s voting) classifier accuracy:"%(eclf.voting))
print(" mean_test_score +/ std_test_score\n %0.3f +/- %0.2f"%(ac.mean(), ac.std()/2.0))
print(iris.feature_names)

### Combining Estimators / Voting
In the above example, *voting* is used to determine the output of the combined classifier.

In majority *voting*, the predicted class label for a particular sample is the class label that represents the majority (mode) of the labels predicted by each individual classifier.
E.g., if the predictions for a given sample are
* classifier 1 -> label A
* classifier 2 -> label A
* classifier 3 -> label B  
the combined prediction is “label A”.

Weights can also be applied to make some classifier's votes count more than others. 

Many classifiers, when given a set of features, also generate estimates of the probability of each possible label being present.  These probabilities can be combined to form more nuanced predictions.  

One way to do this is via _soft voting_.  With this approach, the weighted sum $S(k)$ of probabilities associated with each possible class label is computed, and then the label associated with the largest sum is selected.  For example, in three classifier case, using weights $w_1, w_2$ and $w_3$.

$$S(k) = w_1 P_1(k)+w_2 P_2(k) + w_3 P_3(k)$$

The final label (soft vote) for the data corresponds to the one with largest sum.

$$\text{soft vote} = \text{argmax } S(k)$$

Example:  The table below contains label probabilities for a 3-class classification problem with class labels A,B,C.

 classifier i | P_i(A)       | P_i(B)       | P_i(C) 
 -------------|:------------:|:------------:|:-----------:  
 classifier 1 |  0.1  | 0.8 | 0.1
 classifier 2 |  0.5  | 0.4 | 0.1
 classifier 3 |  0.6  | 0.3 | 0.1
  

Using weights of 1/3 on all classifiers (i.e., $w_1=1/3, w_2=1/3, w_3=1/3$, $S(K)$ is found by applying the weights and summing each column.  ans has the following values:

  | S(A) | S(B) | S(C)
 -----|------|------| ---
 S(k) |  .4 | .5  | 0.1
 

The predicted class label (soft vote) is for class label B, since it generates the highest weight sum.  Below is a bar chart showing the values of $P_i(k)$ and $S(k)$.

In [None]:
df=pd.DataFrame({'$P_1(k)$': [.1,.8,.1], '$P_3(k)$': [.5,.4,.1],
                 '$P_2(k)$': [.6,.3,.1], '$S(k)$':[.4, .5, .1]}, 
             index = ['A', 'B', 'C'])
df.plot.bar(grid='on'); plt.show()

## Task: Hard and Soft Voting
1. In general, when using soft max voting with n classifiers, if one uses a uniform vote weighting of 1/n, do the soft voting scores define a probability measure on possible class labels?

2. Run in isolation, what label would be selected as the output for each classifier in the table and plot above?  
Using a majority voting (hard voting), what class label would be selected using these outputs?  
Do soft **soft voting** and **hard voting** always agree?  Always disagree?

3. The [sklearn.ensemble.VotingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) 
allows one to perform both majority (hard) voting and soft vote class with a collection of individual classifiers.
The `voting` parameter selects the method to use. Re-run the code above by changing the `voting` parameter from `"hard"` to `"soft"`.  
Looking at the decision boundaries, does it look like the combined classifier results improve? What does the out of fold accuracy show? Explain your reasoning.

Put your answers below

### BEGIN SOLUTION
1) **In general, when using soft max voting with n classifiers, if one uses a uniform vote weighting of 1/n, do the soft voting scores define a probability measure on possible class labels?**
Yes, in fact anytime the weights $w_k$ define a probability distribution (i.e., are non-negative and sum to one), the resulting label soft vote scores will define a label probability distribution.  To show this you sum S(k) over all labels, and show that it equal one, i.e.,

$$\Sigma_k S(k) = \Sigma_k \Sigma_i w_i P_i(k) = \Sigma_i \Sigma_k w_i P_i(k) = \Sigma_i w_i \Sigma_k P_i(k) =  \Sigma_k w_k = 1$$

2) **What label would be selected as the output for each classifier in the table above?**

classifier i | Most likely class label
-------------|:-: 
classifier 1 | B  
classifier 2 | A 
classifier 3 | A  

**Using a majority voting, what class label would be selected using these outputs?**  
A would be selected using majority (hard) voting.

**What do you conclude about the agreement between **soft voting** and **hard voting**?**
Hard and soft voting may or may not agree.

3) **Looking at the decision boundaries, does it look like the combined classifier results improve? What does the out of fold accuracy show?**
The Combined (soft voting) classifier decision boundaries look like they handle outliers better than the hard voting case.  It also has higher accuracy.  
### END SOLUTION

## Hyper-parameter optimization 
Independently optimizing each estimator before combining it with others is a good first step, but given it's their combined performance that is of interest, simultaneously optimizing all hyper-parameters may yield additional improvements.  

Below is an simultaneous optimization using sklearn.GridSearchCV, in the example, voting strategies for the combination classifier are also explored.

More detailed examples involving full _pipeline optimization_, including feature selection can be found in this [MLxtend] article](http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/) on Sebastian Raschka's `EnsembleVoteClassifier` (which became the basis for the `scikit-learn.VotingClassifier`).  See the pipeline and feature documentation at sklearn as well.

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV
clf1 = DecisionTreeClassifier()
clf2 = KNeighborsClassifier()
clf3 = SVC(kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('DT', clf1), ('kNN', clf2), ('SVC', clf3)])

params = {'DT__max_depth': [3, 4, 5],
          'kNN__n_neighbors': [6, 7, 8],
          'voting': ['hard','soft']}
          
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5, scoring = 'accuracy')

X = iris.data[:, [0, 2]]
y = iris.target
grid.fit(X, y)

def report_results(grid):
    cv_keys = ('mean_test_score', 'std_test_score', 'params')
    print('mean_test_score +/ std_test_score, {params}') 
    for r, _ in enumerate(grid.cv_results_['mean_test_score']):
        bf = '*' if grid.cv_results_[cv_keys[0]][r]==grid.best_score_ else ' '
        print(bf+"%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

report_results(grid)

## Task: Working with GridSearchCV
Using the code block below, determine the best estimator found by `GridSearchCV` and then do the following:
1. Print out the best model's average accuracy score on out of fold data. 
2. Print out the best model's parameters.
3. Use `plot_clf` to display its decision boundaries.

hint: This information is available in `grid` variable that was computed in the previous cell.

In [None]:
### BEGIN SOLUTION
best_estimator = grid.best_estimator_

print('Best of of fold accuracy =', grid.best_score_)
print('Best parameters =',  grid.best_params_)

plot_clf(plt, grid, X, y)
plt.show()
### END SOLUTION

## Ensemble Methods
The goal of **ensemble methods** is to _automatically create_ a set of 
base estimators (using the method's learning algorithm) which are then combined 
using a voting approach.  

Two families of ensemble methods are usually distinguished:
- In **averaging methods**, the driving principle is to build several estimators 
independently and then to combine their predictions using using averaging, weighted averaging, or voting. Typically, the combined 
estimator is usually better than any single base estimator because 
its variance is reduced. **Examples:** 
  - [Bagging methods](http://scikit-learn.org/stable/modules/ensemble.html#bagging), 
  - [Forests of randomized trees](http://scikit-learn.org/stable/modules/ensemble.html#forest)  


- By contrast, in **boosting methods**, base estimators are built sequentially 
in order to try to reduce the bias of previously constructed estimators.
Here again, the motivation is to combine several weak models to produce a more powerful one. **Examples:** 
  - [AdaBoost](http://scikit-learn.org/stable/modules/ensemble.html#adaboost)
  - [Gradient Tree Boosting](http://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting), … 

### Bagging

Bagging another name for ensemble averaging.  As stated earlier, the goal is to improve the performance of simple models and reduce overfitting of more complex models. With bagging, several models are fitted on different samples of the population (taken with replacement). Then, these models are aggregated by using their average, weighted average or a voting system.

A key insight for bagging is that by averaging (or generally aggregating) many low bias, high variance models, we can reduce the variance while retaining the low bias. Here’s an example of this for density estimation:

<img src="https://qph.ec.quoracdn.net/main-qimg-55c44d63831742ddd387541a428fcedf.webp">

Each estimate is centered around the true density, but is overly complicated (low bias, high variance). By averaging them out, we get a smoothed version of them (low variance), still centered around the true density (low bias). (Jonathan Gordon on Quora)

#### Random Forest

Random forest is a variant of bagging which results in a more random but potentially more powerful classifier.  

Random Forests use subsets of the data (as in regular bagging) to create a set of decision tree classifiers, where a certain amount of additional randomness is introduced into the fitting method for each tree.  When splitting a node during the construction of a tree, the split that is chosen is no longer the best split among all features. Rather, the split picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

This additional randomness is used to try and make the underlying trees more independent, which improves performance.


Below is an example setup call to create a random forest classifier in sklearn.

In [None]:
from sklearn.ensemble import RandomForestClassifier 
# max_features="sqrt" => *max_features=sqrt(n_features)
clf = RandomForestClassifier(n_estimators=20, max_features='sqrt') 

The parameter **n_estimators** controls the number of trees in the forest and the parameter **max_features** controls the number of randomly selected features to consider when looking for the best split. 

Below are some snippets from the [RandomForestClassifier]http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) documentation:

**n_estimators** : integer, optional (default=10)

> The number of trees in the forest.

**max_features** : int, float, string or None, optional (default="auto")

> The number of features to consider when looking for the best split:
>
> -   If int, then consider *max_features* features at each split.
> -   If float, then *max_features* is a percentage and*int(max_features * n_features)* features are considered at each split.
> -   If "auto", then *max_features=sqrt(n_features)*.
> -   If "sqrt", then *max_features=sqrt(n_features)* (same as "auto").
> -   If "log2", then *max_features=log2(n_features)*.
> -   If None, then *max_features=n_features*.
>
> Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than `max_features` features.

Notice that when *max_features* is set to the total number of features in the data, random feature selection during splitting is eliminated, and Random Forest reduces to decision tree based bagging.

### Boosting 

Boosting is a general ensemble method that creates a stronger model from a number of weaker models.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

#### Adaptive Boosting (AdaBoost)

The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are at least slightly better than random guessing, such as small decision trees) on weighted versions of the data. As iterations proceed, examples that are difficult to predict receive ever-increasing influence.

### An Example

<img src="images/boosting1.png" style="width:600px;">

(1) We start with one weak leaner (for example a [decision tree stump](https://en.wikipedia.org/wiki/Decision_stump)) to classify training samples.  
(2) In the next round, we then train another weak learner (e.g. decision tree stump) that focuses on getting the samples that were misclassified in (1). We achieve this by putting a larger weight on the previously misclassified training samples.  
(3) The 2nd classifier will likely get some other samples wrong, so we can re-adjust the weights and train a third classifier accordingly.  
(4) Same logic from (3) is applied.


In a nutshell, we can summarize “Adaboost” as “adaptive” or “incremental” learning from mistakes. Eventually, we will come up with a model that has a lower bias than an individual decision tree (thus, it is less likely to underfit the training data).

#### Gradient Boosting

Gradient Boosting is another popular boosting technique that is similar to Adaptive Boosting.  The major difference is that Gradient Boosting identifies and corrects the short comings of weaker learners using gradients in the loss function.

## scikit-learn Ensemble Methods
You will now use the sci-kit to setup and optimize the ensemble classifiers described above. 

Understanding useful model parameters associated with a these powerful classification approaches takes time and practice, and as with all classification approaches, what works well will be problem domain or even data set specific.  

Please run the code below to create and display a random blob data set.  The section of the lab will now focus on classification tasks using it.

In [None]:
from sklearn.datasets import make_blobs
XX, yy = make_blobs(n_samples=1000, centers=20, random_state=42)
yy = yy % 2
plt.scatter(XX[:, 0], XX[:, 1], c=yy, s=20, edgecolor='k'); plt.show()
yy[:10]

## Task: Ensemble training
Read the hyper-linked documentation and build models on the  blob dataset (XX, yy) using the following approaches:

1. [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
2. [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)   
3. [AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) 
4. [GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)  

Each model has parameters you can adjust.  Typically these are general parameters related to the underlying classifiers (e.g. when using decision trees,
**max_depth**, **min_samples_leaf**) and other parameters related to how the different base classifiers are generated (e.g., when using RandomForestClassifier, **n_estimators**, **max_features**).

Below is a fully worked example using [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) on the blob dataset.  The decision boundaries provide insight into how the classifier is working, but out of fold accuracy scores are what's most important.

**[DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)**

In [None]:
from sklearn.tree import DecisionTreeClassifier 

# Step 1) Display an example
clf = DecisionTreeClassifier(max_depth=4)

clf.fit(XX, yy)
plot_clf(plt, clf, XX, yy)
plt.show()

# Step 2) Pick parameters to optimize
params = {'max_depth': [2, 4, 8, 16], 'min_samples_leaf':[2, 4, 8]}

DTgrid = GridSearchCV(estimator=clf, param_grid=params, cv=5, scoring = 'accuracy')
DTgrid.fit(XX,yy)
plot_clf(plt, DTgrid.best_estimator_, XX, yy); plt.show()
report_results(DTgrid) # Show Results

**[BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)** 

hint: try `base_estimator=DTgrid.best_estimator_`

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.base import clone

# Step 1) Display an example
### BEGIN SOLUTION
clf = BaggingClassifier(base_estimator=DTgrid.best_estimator_, n_estimators=4, random_state=42)
### END SOLUTION

clf.fit(XX, yy)
plot_clf(plt, clf, XX, yy); plt.show()

# Step 2) Pick parameters to optimize
### BEGIN SOLUTION
params = {'n_estimators': [2, 4, 8, 16]}
### END SOLUTION

BCgrid = GridSearchCV(estimator=clf, param_grid=params, cv=5, scoring = 'accuracy')
BCgrid.fit(XX,yy)
plot_clf(plt, BCgrid.best_estimator_, XX, yy); plt.show()
report_results(BCgrid) # Show Results

**[RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**

In [None]:
from sklearn.ensemble import RandomForestClassifier 

# Step 1) Display an example
### BEGIN SOLUTION
clf = RandomForestClassifier(max_depth=4, random_state=42)
### END SOLUTION

clf.fit(XX, yy)
plot_clf(plt, clf, XX, yy); plt.show()

# Step 2) Pick parameters to optimize
### BEGIN SOLUTION
params = {'max_depth': [8, 16, 32], 'min_samples_leaf':[2, 4, 8]}
### END SOLUTION

RFgrid = GridSearchCV(estimator=clf, param_grid=params, cv=5, scoring = 'accuracy')
RFgrid.fit(XX,yy)
plot_clf(plt, RFgrid.best_estimator_, XX, yy); plt.show()
report_results(RFgrid) # Show Results

**[AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)** 

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Step 1) Display an example
### BEGIN SOLUTION
clf = AdaBoostClassifier(base_estimator=RFgrid.best_estimator_, n_estimators=4, random_state = 42)
### END SOLUTION

clf.fit(XX, yy)
plot_clf(plt, clf, XX, yy); plt.show()

# Step 2) Pick parameters to optimize
### BEGIN SOLUTION
params = {'n_estimators':[2, 3, 4, 5, 8, 16, 32]}
### END SOLUTION

ABgrid = GridSearchCV(estimator=clf, param_grid=params, cv=5, scoring = 'accuracy')
ABgrid.fit(XX, yy)
plot_clf(plt, ABgrid.best_estimator_, XX, yy); plt.show()
report_results(ABgrid) # Show Results

**[GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**   

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Step 1) Display an example
### BEGIN SOLUTION
clf = GradientBoostingClassifier(max_depth=4, random_state = 42)
### END SOLUTION

clf.fit(XX, yy)
plot_clf(plt, clf, XX, yy); plt.show()

# Step 2) Pick parameters to optimize
### BEGIN SOLUTION
params = {'max_depth': [2, 4, 8, 16], 'min_samples_leaf':[2, 4, 8]}
### END SOLUTION

GBgrid = GridSearchCV(clf, param_grid=params, cv=5, scoring = 'accuracy')
GBgrid.fit(XX,yy)
plot_clf(plt, GBgrid.best_estimator_, XX, yy); plt.show()
report_results(GBgrid) # Show Results

## Task: Ensemble your ensemble methods
Combine your best ensemble classifiers using the `VoteClassifier` approach described in the Introduction.

hint: use the `.best_estimator_` attribute from each of your grid variables.

In [None]:
### BEGIN SOLUTION
clf = VotingClassifier(estimators=[('RF', RFgrid.best_estimator_), 
                                   ('BC', BCgrid.best_estimator_), 
                                   ('AB', ABgrid.best_estimator_), 
                                   ('GB', GBgrid.best_estimator_)],
                        voting='hard')

### END SOLUTION
clf.fit(XX,yy)
plot_clf(plt, clf, XX, yy); plt.show()
ac = cross_val_score(clf, XX, yy, cv=5, scoring = 'accuracy')
print("Combination (%s voting) classifier accuracy voting:"%(clf.voting))
print(" mean_test_score +/ std_test_score\n %0.3f +/- %0.2f"%(ac.mean(), ac.std()/2.0))

## Task: Learn hyper-hyper-parameters (Homework)
Setup `GridSearchCV` to simultaneously tune some of individual classifier parameters as well explore "hard" and "soft" voting.  In order to reduce the amount of time it takes to do this, you should restrict the size of the space you search using only a small number values for each parameter. Alternatively you can search to using randomized grid search via `RandomizedSearchCV`.

hints: 
1. Use the setup in the Introduction
2. Create your set of parameters to optimize by recoding a subset of optimization parameters you used above and also including voting options.

In [None]:
### BEGIN SOLUTION
from sklearn.model_selection import RandomizedSearchCV

clf = VotingClassifier(estimators=[('BC', BCgrid.best_estimator_),
                                   ('RF', RFgrid.best_estimator_),  
                                   ('AB', ABgrid.best_estimator_), 
                                   ('GB', GBgrid.best_estimator_)],
                        voting='soft')

params = {
  'BC__n_estimators': [1, 2, 3],
  'RF__max_depth': [7, 8, 9], 'RF__min_samples_leaf':[3, 4, 5],
  'AB__n_estimators':[3,4,5],
  'GB__max_depth': [3, 4, 5], 'GB__min_samples_leaf':[3, 4, 5],
  'voting':['hard','soft']}

#ALLgrid = GridSearchCV(estimator=clf, param_grid=params, cv=5, 
#                       scoring = 'accuracy', verbose=1)
ALLgrid = RandomizedSearchCV(estimator=clf, param_distributions=params, cv=5, 
                             scoring = 'accuracy',
                             random_state=1, verbose=1, n_iter=30)
print('Starting ALLgrid optimization:')
ALLgrid.fit(XX,yy)
plot_clf(plt, ALLgrid.best_estimator_, XX, yy); plt.show()
print('Best classifier found: ALLGrid(%s)'%(ALLgrid.best_params_))
print(' Out of fold accuracy = %f'%(ALLgrid.best_score_))
### END SOLUTION

## Task: Understand Ensemble Method Concepts

Below are 6 questions on ensemble methods. Refer the the scikit-learn user guide's section on ensemble methods to help you answer these questions.

1. Bagging maintains the variance of the base model while lowering bias. (T/F)
2. Predicting with gradient boosted model is slower than predicting with a decision tree. (T/F)
3. To make a random forest, you may generate hundreds of trees and then aggregate the results of these tree. Which of the following are true about individual trees in Random Forest? Select all that apply.

  (A) Individual trees find best splits on a subset of the features
  (B) Individual trees find best splits on all of the features
  (C) Individual trees find best splits on a subset of observations
  (D) Individual trees find best splits on all of the observations
4. Which of the following are true about the “max_depth” hyperparameter in GradientBoostedRegressor and GradientBoostedClassifier? Select all that apply. (A) Lower is better parameter in case of same validation accuracy
  (B) Higher is better parameter in case of same validation accuracy
  (C) Increase the value of max_depth may overfit the data
  (D) Increase the value of max_depth may underfit the data
5. For boosting, why is it suggested that the base model be "weak"?
6. Do bagging and boosting methods always improve model accuracy? When is not an appropriate situation to use ensemble learning methods?

## Summary
Some advantages of decision trees are:

- Simple to understand and to interpret. Trees can be visualised.
- Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
- The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
- Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See [algorithms](http://scikit-learn.org/stable/modules/tree.html#tree-algorithms) for more information.
- Able to handle multi-output problems.
- Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
- Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
- Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The disadvantages of decision trees include:

- Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

**Full Credit:** Much of the content and support code from this lab is based on material from scikit-learn's wonderful [on-line documentation](http://scikit-learn.org/stable/documentation.html). Reading through its tutorials, introductory material and package pages, while playing with the examples provided, and creating your own, is a great way to learn more about Machine Learning.