# Week 6: Homework 2 

----------------------------------------------------
Machine Learning                      

Year 2020/2021

*Vanessa Gómez Verdejo vanessa@tsc.uc3m.es* 

----------------------------------------------------

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

The aim of this second HW is to implement and analyse the performance of ensemble methods. To do this, we will work with the Breast Cancer database (described in the next section) and you will have to complete the following exercises.

## Exercise 1. Load and prepare the data 

For this lab session, let's work over the  [Breast cancer data set](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) a binary classification problem aimed to detect breast cancer from a  digitalized image of breast mass. For this purpose, the images have been preprocesed and characterized with 30 input features describing the mass.

Complete next cell code, so that you can:
* Load the dataset
* Create training and testing partitions with the 75% and 25% of the original data
* Normalize the data to zero mean and unitary standard deviation 

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

## Exercise 2. Bagging methods (5.5 points)


### Exercise 2.1  (1 point)

Use the model BaggingClassifier of sklearn to train an ensemble of T trees with a subsamplig rate of 50% (`max_samples` = 0.5). For this exercise, consider that each tree has a maximum depth of 3 and use the default values for the remaining parameters.

Analyze the evolution of the train and test accuracy for T from 1 to 50.

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 2.2 Influence of parameter `n_perc` and ensemble diversity (1.5 points)

For the above ensemble (with T=50), analyze the behaviour of the ensemble for different values of `max_samples` (from 5% to 100%) and analyze the performances over the train and test partitions. Compare them with that of a single decision tree (with `max_depth` of 3) also trained with `max_samples` samples. For both approaches, run several iterations and average the results to obtain representative performance curves.

Analyze the results and answer the following questions:
* What is the advantage of the ensemble compared to a stand-alone tree? 

* Is this advantage equal for any value of `max_samples`?



#### Include your answer here!
#### Create as many cells (code or markdown) as you need

Now analyze the **diversity** among their base learners' outputs for different `max_samples` rates. You can analyze this diversity by measuring the correlation among the learners' outputs. That is, once you have trained an ensemble for a `perc` value, you can compute this diversity as:
1. Compute the learner soft-outputs (over the training data): use the method `.predict_proba()` of the decision trees. Note that the ensemble has an attribute (`estimators_`) with all the learned trees.
2. Obtain the matrix with all pairwaise correlation values (you can use [`np.corrcoef`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html))
3. Compute the ensemble diversity as one minus the averaged value of all pairwise correlation values. 

Finally, analyze the results. Which is better, a high or low diversity? To answer this question, it can be interesting to jointly analyze the diversity and the ensemble accuracy for different `max_samples` values. And do not forget to average your results for different runs.


Note: You can find other ensemble diversity measurements in https://lucykuncheva.co.uk/papers/lkml.pdf

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 2.3 Can we increase the ensemble diversity with other schemes? (1 point)

A very simple way to increase diversity is to do sample and feature selection at the same time (subsampling the training data matrix by rows and by columns). To do this, the `BaggingClassifier` class, just as it has the `max_samples` parameter to indicate the number or percentage of training samples to use, has another parameter `max_features` that allows to control the number or percentage of variables to use.

To analyze the influence of this kind of subsampling, fix `max_samples` to 0.5 and analyze the ensemble performance and diversity for different values of `max_features`. Discuss the obtained results in comparison with those of Exercise 2.2.



#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 2.4 (1 point)

Now compare the performace of the bagged ensemble of Exercise 2.1 (using only sample subsampling) with that of a Random Forest and a Extremely Randomized Tree. Use the same number of learners (T=50), and also use decision trees with a maximum depth of 3, but crossvalidate the subsampling rates (note that RF has to CV the number of samples and features).

Which differences are among these methods? 



#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 2.5  (1 point)
Both `RandomForestClassifier()` and `ExtraTreesClassifier()` are able to learn during their training the **feature relevances**. Analyze the feature importances provided by them. Do they agree? Do they provide relevant information about the Breast Cancer detection?


#### Include your answer here!
#### Create as many cells (code or markdown) as you need

## Exercise 3.Boosting methods (3.5 points)

### Exercise 3.1  (1 point)


Uses the `AdaBoostClassifier` model to train a set of 50 decision trees with depth 3 and analyze their train and test accuracy vs. the number of trees (from 1 to 50). Use the two implementations of Adaboost (`algorithm` = `SAMME` and `algorithm` = `SAMME.R`) and compare the result with the set constructed by Bagging in Exercise 2.1.

Discuss the results and answer the following questions:
* Can I stop adding learners when the train accuracy is $100\%$?
* Can I add as many trees as I want without incurring in overfitting problems? Could I do the same with a bagging ensemble?

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 3.2  (1 point)

Use the `GradientBoostingClassifier` model to train a set of 50 decision trees with depth 3 by Gradient Boosting with exponential (`loss= 'exponential'`) and binomial (`loss= 'deviance'`) cost function. In both cases, obtain the accuracy evolution with the number of learners. Besides, compare the result with the one obtained by the `AdaBoostClassifier` with `algorithm` = `SAMME.R`.

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 3.3 (1.5 points)

Finally compare the performance of GradientBoost based models when using the 
[XGBoost library](https://xgboost.readthedocs.io/en/latest/index.html), as it provides an efficient implementation with parallelization capabilities (in case we need to work with large datasets). In addition, it has a [sklearn interface](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) that allows us to easily integrate it with our code.

Like the sklearn implementation it is designed to work with decision trees, but it brings us some additional utilities:
* It allows us to do *early_stopping*: if we provide a validation set, it evaluates the performance of the set on it to decide when to stop adding new trees (avoiding overfitting effects).
* It allows to evaluate the set with a subset of the trained trees.

Use these utilities to efficiently analyze the performance of the ensemble as a function of the number of trees and apply an early stopping criterion to select the optimal number of trees in the ensemble.

#### Include your answer here!
#### Create as many cells (code or markdown) as you need

### Exercise 4. Stacking of classifiers (1 point)

Using the stacking scheme, select different base classifiers from those seen in class and perform a combination of them at different levels. Justify the selection of the base classifiers, as well as the classifier used at the output. Compare the performances with those of the ensembles designed in the previous sections, discuss their advantages and disadvantages.

#### Include your answer here!
#### Create as many cells (code or markdown) as you need