[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/exercises/8_ex_ensemble_learning.ipynb) 

# BADS Exercise 8 on ensemble learning
This exercise revisits some of the concepts covered in [Tutorial 8 on ensemble learning](https://github.com/Humboldt-WI/bads/blob/master/tutorials/8_nb_ensemble_learning.ipynb). We will take a close look at bagging and analyze its impact on the predictive accuracy, and implement one of the boosting algorithms, Adaboost.

## Loading the data 
Fo this tutorial, we will use HMEQ credit risk data available at [our GitHub repo](https://github.com/Humboldt-WI/bads/blob/master/data/hmeq_prepared.csv). By now, you have imported different data sets multiple times in previous tutorials, but this step is always necessary when working with data. Your preliminary task is to load the HMEQ data set into a `pandas DataFrame`.

Now we can proceed to the tasks on ensemble learning!

## Tasks

### Task 1

Ensemble learning works by reducing bias and/or variance. We begin with examining the variance component. 

In the first task, you will write code that trains and tests a classifier multiple times on different subsets of HMEQ data and examines the classifier preformance. We prepared two versions of this task: *simple* and *for the experts*. Read the task description below and proceed with the version you feel ready to tackle!

*Simple version:* set up a loop to sample the data, calling the sklearn function `train_test_split()` multiple times in a loop. You can use either logistic regression or a decision tree as a classifier. Train and test a new classifier on the sampled data in each iteration of the loop and compute its AUC on the test set. Run your code for 100 iterations and visualize the variation in AUC performance by means of a boxplot. Briefly discuss your findings.

*For the experts:* perform the same task as above but wrap your code in a function `examine_variance()` that:
- supports both logit and decision tree as a classifier
- allows to specify the number of iterations and test sample size
- facilitates controlled sampling of the data such that you randomize either the training or the test set or both sets in each iteration. The idea is that your code should let you study the isolated effect of randomizing only the training data while always predicting the same test data, or the isolated effect of applying the same model to multiple randomized test sets, or the overall effect of sampling the data just as in the simple version
- returns a list of AUC values from each iteration

Run your function for 100 iterations and visualize the AUC values using a boxplot. Briefly discuss differences between the AUC variance when randomizing the training data, the test data or both. 

*Hint:* you can rely on the function `bootstrapping()` from the ensemble learning tutorial to randomize the sampling.

### Task 2

Implement a bagged logistic regression classifier from scratch. You can use the `sklearn` class `LogisticRegressionClassifier` for implementing the base model. The actual bagging step, however, should be implemented from scratch

### Task 3

Theory predicts that bagging should work better for unstable base models like trees than for stable base models like logistic regression. Use the custom bagging algorithm developed in Task 2 to verify this assertion for the HMEQ loan dataset. Specifically:
  - chose a proper experimental design to compare models (split-sample or cross-validation)
  - train two simple classifiers: 
    - logistic regression
    - decision tree
  - train two bagging classifiers:
      - bagged logistic regression
      - bagged decision tree
  - both bagging classifiers should use your custom bagging function from Task 2
  - compare the predictive performance of the bagging ensembles on the test data and briefly discuss your findings

### Task 4 [optional]

#### 4.1. Further enhance the analysis from Task 3 as follows:
  - repeat the comparison of bagged logit versus bagged trees multiple times with different training and testing data sets
  - depict the results (predictive performance) as a boxplot

### 4.2. Investigate the impact of the ensemble size:
- try out different settings for the hyperparameter *ensemble size* (number of bagging iterations)
- produce a line plot of predictive performance versus ensemble size for bagged logit and bagged tree
- identify the suitable ensemble size for both classifiers  

### Task 5
Write a custom Python function that implements the *Adaboost* algorithm. Follow the pseudo-code of the algorithm, as shown in the lecture materials. Design your function such that it accepts a `sklearn` model object as argument and than runs Adaboost using the corresponding base classifier. Test your function on the HMEQ data and evaluate performance in terms of AUC.

# Well done! Your ensembles performed great!