### Problem 5. (Bias and Variance in Tree-based Bagging Ensemble Methods)

In this problem, we'll investigate the bias and the variance of two different estimators: a 2-split tree and a 4-split tree. We'll see (once again) that fitting the data more precisely is not always a good idea.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)

### a)

Suppose our dataset $\mathcal D$ consists of $n=10$ data points drawn randomly from two moons and is balanced. Points are labeled 1 in one moon and 0 in the other moon. We set 0.05 to be the standard deviation of Gaussian noise in the data.

Generate a sample dataset using random_state=2, and plot it. (Note: because the data is 2-dimensional, have the x- and y-coordinates on the axes and represent the label with the color blue for 1 and red for 0.)

In [2]:
from sklearn.datasets import make_moons

### b)

Fit a decision tree model with 2 splits to $\mathcal D$. Use random_state= 2. (Hint: for n splits, there are at most n+1 leaf nodes.)

Plot this tree in the form of rectangular partitions of the feature space with $\mathcal D$. Use lines to mark the splits, and shade the regions with the correct label color.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

### c)

Fit a decision tree model with 4 splits to $\mathcal D$. Use random_state= 2.

Plot this tree in the form of rectangular partitions of the feature space with $\mathcal D$.

### d)

Repeat parts b) and c) to fit models for 1000 different randomly drawn sets $\mathcal D$ from a newly generated set of 10,000 points using random_state=2. Average the individual 2-split decision trees to get the "average" 2-split decision tree model $\bar{t}(x)$ using the bagging ensemble method learned in class. Plot $\bar{t}(x)$ for all 10,000 points.

Do the same for 4-split decision trees to get $\bar{f}(x)$.

In [54]:
from sklearn.ensemble import BaggingClassifier

### e)

Compute the squared bias of $\bar{t}(x)$ and $\bar{f}(x)$. Which model has smaller squared bias?

__Response:__

### f)

Compute the variance of $\bar{t}(x)$ and $\bar{f}(x)$. Which model has smaller variance? How do you interpret this? Which model has smaller overall error?

__Response:__

### g)

How do you think your results (for bias, variance, and total error) depend on the number of points used to train each bagged model? Perform an experiment to check. Explain what you observe.

For clarity, here is a list of parameters you should be considering:

* number of bagged models = 1000
* number of points used to train each bagged model = $n \in \{5, 10, 15, ..., 30 \}$


__Response:__

### h)

As we increase the number of models to average in our bagging ensemble method, we should expect our model (which may be quite complex!) to not only fit the training data better but to also generalize better to a test set.

Using a test set of size 10000 with noise=0.05 and random_state=3, verify this by computing the average out-of-sample misclassification error over 10 replications for each choice of number of bagged models in the range $R = \{1,5,10,20,30,40,50,75,100,150\}$. Use 4-split tree models trained on randomly chosen datasets of size $10$ and plot averaged test error vs number of bagged models.

In particular, for each number of bagged models $r\in R$, build 10 different 4-split tree models with n_estimators = $r$ and max_samples=10, and average the out-of-sample misclassification error over all 10 models. 

Note that averaging over many replications reduces the inherent variability in the noise of the dataset.

For clarity, here is a list of parameters that you should be considering:

* number of bagged models = $r \in R = \{1,5,10,20,30,40,50,75,100,150\} $
* number of replications = number of bagging classifiers to train for a given $r$ = 10 
* number of points used to train each bagged model = 10

