## Ensembles

The purpose of this module is to show the value of ensembles.  We will simulate predictors that have a given performance (accuracy) as a predictor of the target variable.  Then we will show that a combination of these does better than any individual one of them.

First we will create a target variable.  Assume this is the target for a classification problem.  The target should have 50% 0s and 50% 1s.

In [13]:
import pandas as pd
import numpy as np
import random
from sklearn import metrics

n=1000
# Create the target column with 50% 0s and 50% 1s
target = random.choices([0, 1], k=n)
df = pd.DataFrame({'target': target})


Now we are going to use the following code to create "predictors" - think of these as the predictions that come out of a particular model, maybe separate trees or regression models.   

We will keep the code generic, start with num_models = 3 (num_models should be an odd number)

Each predictor is created so that the accuracy is specified (p).

So, for instance if p=0.6, each of the predictors has 60% accuracy.  

In [14]:
## now create k new columns each one having success probability p
p=.6
num_models=3 # keep this as an odd number to break ties later.

for i in range(num_models):
    indices = random.sample(range(len(target)), int((1-p)*n))
    new_col = target.copy()
    for ix in indices:
        new_col[ix] = 1 - target[ix]
    df['pred'+str(i+1)] = new_col

## take a look at df and see if it makes sense to you...



In [15]:
df

Unnamed: 0,target,pred1,pred2,pred3
0,0,0,1,0
1,1,0,1,0
2,1,0,1,1
3,1,0,0,0
4,1,0,1,1
...,...,...,...,...
995,1,1,1,1
996,1,0,1,1
997,1,1,1,0
998,0,0,0,0


Use sklearn.metrics to calculate the accuracy, precision, recall, and f1 measure of one (or all!) of your predictors

Lots of other metrics you can calculate also, see [scikit learn documentation.](https://scikit-learn.org/stable/modules/model_evaluation.html)


In [None]:
## Complete the code

f1 = metrics.f1_score(###)
prec = metrics.precision_score(###)
acc = metrics.accuracy_score(###)
rec = metrics.recall_score(###)

## print out the values


Now, create a new ensemble variable, `en_sum`, by summing up the values of the predictor vectors.  This represents the total number of models that predict the positive class.

Then, create your ensemble predictor `en_pred` which is 1 if `en_sum` is greater than half the number of predictors (num_models), and 0 if not.


In [None]:
en_sum = ###
en_pred = ###

Check the accuracy, precision, recall, and f1 of your new ensemble predictor `en_pred`.  How does it compare to the individual models?

In [None]:
### modify code from cell above...

Now go back and change the number of models, and/or the success probability of the original model.  How does that impact the results?

**Extra: plot the improvement in f1 attained by ensembles of 3, 5, 7,9, etc...**

In [None]:
## Code here