# Introduction to Ensembles

- [Check list](#Check-list)
- [Brief intro](#Brief-intro)
- [Problem summary](#Problem-summary)
- [Python implementation](#Python-implementation)
- [Takeaways](#Takeaways)

# Check list
Please make sure you have Python 3 installed, with the following libraries:
- Pandas
- NumPy
- Jupyter Lab/Notebook (optional)

If you want to closely follow the lecture, please register for an account on [Kaggle](http://kaggle.com/) as well, as we will be participating in a data science competition on that website.

# Brief intro
Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many data science pipelines have ensembles in them.

Put simply, ensembles combine predictions from different models to generate a final prediction, and the more models we include the better it performs. Better still, because ensembles combine baseline predictions, they perform at least as well as the best baseline model. Ensembles give us a performance boost almost for free.

<img width="800" src="https://www.dataquest.io/blog/content/images/2018/01/network-1.png"></img>

_Example schematics of an ensemble._

_An input array X is fed through two preprocessing pipelines and then to a set of base learners f(i)._

_The ensemble combines all base learner predictions into a final prediction array P._


# Problem summary
We will be working with the dataset in [Kaggle's DonorsChoose.org Application Screening competition](https://www.kaggle.com/c/donorschoose-application-screening).

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

To make a submission, after training our model, we will look at individual project proposal in the testing set and predict the probability that that specific proposal will be accepted. Submissions are evaluated on [area under the ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probability and the observed target (true value). If you are unfamiliar with the concept, the larger the area under the ROC is, the more accurate our prediction is, and the perfect prediction will yield the score of 1.

# Python implementation
Here we are provided with premade submissions, computed from different approaches and models. We will learn how to combine these submissions to make an ensemble with a better accuracy.

## Importing data

In [1]:
import pandas as pd
import numpy as np

import os

Our premade submissions are stored in the `input` folder. The individual files are:

In [2]:
submissions = [file for file in os.listdir('input/') if file[-4 :] == '.csv']
submissions

['lgb.csv', 'nlp.csv', 'nn.csv']

Let's also look at the first few lines of these files:

In [3]:
for file in submissions:
    df = pd.read_csv('input/' + file)
    print(file)
    print(df.head())
    print('-' * 30)

lgb.csv
        id  project_is_approved
0  p233245               0.9670
1  p096795               0.9316
2  p236235               0.9443
3  p233680               0.9336
4  p171879               0.8240
------------------------------
nlp.csv
        id  project_is_approved
0  p233245             0.638546
1  p096795             0.780585
2  p236235             0.597326
3  p233680             0.456532
4  p171879             0.637596
------------------------------
nn.csv
        id  project_is_approved
0  p233245             0.929612
1  p096795             0.916068
2  p236235             0.971921
3  p233680             0.890154
4  p171879             0.883831
------------------------------


- `lgb.csv` is a submission made from a LightGBM model.
- `nlp.csv` is a submission made from a pure Natural Language Processing approach.
- `nn.csv` is a submission made from a Neural Network model.

After submitting these files individual, we see that their individual score is:
- `lgb.csv`: 0.79554
- `nlp.csv`: 0.7959
- `nn.csv`: 0.80016

We see that individually all our submissions do quite well, but now we will, again, try to combine them together to get a better prediction.

## Averaged predictions
One intuition for this method is that, if there are proposals that some of our models do really well, while the other models don't, then averaging all predictions would decrease the inaccuracy in the bad models; if all of our models do well or badly on some proposals, the averaged prediction will most likely not stray too far from the original predictions.

Let's see this in action:

In [4]:
submission_values = []

for file in submissions:
    df = pd.read_csv('input/' + file)
    submission_values.append(df['project_is_approved'].values) # 'project_is_approved is the column we need to predict

submission_values

[array([ 0.967 ,  0.9316,  0.9443, ...,  0.862 ,  0.951 ,  0.6743]),
 array([ 0.63854617,  0.78058499,  0.59732588, ...,  0.53082335,
         0.8409909 ,  0.10038701]),
 array([ 0.9296123 ,  0.91606808,  0.97192119, ...,  0.93095801,
         0.97309926,  0.63227701])]

In [5]:
avg_value = np.mean(submission_values, axis=0)
avg_value

array([ 0.84505282,  0.87608436,  0.83784902, ...,  0.77459379,
        0.92169672,  0.468988  ])

In [6]:
sub_df = pd.DataFrame()
sub_df['id'] = df['id']
sub_df['project_is_approved'] = avg_value
sub_df.head()

Unnamed: 0,id,project_is_approved
0,p233245,0.845053
1,p096795,0.876084
2,p236235,0.837849
3,p233680,0.760095
4,p171879,0.781809


In [7]:
sub_df.to_csv('output/avg_sub.csv', index=False, header=True)

After submitting the file, we see that our score has improved quite a bit (0.81676).

## Weighted predictions
Consider our averaged ensemble. If one of our models does exceptionally well overall, taking the average of the predictions will actually negatively affect our accuracy, since the good predictions are being mixed with not-so-good ones. So we need to have a way to "reward" the models that good significantly better than others.

One way to do this is to give those models larger weights in a weighted ensemble. Specifically, we will be using a simple linear weighted ensemble:

$$P(x) = a~LGB(x) + b~NLP(x) + c~NN(x)$$

$$with~0 \leq a, b, c \leq 1, a + b + c = 1$$

We can see that the averaged ensemble we saw ealier is also a linear weighted ensemble, but all of its weights are equal to each other (in this case they are 1/3 = 0.333...)

As we mentioned, let's try changing these weights so that the better model (in this case it's the Neural Network) has more weight in our ensemble. One way to do this is simply use the models' performance as their weights. Specifically:

In [8]:
scores = [0.79554, 0.7959, 0.80016]
weights = np.array([score / sum(scores) for score in scores])

weights

array([ 0.33263924,  0.33278976,  0.334571  ])

In [11]:
weighted_value = np.sum(np.array([submission_values[i] * weights[i] for i in range(3)]), axis=0)

weighted_value

array([ 0.84518508,  0.87614722,  0.83807181, ...,  0.77485915,
        0.92178387,  0.46924796])

In [12]:
sub_df['project_is_approved'] = weighted_value
sub_df.head()

Unnamed: 0,id,project_is_approved
0,p233245,0.845185
1,p096795,0.876147
2,p236235,0.838072
3,p233680,0.760301
4,p171879,0.781984


In [13]:
sub_df.to_csv('output/weighted_sub.csv', index=False, header=True)

We see that this submission gives a better score than one from the averaged ensemble, but only by 0.00003, which is not a significant increase that we were hoping for. Why is that? It's because the weights that we computed for the models are very similar: 0.33263924, 0.33278976, 0.334571. To obtain a more different ensemble, we have to change the weights in a more drastic way.

Let's manually try some different weight combinations. Based on the individual scores, we know that we should give `nn.csv` the largest weight, and `lgb.csv` the smallest weight, so we can try these combinations:

- (0.2, 0.3, 0.5):

In [14]:
weights = np.array([0.2, 0.3, 0.5])
weighted_value = np.sum(np.array([submission_values[i] * weights[i] for i in range(3)]), axis=0)

weighted_value

array([ 0.84977   ,  0.87852954,  0.85401836, ...,  0.79712601,
        0.9290469 ,  0.48111461])

In [15]:
sub_df['project_is_approved'] = weighted_value
sub_df.to_csv('output/weighted_sub_v2.csv', index=False, header=True)

This submission scores 0.81856. Now the increase in our score is much more significant. Let's try another combo:
- (0.15, 0.3, 0.55):

In [16]:
weights = np.array([0.15, 0.3, 0.55])
weighted_value = np.sum(np.array([submission_values[i] * weights[i] for i in range(3)]), axis=0)

weighted_value

array([ 0.84790061,  0.87775294,  0.85539942, ...,  0.80057391,
        0.93015186,  0.47901346])

In [17]:
sub_df['project_is_approved'] = weighted_value
sub_df.to_csv('output/weighted_sub_v3.csv', index=False, header=True)

Another increase in our score!

# Takeaways
We have seen that making ensemble is an effective tool to increase the performance of individual predictive models. The main keys are:
- To include models with different approaches so that they can correct each other's mistakes
- To give models that perform better larger weights in the ensemble

Through this notebook we know how to compute averaged and weighted ensembles; there are other, more complex ways to build an ensemble as well.