## TPS September 2021

*"This is how you win ML competitions: you take other peoples’ work and ensemble them together."* - Vitaly Kuznetsov

This notebook is based on the article: https://mlwave.com/kaggle-ensembling-guide/

## What is Bagging Submissions?

Bagging submissions involves using others' submission files and averaging them in some way that essentially combines these predictions to be more powerful.

It also prevents overfitting. Multiple submissions have different predictions; these predictions combined will bring us closer to the best answer.

Consider the picture below. Three linear regression lines averaged makes up a better separation.

![yeet](https://mlwave.com/beheer/wp-content/uploads/2015/06/perceptron-bagging.png)

## Read in submission files

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
cb_sub = pd.read_csv('../input/bagging-submissions-dataset/CB_Submission.csv')
lgbm_sub = pd.read_csv('../input/bagging-submissions-dataset/LGBM_submission.csv')
xgb_sub = pd.read_csv('../input/bagging-submissions-dataset/XGB_Submission')

## Check for correlations

It's also useful to note that submission files with lower correlation tend to do better, so check their correlation first.

In [None]:
import matplotlib as plt
import plotly.figure_factory as ff
import plotly.express as px

data = np.corrcoef([cb_sub.claim, lgbm_sub.claim, xgb_sub.claim])
group_labels = ['catboost', 'lgbm', 'xgboost']
fig=px.imshow(data,x=group_labels, y=group_labels)

fig.show()

## Major types of Averaging

There are 3 major types of averaging used in competitions. These are:
* **Simple Averaging**
* **Rank Averaging**
* **Weighted Averaging**

## Simple Averaging

Simple averaging is just adding up all the submission files and dividing by the number of files.

For example: **Final Submission = (Submission1 + Submission2 + Submission3) / 3**

In [None]:
simple_avg = (cb_sub + lgbm_sub + xgb_sub) / 3

## Rank averaging

Rank averaging is when you assign a rank to each submission based on performance.

Let's imagine LGBM has the strongest performance, followed by XGBoost, then Catboost.

We would assign LGBM (3), XGBoost (2) and Catboost (1). Then, get their individual weights by dividing by the total (6).

For example: **Final Submission = LGBM * 3/6 + XGB * 2/6 + CB * 1/6**

In [None]:
rank_avg = lgbm_sub*3/6 + xgb_sub*2/6 + cb_sub*1/6

## Weighted Averaging

Weighted averaging is when you assign weights to each submission and see how they turn out on the leaderboard.

This method is mostly trial and error. Putting more weight on a submission causes that submission to have more effect on the final submission.

Just make sure that the total of the weights add up to 1

For example: **Final Submission = Submission1 * 0.4 + Submission2 * 0.3 + Submission3 * 0.3**

In [None]:
weighted_avg = lgbm_sub*0.4 + xgb_sub*0.3 + cb_sub*0.3

## **Power Averaging

This is a very situational type of averaging. It's applicable in TPS September as the leaderboard is based on optimizing for AUC. 

It's also the opposite of normal averaging as it requires highly correlated models to perform well.

For example: **Final Submission = (Submission1^Power + Submission2^Power + Submission3^Power) / 3**

I go in-depth on power averaging for TPS September in this notebook: https://www.kaggle.com/edrickkesuma/power-averaging-is-your-friend

In [None]:
power = 6
power_avg = (lgbm_sub**power + xgb_sub**power + cb_sub**power) / 3

## Caveat

Do keep in mind that only combining submissions and blindly following the leaderboard score may cause you to **overfit** to the public LB. 

If possible, get the submission files from your own models and make sure your oof (out of fold) predictions' score isn't too far from your leaderboard score.