A good solution to this competition will most certainly require ensembling. But what are some good ways of ensembling predictions?

In this notebook, we will look at two approaches:
* voting ensemble
* voting ensemble with weights (this allows you to put more weight on predictions that got a better validation/LB score)

As always, the challenge will be the resources that we have available. With each submissions file at over 5 million rows, each row containing 20 predictions, the proble of available RAM is non-trivial!

To combat this, we will use the very memory efficient `polars` 🙂

As a basis for our work, let us use the following three submissions:

* [Candidate ReRank Model - [LB 0.575]](https://www.kaggle.com/code/cdeotte/candidate-rerank-model-lb-0-575) [0.575] by Chris Deotte
* [Test Dataset Is All We Need?](https://www.kaggle.com/code/tomooinubushi/test-dataset-is-all-we-need/notebook) [0.522] by Tomoo Inubushi
* [💡Matrix Factorization [PyTorch+Merlin Dataloader]](https://www.kaggle.com/code/radek1/matrix-factorization-pytorch-merlin-dataloader/notebook) [0.493] by yours truly

Let's get started!

**Please upvote if you like his notebook 🙏 It would be of great help to me if you do. Thank you!**

*Please note: In this notebook, we are ensembling 1 good solution with 2 that are not that great, hence we can't expect great results with equal weights. Even when setting the weights to something that is reasonable given the performance of each solution, we still cannot expect a very good result.*

*However, when I used this method locally on my own submissions, I was able to combine several solutions generated with the same ranking model (by varying the seed) to improve my LB score from 0.576 to 0.577. This effect can be even stronger when ensembling more varied solutions.*

# Loading the data

In [1]:
#!pip install polars # why are we using polars? it has much smaller memory footprint than pandas!

Collecting polars
  Downloading polars-0.15.14-cp37-abi3-macosx_10_7_x86_64.whl (14.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: polars
Successfully installed polars-0.15.14


In [1]:
import polars as pl

Here are the submissions that we will use. We order the file paths from best performing to the worst.

In [2]:
paths = [
    "/Users/Artem_Boltaev/Documents/EPAM Projects/7. RecSys_OTTO_Kaggle/source_code/otto_recsys_kaggle/notebooks/submission_baseline_0483.csv",
    "/Users/Artem_Boltaev/Documents/EPAM Projects/7. RecSys_OTTO_Kaggle/source_code/otto_recsys_kaggle/notebooks/submission_mix_open_0578.csv",
    "/Users/Artem_Boltaev/Documents/EPAM Projects/7. RecSys_OTTO_Kaggle/source_code/otto_recsys_kaggle/notebooks/submission_vstan_1_0.45.csv",
]

We can load all the submissions at once, but we have to be very careful about what operations we run on the data as it is very simple to run out of RAM.

In [5]:
def read_sub(path, weight=1): # by default let us assing the weight of 1 to predictions from each submission, this will be akin to a standard vote ensemble
    '''a helper function for loading and preprocessing submissions'''
    return (
        pl.read_csv(path)
            .with_column(pl.col('labels').str.split(by=" "))
            .with_column(pl.lit(weight).alias('vote'))
            .explode('labels')
            .rename({'labels': 'aid'})
            .with_column(pl.col('aid')) # we are casting the `aids` to `Int32`! memory management is super important to ensure we don't run out of resources
            .with_column(pl.col('vote').cast(pl.UInt8))
    )

Loading all the data at once.

In [6]:
subs = [read_sub(path) for path in paths]
subs[0].head()

session_type,aid,vote
str,str,u8
"""12899779_click...","""59625""",1
"""12899779_click...","""29735""",1
"""12899779_click...","""1733943""",1
"""12899779_click...","""108125""",1
"""12899779_click...","""1603001""",1


Concatenating and grouping won't work due to memory requirements. Our only option are the very efficient joins.

In [7]:
subs = subs[0].join(subs[1], how='outer', on=['session_type', 'aid']).join(subs[2], how='outer', on=['session_type', 'aid'], suffix='_right2')
subs.head()

session_type,aid,vote,vote_right,vote_right2
str,str,u8,u8,u8
"""14061476_carts...","""743431""",1.0,1,
"""14061476_carts...","""84110""",1.0,1,
"""14061476_carts...","""1267119""",,1,
"""14061476_carts...","""536718""",1.0,1,
"""14061476_carts...","""1236804""",1.0,1,


Let us fill in the `nulls`, sum the votes, and order the predictions so that predictions with more votes appear first.

In [8]:
subs = (subs
    .fill_null(0)
    .with_column((pl.col('vote') + pl.col('vote_right') + pl.col('vote_right2')).alias('vote_sum'))
    .drop(['vote', 'vote_right', 'vote_right2'])
    .sort(by='vote_sum')
    .reverse()
)

subs.head()

session_type,aid,vote_sum
str,str,u8
"""14349622_order...","""1011392""",3
"""14516790_order...","""88856""",3
"""14516790_order...","""84921""",3
"""13813191_carts...","""1079588""",3
"""13813191_carts...","""1043508""",3


All we have to do now is take the first 20 predictions per `session_type` and turn them into a space seperated string.

In [9]:
preds = subs.groupby('session_type').agg([
    pl.col('aid').head(20).alias('labels')
])

preds = preds.with_column(pl.col('labels').apply(lambda lst: ' '.join([str(aid) for aid in lst])))

We have created a standard voting ensemble and are now ready to output the submission file!

In [10]:
%%time

preds.write_csv('submission.csv')

CPU times: user 2.07 s, sys: 899 ms, total: 2.97 s
Wall time: 1.77 s


Voting ensemble is often a great way to go. However, sometimes we might want to weight our submissions. Say, we want to give more weight to the submission that performs better.

How would we do it?

We already have all the pieces 🙂

When reading the submissions, all you have to do is specify the weight associated with each one using the `read_sub` function, for instance we could do something like this:

`subs = [read_sub(path, weight) for path, weight in zip(paths, [1, 0.55, 0.55])]`

And that's it!

## Summary

We now have a way to perfom voting ensemble (including using custom weights) even within the limits of a Kaggle VM! Ensembling will certainly be a major component of strong submissions.

**If you enjoyed this notebook, please upvote! 🙏 Thank you!**

Thank you for reading, happy Kaggling! 🙂