Model ensemble will be a good way to improve our final results.And we used voting ensemble with weights which allows us to put more weight on predictions that got a better score.

As always, the challenge will be the resources that we have available. With each submissions file at over 5 million rows, each row containing 20 predictions, the proble of available RAM is non-trivial!

The following public notebook is also used once in our ensemble:

* [Test Dataset Is All We Need?](https://www.kaggle.com/code/tomooinubushi/test-dataset-is-all-we-need/notebook) [0.522] by Tomoo Inubushi


# Loading the data

In [1]:
!pip install polars # why are we using polars? it has much smaller memory footprint than pandas!

[0m

In [2]:
import polars as pl

Here gives the paths of submissions combined.

In [3]:
paths = ['../input/otto-submissions-ensemble/submission_itemCF.csv','../input/otto-submissions-ensemble/submission_rerank.csv','../input/otto-submissions-ensemble/submission_item2vec.csv','../input/otto-submissions-ensemble/submission_MF.csv']

We can load all the submissions at once, but we have to be very careful about what operations we run on the data as it is very simple to run out of RAM.

In [4]:
def read_sub(path, weight=1): # by default let us assing the weight of 1 to predictions from each submission, this will be akin to a standard vote ensemble
    '''a helper function for loading and preprocessing submissions'''
    return (
        pl.read_csv(path)
            .with_column(pl.col('labels').str.split(by=' '))
            .with_column(pl.lit(weight).alias('vote'))
            .explode('labels')
            .rename({'labels': 'aid'})
            .with_column(pl.col('aid').cast(pl.UInt32)) # we are casting the `aids` to `Int32`! memory management is super important to ensure we don't run out of resources
            .with_column(pl.col('vote').cast(pl.UInt8))
    )

In [5]:
# weights for [itemCF,  rerank, item2vec, MF]
weights=[1,1,1,1]
subs = [read_sub(path, weight) for path, weight in zip(paths[0:3], weights[0:3])]
subs[0].head()

session_type,aid,vote
str,u32,u8
"""12899779_click...",59625,1
"""12899779_click...",1253524,1
"""12899779_click...",737445,1
"""12899779_click...",438191,1
"""12899779_click...",731692,1


In [9]:
print(paths[0:3])

['../input/otto-submissions-ensemble/submission_itemCF.csv', '../input/otto-submissions-ensemble/submission_rerank.csv', '../input/otto-submissions-ensemble/submission_item2vec.csv']


In [12]:
subs[1].head()

session_type,aid,vote
str,u32,u8
"""12899779_click...",59625,1
"""12899779_click...",214278,1
"""12899779_click...",66843,1
"""12899779_click...",1289372,1
"""12899779_click...",597108,1


Concatenating and grouping won't work due to memory requirements. Our only option are the very efficient joins.

In [13]:
# subs = subs[0].join(subs[1], how='outer', on=['session_type', 'aid']).join(subs[2], how='outer', on=['session_type', 'aid'], suffix='_right2').join(subs[3], how='outer', on=['session_type', 'aid'], suffix='_right3')
subs = subs[0].join(subs[1], how='outer', on=['session_type', 'aid']).join(subs[2], how='outer', on=['session_type', 'aid'], suffix='_right2')
subs.head()

session_type,aid,vote,vote_right
str,u32,u8,u8
"""12899779_click...",59625,1,1.0
"""12899779_click...",1253524,1,
"""12899779_click...",737445,1,
"""12899779_click...",438191,1,
"""12899779_click...",731692,1,


Sum up the weighted frequency of each item that has appeared in any of the submissions used.

In [14]:
subs = (subs
    .fill_null(0)
    .with_column((pl.col('vote') + pl.col('vote_right')+pl.col('vote_right2')).alias('vote_sum'))
    .drop(['vote', 'vote_right','vote_right2'])
    .sort(by='vote_sum')
    .reverse()
)

subs.head()

session_type,aid,vote_sum
str,u32,u8
"""14571581_carts...",1392029,2
"""14571581_carts...",1124107,2
"""14571581_carts...",1236674,2
"""14571581_carts...",622489,2
"""14571581_carts...",1401429,2


In [None]:
preds = subs.groupby('session_type').agg([
    pl.col('aid').head(20).alias('labels')
])

preds = preds.with_column(pl.col('labels').apply(lambda lst: ' '.join([str(aid) for aid in lst])))

We have created a standard voting ensemble and are now ready to output the submission file.

In [None]:
%%time

preds.write_csv('submission.csv')