This notebook is to train a `Word2Vec` model.

We will use the `gensim` library which offers extremely fast training on the CPU.

We will rely on `polars` and its small memory footprint to load and process the data. To speed things up, use “otto-ful-optimized-memory-footprint” dataset in a parquet format

# Data Preprocessing

In [1]:
!pip install polars

import polars as pl
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

train = pl.read_parquet('../input/otto-full-optimized-memory-footprint/train.parquet')
test = pl.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet')

Collecting polars
  Downloading polars-0.15.11-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: polars
Successfully installed polars-0.15.11
[0m

Now transform the data into a format that the `gensim` library can work with. `polars` makes the process very efficiently and quickly.

In [2]:
sentences_df = pl.concat([train, test]).groupby('session').agg(
    pl.col('aid').alias('sentence')
)

In [3]:
sentences = sentences_df['sentence'].to_list()

# Training a word2vec model

In [4]:
%%time

w2vec = Word2Vec(sentences=sentences, vector_size=32, min_count=1, workers=4)

CPU times: user 1h 12min 6s, sys: 17.9 s, total: 1h 12min 24s
Wall time: 25min 1s


With the model fully train, let us use similarity between trained representations of our `aids` to create a submission.

The search functionality where we look for nearest neighbors in the embedding space is built into `gensim`, but it is unfortunately super slow. Let's use `annoy` which is much faster (it performs approximate nearest neigbor search).

In [5]:
%%time

from annoy import AnnoyIndex

aid2idx = {aid: i for i, aid in enumerate(w2vec.wv.index_to_key)}
index = AnnoyIndex(32, 'euclidean')

for aid, idx in aid2idx.items():
    index.add_item(idx, w2vec.wv.vectors[idx])
    
index.build(10)

CPU times: user 44.5 s, sys: 534 ms, total: 45 s
Wall time: 18 s


True

# Outputting a submission

In [6]:
import pandas as pd
import numpy as np

from collections import defaultdict

sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv')

session_types = ['clicks', 'carts', 'orders']
test_session_AIDs = test.to_pandas().reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.to_pandas().reset_index(drop=True).groupby('session')['type'].apply(list)

labels = []

# we use the same best weight for item type as we find in Tuning Candidate ReRank Model
# (carts are of greatest importance)
type_weight_multipliers = {0: 0.5, 1: 9, 2: 0.5}
for AIDs, types in zip(test_session_AIDs, test_session_types):
    if len(AIDs) >= 20:
        # if we have enough aids (over equals 20) we don't need to look for candidates! we just use the old logic
        weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1
        aids_temp=defaultdict(lambda: 0)
        for aid,w,t in zip(AIDs,weights,types): 
            aids_temp[aid]+= w * type_weight_multipliers[t]
            
        sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]
        labels.append(sorted_aids[:20])
    else:
        # here we don't have 20 aids to output -- we will use word2vec embeddings to generate candidates!
        AIDs = list(dict.fromkeys(AIDs[::-1]))
        
        # grab the most recent aid
        most_recent_aid = AIDs[0]
        
        # look for their nearest neighbors (besides oneself)
        nns = [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[most_recent_aid], 21)[1:]]
                        
        labels.append((AIDs+nns)[:20])

Now pull it all together and write it to a file.

In [7]:
labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

prediction_dfs = []

for st in session_types:
    modified_predictions = predictions.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{st}'
    prediction_dfs.append(modified_predictions)

submission = pd.concat(prediction_dfs).reset_index(drop=True)
submission.to_csv('submission.csv', index=False)