A co-visitation matrix is essentially an "analog" approximation to matrix factorization. But matrix factorization has a lot of advantages as compared to co-visitation matrices. First of all, it can make better use of data -- it operates on the notion of similarity between categories. This is the jump from unigram/bigram/trigram models to word2vec in NLP.

Now train a matrix factorization model and replace the co-visitation matrices with it. To streamline the work, we will use data in `parquet` format.

# Data Preprocessing

In [None]:
!pip install polars

import polars as pl

train = pl.read_parquet('../input/otto-full-optimized-memory-footprint/train.parquet')
test = pl.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet')

We need to create `aid-aid` pairs to train our matrix factorization model.

Let's us grab the pairs both from the train and test set.

In [None]:
%%time

train_pairs = (pl.concat([train, test])
    .groupby('session').agg([
        pl.col('aid'),
        pl.col('aid').shift(-1).alias('aid_next')
    ])
    .explode(['aid', 'aid_next'])
    .drop_nulls()
)[['aid', 'aid_next']]

In [None]:
train_pairs.shape[0] / 1_000_000

That is 209 million pairs created in 40 seconds without running out of RAM.

In [None]:
train_pairs.head()

Let's see what is the cardinality of our aids -- we will need this to create the embedding layer.

In [None]:
cardinality_aids = max(train_pairs['aid'].max(), train_pairs['aid_next'].max())
cardinality_aids

We will have up to `1855602` -- our matrix factorization model will be able to handle this.

Let's construct a `PyTorch` dataset and `dataloader`.

In [None]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

class ClicksDataset(Dataset):
    def __init__(self, pairs):
        self.aid1 = pairs['aid'].to_numpy()
        self.aid2 = pairs['aid_next'].to_numpy()
    def __getitem__(self, idx):
        aid1 = self.aid1[idx]
        aid2 = self.aid2[idx]
        return [aid1, aid2]
    def __len__(self):
        return len(self.aid1)

train_ds = ClicksDataset(train_pairs[:-10_000_000])
valid_ds = ClicksDataset(train_pairs[10_000_000:])

However, the Pytorch dataloader takes a lot of time to load data. The reason this is taking so long is that indexing into the the arrays and collating results into batches is very computationally expensive.

Thanks to other kagglers' work, we will use a brand new [Merlin Dataloader](https://github.com/NVIDIA-Merlin/dataloader). But, alas, Kaggle gives only 13 GB of RAM on a kernel with a GPU, and that wouldn't allow us to process our dataset. Now we will try how far we can go with CPU only.

In [None]:
!pip install merlin-dataloader==0.0.2

In [None]:
from merlin.loader.torch import Loader 

We can read data directly from the disk.

Write our datasets to disk.

In [None]:
train_pairs[:-10_000_000].to_pandas().to_parquet('train_pairs.parquet')
train_pairs[-10_000_000:].to_pandas().to_parquet('valid_pairs.parquet')

In [None]:
from merlin.loader.torch import Loader 
from merlin.io import Dataset

train_ds = Dataset('train_pairs.parquet')
train_dl_merlin = Loader(train_ds, 65536, True)

In [None]:
class MatrixFactorization(nn.Module):
    def __init__(self, n_aids, n_factors):
        super().__init__()
        self.aid_factors = nn.Embedding(n_aids, n_factors, sparse=True)
        
    def forward(self, aid1, aid2):
        aid1 = self.aid_factors(aid1)
        aid2 = self.aid_factors(aid2)
        
        return (aid1 * aid2).sum(dim=1)
    
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

valid_ds = Dataset('valid_pairs.parquet')
valid_dl_merlin = Loader(valid_ds, 65536, True)

In [None]:
from torch.optim import SparseAdam

num_epochs=3
lr=0.08

model = MatrixFactorization(cardinality_aids+1, 32)
optimizer = SparseAdam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()

In [None]:
%%time

for epoch in range(num_epochs):
    for batch, _ in train_dl_merlin:
        model.train()
        losses = AverageMeter('Loss', ':.4e')
            
        aid1, aid2 = batch['aid'], batch['aid_next']
        output_pos = model(aid1, aid2)
        output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0])])
        
        output = torch.cat([output_pos, output_neg])
        targets = torch.cat([torch.ones_like(output_pos), torch.zeros_like(output_pos)])
        loss = criterion(output, targets)
        losses.update(loss.item())
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    model.eval()
    
    with torch.no_grad():
        accuracy = AverageMeter('accuracy')
        for batch, _ in valid_dl_merlin:
            aid1, aid2 = batch['aid'], batch['aid_next']
            output_pos = model(aid1, aid2)
            output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0])])
            accuracy_batch = torch.cat([output_pos.sigmoid() > 0.5, output_neg.sigmoid() < 0.5]).float().mean()
            accuracy.update(accuracy_batch, aid1.shape[0])
            
    print(f'{epoch+1:02d}: * TrainLoss {losses.avg:.3f}  * Accuracy {accuracy.avg:.3f}')

Grab the embeddings to get vector representation matrix!

In [None]:
embeddings = model.aid_factors.weight.detach().numpy()

And construct create the index for approximate nearest neighbor search.

In [None]:
%%time

from annoy import AnnoyIndex

index = AnnoyIndex(32, 'euclidean')
for i, v in enumerate(embeddings):
    index.add_item(i, v)
    
index.build(10)

Now for any `aid`, we can find its nearest neighbor!

In [None]:
index.get_nns_by_item(123, 10)

Create a submission now!

In [None]:
import pandas as pd
import numpy as np

from collections import defaultdict

sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv')

session_types = ['clicks', 'carts', 'orders']
test_session_AIDs = test.to_pandas().reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.to_pandas().reset_index(drop=True).groupby('session')['type'].apply(list)

labels = []

type_weight_multipliers = {0: 0.5, 1: 9, 2: 0.5}
for AIDs, types in zip(test_session_AIDs, test_session_types):
    if len(AIDs) >= 20:
        # if we have enough aids (over equals 20) we don't need to look for candidates! we just use the old logic
        weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1
        aids_temp=defaultdict(lambda: 0)
        for aid,w,t in zip(AIDs,weights,types): 
            aids_temp[aid]+= w * type_weight_multipliers[t]
            
        sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]
        labels.append(sorted_aids[:20])
    else:
        # here we don't have 20 aids to output -- we will use approximate nearest neighbor search and our embeddings
        # to generate candidates!
        AIDs = list(dict.fromkeys(AIDs[::-1]))
        
        # let's grab the most recent aid
        most_recent_aid = AIDs[0]
        
        # and look for some neighbors of the most recent aid
        # just like what we have done in item2vec
        nns = index.get_nns_by_item(most_recent_aid, 21)[1:]
                        
        labels.append((AIDs+nns)[:20])

Pull it all together and write to a file,

In [None]:
labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

prediction_dfs = []

for st in session_types:
    modified_predictions = predictions.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{st}'
    prediction_dfs.append(modified_predictions)

submission = pd.concat(prediction_dfs).reset_index(drop=True)
submission.to_csv('submission.csv', index=False)