## Implicit ALS base model

[Implicit](https://github.com/benfred/implicit/) is a library for recommender models.

In this notebook we use ALS (Alternating Least Squares), but the library supports a lot of other models with not many changes.

ALS is one of the most used ML models for recommender systems. It's a matrix factorization method based on SVD (it's actually an approximated, numerical version of SVD). Basically, ALS factorizes the interaction matrix (user x items) into two smaller matrices, one for item embeddings and one for user embeddings. These new matrices are built in a manner such that the multiplication of a user and an item gives (approximately) it's interaction score. This build embeddings for items and for users that live in the same vector space, allowing the implementation of recommendations as simple cosine distances between users and items. This is, the 12 items we recommend for a given user are the 12 items with their embedding vectors closer to the user embedding vector.

There are a lot of online resources explaining it. For example, [here](https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1).

In [1]:
!pip install --upgrade implicit

Collecting implicit
  Downloading implicit-0.7.0-cp37-cp37m-manylinux2014_x86_64.whl (9.2 MB)
     |████████████████████████████████| 9.2 MB 8.3 MB/s            
Installing collected packages: implicit
  Attempting uninstall: implicit
    Found existing installation: implicit 0.4.4
    Uninstalling implicit-0.4.4:
      Successfully uninstalled implicit-0.4.4
Successfully installed implicit-0.7.0


In [2]:
import os; os.environ['OPENBLAS_NUM_THREADS']='1'
import numpy as np
import pandas as pd
import implicit
from scipy.sparse import coo_matrix
from implicit.evaluation import mean_average_precision_at_k

In [3]:
%%time

base_path = '/home/sukanya/h-and-m-personalized-fashion-recommendations/'
csv_train = f'{base_path}transactions_train.csv'
csv_sub = f'{base_path}sample_submission.csv'
csv_users = f'{base_path}customers.csv'
csv_items = f'{base_path}articles.csv'

df = pd.read_csv(csv_train, dtype={'article_id': str}, parse_dates=['t_dat'])
df_sub = pd.read_csv(csv_sub)
dfu = pd.read_csv(csv_users)
dfi = pd.read_csv(csv_items, dtype={'article_id': str})

CPU times: user 52.4 s, sys: 4.96 s, total: 57.3 s
Wall time: 1min 30s


In [4]:
# Trying with less data
df = df[df['t_dat'] > '2020-08-21']
df.shape

(1190911, 5)

In [5]:
# For validation this means 3 weeks of training and 1 week for validation
# For submission, it means 4 weeks of training
df['t_dat'].max()

Timestamp('2020-09-22 00:00:00')

## Assign autoincrementing ids starting from 0 to both users and items

In [6]:
ALL_USERS = dfu['customer_id'].unique().tolist()
ALL_ITEMS = dfi['article_id'].unique().tolist()

user_ids = dict(list(enumerate(ALL_USERS)))
item_ids = dict(list(enumerate(ALL_ITEMS)))

user_map = {u: uidx for uidx, u in user_ids.items()}
item_map = {i: iidx for iidx, i in item_ids.items()}

df['user_id'] = df['customer_id'].map(user_map)
df['item_id'] = df['article_id'].map(item_map)

del dfu, dfi

## Create coo_matrix (user x item) and csr matrix (user x item)

It is common to use scipy sparse matrices in recommender systems, because the main core of the problem is typically modeled as a matrix with users and items, with the values representing whether the user purchased (or liked) an items. Since each user purchases only a small fraction of the catalog of products, this matrix is full of zero (aka: it's sparse).

In a very recent release they did an API breaking change, so be aware of that: https://github.com/benfred/implicit/releases
In this notebook we are using the latest version, so everything is aligned with (user x item)

**We are using (user x item) matrices, both for training and for evaluating/recommender.**

In the previous versions the training procedure required a COO item x user

For evaluation and prediction, on the other hand, CSR matrices with users x items format should be provided.


### About COO matrices
COO matrices are a kind of sparse matrix.
They store their values as tuples of `(row, column, value)` (the coordinates)

You can read more about them here: 
* https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)
* https://scipy-lectures.org/advanced/scipy_sparse/coo_matrix.html

From https://het.as.utexas.edu/HET/Software/Scipy/generated/scipy.sparse.coo_matrix.html

```python
>>> row  = np.array([0,3,1,0]) # user_ids
>>> col  = np.array([0,3,1,2]) # item_ids
>>> data = np.array([4,5,7,9]) # a bunch of ones of lenght unique(user) x unique(items)
>>> coo_matrix((data,(row,col)), shape=(4,4)).todense()
matrix([[4, 0, 9, 0],
        [0, 7, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 5]])
```

## About CSR matrices
* https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)


In [7]:
row = df['user_id'].values
col = df['item_id'].values
data = np.ones(df.shape[0])
coo_train = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
coo_train

<1371980x105542 sparse matrix of type '<class 'numpy.float64'>'
	with 1190911 stored elements in COOrdinate format>

# Model check

In [8]:
%%time
model = implicit.als.AlternatingLeastSquares(factors=10, iterations=2)
model.fit(coo_train)



  0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 393 ms, sys: 166 ms, total: 560 ms
Wall time: 1.2 s


# Validation

In [9]:
def to_user_item_coo(df):
    """ Turn a dataframe with transactions into a COO sparse items x users matrix"""
    row = df['user_id'].values
    col = df['item_id'].values
    data = np.ones(df.shape[0])
    coo = coo_matrix((data, (row, col)), shape=(len(ALL_USERS), len(ALL_ITEMS)))
    return coo


def split_data(df, validation_days=7):
    """ Split a pandas dataframe into training and validation data, using <<validation_days>>
    """
    validation_cut = df['t_dat'].max() - pd.Timedelta(validation_days)

    df_train = df[df['t_dat'] < validation_cut]
    df_val = df[df['t_dat'] >= validation_cut]
    return df_train, df_val

def get_val_matrices(df, validation_days=7):
    """ Split into training and validation and create various matrices
        
        Returns a dictionary with the following keys:
            coo_train: training data in COO sparse format and as (users x items)
            csr_train: training data in CSR sparse format and as (users x items)
            csr_val:  validation data in CSR sparse format and as (users x items)
    
    """
    df_train, df_val = split_data(df, validation_days=validation_days)
    coo_train = to_user_item_coo(df_train)
    coo_val = to_user_item_coo(df_val)

    csr_train = coo_train.tocsr()
    csr_val = coo_val.tocsr()
    
    return {'coo_train': coo_train,
            'csr_train': csr_train,
            'csr_val': csr_val
          }


def validate(matrices, factors=200, iterations=20, regularization=0.01, show_progress=True):
    """ Train an ALS model with <<factors>> (embeddings dimension) 
    for <<iterations>> over matrices and validate with MAP@12
    """
    coo_train, csr_train, csr_val = matrices['coo_train'], matrices['csr_train'], matrices['csr_val']
    
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    
    # The MAPK by implicit doesn't allow to calculate allowing repeated items, which is the case.
    # TODO: change MAP@12 to a library that allows repeated items in prediction
    map12 = mean_average_precision_at_k(model, csr_train, csr_val, K=12, show_progress=show_progress, num_threads=4)
    print(f"Factors: {factors:>3} - Iterations: {iterations:>2} - Regularization: {regularization:4.3f} ==> MAP@12: {map12:6.5f}")
    return map12

In [10]:
matrices = get_val_matrices(df)

In [11]:
%%time
best_map12 = 0
for factors in [40, 50, 60, 100, 200, 500, 1000]:
    for iterations in [3, 12, 14, 15, 20]:
        for regularization in [0.01]:
            map12 = validate(matrices, factors, iterations, regularization, show_progress=False)
            if map12 > best_map12:
                best_map12 = map12
                best_params = {'factors': factors, 'iterations': iterations, 'regularization': regularization}
                print(f"Best MAP@12 found. Updating: {best_params}")



Factors:  40 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00473
Best MAP@12 found. Updating: {'factors': 40, 'iterations': 3, 'regularization': 0.01}




Factors:  40 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00545
Best MAP@12 found. Updating: {'factors': 40, 'iterations': 12, 'regularization': 0.01}




Factors:  40 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00529




Factors:  40 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00530




Factors:  40 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00521




Factors:  50 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00444




Factors:  50 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00532




Factors:  50 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00536




Factors:  50 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00541




Factors:  50 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00536




Factors:  60 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00479




Factors:  60 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00551
Best MAP@12 found. Updating: {'factors': 60, 'iterations': 12, 'regularization': 0.01}




Factors:  60 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00559
Best MAP@12 found. Updating: {'factors': 60, 'iterations': 14, 'regularization': 0.01}




Factors:  60 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00558




Factors:  60 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00559




Factors: 100 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00521




Factors: 100 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00640
Best MAP@12 found. Updating: {'factors': 100, 'iterations': 12, 'regularization': 0.01}




Factors: 100 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00639




Factors: 100 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00636




Factors: 100 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00629




Factors: 200 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00641
Best MAP@12 found. Updating: {'factors': 200, 'iterations': 3, 'regularization': 0.01}




Factors: 200 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00653
Best MAP@12 found. Updating: {'factors': 200, 'iterations': 12, 'regularization': 0.01}




Factors: 200 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00651




Factors: 200 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00646




Factors: 200 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00642




Factors: 500 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00679
Best MAP@12 found. Updating: {'factors': 500, 'iterations': 3, 'regularization': 0.01}




Factors: 500 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00584




Factors: 500 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00580




Factors: 500 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00573




Factors: 500 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00574




Factors: 1000 - Iterations:  3 - Regularization: 0.010 ==> MAP@12: 0.00645




Factors: 1000 - Iterations: 12 - Regularization: 0.010 ==> MAP@12: 0.00599




Factors: 1000 - Iterations: 14 - Regularization: 0.010 ==> MAP@12: 0.00603




Factors: 1000 - Iterations: 15 - Regularization: 0.010 ==> MAP@12: 0.00602




Factors: 1000 - Iterations: 20 - Regularization: 0.010 ==> MAP@12: 0.00595
CPU times: user 6min 53s, sys: 1.73 s, total: 6min 55s
Wall time: 6min 55s


In [12]:
del matrices

# Training

In [13]:
coo_train = to_user_item_coo(df)
csr_train = coo_train.tocsr()

In [14]:
def train(coo_train, factors=200, iterations=15, regularization=0.01, show_progress=True):
    model = implicit.als.AlternatingLeastSquares(factors=factors, 
                                                 iterations=iterations, 
                                                 regularization=regularization, 
                                                 random_state=42)
    model.fit(coo_train, show_progress=show_progress)
    return model

In [15]:
best_params

{'factors': 500, 'iterations': 3, 'regularization': 0.01}

In [16]:
model = train(coo_train, **best_params)



  0%|          | 0/3 [00:00<?, ?it/s]

## Prediction

In [17]:
def predict(model, csr_train, file_name="submissions.csv"):
    preds = []
    batch_size = 2000
    to_generate = np.arange(len(ALL_USERS))
    for startidx in range(0, len(to_generate), batch_size):
        batch = to_generate[startidx : startidx + batch_size]
        ids, scores = model.recommend(batch, csr_train[batch], N=12, filter_already_liked_items=False)
        for i, userid in enumerate(batch):
            customer_id = user_ids[userid]
            user_items = ids[i]
            article_ids = [item_ids[item_id] for item_id in user_items]
            preds.append((customer_id, ' '.join(article_ids)))

    df_preds = pd.DataFrame(preds, columns=['customer_id', 'prediction'])
    df_preds.to_csv(file_name, index=False)
    
    display(df_preds.head())
    print(df_preds.shape)
    
    return df_preds

In [18]:
%%time
df_preds = predict(model, csr_train)

Unnamed: 0,customer_id,prediction
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,0568601043 0858856005 0779781015 0828991003 05...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,0108775015 0108775044 0108775051 0110065001 01...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,0794321007 0794321011 0765743007 0805000001 09...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,0108775015 0108775044 0108775051 0110065001 01...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,0108775015 0108775044 0108775051 0110065001 01...


(1371980, 2)
CPU times: user 1min 16s, sys: 648 ms, total: 1min 17s
Wall time: 1min 17s
