# 3.1 Recommendation Testing
Validating a recommender system is no trivial task. ECommerce companies usually trials them extensively using both online and offline sources. As this is not quite feasible in our case, we resolved to using a different approach: As discussed when training the Item2Vec models, we created on distinct model for each customer cluster. Next, we load the test users that we defined. We do make some assumptions when actually making the recommendations:

1. The (true) embedding of the basket can be approximated by a (softmax-weighted) average of the items contained in that basket.
2. As in training the Item2Vec models, we only consider orders that have at least 4 items in them as we believe that just having one item in the basket and predicting the remaining one is incredibly difficult.

We then proceeded to generate "artificial" test datasets by using a rolling convolution to extract "order windows" of the following shape (e.g. by using a filter of size 4x1):

([Item1, Item2, Item3, Item4, Item5]) : 
 
Convolution 1: [Item1, Item2, Item3, Item4]  
Convolution 2: [Item2, Item3, Item4, Item5]  

We then extract the first three basket elements as our "basket" (e.g. [Item1, Item2, Item3]), apply the recommender on these items and then compare with the last item  [Item4] in the convolution. 

We made an important observation when using the recommender here: Using a softmax-weighted basket (i.e. weighting the first element with $e^1$, the second with $e^2$ and then normalizing by the sum of $e^1$ to $e^3$) yields a superior result compared to using the simple mean. This is intuitive, as products that have been purchased at the beginning might not be so indicative of products further ahead in the cart.

Overall we find that the recommender performs reasonably well against a random benchmark. By simply using a random sample from the 60,000 products or so in Instacarts database, we would expected to be picking the right product with a chance of 1/59,999. Depending on the cluster used, we achieve accuracies as high as 9% (which might be a statistical fluke) and and as low as 0.9%, both of which are a significant improvement over a simple guess. We see that the overall error variance is quite pronounced between the clusters.


In [1]:
import pickle
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
from gensim.models import Word2Vec
from itertools import chain
from numpy.lib.stride_tricks import sliding_window_view
from typing import List
from os import listdir

In [2]:
cluster_item_models = [Word2Vec.load(f"model_cluster_{id}.model") for id in range(0, 5)]

In [3]:
with open('product_lookup.pkl', 'rb') as file:
    product_lookup = pickle.load(file)

In [4]:
def load_test_users(id: int):
    with open(f"test_users_cluster{id}.pkl", "rb") as file:
        test_users = pickle.load(file)
    return test_users

In [5]:
test_users = [load_test_users(id) for id in range(1, 6)]

In [6]:
def import_data(data_dir: str) -> List[pd.DataFrame]:
    """
    Parameters:
    ----------------
    data_dir: str
      The path where the data is stored

    Returns:
    ----------------
    dataframes_ls: List[pd.DataFrame]
      A list of pandas dataframes
    """
    files = [file.split('.')[0] for file in listdir(data_dir) if file.split('.')[0] != ""]

    # Creating a string expression to evaluate the data
    eval_expr = ', '.join(f'pd.read_csv(\'{data_dir}/{file}.csv\')' for file in files)

    # Evaluating the expression and assigning it, which creates a list of dataframes
    dataframes_ls = eval(eval_expr)

    return dataframes_ls

In [7]:
products = pd.read_csv('products.csv')
cluster_data = pq.read_table('./dummy_k5.parquet').to_pandas()
cluster_data_named = pd.merge(cluster_data, products, on='product_id', how='inner')

In [8]:
cluster_data_named['product_id'] = cluster_data_named['product_id'].astype(str)
cluster_data_named['user_id'] = cluster_data_named['user_id'].astype(str)

In [9]:
def filter_data_by_cluster(data: pd.DataFrame, cluster_num: int):
    return data.loc[data['cluster'] == cluster_num, :]

In [10]:
clusters_separated = [filter_data_by_cluster(cluster_data_named, cluster_num) for cluster_num in range(0, len(cluster_data_named['cluster'].unique()))]

In [11]:
def subset_cluster(cluster: pd.DataFrame, users):
    return cluster[cluster['user_id'].isin(users)]

In [12]:
def get_orders_from_cluster(cluster_subset):
    return cluster_subset.groupby(['user_id', 'order_id'])['product_id'].apply(list).values

In [13]:
def generate_user_purchase_history_in_cluster(cluster: pd.DataFrame, users):
    cluster_subset = subset_cluster(cluster, users)
    purchase_history = get_orders_from_cluster(cluster_subset)
    filtered_purchase_history = [purchase for purchase in purchase_history if len(purchase) > 3] # A number of purchases 
    return purchase_history

In [14]:
purchase_history_validation = [generate_user_purchase_history_in_cluster(clusters_separated[i], test_users[i]) for i in range(0, len(clusters_separated))]

As explained above, the product recommender retrieves the k most similar items for the averaged item vectors in the basket and checks whether one of the recommended products is indeed the next item. 

In [15]:
def recommend_product(cluster_model, product_lookup, product_ids):

    def filter_matches(cluster_model, product_ids):
        return [product_id for product_id in product_ids if cluster_model.wv.__contains__(product_id)]

    filtered_matches = filter_matches(cluster_model, product_ids)

    if len(filtered_matches) == 0:
        return 'UNKNOWN' # Returning an "UNKNOWN" token for an empty basket
    else:
        def average_item_vectors(cluster_model, product_ids):
            embeddings = [cluster_model.wv[product_id] for product_id in product_ids]
            def softmax_weights(embeddings):
                raw_weights = [np.exp(i) for i in range(1, len(embeddings)+1)]
                softmax_weights = np.array([raw_weight/sum(raw_weights) for raw_weight in raw_weights])
                return softmax_weights
            sm_weights = softmax_weights(embeddings)
            return np.average(embeddings, axis=0, weights=sm_weights)
            
        basket_vector = average_item_vectors(cluster_model, filtered_matches)

        def retrieve_most_similar_products(cluster_model, product_lookup, basket_vector):
            similar_products = cluster_model.wv.similar_by_vector(basket_vector, topn=15)[1:]
            recommendations = [similar[0] for similar in similar_products]
                
            return recommendations   

        recommendations = retrieve_most_similar_products(cluster_model, product_lookup, basket_vector)

        return recommendations

In [16]:
def convolve_prediction_filter(history: np.array, filter_shape: np.array):
        history_expanded = np.expand_dims(history, axis=1)
        masks = sliding_window_view(history_expanded, filter_shape)
        return masks

In [17]:
def validate_recommendations(cluster_model, mask: np.array, product_lookup: dict):
    basket = mask.flatten()[:-1]
    target_item = mask.flatten()[-1]
    recommendations = recommend_product(cluster_model, product_lookup, basket)

    if target_item in set(recommendations):
        return 1
    else:
        return 0

In [18]:
def score_cluster_model(cluster_model, product_lookup, cluster_history):
    validation_history = [np.array(history) for history in cluster_history if len(history) > 3]
    masks = [convolve_prediction_filter(history, (4,1)) for history in validation_history]
    chained_masks = list(chain.from_iterable(masks))

    order_score = sum([validate_recommendations(cluster_model, mask, product_lookup) for mask in masks])/len(masks)
    return order_score

In [19]:
results = [score_cluster_model(cluster_item_models[i], product_lookup, purchase_history_validation[i]) for i in range(len(cluster_item_models))]

In [20]:
[(f"Cluster {i+1}", round(results[i], 4)) for i in range(len(results))]

[('Cluster 1', 0.0408),
 ('Cluster 2', 0.0433),
 ('Cluster 3', 0.0034),
 ('Cluster 4', 0.0533),
 ('Cluster 5', 0.0413)]