# Introduction

This is a second part of a two part series. The data preparation was done in the [first part](https://github.com/PMMAraujo/OnlineRetailII-ecommerce-project/blob/master/notebooks/eda_and_content_based.ipynb), namely cleaning the data, imputation of some user ids and the conversion of the bought quantities to log(x+1)). The objective in this part is to create some collaborative filtering recommender systems! However, instead of using the usual types of user-item interactions, like ratings, here the dependent variable is the quantity bought by users of said item. 
Recommender systems are consentualy divided into two main categories: Content Based and Collaborative Filtering. The base in which these methodologies build upon is that knowing the similarities between users and items we can recommend similar items to those a user had interest previously or or items that interested a similar user. The difference between the two categories comes from how the features for the similarity calculations are obtained. In Collaborative Filtering the users and items characteristics are inferred from a user-item interaction metric, like ratings or quantity bought (used in this example). In Content Based approaches the used features are generally explicitly defined, like item color or type of material and user nationality and age. Therefore, Content Based approaches are highly dependent on extremely well curated data, and the more features available the better this models perform. In the [previous part](https://github.com/PMMAraujo/OnlineRetailII-ecommerce-project/blob/master/notebooks/eda_and_content_based.ipynb) I created some content based similarity inference systems only using the items names, they aren’t true recommender systems but still have some utility. Collaborative Filtering approaches do not need an extensive array of features available since it only depends on a user-item interaction metric. From it the user and item similarities can be inferred. Collaborative filtering is further divided in Memory and Model based methods. The divergency point between Memory and Model based methods is the data upon which the similarities between users and items are calculated. In Memory Based approaches the similarities are inferred from the user-item interaction directly while in Model based approaches there is an explicit modeling of the user-item interactions before the calculation of the similarities.

# Baseline

Before doing any model implementations is always good to have a dummy baseline for comparison. First let’s divide the dataset in train and test (25%). The created baseline is simply the average of the trainset ratings for all the testset.

In [4]:
import pandas as pd
import numpy as np

from surprise.model_selection import train_test_split
from surprise import Reader, Dataset,accuracy
from sklearn.metrics import mean_squared_error

file_path = './log_processed_df.csv'
#file_path = 'drive/My Drive/Colab Notebooks/project_ecommerce/norm_processed_df.csv'

reader = Reader(line_format='item rating user', sep=',', rating_scale=(0,10))

data_log = Dataset.load_from_file(file_path, reader=reader)


trainset, testset = train_test_split(data_log, random_state=100,
                                     test_size=0.25)


real_test_y = [ x[2] for x in testset]
average_train_preds =  [np.average(np.array([ x[2] for x in trainset.all_ratings()]))] * len(real_test_y)

baseline_rmse = np.sqrt(mean_squared_error(real_test_y, average_train_preds))

print(f"The baseline RMSE is: {baseline_rmse}")

The baseline RMSE is: 1.0014571989371153


Let’s also take this opportunity to look at the data. Please notice that this dataset was well explored in the [first part](https://github.com/PMMAraujo/OnlineRetailII-ecommerce-project/blob/master/notebooks/eda_and_content_based.ipynb).

In [None]:
df = pd.read_csv('./log_processed_df.csv', header=None)
df.columns = ['item', 'quantity_log', 'user']
df.head()

# Memory-Based Collaborative Filtering

So how does memory-based collaborative filtering work? Based on the user-item interactions, in this case buying quantities, a similarity matrix is built. The matrix can be built comparing all users or items to each other. The principle here is that users that buy similar items have similar buying patterns and are consequently similar for our modeling interests. The same idea is applied for the items, if bought by the same users their characteristics may be similar or related. An intuitive example can be demonstrated for movies data: user A likes horror movies so he always rates them high, but he doesn't like comedies so he rates them low. If enough users behave like user A the similarity calculation will group together horror movies away from comedies, even without any explicit knowledge of this characteristic.

Having the similarity matrix built a model is applied, like K Nearest Neighbors (KNN), to identify closest data points to our user or item of interest. Then the closest data points are used to predict the quantitative feature at hand for missing entries in our user/item of interest. The way these predictions are built for the each missing entry can be a simple average of the value for the closest data points or, in a more complex fashion, the distance (1- similarity) of the closest data point to the user/item of interest can be used as weight for the creation of the predicted value.

Overall the thinking process for item-based collaborative filtering is: People who bought this item also bought X. For user-item is: People similar to you bought X. Let’s implement these approaches to the dataset at hand to get a better grasp of these concepts!

To implement the collaborative filtering approaches I’m going to use the package scikit surprise. The algorithm to calculate similarities that I'm going to use is KNNWithMeans. Scikit surprise has several k-NN inspired algorithms for collaborative filtering approaches: KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline. Here I'm going to only explore the KNNWithMeans algorithm, this decision was almost arbitrary based on my initial exploration and some use cases that I saw of this package usage. By no means I'm suggesting that this is the best algorithm for the job, that isn't what I'm trying to demonstrate here. 

Let's start by exploring the algorithm parameters. The package divides the hyperparameters for these algorithms in two levels, the first refers to the actual KNN method and how the predictions are calculated, being: 'k', 'min_k' and 'verbose'. The second level contains the similarity options and controls how the similarity matrix is built, here we have the 'name', 'user_based' and 'min_support'. The package documentation does a good job explaining this features (https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans), nevertheless here is a simplified explanation of them and their expected impact:

- 'k' and 'min_k':  control the number of closest neighbors used by the method (max and min). Depending on the data set these have mid to high impact on the model. Basing a prediction on only one neighbor may generate a lot of uncertainty (high bias), therefore using multiple similar neighbors for this inference may generate more robust values. On the other hand if the prediction is based on the average of many neighbors the different contributions may dilute each other and in the end the output will be just an average (or close to it) of the data set (high variance);
- 'verbose': simply controls how much text is written to the screen and has no impact in model performance;
- 'name': is one of the most impactful features in terms of performance since it defines which method is going to be used to calculate the distance matrix between users or items. The three options are cosine distance, msd and pearson and no single one of them is the best for all approaches, one must test which one performs the best for the problem at hand;
- 'user_based': is used to choose if we are going to calculate a distance matrix between users or between items, and consequently define if our approach is  user-item or item-item. This is mostly an approach decision and not an hyper-parameter;
- 'mim_support' restricts the number of common items or users for the inference of the matrix similarities, less than this and the similarity is 0. This has a huge impact in the similarity matrix and the consequence KNN predictive method. It is a balance between dilution contributions (high value) and a high bias (low value), it needs to be properly tested and optimized. However, the pressure of the "perfect" value may generate smaller gains in the predictive capacity, instead it is more realistic to look for a reasonably good and stable value for the model. 

## Item-item Model

Starting with an item-item approach let's do a grid search with cross validation to infer the best hyperparameters for this dataset:

In [6]:
from surprise import KNNWithMeans
from surprise.model_selection import GridSearchCV, train_test_split

item_sim_options = {'name': ['cosine', 'msd', 'pearson'],
                   'user_based': [False], # False because item-item approach
                    'min_support': [1,2,5]}

item_knn_param_grid = {'verbose': [False],
                  'k':[20,40,60],
                  'min_k':[1,2,5],
                  'sim_options': item_sim_options
                  }

item_kwm_gs = GridSearchCV(KNNWithMeans, item_knn_param_grid,
                           measures=['rmse', 'mae'], cv=5, n_jobs=4,
                           refit=True) # output model retrain in entire data set

item_kwm_gs.fit(data_log)

In [7]:
print(f"The results from the grid_search_cv are:\n\
best score rmse: {item_kwm_gs.best_score['rmse']}\n\
best score mae: {item_kwm_gs.best_score['mae']}\n\
best params rmse: {item_kwm_gs.best_params['rmse']}\n\
best params mae: {item_kwm_gs.best_params['mae']}")

The results from the grid_search_cv are:
best score rmse: 0.5796235476542964
best score mae: 0.4231853109804364
best params rmse: {'verbose': False, 'k': 40, 'min_k': 1, 'sim_options': {'name': 'msd', 'user_based': False, 'min_support': 5}}
best params mae: {'verbose': False, 'k': 20, 'min_k': 1, 'sim_options': {'name': 'msd', 'user_based': False, 'min_support': 5}}


These results look promising when compared to the baseline. Especially because the approach used this time, the cross validation, is more robust than the simple train test split previously used. However, comparing this results is unfair because the validation strategy is different. One solution would be to train a model with the previously train-test-split and the new optimized parameters, however this approach has a major flaw commonly called data leakage. Basically the grid search saw all the data and optimized the parameters for it, so in theory a part of our model already contacted the data in the test set (from the train-test-split), consequently this test will be flawed. The proper approach would be to divide the data set in predefined k bins and do both the baseline and the grid search evaluation in the same bins in a cross validation manner. This would not be done here. In the end I just want a reference point, not a perfectly fair test, so for this end I'm happy with the current set-up.

In [8]:
best_item_model = item_kwm_gs.best_estimator['rmse']
item_sim_matrix = best_item_model.sim
print(item_sim_matrix.shape)
item_sim_matrix

(4275, 4275)


array([[1.        , 0.6637766 , 0.66253112, ..., 0.        , 0.        ,
        0.        ],
       [0.6637766 , 1.        , 0.81974447, ..., 0.        , 0.        ,
        0.        ],
       [0.66253112, 0.81974447, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

The parameter .sim of the trained model contains the similarity matrix comparing all the items vs each other. Naturally its shape is (4275, 4275) since we have 4275 different items in the dataset. But is it any good?

We can get the closest neighbors to each item, with the get_neighbors function. However, the scikit surprise methods generate, what they call, internal ids and for this request to the model there is the need to convert real item and user ids to the model internal ids. In the following example I infer the 3 closest neighbors to a randomly selected item. Then let's take those item descriptions and see if they are somewhat similar.

In [10]:
trainset_full = data_log.build_full_trainset()

rng = np.random.default_rng(seed=100)
eg_inner_id = rng.choice(trainset_full.all_items()[-1], 1)
eg_id = trainset_full.to_raw_iid(eg_inner_id[0])

clossest_n = best_item_model.get_neighbors(eg_inner_id, 3)
real_clossest_n = [trainset_full.to_raw_iid(x) for x in clossest_n]
print(f"The closest items to item with the ID {eg_id} are : {real_clossest_n}")

extra_info_df = pd.read_csv('extrainfo_log_processed_df.csv', header=None)

print(f"The item has the name: {extra_info_df[extra_info_df[0] == eg_id][3].values[0]}")
print(f"The 3 closest items to it have the names: \
{extra_info_df[extra_info_df[0] == real_clossest_n[0]][3].values[0]}, \
{extra_info_df[extra_info_df[0] == real_clossest_n[1]][3].values[0]}, and \
{extra_info_df[extra_info_df[0] == real_clossest_n[2]][3].values[0]}")

The clossest items to item with the ID 84913B are : ['90198B', '21617', '84920']
The item has the name: MINT GREEN ROSE TOWEL
The 3 clossest items to it have the names: VINTAGE ROSE BEAD BRACELET BLACK, 4 LILY  BOTANICAL DINNER CANDLES, and PINK FLOWER FABRIC PONY


Quite impressive that the names of these 4 items are somewhat related, the theme appears to be roses or flowers. Just a reminder that the name information was never given to the model, the only data the model contacted with was the user interactions with the items (purchasing quantities). To me this is a great indicator that the similarity matrix recreates reality, at least to some extent. 

Now that I'm more confident in this similarity matrix and knowing that the KNN model outputs a decent rmse score when comparing to a baseline it's time to do some predictions. Let's get a user ID at random.

In [11]:
rng = np.random.default_rng(seed=10)
eg_inner_id = rng.choice(trainset_full.all_items()[-1], 1)[0]
eg_id = trainset_full.to_raw_uid(eg_inner_id)
eg_id

'imp1884'

Now let's predict the 5 items with higher value for the dependent variable for this user using the following function:

In [12]:
def get_predictions(df, user_id, model, n_outs):
    all_items = df['item'].unique()
    already_pred = df[df['user'] == user_id]['item'].values
    to_pred = [x for x in all_items if x not in already_pred]
    
    preds = {}
    for item in to_pred:
        this_pred = model.predict(user_id, item)[3]
        preds[item] = this_pred
    
    pred_df = pd.DataFrame.from_dict(preds, orient='index')
    pred_df.columns = ['pred']
    output = pred_df.sort_values(by=['pred'], ascending=False).head(n=n_outs)
    
    return output

In [44]:
item_results = get_predictions(df, eg_id, best_item_model, 5)

names = []
for iid in item_results.index:
    name = extra_info_df[extra_info_df[0] == iid][3].values[0]
    names.append(name)
    
item_results['name'] = names
item_results

Unnamed: 0,pred,name
72732,7.813187,
84760L,6.329721,LARGE HANGING GLASS+ZINC LANTERN
72759,6.263398,
20715,5.993961,LITTLE FLOWER SHOPPER BAG
49031B,5.993961,CHROME EURO HOOK 20cm


Nice! According to this item-item collaborative model these are the 5 items that the user didn't interact with yet that will have a higher value for the dependent variable. But what is the dependent variable? In this case it is quantity bought, but it could be another metric. Let's say that we want to optimize for the amount of currency spent, the value predicted could be multiplied by the price per unit, please notice that the quantity was transformed using log +1. In this example the price functions like a set of weights that transform our prediction. This thinking process can be expanded and applied to other use cases like giving higher priority (higher weights) to the previous season items before the new one arrives to free storage space, or any other case of interest. With these methods one is not restricted to  only obtain the dependent variable, but also transform it depending on the objective.

## User-item Model

The building process of the user-item collaborative filtering model with scikit surprise is the same as for the previously built item-item one, just remember to set the sim_option 'user_based' to False. Let's repeat the grid search parameter optimization problem and look at the different outcomes.

In [14]:
user_sim_options = {'name': ['cosine', 'msd', 'pearson'],
                   'user_based': [True], 
                    'min_support': [1,2,5]}

user_knn_param_grid = {'verbose': [False],
                  'k':[20,40,60],
                  'min_k':[1,2,5],
                  'sim_options': user_sim_options
                  }


user_kwm_gs = GridSearchCV(KNNWithMeans, user_knn_param_grid,
                           measures=['rmse', 'mae'], cv=5, n_jobs=4,
                           refit=True) # output model retrain in entire data set

user_kwm_gs.fit(data_log)

In [15]:
print(f"The results from the grid_search_cv are:\n\
best score rmse: {user_kwm_gs.best_score['rmse']}\n\
best score mae: {user_kwm_gs.best_score['mae']}\n\
best params rmse: {user_kwm_gs.best_params['rmse']}\n\
best params mae: {user_kwm_gs.best_params['mae']}")

The results from the grid_search_cv are:
best score rmse: 0.5364497455230681
best score mae: 0.37497963845840787
best params rmse: {'verbose': False, 'k': 20, 'min_k': 2, 'sim_options': {'name': 'msd', 'user_based': True, 'min_support': 5}}
best params mae: {'verbose': False, 'k': 20, 'min_k': 2, 'sim_options': {'name': 'msd', 'user_based': True, 'min_support': 5}}


Lets look at rmse output from different parameters within this grid search:

In [16]:
user_kwm_gs.cv_results.keys()
best_rmse = user_kwm_gs.best_score['rmse']
r1 = user_kwm_gs.cv_results['mean_test_rmse'][5]
p1 = user_kwm_gs.cv_results['params'][5]
r2 = user_kwm_gs.cv_results['mean_test_rmse'][11]
p2 = user_kwm_gs.cv_results['params'][11]
r3 = user_kwm_gs.cv_results['mean_test_rmse'][12]
p3 = user_kwm_gs.cv_results['params'][12]
r4 = user_kwm_gs.cv_results['mean_test_rmse'][13]
p4 = user_kwm_gs.cv_results['params'][13]
r5 = user_kwm_gs.cv_results['mean_test_rmse'][41]
p5 = user_kwm_gs.cv_results['params'][41]

print(f'Changing the parameter "min_k" from 2 to 1 impacted the rmse in: {best_rmse - r1}')
print(f'Changing the parameter "name" from "msd" to "cosine" impacted the rmse in: {best_rmse - r2}')
print(f'Changing the parameter "min_support" from 5 to 1 impacted the rmse in: {best_rmse - r3}')
print(f'Changing the parameter "min_support" from 5 to 2 impacted the rmse in: {best_rmse - r4}')
print(f'Changing the parameter "k" from 20 to 40 impacted the rmse in: {best_rmse - r5}')

Chaging the parameter "min_k" from 2 to 1 impacted the rmse in: -0.0002915195835179185
Chaging the parameter "name" from "msd" to "cosine" impacted the rmse in: -0.031012899823172457
Chaging the parameter "min_support" from 5 to 1 impacted the rmse in: -0.003194337883101195
Chaging the parameter "min_support" from 5 to 2 impacted the rmse in: -0.0012495250511870282
Chaging the parameter "k" from 20 to 40 impacted the rmse in: -0.0028650078782975763


As it is possible to understand from these differences in the RMSE metric some hyperparameters had a bigger impact than others in the model performance. Namely, the methods of distance calculation ("name") appears to have a great impact while "min_K" seems to only influence in a minor way or indirectly when combined with other parameters.
Grid search cross validation was a big waste for this example, the improvements seen on the model were only minor and the time it took to achieve the best result was almost 50 times the  time for training a model with the default parameters. Nevertheless, in some specific cases hyperparameter optimization may be essential in separating an ok model from a great one.

In [17]:
best_user_model = user_kwm_gs.best_estimator['rmse']
user_sim_matrix = best_user_model.sim
print(f"The similarity matrix shape is: {user_sim_matrix.shape}")
#user_sim_matrix

The similarity matix shape is: (7019, 7019)


As expected the similarity matrix is 7019 by 7019, since the dataset has 7019 different users.

In [41]:
user_results = get_predictions(df, eg_id, best_user_model, 5)

names = []
for iid in user_results.index:
    name = extra_info_df[extra_info_df[0] == iid][3].values[0]
    names.append(name)
    
user_results['name'] = names
user_results

Unnamed: 0,pred,name
47503J,1.712326,SET/3 FLORAL GARDEN TOOLS IN BAG
85017C,0.693147,ENVELOPE 50 CURIOUS IMAGES
37422,0.693147,WHITE WITH BLACK CATS BOWL
37462B,0.693147,"PET MUG, BUDGIE"
37489C,0.693147,GREEN/BLUE FLOWER DESIGN BIG MUG


In [45]:
user_results = get_predictions(df, eg_id, best_user_model, 5)

print(f"The items suggested user the item-item approach are: {list(item_results.index)}\n\
The items suggested user the user-item approach are: {list(user_results.index)}")

The items suggested user the item-item approach are: ['72732', '84760L', '72759', '20715', '49031B']
The items suggested user the user-item approach are: ['47503J ', '85017C', '37422', '37462B', '37489C']


So these item suggestions are quite different, which highlights the importance of the methodology in recommender systems. The answer to the question "which one is better?" depends greatly on the objective, that's why having well defined objectives and metrics is extremely valuable. Moreover, it is possible to combine several models to achieve hybrid solutions, which may lead to more robust answers. It is possible to create large amounts of models but if those models don't solve a specific problem their usefulness is questionable. 

To wrap up, here I showed that memory-based collaborative models can indeed pick-up patterns from the dataset. The provided dependent variable doesn't need to be a rating, in theory it can be any metric of user interactions with items.
There is more to collaborative filtering than memory-based models, so next let's apply a model-based solution.

# Model-Based Collaborative Filtering

As previously talked about in model-based approaches the inference of the user and items characteristics is not done directly from the user-item interaction metrics. Instead a model is applied to this data and from the output of said model the inference of characteristics is made. This methodology is extremely useful because the user-item interactions matrix is sparse, meaning many users never interacted with the majority of items. The method that I'm going to apply here to "reduce the dimensionality" of the user-item matrix is Single Vector Decompositions (SVD) as implemented in the scikit surprise package. And the correct term for this "dimensionality reduction" is matrix factorization. When doing matrix multiplication we have two matrices of compatible shapes, let's say one is 4 by 3 and the other 3 by 2, ending up with a new matrix which in our example would be 4 by 2. To get a visualization of this I strongly recommend the website [matrixmultiplication](http://matrixmultiplication.xyz/). In matrix factorization we are going the other way around, from the 4 by 2 matrix we want to obtain the 4 by 3 and the 3 by 2 matrices, and depending on the method applied some added bias. Baptiste Rocca made an [amazing schematic representation](https://miro.medium.com/max/2000/1*E9EE5LXxty1EB8fn_s1jkQ@2x.png) of this in his [blog post](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada), I can't stress enough that if you are interested in this field to go and read that amazing blog post.
With this process two matrices are generated, one corresponding to the users and another to the items. Therefore, each row in the created  users matrix is an embedding representation of that user, the same is valid for items. The embeddings can be seen as a list of latent characteristics of the user or item that the model found, however these most likely do not correspond to real features like color or type. To sum-up, this approach can be seen as the creation of embeddings representations, which are then used to infer similarities among users or items. 

Like in the memory based examples here I'm not trying to claim SVD is the best algorithm for this purpose, it is certainly one of the most populars if not the most popular. In scikit surprise package there are other implementations to perform matrix factorization like SVDpp and NMF.
Before, implementing SVD in sckit surprise lets look at some of its hyperparameters:
 - n_factors: this defines the shape of the user and items matrices created. The length of each embedding created is going to the number defined here. The temptation is to define a large value for this parameter, in order to infer as much as possible from the data, however that may result in overfitting;
 - n_epochs: number of times the optimization procedure is going to see the data. This is highly correlated with the learning and regularization rates. Doesn’t have and direct impact on the model inference, however it longer iterations give the model more opportunities to learn (or memorize the train set);
 - biased: if set to true the matrix factorization defaults to Probabilistic Matrix Factorization;
 - init_mean & init_std_dev: the initial vectors for the created matrices are generated at "random", which is not completely true because they are sampled from a designated normal distribution. These two parameters define the characteristics of said normal distribution;
 - lr_all: this function is like the step size of the optimization process. If too big leads to instability, is too small takes more time to converge (more epochs);
 - reg_all: regularization is applied to decrease the overfitting potential of an optimization procedure. These methods, in theory, increase the algorithm performance in unseen data (test or validation set), in other other help the creating models that generalize better;
 - lr_* & reg_*: these parameters control the learning rate and regularization for specific parameters instead of applying the same to all (like when using lr all and reg_all). I will not play with these;
 - random_state: it is always good to define a random state in order to ensure reproducibility. These processes are stochastic, meaning that "random" picked values may have a considerable impact on the model. This should not be optimized like other hyperparameters;
 - verbose: just controls how much is optuted to the screen during training.

Let's do a grid search to infer the best parameters, but instead of optimizing everything let's set the number of epochs to 10 and see which combination of learning rate, regularization and number of factors works the best.

In [46]:
from surprise import SVD

param_grid = {
    "n_epochs": [10],
    "n_factors": [50, 100, 200],
    "lr_all": [0.002, 0.005, 0.008],
    "random_state":[100],
    "reg_all": [0.01, 0.02, 0.05],
    "verbose": [False],
}


gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5, refit=True,
                  n_jobs=4)

gs.fit(data_log)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

best_model_based_model = gs.best_estimator['rmse']

0.5387568817943309
{'n_epochs': 10, 'n_factors': 200, 'lr_all': 0.008, 'random_state': 100, 'reg_all': 0.02, 'verbose': False}


In [47]:
gs_df = pd.DataFrame.from_dict(gs.cv_results)[['rank_test_rmse', 'mean_test_rmse']]
params_df = pd.DataFrame.from_dict(gs.cv_results['params'])

gs_df_final = gs_df.join(params_df).drop(['n_epochs', 'random_state', 'verbose'], axis=1).sort_values(by=['rank_test_rmse'])
gs_df_final.head(n=6)

Unnamed: 0,rank_test_rmse,mean_test_rmse,n_factors,lr_all,reg_all
25,1,0.538757,200,0.008,0.02
15,2,0.539269,100,0.008,0.01
16,3,0.540253,100,0.008,0.02
24,4,0.540699,200,0.008,0.01
6,5,0.542383,50,0.008,0.01
7,6,0.544746,50,0.008,0.02


Without surprise the best combination of parameters always involved (6 in the first 6) the highest learning rate (0.08). This, most likely, indicates that the model needs to be trained for a long period (more epochs). The top 3 results are close together, the difference among them is neglectable. It seems intuitive to me that models with a higher number of factors, and consequently more parameters, will benefit from more regularization to avoid overfitting.
Although I recognize that this model could use some improvements I'm going to stay with it for now and analyse its outputs.  

The two matrices created, one for users and the other for items have the shapes:

In [48]:
print(f"Users matrix shape: {best_model_based_model.pu.shape}")
print(f"Items matrix shape: {best_model_based_model.qi.shape}")

Users matrix shape: (7019, 200)
Items matrix shape: (4275, 200)


Being the User matrix the number of users vs the n_factors and the items matrix the number of items vs n_components. As previously mentioned the 200 points of each embedding can be seen as latent characteristics that the model inferred for each item or user.

Finally let’s get the top 5 predictions for the test id using the previously written function, but this time with the model-based approach.

In [49]:
model_results = get_predictions(df, eg_id, best_model_based_model, 5)
model_results

Unnamed: 0,pred
16033,4.37185
84568,3.681382
16259,3.632351
17084R,3.603174
16169C,3.482402


# Conclusion

Scikit surprise makes it easy to implement collaborative filtering models, however the way that the dataset is handled internally by this package makes it hard (or even impossible) to have a train test split followed by the division of the train set into validation and train. Overall, I think the package achieves what it proposes to do and therefore I recommend it to anyone trying to apply Collaborative filtering methods for the first time. It has some api differences to other packages well used for machine learning, like sklearn, but the documentation is enough to grasp what is going on.
It was very interesting to work with this dataset, the data transformations that I did initially like converting the quantities values to log +1 certainly had a huge impact in the models and it would be interesting to evaluate the performance of the models with other data transformations. Maybe for a part three of this project! However, for now this is it!