# Riemannian Low-rank Matrix Completion algorithm on Movielens dataset
Riemannian Low-rank Matrix Completion (RLRMC) is a matrix factorization based (vanilla) matrix completion algorithm that solves the optimization problem using Riemannian conjugate gradients algorithm (Absil et al., 2008). RLRMC is based on the works by Jawanpuria and Mishra (2018) and Mishra et al. (2013).

The ratings matrix of movies (items) and users is modeled as a low-rank matrix. Let the number of movies be  𝑑  and the number of users be  𝑇 . RLRMC algorithm assumes that the ratings matrix  𝑀  (of size  𝑑×𝑇 ) is partially known. The entry at  𝑀(𝑖,𝑗)  represents the rating given by the  𝑗 -th user to the  𝑖 -th movie. RLRMC learns matrix  𝑀  as  𝑀=𝐿𝑅⊤ , where  𝐿  is a  𝑑×𝑟  matrix and  𝑅  is a  𝑇×𝑟  matrix. Here,  𝑟  is the rank hyper-parameter which needs to be provided to the RLRMC algorithm. Typically, it is assumed that  𝑟≪𝑑,𝑇 . The optimization problem is solved iteratively using the the Riemannian conjugate gradients algorithm. The Riemannian optimization framework generalizes a range of Euclidean first- and second-order algorithms such as conjugate gradients, trust-regions, among others, to Riemannian manifolds. A detailed exposition of the Riemannian optimization framework can be found in Absil et al. (2008).

This notebook provides an example of how to utilize and evaluate RLRMC implementation in reco_utils.

In [165]:
import numpy as np
import sys
import time
import pandas as pd
sys.path.append("../../")
sys.path.append("../../reco_utils/recommender/rlrmc/")

from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.dataset import movielens
from reco_utils.recommender.rlrmc.RLRMCdataset import RLRMCdataset 
from reco_utils.recommender.rlrmc.RLRMCalgorithm import RLRMCalgorithm 
# Pymanopt installation is required via
# pip install pymanopt 
from reco_utils.evaluation.python_evaluation import (
    rmse, mae
)
from reco_utils.dataset.python_splitters import (
    python_random_split, 
    python_chrono_split, 
    python_stratified_split
)

# import logging

# %load_ext autoreload
# %autoreload 2

In [166]:
print("Pandas version: {}".format(pd.__version__))
print("System version: {}".format(sys.version))


Pandas version: 0.25.3
System version: 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:18:16) [MSC v.1916 64 bit (AMD64)]


In [167]:
# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '10m'

# Model parameters

# rank of the model, a positive integer (usually small), required parameter
rank_parameter = 1
# regularization parameter multiplied to loss function, a positive number (usually small), required parameter
regularization_parameter = 0.5
# initialization option for the model, 'svd' employs singular value decomposition, optional parameter
initialization_flag = 'svd' #default is 'random'
# maximum number of iterations for the solver, a positive integer, optional parameterj
maximum_iteration = 500 #optional, default is 100
# maximum time in seconds for the solver, a positive integer, optional parameter
maximum_time = 1000#optional, default is 1000

# Verbosity of the intermediate results
verbosity=0 #optional parameter, valid values are 0,1,2, default is 0
# Whether to compute per iteration train RMSE (and test RMSE, if test data is given)
compute_iter_rmse=True #optional parameter, boolean value, default is False

# Importing the Dataset

In [172]:
df = pd.read_csv('fakedata1.csv')

In [173]:
df

Unnamed: 0,userID,itemID,rating,timestamp,Products
0,102,4,3,6/10/2019 13:01,Cold Brew
1,110,3,1,9/6/2019 8:31,Expresso
2,100,10,1,10/6/2019 13:15,Pistachios Pack
3,106,2,1,4/27/2019 14:43,Vanilla Chai
4,100,11,3,7/23/2019 3:50,Cashew nut Pack
...,...,...,...,...,...
996,105,10,2,7/24/2019 11:54,Pistachios Pack
997,111,3,2,4/6/2019 8:33,Expresso
998,105,11,1,5/7/2019 1:21,Cashew nut Pack
999,106,3,2,10/3/2019 8:09,Expresso


# Split the data randomly

In [174]:
## If both validation and test sets are required
# train, validation, test = python_random_split(df,[0.6, 0.2, 0.2])

## If validation set is not required
train, test = python_random_split(df,[0.8, 0.2])

## If test set is not required
# train, validation = python_random_split(df,[0.8, 0.2])

## If both validation and test sets are not required (i.e., the complete dataset is for training the model)
# train = df

In [175]:
# data = RLRMCdataset(train=train, validation=validation, test=test)
data = RLRMCdataset(train=train, test=test) # No validation set
# data = RLRMCdataset(train=train, validation=validation) # No test set
# data = RLRMCdataset(train=train) # No validation or test set

# Train the RLRMC model on the training data

In [176]:
model = RLRMCalgorithm(rank = rank_parameter,
                       C = regularization_parameter,
                       model_param = data.model_param,
                       initialize_flag = initialization_flag,
                       maxiter=maximum_iteration,
                       max_time=maximum_time)

In [177]:
model

<reco_utils.recommender.rlrmc.RLRMCalgorithm.RLRMCalgorithm at 0x211bd02ee80>

In [178]:
start_time = time.time()

model.fit(data,verbosity=verbosity)

# fit_and_evaluate will compute RMSE on the validation set (if given) at every iteration
# model.fit_and_evaluate(data,verbosity=verbosity)

train_time = time.time() - start_time # train_time includes both model initialization and model training time. 

print("Took {} seconds for training.".format(train_time))

Took 0.36005282402038574 seconds for training.



# Obtain predictions from the RLRMC model on the test data

In [179]:
## Obtain predictions on (userID,itemID) pairs (60586,54775) and (52681,36519) in Movielens 10m dataset
# output = model.predict([60586,52681],[54775,36519]) # Movielens 10m dataset

# Obtain prediction on the full test set
predictions_ndarr = model.predict(test['userID'].values,test['itemID'].values)

# Evaluation

In [180]:
predictions_df = pd.DataFrame(data={"userID": test['userID'].values, "itemID":test['itemID'].values, "prediction":predictions_ndarr})

## Compute test RMSE 
eval_rmse = rmse(test, predictions_df)
## Compute test MAE 
eval_mae = mae(test, predictions_df)

print("RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae, sep='\n')

RMSE:	3.916396
MAE:	3.127031


# Trying the same model with chronological split

In [46]:
train, test = python_chrono_split(
    df,[0.8, 0.2]
)

In [47]:
# data = RLRMCdataset(train=train, validation=validation, test=test)
data = RLRMCdataset(train=train, test=test) # No validation set
# data = RLRMCdataset(train=train, validation=validation) # No test set
# data = RLRMCdataset(train=train) # No validation or test set

In [48]:
model = RLRMCalgorithm(rank = rank_parameter,
                       C = regularization_parameter,
                       model_param = data.model_param,
                       initialize_flag = initialization_flag,
                       maxiter=maximum_iteration,
                       max_time=maximum_time)

In [49]:
model

<reco_utils.recommender.rlrmc.RLRMCalgorithm.RLRMCalgorithm at 0x211bda1a8d0>

# Training the model

In [50]:
start_time = time.time()

model.fit(data,verbosity=verbosity)

# fit_and_evaluate will compute RMSE on the validation set (if given) at every iteration
# model.fit_and_evaluate(data,verbosity=verbosity)

train_time = time.time() - start_time # train_time includes both model initialization and model training time. 

print("Took {} seconds for training.".format(train_time))

Took 0.956524133682251 seconds for training.


In [51]:
## Obtain predictions on (userID,itemID) pairs (60586,54775) and (52681,36519) in Movielens 10m dataset
# output = model.predict([60586,52681],[54775,36519]) # Movielens 10m dataset

# Obtain prediction on the full test set
predictions_ndarr = model.predict(test['userID'].values,test['itemID'].values)

# Evalutaion

In [52]:
predictions_df = pd.DataFrame(data={"userID": test['userID'].values, "itemID":test['itemID'].values, "prediction":predictions_ndarr})

## Compute test RMSE 
eval_rmse = rmse(test, predictions_df)
## Compute test MAE 
eval_mae = mae(test, predictions_df)

print("RMSE:\t%f" % eval_rmse,
      "MAE:\t%f" % eval_mae, sep='\n')

RMSE:	1.252853
MAE:	0.924643
