# Matrix completion and recommender systems

[MovieLens](movielens.umn.edu) data sets were collected by the [GroupLens Research Project](http://www.grouplens.org/) at the University of Minnesota.
 
This data set consists of:
- 100000 ratings (1-5) from 943 users on 1682 movies. 
- Each user has rated at least 20 movies.

The `movielens.csv` file contains the full dataset. Users and items are numbered consecutively from 1. The data is randomly              ordered. This is a tab separated list of 

```
user id | item id | rating | timestamp
```

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from scipy.stats import pearsonr

Read the dataset from the `movielens.csv` file.

In [20]:
dataset = pd.read_csv("data/movielens.csv", sep="\t", header=None)
print(dataset.describe())
print(dataset)

                  0              1              2             3
count  100000.00000  100000.000000  100000.000000  1.000000e+05
mean      462.48475     425.530130       3.529860  8.835289e+08
std       266.61442     330.798356       1.125674  5.343856e+06
min         1.00000       1.000000       1.000000  8.747247e+08
25%       254.00000     175.000000       3.000000  8.794487e+08
50%       447.00000     322.000000       4.000000  8.828269e+08
75%       682.00000     631.000000       4.000000  8.882600e+08
max       943.00000    1682.000000       5.000000  8.932866e+08
         0     1  2          3
0      196   242  3  881250949
1      186   302  3  891717742
2       22   377  1  878887116
3      244    51  2  880606923
4      166   346  1  886397596
...    ...   ... ..        ...
99995  880   476  3  880175444
99996  716   204  5  879795543
99997  276  1090  1  874795795
99998   13   225  2  882399156
99999   12   203  3  879959583

[100000 rows x 4 columns]


How many movies? How many people? How many ratings?

In [21]:
# number of movies
users = np.array(dataset[0]) - 1
n = users.max()
unique_users, unique_users_indices = np.unique(users, return_index=True)
print(unique_users_indices)
print("Number of movies: %d" % n)

# number of users
movies = np.array(dataset[1]) - 1
p = movies.max()
unique_movies, unique_movies_indices = np.unique(movies, return_index=True)
print("Number of users: %d" % p)

# number of ratings
ratings = np.array(dataset[2])
d = len(ratings)
print("Number of ratings: %d" % d)

movies.shape

[  202   700  1257  1250   172     9    39    90  1892    40    88  2169
    63   257   206   234   955   182   170    61   390     2   307   936
    53   186  3364    98   760  1685  1628   100   311  1530  1466   251
   766    26  3965  2175   129    58   149   123   535  1804   838   678
   165    30  1572   216  1035   232  5895   204    68   114    54    67
   369    10    28   430   465   181  2131   168   350   252  1538    55
   492  2007   590  1193   316   616   587   177    51   131   222   152
   160  3462    56  1889   891   101   239    72   735   120    25   487
    34  2164    48   270   328    27  1632   353  1528  1071   623   803
   425   113   473   342   288   442     6   491   462   617    20  1740
   163    16   295  1040   219   560    79   121   531   194   549  4146
   348   403    99  1210   470    65  3555   899   767   372  1158   364
    94   746  1524   218  1065   646   229   584  1102   315   167   763
    35   297   736    29   217    91  3732   521   

(100000,)

Shuffle the data (see the function [`np.random.shuffle`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html)).

In [22]:
idx = np.arange(d)
np.random.seed(1)
np.random.shuffle(idx)
users = users[idx]
movies = movies[idx]
ratings = ratings[idx]
users

array([507, 517, 177, ...,  55, 881, 453])

Split the dataset into a subset of 80000 training ratings and 20000 testing ratings.

In [23]:
training_idx = idx[:80000]
testing_idx = idx[80000:]

print(training_idx.size)
print(testing_idx.size)

training_users = users[training_idx]
training_movies = movies[training_idx]
training_ratings = ratings[training_idx]

testing_users = users[testing_idx]
testing_movies = users[testing_idx]
testing_ratings = users[testing_idx]

80000
20000


Let us denote by $\Omega$ the set of pairs $(i,j)$ such that rating of the $i$-th user on the $j$-th movie is available in the training set (similarly, $\Omega_{\text{test}}$ is the set of testing pairs).
Let us denote by $r_{ij}$ the corresponding rating.

Create a full matrix $X \in \mathbb{R}^{n \times p}$, such that:
$$
X_{i,j} = 
\begin{cases}
r_{ij} & \text{if } (i,j) \in \Omega\\
0& \text{otherwise}
\end{cases}
$$

In [24]:
X = csr_matrix((training_ratings, (training_users, training_movies))).toarray()
X

array([[0, 3, 0, ..., 0, 0, 0],
       [4, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [5, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 5, 0, ..., 0, 0, 0]])

## Trivial recommender system

Create a trivial recommender system, based on the average rating of each user:
$$
r^{\text{pred}}_{ij} = \frac{1}{N_i} \sum_{j : (i,j) \in \Omega} r_{ij}
$$
where $N_i = card(j : (i,j) \in \Omega)$.

Then compute the RMSE (root mean square error):
$$
\text{RMSE} = \sqrt{\frac{1}{card(\Omega_{\text{test}})} \sum_{(i,j) \in \Omega_{\text{test}}} (r_{ij} - r^{\text{pred}}_{ij})^2}
$$
and the Pearson correlation coefficient $\rho$ (use the function [scipy.stats.pearsonr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)):
$$
\rho = 
\frac
{
    \displaystyle\sum_{(i,j) \in \Omega_{\text{test}}} 
       (r_{ij} - \overline{r})
       (r^{\text{pred}}_{ij} - \overline{r}^{\text{pred}})
}
{\sqrt{
    \displaystyle\sum_{(i,j) \in \Omega_{\text{test}}} 
       (r_{ij} - \overline{r})^2
       }
\sqrt{
    \displaystyle\sum_{(i,j) \in \Omega_{\text{test}}} 
       (r^{\text{pred}}_{ij} - \overline{r}^{\text{pred}})^2
       }}
$$
where
$$
\begin{split}
\overline{r} &= \frac{1}{card(\Omega_{\text{test}})} \sum_{(i,j) \in \Omega_{\text{test}}} 
       r_{ij} 
\\
\overline{r}^{\text{pred}} &= \frac{1}{card(\Omega_{\text{test}})} \sum_{(i,j) \in \Omega_{\text{test}}} 
       r^{\text{pred}}_{ij} 
\end{split}
$$

# Singular value truncation (SVT) based recommender system

Implement the SVT algorithm to predict the ratings of the testing set. Set a maximum number of iterations equal to 100. Print the RMSE and $\rho$ at each iteration. Finally, plot the trend of both metrics.

Try to calibrate the threshold to get better results.