# Simple recommender system

In this notebook we will see a simple implementation of a Recommender System based on *Collaborative Filtering*. We will use the MovieLens dataset:

https://grouplens.org/datasets/movielens/ -> it has different sources of datasets of different sizes (we use the 1 milions dataset, but you can use others to check if everything works)

Since we will store the full utility matrix, we consider a small dataset, and in particular the one recommended for education and development (small version). Such a dataset contains approximately 100,000 ratings to 9,000 films made by 600 users.

Users and items (hereinafter, we will use the term "item" instead of "film") are identified by integers, and the rating is a number bewteen 1 and 5. A sample of the file, which contains as also the timestamps, is:

```text
userId, movieId, rating, timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1,70,3.0,964982400
1,101,5.0,964980868
```
We will not use timestamps, but it could be useful for analysis of recent ratings rather than old ones.

## Loading the data


We first define the function to load the data. As usual, we need to adapt such a function to the specific file input format.

In particular, we are going to assign to users and items progressive numbers, so their identifications will be also the indexes of the utility matrix.

In [2]:
# More complex, since the userId is the index of the matrix (starting from zero): it's not sure that the id of users or film is sequential,
# since the dataset is already "made"!
# users = rows, films/items = columns
# If users are not in sequential order -> e.g. 1, 2, 5, 83 -> need to translate the userId with the corresponding index, creating a dictionary of
# users where userid is the key and index of the matrix is the value 
# Same is done for the item -> map the ids as the identifier of the matrix
def load_data(filename):
    input_lines = [] # before creating the matrix I creat a list of lists with the indexes of the matrix and ratings -> will read it to create the matrix (since idk a priori its size)
    users = {} # empty dict
    num_users = 0
    items = {} # empty dict
    num_items = 0
    raw_lines = open(filename, 'r').read().splitlines() # open file and reach each line
    # Remove the first line since it's a description
    del raw_lines[0] 
    for line in raw_lines: # read line by line
        line_content = line.split(',') # line = text, separated by comma
        user_id = int(line_content[0]) # first element is user identifier
        item_id = int(line_content[1]) # second element is item identifier
        rating = float(line_content[2]) # thirs element is rating (a float since later we will normalize, so it's easier to have a float from the beginner)
        # Transalte into matrix indexes
        if user_id not in users:
            users[user_id] = num_users # add user
            num_users += 1 # increase counter, initialized as 0 (since index of matrix)
        if item_id not in items:
            items[item_id] = num_items
            num_items += 1
        input_lines.append([users[user_id], items[item_id], rating]) # it's my list of lists
    return input_lines, num_users, num_items # list, number of users and items 

The input file containing the dataset is called "3-ratings.csv".

On Colab, remember to mount your Drive
```python
from google.colab import drive
drive.mount('/content/drive')
input_file = "/content/drive/My Drive/..."
```
Let's load our dataset and see its initial content:

In [3]:
input_file = "./3_ratings.csv"

input_ratings, num_users, num_items = load_data(input_file)

print("\nThe dataset contains", num_users, "users,", 
      num_items, "items, and", len(input_ratings),"ratings.\n")
print("The first five ratings are:", input_ratings[:5],"\n")


The dataset contains 610 users, 9724 items, and 100836 ratings.

The first five ratings are: [[0, 0, 4.0], [0, 1, 4.0], [0, 2, 4.0], [0, 3, 5.0], [0, 4, 5.0]] 



## Utility matrix

From list to utility matrix <- since it's sparse, we usually work with list (or we waste space storing zeros). Since the dataset is small we work with matrixes, because it's easier. Building and maintaining in memory the utility matrix is inefficient, since the matrix is sparse. But working with a matrix is more intuitive, therefore we will use such an approach. 

The best way to handle a matrix and the operations associated to it is to use NUMPY arrays. If you are not familiar with Numpy, you can find different tutorials online. See for instance:

https://www.kaggle.com/saptarsi/numpy-tutorial-notebook-sg  
https://cs231n.github.io/python-numpy-tutorial/

In [6]:
import numpy as np # use numpy to create the utility matrix

# I know the size of my matrix -> can create an NxI matrix of zeros, where N = number of users
# and I = number of items. I will then fill the above matrix with the correct values
ratings = np.zeros((num_users, num_items)) # pass 2 numbers to create a matrix (number of rows and columns)

# Fill the matrix with the ratings 
for row in input_ratings: # go through my list
    ratings[row[0], row[1]] = row[2] # take first element as index of row and second as index of column, filling the cell with the third element

# Once I filled the matrix, I compute the "sparsity", i.e., percentage of non-zero cells
sparsity = 100*float(np.count_nonzero(ratings))/float(num_users*num_items) # this works no matter the size of the matrix
# I know my matrix has 610 rows and 9724 columns + 100836 elements (not zeros)-> so I could have done sparsity = 100836/(610x9724)
print("Sparsity: %.2f%%\n" % (sparsity))

# Show a snippet of the matrix
ratings

Sparsity: 1.70%



array([[4. , 4. , 4. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [2.5, 2. , 0. , ..., 0. , 0. , 0. ],
       [3. , 0. , 0. , ..., 0. , 0. , 0. ],
       [5. , 0. , 5. , ..., 3. , 3.5, 3.5]])

# Preliminary analysis
-> stat analysis on the matrix

In the following, we suggest a set of preliminary analysis on the dataset that can be carried out with simple operations on the matrix:

### Question  Q1
<div class="alert alert-info">
For a given user id, find the number of rated items by that user, and the average rating.  
    
- **Hint**: Compute these values once for all users, and store them in another Nx2 matrix.
</div>

In [15]:
# Count how many items the user has rated and the average rating -> gives an idea of activity of user/low or high rating from a user

# Select a specific cell
print(ratings[3, 6]) # user 3 for movie 6
# Count how many movies an user rated
ratings[3, :] # select the whole row corresponding to user 3
print(np.count_nonzero(ratings[3, :]))
# Repeat this for all users!

# Print the average rating of an user
print(np.sum(ratings[3, :])/np.count_nonzero(ratings[3, :]))

# Result in anoher matrix, same row as the utility but only 2 columns: number of items rated by that user and their average rating.

0.0
216
3.5555555555555554


In [None]:
# Matrix with users and items -> compute foreach users the number of items rated and the average rating
# FOTOOOOOOOOOOOOOOOOOOOOO

### Question  Q2
<div class="alert alert-info">
Find the top-k viewers, i.e., users that rated the highest number of items.
    
- *Variation*: find the bottom-k viewers.
</div>

In [7]:
# I sort the result above

# I want the users who are more active -> their standard deviation tend to be lower than the one of inactive users
# k can be 10 or 20, for example
# Since above I have a numpy array, I focus on the first column and sort it with argsort (it creates an array f index from highest to lowest)
# Use negative since sot is increasing and I want decreasing
# FOTOOOOOOOOOOOO

The average is very different between users!

### Question  Q3
<div class="alert alert-info">
For a given item id, find the number of users that rated it, and its average rating.
    
- **Hint**: Compute these values once for all items, and store them in another Ix2 matrix.
</div>

In [17]:
# Same as Q1, but for items

print(ratings[3, 6])
ratings[:, 6]
print(np.count_nonzero(ratings[:, 6]))
# Repeat this for all items!


# Print the average rating of an item
print(np.sum(ratings[:, 6])/np.count_nonzero(ratings[:, 6]))


# Result is a vector: same number of items, for each the number of users who rated it and its average rating

0.0
23
3.782608695652174


It tells me the number of rating of a movie and its avergae rating.

### Question  Q4
<div class="alert alert-info">
Find the top-k items with at least v views, i.e., the k items with the highest average rate that has been rated by at least v users.
</div>

In [10]:
# Same as Q2, but for items
# I filter out the items with not enough views (with a boolean mask that create a boolean matrix) and then do it                                                                                 

### Question  Q5
<div class="alert alert-info">
Normalize the utility matrix by subtracting from each non-zero cell at row i the average rating of the user i.
</div>

In [1]:
# Average from user/item POV -> can compute two distinct Normalized Utility matrixes
# Normalize user = take their average rating and substract that value from each positive cell (consider only the cell with values >0,
# since =0 are movies not rated by the user)

# Can be done by the user or item POV. If we don't normalize, the error can be very high in the Collaborative Filtering method! With
# normlization, predictions become more precise

# Create empty matrix, in which I copy the normalized values only for items > 0 (= they have a rating)

# Splitting the utility matrix

We want to separate the values of utility matrix into two sets, train and test:

- The train set is used to compute the similarity between users or items;
- Using the similarity, we will then predict the ratings;
- We compare the prediction with the values in the test set to compute the prediction error.


In [None]:
## Previous definition, SKIP IT
def train_test_split_old(ratings, sample_per_user=10):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in range(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=sample_per_user, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test


In [19]:
# Split matrix = create another matrix of same dimension, take ratings from input matrix and copy in the new -> set them to 0 in
# the input matrix (as if they had not been scored)
def train_test_split(ratings, sample_per_user=10, seed = 123425536):
    test = np.zeros(ratings.shape) # Create new empty matrix, same size as the input matrix
    # .shape = number of rows and columns
    train = ratings.copy() # Make a copy of ratings so I don't work directly on the dataset
    # For each user I take 10% of their ratings (so it's fair for active and inactive users) -> want a random set of those ratings
    # -> use seed so the experiment can be done again
    np.random.seed(seed)
    for user in range(ratings.shape[0]): # go through all rows in rating
        # Count the number of non zero elements 
        # count_nonzero count the number of non zero elements
        # .nonzero create a datastructure that contains the indexes of non zero elements from the vector made up of the user ratings:
        # create a list of lists, in which I care only about the first element (that's why [0]) since I'm working on vectors
        num_ratings = len(ratings[user, :].nonzero()[0]) # get the dimension of the vector: number of ratings I'm using
        # If it's 0, then user has no ratings -> all zeros
        if num_ratings == 0:
            continue
        # If there are ratings, I count them and take a percentage of them
        actual_sample = int(num_ratings*sample_per_user/100)
        # Take first element of a random permutation (random.choice) without replacement -> same as selecting randomly ratings 
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=actual_sample, 
                                        replace=False)
        # Use index to set to zero the ratings in the input matrix
        train[user, test_ratings] = 0.
        # Copy the values from the original matrix (not the copy, I already edited it)
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

We run the above function by using 15 samples from each user for the test set:

In [20]:
train, test = train_test_split(ratings, sample_per_user=10) # use 10% of the ratings for testing

print("Non-zero elements in ratings ", np.count_nonzero(ratings))
print("Non-zero elements in train ", np.count_nonzero(train))
print("Non-zero elements in test ", np.count_nonzero(test))

Non-zero elements in ratings  100836
Non-zero elements in train  91018
Non-zero elements in test  9818


90% in train and 10% in test

In [None]:
# This version can be used after having answered to Q5

train_u, test_u = train_test_split(ratings_norm_u, sample_per_user=10) # normalized over the user
train_i, test_i = train_test_split(ratings_norm_i, sample_per_user=10) # normalized over the item

## Computing the similarity matrix 
MIN 52??????????????

Using the train set, we compute the user-user similarity matrix (NxN) and the item-item similairyt matrix (IxI). From the mathematical point of view, we have for users x and y:

$$
sim(x,y) = \cos(r_x, r_y) = \frac{\sum_i r_{xi} r_{yi}}{\sqrt{\sum_i r_{xi}^2}\sqrt{\sum_i r_{yi}^2}}
$$
It results in the similariy matrix (which is square): 1s in the diagonal and then the values of simil between user x and y. I use the cosine similarity, since it's the correlation over the normalized values.

The denominator is the norm!

A similar computation can be done for the item-item similarity.

Each matrix is computed using matrix operations.

In [23]:
# Focus on user-user
# Take the rating matrix and transpose it: now I can simply do a matrix multiplication
def compute_similarity(ratings, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        # epsilon is for computational POV, since there are lot of zeros -> simil between 2 users could be 0 and
        # computing the norm takes a division by 0, which can be a problem -> add small epsilon to solve it
        sim = ratings.dot(ratings.T) + epsilon # simil = multipl of 2 matrixes
    elif kind == 'item':
        sim = ratings.T.dot(ratings) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))]) # diagonal is the product of vector to itself -> it's squared already
    # norms is a vector
    return (sim / norms / norms.T) # this is the sim

We are now ready to compute the two matrices:

In [24]:
user_similarity = compute_similarity(train, kind='user')
item_similarity = compute_similarity(train, kind='item')

# Show the first values of the item-item similairty matrix
print(item_similarity[:4, :4])

[[1.         0.24628441 0.32583414 0.39350118]
 [0.24628441 1.         0.21485671 0.23276016]
 [0.32583414 0.21485671 1.         0.44114166]
 [0.39350118 0.23276016 0.44114166 1.        ]]


In [None]:
# this version can be used after having answered to Q5

user_similarity_n = compute_similarity(train_u, kind='user')
item_similarity_n = compute_similarity(train_i, kind='item')

# Show the first values of the item-item similairty matrix
print(item_similarity_n[:4, :4])

## Computing the prediction and the prediction error

Using the similarity matrix, we predict the rating of missing values. In case of user-user similarity, if we want to predict item i for user x, we have:

$$
\hat{r}_{xi} = \frac{\sum_y sim(x,y) r_{yi}}{\sum_y sim(x,y)}
$$



In [26]:
def predict_simple(ratings, similarity, kind='user'):
    if kind == 'user':
        return similarity.dot(ratings) / np.array([similarity.sum(axis=1)]).T # similarity matrix x ratings matrix
    elif kind == 'item':
        return ratings.dot(similarity) / np.array([similarity.sum(axis=1)])

To compute the error (= how good is my method), we use the mean square error from Sklearn library:

In [27]:
# In the test matrix I have the values used for my prediction -> I consider just them
# ???? 1.11.30 FINO FINE
# Once I made my prediction, the train matrix is filled with the predicted values -> take these elemnts and put them in pred ???
from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten() # actual rating stored in test
    actual = actual[actual.nonzero()].flatten() # take non zero values from ???
    return mean_squared_error(pred, actual)

In [28]:
user_prediction = predict_simple(train, user_similarity, kind='user') # from train I predict based on user-user similairty
item_prediction = predict_simple(train, item_similarity, kind='item')

print('User-based CF MSE: ', get_mse(user_prediction, test))
print('Item-based CF MSE: ', get_mse(item_prediction, test))

User-based CF MSE:  10.14829318830978
Item-based CF MSE:  11.1352103623954


MSE are proportional to the ratings (from 1 to 5), so an error of 10 is high!

In [None]:
# This version can be used after having answered to Q5

user_prediction_n = predict_simple(train_u, user_similarity_n, kind='user')
item_prediction_n = predict_simple(train_i, item_similarity_n, kind='item')

print('User-based CF MSE: ', get_mse(user_prediction_n, test_u))
print('Item-based CF MSE: ', get_mse(item_prediction_n, test_i))

Item-item similarity is slightly better than user-user similarity.

An error of 1 is a good start, a basic Collaborative Filtering. It's high since it's computed over all ratings (high and low) -> go to the questions

## Additional questions

The above procedure computes the similairity and the prediction for all users or all items. It would be interesting to work on single users, and see if we can improve the error for that user.

### Question  Q6
<div class="alert alert-info">
Consider the user similarity matrix and the item similarity matrix. 
For a given user id, consider the ratings in the test set. Predict those ratings using the user-user similarity matrix or with the items-items similairty matrix, and compute the error.
    
- **Note**: We are not interested in the error computed over all users, but only over for a specific user.
</div>

In [None]:
# your answer here

### Question  Q7
<div class="alert alert-info">
Repeat the above computation, but consider the error for only the top-5 recommended items.
</div>

In [None]:
# Focus only on the high ratings of a specifi user

### Question  Q8
<div class="alert alert-info">
Repeat the computations for questions Q5 and Q6, but, as recommendation, consider the top 30 most similar users or items, and check if this has an impact on the error.

In [None]:
# I computed the simil also with users with low similarity, now I focus only on the top similar users.