## Recommender systems - understanding the basics

> Notes on the lecture series by Machine Learning at VU University Amsterdam: https://mlvu.github.io/

---

## Introduction

### The Netflix task

The Netflix task was a milestone in ML challenges and had a significant impact in developing future platforms like Kaggle


#### Collaborative filtering

- Source of data: explicit user ratings; we ask users which movies they like/dislike and we have a small data set
- Based on this explicit feedback we have to predict ratings for new movies
- This is based on __explicit feedback__. The users collaborate together to provide ratings to help filter movies that they like
- The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person. For example, a collaborative filtering recommendation system for preferences in television programming could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes). Note that these predictions are specific to the user, but use information gleaned from many users. 
- The main drawback is that information is very __sparse__. We might only have few ratings/no ratings per user



### The recommendation task

The recommendation task is applicable when the problem follows this framework:

- We have 2 large sets of things and a particular relation bw them; eg: user liked movie OR user bought book OR user rated movie 2 out of 5
- Often the sides are set of users and a set of items. But this is not necessary; for example recipes and ingredients with the property "has_ingredient". Also based on which politician voted for which law, we can make an analysis of general voting behaviors by elected officials
- It might be that the left and right items are the same set; eg: which people should be friends
- Key property: u have no info/features about the 2 sets of items, we only know of the relationship which exists. If we do have some fetaures we consider them secondary information. We want to base our predictions primarily on the property linking the 2 classes of objects

![](https://i.imgur.com/n47BOm6.png)

![](https://i.imgur.com/DNFKzBZ.png)

In most social media sites like FB, YouTube, Twitter, the main content which we find on the site is powered by recommendations


### Representing the problem

Let us consider the movie recommendation example. We have numeric ratings, which may be -ve signifying that the user dislikes the movie
This is the easiest setting to have. We can visualize our problem set as a matrix R of users and which movies they have rated and how

We can see that this is a sparse matrix and we do not have any features of the users/movies

![](https://i.imgur.com/bQSttPB.png)

#### Embedding vectors

In the word embedding problem, we represented each word by a vector and learned the values in the vectors 

Here also we do the same

- we assign vector with initially random numbers to each user and movie
- the dimension for each vector k needs to be same for each user and movie
- we arrange the embedding vectors into matrices where each col represents a user/movie

![](https://i.imgur.com/AGtEh5s.png)

Now imagine we have to solve this problem ourselved, we could start by crafting some features of a movie and user
For e.g: `likes_romance, likes_action` etc. Also the features in movies are like `has_romance, has_action` etc.

`score(u, m) -> high +ve no if user would like the movie, high -ve if user will dislike the movie, low value if user is mostly ambivalent about the movie`

One such scoring function is the dot product bw the user embedding vector and the movie embedding vector:

![](https://i.imgur.com/KA4DMSb.png)

So if we denote the user embedding metrix as U and movie embedding matrix as M

U^T.M -> R

Say num_users -> n_u and num_movies is n_m. Say embedding vector size -> n_u x n_m

![](https://i.imgur.com/AtoBr4H.jpeg)

![](https://i.imgur.com/NXRCrVi.jpeg)

![](https://i.imgur.com/dAONcD8.png)

The above is a matrix for every single user-movie pair in our data





### Recommender sys as Matrix Decomposition/Optimzation Problem

In the above explanation we saw that if we have movie features and user features, we cab construct a prediction for how much will a user like a movie

So if we look at the problem the other way around

- Given a rating matrix R, __decompose it as a product of 2 factors U and M.__ 
- If we find U and M st U^T . M is close to R, then we can assume that __the matrices U and M contain meaninful embeddings for this use case__
- This is called __Matrix Factorization/Decomposition__ approach to Recommender systems

![](https://i.imgur.com/N4JQNy4.png)

If we look at the last line, u_i . m_j is sumply the dot prod to compute pred rating, we subtract with the given rating to get error and simply square it

We sum this for all and simply optimize for U and M which gives us the least error

#### Challenges

- Lot of missing values in R
- If we fill in with 0 we are essentially telling our model to predict 0 ratings for these unkown values, which is not refelective of the true rating

#### Solution

![](https://i.imgur.com/DgH3GYv.png)

Here i and j iterate only over those elems in the matrix for which we know the rating

This makes optimization a bit more difficult but leads to better models



### Solving the optimization problem

Obvious soln will be to use Gradient Descent

Or we can use something called alternating optimization

#### Alternating Least Squares

![](https://i.imgur.com/vKMpyIj.png)



### Solution walkthrough


<img src="https://i.imgur.com/b41meOm.jpeg" width=800 height=800>


<img src="https://i.imgur.com/QW6rT40.jpeg" width=800 height=800>


<img src="https://i.imgur.com/fGNJlXs.jpeg" width=800 height=800>


<img src="https://i.imgur.com/9HIDGhX.jpeg" width=800 height=800>


`E_lj is a scalar so E_l is a vector of [E_l1, E_l2, ... E_lm]`

<img src="https://i.imgur.com/s6z3Gi4.jpeg" width=800 height=800>


> Note the use of dot (.) at E_l. this indicates that its [E_l1, E_l2, ... E_lm] ie vector of errors for lth user wrt each of the m movies

Each such element can be computed from eqn 1

`E_l_i = R_l_i - u_l . m_i`

Thus the gradient update for a user weight can be computed using the last eqn of the above image

---

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=00b6f2e0-e86b-4758-8c53-b9ea709d832e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>