# Some Memory Based RecSys Examples

Let's start with the most basic approach using a popular
(light-weight) dataset [MovieLens](http://files.grouplens.org/datasets/movielens/ml-20m.zip)

In [7]:
import numpy as np
import pandas as pd
from collections import defaultdict

## Load MovieLens Data

In [2]:
data_dir = "../data/ml-20m"
ratings = pd.read_csv(f"{data_dir}/ratings.csv")

## Similar users

We work under the assumption that users are similar if they have rated similar movies in a similar way. So given two users $u$ and $u'$. We consider a distance between $d(u,u')$. There are many possible distances, but for this part let's just use the cosine distance defined as $$d(u, u') = \frac{\sum_i r_{ui} r_{u'i}}{\sqrt{\sum_i r_{ui}^2}\sqrt{\sum_i r_{u'i}^2}}$$

Now let's recommend movies to an user, this should be easy, right? Well, not really, the naive approaches for this require large amount of memory and computer resources. 

**Important:** Why?

So let's focus only on the 100 most popular movies (ranked more often by users).

### Exercise:

Try to collect the users that have rated any of the top 100 movies. The solution is below because we need that data, but make sure to give it a try before reading it. 

In case that you didn't solve the problem, here is a solution:

In [3]:
popular = ratings[['movieId', 'userId']].groupby('movieId', as_index=False).agg(len).sort_values('userId').tail(100)
ratings_small = ratings[ratings['movieId'].isin(popular['movieId'].values)].copy(deep=True)

Note that the number of users that we have now is 

In [4]:
ratings_small.nunique()

userId        134428
movieId          100
rating            10
timestamp    2956091
dtype: int64

Now, suppose that we have an user and we want to find the most similar users. 

### Exercise:
- What is the computation complexity of this? 
- Is it the same if there are 1000 products, a million?
- What if you want to compute the similariy between all users?

The following is a naive implementation for finding the similarity score:

In [5]:
def score_for_user_naive(userId, ratings):
    
    
    def prod(row):
        
        movieId = int(row['movieId'])
        
        if movieId in user_vec:
            return user_vec[movieId] * row['rating']
        else:
            return 0.0
        
    user_ratings = ratings[ratings['userId'] == userId]
    
    user_vec = {int(movieId):rating for movieId, rating in user_ratings[['movieId', 'rating']].values}
    user_vec = defaultdict(float, user_vec)
    user_vec_norm = (user_ratings.rating ** 2).sum()
    
    ratings['rating_squared'] = ratings['rating'] ** 2
    
    ratings['rating_prod'] = ratings[['movieId', 'rating']].apply(prod, axis=1)
    
    scores = ratings.groupby('userId', as_index=False).agg({'rating_squared':sum, 'rating_prod':sum})
    
    scores["cosine"] = scores["rating_prod"]/np.sqrt(scores['rating_squared'] * user_vec_norm)
    
    # Exercise: How does the following scale? what can you do instead of this? 
    # Hint: Think on how the data is going to be used
    scores = scores.sort_values('cosine')
    
    return scores[['userId', 'cosine']]
    

And the score for the users to userId = 1.0

In [8]:
scores = score_for_user_naive(1.0, ratings_small)

There's another approach which consists of doing a matrix multiplication. 

### Exercise:
- Implement the matrix multiplication approach
- Did you encounter any problem?


**Advanced:**
- Implement the sparse version of the above. Hint: Think about sparse row_matrix and multiprocessing if you want to speed this up. How fast can you make it run?

The results will be sorted, so we obtain the last 11

In [None]:
scores.tail(11)

Unnamed: 0,userId,cosine
123178,126885,0.756793
28058,28916,0.757676
8258,8508,0.758243
24488,25236,0.760611
20547,21174,0.763546
8863,9129,0.766365
78833,81275,0.768049
110283,113612,0.770371
21221,21870,0.777655
118841,122435,0.790843


Let's check what user 1 and user 122435 have in common.

In [None]:
two_users = ratings_small[(ratings_small['userId'] == 1) | (ratings_small['userId'] == 122435)] \
.groupby('userId').agg({'movieId':(list,len)})
two_users

Unnamed: 0_level_0,movieId,movieId
Unnamed: 0_level_1,list,len
userId,Unnamed: 1_level_2,Unnamed: 2_level_2
1,"[32, 47, 50, 253, 260, 293, 296, 318, 367, 541...",36
122435,"[32, 47, 50, 110, 150, 165, 253, 260, 293, 296...",46


In [None]:
len(set(two_users.values[0][0]).intersection(set(two_users.values[1][0])))

31

So they share 31 different rated movies. 

### Exercise: 

Investigate the relation of the ratings? That is, do they rate movies in a similar way?

We could use user 122435 to help user 1 choose the next item, by simply recommending the highest rated product that user 122435 has seen and user 1 hasn't. But there are more users, and more ratings. We can leverage everyone's information by computing an score for each movie according to all users, weighted by how similar the user $u'$ is to user $u$.
$$\hat{r}_{ui} = \frac{\sum\limits_{u'} sim(u, u') r_{u'i}}{\sum\limits_{u'}|sim(u, u')|}$$

**Coding exercise:** 

Find a recommendation for user 1 using the previous approach.

**Coding challenge:** 

Can you do the same with 10000 movies? With all the movies? How fast does your code run? What about if you use C++ or Java? 

**Extra Questions:** 

If you were able to finish the exercises above, try exploring/coding the following.

- Potential problems of this approach?
- How to validate?
- What if we only choose the top k closest?
- What happens if we need to recommend to every user?
