
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

In [1]:
import pandas as pd
import numpy as np

users = pd.read_csv("data/users.csv")
print(users.shape)
users.head()

In [2]:
movies = pd.read_csv("data/movies.csv")
movies.head()

In [3]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,1192,5
1,0,660,3
2,0,913,3
3,0,3407,4
4,0,2354,5


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [4]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

What are similar movies ? In order to answer to this question we need to build a similiarity measure. 

## Features

This measure will operate on the characteristics (**features**) of the movies to determine which are close. In our case, we have access to the genres of the movies. For example, the genres of `Toy Story` are: `Animation`, `Children's` and `Comedy`. This is represented as follow in our dataset:

In [5]:
genre_cols = ["Animation", "Children's", 
       'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War',
       'Musical', 'Mystery', 'Film-Noir', 'Western']

genre_and_title_cols = ['title'] + genre_cols 

movies[genre_and_title_cols].head()

Unnamed: 0,title,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,Toy Story,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jumanji,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grumpier Old Men,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Waiting to Exhale,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Father of the Bride Part II,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similarity

Now that we have some features, we will try to find a function that performs a similiarity measure. The Similarity function will take two items (two list of features) and return a number proportional to their similarity. 

For the following we will consider that the Similarity between two movies is the number of genres they have in common.

Here is an example with `Toy Story` and `E.T`

In [6]:
toy_story_genres = movies[genre_and_title_cols].loc[movies.title == 'Toy Story'][genre_cols].iloc[0]
toy_story_genres

Animation      1.0
Children's     1.0
Comedy         1.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [7]:
et_genres = movies[genre_and_title_cols].loc[movies.title == 'E.T. the Extra-Terrestrial'][genre_cols].iloc[0]
et_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        1.0
Romance        0.0
Drama          1.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         1.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 1081, dtype: float64

In [8]:
et_genres.values * toy_story_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [9]:
(et_genres.values * toy_story_genres).sum() # scalar product

1.0

So our similarity measure returns `1.0` for these two movies. 

Let's see another example where we compare `Toy Stories` and `Pocahontas`

In [10]:
pocahontas_genres = movies[genre_and_title_cols].loc[movies.title == 'Pocahontas'][genre_cols].iloc[0]
(pocahontas_genres.values * toy_story_genres).sum()

2.0

This tels us that `Pocahontas` is closer to `Toy Stories` than `E.T.` which makes sense.


## Scaling up

Ok, that's a nice measure. Now we are going to scale it up to all movies of our dataset. To do so smartly, let's take a look at the operation we just did, but from a mathematical point of view. To do so, we will think of the list of features of a movie as a vector `V`. Then, our similarity measure between `Toy Story` and `E.T.` becomes:
$ V_{ToyStory} \cdot V_{ET}^{T}$

More generally the similarity measure between a movie `i` and another movie `j` is : $ V_{i} \cdot V_{j}^{T}$

Now we can think of `movies` as a matrix containing all features vectors describing the movies. Here is how our similiarity measure looks in this context:

![](imgs/dot_product_matrices.png)

To obtain the similiarity between all movies of our dataset we have to perform the dot product of the `movies` matrix with the transposed of the `movies` matrix.

In [11]:
similarity = movies[genre_cols].values.dot(movies[genre_cols].values.T)
similarity.shape

(3883, 3883)

We can now get the similarity between `Toy Story` and any other movie of our dataset

In [12]:
similarity_with_toy_story = similarity[0] # 0 is Toy Story
similarity_with_toy_story

array([3., 1., 1., ..., 0., 0., 0.])

In [13]:
for i in range(10):
    print(f"Similarity between Toy story and {movies.iloc[i]['title']} (index {i}) is {similarity_with_toy_story[i]}")

Similarity between Toy story and Toy Story (index 0) is 3.0
Similarity between Toy story and Jumanji (index 1) is 1.0
Similarity between Toy story and Grumpier Old Men (index 2) is 1.0
Similarity between Toy story and Waiting to Exhale (index 3) is 1.0
Similarity between Toy story and Father of the Bride Part II (index 4) is 1.0
Similarity between Toy story and Heat (index 5) is 0.0
Similarity between Toy story and Sabrina (index 6) is 1.0
Similarity between Toy story and Tom and Huck (index 7) is 1.0
Similarity between Toy story and Sudden Death (index 8) is 0.0
Similarity between Toy story and GoldenEye (index 9) is 0.0


## A bit of polishing

### Helpers:

We also built some helpers to handle the movies dataset:

In [14]:
from src.content_based_filtering.helpers.movies import get_movie_id, get_movie_name, get_movie_year
    
print (get_movie_id(movies, 'Toy Story'))
print (get_movie_id(movies, 'Die Hard'))

print (get_movie_name(movies, 0))
print (get_movie_name(movies, 1000))
print (get_movie_year(movies, 1000))

0
1023
Toy Story
Parent Trap, The
1961


### Finding similar movies:
Here is a method giving us the movie the most similar to another movie:

In [15]:
def get_most_similar(similarity, movie_name, year=None, top=10):
    index_movie = get_movie_id(movies, movie_name, year)
    best = similarity[index_movie].argsort()[::-1]
    return [(ind, get_movie_name(movies, ind), similarity[index_movie, ind]) for ind in best[:top] if ind != index_movie]

In [16]:
get_most_similar(similarity, 'Toy Story')

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [17]:
get_most_similar(similarity, 'Psycho', 1960) 

[(3593, "Puppet Master III: Toulon's Revenge", 2.0),
 (2923, 'Rawhead Rex', 2.0),
 (1312, 'Believers, The', 2.0),
 (3407, "Jacob's Ladder", 2.0),
 (1957, 'Disturbing Behavior', 2.0),
 (1927, 'Poltergeist III', 2.0),
 (1926, 'Poltergeist II: The Other Side', 2.0),
 (1925, 'Poltergeist', 2.0),
 (732, 'Thinner', 2.0),
 (69, 'From Dusk Till Dawn', 2.0)]

### Giving a recommendation:

And finally, let's find some movies to recommend based on previously liked movies:

In [18]:
def get_recommendations(user_id):
    top_movies = ratings[ratings['user_id'] == user_id].sort_values(by='rating', ascending=False).head(3)['movie_id']
    index=['movie_id', 'title', 'similarity']

    most_similars = []
    for top_movie in top_movies:
        most_similars += get_most_similar(similarity, get_movie_name(movies, top_movie), get_movie_year(movies, top_movie))

    return pd.DataFrame(most_similars, columns=index).drop_duplicates().sort_values(by='similarity', ascending=False).head(5)

get_recommendations(0)


Unnamed: 0,movie_id,title,similarity
0,957,"African Queen, The",4.0
1,1630,Starship Troopers,4.0
2,2253,Soldier,4.0
3,1178,Star Wars: Episode V - The Empire Strikes Back,4.0
26,2072,"American Tail, An",3.0


In [19]:
get_recommendations(1000)

Unnamed: 0,movie_id,title,similarity
0,69,From Dusk Till Dawn,3.0
1,1599,"Devil's Advocate, The",3.0
2,2757,"13th Warrior, The",2.0
3,3701,Dreamscape,2.0
4,2044,Graveyard Shift,2.0


# Conclusion:

The code presented is a first implementation but has a number of shortcomings preventing the collaboration of multiple MLE and Data Scientists:
- It is not possible to introduce easily new features mainly because the code is just a bunch of functions in one file.
- The code can not be scaled to other datasets or variations of the tasks.
- There is no evaluation of the performances.
- There is no testing

Additionaly a number we could think of some features to add, for example, what about looking at similar users to find a recommendation for our targeted user ?

# My Implementation

The structure of the projects is as follows :
I added two python files to the structure : model.py and make_dataset.py
The first one is for the recommendation system and only contains the `Model` class
The second one is for all the other objects (`Movies`, `User`, `UserDB`, `Ratings`) 

The only special requirement are that you need to install `tqdm` to see the fancy loading bars

In [20]:
from src.content_based_filtering.helpers.make_dataset import Movies, Ratings, User, UserDB
from src.content_based_filtering.helpers.model import Model

## Reindexing the df

The original implementation was based on the assumption that `movie_id` and `index` coincided which they didn't. The direct consequence being that the similarity matrix represented the similarity bewteen rows (datapoints) but not between movie_ids. I chose not to side with this and refactor the dataframe so that it had the correct mapping. To do so, I only added one layer of preprocessing to the movie class, so that it's easily scaleable if another dataset is given, (note that this step leaves the dataframe invariant if the movie_ids are in fact coincident with the indexing)

## Instanciation Phase

In [21]:
MoviesDB = Movies(movies)
RatingsDB = Ratings(ratings)
UsersDB = UserDB(list(users.user_id), users, RatingsDB)

uncomment the following line if you wish to re-generate the encoded ratings, otherwis, we'll use the pickle object to save computing time

In [22]:
encoded_ratings = UsersDB.get_encoded_ratings_db(MoviesDB)
# encoded_ratings.to_pickle("./data/encoded_ratings.pkl")

100%|██████████████████████████████████████████████████████████████████████████████| 6040/6040 [05:35<00:00, 17.98it/s]


In [23]:
# encoded_ratings = pd.read_pickle("./data/encoded_ratings.pkl")

We then norm the matrix so that we get ones on the diagonal and the dot product between a user and himself is always one

In [24]:
encoded_ratings = encoded_ratings.apply(lambda row : row/ np.linalg.norm(row))

In [25]:
users_similarity_matrix = UsersDB.get_similarity_matrix(encoded_ratings)

## Prediction Phase

In [26]:
model = Model(UsersDB, MoviesDB)

In [27]:
user_test = User(3000,users, RatingsDB)

The following line is pretty long to execute, uncomment it if you wish to test it

In [28]:
# content_based_full_prediction = model.predict_content_based()

In [29]:
random_user_prediction = model.predict_content_based_one_user(3000)
random_user_prediction.head()

Unnamed: 0,movie_id,title,similarity
28,150,Rob Roy,3.0
27,3116,Ride with the Devil,3.0
26,1093,"Crying Game, The",3.0
25,919,Gone with the Wind,3.0
24,3121,Santa Fe Trail,3.0


In [30]:
pool_based_prediction = model.predict_similar_users_one_user(user_test, users_similarity_matrix, UsersDB, 5)
pool_based_prediction

Unnamed: 0,user_id,movie_id,rating
919670,5554,109,5
720013,4311,295,5
314631,1878,439,5
314632,1878,440,5
314636,1878,456,5


## Evaluation Phase

We now need to assess the performance of our content-based recommendation system, let's build ourselves an evaluation pipeline.
But we have two cases : either the user has already seen some of the movies that we recommend him (which we ideally wouldn't want because the user doesn't want to re-watch the movies he's/she's seen by default) or he/she hasn't, in this case no rating is available, and having no feedback means that we have no idea how to evaluate our pick. In this case, we could think about looking at what the other similar users have rated for the movie we ask him to watch, and compute a score based on that.

Here in the metric chosen (the score based on the movies the other similare users have seen), the higher the score, the higher the quality of the recommendation

In [31]:
usr_test = User(3000, users, RatingsDB)
usr_test2 = User(456, users, RatingsDB)

In [32]:
model.score(users_similarity_matrix, random_user_prediction, usr_test, UsersDB)

9.5

In [33]:
model.score(users_similarity_matrix, random_user_prediction, usr_test2, UsersDB)

11.666666666666668

## Test Phase

### Tests every function of every object class to check any problems

In [34]:
print(MoviesDB.get_most_similar_movies('Toy Story'))
print(MoviesDB.get_movie_id('Toy Story'))
print(MoviesDB.get_movie_year(0))

[(672, 'Space Jam', 3.0), (3753, 'Adventures of Rocky and Bullwinkle, The', 3.0), (3750, 'Chicken Run', 3.0), (2077, 'Jungle Book, The', 3.0), (2079, 'Lady and the Tramp', 3.0), (2080, 'Little Mermaid, The', 3.0), (2101, 'Steamboat Willie', 3.0), (2140, 'American Tail, An', 3.0), (2141, 'American Tail: Fievel Goes West, An', 3.0)]
0
1995.0


In [35]:
print(RatingsDB.get_user_ratings(0))

    user_id  movie_id  rating
0         0      1192       5
1         0       660       3
2         0       913       3
3         0      3407       4
4         0      2354       5
5         0      1196       3
6         0      1286       5
7         0      2803       5
8         0       593       4
9         0       918       4
10        0       594       5
11        0       937       4
12        0      2397       4
13        0      2917       4
14        0      1034       5
15        0      2790       4
16        0      2686       3
17        0      2017       4
18        0      3104       5
19        0      2796       4
20        0      2320       3
21        0       719       3
22        0      1269       5
23        0       526       5
24        0      2339       3
25        0        47       5
26        0      1096       4
27        0      1720       4
28        0      1544       4
29        0       744       3
30        0      2293       4
31        0      3185       4
32        

In [36]:
user_test = User(3000,users, RatingsDB)
print(user_test.get_encoded_ratings(MoviesDB))
print(user_test.get_similar_users(users_similarity_matrix))
print(user_test.get_recommendations(MoviesDB))

          3000
movie_id      
0          4.0
1          0.0
2          0.0
3          0.0
4          0.0
...        ...
3947       0.0
3948       0.0
3949       0.0
3950       0.0
3951       0.0

[3952 rows x 1 columns]
[(5554, 0.5406679903325117), (2945, 0.5244606762792599), (4311, 0.49926262055520887), (1878, 0.49299046858367196)]
    movie_id                title  similarity
28       150              Rob Roy         3.0
27      3116  Ride with the Devil         3.0
26      1093     Crying Game, The         3.0
25       919   Gone with the Wind         3.0
24      3121       Santa Fe Trail         3.0


### Test shapes

The only relevant thing to test is the shape of the similarity matrix since the other tests are implicitely done in the functions of every object's class. (e.g the generation of the similarity matrix needs the proper shapes in order to function, so we have no need to test the shape beforehand)

- the first test makes sure that the similarity matrix represents the similarity between the users
- the second test makes sure that the `movie_id` are well mapped to the indices

In [37]:
assert users_similarity_matrix.shape == (len(users), len(users))

In [38]:
assert MoviesDB.movies_dataset.shape == (movies.movie_id.values[-1] + 1, len(movies.columns))