
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

In [1]:
import pandas as pd

users = pd.read_csv("data/users.csv")
print(users.shape)
users.head()

(6040, 5)


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,0,F,1,10,48067
1,1,M,56,16,70072
2,2,M,25,15,55117
3,3,M,45,7,2460
4,4,M,25,20,55455


In [2]:
movies = pd.read_csv("data/movies.csv")
movies.head()

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Grumpier Old Men,1995,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Waiting to Exhale,1995,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Father of the Bride Part II,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,0,1176,5
1,0,655,3
2,0,902,3
3,0,3339,4
4,0,2286,5


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [4]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

What are similar movies ? In order to answer to this question we need to build a similiarity measure. 

## Features

This measure will operate on the characteristics (**features**) of the movies to determine which are close. In our case, we have access to the genres of the movies. For example, the genres of `Toy Story` are: `Animation`, `Children's` and `Comedy`. This is represented as follow in our dataset:

In [5]:
genre_cols = ["Animation", "Children's", 
       'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War',
       'Musical', 'Mystery', 'Film-Noir', 'Western']

genre_and_title_cols = ['title'] + genre_cols 

movies[genre_and_title_cols].head()

Unnamed: 0,title,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,Toy Story,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jumanji,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grumpier Old Men,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Waiting to Exhale,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Father of the Bride Part II,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similarity

Now that we have some features, we will try to find a function that performs a similiarity measure. The Similarity function will take two items (two list of features) and return a number proportional to their similarity. 

For the following we will consider that the Similarity between two movies is the number of genres they have in common.

Here is an example with `Toy Story` and `E.T`

In [6]:
toy_story_genres = movies[genre_and_title_cols].loc[movies.title == 'Toy Story'][genre_cols].iloc[0]
toy_story_genres

Animation      1.0
Children's     1.0
Comedy         1.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [7]:
et_genres = movies[genre_and_title_cols].loc[movies.title == 'E.T. the Extra-Terrestrial'][genre_cols].iloc[0]
et_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        1.0
Romance        0.0
Drama          1.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         1.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 1081, dtype: float64

In [8]:
et_genres.values * toy_story_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [9]:
(et_genres.values * toy_story_genres).sum() # scalar product

1.0

So our similarity measure returns `1.0` for these two movies. 

Let's see another example where we compare `Toy Stories` and `Pocahontas`

In [10]:
pocahontas_genres = movies[genre_and_title_cols].loc[movies.title == 'Pocahontas'][genre_cols].iloc[0]
(pocahontas_genres.values * toy_story_genres).sum()

2.0

This tels us that `Pocahontas` is closer to `Toy Stories` than `E.T.` which makes sense.


## Scaling up

Ok, that's a nice measure. Now we are going to scale it up to all movies of our dataset. To do so smartly, let's take a look at the operation we just did, but from a mathematical point of view. To do so, we will think of the list of features of a movie as a vector `V`. Then, our similarity measure between `Toy Story` and `E.T.` becomes:
$ V_{ToyStory} \cdot V_{ET}^{T}$

More generally the similarity measure between a movie `i` and another movie `j` is : $ V_{i} \cdot V_{j}^{T}$

Now we can think of `movies` as a matrix containing all features vectors describing the movies. Here is how our similiarity measure looks in this context:

![](imgs/dot_product_matrices.png)

To obtain the similiarity between all movies of our dataset we have to perform the dot product of the `movies` matrix with the transposed of the `movies` matrix.

In [11]:
similarity = movies[genre_cols].values.dot(movies[genre_cols].values.T)
similarity.shape

(3883, 3883)

We can now get the similarity between `Toy Story` and any other movie of our dataset

In [12]:
similarity_with_toy_story = similarity[0] # 0 is Toy Story
similarity_with_toy_story

array([3., 1., 1., ..., 0., 0., 0.])

In [13]:
for i in range(10):
    print(f"Similarity between Toy story and {movies.iloc[i]['title']} (index {i}) is {similarity_with_toy_story[i]}")

Similarity between Toy story and Toy Story (index 0) is 3.0
Similarity between Toy story and Jumanji (index 1) is 1.0
Similarity between Toy story and Grumpier Old Men (index 2) is 1.0
Similarity between Toy story and Waiting to Exhale (index 3) is 1.0
Similarity between Toy story and Father of the Bride Part II (index 4) is 1.0
Similarity between Toy story and Heat (index 5) is 0.0
Similarity between Toy story and Sabrina (index 6) is 1.0
Similarity between Toy story and Tom and Huck (index 7) is 1.0
Similarity between Toy story and Sudden Death (index 8) is 0.0
Similarity between Toy story and GoldenEye (index 9) is 0.0


## A bit of polishing

### Helpers:

We also built some helpers to handle the movies dataset:

In [14]:
from content_based_filtering.helpers.movies import get_movie_id, get_movie_name, get_movie_year
    
print (get_movie_id(movies, 'Toy Story'))
print (get_movie_id(movies, 'Die Hard'))

print (get_movie_name(movies, 0))
print (get_movie_name(movies, 1000))
print (get_movie_year(movies, 1000))

0
1023
Toy Story
Parent Trap, The
1961


### Finding similar movies:
Here is a method giving us the movie the most similar to another movie:

In [15]:
def get_most_similar(similarity, movie_name, year=None, top=10):
    index_movie = get_movie_id(movies, movie_name, year)
    best = similarity[index_movie].argsort()[::-1]
    return [(ind, get_movie_name(movies, ind), similarity[index_movie, ind]) for ind in best[:top] if ind != index_movie]

In [16]:
get_most_similar(similarity, 'Toy Story')

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [17]:
get_most_similar(similarity, 'Psycho', 1960) 

[(3593, "Puppet Master III: Toulon's Revenge", 2.0),
 (2923, 'Rawhead Rex', 2.0),
 (1312, 'Believers, The', 2.0),
 (3407, "Jacob's Ladder", 2.0),
 (1957, 'Disturbing Behavior', 2.0),
 (1927, 'Poltergeist III', 2.0),
 (1926, 'Poltergeist II: The Other Side', 2.0),
 (1925, 'Poltergeist', 2.0),
 (732, 'Thinner', 2.0),
 (69, 'From Dusk Till Dawn', 2.0)]

### Giving a recommendation:

And finally, let's find some movies to recommend based on previously liked movies:

In [18]:
def get_recommendations(user_id):
    top_movies = ratings[ratings['user_id'] == user_id].sort_values(by='rating', ascending=False).head(3)['movie_id']
    index=['movie_id', 'title', 'similarity']

    most_similars = []
    for top_movie in top_movies:
        most_similars += get_most_similar(similarity, get_movie_name(movies, top_movie), get_movie_year(movies, top_movie))

    return pd.DataFrame(most_similars, columns=index).drop_duplicates().sort_values(by='similarity', ascending=False).head(5)

get_recommendations(0)


Unnamed: 0,movie_id,title,similarity
13,773,"Hunchback of Notre Dame, The",3.0
14,1526,Hercules,3.0
27,2072,"American Tail, An",3.0
26,2033,Steamboat Willie,3.0
25,2012,"Little Mermaid, The",3.0


In [19]:
get_recommendations(999)

Unnamed: 0,movie_id,title,similarity
0,166,First Knight,2.0
2,1451,Smilla's Sense of Snow,2.0
3,503,"Perfect World, A",2.0
4,3197,Man Bites Dog (C'est arriv� pr�s de chez vous),2.0
5,1458,"Devil's Own, The",2.0


# Conclusion:

The code presented is a first implementation but has a number of shortcomings preventing the collaboration of multiple MLE and Data Scientists:
- It is not possible to introduce easily new features mainly because the code is just a bunch of functions in one file.
- The code can not be scaled to other datasets or variations of the tasks.
- There is no evaluation of the performances.
- There is no testing

Additionaly a number we could think of some features to add, for example, what about looking at similar users to find a recommendation for our targeted user ?

# Cosine Similarity
#### This part is edited by Indrajeet Datta


Although the methods used in the original implementation work fine, the similarity is compared by measuring the scalar product. Although in the scarlar product implementation, the product is not normalized. i.e. the product value can be any number. We usually do not want that to happen. So, instead, I used cosine similarity matrix method from sklearn which gives a cosine similarity value between 0 and 1.

In [50]:
from sklearn.metrics.pairwise import cosine_similarity

But before applying cosine similarity, I defined some functions that will help up get our recommeded movies.

The function below get_favourite_movies_by_users takes in a user_id value as parameter and finds the first 5 movies that the user rated as rating 5. Those movies are surely favorite movies of the user.

In [51]:
def get_favourite_movies_by_users(user_id):
    favourite_movies = ratings[(ratings["rating"] == 5) & (ratings["user_id"] == user_id)]["movie_id"].tolist()[0:5]
    return favourite_movies

Next, I defined a function called get_topfive_similar_movies which takes in a movie_id and the cosine similarity matrix of the movies as a parameter and returns a list containing 5 movies that are most similar to the movie inputted.

In [52]:
def get_topfive_similar_movies(movie_id, cosine_sim_mat):
    similar_movies = list(enumerate(cosine_sim_mat[movie_id]))
    top_ten_similar_movies = [i[0] for i in sorted(similar_movies, key=lambda x:x[1], reverse=True)[1:6]]
    return top_ten_similar_movies

Next, I defined a function called get_topfive_similar_users which takes in a user_id and the cosine similarity matrix of the users as a parameter and returns a list containing 5 users that are most similar to the user inputted.

In [53]:
def get_topfive_similar_users(user_id, cosine_sim_mat):
    similar_users = list(enumerate(cosine_sim_mat[user_id]))
    top_ten_similar_users = [i[0] for i in sorted(similar_users, key=lambda x:x[1], reverse=True)[1:6]]
    return top_ten_similar_users

Next, I defined a function called get_recommended_movies which takes in user_id and the cosine similarity matrices of movies and users as parameters then for the given user it first finds all the users favourite movies using get_favourite_movies_by_users method defined earlier. Then, for each movie that was found from the previous method, it finds 5 similar movies by using the the get_topfive_similar_movies method defined earlier. Next it finds, 5 users that are similart to the given user by using the get_topfive_similar_users method and then for each similar user found, it finds the 5 favourite movies for each of those users by using the get_favourite_movies_by_users method. It then appends the movies similar to the favourited movies and the movies favourited by the user similar to the given user into a list of movie_ids and returns the list. 


The idea is that if a user favoutited certain movies, then the probability of the liking the movies that are similar to the favourited movies by the user is high. Also, the favourited movies of the users similar to the user is also high. So 

In [54]:
def get_recommended_movies(user_id, cosine_sim_movies, cosine_sim_users):
    recommended_movies = []
    favorite_movies = get_favourite_movies_by_users(user_id)
    #print("Favourite movies of user: ", favorite_movies)
    for movie in favorite_movies:
        similar_movies = get_topfive_similar_movies(movie, cosine_sim_movies)
        #print("Movies similar to", movie, " : ", similar_movies)
        recommended_movies.extend(similar_movies)
    similar_users = get_topfive_similar_users(user_id, cosine_sim_users)
    #print("Users similar to user: ", similar_users)
    for user in similar_users:
        fav_movs = get_favourite_movies_by_users(user_id)
        #print("Favourite Movies of user: ", user, " : ", fav_movies)
        recommended_movies.extend(fav_movs)
    recommended_movies = list(set(recommended_movies))
    return recommended_movies

Now, looking at the users dataframe, we can see that there are certain features that are not helpful in their raw form. For example the age column is much more useful if it is categorized. So, next, I categorized the age column into different age groups and added a new column which determines tha age group.

In [55]:
if "age-group" in users:
    users = users.drop(["age-group"], axis=1)
idx = users.columns.get_loc('age')

b1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
l1 = ['Children', 'Teenagers', '20s','30s', '40s', '50s', '60s', '70s', '80s', '90s']
s1 = pd.cut(users['age'], bins=b1, labels=l1)

users.insert(idx + 1, 'age-group', s1)

In [56]:
import math

Another feature that needs to be modified is the zip_code columns. Since, mostly, no two users will have the exact zip-code, it is not very helpful. Although, if we say the first two characters of the zip-code determines a region, we can only use the first two columns to determine the zip code of the user. So, accordingly, I added a new column called "zip-firsttwo" which only stores the first two digits of the zip code.

In [57]:
users['zip-firsttwo'] = users['zip_code'].str[0:2]

In [58]:
users.head()

Unnamed: 0,user_id,gender,age,age-group,occupation,zip_code,zip-firsttwo
0,0,F,1,Children,10,48067,48
1,1,M,56,50s,16,70072,70
2,2,M,25,20s,15,55117,55
3,3,M,45,40s,7,2460,2
4,4,M,25,20s,20,55455,55


Below, we can see that we have reduced the number of unique zip features from 3439 to only 100. So, if we categorize the user based on the first two digits of the zip code, we only get 100 categorizations which is much more helpful.

In [59]:
print(users["zip_code"].unique().size)
print(users["zip-firsttwo"].unique().size)

3439
100


Next, I imported LabelBinarizer library from sklearn which helps to give values of 0 or 1 for each unique category in a column of dataframe.

In [60]:
from sklearn.preprocessing import LabelBinarizer

In [61]:
lb = LabelBinarizer()

In [62]:
lb_gender = lb.fit_transform(users["gender"])
lb_ageGroup = lb.fit_transform(users["age-group"])
lb_occupation = lb.fit_transform(users["occupation"])
lb_zipcode = lb.fit_transform(users["zip-firsttwo"])

In [63]:
print("Label Binarized zip-firsttwo: \n", lb_zipcode)
print()
print("lb_zipcode shape: ", lb_zipcode.shape)

Label Binarized zip-firsttwo: 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

lb_zipcode shape:  (6040, 100)


We can see above that LabelBinarizer created a list of lists in which each eleement being a 100 vector list containing 0s for all other values and 1 for the value which matches the zip code. That means each zip code is transformed into a size 100 vector with 99 0s and one 1.

In [64]:
import scipy
import numpy as np

Now we horizontally stack all the outputs acquired from LabelBinarizer earlier.

In [65]:
user_features = np.hstack([lb_gender, lb_ageGroup, lb_occupation, lb_zipcode])

In [66]:
print(user_features)

[[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 1 0 ... 0 0 0]]


"user_features" contains list of features for each user in the form of 1s and 0s. These lists are quite good for calculating our cosine similarity matrix for similar users.

We can get the movies features according to the genres very easily as it is already categorized in the movies dataframe

In [67]:
movie_features = movies[genre_cols].values

Now I calculated the cosine matrices for similarity of users and similarity of movies using the cosine_similarity function used on the user_features and movie_features matrices acquired earlier.

In [68]:
cosine_sim_users = cosine_similarity(user_features)

In [69]:
cosine_sim_movies = cosine_similarity(movie_features)

Now, I can input a user_id and the cosine similarity matrices for users and movies to get the list of recommended movies using the function recommended_movies defined earlier.

In [70]:
user_id = 0
recommended_movies = get_recommended_movies(user_id, cosine_sim_movies, cosine_sim_users)

Next, I loop through the recommended list outputted ealier for the user with user_id = 0 and print the names of the movies by using the helper function get_movie_name()

In [76]:
for movie in recommended_movies:
    print(get_movie_name(movies, movie))

Hunchback of Notre Dame, The
James and the Giant Peach
Knightriders
American Tail: Fievel Goes West, An
Aladdin and the King of Thieves
American Tail, An
Othello
Now and Then
Shanghai Triad (Yao a yao yao dao waipo qiao)
Dangerous Minds
One Flew Over the Cuckoo's Nest
Dead Man Walking
First Knight
To Die For
Christmas Story, A
Karate Kid, Part II, The
Karate Kid III, The
Kicking and Screaming
Big Bully
Snow White and the Seven Dwarfs
Beauty and the Beast
Last Summer in the Hamptons
Nobody Loves Me (Keiner liebt mich)
Rugrats Movie, The
Bug's Life, A
All Dogs Go to Heaven 2
Ben-Hur
Emerald Forest, The


Next, we cam loop through all the user_ids stored in the dataframe and find the lists of recommended movies for each of the users and then store it as a column in the dataframe itself 

In [77]:
movies_recommended = []
for user_id in users["user_id"]:
    recommended_movie_ids = get_recommended_movies(user_id, cosine_sim_movies, cosine_sim_users)
    movies_recommended.append(recommended_movie_ids)

In [78]:
users["movies_recommended_cosine_similarity"] = movies_recommended

In [79]:
users

Unnamed: 0,user_id,gender,age,age-group,occupation,zip_code,zip-firsttwo,movies_recommended_cosine_similarity
0,0,F,1,Children,10,48067,48,"[773, 655, 3736, 2073, 1050, 2072, 25, 26, 29,..."
1,1,M,56,50s,16,70072,70,"[387, 518, 2199, 280, 24, 26, 25, 29, 30, 287,..."
2,2,M,25,20s,15,55117,55,"[257, 3, 9, 18, 543, 37, 1573, 1063, 171, 44, ..."
3,3,M,45,40s,7,02460,02,"[257, 518, 1545, 25, 26, 1180, 29, 30, 799, 92..."
4,4,M,25,20s,20,55455,55,"[641, 1159, 1033, 523, 18, 24, 153, 665, 26, 2..."
5,5,F,50,40s,9,55117,55,"[773, 6, 14, 655, 1179, 926, 2336, 2337, 548, ..."
6,6,M,35,30s,1,06810,06,"[1024, 1550, 788, 283, 285, 2847, 543, 1568, 1..."
7,7,M,25,20s,12,11413,11,"[387, 644, 1550, 280, 24, 26, 25, 29, 30, 285,..."
8,8,M,25,20s,17,61614,61,"[1024, 0, 517, 653, 2072, 2073, 1050, 283, 683..."
9,9,F,35,30s,1,95370,95,"[2048, 6, 1286, 1543, 2056, 24, 25, 26, 29, 30..."
