
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

In [None]:
import pandas as pd


users = pd.read_csv("data/users.csv")
print(users.shape)
users.head()

In [None]:
movies = pd.read_csv("data/movies.csv")
movies.head()



In [None]:
ratings = pd.read_csv("data/ratings.csv")
ratings.head()


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [None]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

What are similar movies ? In order to answer to this question we need to build a similiarity measure. 

## Features

This measure will operate on the characteristics (**features**) of the movies to determine which are close. In our case, we have access to the genres of the movies. For example, the genres of `Toy Story` are: `Animation`, `Children's` and `Comedy`. This is represented as follow in our dataset:

In [None]:
#a function that adds a feature
def add_feature(df, name_feature, column_feature) :
    df[name_feature] = column_feature
    return True


In [None]:
def get_features(df) :
    #To extract the features without going through the data
    features= df.iloc[0,2:].keys().tolist()
    return features

In [None]:
genre_cols = ["Animation", "Children's", 
       'Comedy', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War',
       'Musical', 'Mystery', 'Film-Noir', 'Western']

genre_and_title_cols = ['title'] + genre_cols 

movies[genre_and_title_cols].head()

## Similarity

Now that we have some features, we will try to find a function that performs a similiarity measure. The Similarity function will take two items (two list of features) and return a number proportional to their similarity. 

For the following we will consider that the Similarity between two movies is the number of genres they have in common.

Here is an example with `Toy Story` and `E.T`

In [None]:
toy_story_genres = movies[genre_and_title_cols].loc[movies.title == 'Toy Story'][genre_cols].iloc[0]
toy_story_genres

In [None]:
et_genres = movies[genre_and_title_cols].loc[movies.title == 'E.T. the Extra-Terrestrial'][genre_cols].iloc[0]
et_genres

In [None]:
et_genres.values * toy_story_genres

In [None]:
(et_genres.values * toy_story_genres).sum() # scalar product

So our similarity measure returns `1.0` for these two movies. 

Let's see another example where we compare `Toy Stories` and `Pocahontas`

In [None]:
pocahontas_genres = movies[genre_and_title_cols].loc[movies.title == 'Pocahontas'][genre_cols].iloc[0]
(pocahontas_genres.values * toy_story_genres).sum()

This tels us that `Pocahontas` is closer to `Toy Stories` than `E.T.` which makes sense.


## Scaling up

Ok, that's a nice measure. Now we are going to scale it up to all movies of our dataset. To do so smartly, let's take a look at the operation we just did, but from a mathematical point of view. To do so, we will think of the list of features of a movie as a vector `V`. Then, our similarity measure between `Toy Story` and `E.T.` becomes:
$ V_{ToyStory} \cdot V_{ET}^{T}$

More generally the similarity measure between a movie `i` and another movie `j` is : $ V_{i} \cdot V_{j}^{T}$

Now we can think of `movies` as a matrix containing all features vectors describing the movies. Here is how our similiarity measure looks in this context:

![](imgs/dot_product_matrices.png)

To obtain the similiarity between all movies of our dataset we have to perform the dot product of the `movies` matrix with the transposed of the `movies` matrix.

In [None]:
similarity = movies[genre_cols].values.dot(movies[genre_cols].values.T)
similarity.shape

We can now get the similarity between `Toy Story` and any other movie of our dataset

In [None]:
similarity_with_toy_story = similarity[0] # 0 is Toy Story
similarity_with_toy_story

In [None]:
for i in range(10):
    print(f"Similarity between Toy story and {movies.iloc[i]['title']} (index {i}) is {similarity_with_toy_story[i]}")

## A bit of polishing

### Helpers:

We also built some helpers to handle the movies dataset:

In [None]:
from content_based_filtering.helpers.movies import get_movie_id, get_movie_name, get_movie_year
    
print (get_movie_id(movies, 'Toy Story'))
print (get_movie_id(movies, 'Die Hard'))

print (get_movie_name(movies, 0))
print (get_movie_name(movies, 1000))
print (get_movie_year(movies, 1000))

### Finding similar movies:
Here is a method giving us the movie the most similar to another movie:

In [None]:
def get_most_similar(similarity, movie_name, year=None, top=10):
    index_movie = get_movie_id(movies, movie_name, year)
    best = similarity[index_movie].argsort()[::-1]
    return [(ind, get_movie_name(movies, ind), similarity[index_movie, ind]) for ind in best[:top] if ind != index_movie]

In [None]:
get_most_similar(similarity, 'Toy Story')

In [None]:
get_most_similar(similarity, 'Psycho', 1960) 

### Giving a recommendation:

And finally, let's find some movies to recommend based on previously liked movies:

In [None]:
def get_recommendations(user_id):
    top_movies = ratings[ratings['user_id'] == user_id].sort_values(by='rating', ascending=False).head(3)['movie_id']
    index=['movie_id', 'title', 'similarity']

    most_similars = []
    for top_movie in top_movies:
        most_similars += get_most_similar(similarity, get_movie_name(movies, top_movie), get_movie_year(movies, top_movie))

    return pd.DataFrame(most_similars, columns=index).drop_duplicates().sort_values(by='similarity', ascending=False).head(5)

get_recommendations(0)


In [None]:
get_recommendations(999)

import numpy as np


In [None]:
#Evaluation : let's see the ratings of the movies by the users 
def eval_user_recommendation(user_id):
    r=[]
    ratings_user = ratings.loc[ratings['user_id']==user_id]
    #print(get_recommendations(user_id).iloc[:,0])
    
    for i in get_recommendations(user_id).iloc[:,0] : 
        for j in ratings_user['movie_id'] : 
            if j==i :
                r.append(int(ratings_user.loc[ratings_user['movie_id']==j]['rating']))
    if r==[]:
        return False
    else :
        return((np.array(r)).mean())
# we evaluate on a batch of users
c= []
for j in range(100):
    
    m_ratings = eval_user_recommendation(j) 
    
    if m_ratings != False :
        c.append(m_ratings)
        
aux = len(c)

#we consider that a good rating is equal to 4 and above
c = [i for  i in c if i>4]

e= len(c)/aux
print("The error is ",e)
        


In [None]:
# now let's create a recommendation based on the number of ratings/ popularity
best_movies = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
best_movies.head()


#and from these movies we will encourage the users that didn't see them to see them
def recommend_movie(user_id, top=10) :
    listed_bmovies =best_movies['rating'].sort_values(by='size', ascending=False).head(top)
    ratings_user = ratings.loc[ratings['user_id']==user_id]
    r=[]
    
    
    for j in listed_bmovies.iloc[:,0]  : 

        if j not in ratings_user['movie_id'] :

            r.append(get_movie_name(movies, j))
    return(r)

user = 99
print(f"Top watched movies that user number {user} didn't see")
recommend_movie(user)

# Conclusion:

The code presented is a first implementation but has a number of shortcomings preventing the collaboration of multiple MLE and Data Scientists:
- It is not possible to introduce easily new features mainly because the code is just a bunch of functions in one file.
- The code can not be scaled to other datasets or variations of the tasks.
- There is no evaluation of the performances.
- There is no testing

Additionaly a number we could think of some features to add, for example, what about looking at similar users to find a recommendation for our targeted user ?