
# Intro


**Notes**

The main bulk of the material comes from https://developers.google.com/machine-learning/recommendation/overview/candidate-generation. If you want to go further later, you can take a look at http://nicolas-hug.com/blog/matrix_facto_3. It is absolutely not expected to look at these two links for the interviews  or to complete the test.

**Context**: 

We want to build a movies' recommender in order to get new movies to watch during the lock down. We will base our work on a variation of the MovieLens dataset. 
The data consists of movies seen by the users, some informations about the movies, and some informations about the users. The problem consists in predicting which movies a given user might like.

We are presenting you here first a naive approach in order to familarize yourself with the problem and show you how it might be solved.

**Task**:

The code presented is a first implementation but has a number of shortcomings in its structure and features (more on that in the conclusion). Your task consist in producing a refactoring, so as to be one step closer to a "clean" code.

**Evaluation**:

Our goal here is two fold:
- See how you understand a problem and adapt to an already given approach to tackle it.
- See how you can design new features.
- See how you manipulate python code: understanding, ideas to refactor etc ...

The projects will be evaluated on the quality of the source code produced.

# The data

First, let's load some data.

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import argparse

### Construct a class to read the input data present in various csv files and perform some duplicate check

In [98]:
class Preprocess:
    """
       This class does the task of preprocessing.   
    """
    def __init__(self, base_directory):
        self.base_directory = base_directory
        self.user_data = self.read_data(self.base_directory) 
        
    def read_data(self, base_directory):
        """Read the input data
        INPUT: 
            base_directory - input directory where the data needs to be read
        OUTPUT:
            reads the data and returns the data in the form of pandas dataFrame
        """
        data = pd.read_csv(base_directory)
        return data
    
    def check_duplicates(self, df, column):
        """Check for existance of duplicate records
        INPUT: 
            DataFrame - DataFrame to be checked for duplicate records
            column    - Column to be checked for duplicate values
        OUTPUT:
            returns the records that is duplicated.
        """
        return df[df.duplicated([column])]
    
if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description='Perform the task of preprocessing for the task of movie recommendations')
    parser.add_argument('--BASE_DIR', metavar='path', required=True,
                        help='the path to workspace')
    
    #To execute in interactive environment
    args = parser.parse_args(args=['--BASE_DIR', 'data/users.csv',
                                 ])
    #To Execute in shell
    #args = parser.parse_args()
    
    #Initialize the class
    preprop = Preprocess(base_directory=args.BASE_DIR)
    
    #Call function read_data 
    USER_DATA = preprop.read_data('data/users.csv')
    print("see all args:", args)
    print("use one arg:", args.BASE_DIR)

see all args: Namespace(BASE_DIR='data/users.csv')
use one arg: data/users.csv


In [273]:
class ContentFiltering:
    """This class performs Content Filtering for movie recommendations"""
    def __init__(self, df, feature_cols):
        assert isinstance(df, pd.DataFrame), "df should be a Pandas DataFrame"
        
        self.df = df
        self.feature_cols = feature_cols
        
    def get_similarity(self, df, feature_cols):
        """
        To obtain the similiarity between all movies of our dataset.
        INPUT: 
            df - DataFrame 
            feature_cols - genre columns    
        OUTPUT:
            returns similarity between different genre columns
        """
        assert isinstance(df, pd.DataFrame), "df should be a Pandas DataFrame"
        
        similarity = df[feature_cols].values.dot(df[feature_cols].values.T)
        return similarity
    
    def get_movie_id(self, movie_name):
        """
        Get the movie id for a given movie name.
        INPUT: 
            movie_name - name of the movie in the movies dataset 
        OUTPUT:
            returns the movie id of the given movie name
        """
        assert isinstance(movie_name, str), "movie_name must be a string"
        
        movie_id = list(self.df[self.df['title']==movie_name].movie_id)[0]
        return movie_id

    def get_most_similar(self, similarity, movie_name, year=None, top=10):
        """
        Input:  
            similarity - numpy.ndarray matrix that is the returned value from get_similarity method
            movie_name - movie name to be checked for most similar movie names
            year - input year of the movie
            top - top number of records to be fetched
        Output: Return the recommendations for the given User ID.
        """
        assert isinstance(similarity, np.ndarray), "similarity must be an n dimensional array"
        assert isinstance(year, int), "year must be an integer"
        assert isinstance(movie_name, str), "movie_name must be a string"
        
        index_movie = get_movie_id(movies, movie_name, year)
        best = similarity[index_movie].argsort()[::-1]
        return [(ind, get_movie_name(movies, ind), similarity[index_movie, ind]) for ind in best[:top] if ind != index_movie]
    
    def get_recommendations(self, user_id, input_cols, ratings_df, n):
        """
        Input:  
            user_id - Enter the user id
            input_cols - Enter the columns to be displayed on the dataframe that is to be returned
            ratings_df - ratings dataframe
            n - number of rows to be fetched
        Output: Return the recommendations for the given User ID.
        """
        top_movies = ratings_df[ratings_df['user_id'] == user_id].sort_values(by='rating', ascending=False).head(3)['movie_id']

        most_similars = []
        for top_movie in top_movies:
            most_similars += get_most_similar(similarity, get_movie_name(movies, top_movie), get_movie_year(movies, top_movie))

        return pd.DataFrame(most_similars, columns=index).drop_duplicates().sort_values(by='similarity', ascending=False).head(n)

In [91]:
#Initialize the preprocessing class
preproc = Preprocess('data/users.csv')

In [92]:
# Read user data
users = preproc.read_data('data/users.csv')
users.head(2)

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,0,F,1,10,48067
1,1,M,56,16,70072


In [93]:
# Read movies data
movies = preproc.read_data('data/movies.csv')
movies.head(2)

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [94]:
#Read ratings data
ratings = preproc.read_data("data/ratings.csv")
ratings.head(2)

Unnamed: 0,user_id,movie_id,rating
0,0,1176,5
1,0,655,3


### Check for existance of duplicates

In [99]:
#Check whether any movie is duplicated or not
preproc.check_duplicates(movies, 'movie_id')

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western


In [97]:
#Check whether any user_id is duplicated or not
preproc.check_duplicates(users, 'user_id')

Unnamed: 0,user_id,gender,age,occupation,zip_code


In [16]:
print("Total number of users =", users.user_id.count())
print("Total number of movies =", movies.shape[0])
num_of_features = len(movies.iloc[:, 3:].columns)
print("Total number of features / genres =", num_of_features)
print("Total number of rows in the ratings dataset =", ratings.shape[0])
print("Total number of unique users in the ratings dataset =",len(np.unique(ratings.user_id)))

Total number of users = 6040
Total number of movies = 3883
Total number of features / genres = 18
Total number of rows in the ratings dataset = 1000209
Total number of unique users in the ratings dataset = 6040


# Content-based Filtering

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. We dont use other users information !

For example, if user `A` liked `Harry Potter 1`, he/she will like `Harry Potter 2`

In [4]:
%%html
<img src='https://miro.medium.com/max/1642/1*BME1JjIlBEAI9BV5pOO5Mg.png' height="300" width="250"/>

What are similar movies ? In order to answer to this question we need to build a similiarity measure. 

## Features

This measure will operate on the characteristics (**features**) of the movies to determine which are close. In our case, we have access to the genres of the movies. For example, the genres of `Toy Story` are: `Animation`, `Children's` and `Comedy`. This is represented as follow in our dataset:

In [23]:
# Extract feature colums
genre_cols = list(movies.iloc[:, 3:].columns)

In [24]:
genre_cols

['Animation',
 "Children's",
 'Comedy',
 'Adventure',
 'Fantasy',
 'Romance',
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Horror',
 'Sci-Fi',
 'Documentary',
 'War',
 'Musical',
 'Mystery',
 'Film-Noir',
 'Western']

In [25]:
genre_and_title_cols = ['title'] + genre_cols 

movies[genre_and_title_cols].head()

Unnamed: 0,title,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,Toy Story,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jumanji,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grumpier Old Men,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Waiting to Exhale,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Father of the Bride Part II,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Similarity

Now that we have some features, we will try to find a function that performs a similiarity measure. The Similarity function will take two items (two list of features) and return a number proportional to their similarity. 

For the following we will consider that the Similarity between two movies is the number of genres they have in common.

Here is an example with `Toy Story` and `E.T`

In [6]:
toy_story_genres = movies[genre_and_title_cols].loc[movies.title == 'Toy Story'][genre_cols].iloc[0]
toy_story_genres

Animation      1.0
Children's     1.0
Comedy         1.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [27]:
et_genres = movies[genre_and_title_cols].loc[movies.title == 'E.T. the Extra-Terrestrial'][genre_cols].iloc[0]
et_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        1.0
Romance        0.0
Drama          1.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         1.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 1081, dtype: float64

In [29]:
type(et_genres.values)

numpy.ndarray

In [8]:
et_genres.values * toy_story_genres

Animation      0.0
Children's     1.0
Comedy         0.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [9]:
(et_genres.values * toy_story_genres).sum() # scalar product

1.0

So our similarity measure returns `1.0` for these two movies. 

Let's see another example where we compare `Toy Stories` and `Pocahontas`

In [10]:
pocahontas_genres = movies[genre_and_title_cols].loc[movies.title == 'Pocahontas'][genre_cols].iloc[0]
(pocahontas_genres.values * toy_story_genres).sum()

2.0

This tels us that `Pocahontas` is closer to `Toy Stories` than `E.T.` which makes sense.


## Scaling up

Ok, that's a nice measure. Now we are going to scale it up to all movies of our dataset. To do so smartly, let's take a look at the operation we just did, but from a mathematical point of view. To do so, we will think of the list of features of a movie as a vector `V`. Then, our similarity measure between `Toy Story` and `E.T.` becomes:
$ V_{ToyStory} \cdot V_{ET}^{T}$

More generally the similarity measure between a movie `i` and another movie `j` is : $ V_{i} \cdot V_{j}^{T}$

Now we can think of `movies` as a matrix containing all features vectors describing the movies. Here is how our similiarity measure looks in this context:

![](imgs/dot_product_matrices.png)

To obtain the similiarity between all movies of our dataset we have to perform the dot product of the `movies` matrix with the transposed of the `movies` matrix.

In [271]:
movies1 = np.array([1, 2, 3])

In [275]:
# Initialize the ContentFiltering class
cont_filter = ContentFiltering(movies, genre_cols)

In [249]:
similarity = cont_filter.get_similarity(cont_filter.df, cont_filter.feature_cols)

In [250]:
similarity

array([[3., 1., 1., ..., 0., 0., 0.],
       [1., 3., 0., ..., 0., 0., 0.],
       [1., 0., 2., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 1., 1.],
       [0., 0., 0., ..., 1., 1., 1.],
       [0., 0., 0., ..., 1., 1., 2.]])

In [251]:
type(similarity)

numpy.ndarray

In [252]:
similarity.shape

(3883, 3883)

In [253]:
#Get the similarity of a specific film with respect to the rest of the films
movies.head(5)

Unnamed: 0,movie_id,title,year,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0,Toy Story,1995,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Jumanji,1995,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Grumpier Old Men,1995,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Waiting to Exhale,1995,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Father of the Bride Part II,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can now get the similarity between `Toy Story` and any other movie of our dataset

In [230]:
#list(movies[movies['title']=='Father of the Bride Part II'].movie_id)[0]

In [254]:
# Get the movie id for a given film
movie_id = cont_filter.get_movie_id('Father of the Bride Part II')

In [255]:
movie_id

4

In [256]:
#Pass the movie id of the film as an index to similarity. This gives the similarity between the given movie name
# and any other movie of our dataset
similarity_for_father_of_the_bride = similarity[movie_id]

In [257]:
similarity_for_father_of_the_bride

array([1., 0., 1., ..., 0., 0., 0.])

In [258]:
similarity_with_toy_story = similarity[0] # 0 is Toy Story
similarity_with_toy_story

array([3., 1., 1., ..., 0., 0., 0.])

In [259]:
for i in range(10):
    print(f"Similarity between Toy story and {movies.iloc[i]['title']} (index {i}) is {similarity_with_toy_story[i]}")

Similarity between Toy story and Toy Story (index 0) is 3.0
Similarity between Toy story and Jumanji (index 1) is 1.0
Similarity between Toy story and Grumpier Old Men (index 2) is 1.0
Similarity between Toy story and Waiting to Exhale (index 3) is 1.0
Similarity between Toy story and Father of the Bride Part II (index 4) is 1.0
Similarity between Toy story and Heat (index 5) is 0.0
Similarity between Toy story and Sabrina (index 6) is 1.0
Similarity between Toy story and Tom and Huck (index 7) is 1.0
Similarity between Toy story and Sudden Death (index 8) is 0.0
Similarity between Toy story and GoldenEye (index 9) is 0.0


## A bit of polishing

### Helpers:

We also built some helpers to handle the movies dataset:

In [260]:
from content_based_filtering.helpers.movies import get_movie_id, get_movie_name, get_movie_year
    
print (get_movie_id(movies, 'Toy Story'))
print (get_movie_id(movies, 'Die Hard'))

print (get_movie_name(movies, 0))
print (get_movie_name(movies, 1000))
print (get_movie_year(movies, 1000))

0
1023
Toy Story
Parent Trap, The
1961


### Finding similar movies:
Here is a method giving us the movie the most similar to another movie:

In [261]:
top_10_similar_movies = cont_filter.get_most_similar(similarity, 'Toy Story')
top_10_similar_movies

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [262]:
cont_filter.get_most_similar(similarity, 'Psycho', 1960) 

[(3593, "Puppet Master III: Toulon's Revenge", 2.0),
 (2923, 'Rawhead Rex', 2.0),
 (1312, 'Believers, The', 2.0),
 (3407, "Jacob's Ladder", 2.0),
 (1957, 'Disturbing Behavior', 2.0),
 (1927, 'Poltergeist III', 2.0),
 (1926, 'Poltergeist II: The Other Side', 2.0),
 (1925, 'Poltergeist', 2.0),
 (732, 'Thinner', 2.0),
 (69, 'From Dusk Till Dawn', 2.0)]

### Giving a recommendation:

And finally, let's find some movies to recommend based on previously liked movies:

In [267]:
# Get the top recommendations for the user_id
index = ['movie_id', 'title', 'similarity']
user_id= 0 # Enter the user id
recommendations = cont_filter.get_recommendations(user_id, index, ratings, 10)
recommendations

Unnamed: 0,movie_id,title,similarity
13,773,"Hunchback of Notre Dame, The",3.0
14,1526,Hercules,3.0
27,2072,"American Tail, An",3.0
26,2033,Steamboat Willie,3.0
25,2012,"Little Mermaid, The",3.0
22,3682,Chicken Run,3.0
21,3685,"Adventures of Rocky and Bullwinkle, The",3.0
20,667,Space Jam,3.0
19,2011,Lady and the Tramp,3.0
18,655,James and the Giant Peach,3.0


# Conclusion:

The code presented is a first implementation but has a number of shortcomings preventing the collaboration of multiple MLE and Data Scientists:
- It is not possible to introduce easily new features mainly because the code is just a bunch of functions in one file.
- The code can not be scaled to other datasets or variations of the tasks.
- There is no evaluation of the performances.
- There is no testing

Additionaly a number we could think of some features to add, for example, what about looking at similar users to find a recommendation for our targeted user ?

### To address point : 1
=========================  
We can now easily introduce use new features by creating methods inside the ContentFiltering class and we can call whenever required.

### To address Point: 2
========================   
The code can now be scaled to other datasets because the datasets are not hardcoded and it can be passed as 
parameters to the methods inside the class and can also perform variations of the tasks.

### To address Point: 3
===========================  
We can use Singluar Value Decomposition (SVD) and check for its performance. (Scope for Future work)
Since, the instructions was to only take 4 hrs for this task, I did not check the SVD.

### To address Point: 4  
==============================  
Testing has been done using assert functions in python.
We can also use inbuilt python library pytest or unitest to perform rigorous testing

For debugging,
We can use python debuggers (Inbuilt pycharm debuggers) and also we can use pdb modules for extensive debugging.