# Collaborative Filtering

This notebook explores creating a collaborative filtering model, a.k.a. a recommender system by using the data from [MovieLens](https://grouplens.org/datasets/movielens/100k/) and code direction by Jeremy Howard in the FastAI course 'Pracitcal Deep Learning for Coders'. The model is developed using the dot product method and provides the top 10 movie recommendations based on a movie input from the user.

In [1]:
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k) # data from MovieLens

The README file explains that the <i>u.data</i> file contains the user, movie, rating and timestamp in a tab-seperated format with 100k entries.

In [2]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp']) # provide column names
ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Get data of the movies from <i>u.item</i> to map the name of the movies to their IDs in the 'ratings' dataframe.

In [3]:
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Merge the two dataframes.

In [4]:
ratings = ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


Create a Dataloader using the factory method for <i>CollabDataLoaders</i>. Since we are loading the data from a dataframe, the method assumes the first column to be the user, the second column to be the movie and the third to be the rating for that movie. To make the dataloader readable, we use the movie title instead of movie ID using the item_name.

In [5]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64) # bs - batch size
dls.show_batch()

Unnamed: 0,user,title,rating
0,295,Seven (Se7en) (1995),4
1,109,Interview with the Vampire (1994),3
2,347,"Ghost and the Darkness, The (1996)",3
3,141,Leaving Las Vegas (1995),1
4,561,"Terminator, The (1984)",3
5,167,Waterworld (1995),1
6,405,Poison Ivy II (1995),1
7,864,Demolition Man (1993),3
8,314,Clerks (1994),5
9,559,Apocalypse Now (1979),4


To develop a recommender system, the model needs to learn the user's likes and dislikes in movies. Each movie can have several defining factors - for e.g. action, comedy, drama, romance, romedy, thriller, etc. Similarly, a user can also have factors that are determined by the kind of movies they like and rate. To recommend movies to a user, a relationship between the factors of the users and movies need to be forumlated . Not all users will have watched and rated all movies. For this, FastAI's collab_learner uses Stochastic Gradient Descent (SGD) and weigth decay (L2 Regularisation) to determine weights for each factor that will help recommend movies to the user. 

Before diving into the creating a learner, we can take a look at the data the dataloader has.

In [6]:
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
print('The number of unique users are',n_users,'and number of the unique movies are', n_movies,'.')

The number of unique users are 944 and number of the unique movies are 1665 .


Now that we have the dataloader, we can develop a learner. FastAI has its own collaborative filtering model that can be used. The model essentially carries out a dot product mutiplication between the latent factor matrices. For this dataset, 50 latent factors will be used, i.e. the movies and users will have 50 defining factors. The predictions by the learner needs to be a rating between 0 to 5. Based on the empherical discovery by the creators of FastAI, it is best to choose a value slightly higher than 5, here it is 0.5. 

In [7]:
learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

<b>Explaination:</b> There are two ways to develop a collaborative filter model - using dot product and deep learning. As mentioned before, the method used for this approach is dot product; which is why, the parameter 'use_nn' while creating the learner is by default set to <b>False</b>. Due to this, the learner uses [EmbeddingDotBias](https://docs.fast.ai/collab.html#EmbeddingDotBias) with the dataloader as a class, number of factors (n_factors) and range of ratings (y_range). This creates latent factors using embeddings for the users and items (movies) with the help of the number of factors and builds a model. The layers of the model contain the embeddings of the users and movies.

Use fit_one_cycle() to train the model with [Cyclical Learning Rates](https://iconof.com/1cycle-learning-rate-policy/) starting with 0.005. This approach uses Stochastic Gradient Descent (SGD) to learn the values of the weights i.e. latent factors. The weight decay (wd) or L2 Regularization helps in keeping the weights or parameters as low as posisble to avoid overfitting. The smaller the weight decay, the larger the penalty on the parameter gradients. 

In [8]:
learn.fit_one_cycle(5, 5e-3, wd=0.1) 

epoch,train_loss,valid_loss,time
0,0.929023,0.92426,00:10
1,0.884289,0.870602,00:09
2,0.707542,0.83372,00:10
3,0.584835,0.819985,00:10
4,0.49499,0.821814,00:10


The valid_loss is calculated using MSELossFlat by default. The model contains trained weights for the users and items (movies) and a bias for every movie and item. 

In [9]:
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

A quick peak at the embeddings for the user latent factors can be seen as follows:

In [10]:
learn.model.u_weight.weight

Parameter containing:
tensor([[-1.9443e-03,  1.1393e-04, -2.2330e-03,  ..., -3.2655e-03,
         -5.3349e-05,  1.1588e-03],
        [-2.1976e-01, -3.0741e-01, -6.1382e-02,  ..., -1.0371e-02,
         -2.9328e-01, -1.9306e-01],
        [-5.6712e-02,  5.9999e-02, -2.0124e-01,  ...,  1.6301e-01,
         -8.8005e-02, -1.9988e-01],
        ...,
        [-9.6665e-02,  1.9842e-01, -6.5452e-02,  ...,  1.8390e-01,
         -5.0774e-02, -8.9966e-02],
        [-1.8747e-01,  1.9523e-01, -2.4048e-01,  ...,  1.9683e-01,
          1.5150e-01, -5.1991e-02],
        [-2.8145e-01,  5.2345e-02,  3.5268e-01,  ...,  2.6607e-01,
         -2.2565e-01, -3.2856e-01]], device='cuda:0', requires_grad=True)

This matrix contains weights (latent factors) for the 50 factors for every user. Similarly another matrix for the movies is generated as an embedding layer. There is only one bias value for each item and each user.

In [11]:
learn.model.i_bias.weight.squeeze().argsort()

tensor([ 295,  850,  202,  ..., 1399, 1318, 1501], device='cuda:0')

The bias correlates with the ratings of the movies. A higher bias indicates a higher rating, vice-a-versa for a lower bias. The least favoured and most favoured movies can be calculated by sorting the bias using the argsort(). The function argsort() sorts the array values and outputs indexes of the sorted values from the original array. Using this, we can calculate the 10 highest and lowest rated movies.

In [12]:
def movie_list(value):
    movie_bias = learn.model.i_bias.weight.squeeze()
    if value == 'top':
        val = True 
    else:
        val = False
        value = 'Lowest'
    idxs = movie_bias.argsort(descending=val)[:10] 
    movies = [dls.classes['title'][i] for i in idxs]
    movie_bias = learn.model.bias(movies, is_item=True)
    mean_ratings = ratings.groupby('title')['rating'].mean() # get mean rating for each movie
    movie_ratings = [(b, i, mean_ratings.loc[i]) for i,b in zip(movies,movie_bias)]
    print('---10 {} rated movies are---'.format(value))
    for i in range(0,len(movies)):
        print('Movie:',movie_ratings[i][1],', Mean Rating:',round(movie_ratings[i][2],2))

Lets look at the top 10 rated movies along with their ratings.

In [13]:
movie_list('top')

---10 top rated movies are---
Movie: Titanic (1997) , Mean Rating: 4.25
Movie: Shawshank Redemption, The (1994) , Mean Rating: 4.45
Movie: Star Wars (1977) , Mean Rating: 4.36
Movie: L.A. Confidential (1997) , Mean Rating: 4.16
Movie: Apt Pupil (1998) , Mean Rating: 4.1
Movie: Schindler's List (1993) , Mean Rating: 4.47
Movie: Good Will Hunting (1997) , Mean Rating: 4.26
Movie: Rear Window (1954) , Mean Rating: 4.39
Movie: Usual Suspects, The (1995) , Mean Rating: 4.39
Movie: Silence of the Lambs, The (1991) , Mean Rating: 4.29


And the lowest rated movies

In [14]:
movie_list('low')

---10 Lowest rated movies are---
Movie: Children of the Corn: The Gathering (1996) , Mean Rating: 1.32
Movie: Lawnmower Man 2: Beyond Cyberspace (1996) , Mean Rating: 1.71
Movie: Body Parts (1991) , Mean Rating: 1.62
Movie: Robocop 3 (1993) , Mean Rating: 1.73
Movie: Crow: City of Angels, The (1996) , Mean Rating: 1.95
Movie: Mortal Kombat: Annihilation (1997) , Mean Rating: 1.95
Movie: Jury Duty (1995) , Mean Rating: 2.0
Movie: Cable Guy, The (1996) , Mean Rating: 2.34
Movie: Free Willy 3: The Rescue (1997) , Mean Rating: 1.74
Movie: Amityville 1992: It's About Time (1992) , Mean Rating: 1.0


The model can now be used to provide recommendations. This being a low scale model, for now, the model can only predict similar movies to the one provided by the user. This can be done by measuring the similarity between the provided movie and other movie and listing the top 5 closest members, thus providing a list of recommendations.

In [16]:
def similar_movie(title):
    movie_factors = learn.model.i_weight.weight
    idx = dls.classes['title'].o2i[title]
    distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
    idx = distances.argsort(descending=True)[0:6]
    movies = dls.classes['title'][idx]
    num = 1
    for i in movies:
        print(num,'- ', i)
        num += 1        

In [29]:
movie = 'Star Wars (1977)'
print('The top 5 similar movies to {} are:'.format(movie))
similar_movie(movie)

The top 5 similar movies to Star Wars (1977) are:
1 -  Star Wars (1977)
2 -  Empire Strikes Back, The (1980)
3 -  Return of the Jedi (1983)
4 -  Raiders of the Lost Ark (1981)
5 -  Fresh (1994)
6 -  Sting, The (1973)


Not surprising that the movies similar to the 1977 Star Wars movie are the subsequent movies of the same franchise followed by the action/adventure Indiana Jones movie. The weights for the last 2 movies seem to be similar to Star Wars, meaning the drama genre is closely related to the action/adventure genre. Let's try it again!

In [31]:
movie = "Forrest Gump (1994)"
print('The top 5 similar movies to {} are:'.format(movie))
similar_movie(movie)

The top 5 similar movies to Forrest Gump (1994) are:
1 -  Forrest Gump (1994)
2 -  Field of Dreams (1989)
3 -  Hunt for Red October, The (1990)
4 -  It's a Wonderful Life (1946)
5 -  American President, The (1995)
6 -  Shawshank Redemption, The (1994)


A full scale model will be more informed with with data based more on demographic of the users. For now, this model can provide us with 5 movies at a time to watch the next time a lockdown is announced in case of a pandemic (heavens forbid!)