In this notebook, a simple implementation of a recommendation systems based on Collaborative Filtering for movies is presented.

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#ref2">Preprocessing</a></li>
        <li><a href="#ref3">Collaborative Filtering</a></li>
    </ol>
</div>
<br>
<hr>

In [1]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

<hr>

<a id="ref2"></a>
# Preprocessing

data used: 
Full: 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018.
from https://grouplens.org/datasets/movielens/

In [2]:
movies_df = pd.read_csv('/kaggle/input/grouplens-2018/ml-latest/movies.csv')
ratings_df = pd.read_csv('/kaggle/input/grouplens-2018/ml-latest/ratings.csv')

In [3]:
movies_df.shape

(58098, 3)

In [4]:
movies_df.tail()

Unnamed: 0,movieId,title,genres
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),Comedy
58095,193880,Her Name Was Mumu (2016),Drama
58096,193882,Flora (2017),Adventure|Drama|Horror|Sci-Fi
58097,193886,Leal (2018),Action|Crime|Drama


In [5]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [6]:
#Dropping the genres column, no need for them
movies_df = movies_df.drop('genres', 1)
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [7]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


In [8]:
#Drop removes a specified row or column from a dataframe
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,307,3.5
1,1,481,3.5
2,1,1091,1.5
3,1,1257,4.5
4,1,1449,4.5


<hr>

<a id="ref3"></a>
# Collaborative Filtering

The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score

In [9]:
# here's a hypothetical user that we want to make suggestions for
userInput = [
            {'title':'Avatar 2', 'rating':7},
            {'title':'13 Hours', 'rating':3.5},
            {'title':'Jumanji', 'rating':7},
            {'title':"Sherlock: The Abominable Bride", 'rating':8},
            {'title':'Jurassic World', 'rating':8},
    {'title':'Star Wars: Episode VII - The Force Awakens', 'rating':6},
    {'title':'Avengers: Age of Ultron', 'rating':9},
    {'title':'Ant-Man', 'rating':8},
    {'title':'Justice League: Throne of Atlantis', 'rating':7}]
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Avatar 2,7.0
1,13 Hours,3.5
2,Jumanji,7.0
3,Sherlock: The Abominable Bride,8.0
4,Jurassic World,8.0
5,Star Wars: Episode VII - The Force Awakens,6.0
6,Avengers: Age of Ultron,9.0
7,Ant-Man,8.0
8,Justice League: Throne of Atlantis,7.0


#### Add movieId to input user

In [10]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputId.head()

Unnamed: 0,movieId,title,year
1,2,Jumanji,1995
25688,117529,Jurassic World,2015
27547,122886,Star Wars: Episode VII - The Force Awakens,2015
27550,122892,Avengers: Age of Ultron,2015
27551,122894,Avatar 2,2016


In [11]:
inputMovies = pd.merge(inputId, inputMovies)
inputMovies

Unnamed: 0,movieId,title,year,rating
0,2,Jumanji,1995,7.0
1,117529,Jurassic World,2015,8.0
2,122886,Star Wars: Episode VII - The Force Awakens,2015,6.0
3,122892,Avengers: Age of Ultron,2015,9.0
4,122894,Avatar 2,2016,7.0
5,122900,Ant-Man,2015,8.0
6,124867,Justice League: Throne of Atlantis,2015,7.0
7,138210,13 Hours,2016,3.5
8,150548,Sherlock: The Abominable Bride,2016,8.0


In [12]:
inputMovies = inputMovies.drop('year', 1)
inputMovies

Unnamed: 0,movieId,title,rating
0,2,Jumanji,7.0
1,117529,Jurassic World,8.0
2,122886,Star Wars: Episode VII - The Force Awakens,6.0
3,122892,Avengers: Age of Ultron,9.0
4,122894,Avatar 2,7.0
5,122900,Ant-Man,8.0
6,124867,Justice League: Throne of Atlantis,7.0
7,138210,13 Hours,3.5
8,150548,Sherlock: The Abominable Bride,8.0


#### The users who has seen the same movies

In [13]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
43,4,2,4.0
1118,14,2,4.0
1280,14,117529,3.5
2266,34,122886,5.0
2663,39,2,3.5


We now group up the rows by user ID.

In [14]:
userSubsetGroup = userSubset.groupby(['userId'])

In [15]:
#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Now lets look at the first user

In [16]:
userSubsetGroup[0]

(19924,          userId  movieId  rating
 1944892   19924        2     2.5
 1947627   19924   117529     4.0
 1947666   19924   122886     4.5
 1947667   19924   122892     5.0
 1947670   19924   122900     4.0
 1947683   19924   124867     4.5
 1947795   19924   138210     4.0
 1947868   19924   150548     3.5)

#### Similarity of users to input user
Next, we are going to compare all users (not really all !!!) to our specified user and find the one that is most similar.  
we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. 

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 "Pearson Correlation")

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user.

In [17]:
userSubsetGroup = userSubsetGroup[0:100]

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient


In [18]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [19]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.111197,19924
1,0.408564,22947
2,-0.429984,98450
3,-0.573884,232485
4,0.003042,233580


#### The top x similar users to input user
Now let's get the top 50 users that are most similar to the input.

In [20]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
6,0.750597,275269
5,0.694981,253813
23,0.683604,63783
44,0.677894,139102
10,0.636396,9722


Now, let's start recommending movies to the input user.

#### Rating of selected users to all movies
We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.

In [21]:
topUsersRating = topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.750597,275269,1,2.0
1,0.750597,275269,2,3.0
2,0.750597,275269,3,3.5
3,0.750597,275269,5,2.0
4,0.750597,275269,10,4.0


Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:

In [22]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.750597,275269,1,2.0,1.501194
1,0.750597,275269,2,3.0,2.251791
2,0.750597,275269,3,3.5,2.62709
3,0.750597,275269,5,2.0,1.501194
4,0.750597,275269,10,4.0,3.002389


In [23]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,12.522402,47.648664
2,12.662972,40.461052
3,3.102489,10.855875
4,0.798442,2.280704
5,4.33065,12.172884


In [24]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.805074,1
2,3.195225,2
3,3.499085,3
4,2.856443,4
5,2.810868,5


Now let's sort it and see the top 20 movies that the algorithm recommended!

In [25]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3224,5.0,3224
192385,5.0,192385
175685,5.0,175685
159779,5.0,159779
118878,5.0,118878
71139,5.0,71139
8421,5.0,8421
3420,5.0,3420
71575,5.0,71575
126397,5.0,126397


In [26]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
3138,3224,Woman in the Dunes (Suna no onna),1964
3332,3420,...And Justice for All,1979
7807,8421,Dying of Laughter (Muertos de Risa),1999
14231,71139,Paraíso Travel,2008
14343,71575,Rudo y Cursi (Rough and Vulgar),2008
26133,118878,Sekirei,2008
29023,126397,The Encounter,2010
42829,159779,A Midsummer Night's Dream,2016
50125,175685,A Wanderer's Notebook,1962
57518,192385,A Star Is Born,2018
