# **Recommendation systems: COLLABORATIVE FILTERING**

Collaborative filtering is based on the fact that relationships exist between products and people's interests. Many recommendation systems use collaborative filtering to find these relationships and to give an accurate recommendation of a product that the user might like or be interested in. Collaborative filtering has basically two approaches; user-based, and item-based. User-based collaborative filtering is based on the user similarity or neighborhood. Item-based collaborative filtering is based on similarity among items.

In user-based collaborative filtering, we have an active user for whom the recommendation is aimed. The collaborative filtering engine first looks for users who are similar. That is users who share the active users rating patterns. Collaborative filtering basis this similarity on things like history, preference, and choices that users make when buying, watching, or enjoying something. Then it uses the ratings from these similar users to predict the possible ratings by the active user for a movie that she/he had not previously watched. For instance, if two users are similar or are neighbors in terms of their interested movies, the algorithm can recommend a movie to the active user that her/his neighbor has already seen. 

### Advantages and Disadvantages of Collaborative Filtering

**Advantages**
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time

**Disadvantages**
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences

## Table of contents

1. [Acquiring the data](#Data)
2. [Data Preprocessing](#Preprocessing)
3. [Collaborative Filtering](#Filtering)

## **Acquiring the Data**<a name = Data></a>
### **Import required library packages:**

In [1]:
#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

In [2]:
!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip 

--2022-05-22 17:34:08--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160301210 (153M) [application/zip]
Saving to: ‘moviedataset.zip’


2022-05-22 17:34:13 (32.4 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]

unziping ...
Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


In [3]:
#Storing the movie information into a pandas dataframe
df_movies = pd.read_csv('movies.csv')
#Storing the user information into a pandas dataframe
df_ratings = pd.read_csv('ratings.csv')
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
df_movies.shape

(34208, 3)

In [5]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [6]:
df_ratings.shape

(22884377, 4)

## **Data Preprocessing**<a name = Preprocessing></a>
Now remove the year from the **title** column by using pandas' replace function and store in a new **year** column.

In [7]:
df_movies['year'] = df_movies.title.str.extract('(\(\d\d\d\d\))',expand=True)
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,(1995)
1,2,Jumanji (1995),Adventure|Children|Fantasy,(1995)
2,3,Grumpier Old Men (1995),Comedy|Romance,(1995)
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,(1995)
4,5,Father of the Bride Part II (1995),Comedy,(1995)


In [8]:
#Removing the parentheses
df_movies['year'] = df_movies.year.str.extract('(\d\d\d\d)',expand=True)
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [9]:
#Removing the years from the 'title' column
df_movies['title'] = df_movies.title.str.replace('(\(\d\d\d\d\))', '')
df_movies.head(2)

  


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995


With that, split the values in the **Genres** column into a list of Genres to simplify future use. This can be achieved by applying Python's split string function on the correct column.

In [10]:
#Every genre is separated by a | so we simply have to call the split function on |
df_movies['genres'] = df_movies.genres.str.split('|')
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [11]:
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
df_movies['title'] = df_movies['title'].apply(lambda x: x.strip())
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Drop the **genres** column since it is not needed for this particular recommendation system.

In [12]:
#Dropping the genres column
df_movies = df_movies.drop('genres', 1)
df_movies.head()

  


Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


Next, look at the ratings dataframe.

In [13]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. The timestamp column is not needed, so drop it to save on memory.

In [14]:
#Drop removes a specified row or column from a dataframe
df_ratings = df_ratings.drop('timestamp', 1)

  


## **Collaborative Filtering recommendation system**<a name = 'Filtering'></a>

Now, time to start working on recommendation systems.

The Collaborative Filtering, is also known as User-User Filtering. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one being used here is going to be based on the Pearson Correlation Function.

The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score


Begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Matrix" then write it in like this: 'Matrix, The' .

In [15]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


**Add movieId to input user**

With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.

Achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. Also drop unnecessary columns for the input to save memory space.

In [16]:
#Filtering out the movies by title
userMovies = df_movies[df_movies['title'].isin(inputMovies['title'].tolist())]
userMovies

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
293,296,Pulp Fiction,1994
1246,1274,Akira,1988
1885,1968,"Breakfast Club, The",1985


In [17]:
inputMovies = pd.merge(userMovies, inputMovies)
inputMovies

Unnamed: 0,movieId,title,year,rating
0,1,Toy Story,1995,3.5
1,2,Jumanji,1995,2.0
2,296,Pulp Fiction,1994,5.0
3,1274,Akira,1988,4.5
4,1968,"Breakfast Club, The",1985,5.0


**The users who has seen the same movies**

Now with the movie ID's in the **inputMovies**, get the subset of users that have watched and reviewed the movies in the **inputMovies**.

In [18]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = df_ratings[df_ratings['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


In [19]:
userSubset = pd.merge(userSubset, inputMovies)
userSubset.head()

Unnamed: 0,userId,movieId,rating,title,year
0,13,2,2.0,Jumanji,1995
1,217,2,2.0,Jumanji,1995
2,461,2,2.0,Jumanji,1995
3,534,2,2.0,Jumanji,1995
4,537,2,2.0,Jumanji,1995


Now group up the rows by user ID.

In [20]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa3d03e1350>

In [21]:
userSubset['userId'].values

array([    13,    217,    461, ..., 247512, 247570, 247732])

Look at one of the users, e.g. the one with userID=217

In [22]:
userSubsetGroup.get_group(217)

Unnamed: 0,userId,movieId,rating,title,year
1,217,2,2.0,Jumanji,1995
6975,217,296,5.0,Pulp Fiction,1994


Now sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since every single user need not be go through.

In [23]:
#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [24]:
userSubsetGroup[0:3]

[(85133,        userId  movieId  rating                title  year
  843     85133        2     2.0              Jumanji  1995
  3954    85133     1968     5.0  Breakfast Club, The  1985
  18151   85133      296     5.0         Pulp Fiction  1994
  39656   85133     1274     4.5                Akira  1988),
 (119741,        userId  movieId  rating         title  year
  1160   119741        2     2.0       Jumanji  1995
  22593  119741      296     5.0  Pulp Fiction  1994
  39795  119741     1274     4.5         Akira  1988
  42799  119741        1     3.5     Toy Story  1995),
 (138498,        userId  movieId  rating                title  year
  1333   138498        2     2.0              Jumanji  1995
  4932   138498     1968     5.0  Breakfast Club, The  1985
  25009  138498      296     5.0         Pulp Fiction  1994
  43198  138498        1     3.5            Toy Story  1995)]

#### Similarity of users to input user
Next, compare all users (not really all !!!) to the specified user and find the one that is most similar.  
Then find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. 

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 "Pearson Correlation")

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

Select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user.

In [25]:
userSubsetGroup = userSubsetGroup[0:100]

Now, calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient.

In [26]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [27]:
pearsonCorrelationDict.items()

dict_items([(85133, 1.0), (119741, 1.0), (138498, 1.0), (200993, 1.0), (211296, 1.0), (75, 1.0), (4903, 1.0), (4938, 1.0), (5783, 1.0), (6207, 1.0), (7684, 1.0), (7988, 1.0), (9250, 1.0), (9358, 1.0), (12334, 1.0), (12728, 1.0), (16499, 1.0), (19816, 1.0), (20075, 1.0), (20172, 1.0), (20341, 1.0), (20467, 1.0), (21475, 1.0), (21518, 1.0), (22043, 1.0), (22209, 1.0), (25167, 1.0), (25456, 1.0), (25799, 1.0), (27221, 1.0), (27812, 1.0), (28646, 1.0), (35025, 1.0), (37093, 1.0), (38734, 1.0), (39430, 1.0), (41330, 1.0), (41890, 1.0), (42432, 1.0), (44861, 1.0), (46194, 1.0), (46401, 1.0), (46619, 1.0), (47016, 1.0), (50086, 1.0), (50950, 1.0), (54177, 1.0), (56705, 1.0), (57474, 1.0), (57604, 1.0), (57669, 1.0), (59997, 1.0), (60319, 1.0), (60555, 1.0), (60969, 1.0), (62322, 1.0), (64456, 1.0), (65603, 1.0), (66887, 1.0), (66961, 1.0), (67344, 1.0), (68961, 1.0), (69614, 1.0), (70175, 1.0), (70370, 1.0), (70556, 1.0), (70729, 1.0), (71729, 1.0), (75521, 1.0), (76924, 1.0), (77379, 1.0), (

In [28]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,1.0,85133
1,1.0,119741
2,1.0,138498
3,1.0,200993
4,1.0,211296


**The top x similar users to input user**

Now let's get the top 50 users that are most similar to the input.

In [29]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
0,1.0,85133
63,1.0,70175
73,1.0,81627
72,1.0,79850
71,1.0,79481


Now, start recommending movies to the input user.

**Rating of selected users to all movies**

This is done by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, get the movies watched by the users in the pearsonDF from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.

In [30]:
topUsersRating=topUsers.merge(df_ratings, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,85133,1,5.0
1,1.0,85133,2,2.0
2,1.0,85133,10,3.5
3,1.0,85133,15,2.0
4,1.0,85133,18,3.5


Now all that is to be done is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

Do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:

In [31]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,85133,1,5.0,5.0
1,1.0,85133,2,2.0,2.0
2,1.0,85133,10,3.5,3.5
3,1.0,85133,15,2.0,2.0
4,1.0,85133,18,3.5,3.5


In [32]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,50.0,184.0
2,41.0,99.0
3,11.0,29.0
4,2.0,6.0
5,10.0,23.0


In [33]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.68,1
2,2.414634,2
3,2.636364,3
4,3.0,4
5,2.3,5


Now sort it and see the top 20 movies that the algorithm recommended!

In [34]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2347,5.0,2347
2439,5.0,2439
26151,5.0,26151
61026,5.0,61026
37240,5.0,37240
2494,5.0,2494
26052,5.0,26052
166,5.0,166
38600,5.0,38600
25788,5.0,25788


In [35]:
df_movies.loc[df_movies['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
164,166,"Doom Generation, The",1995
2263,2347,"Pope of Greenwich Village, The",1984
2355,2439,Affliction,1997
2410,2494,"Last Days, The",1998
8380,25788,Scarface,1932
8579,26052,Pickpocket,1959
8647,26151,Au Hasard Balthazar,1966
10415,37240,Why We Fight,2005
10486,38600,Factotum,2005
12925,61026,Red Cliff (Chi bi),2008
