# Collaborative filtering

<h4>Author:  <a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a></h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>

<hr>

<p>Copyright &copy; 2018 <a href="https://cocl.us/DX0108EN_CC">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href="https://bigdatauniversity.com/mit-license/">MIT License</a>.</p>

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version of one using Python and the Pandas library.

## Acquiring the data

In [1]:
page = "https://s3-api.us-geo.objectstorage.softlayer.net/" \
       "cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip"

In [2]:
!wget -nc -O moviedataset.zip $page

File ‘moviedataset.zip’ already there; not retrieving.


In [3]:
!unzip -o -j moviedataset.zip

Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


## Preprocessing

In [4]:
# Dataframe manipulation library.
import pandas as pd

# Math functions, we'll only need the sqrt function so let's import only that.
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [5]:
# Storing the movie information into a pandas dataframe.
movies_df = pd.read_csv('movies.csv')

# Storing the user information into a pandas dataframe.
ratings_df = pd.read_csv('ratings.csv')

In [6]:
# Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
# Using regular expressions to find a year stored between parentheses.
# We specify the parantheses so we don't conflict with movies that have years in their titles.
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', expand=False)

# Removing the parentheses.
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)', expand=False)

# Removing the years from the 'title' column.
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

#Applying the strip function to get rid of any ending whitespace characters that may have 
# appeared.
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

In [8]:
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [9]:
# Dropping the genres column.
movies_df = movies_df.drop('genres', 1)

In [10]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [11]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [12]:
# Drop removes a specified row or column from a dataframe.
ratings_df = ratings_df.drop('timestamp', 1)

In [13]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


## Collaborative filtering

In [14]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5},
            {'title':'Star Wars: Episode VI - Return of the Jedi', 'rating':5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5
5,Star Wars: Episode VI - Return of the Jedi,5.0


In [15]:
# Filtering out the movies by title.
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)

# Dropping information we won't use from the input dataframe.
inputMovies = inputMovies.drop('year', 1)

# Final input dataframe.
# If a movie you added in above isn't here, then it might not be in the original 
# dataframe or it might spelled differently, please check capitalisation.
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1210,Star Wars: Episode VI - Return of the Jedi,5.0
4,1274,Akira,4.5
5,1968,"Breakfast Club, The",5.0


In [16]:
# Filtering out users that have watched movies that the input has watched and storing it.
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
418,12,1210,5.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0


In [17]:
# Groupby creates several sub dataframes where they all have the same value in the column 
# specified as the parameter.
userSubsetGroup = userSubset.groupby(['userId'])

In [18]:
userSubsetGroup.get_group(1130)

Unnamed: 0,userId,movieId,rating
104167,1130,1,0.5
104168,1130,2,4.0
104214,1130,296,4.0
104363,1130,1274,4.5
104443,1130,1968,4.5


In [19]:
# Sorting it so users with movie most in common with the input will have priority.
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

In [20]:
userSubsetGroup[0:3]

[(75,       userId  movieId  rating
  7507      75        1     5.0
  7508      75        2     3.5
  7540      75      296     5.0
  7616      75     1210     4.0
  7633      75     1274     4.5
  7673      75     1968     5.0), (106,       userId  movieId  rating
  9083     106        1     2.5
  9084     106        2     3.0
  9115     106      296     3.5
  9182     106     1210     3.0
  9198     106     1274     3.0
  9238     106     1968     3.5), (686,        userId  movieId  rating
  61336     686        1     4.0
  61337     686        2     3.0
  61377     686      296     4.0
  61468     686     1210     5.0
  61478     686     1274     4.0
  61569     686     1968     5.0)]

In [21]:
userSubsetGroup = userSubsetGroup[0:100]

In [22]:
# Store the Pearson Correlation in a dictionary, where the key is the user Id and the value 
# is the coefficient.
pearsonCorrelationDict = {}

# For every user group in our subset.
for name, group in userSubsetGroup:
    
    # Let's start by sorting the input and current user group so the values aren't mixed 
    # up later on.
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    
    # Get the N for the formula.
    nRatings = len(group)
    
    # Get the review scores for the movies that they both have in common.
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    # And then store them in a temporary buffer variable in a list format to facilitate 
    # future calculations.
    tempRatingList = temp_df['rating'].tolist()
    
    # Let's also put the current user group reviews in a list format.
    tempGroupList = group['rating'].tolist()
    
    # Now let's calculate the pearson correlation between two users, so called, x and y.
    Sxx = sum([i ** 2 for i in tempRatingList]) - \
          pow(sum(tempRatingList), 2) / float(nRatings)
    Syy = sum([i ** 2 for i in tempGroupList]) - \
          pow(sum(tempGroupList), 2) / float(nRatings)
    Sxy = sum(i * j for i, j in zip(tempRatingList, tempGroupList)) - \
          sum(tempRatingList) * sum(tempGroupList) / float(nRatings)
    
    # If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy / sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [23]:
pearsonCorrelationDict.items()

dict_items([(75, 0.5875120888504805), (106, 0.5118906968889928), (686, 0.8409632877462002), (815, 0.44744194047041463), (1040, 0.7830233958232257), (1502, 0.7416198487095665), (1599, 0.3370999312316202), (1625, 0.7462025072446367), (1950, 0.05330017908890424), (2065, 0.5330017908890288), (2128, 0.37080992435478255), (2432, -0.18655062681115991), (2791, 0.7727272727272727), (2839, 0.7312724241271304), (2948, 0.0337099931231631), (3025, 0.13089257860118308), (3040, 0.9061030445113399), (3186, 0.7212758731597709), (3271, 0.28653494154687914), (3429, 0.15075567228888184), (3734, -0.16480856327180468), (4099, 0.24001200090007435), (4208, 0.27695585470349876), (4282, 0.0), (4292, 0.6357639532057505), (4415, 0.0), (4586, -0.8711309772938207), (4725, -0.07537783614444092), (4818, 0.4841648318657453), (5104, 0.7918961043907798), (5165, -0.10660035817780393), (5547, 0.35032452487268545), (6082, 0.06998656386948264), (6207, 0.8316847989130756), (6366, 0.703468574415813), (6482, -0.036563621206356

In [24]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.587512,75
1,0.511891,106
2,0.840963,686
3,0.447442,815
4,0.783023,1040


In [25]:
topUsers = pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
62,0.966004,12325
54,0.965909,10707
60,0.936321,12120
65,0.926696,13053
78,0.914091,15157


In [26]:
topUsersRating = topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.966004,12325,1,3.5
1,0.966004,12325,2,1.5
2,0.966004,12325,3,3.0
3,0.966004,12325,5,0.5
4,0.966004,12325,6,2.5


In [27]:
# Multiplies the similarity by the user's ratings.
topUsersRating['weightedRating'] = topUsersRating['similarityIndex'] * topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.966004,12325,1,3.5,3.381014
1,0.966004,12325,2,1.5,1.449006
2,0.966004,12325,3,3.0,2.898012
3,0.966004,12325,5,0.5,0.483002
4,0.966004,12325,6,2.5,2.41501


In [28]:
# Applies a sum to the topUsers after grouping it up by userId.
tempTopUsersRating = topUsersRating.groupby('movieId').\
                                    sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,36.083178,132.199381
2,36.083178,88.853969
3,12.049033,33.977342
4,1.489165,3.914651
5,11.501203,25.589339


In [29]:
# Creates an empty dataframe.
recommendation_df = pd.DataFrame()

# Now we take the weighted average.
recommendation_df['weighted average recommendation score'] = \
    tempTopUsersRating['sum_weightedRating'] / \
    tempTopUsersRating['sum_similarityIndex']

recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.66374,1
2,2.462476,2
3,2.819923,3
4,2.628756,4
5,2.224927,5


In [30]:
recommendation_df = recommendation_df.\
    sort_values(by='weighted average recommendation score', 
                ascending=False)

recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
26340,5.0,26340
2362,5.0,2362
5767,5.0,5767
106109,5.0,106109
55067,5.0,55067
106762,5.0,106762
1860,5.0,1860
121,5.0,121
7983,5.0,7983
67997,5.0,67997


In [31]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
119,121,"Boys of St. Vincent, The",1992
1778,1860,Character (Karakter),1997
2278,2362,Glen or Glenda,1953
5669,5767,Teddy Bear (Mis),1981
7598,7983,Broadway Danny Rose,1984
8783,26340,"Twelve Tasks of Asterix, The (Les douze travau...",1976
12111,55067,Requiem,2006
13653,67997,In the Loop,2009
22105,106109,"Masquerade (Gwanghai, Wangyidoen namja)",2012
22289,106762,Trigun: Badlands Rumble,2010
