<a href="https://colab.research.google.com/github/Arinjoy007/Recommendor_System/blob/master/Recommendor_System_User_Based_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [32]:
#importing libraries
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [33]:
#Reading the files into dataframes
#storing the movie information into a pandas dataframe
movies_df=pd.read_csv('movies.csv')
#storing the user information into a pandas dataframe
ratings_df=pd.read_csv('ratings.csv')

In [34]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Removing the year from title column by using pandas replace function and storing it in a new year column.

\d matches any single digit (same as 0-9).

In [35]:
#taking out the year along with the parenthesis
movies_df['year']=movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#removing the parenthesis
movies_df['year']=movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#removing the year from the title column by replacing it with null
movies_df['title']=movies_df.title.str.replace('(\(\d\d\d\d\))','')
#applying the strip function to remove any ending whitespace characters
movies_df['title']=movies_df['title'].apply(lambda x: x.strip())

In [36]:
movies_df.head(10)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
5,6,Heat,Action|Crime|Thriller,1995
6,7,Sabrina,Comedy|Romance,1995
7,8,Tom and Huck,Adventure|Children,1995
8,9,Sudden Death,Action,1995
9,10,GoldenEye,Action|Adventure|Thriller,1995


Since this is not a content based recommendor system therefore we do not require a genre column.

In [37]:
#Dropping the genres column
movies_df = movies_df.drop('genres', 1)

In [38]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [39]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Timestamp column is not required.

In [40]:
ratings_df.drop('timestamp',1)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


At a later stage while running I detected a duplicate movie title so here I will remove any such duplicates

In [41]:
#only the first value will be considered as unique
movies_df.drop_duplicates(subset='title',keep='first',inplace=True)

In [42]:
movies_df.head(10)

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995
5,6,Heat,1995
6,7,Sabrina,1995
7,8,Tom and Huck,1995
8,9,Sudden Death,1995
9,10,GoldenEye,1995


Creating a random input user to recommend movies to.

In [43]:
userInput = [
            {'title':'Jumanji', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Sabrina', 'rating':4},
            {'title':"GoldenEye", 'rating':5},
            {'title':'Heat', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Jumanji,5.0
1,Toy Story,3.5
2,Sabrina,4.0
3,GoldenEye,5.0
4,Heat,4.5


Extracting the input movies' id and adding them to the user input

In [44]:
#Matching the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', 1)
#Final input dataframe
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,5.0
2,6,Heat,4.5
3,7,Sabrina,4.0
4,10,GoldenEye,5.0


Match the movie id's in the input to get the subset of users who have watched the same movies

In [45]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
2,1,6,4.0,964982224
516,5,1,4.0,847434962
560,6,2,4.0,845553522
564,6,6,4.0,845553757


We see that the timestamp column has reappeared so we drop it

In [46]:
userSubset.drop('timestamp',1)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
2,1,6,4.0
516,5,1,4.0
560,6,2,4.0
564,6,6,4.0
...,...,...,...
98669,608,10,4.0
99497,609,1,3.0
99498,609,10,4.0
99534,610,1,5.0


Grouping up the rows by user id

In [48]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

Let us look at user id 6

In [49]:
userSubsetGroup.get_group(6)

Unnamed: 0,userId,movieId,rating,timestamp
560,6,2,4.0,845553522
564,6,6,4.0,845553757
565,6,7,4.0,845554264
567,6,10,3.0,845553253


users with maximum number of common movies with input user are given priority

In [51]:
#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Now we look at top 5 most similar users

In [52]:
userSubsetGroup[0:5]

[(68,        userId  movieId  rating   timestamp
  10360      68        1     2.5  1158531426
  10361      68        2     2.5  1158532776
  10364      68        6     4.0  1158532058
  10365      68        7     2.0  1230498124
  10366      68       10     4.5  1158531612),
 (414,        userId  movieId  rating  timestamp
  62294     414        1     4.0  961438127
  62295     414        2     3.0  961594981
  62298     414        6     3.0  961515642
  62299     414        7     3.0  961439170
  62301     414       10     3.0  961515863),
 (470,        userId  movieId  rating  timestamp
  72918     470        1     4.0  849224825
  72919     470        2     3.0  849224778
  72922     470        6     3.0  849843318
  72923     470        7     3.0  849370453
  72924     470       10     3.0  849075144),
 (599,        userId  movieId  rating   timestamp
  92623     599        1     3.0  1498524204
  92624     599        2     2.5  1498514085
  92626     599        6     4.5  14985396

Now we will calculate similarity of users using the Pearson Correlation Coefficient.

We will calculate the similarity of the 100 users with most number of common movies with the input

In [53]:
userSubsetGroup=userSubsetGroup[0:100]

Now Pearson Correlation between the input user and users in the subset group are calculated and stored in a dictionary where the key is the user Id and the value is the coefficient. Value of 1 indicates very similar tastes and value of -1 indicates very dissimilar taste in movies.

In [54]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


Now let us take a look at the dictionary items

In [55]:
pearsonCorrelationDict.items()

dict_items([(68, 0.5483505816998477), (414, -0.7717436331412953), (470, -0.7717436331412953), (599, 0.1604222369799343), (6, -0.5222329678670935), (19, -0.5222329678670935), (82, 0.7385489458759964), (91, -0.41403933560541256), (117, -0.8703882797784892), (160, -0.4714045207910317), (202, 0.0), (217, -0.4082482904638631), (219, 0.0), (274, -0.47140452079103173), (304, -0.7777777777777778), (314, -0.2581988897471611), (353, -0.8660254037844387), (380, 0), (411, -0.5222329678670935), (434, -0.6666666666666666), (474, -0.7745966692414834), (480, 0.4082482904638631), (501, -0.30151134457776363), (559, -0.7385489458759964), (573, -0.9428090415820635), (590, -0.7492686492653552), (597, -0.10259783520851541), (18, -0.3273268353539889), (21, 0.5), (31, -0.7559289460184538), (32, 0.0), (43, -0.9449111825230684), (45, 0.0), (57, -0.9449111825230682), (84, 0.0), (93, 0.8660254037844387), (112, -0.3273268353539889), (140, 0.4193139346887665), (144, -0.9999999999999964), (166, -0.500000000000003), 

Using pandas to arrange the data into columns.

In [56]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.548351,68
1,-0.771744,414
2,-0.771744,470
3,0.160422,599
4,-0.522233,6


Sorting the users by the similarity index.

In [60]:
topUsers=pearsonDF.sort_values(by='similarityIndex',ascending=False)
topUsers.head(25)

Unnamed: 0,similarityIndex,userId
97,1.0,90
95,1.0,64
92,1.0,51
89,1.0,42
87,1.0,27
80,0.928571,600
35,0.866025,93
82,0.755929,604
74,0.755929,541
6,0.738549,82


since only the top 24 have a similarity coefficient above 0 so I will only take those into account for further considerations

In [62]:
topUsers=topUsers[0:24]

Merging the movies and ratings with the similarity index.

In [63]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,timestamp
0,1.0,90,1,3.0,856353996
1,1.0,90,7,4.0,856354037
2,1.0,90,14,5.0,856354100
3,1.0,90,17,5.0,856353996
4,1.0,90,25,5.0,856353996


Multiplying the rating by the similarity index to find the weighted rating.

In [64]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,timestamp,weightedRating
0,1.0,90,1,3.0,856353996,3.0
1,1.0,90,7,4.0,856354037,4.0
2,1.0,90,14,5.0,856354100,5.0
3,1.0,90,17,5.0,856353996,5.0
4,1.0,90,25,5.0,856353996,5.0


Sum up the weighted ratings and then dividing it by the sum of the weights(similarity index).

In [65]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,11.580289,35.744457
2,9.291742,36.125627
3,4.894371,16.412655
4,0.928571,1.392857
5,2.732851,7.685917


In [66]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.086664,1
2,3.887928,2
3,3.353374,3
4,1.5,4
5,2.812417,5


Sort according to recommendation score

In [75]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(85)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3814,5.000000,3814
1272,5.000000,1272
1096,5.000000,1096
27731,5.000000,27731
1916,5.000000,1916
...,...,...
69069,5.000000,69069
85,5.000000,85
4467,4.926344,4467
1204,4.918056,1204


Since 82 movies have a perfect recommendation so we will recommend all 82.

In [77]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(82)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
26,27,Now and Then,1995
66,74,Bed of Roses,1996
74,82,Antonia's Line (Antonia),1995
76,85,Angels and Insects,1995
101,116,Anne Frank Remembered,1995
...,...,...,...
6999,67618,Strictly Sexual,2008
7041,69069,Fired Up,2009
7284,74946,She's Out of My League,2010
8536,115122,What We Do in the Shadows,2014
