<b>USER BASED RECOMMENDER SYSTEM

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [42]:
import pandas as pd
from math import sqrt
import numpy as np


In [43]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
print(movies_df.info())
print(ratings_df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None


In [44]:
userInput = [{'title':'Flint (2017)', 'rating':3},
             {'title':'Toy Story (1995)', 'rating':1},
             {'title':'Jumanji (1995)', 'rating':1},
             {'title':'No Game No Life: Zero (2017)', 'rating':5},
             {'title':'Grumpier Old Men (1995)', 'rating':4.5}]
inputMovies = pd.DataFrame(userInput)
print(inputMovies)

                          title  rating
0                  Flint (2017)     3.0
1              Toy Story (1995)     1.0
2                Jumanji (1995)     1.0
3  No Game No Life: Zero (2017)     5.0
4       Grumpier Old Men (1995)     4.5


In [45]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)
# inputMovies = inputMovies.drop('year', 1) #we don't really need this at the moment
inputMovies = inputMovies[['movieId','title','rating']]
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story (1995),1.0
1,2,Jumanji (1995),1.0
2,3,Grumpier Old Men (1995),4.5
3,193583,No Game No Life: Zero (2017),5.0
4,193585,Flint (2017),3.0


#### Now with the `movieId` in our input, we can now get the subset of users that have watched and reviewed the movies in our input. Find the similar user taste.

In [46]:
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
print(userSubset.groupby('movieId').count())

         userId  rating  timestamp
movieId                           
1           215     215        215
2           110     110        110
3            52      52         52
193583        1       1          1
193585        1       1          1


In [47]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

def take_5_elem(x):
    # print (len(x[1]))
    return len(x[1])
    

#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])


[((19,),       userId  movieId  rating  timestamp
2274      19        1     4.0  965705637
2275      19        2     3.0  965704331
2276      19        3     3.0  965707636), ((68,),        userId  movieId  rating   timestamp
10360      68        1     2.5  1158531426
10361      68        2     2.5  1158532776
10362      68        3     2.0  1158533415), ((91,),        userId  movieId  rating   timestamp
14121      91        1     4.0  1112713037
14122      91        2     3.0  1112713392
14123      91        3     3.0  1112712323), ((169,),        userId  movieId  rating   timestamp
24321     169        1     4.5  1059427918
24322     169        2     4.0  1078284713
24323     169        3     5.0  1078284750), ((217,),        userId  movieId  rating  timestamp
30885     217        1     4.0  955942540
30886     217        2     2.0  955942327
30887     217        3     1.0  955944713)]


In [48]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
   
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
   
    
    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
    


In [49]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())


   similarityIndex  userId
0        -0.500000   (19,)
1        -1.000000   (68,)
2        -0.500000   (91,)
3         0.866025  (169,)
4        -0.755929  (217,)


In [50]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers["userId"] = topUsers["userId"].apply(lambda x: x[0])
print(topUsers.head())

    similarityIndex  userId
82         1.000000     555
19         1.000000       6
77         1.000000     501
3          0.866025     169
5          0.500000     226


In [51]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
print(topUsersRating.head(100))


    similarityIndex  userId  movieId  rating  timestamp
0               1.0     555        1     4.0  978746159
1               1.0     555        3     5.0  978747454
2               1.0     555       19     3.0  980123949
3               1.0     555       21     4.0  978746440
4               1.0     555       24     5.0  978841879
..              ...     ...      ...     ...        ...
95              1.0     555      737     1.0  978842115
96              1.0     555      743     1.0  978747598
97              1.0     555      748     2.0  980125071
98              1.0     555      778     5.0  978747779
99              1.0     555      780     4.0  978841695

[100 rows x 5 columns]


In [52]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
print(topUsersRating.head())

   similarityIndex  userId  movieId  rating  timestamp  weightedRating
0              1.0     555        1     4.0  978746159             4.0
1              1.0     555        3     5.0  978747454             5.0
2              1.0     555       19     3.0  980123949             3.0
3              1.0     555       21     4.0  978746440             4.0
4              1.0     555       24     5.0  978841879             5.0


In [53]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

         sum_similarityIndex  sum_weightedRating
movieId                                         
1                   3.497042           14.334845
2                   4.497042           14.763273
3                   5.497042           25.394996
4                   1.000000            3.000000
5                   3.693352           13.984781


In [54]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
print(recommendation_df.head(10))

         weighted average recommendation score  movieId
movieId                                                
1                                     4.099134        1
2                                     3.282885        2
3                                     4.619756        3
4                                     3.000000        4
5                                     3.786474        5
6                                     3.400000        6
7                                     3.980099        7
8                                     3.000000        8
9                                          NaN        9
10                                    3.399711       10


In [55]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)


         weighted average recommendation score  movieId
movieId                                                
1079                                       5.0     1079
6370                                       5.0     6370
6442                                       5.0     6442
3678                                       5.0     3678
3673                                       5.0     3673
...                                        ...      ...
182639                                     NaN   182639
182823                                     NaN   182823
184471                                     NaN   184471
187593                                     NaN   187593
188301                                     NaN   188301

[5639 rows x 2 columns]


In [56]:
recommended_movie=movies_df.loc[movies_df['movieId'].isin(recommendation_df['movieId'])]

#we don't want to recommend the same movie
recommended_movie=recommended_movie.loc[~recommended_movie.movieId.isin(userSubset['movieId'])]

print(recommended_movie)

      movieId                                      title  \
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
5           6                                Heat (1995)   
6           7                             Sabrina (1995)   
7           8                        Tom and Huck (1995)   
...       ...                                        ...   
9692   184471                         Tomb Raider (2018)   
9695   184791  Fred Armisen: Standup for Drummers (2018)   
9709   187593                          Deadpool 2 (2018)   
9710   187595             Solo: A Star Wars Story (2018)   
9713   188301                Ant-Man and the Wasp (2018)   

                                      genres  
3                       Comedy|Drama|Romance  
4                                     Comedy  
5                      Action|Crime|Thriller  
6                             Comedy|Romance  
7                         Adventure|Children