<h3><b> Based on this exercise, discuss the the questions in Quizizz with your group

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [316]:
import pandas as pd
from math import sqrt
import numpy as np


In [317]:
ratings_df = pd.read_csv('Cuisine_rating.csv')
print(ratings_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   UserID         200 non-null    int64  
 1   LocationID     200 non-null    int64  
 2   OverallRating  200 non-null    float64
dtypes: float64(1), int64(2)
memory usage: 4.8 KB
None


In [318]:
ratings_df.head()

Unnamed: 0,UserID,LocationID,OverallRating
0,1,153,4.5
1,2,123,1.0
2,3,122,5.0
3,4,153,2.0
4,5,129,3.0


In [319]:
ratings_df

Unnamed: 0,UserID,LocationID,OverallRating
0,1,153,4.5
1,2,123,1.0
2,3,122,5.0
3,4,153,2.0
4,5,129,3.0
...,...,...,...
195,196,175,1.5
196,197,170,1.5
197,198,160,3.5
198,199,130,2.5


In [320]:
userInput = [{'LocationID': 153, 'OverallRating':4.5},
             {'LocationID': 123, 'OverallRating':2.0},
             {'LocationID': 130, 'OverallRating':3.0},
             {'LocationID': 129, 'OverallRating':5.0}]
inputResto = pd.DataFrame(userInput)
print(inputResto)

   LocationID  OverallRating
0         153            4.5
1         123            2.0
2         130            3.0
3         129            5.0


In [321]:
# inputId = ratings_df[ratings_df['Location'].isin(inputResto['Location'].tolist())]
# inputResto = pd.merge(inputId, inputResto)
# inputResto = inputResto.drop('Cuisines', 1) #we don't really need this at the moment
# inputResto = inputResto[['LocationID','Location','OverallRating']]
# print(inputResto)

In [322]:
userSubset = ratings_df[ratings_df['LocationID'].isin(inputResto['LocationID'].tolist())]
print(userSubset.groupby('LocationID').count())

            UserID  OverallRating
LocationID                       
123             24             24
129             17             17
130              2              2
153             15             15


In [323]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['UserID'])

def take_5_elem(x):
    # print (len(x[1]))
    return len(x[1])
    

#Sorting it so users with resto most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])


[(1,    UserID  LocationID  OverallRating
0       1         153            4.5), (2,    UserID  LocationID  OverallRating
1       2         123            1.0), (4,    UserID  LocationID  OverallRating
3       4         153            2.0), (5,    UserID  LocationID  OverallRating
4       5         129            3.0), (8,    UserID  LocationID  OverallRating
7       8         153            3.5)]


In [324]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='LocationID')
    inputResto = inputResto.sort_values(by='LocationID')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputResto[inputResto['LocationID'].isin(group['LocationID'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['OverallRating'].tolist()
   
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['OverallRating'].tolist()
   
    
    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
    


In [325]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['UserID'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())


   similarityIndex  UserID
0                0       1
1                0       2
2                0       4
3                0       5
4                0       8


In [326]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
print(topUsers.head())

    similarityIndex  UserID
0                 0       1
43                0      78
31                0      57
32                0      58
33                0      59


In [327]:
topUsersRating=topUsers.merge(ratings_df, left_on='UserID', right_on='UserID', how='inner')
print(topUsersRating.head(100))

    similarityIndex  UserID  LocationID  OverallRating
0                 0       1         153            4.5
1                 0      78         153            3.0
2                 0      57         123            5.0
3                 0      58         129            3.5
4                 0      59         123            3.0
5                 0      61         129            3.5
6                 0      63         123            5.0
7                 0      64         153            1.0
8                 0      65         129            5.0
9                 0      72         123            1.5
10                0      73         123            1.0
11                0      74         153            3.0
12                0      75         123            3.5
13                0      77         129            3.5
14                0      82         153            4.0
15                0       2         123            1.0
16                0      85         153            5.0
17        

In [328]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['OverallRating']
print(topUsersRating.head())

   similarityIndex  UserID  LocationID  OverallRating  weightedRating
0                0       1         153            4.5             0.0
1                0      78         153            3.0             0.0
2                0      57         123            5.0             0.0
3                0      58         129            3.5             0.0
4                0      59         123            3.0             0.0


In [329]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('LocationID').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

            sum_similarityIndex  sum_weightedRating
LocationID                                         
123                           0                 0.0
129                           0                 0.0
130                           0                 0.0
153                           0                 0.0


In [330]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['LocationID'] = tempTopUsersRating.index
print(recommendation_df.head(10))

            weighted average recommendation score  LocationID
LocationID                                                   
123                                           NaN         123
129                                           NaN         129
130                                           NaN         130
153                                           NaN         153


In [331]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)


            weighted average recommendation score  LocationID
LocationID                                                   
123                                           NaN         123
129                                           NaN         129
130                                           NaN         130
153                                           NaN         153


In [332]:
recommended_resto=ratings_df.loc[ratings_df['LocationID'].isin(recommendation_df['LocationID'])]

#we don't want to recommend the same movie
recommended_resto=recommended_resto.loc[~recommended_resto.LocationID.isin(userSubset['LocationID'])]

print(recommended_resto)

Empty DataFrame
Columns: [UserID, LocationID, OverallRating]
Index: []


Notes for sir bagus :
I have tried with 2 dataset and it always turns out empty dataframe and its not showing the similarity index. All the code works well but the output still wrong. I hope you can explain to me later what my mistakes in the next meeting. Thankyou sir.