<h3><b> Based on this exercise, discuss the the questions in Quizizz with your group

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [2]:
import pandas as pd
from math import sqrt
import numpy as np
from sklearn import datasets

In [4]:
anime_url = 'https://raw.githubusercontent.com/Theotrgl/FoDS_Recommender_System/main/anime.csv'
ratings_url = 'https://raw.githubusercontent.com/Theotrgl/FoDS_Recommender_System/main/rating.csv'
anime_df = pd.read_csv(anime_url)
ratings_df = pd.read_csv(ratings_url)
print(anime_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


In [6]:
userInput = [{'name':'Gintama', 'rating':9.04},
             {'name':'Cowboy Bebop', 'rating':8.82},
             {'name':'One Punch Man', 'rating':8.82},
             {'name':'Death Note', 'rating':8.71},
             {'name':'Tengen Toppa Gurren Lagann', 'rating':8.78}]
inputAnime = pd.DataFrame(userInput)
print(inputAnime)

                         name  rating
0                     Gintama    9.04
1                Cowboy Bebop    8.82
2               One Punch Man    8.82
3                  Death Note    8.71
4  Tengen Toppa Gurren Lagann    8.78


In [9]:
inputId = anime_df[anime_df['name'].isin(inputAnime['name'].tolist())]
inputAnime = pd.merge(inputId, inputAnime)
inputAnime = inputAnime[['anime_id','name','rating']]
print(inputAnime)

   anime_id                        name  rating
0       918                     Gintama    9.04
1         1                Cowboy Bebop    8.82
2     30276               One Punch Man    8.82
3      2001  Tengen Toppa Gurren Lagann    8.78
4      1535                  Death Note    8.71


#### Now with the `movieId` in our input, we can now get the subset of users that have watched and reviewed the movies in our input. Find the similar user taste.

In [10]:
userSubset = ratings_df[ratings_df['anime_id'].isin(inputAnime['anime_id'].tolist())]
print(userSubset.groupby('anime_id').count())

          user_id  rating
anime_id                 
1           15509   15509
918          4974    4974
1535        39340   39340
2001        19337   19337
30276       13374   13374


In [12]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['user_id'])

def take_5_elem(x):
    # print (len(x[1]))
    return len(x[1])
    

#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])


[(226,        user_id  anime_id  rating
17304      226         1       8
17449      226       918       8
17470      226      1535       8
17501      226      2001       6
18001      226     30276       9), (274,        user_id  anime_id  rating
23128      274         1      -1
23146      274       918      -1
23147      274      1535      -1
23150      274      2001      -1
23235      274     30276      -1), (296,        user_id  anime_id  rating
25851      296         1       8
25925      296       918      10
25939      296      1535      10
25949      296      2001       9
26200      296     30276      -1), (392,        user_id  anime_id  rating
35109      392         1       7
35223      392       918       8
35250      392      1535       8
35277      392      2001       8
35790      392     30276       8), (567,        user_id  anime_id  rating
54622      567         1       8
54680      567       918      10
54692      567      1535      10
54704      567      2001       8
5493

In [14]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='anime_id')
    inputAnime = inputAnime.sort_values(by='anime_id')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputAnime[inputAnime['anime_id'].isin(group['anime_id'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
   
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
   
    
    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
    


In [15]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['user_id'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())


   similarityIndex  user_id
0         0.173546      226
1         0.000000      274
2         0.102461      296
3         0.063313      392
4         0.251088      567


In [16]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
print(topUsers.head())

    similarityIndex  user_id
22         0.943859     2989
97         0.931602     8107
11         0.928241     1576
53         0.838192     5548
74         0.838192     6416


In [17]:
topUsersRating=topUsers.merge(ratings_df, left_on='user_id', right_on='user_id', how='inner')
print(topUsersRating.head(100))

    similarityIndex  user_id  anime_id  rating
0          0.943859     2989         1       9
1          0.943859     2989         6       8
2          0.943859     2989        20       7
3          0.943859     2989        30       7
4          0.943859     2989        32       8
..              ...      ...       ...     ...
95         0.943859     2989     10110       7
96         0.943859     2989     10119       8
97         0.943859     2989     10209       5
98         0.943859     2989     10568       8
99         0.943859     2989     10620       6

[100 rows x 4 columns]


In [18]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
print(topUsersRating.head())

   similarityIndex  user_id  anime_id  rating  weightedRating
0         0.943859     2989         1       9        8.494734
1         0.943859     2989         6       8        7.550875
2         0.943859     2989        20       7        6.607015
3         0.943859     2989        30       7        6.607015
4         0.943859     2989        32       8        7.550875


In [19]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('anime_id').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

          sum_similarityIndex  sum_weightedRating
anime_id                                         
1                   25.948208          214.252638
5                   13.036242          103.867795
6                   15.274047          116.580572
7                    2.647062           18.019870
8                    0.592427            4.739414


In [20]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['anime_id'] = tempTopUsersRating.index
print(recommendation_df.head(10))

          weighted average recommendation score  anime_id
anime_id                                                 
1                                      8.256934         1
5                                      7.967618         5
6                                      7.632592         6
7                                      6.807499         7
8                                      8.000000         8
15                                     8.681326        15
16                                     9.062583        16
17                                     8.750565        17
18                                     9.000000        18
19                                     8.224247        19


In [21]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)


          weighted average recommendation score  anime_id
anime_id                                                 
1034                                       10.0      1034
593                                        10.0       593
10507                                      10.0     10507
557                                        10.0       557
31733                                      10.0     31733
...                                         ...       ...
19671                                      -1.0     19671
1184                                       -1.0      1184
1816                                       -1.0      1816
28779                                      -1.0     28779
3031                                       -1.0      3031

[3779 rows x 2 columns]


In [26]:
recommended_anime=anime_df.loc[anime_df['anime_id'].isin(recommendation_df['anime_id'])]

#we don't want to recommend the same movie
recommended_anime=recommended_anime.loc[~recommended_anime.anime_id.isin(userSubset['anime_id'])]

recommended_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
