#### **USER BASED RECOMMENDATION SYSTEM**

Jeffrey (2602118484)

Steps in a user-based recommendation system:

1. Select a user with the movies the user has watched
2. Based on his rating to movies, find the top x neighbours
3. Get the watched movie record of the user for each neighbour.
4. Calculate a similarity score using some formula
5. Recommend the items with the highest score

In [142]:
import pandas as pd
from math import sqrt
import numpy as np

In [143]:
ratings = pd.read_csv('ratings.csv')
print(ratings.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5976479 entries, 0 to 5976478
Data columns (total 3 columns):
 #   Column   Dtype
---  ------   -----
 0   user_id  int64
 1   book_id  int64
 2   rating   int64
dtypes: int64(3)
memory usage: 136.8 MB
None


In [144]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [145]:
books = pd.read_csv('goodbooks_10k_rating_and_description.csv')
print(books.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9794 entries, 0 to 9793
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   book_id                    9794 non-null   int64  
 1   book_title                 9794 non-null   object 
 2   book_series                9794 non-null   object 
 3   title                      9794 non-null   object 
 4   book_authors               9794 non-null   object 
 5   genres                     5037 non-null   object 
 6   book_score                 9794 non-null   float64
 7   book_rating                9794 non-null   float64
 8   book_rating_obj            9794 non-null   float64
 9   book_rating_count          9794 non-null   int64  
 10  book_review_count          9794 non-null   int64  
 11  book_desc                  5028 non-null   object 
 12  tags                       9794 non-null   object 
 13  FE_text                    9794 non-null   objec

In [146]:
from random import sample

sample(books['book_title'].to_list(), 5)

['Clockwork Angel',
 'Covet',
 'Austerlitz',
 'Wicked - Piano/Vocal Arrangement',
 'Kisscut']

In [147]:
userInput = [{'book_title':'The Hunger Games', 'rating':1},
             {'book_title':'Twilight', 'rating':5},
             {'book_title':'Crush', 'rating':3},
             {'book_title':'Boneshaker', 'rating':4},
             {'book_title':'Prelude to Foundation', 'rating':3},
             {'book_title':'Among the Hidden', 'rating':2},
             {'book_title':'Absent in the Spring', 'rating':3},
             {'book_title':'The Last Sin Eater', 'rating':5},
             {'book_title':'Wicked Lovely', 'rating':5},
             {'book_title':'Article 5', 'rating':3},
             {'book_title':'To Challenge a Dragon', 'rating':2},
             {'book_title':'I Too Had A Love Story', 'rating':1},
             {'book_title':'You Know You Love Me', 'rating':3},
             {'book_title':'Showdown by Ted Dekker Signature Edition', 'rating':4},
             {'book_title':'Hark! A Vagrant', 'rating':3},
             {'book_title':'Hope', 'rating':1},
             {'book_title':'Reaper Man', 'rating':2},
             {'book_title':'Anne of Green Gables', 'rating':5}
             ]
inputBooks = pd.DataFrame(userInput)
print(inputBooks)

                                  book_title  rating
0                           The Hunger Games       1
1                                   Twilight       5
2                                      Crush       3
3                                 Boneshaker       4
4                      Prelude to Foundation       3
5                           Among the Hidden       2
6                       Absent in the Spring       3
7                         The Last Sin Eater       5
8                              Wicked Lovely       5
9                                  Article 5       3
10                     To Challenge a Dragon       2
11                    I Too Had A Love Story       1
12                      You Know You Love Me       3
13  Showdown by Ted Dekker Signature Edition       4
14                           Hark! A Vagrant       3
15                                      Hope       1
16                                Reaper Man       2
17                      Anne of Green Gables  

### **Find the book id from the books DataFrame and merge with the inputBooks**


In [148]:
inputId = books[books['book_title'].isin(inputBooks['book_title'].tolist())]
inputBooks = pd.merge(inputId, inputBooks)
inputBooks = inputBooks[['book_id','book_title','rating']]
print(inputBooks)

    book_id                                book_title  rating
0         1                          The Hunger Games       1
1         3                                  Twilight       5
2      4234                                  Twilight       5
3      7047                                  Twilight       5
4      8354                                  Twilight       5
5       133                      Anne of Green Gables       5
6      1924                                Reaper Man       2
7      2624                     Prelude to Foundation       3
8      2749                    I Too Had A Love Story       1
9      4327                        The Last Sin Eater       5
10     5384                           Hark! A Vagrant       3
11     5544                                 Article 5       3
12     5768                                     Crush       3
13     9531                                     Crush       3
14     6402                      You Know You Love Me       3
15     8

### **Find the users who have read and reviewed the chosen books**


In [149]:
userSubset = ratings[ratings['book_id'].isin(inputBooks['book_id'].tolist())]
print(userSubset.groupby('book_id').count())

         user_id  rating
book_id                 
1          22806   22806
3          16931   16931
133         6536    6536
1924         912     912
2624         547     547
2749         136     136
4234         346     346
4327         219     219
5384         206     206
5544         222     222
5768         242     242
6402         172     172
7047         256     256
8354         124     124
8784         119     119
9531          72      72
9618         109     109
9788          92      92


In [150]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['user_id'])

def take_5_elem(x):
    # print (len(x[1]))
    return len(x[1])


#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)

userSubsetGroup = userSubsetGroup[0:100]
print(userSubsetGroup[0:5])

  userSubsetGroup = sorted(userSubsetGroup, key=take_5_elem, reverse=True)


[(18576,          user_id  book_id  rating
1258056    18576     4327       2
1258070    18576      133       5
1258131    18576        3       1
1258137    18576        1       4
1258846    18576     8784       4), (35944,          user_id  book_id  rating
2838865    35944        3       1
2838867    35944        1       3
2926237    35944     2749       3
3086614    35944     4234       4
3416214    35944      133       4), (50494,          user_id  book_id  rating
4812331    50494        1       4
4812726    50494        3       1
4813054    50494     8354       4
4814029    50494     9618       5
5051945    50494      133       3), (1052,          user_id  book_id  rating
27263       1052        3       3
27359       1052     2624       4
50239       1052     7047       2
2932506     1052        1       4), (1392,          user_id  book_id  rating
53345       1392        3       1
96327       1392     6402       5
3126096     1392        1       4
3553625     1392      133       4)]

### **Calculate the Pearson correlation coefficient between users based on their book ratings**

In [151]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:

    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='book_id')
    inputBooks = inputBooks.sort_values(by='book_id')

    #Get the N for the formula
    nRatings = len(group)

    #Get the review scores for the movies that they both have in common
    temp_df = inputBooks[inputBooks['book_id'].isin(group['book_id'].tolist())]

    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()

    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()


    #Now let's calculate the pearson correlation between two users, so called, x and y manually (check the formula from week 7 slide)
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

### **Transform the dictionary of Pearson correlation coefficients into a Pandas DataFrame, add appropriate column names, set the 'user_id' column as index**

In [152]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['user_id'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
print(pearsonDF.head())

   similarityIndex  user_id
0        -0.351364    18576
1         0.000000    35944
2        -0.608859    50494
3        -0.818182     1052
4        -0.502519     1392


### **Display users with the highest similarity indices**

In [153]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
print(topUsers.head())

    similarityIndex  user_id
43         1.000000    22170
41         1.000000    21355
20         0.980196    12683
46         0.870388    22942
80         0.852803    39302


In [154]:
topUsersRating=topUsers.merge(ratings, left_on='user_id', right_on='user_id', how='inner')
print(topUsersRating.head(100))

    similarityIndex  user_id  book_id  rating
0               1.0    22170      106       4
1               1.0    22170      114       5
2               1.0    22170      108       5
3               1.0    22170      634       4
4               1.0    22170      308       5
..              ...      ...      ...     ...
95              1.0    22170       56       5
96              1.0    22170       52       5
97              1.0    22170      219       5
98              1.0    22170       17       5
99              1.0    22170       49       5

[100 rows x 4 columns]


In [155]:
#Multiplies the similarity by the user’s ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
print(topUsersRating.head())

   similarityIndex  user_id  book_id  rating  weightedRating
0              1.0    22170      106       4             4.0
1              1.0    22170      114       5             5.0
2              1.0    22170      108       5             5.0
3              1.0    22170      634       4             4.0
4              1.0    22170      308       5             5.0


In [156]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('book_id').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
print(tempTopUsersRating.head())

         sum_similarityIndex  sum_weightedRating
book_id                                         
1                   0.002341           -5.676749
2                   0.795762            3.852685
3                   0.090387           14.703685
4                  -1.482720           -7.613935
5                   0.216175            1.088070


In [157]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()

#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['book_id'] = tempTopUsersRating.index
print(recommendation_df.head(10))

         weighted average recommendation score  book_id
book_id                                                
1                                 -2424.413923        1
2                                     4.841506        2
3                                   162.675520        3
4                                     5.135114        4
5                                     5.033279        5
6                                     6.461028        6
7                                     3.555417        7
8                                     2.968848        8
9                                     4.437288        9
10                                    6.988952       10


In [158]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
print(recommendation_df)

         weighted average recommendation score  book_id
book_id                                                
3                                   162.675520        3
28                                  140.713876       28
4083                                131.428510     4083
5048                                119.640987     5048
224                                  85.560341      224
...                                        ...      ...
9810                                       NaN     9810
9838                                       NaN     9838
9974                                       NaN     9974
9977                                       NaN     9977
9986                                       NaN     9986

[2451 rows x 2 columns]


### **Output the top 20 most recommmended books to the user**

In [159]:
recommended_book=books.loc[books['book_id'].isin(recommendation_df.head(20)['book_id'])]

#we don't want to recommend the same movie
recommended_book=recommended_book.loc[~recommended_book.book_id.isin(userSubset['book_id'])]

print(recommended_book[['book_id', 'book_title', 'book_authors']])

      book_id                               book_title  \
12         13                     Nineteen Eighty-Four   
27         28                       Lord of the Flies    
32         33                      Memoirs of a Geisha   
76         78                    The Devil Wears Prada   
104       107                       A Walk to Remember   
107       110                         A Clash of Kings   
147       150                             The Red Tent   
154       157                       Green Eggs and Ham   
174       177                 Преступление и наказание   
221       224                                   Fallen   
252       255                           Atlas Shrugged   
289       292           The Voyage of the Dawn Treader   
290       293                          Treasure Island   
324       327                                   Legend   
388       391                                The Lorax   
627       632              Shatter Me (Shatter Me, #1)   
842       849 