## Content Based Filtering

-- Uses item metadata as the basis for the recommendation

We'll try to make a simple content-based recommender system using cosine similarity. We measure the similarity between two items. Conceptually, we will recommend a user some items based on the simility of items that they like previously.

To remind you, cosine similarity is a 'distance' measurement of two vectors

<img src="https://softscients.com/wp-content/uploads/2020/03/2.-Cara-Menghitung-Cosine-similarity.png"></img>


<img src="https://www.researchgate.net/profile/Said-Salloum/publication/345471138/figure/fig2/AS:955431962808321@1604804139868/Cosine-similarity-formula.png"></img>

To start, we use Pandas for data loading and preprocessing and Numpy for linear algebra calculation. In this case, we will make a movie recommendation system.

In [1]:
import pandas as pd
import numpy as np

In [2]:
movie = pd.read_csv('https://github.com/MahnoorJaved98/Movie-Recommendation-System/blob/main/movie_dataset.csv?raw=true').dropna()
movie.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


In [3]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1432 entries, 0 to 4796
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 1432 non-null   int64  
 1   budget                1432 non-null   int64  
 2   genres                1432 non-null   object 
 3   homepage              1432 non-null   object 
 4   id                    1432 non-null   int64  
 5   keywords              1432 non-null   object 
 6   original_language     1432 non-null   object 
 7   original_title        1432 non-null   object 
 8   overview              1432 non-null   object 
 9   popularity            1432 non-null   float64
 10  production_companies  1432 non-null   object 
 11  production_countries  1432 non-null   object 
 12  release_date          1432 non-null   object 
 13  revenue               1432 non-null   int64  
 14  runtime               1432 non-null   float64
 15  spoken_languages     

## Based on Genres Only

To simplify our system, we will use genres only for the vector elements. Remember that, cosine similarity needs vectors to do the calculation so we have to extract the vector from the genre data.

To extract the vector, we use one-hot encoding technique (a technique that labeling of the existance of a category), which the illustration represented by image below:

<img src="https://i.imgur.com/mtimFxh.png"></img>

since each movie has more than one genres, we have to do an extra preprocessing.

### One-Hot Encoding Process

In [4]:
genres = ' '
for g in movie['genres']:
  genres += g+' '

genres = list(set(genres.split(' ')))[1:]

In [5]:
genres

['Family',
 'Drama',
 'Horror',
 'TV',
 'Fantasy',
 'Action',
 'Thriller',
 'Adventure',
 'Mystery',
 'Movie',
 'Western',
 'Music',
 'Comedy',
 'Fiction',
 'Animation',
 'Documentary',
 'War',
 'Crime',
 'Science',
 'Romance',
 'History',
 'Foreign']

In [6]:
gen_mv = [[] for i in range(len(genres))]

for dat in movie['genres']:
  for i,g in enumerate(genres):
    if g in dat.split(' '):
      gen_mv[i].append(1)
    else:
      gen_mv[i].append(0)

In [7]:
gen_mv_dat = pd.DataFrame(np.array(gen_mv).T,columns=genres)
gen_mv_dat

Unnamed: 0,Family,Drama,Horror,TV,Fantasy,Action,Thriller,Adventure,Mystery,Movie,...,Comedy,Fiction,Animation,Documentary,War,Crime,Science,Romance,History,Foreign
0,0,0,0,0,1,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
1,0,0,0,0,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1427,0,1,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1428,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1429,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1430,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
title_df = movie[['original_title']].reset_index(drop=True)
movie_vector = pd.concat([title_df,gen_mv_dat],axis=1)
movie_vector.set_index('original_title',inplace=True)
movie_vector


Unnamed: 0_level_0,Family,Drama,Horror,TV,Fantasy,Action,Thriller,Adventure,Mystery,Movie,...,Comedy,Fiction,Animation,Documentary,War,Crime,Science,Romance,History,Foreign
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,0,0,0,0,1,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
Pirates of the Caribbean: At World's End,0,0,0,0,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectre,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
The Dark Knight Rises,0,1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
John Carter,0,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Down Terrace,0,1,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Clerks,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Dry Spell,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
Tin Can Man,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Voila! we have the movie vectors which represent each movie's genres. Next, we define the cosine similarity function to ease our similarity calculation.

In [9]:
def cosine_sim(vect1,vect2):
  norm_1 = np.linalg.norm(vect1)
  norm_2 = np.linalg.norm(vect2)

  cos_sim = (vect1 @ vect2) / (norm_1 * norm_2)
  return cos_sim

We want to test the function for Avatar and Tin Can Man. In fact, Avatar and Tin Can Man have different genres

In [10]:
movie[movie['original_title']=='Avatar'][['original_title','genres']]

Unnamed: 0,original_title,genres
0,Avatar,Action Adventure Fantasy Science Fiction


In [11]:
movie[movie['original_title']=='Tin Can Man'][['original_title','genres']]

Unnamed: 0,original_title,genres
4791,Tin Can Man,Horror


In [12]:
cosine_sim(movie_vector.loc['Avatar'], movie_vector.loc['Tin Can Man'])

0.0

The cosine similarity of both movies is zero, which is there is no similarity between them.

Imagine that you really love `Man of Steel` and our system will recommend you 5 movies that similar to `Man of Steel`.

In [13]:
def recsys(movie, top_N):
  cossim = pd.Series([cosine_sim(movie_vector.loc[movie],x) for x in movie_vector.values],index=movie_vector.index).drop(index=movie)
  print(f'You like {movie}, so based on our recommender system, We recommend you to watch:')
  for i,mv in enumerate(cossim.sort_values(ascending=False)[:top_N].index):
    print(f'{i+1}. {mv}')

In [14]:
recsys('Man of Steel',5)

You like Man of Steel, so based on our recommender system, We recommend you to watch:
1. Avatar
2. Jupiter Ascending
3. The Wolverine
4. X-Men: Days of Future Past
5. Teenage Mutant Ninja Turtles


## Based on Overview

In [15]:
movie['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

To preprocess the text into numbers, we can approach by use a tf-idf vectorizer.

While this approach is more commonly used on a text corpus, it possesses some interesting properties that will be useful in order to obtain a vector representation of the data. The expression is defined as follows:

![img](https://raw.githubusercontent.com/AlexanderNixon/Machine-learning-reads/b47791834906c152411dfaf5f5c2035aebd2157d//images/Content_based_recommenders/tfidf.jpg)

Where we have the product of the term frequency, i.e. the amount of times a given term (words) occurs in a document (overview of a movie), times the right side factor, which basically scales the term frequency depending on the amount of times a given term appears in all documents (movies).

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(movie['overview'])

In [17]:
cosine_sim = cosine_similarity(tfidf_matrix)

df = pd.DataFrame(cosine_sim, index=movie['original_title'], columns=movie['original_title'])
df.sample(5, axis=1).round(2)

original_title,Underworld: Awakening,10 Cloverfield Lane,"As Above, So Below","Food, Inc.",Carrie
original_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avatar,0.00,0.02,0.00,0.00,0.0
Pirates of the Caribbean: At World's End,0.00,0.00,0.05,0.00,0.0
Spectre,0.00,0.00,0.07,0.00,0.0
The Dark Knight Rises,0.01,0.01,0.05,0.00,0.0
John Carter,0.00,0.01,0.00,0.00,0.0
...,...,...,...,...,...
Down Terrace,0.02,0.01,0.00,0.01,0.0
Clerks,0.00,0.00,0.00,0.00,0.0
Dry Spell,0.00,0.02,0.00,0.00,0.0
Tin Can Man,0.00,0.03,0.00,0.00,0.0


In [18]:
df['Avatar'].drop(index='Avatar').sort_values(ascending=False).iloc[:5]

original_title
Apollo 18                      0.168557
The Matrix                     0.132872
Hanna                          0.105987
Semi-Pro                       0.087084
Aliens vs Predator: Requiem    0.078278
Name: Avatar, dtype: float64

In [19]:
def sorting(mv):
  tmp = df[mv].drop(index=mv).sort_values(ascending=False).iloc[:5]
  print(f'You like {mv}, so based on our recommender system, We recommend you to watch:')
  for i,mv in enumerate(tmp.index):
    print(f'{i+1}. {mv}')

In [20]:
sorting('Man of Steel')

You like Man of Steel, so based on our recommender system, We recommend you to watch:
1. Prometheus
2. Ong Bak 2
3. Midnight Special
4. War Horse
5. Astro Boy


# Collaborative Filtering

-- filter out items that a user might like on the basis of reactions by similar users.

We will develop a simple recommender system using memory-based collaboratice filtering.

Memory-based collaborative filtering (also known as neighborhood-based collaborative filtering) is a recommendation technique that makes predictions for users or items based on the preferences of similar users or items. It can be divided into two main categories: user-based collaborative filtering and item-based collaborative filtering. However, we will focus on user-based (user-user) technique.

In [21]:
books=pd.read_csv('https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/master/books10k.csv')
ratings=pd.read_csv('https://raw.githubusercontent.com/Reinalynn/Building-a-Book-Recommendation-System-using-Python/master/ratings10k.csv')

In [22]:
books.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [23]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   book_id                    10000 non-null  int64  
 1   goodreads_book_id          10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

In [24]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


To ease our calculation, we need to transform the ratings dataframe into user-book ratings matrix

In [25]:
ratings_matrix = pd.pivot_table(ratings, values='rating', index='user_id', columns='book_id')
ratings_matrix.head()

book_id,1,2,3,4,5,7,8,9,10,11,...,9984,9985,9986,9988,9990,9991,9995,9997,9998,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,5.0,,,,,4.0,5.0,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,3.0,,,,,,,...,,,,,,,,,,
4,,5.0,,4.0,4.0,4.0,4.0,,5.0,4.0,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,


Normalize/Scaling the data to ease the similarity calculation

In [26]:
normalized_ratings_matrix = ratings_matrix.divide(ratings_matrix.mean(axis=1), axis=0).fillna(0)
normalized_ratings_matrix.head()

book_id,1,2,3,4,5,7,8,9,10,11,...,9984,9985,9986,9988,9990,9991,9995,9997,9998,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.397516,0.0,0.0,0.0,0.0,1.118012,1.397516,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.791667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.326733,0.0,1.061386,1.061386,1.061386,1.061386,0.0,1.326733,1.061386,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculate the similarity among users by their rating on certain book

In [27]:
cossim = cosine_similarity(normalized_ratings_matrix)
df = pd.DataFrame(cossim, index=normalized_ratings_matrix.index, columns=normalized_ratings_matrix.index)
df.sample(5, axis=1).round(2)

user_id,2430,1327,2352,5348,482
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.21,0.10,0.00,0.06,0.00
2,0.07,0.00,0.00,0.04,0.00
3,0.02,0.06,0.07,0.04,0.00
4,0.15,0.07,0.09,0.25,0.06
6,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...
5760,0.07,0.00,0.07,0.16,0.00
5766,0.25,0.00,0.02,0.09,0.00
5895,0.00,0.00,0.00,0.00,0.00
6840,0.06,0.00,0.02,0.04,0.00


We will take a case for user with user_id 1

In [28]:
user_id = 1

user_rating = ratings_matrix.T[user_id]
user_rating

book_id
1        NaN
2        NaN
3        NaN
4        5.0
5        NaN
        ... 
9991     NaN
9995     NaN
9997     NaN
9998     NaN
10000    NaN
Name: 1, Length: 5743, dtype: float64

In [29]:
sim_users = df[user_id].sort_values(ascending=False)[1:]
sim_users

user_id
5458    0.350813
4897    0.331543
3194    0.317092
4074    0.307814
571     0.297117
          ...   
2938    0.000000
2932    0.000000
4613    0.000000
891     0.000000
7582    0.000000
Name: 1, Length: 3787, dtype: float64

After you have determined a list of users similar to a user U, you need to calculate the rating R that U would give to a certain book.

<img src="https://files.realpython.com/media/weighted_rating.06ba3ea506b6.png" width=400></img>

In the above formula, every rating is multiplied by the similarity factor of the user who gave the rating. The final predicted rating by user U will be equal to the sum of the weighted ratings divided by the sum of the weights. With a weighted average, you give more consideration to the ratings of similar users in order of their similarity.

To make you easy to understand the codes below, the formula be written as:

```
predicted_rating = weighted_sum / total_sim
```

where `weighted_sum = sum(sim * ratings_matrix.loc[sim_user, book_id])`

In [30]:
%%time

recommend = []

for book_id in ratings_matrix.columns: #Access all book_id on ratings_matrix dataframe
  if pd.isna(user_rating[book_id]): #Check if there's no rating to a book_id by user 1, it will be taken to the recommendation candidates list
    weighted_sum = 0 #initial weighted sum of rating
    total_sim = 0 #initial total similarity of user 1

    for sim_user, sim in sim_users.items(): #access all users and the similarity with user 1
      if pd.notna(ratings_matrix.loc[sim_user, book_id]): #Only rating by the users that have similarity with user 1 included
        weighted_sum += sim * ratings_matrix.loc[sim_user, book_id] #sum of similarity of each users with user 1 * rating that they gave to a book
        total_sim += sim #calculate all similarity

    if total_sim > 0: #If there are similarity among users and user 1, so we can calculate the predicted rating to a book by user 1
      recommend.append((book_id, weighted_sum/total_sim))

recommend.sort(key=lambda x: x[1], reverse=True)

CPU times: user 3min 45s, sys: 529 ms, total: 3min 46s
Wall time: 3min 54s


In [31]:
recommend[:10]

[(1781, 5.000000000000001),
 (4099, 5.000000000000001),
 (5788, 5.000000000000001),
 (6435, 5.000000000000001),
 (7060, 5.000000000000001),
 (7839, 5.000000000000001),
 (7846, 5.000000000000001),
 (8286, 5.000000000000001),
 (8947, 5.000000000000001),
 (9727, 5.000000000000001)]

Wrap up to a function

In [32]:
def rec_memory(user, top_number):
  user_rating = ratings_matrix.T[user]
  sim_user = df[user].sort_values(ascending=False)[1:]

  recommend = []

  for book_id in ratings_matrix.columns:
    if pd.isna(user_rating[book_id]):
      weighted_sum = 0
      total_sim = 0

      for sim_user, sim in sim_users.items():
        if pd.notna(ratings_matrix.loc[sim_user, book_id]):
          weighted_sum += sim * ratings_matrix.loc[sim_user, book_id]
          total_sim += sim

      if total_sim > 0:
        recommend.append((book_id, weighted_sum/total_sim))

  recommend.sort(key=lambda x: x[1], reverse=True)
  return recommend[:top_number]

If we take only top 5 to recommend user 1, so the list will be:

In [33]:
rec_memory(10, 5)

[(1781, 5.000000000000001),
 (4099, 5.000000000000001),
 (5788, 5.000000000000001),
 (6435, 5.000000000000001),
 (7060, 5.000000000000001)]