# Books Recommender System

![](http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg)

This is the second part of my project on Book Data Analysis and Recommendation Systems. 

In my first notebook ([The Story of Book](https://www.kaggle.com/omarzaghlol/goodreads-1-the-story-of-book/)), I attempted at narrating the story of book by performing an extensive exploratory data analysis on Books Metadata collected from Goodreads.

In this notebook, I will attempt at implementing a few recommendation algorithms (Basic Recommender, Content-based and Collaborative Filtering) and try to build an ensemble of these models to come up with our final recommendation system.

# What's in this kernel?

- [Importing Libraries and Loading Our Data](#1)
- [Clean the dataset](#2)
- [Simple Recommender](#3)
    - [Top Books](#4)
    - [Top "Genres" Books](#5)
- [Content Based Recommender](#6)
    - [Cosine Similarity](#7)
    - [Popularity and Ratings](#8)
- [Collaborative Filtering](#9)
    - [User Based](#10)
    - [Item Based](#11)
- [Hybrid Recommender](#12)
- [Conclusion](#13)
- [Save Model](#14)

# Importing Libraries and Loading Our Data <a id="1"></a> <br>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
######
from lenskit import batch, topn
import lenskit.crossfold as xf
import warnings
warnings.filterwarnings('ignore')
# !pip install lenskit_tf
from lenskit import topn, util
from lenskit.algorithms import Recommender, item_knn, user_knn, als, tf
from lenskit.algorithms import basic


In [2]:
books = pd.read_csv('goodbook/books.csv')
ratings = pd.read_csv('goodbook/ratings.csv')
book_tags = pd.read_csv('goodbook/book_tags.csv')
tags = pd.read_csv('goodbook/tags.csv')

# Start with Book tags

In [10]:
genres = ["Art", "Biography", "Business", "Chick Lit", "Children's", "Christian", "Classics",
          "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction",
          "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror",
          "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal",
          "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", 
          "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]

genres = list(map(str.lower, genres))
genres[:4]


['art', 'biography', 'business', 'chick lit']

In [11]:
available_genres = tags.loc[tags.tag_name.str.lower().isin(genres)]


In [12]:
available_genres_books = book_tags[book_tags.tag_id.isin(available_genres.tag_id)]
print('There are {} books that are tagged with above genres'.format(available_genres_books.shape[0]))


There are 60573 books that are tagged with above genres


In [13]:
available_genres_books['genre'] = available_genres.tag_name.loc[available_genres_books.tag_id].values
available_genres_books.head()

Unnamed: 0,goodreads_book_id,tag_id,count,genre
1,1,11305,37174,fantasy
5,1,11743,9954,fiction
25,1,7457,958,classics
38,1,22973,673,paranormal
52,1,20939,465,mystery


In [14]:
np.sort(ratings.groupby('user_id')['rating'].count())[::-1]

array([200, 200, 199, ...,   2,   2,   2])

In [15]:
dup_ratings = ratings.drop_duplicates(keep='first')
dup_ratings

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4
...,...,...,...
981751,10000,48386,5
981752,10000,49007,4
981753,10000,49383,5
981754,10000,50124,5


In [16]:
print (len (dup_ratings.user_id.unique()))
print(len (dup_ratings.book_id.unique()))

53424
10000


In [17]:
available_genres_books = book_tags[book_tags.tag_id.isin(available_genres.tag_id)]
available_genres_books ["book_id"] = available_genres_books ["goodreads_book_id"]
available_genres_books['genre'] = available_genres.tag_name.loc[available_genres_books.tag_id].values

# Merge the DataFrames based on the 'book_id' column
genres_ratings = dup_ratings.merge(available_genres_books, on='book_id', how='inner')
genres_ratings

Unnamed: 0,book_id,user_id,rating,goodreads_book_id,tag_id,count,genre
0,1,314,5,1,11305,37174,fantasy
1,1,314,5,1,11743,9954,fiction
2,1,314,5,1,7457,958,classics
3,1,314,5,1,22973,673,paranormal
4,1,314,5,1,20939,465,mystery
...,...,...,...,...,...,...,...
496543,9998,53249,5,9998,14821,21,horror
496544,9998,53249,5,9998,8055,18,contemporary
496545,9998,53249,5,9998,23471,17,philosophy
496546,9998,53249,5,9998,10210,7,ebooks


In [18]:
df_fil = genres_ratings[['user_id', 'book_id', 'rating', 'genre']]
df_fil

Unnamed: 0,user_id,book_id,rating,genre
0,314,1,5,fantasy
1,314,1,5,fiction
2,314,1,5,classics
3,314,1,5,paranormal
4,314,1,5,mystery
...,...,...,...,...
496543,53249,9998,5,horror
496544,53249,9998,5,contemporary
496545,53249,9998,5,philosophy
496546,53249,9998,5,ebooks


In [19]:
ratings_sorted = df_fil.sort_values(by='user_id')
ratings_sorted

Unnamed: 0,user_id,book_id,rating,genre
484516,2,9762,4,philosophy
484513,2,9762,4,psychology
484514,2,9762,4,spirituality
484517,2,9762,4,religion
484515,2,9762,4,nonfiction
...,...,...,...,...
201189,53424,4214,5,classics
201188,53424,4214,5,fantasy
201195,53424,4214,5,ebooks
201196,53424,4214,5,travel


In [None]:
# genres_ratings_sorted= ratings_sorted.rename(columns={'user_index': 'user', 'book_index': 'book_id'})
# genres_ratings_sorted

In [20]:
grouped_df = ratings_sorted.groupby(['user_id', 'book_id']).agg({'genre': ', '.join, 'rating': 'mean'}).reset_index()
grouped_df

Unnamed: 0,user_id,book_id,genre,rating
0,2,9762,"philosophy, psychology, spirituality, religion...",4.0
1,3,9014,"thriller, fantasy, fiction, horror, ebooks, sc...",1.0
2,4,3273,"ebooks, travel, contemporary, fiction, history...",2.0
3,7,1519,"fantasy, philosophy, history, poetry, fiction,...",5.0
4,7,3711,"religion, classics, contemporary, fiction",5.0
...,...,...,...,...
79526,53420,4625,"ebooks, classics, fiction",3.0
79527,53420,6538,"nonfiction, history, suspense, ebooks, science...",4.0
79528,53422,7667,"suspense, mystery, thriller, fiction, crime, s...",4.0
79529,53423,4984,"classics, fiction, biography, ebooks, science,...",5.0


In [21]:
# Step 1: Filter book_ids with less than 10 ratings
book_counts = grouped_df['book_id'].value_counts()
popular_books = book_counts[book_counts >= 3].index
df_filtered_books = grouped_df[grouped_df['book_id'].isin(popular_books)]

# Step 2: Filter users with less than 20 interactions
user_counts = df_filtered_books['user_id'].value_counts()
active_users = user_counts[user_counts >= 10].index
df_filtered = df_filtered_books[df_filtered_books['user_id'].isin(active_users)]

# Step 3: Reset the indices of the filtered DataFrame
df_filtered.reset_index(drop=True, inplace=True)

# Now, df_filtered contains the data where book_ids have at least 10 ratings, users have at least 20 interactions, and the indices are reset.

# Step 4: Create mapping dictionaries for book_id and user_id to integer indices
book_id_to_index = {book_id: index+1 for index, book_id in enumerate(df_filtered['book_id'].unique())}
user_id_to_index = {user_id: index+1 for index, user_id in enumerate(df_filtered['user_id'].unique())}

# Step 5: Map book_id and user_id to integer indices in the DataFrame
df_filtered['book_index'] = df_filtered['book_id'].map(book_id_to_index)
df_filtered['user_index'] = df_filtered['user_id'].map(user_id_to_index)

# Now, df_filtered contains integer indices for book_id and user_id in the 'book_index' and 'user_index' columns.
df_filtered = df_filtered.rename(columns={'user_index': 'user', 'book_index': 'item'})


In [29]:
df_filtered.to_csv ("goodbook/V1_RealIDS_ratings_filtered_goodbook.csv", index= False)

In [22]:
grouped_df = df_filtered[['user', 'item', 'rating', 'genre']]
grouped_df

Unnamed: 0,user,item,rating,genre
0,1,1,5.0,"classics, fiction, fantasy, contemporary, myst..."
1,1,2,1.0,"fiction, classics, fantasy, ebooks"
2,1,3,2.0,"classics, science, fiction, fantasy, philosoph..."
3,1,4,5.0,"science, ebooks, religion, philosophy, classic..."
4,1,5,4.0,"ebooks, fiction, classics, contemporary, roman..."
...,...,...,...,...
13187,943,90,5.0,"memoir, classics, nonfiction, travel, history,..."
13188,943,19,5.0,"music, fiction, classics, ebooks, history, con..."
13189,943,109,5.0,"classics, ebooks, thriller, mystery, fiction, ..."
13190,943,110,5.0,"mystery, fiction, crime, ebooks, thriller, sus..."


In [23]:
# grouped_df= df_filtered.copy()
# Number of unique user_ids and book_ids
num_unique_users = grouped_df['user'].nunique()
num_unique_books = grouped_df['item'].nunique()

# Total possible interactions (assuming all combinations exist)
total_possible_interactions = num_unique_users * num_unique_books

# Actual number of interactions (non-zero ratings)
num_interactions = grouped_df.shape[0]

# Sparsity calculation
sparsity = 1.0 - (num_interactions / total_possible_interactions)

# Print the results
print(f"Number of unique user_ids: {num_unique_users}")
print(f"Number of unique book_ids: {num_unique_books}")
print(f"Sparsity of the data: {sparsity:.4f}")

Number of unique user_ids: 943
Number of unique book_ids: 761
Sparsity of the data: 0.9816


In [24]:
# Group by 'user_id' and count the ratings for each user
user_ratings_counts = grouped_df.groupby('user')['rating'].count()

# Sort the user ratings counts in descending order
sorted_user_ratings_counts = np.sort(user_ratings_counts)[::-1]

# If you want to sort in descending order but preserve the corresponding user IDs:
sorted_user_ids = user_ratings_counts.index[np.argsort(user_ratings_counts.values)[::-1]]

# If you want to see both user IDs and their corresponding counts:
sorted_user_data = pd.DataFrame({'user': sorted_user_ids, 'rating_count': sorted_user_ratings_counts})
sorted_user_data

Unnamed: 0,user,rating_count
0,244,32
1,479,32
2,650,31
3,397,30
4,868,30
...,...,...
938,415,10
939,216,10
940,416,10
941,715,10


# Train / Test Split

In [26]:


# user_sampling = xf.SampleN(5)

# folds = list(xf.partition_users(grouped_df, 1, xf.SampleN(5)))
# train, test = next(folds)

for i, tp in enumerate(xf.partition_users(grouped_df, 1, xf.SampleN(5))):
  tp.train.to_csv('train-book%d.csv' % (i,), index= False)
  tp.test.to_csv('val-book%d.csv' % (i,), index= False)

In [30]:
grouped_df.to_csv ("goodbook/ratings_filtered_goodbook.csv", index= False)

In [28]:
train = pd.read_csv("goodbook/trainVal-book0.csv")
for i, tp in enumerate(xf.partition_users(grouped_df, 1, xf.SampleN(5))):
  tp.train.to_csv('train-book%d.csv' % (i,), index= False)
  tp.test.to_csv('val-book%d.csv' % (i,), index= False)

In [None]:
grouped_df.to_csv("goodbook/processed_GB_Data.csv", index=False)


# Recommendation

In [83]:
# train = pd.read_csv ("goodbook/trainVal-book0.csv", sep=",", names= ["user", "item", "rating", "genre"]) # 
train = pd.read_csv ("goodbook/obfuscated_user_item_matrix_2%_top50Inditems_top100IndiUsers_Categories.csv", sep=",", names= ["user", "item", "rating", "genre"]) # 
# train = pd.read_csv ("goodbook/Adding_user_item_matrix_20%.csv", sep=",", names= ["user", "item", "rating", "genre"]) # 
# val = pd.read_csv ("goodbook/val-book0.csv", sep=",", names= ["user", "item", "rating", "genre"])
test = pd.read_csv ("goodbook/test-book0.csv", sep=",", names= ["user", "item", "rating", "genre"])


In [84]:
train ["rating"] = 1
train = train [["user", "item", "rating"]] #.copy ()
# trainVal_small.to_csv ("goodbook/trainVal_small.csv", index= False)\

# val_small = train [["user", "item", "rating"]] #.copy ()
# train_small.to_csv ("goodbook/train_small.csv", index= False)
test ["rating"] = 1
test = test [["user", "item", "rating"]] #.copy ()
# test_small.to_csv ("goodbook/test_small.csv", index= False)


In [85]:
train

Unnamed: 0,user,item,rating
0,1,3,1
1,1,4,1
2,1,5,1
3,1,7,1
4,1,8,1
...,...,...,...
8494,943,90,1
8495,943,109,1
8496,943,127,1
8497,943,381,1


In [86]:
test.user.value_counts()

1      5
634    5
622    5
623    5
624    5
      ..
319    5
320    5
321    5
322    5
943    5
Name: user, Length: 943, dtype: int64

In [87]:
# train
train.user.value_counts()

427    25
921    25
811    25
645    25
790    24
       ..
704     4
775     4
780     4
895     3
870     3
Name: user, Length: 943, dtype: int64

In [88]:
# len (grouped_df.item.unique ())

In [89]:
BPR = tf.BPR(features=200, epochs= 500)
# ItemKNN = item_knn.ItemItem (nnbrs=50)
# UserKNN = user_knn.UserUser (nnbrs=50)

In [90]:
def evaluation(aname, algo, train, test):
    fittable = util.clone(algo)
    fittable = Recommender.adapt(fittable)
    fittable.fit(train)
    users = test.user.unique()
    # now we run the recommender
    recs = batch.recommend(fittable, users, 10)
    # add the algorithm name for analyzability
    recs['Algorithm'] = aname
    return recs

In [91]:
all_recs = []
test_data = []
# for train, test in xf.partition_users(ratings[['user', 'item', 'rating']], 5, xf.SampleFrac(0.2)):
test_data.append(test)
test_data

[      user  item  rating
 0        1     2       1
 1        1     9       1
 2        1     6       1
 3        1    12       1
 4        1     1       1
 ...    ...   ...     ...
 4710   943    85       1
 4711   943   132       1
 4712   943   110       1
 4713   943    50       1
 4714   943    70       1
 
 [4715 rows x 3 columns]]

In [92]:
all_recs.append(evaluation('BPR', BPR, train, test))
# all_recs.append(evaluation('ItemKNN', ItemKNN, train, test))
# all_recs.append(evaluation('UserKNN', UserKNN, train, test))


Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500


Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78/500
Epoch 79/500
Epoch 80/500
Epoch 81/500
Epoch 82/500
Epoch 83/500
Epoch 84/500
Epoch 85/500
Epoch 86/500
Epoch 87/500
Epoch 88/500
Epoch 89/500
Epoch 90/500
Epoch 91/500
Epoch 92/500

In [93]:
all_recs = pd.concat(all_recs, ignore_index=True)
all_recs.head()


Unnamed: 0,item,score,user,rank,Algorithm
0,158,2.12722,1,1,BPR
1,75,2.049705,1,2,BPR
2,80,1.879437,1,3,BPR
3,81,1.878907,1,4,BPR
4,62,1.738119,1,5,BPR


In [94]:
# all_recs.to_csv ("all_recs_BPR")

In [95]:
all_recs_df = all_recs.copy ()


In [96]:
test_data = []
test_data.append(test)

In [97]:
test_data = pd.concat(test_data, ignore_index=True)


In [98]:
rla = topn.RecListAnalysis()
rla.add_metric(topn.ndcg)
rla.add_metric(topn.hit)
rla.add_metric(topn.precision)
rla.add_metric(topn.recall)

results = rla.compute(all_recs_df, test_data) # all_recs
results.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,nrecs,ndcg,hit,precision,recall
Algorithm,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BPR,1,10,0.100013,1.0,0.1,0.2
BPR,2,10,0.093591,1.0,0.1,0.2
BPR,3,10,0.120922,1.0,0.1,0.2
BPR,4,10,0.561544,1.0,0.2,0.4
BPR,5,10,0.084521,1.0,0.1,0.2


In [99]:
results.groupby('Algorithm').mean()

Unnamed: 0_level_0,nrecs,ndcg,hit,precision,recall
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BPR,10.0,0.227129,0.702015,0.126723,0.253446


In [100]:
# results.to_csv ("/Users/mslokom/Documents/RecSys_News/goodbook/results_BPR_fullObf_p5%.csv")

# Clean the dataset <a id="2"></a> <br>

As with nearly any real-life dataset, we need to do some cleaning first. When exploring the data I noticed that for some combinations of user and book there are multiple ratings, while in theory there should only be one (unless users can rate a book several times). Furthermore, for the collaborative filtering it is better to have more ratings per user. So I decided to remove users who have rated fewer than 3 books.

In [101]:
# books['original_publication_year'] = books['original_publication_year'].fillna(-1).apply(lambda x: int(x) if x != -1 else -1)

NameError: name 'books' is not defined

In [None]:
ratings_rmv_duplicates = ratings.drop_duplicates()
unwanted_users = ratings_rmv_duplicates.groupby('user_id')['user_id'].count()
unwanted_users = unwanted_users[unwanted_users < 3]
unwanted_ratings = ratings_rmv_duplicates[ratings_rmv_duplicates.user_id.isin(unwanted_users.index)]
new_ratings = ratings_rmv_duplicates.drop(unwanted_ratings.index)

In [None]:
# new_ratings['title'] = books.set_index('id').title.loc[new_ratings.book_id].values

In [None]:
# new_ratings.head(10)

# Simple Recommender <a id="3"></a> <br>

The Simple Recommender offers generalized recommnendations to every user based on book popularity and (sometimes) genre. The basic idea behind this recommender is that books that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our books based on ratings and popularity and display the top books of our list. As an added step, we can pass in a genre argument to get the top books of a particular genre.


I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of ratings for the book
* *m* is the minimum ratings required to be listed in the chart
* *R* is the average rating of the book
* *C* is the mean rating across the whole report

The next step is to determine an appropriate value for *m*, the minimum ratings required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a book to feature in the charts, it must have more ratings than at least 95% of the books in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [None]:
# v = books['ratings_count']
# m = books['ratings_count'].quantile(0.95)
# R = books['average_rating']
# C = books['average_rating'].mean()
# W = (R*v + C*m) / (v + m)

In [None]:
# books['weighted_rating'] = W

In [None]:
# qualified  = books.sort_values('weighted_rating', ascending=False).head(250)

## Top Books <a id="4"></a> <br>

In [None]:
# qualified[['title', 'authors', 'average_rating', 'weighted_rating']].head(15)

We see that J.K. Rowling's **Harry Potter** Books occur at the very top of our chart. The chart also indicates a strong bias of Goodreads Users towards particular genres and authors. 

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the **85th** percentile instead of 95. 

## Top "Genres" Books <a id="5"></a> <br>

In [None]:
# book_tags.head()

In [None]:
# tags.head()

In [None]:
# genres = ["Art", "Biography", "Business", "Chick Lit", "Children's", "Christian", "Classics",
#           "Comics", "Contemporary", "Cookbooks", "Crime", "Ebooks", "Fantasy", "Fiction",
#           "Gay and Lesbian", "Graphic Novels", "Historical Fiction", "History", "Horror",
#           "Humor and Comedy", "Manga", "Memoir", "Music", "Mystery", "Nonfiction", "Paranormal",
#           "Philosophy", "Poetry", "Psychology", "Religion", "Romance", "Science", "Science Fiction", 
#           "Self Help", "Suspense", "Spirituality", "Sports", "Thriller", "Travel", "Young Adult"]

          

In [None]:
# genres = list(map(str.lower, genres))
# genres[:4]

In [None]:
# available_genres = tags.loc[tags.tag_name.str.lower().isin(genres)]

In [None]:
# available_genres.head()

In [None]:
# available_genres_books = book_tags[book_tags.tag_id.isin(available_genres.tag_id)]

In [None]:
# print('There are {} books that are tagged with above genres'.format(available_genres_books.shape[0]))

In [None]:
# available_genres_books.head()

In [None]:
# len (available_genres_books.goodreads_book_id.unique())

In [None]:
# available_genres_books['genre'] = available_genres.tag_name.loc[available_genres_books.tag_id].values
# available_genres_books.head()

In [None]:
# def build_chart(genre, percentile=0.85):
#     df = available_genres_books[available_genres_books['genre'] == genre.lower()]
#     qualified = books.set_index('book_id').loc[df.goodreads_book_id]

#     v = qualified['ratings_count']
#     m = qualified['ratings_count'].quantile(percentile)
#     R = qualified['average_rating']
#     C = qualified['average_rating'].mean()
#     qualified['weighted_rating'] = (R*v + C*m) / (v + m)

#     qualified.sort_values('weighted_rating', ascending=False, inplace=True)
#     return qualified

Let us see our method in action by displaying the Top 15 Fiction Books (Fiction almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

In [None]:
# cols = ['title','authors','original_publication_year','average_rating','ratings_count','work_text_reviews_count','weighted_rating']

In [None]:
# genre = 'Fiction'
# build_chart(genre)[cols].head(15)

For simplicity, you can just pass the index of the wanted genre from below. 

In [None]:
# list(enumerate(available_genres.tag_name))

In [None]:
# idx = 24  # romance
# build_chart(list(available_genres.tag_name)[idx])[cols].head(15)

# Content Based Recommender <a id="6"></a> <br>

![](https://miro.medium.com/max/828/1*1b-yMSGZ1HfxvHiJCiPV7Q.png)

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves business books (and hates fiction) were to look at our Top 15 Chart, s/he wouldn't probably like most of the books. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *The Fault in Our Stars*, *Twilight*. One inference we can obtain is that the person loves the romaintic books. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests books that are most similar to a particular book that a user liked. Since we will be using book metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build this recommender based on book's *Title*, *Authors* and *Genres*.

In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

My approach to building the recommender is going to be extremely *hacky*. These are steps I plan to do:
1. **Strip Spaces and Convert to Lowercase** from authors. This way, our engine will not confuse between **Stephen Covey** and **Stephen King**.
2. Combining books with their corresponding **genres** .
2. I then use a **Count Vectorizer** to create our count matrix.

Finally, we calculate the cosine similarities and return books that are most similar.

In [None]:
# books['authors'] = books['authors'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x.split(', ')])

In [None]:
# def get_genres(x):
#     t = book_tags[book_tags.goodreads_book_id==x]
#     return [i.lower().replace(" ", "") for i in tags.tag_name.loc[t.tag_id].values]

In [None]:
# books['genres'] = books.book_id.apply(get_genres)

In [None]:
# books['soup'] = books.apply(lambda x: ' '.join([x['title']] + x['authors'] + x['genres']), axis=1)

In [None]:
# books.soup.head()

In [None]:
# count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.0, stop_words='english')
# count_matrix = count.fit_transform(books['soup'])

## Cosine Similarity <a id="7"></a> <br>

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two books. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $



In [None]:
# cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
# indices = pd.Series(books.index, index=books['title'])
# titles = books['title']

In [None]:
# def get_recommendations(title, n=10):
#     idx = indices[title]
#     sim_scores = list(enumerate(cosine_sim[idx]))
#     sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
#     sim_scores = sim_scores[1:31]
#     book_indices = [i[0] for i in sim_scores]
#     return list(titles.iloc[book_indices].values)[:n]

In [None]:
# get_recommendations("The One Minute Manager")

What if I want a specific book but I can't remember it's full name!!

So I created the following *method* to get book titles from a **partial** title.

In [None]:
# def get_name_from_partial(title):
#     return list(books.title[books.title.str.lower().str.contains(title) == True].values)

In [None]:
# title = "business"
# l = get_name_from_partial(title)
# list(enumerate(l))

In [None]:
# get_recommendations(l[1])

## Popularity and Ratings <a id="8"></a> <br>

One thing that we notice about our recommendation system is that it recommends books regardless of ratings and popularity. It is true that ***Across the River and Into the Trees*** and ***The Old Man and the Sea*** were written by **Ernest Hemingway**, but the former one was cnosidered a bad (not the worst) book that shouldn't be recommended to anyone, since that most people hated the book for it's static plot and overwrought emotion.

Therefore, we will add a mechanism to remove bad books and return books which are popular and have had a good critical response.

I will take the top 30 movies based on similarity scores and calculate the vote of the 60th percentile book. Then, using this as the value of $m$, we will calculate the weighted rating of each book using IMDB's formula like we did in the Simple Recommender section.

In [None]:
# def improved_recommendations(title, n=10):
#     idx = indices[title]
#     sim_scores = list(enumerate(cosine_sim[idx]))
#     sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
#     sim_scores = sim_scores[1:31]
#     book_indices = [i[0] for i in sim_scores]
#     df = books.iloc[book_indices][['title', 'ratings_count', 'average_rating', 'weighted_rating']]

#     v = df['ratings_count']
#     m = df['ratings_count'].quantile(0.60)
#     R = df['average_rating']
#     C = df['average_rating'].mean()
#     df['weighted_rating'] = (R*v + C*m) / (v + m)
    
#     qualified = df[df['ratings_count'] >= m]
#     qualified = qualified.sort_values('weighted_rating', ascending=False)
#     return qualified.head(n)

In [None]:
# improved_recommendations("The One Minute Manager")

In [None]:
# improved_recommendations(l[1])

I think the sorting of similar is more better now than before.
Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.


# Collaborative Filtering <a id="9"></a> <br>

![](https://miro.medium.com/max/706/1*DYJ-HQnOVvmm5suNtqV3Jw.png)

Our content based engine suffers from some severe limitations. It is only capable of suggesting books which are *close* to a certain book. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a book will receive the same recommendations for that book, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Book Readers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

There are two classes of Collaborative Filtering:
![](https://miro.medium.com/max/1280/1*QvhetbRjCr1vryTch_2HZQ.jpeg)
- **User-based**, which measures the similarity between target users and other users.
- **Item-based**, which measures the similarity between the items that target users rate or interact with and other items.

## - User Based <a id="10"></a> <br>

In [None]:
# ! pip install surprise

In [None]:
# from surprise import Reader, Dataset, SVD
# from surprise.model_selection import cross_validate

In [None]:
# reader = Reader()
# data = Dataset.load_from_df(new_ratings[['user_id', 'book_id', 'rating']], reader)