# Book Recommender----Part 3 (Content_Based) 

In this part, we will build a content based, personalized book recommender. The approach is below:  

1) Firstly, construct a matrix in which each row reprensents a book with informations as: books title, author, book description.<br />

2) Then, with the constructed matrix, compute the similiarities among these vectors(books). Here we use cosine similarity,where $V_j$ is reprensents the jth book, $V_i$ is reprensents the ith book. <br />
$$S_{ij} = \frac{V_{i}^T.V_{j}}{||V_{i}||.||V_{j}||}$$

3) For each user, compute rating for books the user didn't rated in the following way: Find the books the user has rated($B_{rated}$,size is n). For the jth book in unrated book set, find a projected rating with the formula below, where $r_i$ is the rating of ith book in the rated book set, $S_{ij}$ is the similarity between jth unrated book and ith rated book.    
    $$r_j=\frac{\sum_{i=1}^{n} r_iS_{ij}}{\sum_{i=1}^{n}S_{ij}}$$

4) Rank the unrated books by the ratings computed as above, recommend the top N books for this user

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import re
from tqdm import tqdm_notebook as tqdm
import collections
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import defaultdict

from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.metrics import mean_squared_error
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\YaoDe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Construct TF-iDF matrix

In [2]:
books_df=pd.read_csv('./goodreads/books.csv')
books_df['bookID']=books_df['bookID'].astype('str')
books_df['authorID']=books_df['authorID'].astype('str')
#Combine book title, author name and book descrtiptions into a string
books_df['combine_text']=books_df['title']+' '+(books_df['authorName']+' ')*3+books_df['descriptions']
books_df['combine_text'].values[:3]

array(["Harry Potter and the Sorcerer's Stone J.K. Rowling J.K. Rowling J.K. Rowling Harry Potter's life is miserable. His parents are dead and he's stuck with his heartless relatives, who force him to live in a tiny closet under the stairs. But his fortune changes when he receives a letter that tells him the truth about himself: he's a wizard. A mysterious visitor rescues him from his relatives and takes him to his new home, Hogwarts School of Witchcraft and Wizardry.After a lifetime of bottling up his magical powers, Harry finally feels like a normal kid. But even within the Wizarding community, he is special. He is the boy who lived: the only person to have ever survived a killing curse inflicted by the evil Lord Voldemort, who launched a brutal takeover of the Wizarding world, only to vanish after failing to kill Harry.Though Harry's first year at Hogwarts is the best of his life, not everything is perfect. There is a dangerous secret object hidden within the castle walls, and Harr

In [3]:
#Convert the text string into TF-iDF vector
tf = TfidfVectorizer(analyzer='word',min_df=0, stop_words='english')
tf_matrix=tf.fit_transform(books_df['combine_text'])
tf_matrix.shape

(2970, 28663)

In [4]:
#Compute the similiarties among books, since we use TF-iDF vector(normalized), the cosine similar is the same as linear kernal.
cosine_similiarity = linear_kernel(tf_matrix, tf_matrix)
cosine_similiarity.shape

(2970, 2970)

## 2. Compute ratings for unrated books on each user individually

In [5]:
ratings=pd.read_csv('./goodreads/final_ratings.csv')
ratings['bookID']=ratings['bookID'].astype('str')
ratings['userID']=ratings['userID'].astype('str')
ratings['user_count']=ratings.groupby('userID')['userID'].transform('count') 
ratings=ratings[ratings['user_count']>=20] #only train the model to each user who have rated no less than 20 books
userid_list=ratings['userID'].unique()

In [6]:
#split the data into train and test
train, test = train_test_split(ratings,
                               stratify=ratings['userID'], 
                               test_size=0.20,
                               random_state=42)

In [7]:
#predict the ratings of unrated books for user whose userID is '1713956', use the test set to measure the RMSE
rated_bookID = train.loc[train.userID =='1713956']['bookID'].values
rated_books=books_df[books_df['bookID'].isin(rated_bookID)]
unrated_books=books_df[~books_df['bookID'].isin(rated_bookID)]
rating_mapping=defaultdict(int)
sim_mapping=defaultdict(int)
rating_pred={}
for idx_r in rated_books.index:
    rating_r=rated_books[rated_books.index==idx_r]['rating'].values[0]
    #print(rating_r)
    for idx_ur in unrated_books.index:
        similarity=cosine_similiarity[idx_r,idx_ur]
        #print(similarity)
        rating_mapping[idx_ur]+=(rating_r*similarity)
        sim_mapping[idx_ur]+=similarity
for idx_ur in unrated_books.index:
    rating_pred[idx_ur]=(rating_mapping[idx_ur]/sim_mapping[idx_ur])
unrated_books['rating_pred']=list(rating_pred.values())
true_rating_test=test[test['userID']=='1713956'].merge(unrated_books,on='bookID')['rating_x']
pred_rating_test=test[test['userID']=='1713956'].merge(unrated_books,on='bookID')['rating_pred']
rmse_test=sqrt(mean_squared_error(true_rating_test,pred_rating_test))
rmse_test

1.046544456960596

In [8]:
#For every user, predict the rating of the unrated books,and compute the average RMSE across the users
def compute_rating(train,test,sim_matrix,books_df,userid_list):
    rmse_train_list=[] 
    rmse_test_list=[]
    for userid in tqdm(userid_list,desc="computing rating"):
        rated_bookID = train.loc[train.userID ==userid]['bookID'].values
        rated_books=books_df[books_df['bookID'].isin(rated_bookID)]
        unrated_books=books_df[~books_df['bookID'].isin(rated_bookID)]
        rating_mapping=defaultdict(int)
        sim_mapping=defaultdict(int)
        rating_pred={}
        for idx_r in rated_books.index:
            rating_r=rated_books[rated_books.index==idx_r]['rating'].values[0]
            #print(rating_r)
            for idx_ur in unrated_books.index:
                similarity=cosine_similiarity[idx_r,idx_ur]
                #print(similarity)
                rating_mapping[idx_ur]+=(rating_r*similarity)
                sim_mapping[idx_ur]+=similarity
        for idx_ur in unrated_books.index:
            rating_pred[idx_ur]=(rating_mapping[idx_ur]/sim_mapping[idx_ur])
        unrated_books['rating_pred']=pd.Series(rating_pred)
        true_rating_test=test[test['userID']==userid].merge(unrated_books,on='bookID')['rating_x']
        pred_rating_test=test[test['userID']==userid].merge(unrated_books,on='bookID')['rating_pred']
        if len(true_rating_test)==0 or len(pred_rating_test)==0: continue
        pred_rating_test=pred_rating_test.fillna(0)
        rmse_test=sqrt(mean_squared_error(true_rating_test,pred_rating_test))
        rmse_test_list.append(rmse_test)
    print(f'The mean rmse of rating prediction for test dataset is {np.mean(rmse_test_list)}')
    return np.mean(rmse_test_list)

In [9]:
compute_rating(train,test,cosine_similiarity,books_df,userid_list)

HBox(children=(IntProgress(value=0, description='computing rating', max=3881, style=ProgressStyle(description_…


The mean rmse of rating prediction for test dataset is 1.0387160987024733


1.0387160987024733

## 3.  Recommend top 10 unrated books for a user

In [10]:
def recommend_books(sim_matrix,ratings,books_df,userid,N=10):
    rated_bookID = ratings.loc[ratings.userID ==userid]['bookID'].values
    rated_books=books_df[books_df['bookID'].isin(rated_bookID)]
    unrated_books=books_df[~books_df['bookID'].isin(rated_bookID)]
    rating_mapping=defaultdict(int)
    sim_mapping=defaultdict(int)
    rating_pred={}
    for idx_r in rated_books.index:
        rating_r=rated_books[rated_books.index==idx_r]['rating'].values[0]
        for idx_ur in unrated_books.index:
            similarity=sim_matrix[idx_r,idx_ur]
            rating_mapping[idx_ur]+=(rating_r*similarity)
            sim_mapping[idx_ur]+=similarity
    for idx_ur in unrated_books.index:
        rating_pred[idx_ur]=(rating_mapping[idx_ur]/sim_mapping[idx_ur])
    unrated_books['rating_pred']=pd.Series(rating_pred)
    return unrated_books.sort_values('rating_pred',ascending=False).head(N)[['title','rating','rating_pred']]

In [11]:
df_recommend=recommend_books(cosine_similiarity,ratings,books_df,userid='2745288')
df_recommend

Unnamed: 0,title,rating,rating_pred
1917,Harry Potter Series Box Set,4.74,4.156894
1916,Miecz przeznaczenia,4.31,4.14
787,Two Classics by Roald Dahl,4.13,4.135806
892,A Light in the Attic,4.34,4.096857
1357,Harry Potter and the Methods of Rationality,4.39,4.083072
1709,Di undici foglie,4.16,4.076558
573,Fantastic Beasts and Where to Find Them,3.99,4.07476
687,The Bane Chronicles,4.15,4.05878
81,Where the Sidewalk Ends,4.3,4.055993
2314,Les Fiancés de l'Hiver,4.17,4.05593
