# Book Recommender --- Part 2(Weighted rating, Naive Bayes)

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)


import re
from tqdm import tqdm_notebook as tqdm
import collections
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\YaoDe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. Weighted rating approach

On the website of goodreads.com, each book has a score which was denoted by "A book’s total score is based on multiple factors, including the number of people who have voted for it and how highly those voters ranked the book." I could not find the formula it uses and find the weighted rating formula applied by IMDB as follows:

Weighted Rating (WR) = $(\frac{v}{v + M} . r) + (\frac{M}{v + M} . C)$

where,
* *v* is the number of ratings for the book
* *M* is the minimum number of ratings required to be listed in the chart
* *r* is the average rating of the book
* *C* is the average rating across the whole dataset

I will use the above formula to generate the book weighted raing score:


In [2]:
books_df=pd.read_csv('./goodreads/books.csv')
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2971 entries, 0 to 2970
Data columns (total 8 columns):
bookID          2971 non-null int64
title           2971 non-null object
rating          2971 non-null float64
authorName      2971 non-null object
authorID        2971 non-null int64
ratingCount     2971 non-null int64
reviewCount     2971 non-null int64
descriptions    2971 non-null object
dtypes: float64(1), int64(4), object(3)
memory usage: 185.8+ KB


In [3]:
books_df.describe()

Unnamed: 0,bookID,rating,authorID,ratingCount,reviewCount
count,2971.0,2971.0,2971.0,2971.0,2971.0
mean,10496790.0,3.992117,3279136.0,119726.3,5255.322114
std,12305160.0,0.287601,4742688.0,373931.1,11818.955709
min,1.0,1.99,4.0,1.0,0.0
25%,184456.5,3.81,19564.0,1230.5,154.5
50%,6499709.0,3.98,504038.0,15774.0,1287.0
75%,17569530.0,4.16,5136597.0,74806.0,4864.0
max,55098690.0,5.0,20611760.0,6969115.0,171311.0


In [4]:
books_df.head()

Unnamed: 0,bookID,title,rating,authorName,authorID,ratingCount,reviewCount,descriptions
0,3,Harry Potter and the Sorcerer's Stone,4.47,J.K. Rowling,1077326,6969115,111108,Harry Potter's life is miserable. His parents ...
1,28187,The Lightning Thief,4.25,Rick Riordan,15872,1969008,60352,Alternate cover for this ISBN can be found her...
2,41865,Twilight,3.6,Stephenie Meyer,941441,4923599,104036,About three things I was absolutely positive.F...
3,2767052,The Hunger Games,4.33,Suzanne Collins,153394,6325313,171311,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...
4,3636,The Giver,4.13,Lois Lowry,2493,1766957,64956,Twelve-year-old Jonas lives in a seemingly ide...


In [5]:
C=books_df['rating'].mean()
print(f'The average rating across the whole dataset C={C}')
#We only keep the books have number of ratings more than 80% of the books in the list
M=books_df['ratingCount'].quantile(0.8)
print(f'The minimum number of ratings required to be listed in the chart M= {M}')
books_chart=books_df[books_df['ratingCount']>M].copy()

The average rating across the whole dataset C=3.99211713227869
The minimum number of ratings required to be listed in the chart M= 104203.0


In [6]:
# The books qulified for the chart
books_chart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 594 entries, 0 to 2960
Data columns (total 8 columns):
bookID          594 non-null int64
title           594 non-null object
rating          594 non-null float64
authorName      594 non-null object
authorID        594 non-null int64
ratingCount     594 non-null int64
reviewCount     594 non-null int64
descriptions    594 non-null object
dtypes: float64(1), int64(4), object(3)
memory usage: 41.8+ KB


In [7]:
def weighted_rating(x):
    v=x['ratingCount']
    r=x['rating']
    return (v/(v+M) * r) + (M/(M+v) * C)

In [8]:
books_chart['WR']=books_chart.apply(weighted_rating,axis=1)
books_chart.sort_values('WR',ascending=False).head(100)

Unnamed: 0,bookID,title,rating,authorName,authorID,ratingCount,reviewCount,descriptions,WR
2204,5,Harry Potter and the Prisoner of Azkaban,4.57,J.K. Rowling,1077326,2772791,54787,Harry Potter's third year at Hogwarts is full ...,4.549069
2335,1,Harry Potter and the Half-Blood Prince,4.57,J.K. Rowling,1077326,2412888,39066,The war against Voldemort is not going well; e...,4.546077
2506,6,Harry Potter and the Goblet of Fire,4.56,J.K. Rowling,1077326,2566929,46004,Harry Potter is midway through his training as...,4.537846
1918,862041,Harry Potter Series Box Set,4.74,J.K. Rowling,1077326,249074,7336,"Over 4000 pages of Harry Potter and his world,...",4.519404
1621,17332218,Words of Radiance,4.76,Brandon Sanderson,38550,202205,12234,"Words of Radiance, Book Two of the Stormlight ...",4.498859
957,7235533,The Way of Kings,4.64,Brandon Sanderson,38550,294873,19668,According to mythology mankind used to live in...,4.470831
1570,62291,A Storm of Swords,4.54,George R.R. Martin,346732,661064,22116,An alternate cover for this isbn can be found ...,4.465397
0,3,Harry Potter and the Sorcerer's Stone,4.47,J.K. Rowling,1077326,6969115,111108,Harry Potter's life is miserable. His parents ...,4.462960
306,186074,The Name of the Wind,4.53,Patrick Rothfuss,108424,705793,41579,"Told in Kvothe's own voice, this is the tale o...",4.460803
1140,18512,The Return of the King,4.53,J.R.R. Tolkien,656983,671906,9971,Alternate cover edition here.The Companions of...,4.457782


The top 100 books of the chart are as above. It looks that the weighted ratings of Harry Potter serires are the highest among the 3000 books.  

# 2. Naive bayes
Each book has a description in the metadata files. It is like summary or excerpt of the book which gives us a rough idea of the topic, theme or the style. I will use Naive Bayes approach to train and predict the book ratings. 

<ol>
<li>First, normalize descriptions, convert the uppercase letters into lowercase, remove all the non-numeric and non-letter characters, then split each decrtiption into a word list.</li> 

<li>Then, generate a dictoionary which includes all the qulified words(frequency is more than 10 in the whole dataset, and is not a stop word) and their index.</li>

<li>With the dictionary, transform each word list to a vector, the length of the vector is the lengh of the dictionary, and the ith element of the vector represents the frenquecy of the ith word of the dictionary.</li>

<li>For a specific user, gather all the books the user has rated, transform these books decripitions into a matrix(every row in the matrix is a vecotr refers to one book decription), and create labels(label=1 if rating>=4, 0 otherwise).</li>

<li>With the matrix and lables, train them with naive bayes model, then use the trained model to predict rating probabilities for each book in the whole books dataset for this user. Then rescale these probobilities into labels(label=1 if probability>threshhold,else 0 )</li>
</ol>

In [9]:
def generate_vocabulary(briefs):
    word_counts = collections.defaultdict(int)
    for brief in briefs:
        for word in set(brief):
            word_counts[word] += 1
    vocabulary = {}
    for word, count in word_counts.items():
        if count>10 and word not in stopwords.words('english') and len(word) > 1:
            next_index = len(vocabulary)
            vocabulary[word] = next_index

    return vocabulary

In [10]:
def words_to_matirx(briefs,vocabulary):
    matrix = np.zeros((len(briefs), len(vocabulary)))
    for i, brief in enumerate(briefs):
        for word in brief:
            if word in vocabulary:
                matrix[i, vocabulary[word]] += 1

    return matrix

In [11]:
books_df=pd.read_csv('./goodreads/books.csv')
books_df['bookID']=books_df['bookID'].astype('str')
books_df['authorID']=books_df['authorID'].astype('str')
#remove the non-letter and non-numberic characters,convert upppercase characters to lowercase, split sentence into words 
books_df['briefs']=books_df['descriptions'].str.replace(r'[^\w\s]','').str.lower().str.split()

In [12]:
vocabulary=generate_vocabulary(books_df['briefs'])
len(vocabulary)

3097

In [13]:
matrix=words_to_matirx(books_df['briefs'],vocabulary)
matrix.shape

(2971, 3097)

In [14]:
books_df['label']= [1 if x >=4 else 0 for x in books_df['rating']]

In [15]:
def fit_naive_bayes_model(matrix, labels):
    model = {}
    phi = (1. * sum(labels) / len(labels))*0.95+0.05*0.5
    model['logphi_0'] = np.log(1.-phi)
    model['logphi_1'] = np.log(phi)
    theta_0 = (matrix[labels == 0]).sum(axis=0) + 1
    theta_1 = (matrix[labels == 1]).sum(axis=0) + 1
    theta_0 /= theta_0.sum()
    theta_1 /= theta_1.sum()
    model['logtheta_0'] = np.log(theta_0)
    model['logtheta_1'] = np.log(theta_1)
    return model

In [16]:
def predict_from_naive_bayes_model(model, matrix):
    output = np.zeros(matrix.shape[0])
    logphi_0 = model['logphi_0']
    logphi_1 = model['logphi_1']
    logtheta_0 = model['logtheta_0']
    logtheta_1 = model['logtheta_1']
    logprobs_0 = (matrix * logtheta_0).sum(axis=1) + logphi_0
    logprobs_1 = (matrix * logtheta_1).sum(axis=1) + logphi_1
    output = (logprobs_1/(logprobs_1+logprobs_0))
    return output

In [17]:
ratings=pd.read_csv('./goodreads/final_ratings.csv')
ratings['bookID']=ratings['bookID'].astype('str')
ratings['userID']=ratings['userID'].astype('str')
ratings['user_count']=ratings.groupby('userID')['userID'].transform('count') 
ratings=ratings[ratings['user_count']>100] #only train the model to each user who have rated no less than 100 books
ratings['label']=[1 if x>=4 else 0 for x in ratings['rating']] # create the labels

In [18]:
userid_list=ratings['userID'].unique()

In [19]:
train, test = train_test_split(ratings,
                               stratify=ratings['userID'], 
                               test_size=0.20,
                               random_state=42)

In [20]:
from sklearn.metrics import accuracy_score
from numpy import arange

def find_threshold(y_true, prob,thresh_min=0.47,thresh_max=0.50):
    best_thresh=0
    best_accuracy=0
    for thresh in arange(thresh_min,thresh_max,0.002):
        y_pred=[1 if x>=thresh else 0 for x in prob]
        accuracy=accuracy_score(y_true,y_pred)
        if accuracy>best_accuracy:
            best_accuracy=accuracy
            best_thresh=thresh
    return best_accuracy,best_thresh
            

In [21]:
def train_predict(train,test,books_df,vocabulary,userid_list):
    accuracy_train=[] 
    accuracy_test=[]
    for userid in tqdm(userid_list,desc="train and predict on each user individually"):
        User = train.loc[train.userID ==userid].sort_values('bookID')
        user_brief=User.merge(books_df,on='bookID')['briefs']
        user_matrix=words_to_matirx(user_brief,vocabulary)
        user_label=User.merge(books_df,on='bookID')['label_x']
        model = fit_naive_bayes_model(user_matrix, user_label)
        result = predict_from_naive_bayes_model(model, matrix)
        pred_df=pd.DataFrame({'bookID':books_df['bookID'].values,'result':result})
        #compute the train accuracy
        y_true_train=train[train['userID']==userid].merge(pred_df,on='bookID')['label']
        prob_train=train[train['userID']==userid].merge(pred_df,on='bookID')['result']
        best_accuracy_train,best_thresh_train=find_threshold(y_true_train, prob_train)
        accuracy_train.append(best_accuracy_train)
        #compute the test accuracy using the best_threshhold for train dateset
        y_true_test=test[test['userID']==userid].merge(pred_df,on='bookID')['label']
        prob_test=test[test['userID']==userid].merge(pred_df,on='bookID')['result']
        y_pred_test=[1 if x>=best_thresh_train else 0 for x in prob_test]
        accuracy_test_score=accuracy_score(y_true_test, y_pred_test)
        accuracy_test.append(accuracy_test_score)
    print(f'The mean accuracy of prediction for train dataset is {np.mean(accuracy_train)}')
    print(f'The mean accuracy of prediction for test dataset is {np.mean(accuracy_test)}')    

In [22]:
train_predict(train,test,books_df,vocabulary,userid_list)


HBox(children=(IntProgress(value=0, description='train and predict on each user individually', max=184, style=…


The mean accuracy of prediction for train dataset is 0.6312213344909494
The mean accuracy of prediction for test dataset is 0.6289469510796714


# 3. Conclusion
In this notebook, I first use the weighted rating approach to generate the top 100 book chart. This approach gives the the same recommendation for all user. It is not personalized and only gives a general idea of the book's popularity. 

Applying Naive Bayes model to predict the ratings is personalized for each user, but the test acurracy is not good.

Next, I will build a book recommender using collaborative filtering technique.