# Recommender System
I will now build a basic recommender system based for the books that have been scraped.  The idea is that you give it a single book and it will return books you are likely to also enjoy based on their similarity to the book that you provided.

## Type
While there are many types of recommender systems, the two most common are *collaborative filters* and *content filters*.

At a high level, collaborative filtering works at a user-level.  It takes individual statistics like ratings, which items were viewed, etc., and draws similarities between users based on these values.  If there is content that one has interacted with that another did not, it can be a potential suggestion.

On the other hand, content filters ignore the user and focus on the similarities between the actual content of the data, such as weighted ratings, similarity of authors, frequency of topics appearing in the description, and so on.  This method requires a direct 
'similarity score' between items in order to compute how related they are.

I'm going to go with the **content filtering** method because the data that I scraped best fits this - it has book content, not user interaction data.

In [2]:
import pandas as pd
import numpy as np

# NLP stuff.
import string
from rake_nltk import Rake
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
book_data = pd.read_csv('./scraper/output/pages-1-100.tsv', sep='\t')

## Remove duplicates
I have read on the user forum and eyeballed a few duplicates.  I will remove them by common title.  Of course the disadvantage to this is that some removes entries may contain information that's missing in the first encounter (which is what is kept by default).

In [4]:
book_data.drop_duplicates(subset='title', inplace=True)

# Resetting the index is VERY important!
# We rely on the index later and if we remove values here, the index will no longer be right.
book_data = book_data.reset_index()

## Weighted rating & top books
We cannot take rating scores directly as they can be imbalanced.  One user rating a book 5/5 is not better than 50,000 people rating it on average 4.5.  We need some kind of algorithm to weight the rating values.

[IMDB's FAQ](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV?ref_=helpms_helpart_inline#calculatetop) describes the algorithm that they use to weight the rank o movies and TV shows for the top rated lists.  It reads:

$\text{Weighted Rating (WR)} = (\frac{v}{v+m} \cdot R) + (\frac{m}{v+m} \cdot C)$

where

* $R$ is the average rating for the movie (mean).
* $v$ is the number of votes for the movie.
* $m$ is the minimum number of votes to be listed (25,000 in their case)
* $C$ is the mean vote across the whole report.

We already have access to $R$ and $v$ in the columns directly.  $C$ is something we can compute from the data.  $m$ is something we can configure and tweak.  I'll begin with the 10th percentile, essentially chopping off the bottom part of the data.

In [5]:
C = book_data['avg_rating'].mean()
C

4.052728995578016

In [6]:
m = book_data['num_ratings'].quantile(0.1)
m

2421.9000000000005

In [7]:
def weighted_rating(book, m, C):
    # Average rating for the book.
    R = book['avg_rating']
    # Total number of votes for the book.
    v = book['num_ratings']
    # IMDB formula.
    return (v / (v+m) * R) + (m / (m+v) * C)

# Calculate the weighted rating for books that are within our threshold.
book_data.loc[book_data.num_ratings > m, 'weighted_rating'] = book_data.loc[book_data.num_ratings > m].apply(lambda x: weighted_rating(x, m, C), axis=1)

# Fill the NaN values (i.e., books lower than our threshold) with a zero score.
book_data['weighted_rating'].fillna(0, inplace=True)

Using this method, let's eyeball the top and bottom 5 entries (sorted by `weighted_rating`).  These movies are 'similar' in that they are ordered by their weighted rating.  Books around the same score were rated similar.  However, this is too simple and doesn't consider what the actual books are about, who wrote them, and so on.

In [8]:
book_data.sort_values('weighted_rating', ascending=False).head(5)

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating
1525,1539,The Complete Calvin and Hobbes,The Complete Calvin and Hobbes,Calvin and Hobbes,English,Bill Watterson,4.82,33322,961,"Sequential Art,Comics,Humor,Sequential Art,Gra...",[ Box Set | Book One | Book Two | Book Three...,https://www.goodreads.com/book/show/24812.The_...,4.768012
982,988,Words of Radiance,Words of Radiance,The Stormlight Archive,English,Brandon Sanderson,4.76,172432,10541,"Fantasy,Fiction,Fantasy,Epic Fantasy,Fantasy,H...",From #1 New York Times bestselling author Bran...,https://www.goodreads.com/book/show/17332218-w...,4.750204
6306,6538,"Harry Potter Boxed Set, Books 1-5 (Harry Potte...",,,English,"J.K. Rowling,Mary GrandPré (Illustrator)",4.78,39132,162,"Fantasy,Young Adult,Fiction,Fantasy,Magic",Box Set containing Harry Potter and the Sorcer...,https://www.goodreads.com/book/show/8.Harry_Po...,4.737612
1469,1481,Harry Potter Series Box Set,,Harry Potter,English,J.K. Rowling,4.74,234260,7065,"Fantasy,Young Adult,Fiction","Over 4000 pages of Harry Potter and his world,...",https://www.goodreads.com/book/show/862041.Har...,4.732967
5288,5455,It's a Magical World,It's a Magical World,Calvin and Hobbes,English,Bill Watterson,4.76,25119,334,"Sequential Art,Comics,Humor,Fiction,Sequential...",When cartoonist Bill Watterson announced that ...,https://www.goodreads.com/book/show/24814.It_s...,4.697804


In [9]:
book_data.sort_values('weighted_rating', ascending=False).tail(5)

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating
6308,6540,Awakening Inner Guru,,,English,"Banani Ray,Amit Ray",4.78,104,24,"Spirituality,Inspirational,Self Help",Awakening Inner Guru is a clear and straightfo...,https://www.goodreads.com/book/show/8596181-aw...,0.0
6302,6534,30 Pieces of Gold: Self Growth - How to use In...,,,English,"Ron Millicent,Millie Parker (Editor)",4.31,128,1,"Novels,Inspirational,Contemporary,Adult,Self H...",Inspirational Quotes – Hah - Do They Really Wo...,https://www.goodreads.com/book/show/27467291-3...,0.0
6291,6520,The Pace,The Pace,The Pace,English,Shelena Shorts,3.7,1409,258,"Young Adult,Fantasy,Romance,Fantasy,Paranormal...",Weston Wilson is not immortal and he is of thi...,https://www.goodreads.com/book/show/6599113-th...,0.0
6282,6511,A Midnight Clear,A Midnight Clear,,English,William Wharton,4.18,1391,66,"Fiction,Historical,Historical Fiction,War,War,...",Set in the Ardennes Forest on Christmas Eve 19...,https://www.goodreads.com/book/show/720234.A_M...,0.0
4749,4890,Death of the Body,,Crossing Death,English,Rick Chiantaretto,3.82,217,74,"Fantasy,Fantasy,Paranormal,Fantasy,Urban Fanta...",I grew up in a world of magic. By the time I w...,https://www.goodreads.com/book/show/18624197-d...,0.0


In [10]:
# A little cleanup.
del C
del m

## Content-Based Recommender System
Now let's get to building the recommender.  It will be based on the content, so we will be creating an amalgam of features per book that will be used to calculate the similarity score between books.

Values I'm thinking of using include the title, series that it belongs to (if any), language, author(s), genres, and of course we can identify keywords from the book's description.

Instead of treating each entry equally, we can add weight to them by mentioning the words multiple times in the vector that we will use to calculate similarity.

Problems with the approach I have taken below include:

* Genres and languages can overlap (English vs. English) which increases the importance of that feature.
* Processing is a little trivial without much testing yet.
* All authors are included blindly.  They could be filtered based on their (Role).

In [11]:
# Takes a string and returns an array of its processed words.
def clean_string(s):
    # Remove stopwords and punctuation.
    stop = stopwords.words('english') + list(string.punctuation)
    return [n for n in wordpunct_tokenize(s.lower()) if n not in stop]

def create_soup(x):
    title_importance = 1
    language_importance = 1
    series_importance = 1
    authors_importance = 1
    genres_importance = 1

    soup = ''
    
    # Keywords from description.
    desc = x['description']
    if desc is not np.nan:
        rake = Rake()
        rake.extract_keywords_from_text(desc)
        desc_soup = ' '.join(list(rake.get_word_degrees().keys()))
        soup = ' '.join(filter(None, [soup, desc_soup]))
    
    # Title.
    title_soup = ' '.join(clean_string(x['title']) * title_importance)
    soup = ' '.join(filter(None, [soup, title_soup]))
    
    # Language.
    language = x['language']
    if language is not np.nan:
        language_soup = ' '.join(clean_string(language) * language_importance)
        soup = ' '.join(filter(None, [soup, language_soup]))
    
    # Series.
    series = x['series']
    if series is not np.nan:
        series_soup = ' '.join(clean_string(series) * series_importance)
        soup = ' '.join(filter(None, [soup, series_soup]))

    # Authors.
    authors = x['authors']
    if authors is not np.nan:
        # I'm trying to not remove punctuation here but to just set all as spaces. I want to retain (Role).
        # Providing it's consistent across entries, this should work.
        author_soup = ' '.join([a.lower().replace(' ', '') for a in authors.split(',')] * authors_importance)
        soup = ' '.join(filter(None, [soup, author_soup]))
    
    # Genres.
    genres = x['genres']
    if genres is not np.nan:
        # Almost the same treatment as authors (strip spaces to make matching a bit more likely).
        genre_soup = ' '.join([g.lower().replace(' ', '') for g in genres.split(',')] * genres_importance)
        soup = ' '.join(filter(None, [soup, genre_soup]))
    
    return soup

book_data['soup'] = book_data.apply(create_soup, axis=1)

In [12]:
book_data.soup.head()

0    wife cast pages weaknesses remembered -- geniu...
1    novel arc fictional episodes neither clearly d...
2    become history space mind david mitchell combi...
3    thousands prelude world third caper unlike dis...
4    older son denise final family christmas great ...
Name: soup, dtype: object

Now it's time to create the similarity matrix between all books based on our lovely steaming soup.

In [13]:
count_vec = CountVectorizer()
count_matrix = count_vec.fit_transform(book_data['soup'])

from sklearn.metrics.pairwise import linear_kernel
cos_sim = cosine_similarity(count_matrix, count_matrix)

In [14]:
# Reverse lookup of title vs. index.
title_to_index = pd.Series(book_data.index, index=book_data['title'])

def get_recommendation(title):
    idx = title_to_index[title]
    print(idx)
    print(book_data.loc[idx].soup)
    
    scores = pd.Series(cos_sim[idx]).sort_values(ascending=False)
    book_indices = list(scores.iloc[1:11].index)
    
#     scores = list(enumerate(cos_sim[idx]))
#     scores = sorted(scores, key=lambda x: x[1], reverse=True)
#     scores = scores[1:11]
#     book_indices = [i[0] for i in scores]
    print(scores[1:11])
    return book_data.iloc[book_indices]

# get_recommendation('Harry Potter and the Chamber of Secrets')
get_recommendation("The Hitchhiker's Guide to the Galaxy")

1719
plucked seconds guide dynamic pair begin headed make way formally tricia mcmillan ), pick planet galactic freeway friend ford prefect (" journey galaxy two interstellar hitchhiker demolished girlfriend work actor arthur tried brilliant armed ex full lunch president zaphod last fifteen years obsessed former graduate student revised edition hippie chronically depressed robot fellow travelers trillian upon paranoid dent together posing veet voojagig ballpoint pens massively useful thing totally ") time zone bought cocktail party towel marvin three disappearance beeblebrox — space aided earth researcher quotes hitchhiker guide galaxy english hitchhiker guide galaxy douglasadams sciencefiction fiction humor fantasy classics
929     0.440922
1623    0.413869
3569    0.334219
2871    0.332385
814     0.318251
265     0.217918
3703    0.186210
613     0.184703
9185    0.155902
8852    0.144139
dtype: float64


Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating,soup
929,934,The Restaurant at the End of the Universe,The Restaurant at the End of the Universe,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.22,210747,5133,"Science Fiction,Fiction,Humor,Fantasy,Humor,Co...",Alternate Cover Edition ISBN 0345418921 (ISBN1...,https://www.goodreads.com/book/show/8695.The_R...,4.2181,guide hitch hiker arms place space powered com...
1623,1638,The Ultimate Hitchhiker's Guide to the Galaxy,The Ultimate Hitchhiker's Guide: Five Complete...,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.38,265641,4980,"Science Fiction,Fiction,Humor,Fantasy,Classics","At last in paperback in one complete volume, h...",https://www.goodreads.com/book/show/13.The_Ult...,4.377043,guide space universe galaxy fish saved sick sp...
3569,3657,The Hitchhiker's Guide to the Galaxy: A Trilog...,The Hitchhiker's Guide to the Galaxy: The Tril...,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.51,33696,523,"Science Fiction,Fiction,Humor,Fantasy",Charting the whole of Arthur Dent's odyssey th...,https://www.goodreads.com/book/show/841628.The...,4.479338,guide space things mind universe make way gala...
2871,2922,Mostly Harmless,Mostly Harmless,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,3.97,99423,2318,"Science Fiction,Fiction,Humor,Fantasy,Humor,Co...",It’s easy to get disheartened when your planet...,https://www.goodreads.com/book/show/360.Mostly...,3.971967,crashes guide enjoy life space total multidime...
814,817,"Life, the Universe and Everything","Life, the Universe and Everything",Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.2,182318,3099,"Science Fiction,Fiction,Humor,Fantasy",The unhappy inhabitants of planet Krikkit are ...,https://www.goodreads.com/book/show/8694.Life_...,4.198069,killer robots planet krikkit best friend slart...
265,266,"So Long, and Thanks for All the Fish","So Long, and Thanks for All the Fish",Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.09,132863,2578,"Science Fiction,Fiction,Humor,Fantasy,Humor,Co...",Including everything you wanted to know about ...,https://www.goodreads.com/book/show/8698.So_Lo...,4.089333,making life flash wanted lost arthur dent let ...
3703,3793,The Salmon of Doubt,The Salmon of Doubt: Hitchhiking the Galaxy On...,Dirk Gently,English,Douglas Adams,3.93,24993,940,"Science Fiction,Fiction,Humor,Fantasy",Douglas Adams changed the face of science fict...,https://www.goodreads.com/book/show/359.The_Sa...,3.940842,guide essays genghis khan — warrior headed ali...
613,616,Journey to the Center of the Earth,Voyage au centre de la Terre,Extraordinary Voyages,English,Jules Verne,3.86,135876,4637,"Classics,Science Fiction,Fiction,Adventure,Fan...",The intrepid Professor Liedenbrock embarks upo...,https://www.goodreads.com/book/show/32829.Jour...,3.863375,quaking nephew axel prehistoric proportions de...
9185,9662,Earth,Earth,Earth,English,David Brin,3.92,6716,278,"Science Fiction,Fiction,Science Fiction Fantas...","TIME IS RUNNING OUT Decades from now, an artif...",https://www.goodreads.com/book/show/96471.Earth,3.955178,human inhabitants become extinct start fallen ...
8852,9306,The Golden Torc,The Golden Torc,Saga of the Pliocene Exile,English,Julian May,4.12,6297,122,"Science Fiction,Fantasy,Fiction","By A.D. 2110 nearly 100,000 humans had fled th...",https://www.goodreads.com/book/show/1018538.Th...,4.101314,six million b freedom end humans tanu would re...


I'm going to now output the data to some pickle files for loading elsewhere (since it has been processed a little).

In [18]:
import pickle

should_export = False

if should_export:
    # Book data.
    print('Exporting book data...', end='')
    pickle.dump(book_data, open('book_data.pickle', 'wb'))
    print('done!')
    
    # Cosine similarity (warning: this will be huge).
    print('Exporting similarity matrix...', end='')
    pickle.dump(cos_sim, open('cossim.pickle', 'wb'))
    print('done!')

Exporting book data...done!
Exporting similarity matrix...done!


In [14]:
# DEBUG: Easy way to find the rows of books I know.
book_data.loc[book_data.title.str.contains('Hitchhiker')]

Unnamed: 0,index,title,original_title,series,language,authors,avg_rating,num_ratings,num_reviews,genres,description,url,weighted_rating,soup
1623,1638,The Ultimate Hitchhiker's Guide to the Galaxy,The Ultimate Hitchhiker's Guide: Five Complete...,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.38,265641,4980,"Science Fiction,Fiction,Humor,Fantasy,Classics","At last in paperback in one complete volume, h...",https://www.goodreads.com/book/show/13.The_Ult...,4.377043,total obliteration together mannered arthur de...
1719,1739,The Hitchhiker's Guide to the Galaxy,The Hitchhiker's Guide to the Galaxy,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.22,1281495,26801,"Science Fiction,Fiction,Humor,Fantasy,Classics",Seconds before the Earth is demolished to make...,https://www.goodreads.com/book/show/386162.The...,4.219684,disappearance massively useful thing interstel...
3569,3657,The Hitchhiker's Guide to the Galaxy: A Trilog...,The Hitchhiker's Guide to the Galaxy: The Tril...,Hitchhiker's Guide to the Galaxy,English,Douglas Adams,4.51,33696,523,"Science Fiction,Fiction,Humor,Fantasy",Charting the whole of Arthur Dent's odyssey th...,https://www.goodreads.com/book/show/841628.The...,4.479338,one thursday lunchtime 3 thinks large hitchhik...
