# Simple Hybrid Recommender
This is a basic hybrid recommendation system that uses the 10k goodbooks data set
in our database. There are 3 main components to this system:
- Title Search
    - Since currently, a bookshelf is sent without an identifier stored in our 
      database, and the identifier that _is_ sent can't be used without a call
      to the google books API, the most practical thing to do is to search for 
      the _closest_ book. 
- Content Based System
    - A cosine similarity matrix is built on similarity given a book description.
      The values in that matrix will be multiplied by the weighted values of another
      matrix, the cosine similarity of books through collaborative filtering
- Collaborative Filtering
    - User book ratings will be used here to create a sort of user engagement
      matrix, from which the second similarity matrix will be derived
      
The recommender will work as follows:
1. A title is searched for via a basic search engine (CountVectorizer)
2. If the title is similar enough, the index of said title will be referenced in 
   a combined similarity matrix
3. The top 10 most similar indices will be returned (along with pertinent information)

In [0]:
import os
import re
import sys
import pickle

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs

from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

### Title Search Function
This step is necessary because we don't get ISBNs in the response body. The ISBNs are also somewhat unreliable due to differing editions, versions etc.

In [4]:
sql = """
SELECT *
FROM goodbooks_books_xml;
"""
con = os.environ["DATABASE_URL"]
books = pd.read_sql(sql, con)
books.head()

  """)


Unnamed: 0,id,title,isbn,isbn13,asin,kindle_asin,marketplace_id,country_code,image_url,small_image_url,publication_year,publication_month,publication_day,publisher,language_code,is_ebook,description,work,work_work,work_id,work_books_count,work_best_book_id,work_reviews_count,work_ratings_sum,work_ratings_count,work_text_reviews_count,work_original_publication_year,work_original_publication_month,work_original_publication_day,work_original_title,work_original_language_id,work_media_type,work_rating_dist,work_desc_user_id,work_default_chaptering_book_id,work_default_description_language_code,average_rating,num_pages,format,edition_information,ratings_count,text_reviews_count,url,link,authors,authors_authors,authors_author,authors_id,authors_name,authors_role,authors_image_url,authors_small_image_url,authors_link,authors_average_rating,authors_ratings_count,authors_text_reviews_count,public_document,public_document_public_document,public_document_id,public_document_document_url
0,1162022,On the Jellicoe Road,0670070297,9780670070299,,B00AMH0S8A,A1F83G8C2ARO7P,GB,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,2006,8.0,28.0,Penguin Australia,eng,False,I'm dreaming of the boy in the tree. I tell hi...,\n,\n,6479100,37,1162022,121977,168383,40624,6185,2006,,,On the Jellicoe Road,,book,5:20011|4:11237|3:5957|2:2090|1:1329|total:40624,171430,,,4.14,290,Paperback,1st edition,33421,3771,https://www.goodreads.com/book/show/1162022.On...,https://www.goodreads.com/book/show/1162022.On...,\n,\n,\n,47104,Melina Marchetta,,\nhttps://images.gr-assets.com/authors/1277655...,\nhttps://images.gr-assets.com/authors/1277655...,https://www.goodreads.com/author/show/47104.Me...,4.06,159449,19650,,,,
1,18143968,"I've Got You Under My Skin (Under Suspicion, #1)",147674906X,9781476749068,,B00EBA5P1O,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1397768065m...,https://images.gr-assets.com/books/1397768065s...,2014,4.0,1.0,Simon Schuster,eng,False,When Laurie Moran's husband was brutally murde...,\n,\n,25491291,46,18143968,22253,44529,11889,1224,2014,4.0,1.0,I've Got You Under My Skin,,book,5:2977|4:4367|3:3347|2:937|1:261|total:11889,6355728,,,3.75,303,Hardcover,,8892,1023,https://www.goodreads.com/book/show/18143968-i...,https://www.goodreads.com/book/show/18143968-i...,\n,\n,\n,108774,Alafair Burke,,\nhttps://images.gr-assets.com/authors/1367515...,\nhttps://images.gr-assets.com/authors/1367515...,https://www.goodreads.com/author/show/108774.A...,3.75,70989,8154,,,,
2,25403,The Orange Girl,0753819929,9780753819920,,B004OBZNXU,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1415583796m...,https://images.gr-assets.com/books/1415583796s...,2005,7.0,6.0,,eng,False,'My father died eleven years ago. I was only f...,\n,\n,1015565,85,25403,27592,61869,15820,1358,2003,,,Appelsinpiken,,book,5:5184|4:5624|3:3704|2:1033|1:275|total:15820,43579257,,,3.91,160,,,10362,612,https://www.goodreads.com/book/show/25403.The_...,https://www.goodreads.com/book/show/25403.The_...,\n,\n,\n,191735,James Anderson,Translator,\nhttps://images.gr-assets.com/authors/1284560...,\nhttps://images.gr-assets.com/authors/1284560...,https://www.goodreads.com/author/show/191735.J...,3.8,24511,1832,,,,
3,9914,The Informers,0330339184,9780330339186,,B004FV4T3Y,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1374684746m...,https://images.gr-assets.com/books/1374684746s...,2000,,,MacMillan,en-GB,False,"Set in Los Angeles, in the recent past. The bi...",\n,\n,1308950,53,9914,24808,49718,14685,477,1994,,,The Informers,,book,5:2314|4:4328|3:5346|2:2101|1:596|total:14685,-51,,,3.39,240,Unknown Binding,,12907,298,https://www.goodreads.com/book/show/9914.The_I...,https://www.goodreads.com/book/show/9914.The_I...,\n,\n,\n,2751,Bret Easton Ellis,,\nhttps://images.gr-assets.com/authors/1405340...,\nhttps://images.gr-assets.com/authors/1405340...,https://www.goodreads.com/author/show/2751.Bre...,3.69,323887,14657,,,,
4,39980,"A Year Down Yonder (A Long Way from Chicago, #2)",0142300705,9780142300701,,,,GB,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,2002,11.0,21.0,Puffin Books,eng,False,Mary Alice remembers childhood summers packed ...,\n,\n,39678,30,39980,33163,99867,24366,1637,2000,1.0,1.0,A Year Down Yonder,,book,5:10198|4:8285|3:4425|2:1004|1:454|total:24366,4863716,,,4.1,160,Paperback,,23326,1444,https://www.goodreads.com/book/show/39980.A_Ye...,https://www.goodreads.com/book/show/39980.A_Ye...,\n,\n,\n,22414,Richard Peck,,\nhttps://images.gr-assets.com/authors/1299893...,\nhttps://images.gr-assets.com/authors/1299893...,https://www.goodreads.com/author/show/22414.Ri...,3.91,86021,9084,,,,


In [0]:
# Remove null isbn13s
books = books.dropna(subset=['isbn13'])

In [6]:
# books needs book id from old dataset
csv_books = pd.read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv')
csv_books['id'] = csv_books['goodreads_book_id']
csv_books = csv_books[['id', 'title', 'book_id']]

# Cast books['id'] as type int to match csv data
books['id'] = books['id'].astype(int)

# Merge datasets to match ratings
books = books.merge(csv_books, on=['id', 'title'])

# Remove isbn13 where value is null, these need to be used to search db
books = books.dropna(subset=['isbn13'])
books.shape

(9405, 61)

In [7]:
# Which column has the actual author name?
books[['authors', 'authors_authors', 'authors_author',
       'authors_id', 'authors_name', 'authors_role']]

Unnamed: 0,authors,authors_authors,authors_author,authors_id,authors_name,authors_role
0,\n,\n,\n,47104,Melina Marchetta,
1,\n,\n,\n,108774,Alafair Burke,
2,\n,\n,\n,191735,James Anderson,Translator
3,\n,\n,\n,2751,Bret Easton Ellis,
4,\n,\n,\n,22414,Richard Peck,
...,...,...,...,...,...,...
9400,\n,\n,\n,38550,Brandon Sanderson,
9401,\n,\n,\n,1654,Terry Pratchett,
9402,\n,\n,\n,27398,Joshua Harris,
9403,\n,\n,\n,14617,Margaret Peterson Haddix,


In [8]:
# Drop duplicate titles (so that all matrices are the same size for comparison)
books = books.drop_duplicates(['title'])
books.shape

(9371, 61)

In [9]:
title_search_terms = books['title'] + ' ' + books['authors_name']
title_search_terms.head()

0                On the Jellicoe Road Melina Marchetta
1    I've Got You Under My Skin (Under Suspicion, #...
2                       The Orange Girl James Anderson
3                      The Informers Bret Easton Ellis
4    A Year Down Yonder (A Long Way from Chicago, #...
dtype: object

In [10]:
# Take the titles + authors and create a document term matrix based on term counts
vectorizer = CountVectorizer(ngram_range=(1, 2), max_df=190)
title_term_matrix = vectorizer.fit_transform(title_search_terms)
title_term_matrix.shape

(9371, 47802)

In [11]:
nn = NearestNeighbors(algorithm='brute', metric='cosine')
nn.fit(title_term_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [0]:
def closest_titles(title, thresh=1.0):
    """
    Returns closest title if within parameter
    
    thresh: if distance falls below this param, print title details
    """
    title = [title]
    title_transformed = vectorizer.transform(title)
    distances, indices = nn.kneighbors(title_transformed)
    distances = distances.flatten()
    indices = indices.flatten()
    nearest = list(zip(distances, indices))[0]
    
    if nearest[0] < thresh:
        i = nearest[1]
        d = nearest[0]
        print("%i | %.3f - %s" % (i, d, books['title'].iloc[i]))
    else:
        print('No good match found. Send to content based')
        
def all_details(title):
    """Prints all closest indices, distances and titles"""
    title = [title]
    title_transformed = vectorizer.transform(title)
    distances, indices = nn.kneighbors(title_transformed)
    distances = distances.flatten()
    indices = indices.flatten()
    nearest = list(zip(distances, indices))
    
    for d, i in nearest:
        print("%i | %.3f - %s" % (i, d, books['title'].iloc[i]))
        
def all_titles(title):
    closest_titles(title, thresh=.631)
    print("\nNeighbors:")
    print("~~~~~~~~~~")
    all_details(title)

Now we have a basic search engine to match most book queries to our 10k data:

In [13]:
all_titles("waking up sam harris")

8662 | 0.388 - Waking Up: A Guide to Spirituality Without Religion

Neighbors:
~~~~~~~~~~
8662 | 0.388 - Waking Up: A Guide to Spirituality Without Religion
5706 | 0.537 - Free Will
6709 | 0.613 - Letter to a Christian Nation
2982 | 0.719 - The End of Faith: Religion, Terror, and the Future of Reason
302 | 0.726 - The Moral Landscape: How Science Can Determine Human Values


### Collaborative Filtering Component
Now that we have a content based similarity matrix, the ratings table can be used to get a content based similarity matrix

In [14]:
# Get the ratings data
ratings = pd.read_csv('https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv')
ratings.shape

(5976479, 3)

In [15]:
# Change 'book_id' to 'id' so that we can merge with books
ratings['id'] = ratings['book_id']
ratings.head()

Unnamed: 0,user_id,book_id,rating,id
0,1,258,5,258
1,2,4081,4,4081
2,2,260,5,260
3,2,9296,5,9296
4,2,2318,3,2318


In [16]:
# Merge ratings data with books
merged = ratings.merge(books, on='book_id', how='inner')
merged = merged[['user_id', 'rating', 'title', 'book_id']]
merged.head()

Unnamed: 0,user_id,rating,title,book_id
0,1,5,The Shadow of the Wind (The Cemetery of Forgot...,258
1,11,3,The Shadow of the Wind (The Cemetery of Forgot...,258
2,143,4,The Shadow of the Wind (The Cemetery of Forgot...,258
3,242,5,The Shadow of the Wind (The Cemetery of Forgot...,258
4,325,4,The Shadow of the Wind (The Cemetery of Forgot...,258


In [17]:
# Remove rows where both title and user_id are duplicated
merged = merged.drop_duplicates(['user_id', 'title'])
merged.shape

(5765108, 4)

In [0]:
# Create user ratings pivot table and fill null values with zeros
matrix2d = merged.pivot(index='title', columns='user_id', values='rating').fillna(0)

In [19]:
# 2D matrix is ~4gb in memory, need to clear this later
sys.getsizeof(matrix2d)

4005937013

In [0]:
# Compress the matrix since it is so sparse
comp2d = csr_matrix(matrix2d.values)

In [21]:
# Compressed matrix is much smaller
print("Size of compressed matrix:", sys.getsizeof(comp2d))

# Keep 2D matrix index for use with content based model (matrices need to match)
book_index = matrix2d.index
del matrix2d

Size of compressed matrix: 56


In [22]:
# Construct a cosine similarity matrix
collab_sim = cosine_similarity(comp2d, comp2d)
collab_sim.shape

(9371, 9371)

### Content Based Component
This part of the system creates a cosine similarity matrix based on textual description. Using TF-IDF Vectorizer from sklearn right out of the box to avoid placing more large files in the flask application directory

In [0]:
# Make sure descriptions are based on the books included in content model
descriptions = books[books['title'].isin(book_index)]
# Reorder indices so final matrices match
descriptions = descriptions.set_index('title').reindex(book_index)
# Fill null descriptions with some sort of string 
descriptions = descriptions['description'].fillna('None')

In [0]:
# Some minor cleaning
def get_text(text):
    """Extracts text from html"""
    if type(text) == float:
        return "None"    
    soup = bs(text)    
    return soup.text

In [0]:
descriptions = descriptions.str.strip("'")
descriptions = descriptions.apply(get_text)

In [0]:
# Tokenize and vectorize descriptions using TF-IDF
tfidf = TfidfVectorizer(
            stop_words='english', 
            ngram_range=(1, 2),
            min_df=8, max_df=.80)
dtm = tfidf.fit_transform(descriptions)

In [27]:
# Using linear kernel since getting dot product after tfidf gives cosine similarity
content_sim = linear_kernel(dtm, dtm)
content_sim.shape

(9371, 9371)

### Combine Matrices
In this last step, the matrices are weighted and combined into one similarity matrix. The collaborative matrix will be weighted higher, since its latent data tells us much more about books than the description alone can. The final score will be an average of the two scores

In [58]:
hybrid_scores = ((collab_sim * 1.5) + content_sim) / 2
print(hybrid_scores.shape)
print(hybrid_scores[0])

# Compress the hybrid matrix (would be way too large in memory otherwise)
compressed_hybrid = csr_matrix(hybrid_scores)
print("Compressed size:", sys.getsizeof(compressed_hybrid))
print("Not compressed:", sys.getsizeof(hybrid_scores))

(9371, 9371)
[1.25       0.         0.00620319 ... 0.         0.01037456 0.        ]
Compressed size: 56
Not compressed: 702525240


In [60]:
# Values can be accessed like so:
compressed_hybrid[0].toarray().flatten()

array([1.25      , 0.        , 0.00620319, ..., 0.        , 0.01037456,
       0.        ])

In [0]:
books_identifiers = books[books['title'].isin(book_index)][['title', 'isbn13', 'isbn']]
master_index = books_identifiers.set_index('title').reindex(book_index)

In [0]:
# Redefine the title search function so that it returns closest book
def closest_title(title_author, thresh=1.0):
    """
    Returns closest title if within parameter
    
    thresh: if distance falls below this param, print title details
    """
    title_author = [title_author]
    title_transformed = vectorizer.transform(title_author)
    distances, indices = nn.kneighbors(title_transformed)
    distances = distances.flatten()
    indices = indices.flatten()
    nearest = list(zip(distances, indices))[0]
    
    if nearest[0] < thresh:
        i = nearest[1]
        d = nearest[0]
        return books['title'].iloc[i]
    else:
        return False

def get_recs(title_author, thresh):
    """Get recommendations based on neares title"""
    title = closest_title(title_author, thresh=thresh)
    
    if title:
        idx = book_index.tolist().index(title)
        sim_scores = list(enumerate(compressed_hybrid[idx].toarray().flatten()))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:11]
        indices = [i[0] for i in sim_scores]
        titles = book_index[indices]
        
        return master_index.loc[titles]
    else:
        message = "Send to alternative model"
        return message
    

It would be a good idea to check the recommendations to see if they make sense before commiting to using them in the API

In [62]:
THRESH = .5

print(get_recs('waking up sam harris', THRESH))

                                                           isbn13        isbn
title                                                                        
The Moral Landscape: How Science Can Determine ...  9781439171219  1439171211
The End of Faith: Religion, Terror, and the Fut...  9780393327656  0393327655
Letter to a Christian Nation                        9780307265777  0307265773
Free Will                                           9781451683400  1451683405
god is Not Great: How Religion Poisons Everything   9780446579803  0446579807
10% Happier: How I Tamed the Voice in My Head, ...  9780062265425  0062265423
The Portable Atheist: Essential Readings for th...  9780306816086  0306816083
A Universe from Nothing: Why There Is Something...  9781451624458  145162445X
The Better Angels of Our Nature: Why Violence H...  9780670022953  0670022950
No Place to Run (KGI, #2)                           9780425238196  0425238199


In [63]:
print(get_recs('Where the crawdads sing delia owens', THRESH))

Send to alternative model


In [64]:
print(get_recs('voyager diana gabaldon', THRESH))

                                                           isbn13        isbn
title                                                                        
Drums of Autumn (Outlander, #4)                     9780385335980  0385335989
Dragonfly in Amber (Outlander, #2)                  9780385335973  0385335970
The Fiery Cross (Outlander, #5)                     9780440221661  0440221668
A Breath of Snow and Ashes (Outlander, #6)          9780385340397  0385340397
An Echo in the Bone (Outlander, #7)                 9780752898476  0752898477
Outlander (Outlander, #1)                           9780440242949  0440242940
Written in My Own Heart's Blood (Outlander, #8)     9780385344432  0385344430
Lord John and the Private Matter (Lord John Gre...  9780770429454  0770429459
Lord John and the Brotherhood of the Blade  (Lo...  9780385337496  0385337493
The Scottish Prisoner (Lord John Grey, #3)          9781409135197  1409135195


In [65]:
print(get_recs('The martian andy weir', THRESH))

                                          isbn13        isbn
title                                                       
Ready Player One                   9780307887436  030788743X
The Girl on the Train              9781594633669  1594633665
Station Eleven                     9780385353304  0385353308
All the Light We Cannot See        9781476746586  1476746583
Armada                             9780804137256  0804137250
Gone Girl                          9780297859383  0297859382
Leviathan Wakes (The Expanse, #1)  9781841499888  1841499889
Red Rising (Red Rising, #1)        9780345539786  0345539788
Ender's Game (Ender's Saga, #1)    9780812550702  0812550706
The Ocean at the End of the Lane   9780062255655  0062255657


### Serialize Components
The main hybrid similarity matrix needs to be serialized, along with the master index, in order to be able to retrieve valid identifiers from either our database or from the google books API. The components for the search function will also need to be pickled. The Flask app is essentially using the database as a sort of cache for books that it has not encountered.

In [0]:
# Pickle search components
book_search_index = books['title'].tolist()

with open('book_search_index.pkl', 'wb') as bsi:
    pickle.dump(book_search_index, bsi)
    
with open('search_neighbors.pkl', 'wb') as sn:
    pickle.dump(nn, sn)
    
with open('search_vectorizer.pkl', 'wb') as sv:
    pickle.dump(vectorizer, sv)

In [0]:
# Pickle Hybrid Model components

with open('master_hybrid_index.pkl', 'wb') as mi:
    pickle.dump(master_index, mi)
    
with open('compressed_sim_matrix.pkl', 'wb') as csm:
    pickle.dump(compressed_hybrid, csm)