# Simple Hybrid Recommender
This is a basic hybrid recommendation system that uses the 10k goodbooks data set
in our database. There are 3 main components to this system:
- Title Search
    - Since currently, a bookshelf is sent without an identifier stored in our 
      database, and the identifier that _is_ sent can't be used without a call
      to the google books API, the most practical thing to do is to search for 
      the _closest_ book. 
- Content Based System
    - A cosine similarity matrix is built on similarity given a book description.
      The values in that matrix will be multiplied by the weighted values of another
      matrix, the cosine similarity of books through collaborative filtering
- Collaborative Filtering
    - User book ratings will be used here to create a sort of user engagement
      matrix, from which the second similarity matrix will be derived
      
The recommender will work as follows:
1. A title is searched for via a basic search engine (CountVectorizer)
2. If the title is similar enough, the index of said title will be referenced in 
   a combined similarity matrix
3. The top 10 most similar indices will be returned (along with pertinent information)

In [94]:
import os
import re

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.neighbors import NearestNeighbors

### Title Search Function
Before the title search function can be reliably created, the books will need to be filtered by significance.
*Note: This step has not been completed in this notebook*

In [4]:
sql = """
SELECT *
FROM goodbooks_books_xml;
"""
con = os.environ["DATABASE_URL"]
books = pd.read_sql(sql, con)
books.head()

Unnamed: 0,id,title,isbn,isbn13,asin,kindle_asin,marketplace_id,country_code,image_url,small_image_url,...,authors_image_url,authors_small_image_url,authors_link,authors_average_rating,authors_ratings_count,authors_text_reviews_count,public_document,public_document_public_document,public_document_id,public_document_document_url
0,1162022,On the Jellicoe Road,0670070297,9780670070299,,B00AMH0S8A,A1F83G8C2ARO7P,GB,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,...,\nhttps://images.gr-assets.com/authors/1277655...,\nhttps://images.gr-assets.com/authors/1277655...,https://www.goodreads.com/author/show/47104.Me...,4.06,159449,19650,,,,
1,18143968,"I've Got You Under My Skin (Under Suspicion, #1)",147674906X,9781476749068,,B00EBA5P1O,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1397768065m...,https://images.gr-assets.com/books/1397768065s...,...,\nhttps://images.gr-assets.com/authors/1367515...,\nhttps://images.gr-assets.com/authors/1367515...,https://www.goodreads.com/author/show/108774.A...,3.75,70989,8154,,,,
2,25403,The Orange Girl,0753819929,9780753819920,,B004OBZNXU,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1415583796m...,https://images.gr-assets.com/books/1415583796s...,...,\nhttps://images.gr-assets.com/authors/1284560...,\nhttps://images.gr-assets.com/authors/1284560...,https://www.goodreads.com/author/show/191735.J...,3.8,24511,1832,,,,
3,9914,The Informers,0330339184,9780330339186,,B004FV4T3Y,A1F83G8C2ARO7P,GB,https://images.gr-assets.com/books/1374684746m...,https://images.gr-assets.com/books/1374684746s...,...,\nhttps://images.gr-assets.com/authors/1405340...,\nhttps://images.gr-assets.com/authors/1405340...,https://www.goodreads.com/author/show/2751.Bre...,3.69,323887,14657,,,,
4,39980,"A Year Down Yonder (A Long Way from Chicago, #2)",0142300705,9780142300701,,,,GB,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...,...,\nhttps://images.gr-assets.com/authors/1299893...,\nhttps://images.gr-assets.com/authors/1299893...,https://www.goodreads.com/author/show/22414.Ri...,3.91,86021,9084,,,,


In [229]:
# Which column has the actual author name?
books[['authors', 'authors_authors', 'authors_author',
       'authors_id', 'authors_name', 'authors_role']]

Unnamed: 0,authors,authors_authors,authors_author,authors_id,authors_name,authors_role
0,\n,\n,\n,47104,Melina Marchetta,
1,\n,\n,\n,108774,Alafair Burke,
2,\n,\n,\n,191735,James Anderson,Translator
3,\n,\n,\n,2751,Bret Easton Ellis,
4,\n,\n,\n,22414,Richard Peck,
...,...,...,...,...,...,...
9995,\n,\n,\n,38550,Brandon Sanderson,
9996,\n,\n,\n,1654,Terry Pratchett,
9997,\n,\n,\n,27398,Joshua Harris,
9998,\n,\n,\n,14617,Margaret Peterson Haddix,


In [231]:
title_search_terms = books['title'] + ' ' + books['authors_name']
title_search_terms.head()

0                On the Jellicoe Road Melina Marchetta
1    I've Got You Under My Skin (Under Suspicion, #...
2                       The Orange Girl James Anderson
3                      The Informers Bret Easton Ellis
4    A Year Down Yonder (A Long Way from Chicago, #...
dtype: object

In [232]:
# Take the titles + authors and create a document term matrix based on term counts
vectorizer = CountVectorizer(ngram_range=(1, 2), max_df=190)
title_term_matrix = vectorizer.fit_transform(title_search_terms)
title_term_matrix.shape

(10000, 49980)

In [233]:
nn = NearestNeighbors(algorithm='brute', metric='cosine')
nn.fit(title_term_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [285]:
def closest_titles(title, thresh=1.0):
    """
    Returns closest title if within parameter
    
    thresh: if distance falls below this param, print title details
    """
    title = [title]
    title_transformed = vectorizer.transform(title)
    distances, indices = nn.kneighbors(title_transformed)
    distances = distances.flatten()
    indices = indices.flatten()
    nearest = list(zip(distances, indices))[0]
    
    if nearest[0] < thresh:
        i = nearest[1]
        d = nearest[0]
        print("%i | %.3f - %s" % (i, d, books['title'].iloc[i]))
    else:
        print('No good match found. Send to content based')
        
def all_details(title):
    """Prints all closest indices, distances and titles"""
    title = [title]
    title_transformed = vectorizer.transform(title)
    distances, indices = nn.kneighbors(title_transformed)
    distances = distances.flatten()
    indices = indices.flatten()
    nearest = list(zip(distances, indices))
    
    for d, i in nearest:
        print("%i | %.3f - %s" % (i, d, books['title'].iloc[i]))
        
def all_titles(title):
    closest_titles(title, thresh=.631)
    print("\nNeighbors:")
    print("~~~~~~~~~~")
    all_details(title)

In [291]:
all_titles("waking up sam harris")

9240 | 0.388 - Waking Up: A Guide to Spirituality Without Religion

Neighbors:
~~~~~~~~~~
9240 | 0.388 - Waking Up: A Guide to Spirituality Without Religion
2651 | 0.465 - Waking Up Married (Waking Up, #1)
6067 | 0.537 - Free Will
7142 | 0.613 - Letter to a Christian Nation
3176 | 0.719 - The End of Faith: Religion, Terror, and the Future of Reason


In [290]:
books.iloc[9240]

id                                                                                 18774981
title                                     Waking Up: A Guide to Spirituality Without Rel...
isbn                                                                             1451636016
isbn13                                                                        9781451636017
asin                                                                                   None
kindle_asin                                                                      B00LWM6CAM
marketplace_id                                                               A1F83G8C2ARO7P
country_code                                                                             GB
image_url                                 https://images.gr-assets.com/books/1415677308m...
small_image_url                           https://images.gr-assets.com/books/1415677308s...
publication_year                                                                

In [39]:
simulated = ((a1 * 100) + a2) / 2

In [40]:
simulated

array([[ 2.275, 20.35 , 18.445],
       [25.1  , 10.25 , 40.3  ],
       [20.05 ,  5.445, 15.045]])