In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


## Import the data

In the Flask app, this code will be changed to get the data directly from the database instead.

In [2]:
# read csv file

df = pd.read_csv("data/final_cleaned.csv")

In [3]:
# display dataframe

df.head()

Unnamed: 0.1,Unnamed: 0,title,authors,isbn,publisher,categories,thumbnail
0,0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,0439785960,Scholastic Inc.,Juvenile Fiction,http://books.google.com/books/content?id=QzI0B...
1,1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,0439358078,Scholastic Inc.,Juvenile Fiction,http://books.google.com/books/content?id=OIJ5B...
2,2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,0439554896,Scholastic,Juvenile Fiction,http://books.google.com/books/content?id=h2Y-P...
3,3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,043965548X,Scholastic Inc.,Juvenile Fiction,http://books.google.com/books/content?id=IZN5B...
4,4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,0439682584,Scholastic,Juvenile Fiction,http://books.google.com/books/content?id=DAAAA...


## Combine text attributes into one column

In [4]:
# combine title, authors, publisher, & category columns into one

df['all'] = df['title'] + df['authors'] + df['publisher'] + df['categories']

## TF-IDF 

Set the vectorizer as TFIDvectorizer from scikit-learn.  This will be used to convert our collection of text documents for each book into a matrix of features.  TF-IDF measures the importance of each word.

In [5]:
# set to analyze words 

vectorizer = TfidfVectorizer(analyzer='word')

## Fit & Transform

Use the column of all the text attributes and the vectorizer to create a matrix of the documents and their TD-IDF calculations.

In [6]:
tfidf_all_content = vectorizer.fit_transform(df['all'])

In [7]:
tfidf_all_content

<5290x18488 sparse matrix of type '<class 'numpy.float64'>'
	with 52031 stored elements in Compressed Sparse Row format>

## Compare Similarity

Create a matrix that mathmatically compares the similiarities between every pair of books.  The linear kernal method from scikit-learn can be used to calculate the cosine similarities between each pair of books by comparing our TF-IDF matrix to itself.

[Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is "a measure of similarity between two non-zero vectors of an inner product space" per Wikipedia.  This one of several ways to mathmatically measure similarity.

[Kernal methods](https://en.wikipedia.org/wiki/Kernel_method) are useful for high-dimensional features (such as the matrix we just created) because they do not have to explicitly compute every coordinate.  

In [8]:
cosine_similarity_all_content = linear_kernel(tfidf_all_content, tfidf_all_content)

## Set up dataframe & indexes

In [9]:
# create new dataframe with reset index

books = df.reset_index(drop=True)

In [10]:
# create a series of the indexes

indices = pd.Series(books['title'].index)

## Get user-defined book

In the Flask app, the book chosen by the user will be retrieved by an input form.  For this notebook, the book title is hard-coded.

Using the book title, the index of the book is found.

In [11]:
input_title = "War and Peace"

input_array = books[books['title'] == input_title].index.values

input_index = input_array[0]

input_index

120

## Retrieve top 10 most similar books

A function is created to find the 10 most similar books to the book the user chooses.  The index of the chosen book and the matrix of cosine similarities are passed to the function.  All the similarity scores that compare that book to each other book are listed and then sorted.  The first 10 of the list are sliced and the index of each book is retreived.  The 10 indexes are then used to get the titles and authors and are zipped up into a list.

In [12]:
# Function to get the most similar books
def recommend(index, method):
    id = indices[index]
    similarity_scores = list(enumerate(method[id]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:11]
    
    #Get the books index
    books_index = [i[0] for i in similarity_scores]
    
    titles = books['title'].iloc[books_index]
    authors = books['authors'].iloc[books_index]
    
    #data = pd.DataFrame(list(zip(titles, authors)), columns =['Title', 'Authors'])
    
    a_zip = zip(titles, authors)
    data = list(a_zip)
  
    return data

In [13]:
# pass the book index & the cosine similiarities

recommended_list = recommend(input_index, cosine_similarity_all_content)
recommended_list

[('Collected Shorter Fiction: Volume I',
  'Leo Tolstoy/Aylmer Maude/Nigel J. Cooper'),
 ('War and Peace and War: The Rise and Fall of Empires', 'Peter Turchin'),
 ('Tolstoy: Anna Karenina', 'Anthony Thorlby'),
 ('The Gardens of Emily Dickinson', 'Judith Farr/Louise Carter'),
 ('The Last Wife of Henry VIII', 'Carolly Erickson'),
 ('Paris Spleen', 'Charles Baudelaire/Louise Varèse'),
 ('Sexus (The Rosy Crucifixion  #1)', 'Henry Miller'),
 ('The Forever War (The Forever War  #1)', 'Joe Haldeman'),
 ('Pride and Prejudice', 'Jane Austen'),
 ('When I Feel Angry', 'Nancy Cote/Cornelia Maude Spelman')]

## Last step - Pass the data to the website to display

Now that the list of recommended books has been created, the Flask app will pass the list to the html page to display using Jinja.