# Data for Recommender System

UCSD scraped goodreads website for this information in 2017. Goodreads does not allow api keys to be used since 2020. 

goodreads_interactions.csv : User interactions between books 

goodreads_books.json.gz : book metadata

book_id_map : maps betweens books in each dataset 

Link to data: https://mengtingwan.github.io/data/goodreads.html#datasets

Citation for data: 



In [2]:
!wc -l goodreads_books.json.gz

wc: goodreads_books.json.gz: open: No such file or directory


In [3]:
!ls -lh | grep goodreads_books.json.gz

In [5]:
#streaming version of the code so I don't use too much memory
#loading the data line by line rather then the whole thing at once
import gzip 

with gzip.open("data/goodreads_books.json.gz", 'r') as f: 
    line = f.readline()
#reading line by line

In [6]:
#use json module to load the json using loads (load string)
#creates python dict
import json 
json.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

In [7]:
#format the lines how we want
def parse_fields(line): 
    data = json.loads(line)
    return{
        "book_id": data["book_id"],
        "title": data["title_without_series"],
        "ratings": data["ratings_count"],
        "url": data["url"],
        "cover_image": data["image_url"]
    }

## Adding Books to list that contain <10 ratings

In [9]:
#parse each line and read each line to add
books_titles = []
with gzip.open("data/goodreads_books.json.gz", 'r') as f: 
    while True: 
        line = f.readline()
        if not line: 
            break
        fields = parse_fields(line)

        try: 
            ratings = int(fields["ratings"]) #int for ratings, if valueerror, continue
        except ValueError: 
            continue
        if ratings > 10: #only adding books that contain ratings more than 10 ratings
            books_titles.append(fields)

In [1]:
import pandas as pd

titles = pd.DataFrame.from_dict(books_titles)
print(titles)

NameError: name 'books_titles' is not defined

In [10]:
#make ratings numerical
titles["ratings"] = pd.to_numeric(titles["ratings"])

In [11]:
#regex to make titles more uniform
titles["modified_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]","", regex = True)

In [12]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,The Unschooled Wizard Sun Wolf and Starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,Best Friends Forever
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,Runic Astrology Starcraft and Timekeeping in t...
3,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Aeneid for Boys and Girls
4,378460,The Wanting of Levine,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Wanting of Levine
...,...,...,...,...,...,...
1496807,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,Jacqueline Kennedy Onassis Friend of the Arts
1496808,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Spaniards Blackmailed Bride
1496809,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...,This Sceptred Isle Vol 10 The Age of Victoria ...
1496810,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Childrens Classic Poetry Collection


In [13]:
lowercase and removed white space
titles["modified_title"] = titles["modified_title"].str.lower()
titles["modified_title"] = titles["modified_title"].str.replace("\s+"," ", regex=True)

In [14]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
3,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
4,378460,The Wanting of Levine,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the wanting of levine
...,...,...,...,...,...,...
1496807,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis friend of the arts
1496808,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniards blackmailed bride
1496809,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...,this sceptred isle vol 10 the age of victoria ...
1496810,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection


In [35]:
titles = titles[titles["modified_title"].str.len() >0 ]

In [37]:
#write to json file 
titles.to_json("books_titles.json")

## TF-IDF to Cosine Similarity Search Function
Next step is to use inverse document frequency to put all the book titles into matrix form so the search engine can use that format to search for book titles. This is why we cleaned the title names to the 'modifed_title' column. (TF-IDF)

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["modified_title"])

In [85]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val): 
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val): 
    return '<img src="{}" width=50></img>'.format(val)

def search(query , vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:] #find the indices of similarity
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    return results.head(5).style.format({'url':  make_clickable, 'cover_image': show_image})

In [109]:
search("Everything I never told you", vectorizer)

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
757036,18693763,Everything I Never Told You,115500,Goodreads,,everything i never told you
496779,23398763,Everything I Never Told You,6466,Goodreads,,everything i never told you
622640,23003206,Everything I Never Told You,2661,Goodreads,,everything i never told you
1040068,23442209,Everything I Never Told You,603,Goodreads,,everything i never told you
810018,29367399,Everything I Never Told You,217,Goodreads,,everything i never told you


In [None]:
#make a list of liked book_id
lmiller_liked_books: [11069349, 13537029, 26827125, 18209268, 18693763]

## Building a Recommendation Engine from our Search Engine base
