# Data for Recommender System

UCSD scraped goodreads website for this information in 2017. Goodreads does not allow api keys to be used since 2020. 

goodreads_interactions.csv : User interactions between books 

goodreads_books.json.gz : book metadata

book_id_map : maps betweens books in each dataset 

Link to data: https://mengtingwan.github.io/data/goodreads.html#datasets

Citation for data: 



In [6]:
!wc -l goodreads_books.json.gz

wc: goodreads_books.json.gz: open: No such file or directory


In [7]:
!ls -lh | grep goodreads_books.json.gz

In [8]:
#streaming version of the code so I don't use too much memory
#loading the data line by line rather then the whole thing at once
import gzip 

with gzip.open("data/goodreads_books.json.gz", 'r') as f: 
    line = f.readline()
#reading line by line

In [9]:
#use json module to load the json using loads (load string)
#creates python dict
import json 
json.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

In [10]:
#format the lines how we want
def parse_fields(line): 
    data = json.loads(line)
    return{
        "book_id": data["book_id"],
        "title": data["title_without_series"],
        "ratings": data["ratings_count"],
        "url": data["url"],
        "cover_image": data["image_url"]
    }

## Adding Books to list that contain <10 ratings

In [11]:
#parse each line and read each line to add
books_titles = []
with gzip.open("data/goodreads_books.json.gz", 'r') as f: 
    while True: 
        line = f.readline()
        if not line: 
            break
        fields = parse_fields(line)

        try: 
            ratings = int(fields["ratings"]) #int for ratings, if valueerror, continue
        except ValueError: 
            continue
        if ratings > 10: #only adding books that contain ratings more than 10 ratings
            books_titles.append(fields)

In [12]:
import pandas as pd

titles = pd.DataFrame.from_dict(books_titles)
print(titles)

          book_id                                              title ratings  \
0         7327624  The Unschooled Wizard (Sun Wolf and Starhawk, ...     140   
1         6066819                               Best Friends Forever   51184   
2          287140  Runic Astrology: Starcraft and Timekeeping in ...      15   
3          287141                      The Aeneid for Boys and Girls      46   
4          378460                              The Wanting of Levine      12   
...           ...                                                ...     ...   
1496807    331839     Jacqueline Kennedy Onassis: Friend of the Arts      18   
1496808   2685097                   The Spaniard's Blackmailed Bride     112   
1496809   3084038  This Sceptred Isle, Vol. 10: The Age of Victor...      12   
1496810   2342551           The Children's Classic Poetry Collection      36   
1496811  22017381          101 Nights: Volume One (101 Nights, #1-3)      70   

                                       

In [13]:
#make ratings numerical
titles["ratings"] = pd.to_numeric(titles["ratings"])

In [14]:
#regex to make titles more uniform
titles["modified_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]","", regex = True)

In [15]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,The Unschooled Wizard Sun Wolf and Starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,Best Friends Forever
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,Runic Astrology Starcraft and Timekeeping in t...
3,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Aeneid for Boys and Girls
4,378460,The Wanting of Levine,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Wanting of Levine
...,...,...,...,...,...,...
1496807,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,Jacqueline Kennedy Onassis Friend of the Arts
1496808,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Spaniards Blackmailed Bride
1496809,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...,This Sceptred Isle Vol 10 The Age of Victoria ...
1496810,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Childrens Classic Poetry Collection


In [16]:
#lowercase and removed white space
titles["modified_title"] = titles["modified_title"].str.lower()
titles["modified_title"] = titles["modified_title"].str.replace("\s+"," ", regex=True)

In [17]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
3,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
4,378460,The Wanting of Levine,12,https://www.goodreads.com/book/show/378460.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the wanting of levine
...,...,...,...,...,...,...
1496807,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis friend of the arts
1496808,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniards blackmailed bride
1496809,3084038,"This Sceptred Isle, Vol. 10: The Age of Victor...",12,https://www.goodreads.com/book/show/3084038-th...,https://images.gr-assets.com/books/1494763458m...,this sceptred isle vol 10 the age of victoria ...
1496810,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection


In [18]:
titles = titles[titles["modified_title"].str.len() >0 ]

In [19]:
#write to json file 
titles.to_json("books_titles.json")

## TF-IDF to Cosine Similarity Search Function
Next step is to use inverse document frequency to put all the book titles into matrix form so the search engine can use that format to search for book titles. This is why we cleaned the title names to the 'modifed_title' column. (TF-IDF)

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["modified_title"])

In [21]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val): 
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val): 
    return '<img src="{}" width=50></img>'.format(val)

def search(query , vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:] #find the indices of similarity
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    return results.head(5).style.format({'url':  make_clickable, 'cover_image': show_image})

### Search engine function for books: 

In [22]:
search("Great Alone", vectorizer)

Unnamed: 0,book_id,title,ratings,url,cover_image,modified_title
733571,662196,The Great Alone,280,Goodreads,,the great alone
619675,34912895,The Great Alone,265,Goodreads,,the great alone
458475,30753657,Alone,181,Goodreads,,alone
1319961,19055962,Alone,104,Goodreads,,alone
1446744,33918889,Alone,97,Goodreads,,alone


In [23]:
#make a list of liked book_id to use for testing recommendations system
lmiller_liked_books =  ['11069349', '13537029', '26827125', '18209268', '18693763']

## Building a Recommendation Engine



In [24]:
import pandas as pd

def recommend_books(liked_books, csv_book_mapping_path, interactions_path, books_titles_path):
    """
    Recommend books based on user interactions and a list of liked books.

    Parameters:
    - liked_books (list of str): List of book IDs the user likes.
    - csv_book_mapping_path (str): Path to the book ID mapping CSV file.
    - interactions_path (str): Path to the Goodreads interactions CSV file.
    - books_titles_path (str): Path to the books titles JSON file.

    Returns:
    - pd.DataFrame: Top recommended books with scores, links, and images.
    """
    # Load the book ID mapping
    csv_book_mapping = {}
    with open(csv_book_mapping_path, "r") as f:
        for line in f:
            csv_id, book_id = line.strip().split(",")
            csv_book_mapping[csv_id] = book_id

    # Identify overlap users who rated liked books highly
    overlap_users = set()
    with open(interactions_path, 'r') as f:
        for line in f:
            user_id, csv_id, _, rating, _ = line.split(",")
            try:
                rating = int(rating)
            except ValueError:
                continue

            book_id = csv_book_mapping.get(csv_id)
            if book_id in liked_books and rating >= 4:
                overlap_users.add(user_id)

    # Gather recommendations based on overlap users
    rec_lines = []
    with open(interactions_path, 'r') as f:
        for line in f:
            user_id, csv_id, _, rating, _ = line.split(",")
            if user_id in overlap_users:
                book_id = csv_book_mapping.get(csv_id)
                rec_lines.append([user_id, book_id, int(rating)])

    # Create a DataFrame for recommendations
    recs_df = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])
    recs_df["book_id"] = recs_df["book_id"].astype(str)

    # Calculate top recommendations
    top_recs = recs_df["book_id"].value_counts()
    books_titles = pd.read_json(books_titles_path)
    books_titles["book_id"] = books_titles["book_id"].astype(str)
    all_recs = top_recs.to_frame().reset_index()
    all_recs.columns = ["book_id", "book_count"]

    # Merge with book details
    all_recs = all_recs.merge(books_titles, how="inner", on="book_id")
    all_recs["score"] = all_recs["book_count"] * (all_recs["book_count"] / all_recs["ratings"])

    # Filter and sort recommendations
    popular_recs = all_recs[all_recs["book_count"] > 75].sort_values("score", ascending=False)

    # Exclude books already liked by the user
    popular_recs = popular_recs[~popular_recs["book_id"].isin(liked_books)]

    # Format links and images for better presentation
    def make_clickable(val):
        return f'<a target="_blank" href="{val}">Goodreads</a>'

    def show_image(val):
        return f'<img src="{val}" width=50></img>'

    return popular_recs.head(10).style.format({'url': make_clickable, 'cover_image': show_image})




Using our recommendation engine: 

The book recommendation system uses a list of bookids. An actual recomendation system would take a book name, convert it to a book id, then provide recomendations. 

In [25]:

liked_books = ["11069349", "13537029", "26827125", "18209268", "18693763"] #[11069349, 13537029, 26827125, 18209268, 18693763]
recommend_books(
    liked_books, 
    "data/book_id_map.csv", 
    "data/goodreads_interactions.csv", 
    "books_titles.json")

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,modified_title,score
8054,26856502,123,"Vengeful (Villains, #2)",35,Goodreads,,vengeful villains 2,432.257143
4499,28170940,213,"Lethal White (Cormoran Strike, #4)",106,Goodreads,,lethal white cormoran strike 4,428.009434
102,34273236,2804,Little Fires Everywhere,21135,Goodreads,,little fires everywhere,372.009274
6267,34927828,157,The Great Alone,70,Goodreads,,the great alone,352.128571
6196,24909347,158,"Obsidio (The Illuminae Files, #3)",82,Goodreads,,obsidio the illuminae files 3,304.439024
567,32920226,1135,"Sing, Unburied, Sing",4592,Goodreads,,sing unburied sing,280.536803
3550,34217599,267,Future Home of the Living God,263,Goodreads,,future home of the living god,271.060837
3194,24493732,297,Solutions and Other Problems,334,Goodreads,,solutions and other problems,264.098802
525,25810500,1192,What is Not Yours is Not Yours,5470,Goodreads,,what is not yours is not yours,259.755759
176,28815371,2228,The Mothers,22346,Goodreads,,the mothers,222.141949
