# Enhancing Book Discovery: Building a Data-Driven Recommendation System with Goodreads Data

## Introduction
This document outlines the development of a state-of-the-art book recommendation system designed to significantly improve user experience through personalized suggestions. By integrating collaborative filtering and content-based techniques, the system effectively tailors its recommendations to match individual user preferences and reading histories.

The system operates by harnessing a comprehensive database of user interactions and book details, leveraging powerful analytical tools within Python to dissect and understand patterns of behavior and preference. This dual approach allows the system to not only suggest books based on similar user profiles but also to recommend new books that share key thematic and stylistic traits with those a user has previously enjoyed.

Through these advanced methodologies, the recommendation system aims to deliver highly relevant book suggestions, thereby enhancing engagement and providing a richer, more curated reading experience.

## Objectives

1. **Integrate Data**: Accurately map Goodreads book IDs with internal CSV IDs for precise book tracking across different data segments.
2. **Analyze User Preferences**: Identify users with similar book preferences based on books liked by a specific user and their high ratings.
3. **Extract and Filter Data**: Use user interaction data to find significant overlaps in preferences for compiling book recommendations.
4. **Compile and Score Recommendations**: List and prioritize books recommended by like-minded users using a scoring system based on recommendation frequency and user ratings.
5. **Enhance Presentation**: Improve the display of recommendation results with interactive elements like clickable links and book images.
6. **Validate System**: Demonstrate the system’s capability to effectively suggest relevant books, confirming its practical utility and accuracy.

These objectives aim to build a sophisticated, data-driven book recommendation system that provides personalized and appealing book suggestions.

## Conclusion
The book recommendation system presented here represents a significant leap forward in the realm of personalized user experiences. By blending collaborative and content-based filtering methods, it successfully addresses the diverse preferences of users, offering highly tailored book suggestions that resonate with individual tastes and reading histories. This system not only enhances user satisfaction by connecting readers with books that truly match their interests but also promotes discovery and exploration within the vast landscape of literature. Ultimately, the deployment of this recommendation engine promises to transform the way users interact with book discovery platforms, making reading a more engaging and personalized journey.

In [1]:
# Import pandas library for data manipulation and analysis
import pandas as pd

In [2]:
# List of book IDs that are liked by the user, serving as a basis for recommendation
liked_books = ["5996629", "30659", "1271159", "85424", "11047557", "12977531", "53732",]

In [3]:
# Load and parse a CSV file that maps CSV IDs to book IDs
csv_book_mapping = {}
with open(r"book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",")
        csv_book_mapping[csv_id] = book_id

In [4]:
# Display the number of entries in the book ID mapping to ensure it's loaded correctly
len(csv_book_mapping)

2360651

In [5]:
# Identify users who liked similar books and rated them 4 or higher
overlap_users = set()

with open(r"goodreads_interactions.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")

        if user_id in overlap_users:
            continue

        try:
            rating = int(rating)
        except ValueError:
            continue

        book_id = csv_book_mapping[csv_id]

        if book_id in liked_books and rating >=4:
            overlap_users.add(user_id)


In [7]:
# Collect recommendations from users who have similar tastes
rec_lines = []
with open(r"goodreads_interactions.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")

        if user_id in overlap_users:
            book_id = csv_book_mapping[csv_id]
            rec_lines.append([user_id, book_id, rating])

In [8]:
# Check how many users have similar book preferences
len(overlap_users)

5136

In [9]:
# Display the number of recommendation entries gathered
len(rec_lines)

3521261

In [11]:
# Convert the list of recommendations into a DataFrame for easier manipulation
recs = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])
recs["book_id"] = recs["book_id"].astype(str)

In [12]:
# Identify the top 10 most frequently recommended books
top_recs = recs["book_id"].value_counts().head(10)
top_recs = top_recs.index.values

In [14]:
# Load book titles from a JSON file into a DataFrame
books_titles = pd.read_json("books_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str)

In [15]:
# Preview the first few entries of the books titles DataFrame
books_titles.head()

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook


In [18]:
# Count all book recommendations to see overall popularity
all_recs = recs["book_id"].value_counts()

In [19]:
# Display the frequency of all book recommendations
all_recs

book_id
5470        2763
30659       2652
4671        2475
2657        2467
7613        2357
            ... 
1071344        1
29995904       1
1071392        1
1071444        1
1871541        1
Name: count, Length: 627765, dtype: int64

In [20]:
# Convert the book recommendation counts into a DataFrame for further analysis
all_recs = all_recs.to_frame().reset_index()

In [21]:
# Preview the structured DataFrame of book recommendations
all_recs

Unnamed: 0,book_id,count
0,5470,2763
1,30659,2652
2,4671,2475
3,2657,2467
4,7613,2357
...,...,...
627760,1071344,1
627761,29995904,1
627762,1071392,1
627763,1071444,1


In [22]:
# Rename columns for clarity
all_recs.columns = ["book_id", "book_count"]

In [23]:
# Show the DataFrame with renamed columns
all_recs

Unnamed: 0,book_id,book_count
0,5470,2763
1,30659,2652
2,4671,2475
3,2657,2467
4,7613,2357
...,...,...
627760,1071344,1
627761,29995904,1
627762,1071392,1
627763,1071444,1


In [24]:
# Merge the recommendation counts with book titles for comprehensive data
all_recs = all_recs.merge(books_titles, how="inner", on="book_id" )

In [25]:
# Show the merged DataFrame containing both recommendation counts and book details
all_recs

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title
0,5470,2763,1984,2023937,https://www.goodreads.com/book/show/5470.1984,https://images.gr-assets.com/books/1348990566m...,1984
1,30659,2652,Meditations,45727,https://www.goodreads.com/book/show/30659.Medi...,https://images.gr-assets.com/books/1421618636m...,meditations
2,4671,2475,The Great Gatsby,2758812,https://www.goodreads.com/book/show/4671.The_G...,https://images.gr-assets.com/books/1490528560m...,the great gatsby
3,2657,2467,To Kill a Mockingbird,3255518,https://www.goodreads.com/book/show/2657.To_Ki...,https://images.gr-assets.com/books/1361975680m...,to kill a mockingbird
4,7613,2357,Animal Farm,1928931,https://www.goodreads.com/book/show/7613.Anima...,https://images.gr-assets.com/books/1424037542m...,animal farm
...,...,...,...,...,...,...,...
539712,1071339,1,"In Search of Andy (Replica, #12)",431,https://www.goodreads.com/book/show/1071339.In...,https://s.gr-assets.com/assets/nophoto/book/11...,in search of andy replica 12
539713,1071344,1,"Ice Cold (Replica, #10)",475,https://www.goodreads.com/book/show/1071344.Ic...,https://s.gr-assets.com/assets/nophoto/book/11...,ice cold replica 10
539714,29995904,1,Finishing School: The Happy Ending to That Wri...,72,https://www.goodreads.com/book/show/29995904-f...,https://images.gr-assets.com/books/1462161491m...,finishing school the happy ending to that writ...
539715,1071392,1,"Mystery Mother (Replica, #8)",475,https://www.goodreads.com/book/show/1071392.My...,https://s.gr-assets.com/assets/nophoto/book/11...,mystery mother replica 8


In [26]:
# Calculate a score for each book based on count and ratings to identify top recommendations
all_recs["score"] = all_recs["book_count"] * (all_recs["book_count"] / all_recs["ratings"])

In [28]:
# Sort the books by the calculated score to find the highest rated recommendations
all_recs.sort_values("score", ascending=False).head(10)

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
1,30659,2652,Meditations,45727,https://www.goodreads.com/book/show/30659.Medi...,https://images.gr-assets.com/books/1421618636m...,meditations,153.806373
107,53732,969,Dune,8645,https://www.goodreads.com/book/show/53732.Dune,https://images.gr-assets.com/books/1426192671m...,dune,108.613187
285,12977531,596,The Shining,3998,https://www.goodreads.com/book/show/12977531-t...,https://images.gr-assets.com/books/1333576785m...,the shining,88.848424
295,85424,586,The Green Mile,4802,https://www.goodreads.com/book/show/85424.The_...,https://images.gr-assets.com/books/1289526684m...,the green mile,71.511037
175,97411,776,Letters from a Stoic,9134,https://www.goodreads.com/book/show/97411.Lett...,https://images.gr-assets.com/books/1421619214m...,letters from a stoic,65.926867
825,1045017,308,The Discourses,2886,https://www.goodreads.com/book/show/1045017.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the discourses,32.870409
14880,26856502,33,"Vengeful (Villains, #2)",35,https://www.goodreads.com/book/show/26856502-v...,https://s.gr-assets.com/assets/nophoto/book/11...,vengeful villains 2,31.114286
16234,8514123,31,All the Talk Is Dead,35,https://www.goodreads.com/book/show/8514123-al...,https://s.gr-assets.com/assets/nophoto/book/11...,all the talk is dead,27.457143
2247,4143812,151,Discourses and Selected Writings,832,https://www.goodreads.com/book/show/4143812-di...,https://images.gr-assets.com/books/1311645700m...,discourses and selected writings,27.405048
9876,24909347,47,"Obsidio (The Illuminae Files, #3)",82,https://www.goodreads.com/book/show/24909347-o...,https://images.gr-assets.com/books/1501704611m...,obsidio the illuminae files 3,26.939024


In [30]:
# Filter recommendations to find popular books with more than 75 interactions
popular_recs = all_recs[all_recs["book_count"] > 75].sort_values("score", ascending=False)

In [32]:
# Define functions to make links clickable and show images, and display the top 10 recommendations excluding already liked books
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

popular_recs[~popular_recs["book_id"].isin(liked_books)].head(10).style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
175,97411,776,Letters from a Stoic,9134,Goodreads,,letters from a stoic,65.926867
825,1045017,308,The Discourses,2886,Goodreads,,the discourses,32.870409
2247,4143812,151,Discourses and Selected Writings,832,Goodreads,,discourses and selected writings,27.405048
3462,305860,109,Philosophy As a Way of Life: Spiritual Exercises from Socrates to Foucault,470,Goodreads,,philosophy as a way of life spiritual exercises from socrates to foucault,25.278723
1292,21032488,227,"Doors of Stone (The Kingkiller Chronicle, #3)",2059,Goodreads,,doors of stone the kingkiller chronicle 3,25.026226
2100,84597,160,On the Good Life,1243,Goodreads,,on the good life,20.595334
1405,24618,216,"The Art of Living: The Classical Manual on Virtue, Happiness and Effectiveness",2689,Goodreads,,the art of living the classical manual on virtue happiness and effectiveness,17.350688
3594,195762,105,The Essential Epicurus,684,Goodreads,,the essential epicurus,16.118421
696,30735,344,The Complete Essays,9232,Goodreads,,the complete essays,12.818024
701,205218,343,Ethics,9491,Goodreads,,ethics,12.395849
