# Developing a Preliminary Book Search Engine Using Goodreads Data

## Introduction
This document introduces a sophisticated search system tailored for an extensive online book repository. The core of this system is a search functionality that leverages advanced natural language processing techniques to understand and process user queries effectively.

Built on the premise that accessibility and precision in search results are paramount, the system uses finely tuned algorithms to sift through thousands of book entries quickly. This ensures that users can find specific titles or discover new books aligned with their interests with minimal effort. Through this innovative search mechanism, the platform aims to enhance user engagement, making the exploration of literary worlds both intuitive and rewarding.

The notebook is crafted to create an integral part of a Book Recommendation System, leveraging a dataset from Goodreads. Its objectives focus on streamlining the process from initial data handling to deploying a prototype recommendation engine:

1. **Efficient Data Handling**: Load and parse extensive data efficiently from a compressed JSON file, extracting only the necessary information to optimize processing and memory usage.

2. **Rigorous Data Cleaning**: Apply thorough cleaning techniques to book titles to ensure uniformity across the dataset, which is crucial for the subsequent analysis and recommendation processes.

3. **Textual Data Vectorization**: Employ TF-IDF vectorization to transform book titles into a numerically analyzable format, setting the stage for applying machine learning algorithms for similarity detection.

4. **Recommendation Engine Setup**: Develop a functional recommendation engine that utilizes textual similarity, calculated through cosine similarity measures, to suggest books based on user queries.

5. **Interactive Features Development**: Enhance the data presentation within the notebook by introducing interactive elements like clickable links and image displays, thus improving the user experience.

6. **Capability Demonstration**: Validate the effectiveness of the developed recommendation system by performing practical queries, showcasing the system’s ability to deliver relevant book recommendations.

These objectives are geared towards constructing a robust recommendation system that not only analyzes book data but also enhances user interaction and demonstrates practical utility in suggesting books based on preferences and textual content.

## Conclusion
In conclusion, the search system developed for the online book repository represents a significant enhancement in how users interact with digital libraries. By focusing on precision, speed, and user-friendly features, this system not only meets the immediate needs of users looking for specific titles but also fosters an environment of discovery and exploration. With the integration of advanced natural language processing and machine learning technologies, the system is well-equipped to adapt and evolve with changing user behaviors and expanding content. Ultimately, this search system is poised to transform the user experience by making the vast world of books more accessible and engaging, thereby enriching the cultural landscape of digital readership.

In [32]:
# Importing necessary libraries
import re
import gzip
import json
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
# Loading data from a gzipped JSON file containing book information
with gzip.open(r"goodreads_books.json.gz", 'r') as f:
    line = f.readline()

In [9]:
# Parse a single JSON line to explore the data format
json.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

In [10]:
# Define a function to parse relevant fields from a JSON line
def parse_fields(line):
    data = json.loads(line)
    return {
        "book_id": data["book_id"],
        "title": data["title_without_series"],
        "ratings": data["ratings_count"],
        "url": data["url"],
        "cover_image": data["image_url"]
    }

In [16]:
# Extract data from gzipped file, filter entries with more than 15 ratings, and store them
books_titles = []
with gzip.open(r"goodreads_books.json.gz", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        fields = parse_fields(line)

        try:
            ratings = int(fields["ratings"])
        except ValueError:
            continue
        if ratings > 15:
            books_titles.append(fields)

In [17]:
# Convert the list of book titles into a Pandas DataFrame
titles = pd.DataFrame.from_dict(books_titles)

In [18]:
# Convert ratings column to numeric type for analysis
titles["ratings"] = pd.to_numeric(titles["ratings"])

In [21]:
# Clean the title text by removing non-alphanumeric characters
titles["mod_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True)

In [22]:
# Display the current state of the DataFrame to check the modifications
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,The Unschooled Wizard Sun Wolf and Starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,Best Friends Forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Aeneid for Boys and Girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,Alls Fairy in Love and War Avalon Web of Magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,The Devils Notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,Ondine Ondine Quartet 05
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,Jacqueline Kennedy Onassis Friend of the Arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Spaniards Blackmailed Bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Childrens Classic Poetry Collection


In [24]:
# Convert the modified titles to lowercase for uniformity
titles["mod_title"] = titles["mod_title"].str.lower()

In [25]:
# Replace multiple spaces with a single space in title strings
titles["mod_title"] = titles["mod_title"].str.replace("\s+", " ", regex=True)

  titles["mod_title"] = titles["mod_title"].str.replace("\s+", " ", regex=True)


In [26]:
# Display the DataFrame again to verify text cleaning
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,ondine ondine quartet 05
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis friend of the arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniards blackmailed bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection


In [27]:
# Remove rows where the modified title is empty
titles = titles[titles["mod_title"].str.len() > 0]

In [28]:
# Save the cleaned and processed titles to a JSON file
titles.to_json("books_titles.json")

In [29]:
# Final display of the DataFrame to ensure all operations have been successful
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,ondine ondine quartet 05
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis friend of the arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniards blackmailed bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection


In [31]:
# Initialize the TfidfVectorizer and fit it on the modified titles to prepare for querying
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["mod_title"])

In [45]:
# Define functions to make links clickable and show images, and to perform search queries using TF-IDF vectorization
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)


def search(query, vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings", ascending=False)
    return results.head(5).style.format({'url': make_clickable, 'cover_image': show_image})

In [62]:
# Perform a search for the book "dune" using the previously defined search function
search("dune", vectorizer)

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
452833,53732,Dune,8645,Goodreads,,dune
1032211,20441724,Dune,8271,Goodreads,,dune
1016394,53747,Dune,2860,Goodreads,,dune
156346,1685995,Dune,1653,Goodreads,,dune
706272,13249366,Dune,460,Goodreads,,dune
