## Introduction
This notebook is designed to help users search for the book they like and obtain the necessary book_id. It uses data preprocessing techniques to clean the dataset and implements a search function to return the most matched results based on user input.

### Key Features:
- Data loading and preprocessing
- Search functionality based on user query
- Display of search results

In [1]:
# goodreads_books is a json zipped file of around 2gb which when unzipped is going upto 10gb. 
# Using methods like pandas.read_json wound need very high disk space and less efficient.
# Hence,using gzip which allows us to work with large compressed files in a memory-efficient way 
# We can read the compressed file line by line or chunk by chunk, without ever fully unzipping it.


In [2]:
import gzip

with gzip.open("goodreads_books.json.gz") as f:
    line = f.readline()
line
# This reads the first line of a gzip-compressed file (goodreads_books.json.gz),each line having data about one particular book
# Each line is in the form of json formatted string

b'{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin\'s Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "t

In [3]:
import json
# The below line takes a JSON-formatted string and convert it into a corresponding Python object (like a dictionary), loads -> load string
data = json.loads(line)
data

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

In [8]:
# Function to select only the needed features among all the available features for recommendation
def features(text):
    data = json.loads(text)
    return{
    'Id':data['book_id'],
    'Title':data['title_without_series'],
    'Rating_Count':data['ratings_count'],
    'Cover':data['image_url']
    } 

In [9]:
books=[]
with gzip.open("goodreads_books.json.gz") as f:
    while True:
        # f.readline() reads one line from the file at a time.
        # After reading, it moves the file pointer (the cursor in the file) to the next line. 
        line = f.readline()
        # If line is empty (end of file), break the loop
        if not line:
            break
        fields = features(line)
        try:
            Rating_Count = int(fields["Rating_Count"])
        # If the conversion to int fails, skip this iteration and go to the next line
        except ValueError:
            continue
        if Rating_Count > 5: #This is the count of ratings and not the rating itself. Just ignoring the books which are rated by <5 users
            books.append(fields)
books
# books is now a list of dictionaries, each dictionary representing a particular book (using Id, Title, Rating, Cover features)
    

[{'Id': '1333909',
  'Title': 'Good Harbor',
  'Rating_Count': '10',
  'Cover': 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'},
 {'Id': '7327624',
  'Title': 'The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)',
  'Rating_Count': '140',
  'Cover': 'https://images.gr-assets.com/books/1304100136m/7327624.jpg'},
 {'Id': '6066819',
  'Title': 'Best Friends Forever',
  'Rating_Count': '51184',
  'Cover': 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'},
 {'Id': '287140',
  'Title': 'Runic Astrology: Starcraft and Timekeeping in the Northern Tradition',
  'Rating_Count': '15',
  'Cover': 'https://images.gr-assets.com/books/1413219371m/287140.jpg'},
 {'Id': '287141',
  'Title': 'The Aeneid for Boys and Girls',
  'Rating_Count': '46',
  'Cover': 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'},
 {'Id': '378460',
  'Title': 'The Wanting of Levine',
  'Rating_Count'

In [10]:
import pandas as pd
df = pd.DataFrame(books) # Converting the list of dictionaries to a pandas dataframe
df

Unnamed: 0,Cover,Id,Rating_Count,Title
0,https://s.gr-assets.com/assets/nophoto/book/11...,1333909,10,Good Harbor
1,https://images.gr-assets.com/books/1304100136m...,7327624,140,"The Unschooled Wizard (Sun Wolf and Starhawk, ..."
2,https://s.gr-assets.com/assets/nophoto/book/11...,6066819,51184,Best Friends Forever
3,https://images.gr-assets.com/books/1413219371m...,287140,15,Runic Astrology: Starcraft and Timekeeping in ...
4,https://s.gr-assets.com/assets/nophoto/book/11...,287141,46,The Aeneid for Boys and Girls
5,https://s.gr-assets.com/assets/nophoto/book/11...,378460,12,The Wanting of Levine
6,https://images.gr-assets.com/books/1316637798m...,6066812,98,All's Fairy in Love and War (Avalon: Web of Ma...
7,https://images.gr-assets.com/books/1328768789m...,287149,986,The Devil's Notebook
8,https://images.gr-assets.com/books/1328724803m...,6066814,186,"Crowner Royal (Crowner John Mystery, #13)"
9,https://images.gr-assets.com/books/1493114742m...,33394837,269,The House of Memory (Pluto's Snitch #2)


In [11]:
# Cleaning book titles
import re
df['Title'] = df['Title'].str.replace('[^A-Za-z0-9 ]','', regex=True) # To remove any characters except alphabets and numbers
df['Title'] = df['Title'].str.replace('\s+', " ", regex=True) # To replace any continuous spaces with a single space
df['Title'] = df.Title.str.lower()
df['Rating_Count'] = pd.to_numeric(df.Rating_Count)

In [12]:
df

Unnamed: 0,Cover,Id,Rating_Count,Title
0,https://s.gr-assets.com/assets/nophoto/book/11...,1333909,10,good harbor
1,https://images.gr-assets.com/books/1304100136m...,7327624,140,the unschooled wizard sun wolf and starhawk 12
2,https://s.gr-assets.com/assets/nophoto/book/11...,6066819,51184,best friends forever
3,https://images.gr-assets.com/books/1413219371m...,287140,15,runic astrology starcraft and timekeeping in t...
4,https://s.gr-assets.com/assets/nophoto/book/11...,287141,46,the aeneid for boys and girls
5,https://s.gr-assets.com/assets/nophoto/book/11...,378460,12,the wanting of levine
6,https://images.gr-assets.com/books/1316637798m...,6066812,98,alls fairy in love and war avalon web of magic 8
7,https://images.gr-assets.com/books/1328768789m...,287149,986,the devils notebook
8,https://images.gr-assets.com/books/1328724803m...,6066814,186,crowner royal crowner john mystery 13
9,https://images.gr-assets.com/books/1493114742m...,33394837,269,the house of memory plutos snitch 2


In [13]:
# Remove any row having null movie name
print(df.isnull().any())


Cover           False
Id              False
Rating_Count    False
Title           False
dtype: bool


In [14]:
df.to_csv('cleaned_books.csv')


In [15]:
# Using tfidf to vectorize the movie titles -> Each movie title will be represented by a vector
# Using Cosine Similarity to check how similar two movie titles are
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
tfidf = vect.fit_transform(df['Title'])
tfidf

<1782579x301073 sparse matrix of type '<class 'numpy.float64'>'
	with 8429331 stored elements in Compressed Sparse Row format>

In [19]:
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
# To find similar book titles using tfidf and cosine similarity
def find(bookname, vectorizer):
    bookname = re.sub(r'\[^A-Za-z0-9 ]', "", bookname.lower())
    current_vec = vect.transform([bookname])
    similarity = cosine_similarity(current_vec, tfidf).flatten()
    # Computes the cosine similarity between the current vector and the TF-IDF matrix (which represents all book titles)
    # .flatten() to convert multi-d (here, its 2d array) to 1d array for easier working
    indices = np.argpartition(similarity, -10)[-10:]
    # Get indices of the top 10 highest similarity scores from the 'similarity' array using np.argpartition.
    # The outer brackets [-10:] retrieve the last 10 indices corresponding to the highest scores.
    result = df.iloc[indices].sort_values('Rating_Count',ascending = False)
    return result 


In [20]:
# Searching for a list of favorite books individually to note their Id for further analysis
find('attitude is everything', vect)
# For attitude is everything, it is 9825887
# Multiple books titled attitude is everything indicates multiple editions of the same book (can be decoded by looking at the cover image)


Unnamed: 0,Cover,Id,Rating_Count,Title
1611978,https://images.gr-assets.com/books/1368417489m...,17862201,118,attitude
682805,https://s.gr-assets.com/assets/nophoto/book/11...,9825887,46,attitude is everything
1718199,https://images.gr-assets.com/books/1329279090m...,10802564,44,attitude is everything
266271,https://images.gr-assets.com/books/1309203993m...,868147,22,attitude
406840,https://s.gr-assets.com/assets/nophoto/book/11...,21244319,22,attitude is everything
1226117,https://images.gr-assets.com/books/1328822867m...,10786711,12,a attitude
1270302,https://images.gr-assets.com/books/1366601361m...,17837400,12,attitude
1367705,https://s.gr-assets.com/assets/nophoto/book/11...,25860542,11,attitude is everything coloring book
62073,https://s.gr-assets.com/assets/nophoto/book/11...,3102880,7,a is for attitude
1362852,https://s.gr-assets.com/assets/nophoto/book/11...,4034606,7,attitude is everything


* This way, we can search for the book we like and get a list of books from the corpus which most matches with the term user has searched for.
* We can then obtain a list of Ids corresponding to our favorite books for further analysis.
