In this project, we are going to build a book recommendation system. For that, we kneed the list of the books we like and the data on books rating on which we will base our recommendations.

We are going the use the data from https://www.goodreads.com. Since that this platforms doesn't actually offer an API to access their data, we will use data gathered by some searchers using scrapping available [this URL](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa1AxU2JLSEhDYzlOOVhTNXhXVjNvWGthUHB1QXxBQ3Jtc0trcEQ2aDk0Qlk4dE42NFUwb0otdllfMXpINV9PQTN3NlBob3JBcnhUUGdJRUEweU5TdEd6S2dZQ1B2T0o4aGJadVBNNXBla3dtZjVBb3JjSXd4NkhjYzExRWZySGFXTm5yUDY3bHlOWjIxUDVaWGtQaw&q=https%3A%2F%2Fsites.google.com%2Feng.ucsd.edu%2Fucsdbookgraph%2Fhome&v=x-alwfgQ-cY).

## Data we will be using

As the data we are going to have three main files:


+ **goodreads_interactins.csv:** contains the rating of different books by different users.
+ **goodreads_books.json.gz:** contains meta data on different books
+ **book_id_map.csv:** map the book ids of the first and the second datasets.

## Project steps

The different steps of the project are as follow:

+ Search for books: get the id of a book using it title
+ Create book list: the list of the book we like. That will be done using the previous search engine.
+ Recommend Books: somehow the main step of the project.

We are going to build the search engine and build the list of liked books in the current notebook. The reommmendation system will be implemented in a different notebook.

## Search engine

### Download the data

In [1]:
!gdown 1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK

Downloading...
From: https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK
To: /content/goodreads_books.json.gz
100% 2.08G/2.08G [00:21<00:00, 97.4MB/s]


Count the number of linew within the goodreads_books file.

In [2]:
!wc -l goodreads_books.json.gz

7588375 goodreads_books.json.gz


Get the size of the file.

In [3]:
!ls -lh | grep goodreads_books.json.gz

-rw-r--r-- 1 root root 2.0G Sep 25 18:02 goodreads_books.json.gz


Since the file of the file is about 2G, we can not load it using `pd.read_json` as usual. The file is a .gz, meaning it is a zipped file. After extracting the content of the file, its size will be around 10G, and loading it with `pd.read_json` will result in overloading the memory. To avoid that we are going to use another technique to read the file, consisting in reading the file line by line.

We are going the gzip module that is going to stream the file without unzipping it.


In [4]:
import gzip

with  gzip.open("goodreads_books.json.gz", "r") as f:
  line = f.readline()

In [5]:
line

b'{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin\'s Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "t

In [6]:
import json

json.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

### Parsing the book metadata

We are going to write a function that takes a line of the goodreads_books.json.gz file and returns only the fields we care about.

In [7]:
def parse_fields(line):
  data = json.loads(line)
  return{
      "book_id": data["book_id"],
      "title": data["title_without_series"],
      "ratings": data["ratings_count"],
      "url": data["url"],
      "cover_image": data["image_url"]
  }

Now we'll use our `parse_line` function to get the list of books having at least 15 ratings. In the data set, there are many books with a low number of ratings. Since our recommendation depends on the rating of other readers, we are only interested in books having a decent amount of ratings, the other ones not being recommendable. 

In [8]:
books_titles = []

with gzip.open("goodreads_books.json.gz", 'r') as f:
  while True:
    line = f.readline()
    if not line:
      break
    #parse the line
    fields = parse_fields(line)

    try:
      rating = int(fields["ratings"])
    except ValueError:
      continue
    if rating > 15: 
      books_titles.append(fields)

### Processing book metadata withh pandas

In [9]:
import pandas as pd

#create a pandas dataframe from the list of titles
titles = pd.DataFrame.from_dict(books_titles)

In [10]:
#change the ratings into numeric values
titles["ratings"] = pd.to_numeric(titles["ratings"])

To have an efficient search, we have to minimize the search space, such that titles "harry potter" and "Harry PoTer" will be considered to be the same, 'W.C. Fields: A Life on Film' and 'WC Fields: A life on film' will be considered to be the same. To achieve that, we have to make some modifications on the actual titles.

In [11]:
#remove non alpha-numerical characters
titles['mod_title'] = titles["title"].str.replace('[^a-zA-Z0-9 ]', "", regex=True)
#lowercase the titles
titles["mod_title"] = titles["mod_title"].str.lower()
#remove multiple spaces in a row and replace it by a single space
titles['mod_title'] = titles['mod_title'].str.replace("\s+", " ", regex=True)
#remove null titles
titles = titles[titles['mod_title'].str.len() > 0]

In [33]:
cd /content/drive/MyDrive/Colab Notebooks/book_recommandation_system

/content/drive/MyDrive/Colab Notebooks/book_recommandation_system


In [34]:
#dump the result in a file so that we can use it later
titles.to_json("books_titles.json")

### Build the book search engine

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["mod_title"])

In [27]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val):
  #This function will help us to style links in dataframes such that we can click on them to see if a book is the
  #one we are looking for
  return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
  return '<img src="{}" width=50></img>'.format(val)

def search(query, vectorizer):
  processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
  query_vec = vectorizer.transform([processed])
  similarity = cosine_similarity(query_vec, tfidf).flatten()
  indices = np.argpartition(similarity, -10)[-10:]
  results = titles.iloc[indices]
  results = results.sort_values("ratings", ascending=False)
  return results.head(5).style.format({'url':make_clickable, 'cover_image':show_image})

In [32]:
search("foundation", vectorizer)

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
638824,5996629,"Foundation (Foundation, #1)",5359,Goodreads,,foundation foundation 1
1229317,414853,"Foundation (Foundation, #1)",604,Goodreads,,foundation foundation 1
1204348,7352028,"Foundation (Foundation, #1)",318,Goodreads,,foundation foundation 1
694488,9401317,"Foundation (Foundation, #1)",204,Goodreads,,foundation foundation 1
541719,920239,"Foundation (Foundation, #1)",192,Goodreads,,foundation foundation 1


## Create the list of liked books

We are going to use the search engine we just created to search some books we like and add the ids of those books in the following list.

In [None]:
liked_books = ["681495", "380748", "1062354", "638824"]