# Data Preparation

## Download Goodreads Dataset

Download the UCSD Book Graph Dataset and save to the data directory.

### Dataset Information
* 2.36M books with meta-data
* 15M reviews

### Resource Link
[UCSD Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home)


In [None]:
%pip install gdown
!mkdir data/goodreads
!gdown --fuzzy 'https://drive.google.com/uc?id=1LXpK1UfqtP89H1tYy0pBGHjYk8IhigUK' -O data/goodreads/
!wget https://drive.google.com/uc?id=19cdwyXwfXx_HDIgxXaHzH0mrx8nMyLvC -O data/goodreads/goodreads_book_authors.json.gz
!gdown --fuzzy 'https://drive.google.com/uc?id=1op8D4e5BaxU2JcPUgxM3ZqrodajryFBb' -O data/goodreads/
!gzip -d data/goodreads/goodreads_book_authors.json.gz
!gzip -d data/goodreads/goodreads_books.json.gz
!gzip -d data/goodreads/goodreads_book_series.json.gz

## 1. Reverse Index Data

Use book title, description, and book id to build the index.

In [9]:
import json
from tqdm import tqdm
import codecs

In [None]:
with codecs.open('data/goodreads/goodreads_books.json', 'r', encoding='utf-8') as fin:
    text = fin.readlines()
print(len(text))

In [None]:
with codecs.open('data/goodreads/goodreads_book_authors.json', 'r', encoding='utf-8') as fin:
    authors = fin.readlines()
print(len(authors))

In [None]:
authors[0]

In [None]:
authormap = dict()
for author_info in tqdm(authors):
    author_info = json.loads(author_info)
    authormap[int(author_info['author_id'])] = author_info['name']

In [None]:
with codecs.open('data/goodreads/goodreads_book_series.json', 'r', encoding='utf-8') as fin:
    series = fin.readlines()
print(len(series))

In [None]:
seriemap = dict()
for serie in tqdm(series):
    serie = json.loads(serie)
    seriemap[int(serie['series_id'])] = (serie['title'], serie['description'])

In [None]:
with codecs.open('data/goodreads/book_index_data.json', 'w', encoding='utf-8') as fout:
    for book in tqdm(text):
        meta = json.loads(book)
        book_info = dict()
        description = meta['description']
        book_info['book_id'] = int(meta['book_id'])
        book_info['title'] = meta['title']
        book_info['description'] = description
        author_list = []
        for author in meta['authors']:
            author_list.append(authormap[int(author['author_id'])])
        book_info['author_list'] = author_list
        fout.write(json.dumps(book_info, ensure_ascii=False) + '\n')

## 2. Database Data

In [None]:
with codecs.open('data/goodreads/book_database_data.json', 'w', encoding='utf-8') as fout:
    for book in tqdm(text):
        meta = json.loads(book)
        book_info = dict()
        description = meta['description']
        author_list = []
        for author in meta['authors']:
            author_list.append(authormap[int(author['author_id'])])
        meta['author_list'] = author_list
        series_list = []
        for series in meta['series']:
            series_list.append(seriemap[int(series)])
        meta['series_list'] = series_list
        fout.write(json.dumps(meta, ensure_ascii=False) + '\n')