# Download and Parse Data from Penguin Random House

Here we will use the ratings from [Book-Crossing](../book_crossing/preprocessing.ipynb) and [Goodreads](../goodreads/preprocessing.ipynb) datasets to download the information about the corresponding books via [Penguin Random House API](https://developer.penguinrandomhouse.com/).

In [1]:
import os
import sys

# Append the sys.path with the project root path
sys.path.append(os.path.dirname(os.path.abspath('')))

In [2]:
import pandas as pd
from download_scripts.books import get_book_data
from download_scripts.categories import get_category_info

## Join Ratings from Book-Crossing and Goodreads Data

### Book Crossing

Load preprocessed [Book–Crossing data](../book_crossing/data_prep):

In [3]:
path_bc = os.path.join('..', 'book_crossing', 'data_prep')
ratings_bc = pd.read_csv(os.path.join(path_bc, 'ratings.csv'),
                         dtype={'user_id': 'category',
                                'rating': 'uint8',
                                'isbn13': 'category'})

In [4]:
# Drop all implicit ratings
ratings_bc = ratings_bc[ratings_bc['rating'] > 0]
ratings_bc.head(3)

Unnamed: 0,user_id,rating,isbn13
1,276726,5,9780155061224
3,276729,3,9780521656153
4,276729,6,9780521795029


In [5]:
print(f'Number of ratings: {len(ratings_bc)}')

Number of ratings: 384127


### Goodreads

Load preprocessed [Goodreads data](../goodreads/data_prep):

In [6]:
path_gr = os.path.join('..', 'goodreads', 'data_prep')
books_gr = pd.read_csv(os.path.join(path_gr, 'books.csv'),
                       usecols=['isbn13', 'book_id'], index_col=['book_id'],
                       dtype={'isbn13': 'category', 'book_id': 'category'})
ratings_gr = pd.read_csv(os.path.join(path_gr, 'ratings.csv'),
                         usecols=['user_id', 'book_id', 'rating'],
                         dtype={'user_id': 'category',
                                'rating': 'Int8',
                                'book_id': 'category'})

In [7]:
# Drop all implicit ratings
ratings_gr = ratings_gr[~ratings_gr['rating'].isna()]
ratings_gr['rating'] = ratings_gr['rating'].astype('uint8')

As we remember from the data preprocessing stage, there are some duplicated ISBNs. We have to delete the duplicates and made the corresponding changes in the rating dataset:

In [8]:
# Group them and get indexes
books_gr_duplicates = books_gr[books_gr.duplicated(['isbn13'], keep=False)]
books_gr_duplicates_idx = books_gr_duplicates \
    .groupby(['isbn13'], observed=True) \
    .apply(lambda x: list(x.index)).tolist()

# Iterate over each group and keep only one book_id
to_replace = {}
for book_group in books_gr_duplicates_idx:
    to_leave = book_group.pop()
    for index in book_group:
        to_replace[index] = to_leave

# Replace in the ratings
ratings_gr['book_id'] = ratings_gr['book_id'] \
    .map(lambda x: to_replace.get(x, x))

# This transformation causes duplicated rows in ratings
ratings_gr.drop_duplicates(['user_id', 'book_id'],
                           keep='first', inplace=True)

Change `book_id` to `isbn13`:

In [9]:
ratings_gr = ratings_gr.merge(books_gr[['isbn13']], left_on='book_id',
                              right_index=True, how='left')
ratings_gr.drop(columns=['book_id'], inplace=True)
ratings_gr.head(2)

Unnamed: 0,rating,user_id,isbn13
0,5,8842281e1d1347389f2ab93d60773d4d,9780517226957
1,5,8842281e1d1347389f2ab93d60773d4d,9780767908184


In [10]:
print(f'Number of ratings: {len(ratings_gr)}')

Number of ratings: 89624581


### Join

Before merging the data, we need to scale ratings to the common range:

In [11]:
# Scale goodreads ratings to the range from 1 to 10
ratings_gr['rating'] *= 2

We assume that users from Goodreads and Book-Crossing communities are completely different. 

In [12]:
# Check the ids to be unique
ratings_gr['user_id'].isin(ratings_bc['user_id']).any()

False

In [13]:
# Append Book-Crossing ratings to Goodreads ones
ratings = ratings_gr.append(ratings_bc)

# Assign integer values to user_id 
ratings['user_id'] = ratings['user_id'].astype('category').cat.codes

# Save data
ratings.to_csv(os.path.join('data_interm', 'ratings_joined.csv'),
               index=False)

# Show
ratings.head(2)

Unnamed: 0,rating,user_id,isbn13
0,10,494492,9780517226957
1,10,494492,9780767908184


In [14]:
print(f'Number of joined ratings: {len(ratings)}')
print(f'Number of unique users: {len(ratings["user_id"].unique())}')
print(f'Number of unique books: {len(ratings["isbn13"].unique())}')

Number of joined ratings: 90008708
Number of unique users: 876176
Number of unique books: 1647917


## Book Info

We will use [this script](download_scripts/books.py) to download book data via Penguin Random House API. Let's see the example:

In [15]:
# Download data about War and Peace
book_example = get_book_data(9780241265543)

# Book info
for prop in ['isbn', 'title', 'publisher', 'format']:
    print(f'{prop}: {book_example["data"]["titles"][0].get(prop)}')

isbn: 9780241265543
title: War and Peace
publisher: {'code': '6262', 'description': 'Penguin Publishing Group'}
format: {'code': 'HC', 'description': 'Hardcover'}


In [16]:
# Extract relative info
embeds = {}
for embed in book_example['data']['_embeds']:
    embeds.update(embed)

# Authors info
embeds['authors'][:2]

[{'authorId': 8653,
  'display': 'Orlando Figes',
  'first': 'Orlando',
  'last': 'Figes',
  'company': {'key': 'R_H', 'value': None},
  'clientSourceId': 0,
  'seoFriendlyUrl': '/authors/8653/orlando-figes',
  'contribRoleCode': 'D',
  'contribRoleDesc': 'Afterword by',
  'primaryFlag': False,
  '_embeds': None,
  '_links': []},
 {'authorId': 31231,
  'display': 'Leo Tolstoy',
  'first': 'Leo',
  'last': 'Tolstoy',
  'company': {'key': 'R_H', 'value': None},
  'clientSourceId': 0,
  'seoFriendlyUrl': '/authors/31231/leo-tolstoy',
  'contribRoleCode': 'A',
  'contribRoleDesc': 'Author',
  'primaryFlag': True,
  '_embeds': None,
  '_links': []}]

In [17]:
# Series info
embeds['series']

[{'seriesCode': 'B45',
  'seriesName': 'Penguin Clothbound Classics',
  'description': 'With splendid packaging created by award-winning designer Coralie Bickford-Smith, Penguin Classics presents&nbsp;beautiful hardcover editions of beloved classic literature. Featuring custom patterns inspired by each work stamped on linen cases, colored endpapers, and ribbon markers, these gift-worthy editions of more than sixty titles including&nbsp;<i>Great Expectations</i>,&nbsp;<i>Far from the Madding Crowd</i>, and&nbsp;<i>Wuthering Heights</i>&nbsp;are one of the most coveted series of classic literature ever produced.',
  'seriesCount': 65,
  'seriesDate': '2021-11-30',
  'isNumbered': False,
  'isKids': False,
  'seoFriendlyUrl': '/series/B45/penguin-clothbound-classics',
  '_embeds': None,
  '_links': []}]

In [18]:
# After hours of downloading, the information was saved here
path_books_raw = os.path.join('data_raw', 'books.txt')

Downloaded data contains a lot of information about books, their authors, publishers, etc. For the sake of simplicity, we will analyze only a part of it. [This script](parse_scripts/books.py) was used to parse data and remove unused book properties.

## Categories Info

There are several types of categories, but we will use only the "Consumer" categories. See, more details [here](https://developer.penguinrandomhouse.com/docs/read/enhanced_prh_api/resources/Category) and [here](https://developer.penguinrandomhouse.com/docs/read/enhanced_prh_api/concepts/Categories). Each category can have a parent and children. Thus, we'll get information about the top level category, and then we'll go deeper and deeper with recursion.

To download category info, [this script](download_scripts/categories.py) is used.

In [19]:
# The information about parent category
top_category_id = '2000000000'
top_category_info = get_category_info(top_category_id)
top_category_info

{'catId': 2000000000,
 'description': 'Consumer Category',
 'catSetId': 'CN',
 'catUri': None,
 'menuText': None,
 'hasChildren': True,
 'seq': None,
 'weight': None,
 '_embeds': None,
 '_links': []}

In [20]:
# After several minutes of downloading, the information was saved here
path_cats_raw = os.path.join('data_raw', 'categories.txt')

## Contributor Roles

When associating people with books, the term "author" is a natural fit. But it is better to use term "contributor" because people contribute into books in many ways besides simply authoring the content such as illustrating, narrating, or editing. More details [here](https://developer.penguinrandomhouse.com/docs/read/enhanced_prh_api/concepts/Contributor_role).

In [21]:
# Since there is a little number of roles, they were set manually and saved here
path_roles = os.path.join('data_raw', 'contributor_roles.csv')

## Other

The main information about series, contributors, works and categories is included in the API responses with book data. However, it is possible to download some additional information, for example, biography of contributors (see details [here](https://developer.penguinrandomhouse.com/docs/read/enhanced_prh_api/resources/Authors)), but we won't do this for simplicity.