# Looking up ISBNs in Open Library

We've already [extracted ASINs from Amazon links in Hacker News posts](./0010-extracting-asins.ipynb) and obtained an [Open  Library Dump](https://openlibrary.org/developers/dumps).

We're now going to try to link those ASINs to works in Open Library.

## Imports

In [1]:
from pathlib import Path

from collections import defaultdict

import re

import gzip
import json
import sqlite3

import pandas as pd

from tqdm.notebook import tqdm

## Loading in Open Library

In [2]:
ol_dump_date = '2022-06-06'
data_path = Path('../data/01_raw')

def ol_path(segment):
    return data_path / f'ol_dump_{segment}_{ol_dump_date}.txt.gz'

def ol_data(segment):
    with gzip.open(ol_path(segment), 'rt') as f:
        for line in f:
            yield tuple(line.split('\t', 5))

## ISBN 10 to 13 Conversion

For books their ASIN is identical to an ISBN 10.
However almost 20% of Open Library records have an ISBN 13 and not an ISBN 10.
Luckily it's [straightforward to convert](https://bisg.org/page/conversionscalculat/Conversion--Calculations-.htm) an ISBN 13 to ISBN 10; we just add the prefix "378" and change the final "check digit" as per a given algorithm below.

In [3]:
isbn_10_weighting = [10,9,8,7,6,5,4,3,2]

isbn_13_weighting = [1,3,1,3,1,3,1,3,1,3,1,3,1]

def isbn10_check_digit(isbn10: str) -> str:
    assert len(isbn10) == 10
    digits = [int(x) for x in isbn10[:-1]]
    check = 11 - sum(x*y for x,y in zip(digits, isbn_10_weighting)) % 11
    
    if check == 11:
        check_digit = "0"
    elif check == 10:
        check_digit = "X"
    else:
        check_digit = str(check)
        
    assert len(check_digit) == 1
    assert check_digit in ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "X"]
    return check_digit
    
def isbn13_check_digit(isbn13: str) -> str:
    assert len(isbn13) == 13
    digits = [int(x) for x in isbn13[:-1]]
    check = 10 - sum(x*y for x,y in zip(digits, isbn_13_weighting)) % 10
    
    if check == 10:
        check_digit = "0"
    else:
        check_digit = str(check)
        
    assert len(check_digit) == 1
    assert check_digit in ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    return check_digit
    
def isbn13_to_10(isbn13: str) -> str:
    assert isbn13.startswith("978")
    
    return isbn13[3:-1] + isbn10_check_digit(isbn13[3:])

def isbn10_to_13(isbn10: str) -> str:
    return "978" + isbn10[:-1] + isbn13_check_digit("978" + isbn10)

## Extracting the ISBNs

We've already exported all posts with an ASIN here; we'll now try to extract the ISBN 10 when it exists with a link to the post.

In [4]:
df = pd.read_csv('../data/02_intermediate/hn_asin.csv').set_index('id')

In [5]:
asin_re = re.compile(r'amazon\.[^"> ]*/dp/([0-9]{9}[0-9X])\W'.replace('/', '&#x2F;'))

In [6]:
asin_matches = df.text.str.extractall(asin_re).drop_duplicates()

In [7]:
asin_matches

Unnamed: 0_level_0,Unnamed: 1_level_0,0
id,match,Unnamed: 2_level_1
25763413,0,0809301377
27595409,0,0884272079
27595409,2,0884271536
29586021,0,1577314808
26032563,0,1524747378
...,...,...
27126565,0,1937785580
27096359,0,1736703307
27090660,0,0300208448
29331929,0,3831138931


In [8]:
isbn10s = set(asin_matches[0].unique())

We can also get the corresponding ISBN 13

In [9]:
isbn13s = {isbn10_to_13(a): a for a in isbn10s}

In [10]:
len(isbn10s), len(isbn13s)

(1423, 1423)

## Searching Open Library

We can now iterate through all the editions check whether the ISBN 10 or ISBN 13 are inside.

We could make this process fast by unnesting the ISBNs and adding an index in SQLite, but in practice it's a batch process we'd only need to do once in a while so we can let it run.

In [11]:
%%time
matches = defaultdict(set)
match_metadata = {}

for record_type, key, revision, last_modified, json_metadata in ol_data('editions'):
    metadata = json.loads(json_metadata)
    for isbn in metadata.get('isbn_10', []):
        if isbn in isbn10s:
            matches[isbn].update([key])
            match_metadata[key] = metadata
            
    for isbn in metadata.get('isbn_13', []):
        if isbn in isbn13s:
            matches[isbn13s[isbn]].update([key])
            match_metadata[key] = metadata

CPU times: user 12min 45s, sys: 4.3 s, total: 12min 49s
Wall time: 12min 49s


# Analysing linked records

Most of the records are linked

In [12]:
f'{len(matches) / len(isbn10s):0.2%}'

'94.03%'

All these unlinked records are legitimate books with ISBNs missing from Open Library.
I searched their names through Open Library

* 5 are in open library under a different ISBN (one is a CD)
* 5 are missing entirely, and could be added

In [13]:
def amazon_link(isbn10):
    return f'https://www.amazon.com/dp/{isbn10}'

In [14]:
print('\n'.join([amazon_link(i) for i in isbn10s if i not in matches][:10]))

https://www.amazon.com/dp/1947864351
https://www.amazon.com/dp/973465148X
https://www.amazon.com/dp/9387022897
https://www.amazon.com/dp/0814311156
https://www.amazon.com/dp/098018486X
https://www.amazon.com/dp/1610279034
https://www.amazon.com/dp/1521531218
https://www.amazon.com/dp/1733706119
https://www.amazon.com/dp/1975977920
https://www.amazon.com/dp/9684122179


20% have more than 1 match

In [15]:
f'{sum(1/len(matches) for v in matches.values() if len(v) > 1):0.2%}'

'20.25%'

Let's look at an example

In [16]:
i, editions = next(iter(matches.items()))
i, editions

('014015339X',
 {'/books/OL22594993M', '/books/OL7348913M', '/books/OL9303565M'})

Winnie the Pooh in Latin has 3 records.

These could potentially be merged in some way; for example authors and works can be merged.
Something like title is tricky where we need some heuristic to pick one (e.g. last updates, greatest revision, greatest amount of metadta, consistency (e.g. casing), consistency with work, ...)

This would be trickier if we found it via title, rather than by ISBN.

In [17]:
print(amazon_link(i))

https://www.amazon.com/dp/014015339X


In [18]:
for w in editions:
    print(match_metadata[w])
    print()

{'publishers': ['Penguin (Non-Classics)'], 'number_of_pages': 160, 'last_modified': {'type': '/type/datetime', 'value': '2010-08-18T06:35:31.974089'}, 'title': 'Winnie Ille Pu', 'contributions': ['Alexander Lenard (Translator)'], 'identifiers': {'goodreads': ['821004'], 'librarything': ['4315935']}, 'isbn_13': ['9780140153392'], 'covers': [96049], 'created': {'type': '/type/datetime', 'value': '2008-04-29T13:35:46.876380'}, 'languages': [{'key': '/languages/lat'}], 'isbn_10': ['014015339X'], 'publish_date': 'June 20, 1991', 'key': '/books/OL7348913M', 'authors': [{'key': '/authors/OL29881A'}, {'key': '/authors/OL2657539A'}], 'latest_revision': 6, 'works': [{'key': '/works/OL476641W'}], 'type': {'key': '/type/edition'}, 'revision': 6}

{'publishers': ['Penguin (Non-Classics)'], 'identifiers': {'librarything': ['4315935'], 'goodreads': ['821004']}, 'ia_box_id': ['ia148423'], 'weight': '2.4 ounces', 'isbn_10': ['014015339X'], 'covers': [8743072], 'physical_format': 'Paperback', 'ia_loaded

In [19]:
match_metadata[w]['works'][0]['key']

'/works/OL476641W'

We can find the corresponding HN comments

In [20]:
comment_ids = asin_matches[asin_matches[0] == i].reset_index()['id'].to_list()
comment_ids

[29735443]

In [21]:
df.loc[comment_ids].clean_text.to_list()

['Today i learned, that there is a latin translation of Winnie the Pooh. https://www.amazon.com/Winnie-Ille-Pu-Latin-Milne/dp/014015339X It even became a bestseller, and appeared on the list of New York Times bestsellers. Is it known, if translations of the work need to pay for copyright too?']

# Saving Results

In [22]:
max(len(v) for v in matches.values())

7

In [23]:
asin_matches.rename(columns={0: 'isbn'}).to_csv('../data/02_intermediate/hn_asin_isbn.csv')

In [24]:
pd.read_csv('../data/02_intermediate/hn_asin_isbn.csv')

Unnamed: 0,id,match,isbn
0,25763413,0,0809301377
1,27595409,0,0884272079
2,27595409,2,0884271536
3,29586021,0,1577314808
4,26032563,0,1524747378
...,...,...,...
1418,27126565,0,1937785580
1419,27096359,0,1736703307
1420,27090660,0,0300208448
1421,29331929,0,3831138931


In [25]:
pairs = []
for k, vs in matches.items():
    for v in vs:
        pairs.append((k, v))
        
pd.DataFrame(pairs, columns=['isbn', 'edition_key']).to_csv('../data/02_intermediate/isbn_to_edition.csv', index=False)

In [26]:
pd.read_csv('../data/02_intermediate/isbn_to_edition.csv')

Unnamed: 0,isbn,edition_key
0,014015339X,/books/OL7348913M
1,014015339X,/books/OL9303565M
2,014015339X,/books/OL22594993M
3,9810240392,/books/OL9195040M
4,9810240392,/books/OL3435163M
...,...,...
1708,1541647467,/books/OL29483383M
1709,1541647475,/books/OL32387462M
1710,1760290424,/books/OL34014431M
1711,0367338858,/books/OL34664066M


# Quick look at other records

In [27]:
import sqlite3

In [28]:
db = sqlite3.connect('../data/01_raw/openlibrary.sqlite')

In [29]:
for idx, (i, editions) in enumerate(matches.items()):
    meta = match_metadata[next(iter(editions))]
    if 'authors' in meta:
        first_author = meta['authors'][0]['key']
        author_meta = json.loads(db.execute(f"SELECT json FROM authors WHERE key = '{first_author}'").fetchone()[0])
    else:
        author_meta = {}
    
    comment_ids = asin_matches[asin_matches[0] == i].reset_index()['id'].to_list()
    comment_texts = df.loc[comment_ids].clean_text.to_list()
    
    print('*** ', meta['title'], 'by', author_meta.get('name'))
    print()
    for text in comment_texts:
        print(text[:500] + ('...' if len(text) >= 500 else ''))
    print()
    
    if idx > 20:
        break

***  Winnie Ille Pu by A. A. Milne

Today i learned, that there is a latin translation of Winnie the Pooh. https://www.amazon.com/Winnie-Ille-Pu-Latin-Milne/dp/014015339X It even became a bestseller, and appeared on the list of New York Times bestsellers. Is it known, if translations of the work need to pay for copyright too?

***  Foundations and Interpretation of Quantum Mechanics by Gennaro Auletta

You absolutely want this, relatively obscure, book:

https://www.amazon.com/Foundations-Interpretation-Quantum-Mechanics-Critical-Historical/dp/9810240392

There are some translation/copy-writing issues but don't let that put you off. It has a solid coverage of quantization of Hamiltonian mechanics in both matrix mechanics and wave mechanics, and then the unification of the two - all following the historical narrative.

***  The Game by Neil Strauss

(No idea whether you'll read this, but...)

> I know you're trying to come at this from a "things could be worse" perspective

Nope, not re