# Exploratory Data Analysis of Open Library Dumps

Now we've exported the Open Library Dumps into an SQLite database, we'll have a look at some specific examples.
Unfortunately we find that it's not that clean and there are duplicate works, editions and authors.

To make this a bit easier to read we'll use [ipython-sql](https://github.com/catherinedevlin/ipython-sql).

In [1]:
%load_ext sql

Let's create some indexes to make querying faster

In [2]:
%%time
%%sql sqlite:///../data/01_raw/openlibrary.sqlite

CREATE INDEX IF NOT EXISTS authors_key ON authors(key);
CREATE INDEX IF NOT EXISTS works_key ON works(key);
CREATE INDEX IF NOT EXISTS editions_key ON editions(key);

Done.
Done.
Done.
CPU times: user 19.5 ms, sys: 9.97 ms, total: 29.4 ms
Wall time: 48 ms


[]

Let's check the number of entries in each file.

In [3]:
%sql SELECT count(*) FROM authors

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


count(*)
9601995


In [4]:
%sql SELECT count(*) FROM works

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


count(*)
25250375


In [5]:
%sql SELECT count(*) FROM editions

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


count(*)
35115487


## Table Structure

Compare the schema with the [short description](https://openlibrary.org/type/edition) and [official schemata](https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/edition.schema.json).

Required fields should be

    "key",
    "title",
    "type",
    "works",
    "revision",
    "last_modified"

In [6]:
%sql SELECT * FROM editions LIMIT 5

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


type,key,revision,last_modified,json
/type/edition,/books/OL10000149M,2,2010-03-11T23:51:36.723486,"{""publishers"": [""Stationery Office Books""], ""key"": ""/books/OL10000149M"", ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""number_of_pages"": 87, ""isbn_13"": [""9780107805548""], ""physical_format"": ""Hardcover"", ""isbn_10"": [""0107805545""], ""publish_date"": ""December 31, 1994"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-03-11T23:51:36.723486""}, ""authors"": [{""key"": ""/authors/OL46053A""}], ""title"": ""40house of Lords Official Report"", ""latest_revision"": 2, ""works"": [{""key"": ""/works/OL14903292W""}], ""type"": {""key"": ""/type/edition""}, ""revision"": 2}"
/type/edition,/books/OL10000180M,2,2010-03-11T23:36:43.209220,"{""publishers"": [""Stationery Office Books""], ""title"": ""House of Lords Official Report"", ""isbn_10"": [""0107805863""], ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""isbn_13"": [""9780107805869""], ""physical_format"": ""Hardcover"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-03-11T23:36:43.209220""}, ""publish_date"": ""November 21, 1998"", ""key"": ""/books/OL10000180M"", ""authors"": [{""key"": ""/authors/OL46053A""}], ""latest_revision"": 2, ""works"": [{""key"": ""/works/OL14903109W""}], ""type"": {""key"": ""/type/edition""}, ""revision"": 2}"
/type/edition,/books/OL10000339M,3,2011-04-30T06:54:53.773773,"{""publishers"": [""Stationery Office Books""], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-30T06:54:53.773773""}, ""title"": ""European Communities (Amendment) Bill"", ""number_of_pages"": 2, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""isbn_13"": [""9780108361104""], ""isbn_10"": [""0108361101""], ""publish_date"": ""May 20, 1998"", ""key"": ""/books/OL10000339M"", ""authors"": [{""key"": ""/authors/OL46053A""}], ""latest_revision"": 3, ""oclc_numbers"": [""314617472""], ""works"": [{""key"": ""/works/OL14903170W""}], ""type"": {""key"": ""/type/edition""}, ""revision"": 3}"
/type/edition,/books/OL10000461M,3,2011-04-27T06:02:22.341989,"{""publishers"": [""Stationery Office Books""], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-27T06:02:22.341989""}, ""title"": ""Referendums (Scotland and Wales) Bill"", ""number_of_pages"": 40, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""isbn_13"": [""9780108363429""], ""isbn_10"": [""0108363422""], ""publish_date"": ""October 9, 1997"", ""key"": ""/books/OL10000461M"", ""authors"": [{""key"": ""/authors/OL46053A""}], ""latest_revision"": 3, ""oclc_numbers"": [""314726246""], ""works"": [{""key"": ""/works/OL14903135W""}], ""type"": {""key"": ""/type/edition""}, ""revision"": 3}"
/type/edition,/books/OL10000692M,2,2010-03-11T23:43:16.401260,"{""publishers"": [""Stationery Office Books""], ""key"": ""/books/OL10000692M"", ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""number_of_pages"": 2, ""isbn_13"": [""9780108366024""], ""physical_format"": ""Paperback"", ""isbn_10"": [""0108366022""], ""publish_date"": ""March 27, 1998"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-03-11T23:43:16.401260""}, ""authors"": [{""key"": ""/authors/OL46053A""}], ""title"": ""Data Protection Bill [H.L.]"", ""latest_revision"": 2, ""works"": [{""key"": ""/works/OL14903197W""}], ""type"": {""key"": ""/type/edition""}, ""revision"": 2}"


Let's extract a few common high level fields; note that many of them are structured arrays.

In [7]:
%%sql
SELECT key,
       json_extract(json, '$.title') as title,
       json_extract(json, "$.authors") as authors, 
       json_extract(json, "$.works") as works, 
       json_extract(json, "$.isbn_13") as isbn13,
       json_extract(json, "$.isbn_10") as isbn10,
       json_extract(json, "$.number_of_pages") as number_of_pages,
       json_extract(json, "$.publish_date") as publish_date
FROM editions
LIMIT 5


 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,title,authors,works,isbn13,isbn10,number_of_pages,publish_date
/books/OL10000149M,40house of Lords Official Report,"[{""key"":""/authors/OL46053A""}]","[{""key"":""/works/OL14903292W""}]","[""9780107805548""]","[""0107805545""]",87.0,"December 31, 1994"
/books/OL10000180M,House of Lords Official Report,"[{""key"":""/authors/OL46053A""}]","[{""key"":""/works/OL14903109W""}]","[""9780107805869""]","[""0107805863""]",,"November 21, 1998"
/books/OL10000339M,European Communities (Amendment) Bill,"[{""key"":""/authors/OL46053A""}]","[{""key"":""/works/OL14903170W""}]","[""9780108361104""]","[""0108361101""]",2.0,"May 20, 1998"
/books/OL10000461M,Referendums (Scotland and Wales) Bill,"[{""key"":""/authors/OL46053A""}]","[{""key"":""/works/OL14903135W""}]","[""9780108363429""]","[""0108363422""]",40.0,"October 9, 1997"
/books/OL10000692M,Data Protection Bill [H.L.],"[{""key"":""/authors/OL46053A""}]","[{""key"":""/works/OL14903197W""}]","[""9780108366024""]","[""0108366022""]",2.0,"March 27, 1998"


Let's look at the authors of "Bayesian Data Analysis", which got new authors in later editions.

First let's create an index on the works titles to make the query run in reasonable time.
We use `COLLATE NOCASE` to use case insensitive search (e.g. with a `LIKE` query).

In [8]:
%%time
%sql CREATE INDEX IF NOT EXISTS works_title ON works(json_extract(json, '$.title') COLLATE NOCASE)

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.
CPU times: user 1.65 ms, sys: 1.29 ms, total: 2.94 ms
Wall time: 2.09 ms


[]

There's some issues here: 

* there are 4 distinct works, when there should be just 1
* There are 3 distinct Gelman's (Gelman Andrew, Andrew Gelman, and A. Gelman) when there is only one corresponding authors

These are big problems for entity linking and we'll need to deduplicate.

Note that since a work can have multiple authors we use `json_each` which is a table valued function.
Here we will get one row per work and author pair.

In [9]:
%%sql

SELECT works.key as work_key, 
       json_extract(works.json, '$.title') as works_title,
       json_extract(authors.json, '$.name') as author_name,
       authors.key as author_key
FROM works
JOIN json_each((
                SELECT json
                FROM works as e
                WHERE e.key = works.key
              ), '$.authors') as works_authors
LEFT JOIN authors ON authors.key = json_extract(works_authors.value, "$.author.key")
WHERE json_extract(works.json, '$.title') LIKE 'Bayesian Data Analysis'

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


work_key,works_title,author_name,author_key
/works/OL25152967W,Bayesian Data Analysis,Gelman Andrew,/authors/OL9492748A
/works/OL12630389W,Bayesian data analysis,Andrew Gelman,/authors/OL2668098A
/works/OL19124056W,Bayesian data analysis,Andrew Gelman,/authors/OL2668098A
/works/OL18391964W,Bayesian data analysis,Andrew Gelman,/authors/OL2668098A
/works/OL18391964W,Bayesian data analysis,John B. Carlin,/authors/OL2692132A
/works/OL18391964W,Bayesian data analysis,Hal S. Stern,/authors/OL2692133A
/works/OL18391964W,Bayesian data analysis,Donald B. Rubin,/authors/OL1194305A
/works/OL18391964W,Bayesian data analysis,A. Gelman,/authors/OL2692134A


Let's have a look at the editions

In [10]:
%%time
%sql CREATE INDEX IF NOT EXISTS editions_title ON editions(json_extract(json, '$.title') COLLATE NOCASE)

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.
CPU times: user 3min 40s, sys: 29.4 s, total: 4min 10s
Wall time: 5min 25s


[]

The duplications carry across editions.

In [11]:
%%sql

SELECT editions.key, 
       json_extract(editions.json, '$.title') as edition_title,
    json_extract(editions.json, '$.edition_name') as edition_name,
       json_extract(authors.json, '$.name') as author_name,
        authors.key as author_key,
        json_extract(editions.json, '$.isbn_10') as isbn_10,
        json_extract(editions.json, '$.isbn_13') as isbn_13,
        json_extract(editions.json, '$.publish_date') as publish_date,
        json_array_length(editions.json, '$.works') as num_works,
        json_extract(editions.json, '$.works[0].key') as first_work
FROM editions
JOIN json_each((
               SELECT json
               FROM editions as e
               WHERE e.key = editions.key
             ), '$.authors') as edition_authors
LEFT JOIN authors ON authors.key = json_extract(edition_authors.value, "$.key")
WHERE json_extract(editions.json, '$.title') LIKE 'Bayesian Data Analysis'
order by first_work, publish_date, editions.key, author_name, authors.key

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,edition_title,edition_name,author_name,author_key,isbn_10,isbn_13,publish_date,num_works,first_work
/books/OL22539569M,Bayesian data analysis,2nd ed.,No name,/authors/OL5631454A,"[""158488388X""]","[""9781584883883""]",2003,1,/works/OL12630389W
/books/OL34585063M,Bayesian Data Analysis,,Andrew Gelman,/authors/OL2668098A,,"[""9780203491287""]",1995,1,/works/OL18391964W
/books/OL34585063M,Bayesian Data Analysis,,Donald B. Rubin,/authors/OL1194305A,,"[""9780203491287""]",1995,1,/works/OL18391964W
/books/OL34585063M,Bayesian Data Analysis,,Hal S. Stern,/authors/OL2692133A,,"[""9780203491287""]",1995,1,/works/OL18391964W
/books/OL34585063M,Bayesian Data Analysis,,John B. Carlin,/authors/OL2692132A,,"[""9780203491287""]",1995,1,/works/OL18391964W
/books/OL29240217M,Bayesian Data Analysis,,Aki Vehtari,/authors/OL8014523A,,"[""9781439898208""]",2013,1,/works/OL18391964W
/books/OL29240217M,Bayesian Data Analysis,,Andrew Gelman,/authors/OL2668098A,,"[""9781439898208""]",2013,1,/works/OL18391964W
/books/OL29240217M,Bayesian Data Analysis,,David B. Dunson,/authors/OL8243702A,,"[""9781439898208""]",2013,1,/works/OL18391964W
/books/OL29240217M,Bayesian Data Analysis,,Hal S. Stern,/authors/OL2692133A,,"[""9781439898208""]",2013,1,/works/OL18391964W
/books/OL29240217M,Bayesian Data Analysis,,John B. Carlin,/authors/OL2692132A,,"[""9781439898208""]",2013,1,/works/OL18391964W


Similarly there are multiple works for *How to Solve it* by *George Pólya*, when I think there is only 1.

We'll likely need a way to pick a preferred candidate (e.g. most editions, most content).

In [12]:
%%sql

SELECT works.key, 
       json_extract(works.json, '$.title') as works_title,
       json_extract(authors.json, '$.name') as author_name,
       authors.key as author_key
FROM works
JOIN json_each((
                SELECT json
                FROM works as e
                WHERE e.key = works.key
              ), '$.authors') as works_authors
LEFT JOIN authors ON authors.key = json_extract(works_authors.value, "$.author.key")
WHERE json_extract(works.json, '$.title') LIKE 'How to Solve It'

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,works_title,author_name,author_key
/works/OL8327698W,How to solve it,George Pólya,/authors/OL4709585A
/works/OL12252975W,How to solve it,George Pólya,/authors/OL4709585A
/works/OL4569825W,How to solve it,George Pólya,/authors/OL4709585A
/works/OL19367125W,How to Solve It,George Pólya,/authors/OL4709585A
/works/OL47601W,How to solve it,Zbigniew Michalewicz,/authors/OL28391A
/works/OL4569826W,How to Solve It,George Pólya,/authors/OL4709585A
/works/OL4569830W,How to solve it,George Pólya,/authors/OL4709585A
/works/OL11245415W,How to solve it,George Pólya,/authors/OL4709585A
/works/OL21575207W,How to Solve It,George Pólya,/authors/OL4709585A
/works/OL8600815W,How to solve it,George Pólya,/authors/OL4709585A


It gets worse, there are duplicate editions within a work (e.g. look at `/books/OL18335079M` and `/books/OL4468213M`)

In [13]:
%%sql

SELECT editions.key, 
       json_extract(editions.json, '$.title') as edition_title,
    json_extract(editions.json, '$.edition_name') as edition_name,
       json_extract(authors.json, '$.name') as author_name,
        authors.key as author_key,
        json_extract(editions.json, '$.isbn_10') as isbn_10,
        json_extract(editions.json, '$.isbn_13') as isbn_13,
        json_extract(editions.json, '$.publish_date') as publish_date,
        json_array_length(editions.json, '$.works') as num_works,
        json_extract(editions.json, '$.works[0].key') as first_work,
        json_extract(works.json, '$.title') as first_work_title
FROM editions
JOIN json_each((
               SELECT json
               FROM editions as e
               WHERE e.key = editions.key
             ), '$.authors') as edition_authors
LEFT JOIN authors ON authors.key = json_extract(edition_authors.value, "$.key")
LEFT JOIN works ON works.key = json_extract(editions.json, '$.works[0].key')
WHERE json_extract(editions.json, '$.title') LIKE 'How to Solve It'
order by isbn_10, isbn_13, first_work, publish_date, editions.key, author_name, authors.key

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,edition_title,edition_name,author_name,author_key,isbn_10,isbn_13,publish_date,num_works,first_work,first_work_title
/books/OL16491214M,How to solve it,,George Pólya,/authors/OL4709585A,,,1948,1,/works/OL11245415W,How to solve it
/books/OL20869941M,How to solve it,2nd ed.,George Pólya,/authors/OL4709585A,,,,1,/works/OL4569830W,How to solve it
/books/OL13694363M,How to solve it,,George Pólya,/authors/OL4709585A,,,1945,1,/works/OL4569830W,How to solve it
/books/OL13747212M,How to solve it,,George Pólya,/authors/OL4709585A,,,1945,1,/works/OL4569830W,How to solve it
/books/OL18161438M,How to solve it,,George Pólya,/authors/OL4709585A,,,1945,1,/works/OL4569830W,How to solve it
/books/OL188098M,How to solve it,,George Pólya,/authors/OL4709585A,,,1945,1,/works/OL4569830W,How to solve it
/books/OL13677920M,How to solve it,2nd ed.,George Pólya,/authors/OL4709585A,,,1957,1,/works/OL4569830W,How to solve it
/books/OL13952675M,How to solve it,2nd ed.,George Pólya,/authors/OL4709585A,,,1957,1,/works/OL4569830W,How to solve it
/books/OL6220908M,How to solve it,2d ed.,George Pólya,/authors/OL4709585A,,,1957,1,/works/OL4569830W,How to solve it
/books/OL19665756M,How to solve it,2nd ed.,George Pólya,/authors/OL4709585A,,,1971,1,/works/OL4569830W,How to solve it


Note that often one of ISBN-13 and 10 is missing, but [we can convert them](https://bisg.org/page/conversionscalculat/Conversion--Calculations-.htm).

In [14]:
isbn_10_weighting = [10,9,8,7,6,5,4,3,2]

isbn_13_weighting = [1,3,1,3,1,3,1,3,1,3,1,3,1]

def isbn10_check_digit(isbn10: str) -> str:
    assert len(isbn10) == 10
    digits = [int(x) for x in isbn10[:-1]]
    check = 11 - sum(x*y for x,y in zip(digits, isbn_10_weighting)) % 11
    
    if check == 11:
        check_digit = "0"
    elif check == 10:
        check_digit = "X"
    else:
        check_digit = str(check)
        
    assert len(check_digit) == 1
    assert check_digit in ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "X"]
    return check_digit
    
def isbn13_check_digit(isbn13: str) -> str:
    assert len(isbn13) == 13
    digits = [int(x) for x in isbn13[:-1]]
    check = 10 - sum(x*y for x,y in zip(digits, isbn_13_weighting)) % 10
    
    if check == 10:
        check_digit = "0"
    else:
        check_digit = str(check)
        
    assert len(check_digit) == 1
    assert check_digit in ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
    return check_digit
    
def isbn13_to_10(isbn13: str) -> str:
    assert isbn13.startswith("978")
    
    return isbn13[3:-1] + isbn10_check_digit(isbn13[3:])

def isbn10_to_13(isbn10: str) -> str:
    return "978" + isbn10[:-1] + isbn13_check_digit("978" + isbn10)

In [15]:
isbn13_to_10("9784871878302")

'4871878309'

In [16]:
isbn10_to_13("4871878309")

'9784871878302'

Let's check another famous example; SICP. Here there are two works.

In [17]:
%%sql

SELECT works.key, 
       json_extract(works.json, '$.title') as works_title,
       json_extract(authors.json, '$.name') as author_name,
       authors.key as author_key
FROM works
JOIN json_each((
                SELECT json
                FROM works as e
                WHERE e.key = works.key
              ), '$.authors') as works_authors
LEFT JOIN authors ON authors.key = json_extract(works_authors.value, "$.author.key")
WHERE json_extract(works.json, '$.title') LIKE 'Structure and interpretation of computer programs'

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,works_title,author_name,author_key
/works/OL15845363W,Structure and interpretation of computer programs,Harold Abelson,/authors/OL532838A
/works/OL3267304W,Structure and Interpretation of Computer Programs,Harold Abelson,/authors/OL532838A
/works/OL3267304W,Structure and Interpretation of Computer Programs,Gerald Jay Sussman,/authors/OL1963807A
/works/OL3267304W,Structure and Interpretation of Computer Programs,Julie Sussman,/authors/OL596156A


Again there are some duplicate editions (based on ISBN).

In [18]:
%%sql

SELECT editions.key, 
       json_extract(editions.json, '$.title') as edition_title,
    json_extract(editions.json, '$.edition_name') as edition_name,
    json_extract(editions.json, '$.subtitle') as edition_subtitle,
       json_extract(authors.json, '$.name') as author_name,
        authors.key as author_key,
        json_extract(editions.json, '$.isbn_10') as isbn_10,
        json_extract(editions.json, '$.isbn_13') as isbn_13,
        json_extract(editions.json, '$.publish_date') as publish_date,
        json_array_length(editions.json, '$.works') as num_works,
        json_extract(editions.json, '$.works[0].key') as first_work,
        json_extract(works.json, '$.title') as first_work_title
FROM editions
JOIN json_each((
               SELECT json
               FROM editions as e
               WHERE e.key = editions.key
             ), '$.authors') as edition_authors
LEFT JOIN authors ON authors.key = json_extract(edition_authors.value, "$.key")
LEFT JOIN works ON works.key = json_extract(editions.json, '$.works[0].key')
WHERE json_extract(editions.json, '$.title') LIKE 'Structure and interpretation of computer programs'
order by isbn_10, isbn_13, first_work, publish_date, editions.key, author_name, authors.key

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,edition_title,edition_name,edition_subtitle,author_name,author_key,isbn_10,isbn_13,publish_date,num_works,first_work,first_work_title
/books/OL34979954M,Structure and Interpretation of Computer Programs,,JavaScript Edition,Gerald Jay Sussman,/authors/OL7526117A,,"[""9780262543231""]",2022,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL34979954M,Structure and Interpretation of Computer Programs,,JavaScript Edition,Harold Abelson,/authors/OL532838A,,"[""9780262543231""]",2022,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL34979954M,Structure and Interpretation of Computer Programs,,JavaScript Edition,Julie Sussman,/authors/OL596156A,,"[""9780262543231""]",2022,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL34979954M,Structure and Interpretation of Computer Programs,,JavaScript Edition,Martin Henz,/authors/OL7576334A,,"[""9780262543231""]",2022,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL34979954M,Structure and Interpretation of Computer Programs,,JavaScript Edition,Tobias Wrigstad,/authors/OL7215381A,,"[""9780262543231""]",2022,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL2847540M,Structure and interpretation of computer programs,,,Harold Abelson,/authors/OL532838A,"[""0262010771"",""0070004226""]",,1985,1,/works/OL3267304W,Structure and Interpretation of Computer Programs
/books/OL24754690M,Structure and interpretation of computer programs,,,Harold Abelson,/authors/OL532838A,"[""0262010771"",""0070004226""]","[""9780262010771"",""9780070004221""]",1985,1,/works/OL15845363W,Structure and interpretation of computer programs
/books/OL15495574M,Structure and Interpretation of Computer Programs,Second Edition,,Harold Abelson,/authors/OL532838A,"[""0262011530"",""0262011530"",""0262510871"",""0262510871"",""0070004846""]",,July 25th 1996,1,/works/OL3267304W,Structure and Interpretation of Computer Programs
/books/OL14975261M,Structure and interpretation of computer programs,,,Harold Abelson,/authors/OL532838A,"[""0262510367""]",,1985,1,/works/OL3267304W,Structure and Interpretation of Computer Programs


## Multiple works per edition

Sometimes there are multiple works for an edition; here are a few examples.

At a glance it's not clear that these works are different (except the one case where it's missing!)

In [19]:
%%sql

SELECT editions.key,
       editions.last_modified,
       editions.json as edition_metadata,
       edition_works.value as work_key,
       works.json as work_metadata
FROM editions
JOIN json_each((
                SELECT json
                FROM editions as e
                WHERE e.key = editions.key
              ), '$.works') as edition_works
LEFT JOIN works on json_extract(edition_works.value, '$.key') = works.key
WHERE json_array_length(editions.json, '$.works') > 1
LIMIT 10

 * sqlite:///../data/01_raw/openlibrary.sqlite
Done.


key,last_modified,edition_metadata,work_key,work_metadata
/books/OL12626512M,2011-06-08T10:36:29.366416,"{""publishers"": [""Pika""], ""weight"": ""6.4 ounces"", ""covers"": [3143363], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-06-08T10:36:29.366416""}, ""latest_revision"": 7, ""key"": ""/books/OL12626512M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""isbn_13"": [""9782845991071""], ""title"": ""J'aime ce que j'aime, tome 1"", ""identifiers"": {""librarything"": [""160310""], ""goodreads"": [""1127794""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T20:50:18.033121""}, ""languages"": [{""key"": ""/languages/fre""}], ""isbn_10"": [""284599107X""], ""publish_date"": ""March 1, 2001"", ""oclc_numbers"": [""469280304""], ""works"": [{""key"": ""/works/OL5750828W""}, {""key"": ""/works/OL5750990W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""7 x 4.3 x 0.6 inches"", ""revision"": 7}","{""key"":""/works/OL5750828W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-10T11:13:47.184190""}, ""title"": ""J'aime ce que j'aime, tome 1"", ""covers"": [3143363], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750828W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL12626512M,2011-06-08T10:36:29.366416,"{""publishers"": [""Pika""], ""weight"": ""6.4 ounces"", ""covers"": [3143363], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-06-08T10:36:29.366416""}, ""latest_revision"": 7, ""key"": ""/books/OL12626512M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""isbn_13"": [""9782845991071""], ""title"": ""J'aime ce que j'aime, tome 1"", ""identifiers"": {""librarything"": [""160310""], ""goodreads"": [""1127794""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T20:50:18.033121""}, ""languages"": [{""key"": ""/languages/fre""}], ""isbn_10"": [""284599107X""], ""publish_date"": ""March 1, 2001"", ""oclc_numbers"": [""469280304""], ""works"": [{""key"": ""/works/OL5750828W""}, {""key"": ""/works/OL5750990W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""7 x 4.3 x 0.6 inches"", ""revision"": 7}","{""key"":""/works/OL5750990W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-14T19:33:16.370837""}, ""title"": ""J'aime ce que j'aime, tome 1"", ""covers"": [3143363], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750990W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL8858020M,2012-06-18T15:53:10.581988,"{""publishers"": [""TokyoPop""], ""number_of_pages"": 192, ""weight"": ""6.4 ounces"", ""series"": [""\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a \u001a\u001a\u001a\u001a\u001a (4)""], ""covers"": [865906], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2012-06-18T15:53:10.581988""}, ""latest_revision"": 7, ""key"": ""/books/OL8858020M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""subjects"": [""Graphic Novels-Manga"", ""Juvenile Fiction"", ""Children: Grades 2-3"", ""Children: Grades 4-6"", ""Graphic Novels - Manga"", ""Comics & Graphic Novels / Graphic Novels / Manga"", ""Comics & Graphic Novels - Manga"", ""Cartoons and comics"", ""Fiction"", ""Magic""], ""isbn_13"": [""9781591828815""], ""classifications"": {}, ""title"": ""Cardcaptor Sakura, Vol. 4"", ""notes"": {""type"": ""/type/text"", ""value"": ""Cardcaptor Sakura Authentic Manga""}, ""identifiers"": {""librarything"": [""547834""], ""goodreads"": [""229152""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""languages"": [{""key"": ""/languages/eng""}], ""isbn_10"": [""1591828813""], ""publish_date"": ""February 8, 2005"", ""works"": [{""key"": ""/works/OL5750771W""}, {""key"": ""/works/OL5750933W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""7.4 x 4.9 x 0.7 inches"", ""revision"": 7}","{""key"":""/works/OL5750771W""}","{""title"": ""Cardcaptor Sakura, Vol. 4"", ""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-10T11:13:47.184190""}, ""covers"": [865906], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2012-06-18T15:53:10.581988""}, ""latest_revision"": 3, ""key"": ""/works/OL5750771W"", ""authors"": [{""type"": {""key"": ""/type/author_role""}, ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 3}"
/books/OL8858020M,2012-06-18T15:53:10.581988,"{""publishers"": [""TokyoPop""], ""number_of_pages"": 192, ""weight"": ""6.4 ounces"", ""series"": [""\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a\u001a \u001a\u001a\u001a\u001a\u001a (4)""], ""covers"": [865906], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2012-06-18T15:53:10.581988""}, ""latest_revision"": 7, ""key"": ""/books/OL8858020M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""subjects"": [""Graphic Novels-Manga"", ""Juvenile Fiction"", ""Children: Grades 2-3"", ""Children: Grades 4-6"", ""Graphic Novels - Manga"", ""Comics & Graphic Novels / Graphic Novels / Manga"", ""Comics & Graphic Novels - Manga"", ""Cartoons and comics"", ""Fiction"", ""Magic""], ""isbn_13"": [""9781591828815""], ""classifications"": {}, ""title"": ""Cardcaptor Sakura, Vol. 4"", ""notes"": {""type"": ""/type/text"", ""value"": ""Cardcaptor Sakura Authentic Manga""}, ""identifiers"": {""librarything"": [""547834""], ""goodreads"": [""229152""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""languages"": [{""key"": ""/languages/eng""}], ""isbn_10"": [""1591828813""], ""publish_date"": ""February 8, 2005"", ""works"": [{""key"": ""/works/OL5750771W""}, {""key"": ""/works/OL5750933W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""7.4 x 4.9 x 0.7 inches"", ""revision"": 7}","{""key"":""/works/OL5750933W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-14T19:33:16.370837""}, ""title"": ""Cardcaptor Sakura, Vol. 4 (Cardcaptor Sakura Authentic Manga)"", ""covers"": [865906], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750933W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL9064958M,2011-04-26T17:33:36.292852,"{""publishers"": [""Carlsen""], ""number_of_pages"": 212, ""weight"": ""6.6 ounces"", ""covers"": [1020787], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-26T17:33:36.292852""}, ""latest_revision"": 7, ""key"": ""/books/OL9064958M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""isbn_13"": [""9783551763143""], ""title"": ""Magic Knight Rayearth 04. Fremde M\u00e4chte."", ""identifiers"": {""librarything"": [""290838""], ""goodreads"": [""62325""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""languages"": [{""key"": ""/languages/ger""}], ""isbn_10"": [""3551763143""], ""publish_date"": ""August 1, 2002"", ""oclc_numbers"": [""249620294""], ""works"": [{""key"": ""/works/OL5750838W""}, {""key"": ""/works/OL5751000W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""6.9 x 4.6 x 0.8 inches"", ""revision"": 7}","{""key"":""/works/OL5750838W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-10T11:13:47.184190""}, ""title"": ""Magic Knight Rayearth 04. Fremde M\u00e4chte"", ""covers"": [1020787], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750838W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL9064958M,2011-04-26T17:33:36.292852,"{""publishers"": [""Carlsen""], ""number_of_pages"": 212, ""weight"": ""6.6 ounces"", ""covers"": [1020787], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-26T17:33:36.292852""}, ""latest_revision"": 7, ""key"": ""/books/OL9064958M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""isbn_13"": [""9783551763143""], ""title"": ""Magic Knight Rayearth 04. Fremde M\u00e4chte."", ""identifiers"": {""librarything"": [""290838""], ""goodreads"": [""62325""]}, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T09:38:13.731961""}, ""languages"": [{""key"": ""/languages/ger""}], ""isbn_10"": [""3551763143""], ""publish_date"": ""August 1, 2002"", ""oclc_numbers"": [""249620294""], ""works"": [{""key"": ""/works/OL5750838W""}, {""key"": ""/works/OL5751000W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""6.9 x 4.6 x 0.8 inches"", ""revision"": 7}","{""key"":""/works/OL5751000W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-14T19:33:16.370837""}, ""title"": ""Magic Knight Rayearth 04. Fremde M\u00e4chte"", ""covers"": [1020787], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5751000W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL31541M,2021-10-22T17:38:03.303296,"{""publishers"": [""Bilingual Press/Editorial Bilingu\u0308e""], ""identifiers"": {""librarything"": [""1195910""], ""goodreads"": [""1293758""]}, ""isbn_10"": [""092753486X""], ""pagination"": ""79 p. ;"", ""covers"": [709195], ""lc_classifications"": [""PS3563.O54588 I28 1999""], ""key"": ""/books/OL31541M"", ""authors"": [{""key"": ""/authors/OL19612A""}], ""publish_places"": [""Tempe, Ariz""], ""genres"": [""Poetry.""], ""classifications"": {}, ""source_records"": [""marc:marc_loc_2016/BooksAll.2016.part27.utf8:222277916:612"", ""ia:iceworkersingsot0000mont""], ""title"": ""The iceworker sings and other poems"", ""lccn"": [""99012111""], ""number_of_pages"": 79, ""languages"": [{""key"": ""/languages/eng""}], ""dewey_decimal_class"": [""811/.54""], ""subjects"": [""Hispanic Americans -- Poetry.""], ""publish_date"": ""1999"", ""publish_country"": ""azu"", ""by_statement"": ""Andre\u0301s Montoya."", ""oclc_numbers"": [""40668197""], ""works"": [{""key"": ""/works/OL14850399W""}, {""key"": ""/works/OL14850400W""}], ""type"": {""key"": ""/type/edition""}, ""ocaid"": ""iceworkersingsot0000mont"", ""latest_revision"": 10, ""revision"": 10, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-01T03:28:50.625462""}, ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2021-10-22T17:38:03.303296""}}","{""key"":""/works/OL14850399W""}","{""first_publish_date"": ""1999"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2020-05-06T23:49:44.614623""}, ""title"": ""The iceworker sings and other poems"", ""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-07T20:44:06.266481""}, ""covers"": [709195], ""lc_classifications"": [""PS3563.O54588 I28 1999""], ""latest_revision"": 4, ""key"": ""/works/OL14850399W"", ""authors"": [{""type"": {""key"": ""/type/author_role""}, ""author"": {""key"": ""/authors/OL19612A""}}], ""dewey_number"": [""811/.54""], ""type"": {""key"": ""/type/work""}, ""subjects"": [""Hispanic Americans"", ""Poetry""], ""revision"": 4}"
/books/OL31541M,2021-10-22T17:38:03.303296,"{""publishers"": [""Bilingual Press/Editorial Bilingu\u0308e""], ""identifiers"": {""librarything"": [""1195910""], ""goodreads"": [""1293758""]}, ""isbn_10"": [""092753486X""], ""pagination"": ""79 p. ;"", ""covers"": [709195], ""lc_classifications"": [""PS3563.O54588 I28 1999""], ""key"": ""/books/OL31541M"", ""authors"": [{""key"": ""/authors/OL19612A""}], ""publish_places"": [""Tempe, Ariz""], ""genres"": [""Poetry.""], ""classifications"": {}, ""source_records"": [""marc:marc_loc_2016/BooksAll.2016.part27.utf8:222277916:612"", ""ia:iceworkersingsot0000mont""], ""title"": ""The iceworker sings and other poems"", ""lccn"": [""99012111""], ""number_of_pages"": 79, ""languages"": [{""key"": ""/languages/eng""}], ""dewey_decimal_class"": [""811/.54""], ""subjects"": [""Hispanic Americans -- Poetry.""], ""publish_date"": ""1999"", ""publish_country"": ""azu"", ""by_statement"": ""Andre\u0301s Montoya."", ""oclc_numbers"": [""40668197""], ""works"": [{""key"": ""/works/OL14850399W""}, {""key"": ""/works/OL14850400W""}], ""type"": {""key"": ""/type/edition""}, ""ocaid"": ""iceworkersingsot0000mont"", ""latest_revision"": 10, ""revision"": 10, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-01T03:28:50.625462""}, ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2021-10-22T17:38:03.303296""}}","{""key"":""/works/OL14850400W""}",
/books/OL8718879M,2011-04-30T05:42:41.578018,"{""publishers"": [""TokyoPop""], ""identifiers"": {""librarything"": [""26261""], ""goodreads"": [""1338941""]}, ""weight"": ""7 ounces"", ""covers"": [941175], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-30T05:42:41.578018""}, ""latest_revision"": 7, ""key"": ""/books/OL8718879M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""subjects"": [""Graphic novels"", ""Movie Tie - In"", ""Comics & Graphic Novels - General"", ""Cartoons and comics"", ""Juvenile Fiction"", ""Children's 12-Up - Fiction - Fantasy"", ""Magic"", ""Children: Grades 4-6"", ""Comics & Graphic Novels - Manga"", ""Action & Adventure"", ""Science Fiction, Fantasy, & Magic"", ""Fiction""], ""edition_name"": ""Chix Comix Pocket Ed edition"", ""languages"": [{""key"": ""/languages/eng""}], ""title"": ""Cardcaptor Sakura #2"", ""number_of_pages"": 185, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T08:14:56.482104""}, ""isbn_13"": [""9781892213501""], ""isbn_10"": [""1892213508""], ""publish_date"": ""December 31, 2000"", ""oclc_numbers"": [""45673814""], ""works"": [{""key"": ""/works/OL5750742W""}, {""key"": ""/works/OL5750904W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""6.9 x 4.3 x 0.5 inches"", ""revision"": 7}","{""key"":""/works/OL5750742W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-10T11:13:47.184190""}, ""title"": ""Cardcaptor Sakura #2"", ""covers"": [941175], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750742W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"
/books/OL8718879M,2011-04-30T05:42:41.578018,"{""publishers"": [""TokyoPop""], ""identifiers"": {""librarything"": [""26261""], ""goodreads"": [""1338941""]}, ""weight"": ""7 ounces"", ""covers"": [941175], ""physical_format"": ""Paperback"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2011-04-30T05:42:41.578018""}, ""latest_revision"": 7, ""key"": ""/books/OL8718879M"", ""authors"": [{""key"": ""/authors/OL1398994A""}], ""subjects"": [""Graphic novels"", ""Movie Tie - In"", ""Comics & Graphic Novels - General"", ""Cartoons and comics"", ""Juvenile Fiction"", ""Children's 12-Up - Fiction - Fantasy"", ""Magic"", ""Children: Grades 4-6"", ""Comics & Graphic Novels - Manga"", ""Action & Adventure"", ""Science Fiction, Fantasy, & Magic"", ""Fiction""], ""edition_name"": ""Chix Comix Pocket Ed edition"", ""languages"": [{""key"": ""/languages/eng""}], ""title"": ""Cardcaptor Sakura #2"", ""number_of_pages"": 185, ""created"": {""type"": ""/type/datetime"", ""value"": ""2008-04-30T08:14:56.482104""}, ""isbn_13"": [""9781892213501""], ""isbn_10"": [""1892213508""], ""publish_date"": ""December 31, 2000"", ""oclc_numbers"": [""45673814""], ""works"": [{""key"": ""/works/OL5750742W""}, {""key"": ""/works/OL5750904W""}], ""type"": {""key"": ""/type/edition""}, ""physical_dimensions"": ""6.9 x 4.3 x 0.5 inches"", ""revision"": 7}","{""key"":""/works/OL5750904W""}","{""created"": {""type"": ""/type/datetime"", ""value"": ""2009-12-14T19:33:16.370837""}, ""title"": ""Cardcaptor Sakura #2"", ""covers"": [941175], ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-28T09:10:02.307057""}, ""latest_revision"": 2, ""key"": ""/works/OL5750904W"", ""authors"": [{""type"": ""/type/author_role"", ""author"": {""key"": ""/authors/OL1398994A""}}], ""type"": {""key"": ""/type/work""}, ""revision"": 2}"


It's an open question as to how we clean all this up.