# Book Data Linkage Statistics

This notebook presents linkage statistics for the book data integration.

## Setup

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
from support import db_url

## ISBN Statistics

How many ISBNs do we have in the database?

In [3]:
pd.read_sql('''
SELECT COUNT(*) FROM isbn_id
''', db_url())

Unnamed: 0,count
0,24132173


How many ISBN clusters from different sources?

In [10]:
def count_clusters(schema='public'):
    res = pd.read_sql(f'''
        SELECT COUNT(DISTINCT cluster) FROM {schema}.isbn_cluster
    ''', db_url())
    return res.iloc[0, 0]

In [12]:
pd.Series({
    'All': count_clusters(),
    'LOC-MDS': count_clusters('locmds'),
    'OL': count_clusters('ol'),
    'GR': count_clusters('gr'),
    'LOC-ID': count_clusters('locid')
})

All        10735373
LOC-MDS     5185860
OL          9772540
GR          1109082
LOC-ID      6545327
dtype: int64

## LOC MDS Linking

How many resolved books from LOC?

In [9]:
pd.read_sql('''
WITH rec_lc AS (SELECT rec_id, COUNT(isbn_id) AS isbc FROM locmds.book_rec_isbn GROUP BY rec_id)
SELECT COUNT(rec_id), COUNT(isbc)
FROM locmds.book LEFT OUTER JOIN rec_lc USING (rec_id)
''', db_url())

Unnamed: 0,count,count.1
0,9244676,5246932


Let's count LOC books by resolution status.

In [16]:
pd.read_sql('''
SELECT COALESCE(gender, '<missing>') AS gender, COUNT(*) AS nbooks
FROM locmds.book_rec_isbn JOIN isbn_cluster USING (isbn_id)
LEFT OUTER JOIN cluster_first_author_gender USING (cluster)
GROUP BY COALESCE(gender, '<missing>')
''', db_url())

ProgrammingError: (psycopg2.ProgrammingError) relation "cluster_first_author_gender" does not exist
LINE 4: LEFT OUTER JOIN cluster_first_author_gender USING (cluster)
                        ^

[SQL: 
SELECT COALESCE(gender, '<missing>') AS gender, COUNT(*) AS nbooks
FROM locmds.book_rec_isbn JOIN isbn_cluster USING (isbn_id)
LEFT OUTER JOIN cluster_first_author_gender USING (cluster)
GROUP BY COALESCE(gender, '<missing>')
]
(Background on this error at: http://sqlalche.me/e/f405)