This notebook audits for significant changes in cluster gender annotations, to allow us to detect the significance of shifts over time.  It depends on the aligned cluster identities in `isbn-version-clusters.parquet`.


In [1]:
from pathlib import Path
from functools import reduce

In [2]:
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt

## Load Data

Define the versions we care about:


In [3]:
versions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', 'current']

Load the aligned ISBNs:


In [4]:
isbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43027360 entries, 0 to 43027359
Data columns (total 8 columns):
 #   Column       Dtype  
---  ------       -----  
 0   isbn         object 
 1   isbn_id      int32  
 2   current      float64
 3   2022-11-2.1  float64
 4   2022-10      float64
 5   2022-07      float64
 6   2022-03-2.0  float64
 7   pgsql        float64
dtypes: float64(6), int32(1), object(1)
memory usage: 2.4+ GB


## Different Genders

How many clusters changed gender?

To get started, we need a list of genders in order.


In [5]:
genders = [
    'ambiguous', 'female', 'male', 'unknown',
    'no-author-rec', 'no-book-author', 'no-book', 'absent'
]

Let's make a function to read gender info:


In [6]:
def read_gender(path, map_file=None):
    cg = pl.scan_parquet(path)
    cg = cg.select([
        pl.col('cluster').cast(pl.Int32),
        pl.when(pl.col('gender') == 'no-loc-author')
            .then('no-book-author')
            .when(pl.col('gender') == 'no-viaf-author')
            .then('no-author-rec')
            .otherwise(pl.col('gender'))
            .cast(pl.Categorical)
            .alias('gender')
    ])
    if map_file is not None:
        map = pl.scan_parquet(map_file)
        cg = cg.join(map, on='cluster', how='left')
        cg = cg.select([
            pl.col('common').alias('cluster'),
            pl.col('gender')
        ])
    return cg

Read each data source's gender info and map to common cluster IDs:


In [7]:
gender_cc = {
    v: read_gender(f'{v}/cluster-genders.parquet', f'{v}/cluster-map.parquet')
    for v in versions if v != 'current'
}
gender_cc['current'] = read_gender('../book-links/cluster-genders.parquet')

Set up a sequence of frames for merging:


In [8]:
to_merge = [
    gender_cc[v].select([
        pl.col('cluster'),
        pl.col('gender').alias(v)
    ]).unique()
    for v in versions
]

Merge and collect results:


In [9]:
cluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer'), to_merge)
cluster_genders = cluster_genders.collect()

For unclear reasons, a few versions have a null cluster. Drop that.


In [10]:
cluster_genders = cluster_genders.filter(cluster_genders['cluster'].is_not_null())

Now we will convert to Pandas and fix missing values:


In [11]:
cluster_genders = cluster_genders.to_pandas().set_index('cluster')

Now we'll unify the categories and their orders:


In [12]:
cluster_genders = cluster_genders.apply(lambda vdf: vdf.cat.set_categories(genders, ordered=True))
cluster_genders.fillna('absent', inplace=True)
cluster_genders.head()

Unnamed: 0_level_0,pgsql,2022-03-2.0,2022-07,2022-10,2022-11-2.1,current
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
423896385,absent,absent,absent,absent,absent,no-book-author
424930878,absent,absent,absent,absent,absent,no-book-author
452155044,absent,absent,absent,absent,absent,no-book-author
452348564,absent,absent,absent,absent,absent,no-book-author
439895815,absent,absent,absent,absent,absent,no-book-author


Let's save this file for further analysis:


In [13]:
cluster_genders.to_parquet('cluster-version-genders.parquet', compression='zstd')

## PostgreSQL to Current

Now we are ready to actually compare cluster genders across categories. Let's start by comparing original data (PostgreSQL) to current:


In [14]:
ct = cluster_genders[['pgsql', 'current']].value_counts().unstack()
ct = ct.reindex(labels=genders, columns=genders)
ct

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,98394.0,4328.0,11291.0,1859.0,996.0,2773.0,4.0,4563.0
female,16335.0,1120257.0,979.0,12939.0,9441.0,19.0,30.0,34111.0
male,28531.0,2687.0,3493237.0,18112.0,31096.0,689.0,152.0,70712.0
unknown,3012.0,102950.0,215358.0,1545733.0,19021.0,15.0,12.0,14274.0
no-author-rec,10534.0,58474.0,330309.0,226925.0,1395180.0,437.0,125.0,13710.0
no-book-author,8347.0,114861.0,219228.0,125946.0,211471.0,2457223.0,903525.0,273472.0
no-book,,,,,,,,
absent,121082.0,1022705.0,2439398.0,1026100.0,4539072.0,17046824.0,734646.0,65882.0


In [15]:
ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
def style_row(row):
    styles = []
    for col, val in zip(row.index, row.values):
        if col == row.name:
            styles.append('font-weight: bold')
        elif val > 0.1:
            styles.append('color: red')
        else:
            styles.append(None)
    return styles
ctf.style.apply(style_row, 'columns')

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.792171,0.034845,0.090904,0.014967,0.008019,0.022325,3.2e-05,0.036737
female,0.01368,0.938151,0.00082,0.010836,0.007906,1.6e-05,2.5e-05,0.028566
male,0.007827,0.000737,0.958307,0.004969,0.008531,0.000189,4.2e-05,0.019399
unknown,0.001585,0.054174,0.113324,0.813383,0.010009,8e-06,6e-06,0.007511
no-author-rec,0.005175,0.028724,0.162259,0.111473,0.685358,0.000215,6.1e-05,0.006735
no-book-author,0.001935,0.026625,0.050817,0.029194,0.049019,0.569583,0.209437,0.063391
no-book,,,,,,,,
absent,0.004485,0.037884,0.090362,0.03801,0.16814,0.631464,0.027213,0.00244


Most of the change is coming from clusters absent in the original but present in the new.

There are also quite a few that had no book author in PGSQL, but no book in the current data - not sure what's up with that.  Let's look at more crosstabs.


In [16]:
def gender_crosstab(old, new, fractional=True):
    ct = cluster_genders[[old, new]].value_counts().unstack()
    ct = ct.reindex(labels=genders, columns=genders)

    if fractional:
        ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
        return ctf
    else:
        return ct

## PostgreSQL to March 2022 (2.0 release)

This marks the change from PostgreSQL to pure-Rust.


In [17]:
ct = gender_crosstab('pgsql', '2022-03-2.0')
ct.style.apply(style_row, 'columns')

2022-03-2.0,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.9779,0.002963,0.013928,0.000636,0.000878,,,0.003695
female,0.002192,0.993952,1e-06,0.000301,0.00043,3e-06,5e-06,0.003115
male,0.000591,0.0,0.995942,0.000528,0.000796,2e-06,1.4e-05,0.002127
unknown,4.3e-05,0.00276,0.005302,0.9889,0.001953,1e-06,3e-06,0.001038
no-author-rec,0.000104,0.007918,0.049969,0.031482,0.908597,,8e-06,0.001923
no-book-author,3e-06,4.9e-05,0.000197,0.000107,5.2e-05,0.649173,0.335017,0.015402
no-book,,,,,,,,
absent,6e-06,4.9e-05,0.000254,0.00012,0.000178,0.002007,7e-05,0.997316


This is where we change from no-book-author to no-book for a bunch of books; otherwise things are pretty consistent. This major change is likely a result of changes that count more books and book clusters - we had some inner joins in the PostgreSQL version that were questionable, and in particular we didn't really cluster solo ISBNs but now we do.  But now, if we have a solo ISBN from rating data, it gets a cluster with no book record instead of being excluded from the clustering.

## March to July 2022

We updated a lot of data files and changed the name and ISBN parsing logic.


In [18]:
ct = gender_crosstab('2022-03-2.0', '2022-07')
ct.style.apply(style_row, 'columns')

2022-07,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-03-2.0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.836252,0.035594,0.083594,0.014814,0.004862,8.7e-05,1.6e-05,0.024782
female,0.01006,0.963185,0.000488,0.007911,0.001254,2e-06,1.2e-05,0.017088
male,0.006704,0.000646,0.974652,0.003702,0.001364,7.9e-05,1.4e-05,0.012839
unknown,0.0019,0.040311,0.092955,0.856043,0.003413,0.0002,,0.005178
no-author-rec,0.003538,0.020637,0.108685,0.101636,0.762109,9e-06,3.7e-05,0.003349
no-book-author,0.00206,0.030435,0.057311,0.035309,0.056239,0.809983,7e-06,0.008655
no-book,0.000157,0.002341,0.005291,0.00274,0.004762,0.000635,0.980666,0.003408
absent,0.002246,0.020151,0.043122,0.015901,0.062986,0.003484,1e-06,0.852108


Mostly fine; some more are resolved, existing resolutions are pretty consistent.

## July 2022 to Oct. 2022

We changed from DataFusion to Polars and made further ISBN and name parsing changes.


In [19]:
ct = gender_crosstab('2022-07', '2022-10')
ct.style.apply(style_row, 'columns')

2022-10,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-07,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.989403,0.004969,0.003626,0.000336,0.001647,,,1.8e-05
female,,0.995089,,0.000361,0.004471,,,7.8e-05
male,1e-06,,0.994581,0.000431,0.004975,0.0,,1.1e-05
unknown,,,,0.995469,0.004492,,,3.9e-05
no-author-rec,,1e-06,3e-06,5e-06,0.999824,0.000131,,3.7e-05
no-book-author,,0.0,1e-06,0.0,0.0,0.996616,,0.003382
no-book,1e-06,0.000101,3.1e-05,6.5e-05,8.8e-05,0.198059,0.670216,0.131438
absent,,,,,,,,1.0


## Oct. 2022 to Current

We added support for GoodReads CSV data and the Amazon 2018 rating CSV files.


In [20]:
ct = gender_crosstab('2022-10', 'current')
ct.style.apply(style_row, 'columns')

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.879473,0.006926,0.018943,0.001573,0.000239,0.074743,9e-06,0.018094
female,0.004885,0.975546,0.000307,0.004275,0.000396,0.000482,2e-06,0.014107
male,0.00225,4.4e-05,0.986202,0.001428,0.000497,0.001463,5e-06,0.008112
unknown,0.000267,0.019107,0.032235,0.945155,0.001086,0.000425,0.0,0.001724
no-author-rec,0.000277,0.002895,0.007173,0.00978,0.975843,0.00036,4e-06,0.003668
no-book-author,0.000408,0.006234,0.011057,0.004072,0.006725,0.967622,1e-06,0.003881
no-book,0.000337,0.004516,0.010218,0.006248,0.019862,0.002849,0.950917,0.005053
absent,0.003004,0.020526,0.054834,0.026329,0.124175,0.723785,0.031457,0.015889
