# Cluster Gender Changes

This notebook audits for significant changes in cluster gender annotations, to allow us to detect the significance of shifts over time.  It depends on the aligned cluster identities in `isbn-version-clusters.parquet`.

In [1]:
from pathlib import Path
from functools import reduce

In [2]:
import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt

## Load Data

Define the versions we care about:

In [3]:
versions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', 'current']

Load the aligned ISBNs:

In [4]:
isbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')
isbn_clusters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33538726 entries, 0 to 33538725
Data columns (total 8 columns):
 #   Column       Dtype  
---  ------       -----  
 0   isbn         object 
 1   isbn_id      int32  
 2   current      float64
 3   2022-11-2.1  float64
 4   2022-10      float64
 5   2022-07      float64
 6   2022-03-2.0  float64
 7   pgsql        float64
dtypes: float64(6), int32(1), object(1)
memory usage: 1.9+ GB


## Different Genders

How many clusters changed gender?

To get started, we need a list of genders in order.

In [5]:
genders = [
    'ambiguous', 'female', 'male', 'unknown',
    'no-author-rec', 'no-book-author', 'no-book', 'absent'
]

Let's make a function to read gender info:

In [6]:
def read_gender(path, map_file=None):
    cg = pl.scan_parquet(path)
    cg = cg.select([
        pl.col('cluster').cast(pl.Int32),
        pl.when(pl.col('gender') == 'no-loc-author')
            .then('no-book-author')
            .when(pl.col('gender') == 'no-viaf-author')
            .then('no-author-rec')
            .otherwise(pl.col('gender'))
            .cast(pl.Categorical)
            .alias('gender')
    ])
    if map_file is not None:
        map = pl.scan_parquet(map_file)
        cg = cg.join(map, on='cluster', how='left')
        cg = cg.select([
            pl.col('common').alias('cluster'),
            pl.col('gender')
        ])
    return cg

Read each data source's gender info and map to common cluster IDs:

In [7]:
gender_cc = {
    v: read_gender(f'{v}/cluster-genders.parquet', f'cluster-map-{v}.parquet')
    for v in versions if v != 'current'
}
gender_cc['current'] = read_gender('../book-links/cluster-genders.parquet')

Set up a sequence of frames for merging:

In [8]:
to_merge = [
    gender_cc[v].select([
        pl.col('cluster'),
        pl.col('gender').alias(v)
    ]).unique(False)
    for v in versions
]

  ]).unique(False)


Merge and collect results:

In [9]:
cluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer'), to_merge)
cluster_genders = cluster_genders.collect()

For unclear reasons, a few versions have a null cluster. Drop that.

In [10]:
cluster_genders = cluster_genders.filter(cluster_genders['cluster'].is_not_null())

Now we will convert to Pandas and fix missing values:

In [11]:
cluster_genders = cluster_genders.to_pandas().set_index('cluster')

Now we'll unify the categories and their orders:

In [12]:
cluster_genders = cluster_genders.apply(lambda vdf: vdf.cat.set_categories(genders, ordered=True))
cluster_genders.fillna('absent', inplace=True)
cluster_genders.head()

Unnamed: 0_level_0,pgsql,2022-03-2.0,2022-07,2022-10,2022-11-2.1,current
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
302589224,absent,absent,absent,absent,absent,no-book-author
101205744,male,male,male,male,male,male
117642016,absent,absent,no-author-rec,no-author-rec,no-author-rec,no-author-rec
407943760,no-book-author,no-book-author,no-book-author,no-book-author,no-book-author,no-book-author
100620168,no-book-author,no-book-author,no-book-author,no-book-author,no-book-author,no-book-author


Let's save this file for further analysis:

In [13]:
cluster_genders.to_parquet('cluster-version-genders.parquet', compression='zstd')

## PostgreSQL to Current

Now we are ready to actually compare cluster genders across categories. Let's start by comparing original data (PostgreSQL) to current:

In [14]:
ct = cluster_genders[['pgsql', 'current']].value_counts().unstack()
ct = ct.reindex(labels=genders, columns=genders)
ct

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,102786.0,4575.0,10492.0,1788.0,980.0,11.0,4.0,3572.0
female,13035.0,1136097.0,828.0,10107.0,9856.0,6.0,30.0,24152.0
male,23550.0,2868.0,3515645.0,16026.0,31682.0,307.0,138.0,55000.0
unknown,1994.0,80654.0,168137.0,1617551.0,19640.0,391.0,12.0,11996.0
no-author-rec,9693.0,55447.0,319211.0,226357.0,1413416.0,319.0,121.0,11130.0
no-book-author,6285.0,93743.0,181627.0,111842.0,177687.0,2533809.0,950180.0,258900.0
no-book,,,,,,,,
absent,60682.0,542445.0,1162935.0,426157.0,1694301.0,14981046.0,790979.0,1752.0


In [15]:
ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
def style_row(row):
    styles = []
    for col, val in zip(row.index, row.values):
        if col == row.name:
            styles.append('font-weight: bold')
        elif val > 0.1:
            styles.append('color: red')
        else:
            styles.append(None)
    return styles
ctf.style.apply(style_row, 'columns')

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.827531,0.036833,0.084471,0.014395,0.00789,8.9e-05,3.2e-05,0.028758
female,0.010916,0.951417,0.000693,0.008464,0.008254,5e-06,2.5e-05,0.020226
male,0.006461,0.000787,0.964455,0.004396,0.008691,8.4e-05,3.8e-05,0.015088
unknown,0.001049,0.042441,0.088476,0.851175,0.010335,0.000206,6e-06,0.006312
no-author-rec,0.004762,0.027237,0.156807,0.111194,0.694317,0.000157,5.9e-05,0.005467
no-book-author,0.001457,0.02173,0.042101,0.025925,0.041188,0.587336,0.220251,0.060013
no-book,,,,,,,,
absent,0.003087,0.027591,0.059151,0.021676,0.086179,0.761995,0.040232,8.9e-05


Most of the change is coming from clusters absent in the original but present in the new.

There are also quite a few that had no book author in PGSQL, but no book in the current data - not sure what's up with that.  Let's look at more crosstabs.

In [16]:
def gender_crosstab(old, new, fractional=True):
    ct = cluster_genders[[old, new]].value_counts().unstack()
    ct = ct.reindex(labels=genders, columns=genders)

    if fractional:
        ctf = ct.divide(ct.sum(axis='columns'), axis='rows')
        return ctf
    else:
        return ct

## PostgreSQL to March 2022 (2.0 release)

This marks the change from PostgreSQL to pure-Rust.

In [17]:
ct = gender_crosstab('pgsql', '2022-03-2.0')
ct.style.apply(style_row, 'columns')

2022-03-2.0,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
pgsql,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.977892,0.002955,0.01392,0.000636,0.000878,,,0.00372
female,0.002192,0.993949,1e-06,0.000301,0.00043,3e-06,5e-06,0.003119
male,0.000591,0.0,0.995942,0.000528,0.000796,2e-06,1.4e-05,0.002127
unknown,4.3e-05,0.00276,0.005305,0.9889,0.001953,1e-06,3e-06,0.001035
no-author-rec,0.000104,0.007917,0.049966,0.031481,0.908597,,8e-06,0.001927
no-book-author,3e-06,5.1e-05,0.000198,0.000107,5.1e-05,0.649174,0.335017,0.015398
no-book,,,,,,,,
absent,9e-06,6.7e-05,0.000349,0.000164,0.000244,0.002756,9.6e-05,0.996315


This is where we change from no-book-author to no-book for a bunch of books; otherwise things are pretty consistent. This major change is likely a result of changes that count more books and book clusters - we had some inner joins in the PostgreSQL version that were questionable, and in particular we didn't really cluster solo ISBNs but now we do.  But now, if we have a solo ISBN from rating data, it gets a cluster with no book record instead of being excluded from the clustering.

## March to July 2022

We updated a lot of data files and changed the name and ISBN parsing logic.

In [18]:
ct = gender_crosstab('2022-03-2.0', '2022-07')
ct.style.apply(style_row, 'columns')

2022-07,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-03-2.0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.836221,0.035586,0.083579,0.014798,0.004862,8.7e-05,1.6e-05,0.024853
female,0.010071,0.963215,0.000488,0.00791,0.001254,2e-06,1.1e-05,0.01705
male,0.006706,0.000646,0.974658,0.003702,0.001364,7.9e-05,1.4e-05,0.01283
unknown,0.0019,0.040308,0.09294,0.856051,0.003412,0.0002,,0.00519
no-author-rec,0.003536,0.020632,0.108692,0.101639,0.762115,9e-06,3.8e-05,0.003339
no-book-author,0.002055,0.030434,0.057325,0.035309,0.056236,0.809982,7e-06,0.008653
no-book,0.000156,0.002331,0.00526,0.002731,0.004763,0.000635,0.980666,0.003458
absent,0.003084,0.027664,0.059202,0.021831,0.086472,0.004783,2e-06,0.796962


Mostly fine; some more are resolved, existing resolutions are pretty consistent.

## July 2022 to Oct. 2022

We changed from DataFusion to Polars and made further ISBN and name parsing changes.

In [19]:
ct = gender_crosstab('2022-07', '2022-10')
ct.style.apply(style_row, 'columns')

2022-10,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-07,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.98939,0.004969,0.003626,0.000336,0.001647,,,3.2e-05
female,,0.995095,,0.000361,0.00447,,,7.4e-05
male,1e-06,,0.994581,0.000431,0.004975,0.0,,1.1e-05
unknown,,,,0.995468,0.004492,,,4.1e-05
no-author-rec,,1e-06,3e-06,5e-06,0.999822,0.000131,,3.9e-05
no-book-author,,0.0,1e-06,0.0,0.0,0.99661,,0.003387
no-book,4e-06,9.4e-05,3e-05,6.7e-05,9.4e-05,0.198069,0.670216,0.131427
absent,,,,,,,,1.0


## Oct. 2022 to Current

We added support for GoodReads CSV data and the Amazon 2018 rating CSV files.

In [20]:
ct = gender_crosstab('2022-10', 'current')
ct.style.apply(style_row, 'columns')

current,ambiguous,female,male,unknown,no-author-rec,no-book-author,no-book,absent
2022-10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ambiguous,0.999995,,,,,,,5e-06
female,,0.999999,,,,,,1e-06
male,,,0.999998,,,,,2e-06
unknown,,,,0.999998,,,,2e-06
no-author-rec,,,,,0.999999,,,1e-06
no-book-author,,,0.0,0.0,,0.999982,,1.7e-05
no-book,,,,,,,1.0,
absent,,0.0,0.0,0.0,0.0,0.927674,0.04941,0.022915
