# Debug Cluster Changes

This notebook is a utility notebook for examining changes between the current branch and `master`.

**Warning:** this notebook consumes quite a bit of memory (over 32GB).

## Setup

Libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import dvc.api

## Load Data

We need to load some data files.  We are primarily interested in the book genders.

### Current Book Genders

Load the current genders:

In [None]:
current = pd.read_parquet('book-links/cluster-genders.parquet')
current['gender'] = current['gender'].astype('category')
current.info()

In [None]:
sns.countplot(y='gender', data=current)

In [None]:
current = current.set_index('cluster').sort_index()

### Master Book Genders

Now load the genders from the `master` branch:

In [None]:
with dvc.api.open('book-links/cluster-genders.parquet', rev='master', mode='rb') as pqf:
    master = pd.read_parquet(pqf)
master['gender'] = master['gender'].astype('category')
master.info()

In [None]:
sns.countplot(y='gender', data=master)

In [None]:
master = master.set_index('cluster').sort_index()

### Authors

Let's load book first authors:

In [None]:
authors = pd.read_parquet('book-links/cluster-first-authors.parquet')
authors.info()

And the author indexes:

In [None]:
cur_au_idx = pd.read_parquet('viaf/author-name-index.parquet')
cur_au_idx.info()

In [None]:
with dvc.api.open('viaf/author-name-index.parquet', rev='master', mode='rb') as pqf:
    old_au_idx = pd.read_parquet(pqf)
old_au_idx.info()

And the old author names:

In [None]:
with dvc.api.open('book-links/cluster-first-authors.parquet', rev='master', mode='rb') as pqf:
    old_authors = pd.read_parquet(pqf)
old_authors.info()

## Tabulate Results

Let's compare gender link results.

In [None]:
genders = master.join(current, how='outer', lsuffix='_old', rsuffix='_cur')
genders.columns.name = 'source'
genders.head()

In [None]:
gender_tall = genders.stack().to_frame(name='gender').reset_index()
gender_tall['source'] = gender_tall['source'].str.replace('gender_', '', regex=False)
gender_tall.head()

In [None]:
sns.countplot(y='gender', hue='source', data=gender_tall)

In [None]:
pd.crosstab(genders['gender_old'], genders['gender_cur'])

## Examine Unmatched Books

Our initial question is to study why we have books without author records matching.

In [None]:
cur_nar_mask = current['gender'] == 'no-author-rec'
cur_nar_mask.describe()

In [None]:
old_nar_mask = master['gender'] == 'no-author-rec'
old_nar_mask.describe()

Get the books that are now NAR, but were not before:

In [None]:
newly_nar = cur_nar_mask & ~old_nar_mask
newly_nar.sum()

Now let's try to look at why. What are these author names?

In [None]:
nnar = master[newly_nar]
nnar_auth = pd.merge(nnar.reset_index(), old_authors)
nnar_auth.head()

What does that look like in current data?

In [None]:
nnar_cauth = pd.merge(current[newly_nar].reset_index(), authors).head()
nnar_cauth

Grab that first cluster.

In [None]:
sc = nnar_auth.iloc[0,0]
sought = nnar_auth.loc[nnar_auth['cluster'] == sc, 'author_name']
sought

Find them in the old data:

In [None]:
matched = old_au_idx[old_au_idx['name'].isin(sought)]
matched

What does the current data say for those records?

In [None]:
cur_ver = cur_au_idx[cur_au_idx['rec_id'].isin(matched['rec_id'])].copy()
cur_ver['repr'] = cur_ver['name'].apply(repr)
cur_ver

In [None]:
csc = nnar_cauth.iloc[0, 0]
cso = nnar_cauth.loc[nnar_cauth['cluster'] == csc, 'author_name']
cmatch = cur_au_idx[cur_au_idx['name'].isin(cso)]
cmatch