# Assigning `super_book_codes` to the banned books

**Author:** Michael Falk

**Date:** 31/10/18-1/11/18, 6/11/18, 12/11/18

## Background

One of the key datasets for *Mapping Print, Charting Enlightenment* is a set of documents concerning illegal books in eighteenth-century France. BNF MS 21928-9 contains a list of banned books. It is unclear who exactly wrote the list, but it appears to have been prepared by the central government to assist book inspectors with their tasks across France. BNF Arsenal MS 10305 is an inventory of all the books that were found in the Bastille when it was stormed during the French Revolution. The actual MS has disappeared, but luckily a modern edition exists.

A problem occured during entry of the data. The interface was supposed to oblige the user to assign a 'super book code' to each title in the banned books lists upon entry. But due to a glitch in the interface, the lookup took too long at it was impossible to efficiently do so. Accordingly, only 97 of the 1000+ illegal books have a 'super book code' assigned to them. The data is therefore not linked to the rest of the database and is useless for analysis.

To speed up the process of linking all the data, this notebook uses 'dedupe', an open-source record linkage library, to try and find links between these banned titles and the titles already recorded elsewhere in the database. Hopefully this will speed up record linkage, and provide a testbed for other record linkage tasks in the project.

**Update:** Better versions of the helper functions defined in this notebook have been saved in the file *dedupe_helper_functions.py* in this repo.

***

## Section 1: Import data, initialise model

In [1]:
# Cell 1.1: Load necessary libraries and define key paths
import dedupe as dd
import pandas as pd
import os as os
import time
import numpy as np
import random
from dedupe_helper_functions import dedupe_initialise, run_deduper, save_clusters
import json

input_file = "combined_editions_illegal_books.csv"
output_file = "illegal_books_deduped.csv"
settings_file = "illegal_books_learned_settings"
training_file = "illegal_books_training.json"
output_file = "illegal_books_clustered.csv"
marked_pairs_file = "marked_pairs.json"

Data was preprocessed in R. The full set of 'editions' was extracted from the database. The illegal_titles data was cleaned, and the two datasets were combined into a single large table. The script is in this repo. This preprocessing means that the problem is now a problem of finding duplicate rows in a single table.

In [2]:
# Cell 1.2: Import data
data_frame = pd.read_csv(input_file)

print(f"The data has {data_frame.shape[0]} rows, {data_frame[data_frame['super_book_code'].isna()].shape[0]} of which need super_book_codes assigned.")

The data has 17164 rows, 1921 of which need super_book_codes assigned.


In [3]:
# Cell 1.3: Initialise deduper

# Creat a list of fields for the model to look at. NB: 'ID' and 'UUID' are not relevant to the task,
# hence do not appear in the list. The 'super_book_code' also encodes no useful information,
# because the problem is that we have records without codes, and the model will learn to focus on that
# column too much if we include it, since it is a nearly perfect determinant of identity.
fields = [
    {'field':'full_book_title', 'type': 'String'},
    {'field':'author_name', 'type': 'String'},
    {'field':'stated_publication_places', 'type': 'String'},
    {'field':'stated_publication_years', 'type': 'DateTime'}
]

# This line creates the Dedupe model. If it finds a file at the 'settings_file', it will load as a 'static'
# deduper that can be used efficiently to cluster data, but that cannot be trained. If there is no
# settings file at the given path, it will initialise as an active deduper that must be trained before it can be used.
deduper = dedupe_initialise(data_frame, fields, settings_file, training_file)

INFO:dedupe.api:(TfidfTextCanopyPredicate: (0.2, full_book_title), (SimplePredicate: (firstTokenPredicate, author_name), SimplePredicate: (sameSevenCharStartPredicate, author_name)), TfidfTextCanopyPredicate: (0.4, full_book_title), (SimplePredicate: (sameSevenCharStartPredicate, full_book_title), TfidfNGramCanopyPredicate: (0.2, full_book_title)))


Reading pre-trained model from illegal_books_learned_settings...
Done


## Section 2: Training the Model

In [None]:
# Cell 2.1a: Adding training data to the model (console)

# Run this cell to open the console labeller, which allows you to manually enter training data in the output window.
dd.consoleLabel(deduper)

In [31]:
# Cell 2.1b: Adding training data to the model (marked pairs json file)

# Run this cell to import the training json file generated by 'illegal_books_get_marked_pairs.R', and add it to the model.
with open(marked_pairs_file, 'r') as f:
    marked_pairs = json.load(f)

deduper.markPairs(marked_pairs)

AttributeError: 'StaticDedupe' object has no attribute 'markPairs'

In [32]:
# Cell 2.2: Run the model, check out the results.

# The main parameter you can change here is recall_weight. If you increase this number, the model will
# care more about finding possible matches (recall). If you reduce the number, the model will care more about
# getting the matches right when it does find them (precision).
deduper, matches = run_deduper(deduper, data_frame, settings_file, training_file, recall_weight = 0.5)

Computing threshold based on a recall weighting of 0.5.


INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word les
INFO:dedupe.canopy_index:Removing stop word ou
INFO:dedupe.canopy_index:Removing stop word M
INFO:dedupe.canopy_index:Removing stop word sur
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word la
INFO:dedupe.canopy_index:Removing stop word par
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word à
INFO:dedupe.canopy_index:Removing stop word des
INFO:dedupe.canopy_index:Removing stop word le
INFO:dedupe.canopy_index:Removing stop word en
INFO:dedupe.canopy_index:Removing stop word  D
INFO:dedupe.canopy_index:Removing stop word  d
INFO:dedupe.canopy_index:Removing stop word ad
INFO:dedupe.canopy_index:Removing stop word ch
INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word es
INFO:dedupe.canopy_index:Removing stop word ie
INFO:dedupe.canopy_index:Removing stop word li
INFO:dedupe

Computation complete. Threshold = 0.5879074931144714. It took 60.914 seconds.
Clustering...


INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word les
INFO:dedupe.canopy_index:Removing stop word ou
INFO:dedupe.canopy_index:Removing stop word M
INFO:dedupe.canopy_index:Removing stop word sur
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word la
INFO:dedupe.canopy_index:Removing stop word par
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word à
INFO:dedupe.canopy_index:Removing stop word des
INFO:dedupe.canopy_index:Removing stop word le
INFO:dedupe.canopy_index:Removing stop word en
INFO:dedupe.canopy_index:Removing stop word  D
INFO:dedupe.canopy_index:Removing stop word  d
INFO:dedupe.canopy_index:Removing stop word ad
INFO:dedupe.canopy_index:Removing stop word ch
INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word es
INFO:dedupe.canopy_index:Removing stop word ie
INFO:dedupe.canopy_index:Removing stop word li
INFO:dedupe

Clustering complete. 3387 clusters found. It took 64.455 seconds.


## Section 3: Inspecting the results

In [40]:
# Cell 3.1. Add cluster data back to original data frame and save to csv
_ = save_clusters(matches, data_frame, output_file)

Writing clustered data to illegal_books_clustered.csv...
Done!


In [51]:
# Cell 3.2. Sanity check. How good are the model's assignments?
# Run this cell a few times to look at different random clusters
data_frame[data_frame['cluster'] == random.randint(0, len(matches) + 1)]

Unnamed: 0,ID,UUID,super_book_code,full_book_title,author_code,author_name,stated_publication_places,stated_publication_years,cluster,confidence
5928,,,spbk0001648,Memoires Turcs où hist. galante de deux Turcs...,au0000038,"Aucour, Claude Godard [d']",,,1357.0,0.745271
6354,,,spbk0001648,Memoires Turcs ou histoire galante de deux Tur...,au0000038,"Aucour, Claude Godard [d']",Francfort,1765.0,1357.0,0.735028
6853,,,spbk0001648,Memoires Turcs,au0000038,"Aucour, Claude Godard [d']",,,1357.0,0.83277


In [52]:
# Cell 3.3. How much time have we saved? How many illegal books have been assigned a super_book_code?
assigned = data_frame[
    pd.notnull(data_frame['cluster']) & # which books have been assigned a cluster?
    pd.notnull(data_frame['UUID']) & # only illegal books have UUIDs
    pd.isna(data_frame['super_book_code']) # only interested in books that didn't already have super_book_codes
].shape[0]

total = data_frame_clustered[
    pd.notnull(data_frame['UUID']) &
    pd.isna(data_frame['super_book_code'])
].shape[0]

print(f"{assigned} illegal books have been given super_book_codes, of {total} that lack them.")

532 illegal books have been given super_book_codes, of 1921 that lack them.


We can consider the accuracy of the model more accurately by seeing how often it clustered books with different super_book_codes.

In [53]:
# Cell 3.4. Inspecting the super_book_codes in all the clusters.
multi_groups = (data_frame.groupby(by="cluster")['super_book_code'] # group into clusters, inspect 'super_book_code'
               .nunique() # count how many unqiue 'super_book_codes' are in the cluster
               .where(lambda x: x > 1) # only keep clusters with more than one 'super_book_code'
               .dropna()) # drop the NaNs created by .where()

print(f"Of the {len(matches)} clusters found by dedupe, {len(multi_groups)} contain multiple superbooks.")

Of the 3387 clusters found by dedupe, 717 contain multiple superbooks.


In [54]:
# Cell 3.5. Which books has Dedupe confounded?
# Pick one of the groups
rand_multi = int(random.choice(multi_groups.index.tolist()))

# Inspect it
data_frame[data_frame['cluster'] == rand_multi]

Unnamed: 0,ID,UUID,super_book_code,full_book_title,author_code,author_name,stated_publication_places,stated_publication_years,cluster,confidence
2034,,,spbk0000541,Dictionnaire de l'Académie Françoise,au0001311,Académie Française,,,412.0,0.792762
6350,,,spbk0000541,Dictionnaire de l'academie françoise nouvelle ...,au0001311,Académie Française,Lyon,1776.0,412.0,0.879901
10575,,,zspbk0010615,Dictionnaire de l’academie françoise. Nouvell...,,,Lyon,1776.0,412.0,0.876813


In [61]:
# Are there any clusters where it has put more than one illegal book?
multi_illegal = (data_frame[pd.notnull(data_frame['UUID'])]
                 .groupby(by = 'cluster')['UUID']
                 .nunique()
                 .where(lambda x: x > 1)
                 .dropna())

print(f"Of the {len(matches)} clusters found by dedupe, {len(multi_illegal)} contain multiple illegal books.")

Of the 3387 clusters found by dedupe, 102 contain multiple illegal books.


In [80]:
# Examine some of these ones:
# Pick one of the groups
rand_multi = int(random.choice(multi_illegal.index.tolist()))

# Inspect it
data_frame[data_frame['cluster'] == rand_multi]

Unnamed: 0,ID,UUID,super_book_code,full_book_title,author_code,author_name,stated_publication_places,stated_publication_years,cluster,confidence
1657,1744.0,9f316381-78db-41cd-9c43-e40bf2701f9a,,L'Espion anglois,,"Pidansat de Mairobert, Mathieu Franc?ois",London,1785,290.0,0.869183
1658,1745.0,4e7e8462-92dc-4ce9-a203-e0d56e836e51,,L'Espion anglois,,"Pidansat de Mairobert, Mathieu Franc?ois",London,1785,290.0,0.869183
4023,,,spbk0001429,Lettres originales de Madame la comtesse du Ba...,au0000705,"Pidansat de Mairobert, Mathieu-François",London,1779,290.0,0.758484


## Section 4: Using the results

The question is: how to use the results? The most obvious course seems to be to be this: go through all the illegal books, and keep the most confident result that the model has put out. Then we can manually go over them, and if they are okay, update the database.

In [81]:
# Make a copy of the data frame
out_data = data_frame.copy

In [None]:
# Now go through each cluster, get the highest confidence super_book_code and assign it to the illegal book
(out_data.groupby(col = "cluster")
    .)

## Conclusion

Dedupe appears to do a good job in linking the illegal books to super books that are already in the dataset. The question is whether the ~1600 books the algorithm could not cluster are new books that aren't already in our dataset, or if the algorithm has low recall.

Manual investigation will be necessary to see if the ~1600 super books are already in the database.

It may also be possible to tune the model further, by
1. giving it more training data, or
2. increasing the `recall_weight` so that the model cares more about finding possible matches than being accurate when it does find them.

**Addendum (12/11/2018):** Simon reckons that finding `super_book_codes` for only ~300 of the illegal books is unsurprising. His hunch is that the authorities were quite good at extinguishing banned titles most of the time in *ancien r&eacute;gime* France.

**Addendum (19/11/2018):** I tried generating a whole lot of training data, using the super book codes to find matching and non-matching pairs. Feeding this to the model using the `Dedupe.markPairs()` method, the model's recall jumped considerably, and it has found 600+ matches with the illegal books. Now to check that data and upload it...