# Assigning super_book_codes to the banned books

**Author:** Michael Falk

**Date:** 31/10/18

## Background

One of the key datasets for *Mapping Print, Charting Enlightenment* is a set of documents concerning illegal books in eighteenth-century France. BNF MS 21928-9 contains a list of banned books. It is unclear who exactly wrote the list, but it appears to have been prepared by the central government to assist book inspectors with their tasks across France. BNF Arsenal MS 10305 is an inventory of all the books that were found in the Bastille when it was stormed during the French Revolution. The actual MS has disappeared, but luckily a modern edition exists.

A problem occured during entry of the data. The interface was supposed to oblige the user to assign a 'super book code' to each title in the banned books lists upon entry. But due to a glitch in the interface, the lookup took too long at it was impossible to efficiently do so. Accordingly, only 97 of the 1000+ illegal books have a 'super book code' assigned to them. The data is therefore not linked to the rest of the database and is useless for analysis.

To speed up the process of linking all the data, this notebook uses 'dedupe', an open-source record linkage library, to try and find links between these banned titles and the titles already recorded elsewhere in the database. Hopefully this will speed up record linkage, and provide a testbed for other record linkage tasks in the project.

***

## Section 1: Import data, initialise model

In [49]:
# Cell 1: Load necessary libraries and define key paths
import dedupe as dd
import pandas as pd
import os as os

input_file = "combined_editions_illegal_books.csv"
output_file = "illegal_books_deduped.csv"
settings_file = "illegal_books_learned_settings"
training_file = "illegal_books_traing.json"

Data was preprocessed in R. The full set of 'editions' was extracted from the database. The illegal_titles data was cleaned, and the two datasets were combined into a single large table. The script is in this repo. This preprocessing means that the problem is now a problem of finding duplicate rows in a single table.

In [52]:
# Cell 2: Import data
data = pd.read_csv(input_file)

print(f"The data has {data.shape[0]} rows, {data[data['super_book_code'].isna()].shape[0]} of which need super_book_codes assigned.")

# Dedupe requires missing values to be 'Nones', and wants the data to be a list of dicts.
data = data.where(pd.notnull(data), None)
data = data.to_dict("index")

The data has 14201 rows, 1941 of which need super_book_codes assigned.


In [47]:
# Cell 3: Define initialisation function
def dedupe_initialise(data, fields, settings_file, training_file, sample_size = 15000):
    """
    Takes a data dictionary and field definitions and creates a Dedupe object.
    
    params:
        data: a list of dicts, where each dict represents one record
        fields: a list of dicts, where each dict describes a field that the model should inspect
    
    returns:
        deduper: a Dedupe object
    """
    
    # Check to see if an initialised model has already been saved.
    # If so, load the initialised model. If not, initialise a new model using the fields list supplied.
    if os.path.exists(settings_file):
        print('reading from', settings_file)
        with open(settings_file, 'rb') as f:
            deduper = dedupe.StaticDedupe(f)
    else:
        deduper = dd.Dedupe(fields)
    
    # Create training sample pairs from provided data
    deduper.sample(data, sample_size)
    
    # Load existing training file if it exists
    if os.path.exists(training_file):
        print('reading labeled examples from ', training_file)
        with open(training_file, 'rb') as f:
            deduper.readTraining(f)
    
    return deduper

In [62]:
# Cell 4: Initialise deduper

# Creat a list of fields for the model to look at. NB: 'ID' and 'UUID' are not relevant to the task,
# hence do not appear in the list.
fields = [
    {'field':'super_book_code', 'type': 'String'},
    {'field':'full_book_title', 'type': 'String'},
    {'field':'author_code', 'type': 'String'},
    {'field':'author_name', 'type': 'String'},
    {'field':'stated_publication_places', 'type': 'String'},
    {'field':'stated_publication_years', 'type': 'DateTime'},
    {'field':'short_book_titles', 'type': 'String'},
    {'field':'translated_title', 'type': 'String'},
]

deduper = dedupe_initialise(data, fields, settings_file, training_file)

INFO:dedupe.canopy_index:Removing stop word 00
INFO:dedupe.canopy_index:Removing stop word bk
INFO:dedupe.canopy_index:Removing stop word pb
INFO:dedupe.canopy_index:Removing stop word 01
INFO:dedupe.canopy_index:Removing stop word zs
INFO:dedupe.canopy_index:Removing stop word des
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word sur
INFO:dedupe.canopy_index:Removing stop word ou
INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word les
INFO:dedupe.canopy_index:Removing stop word M
INFO:dedupe.canopy_index:Removing stop word la
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word par
INFO:dedupe.canopy_index:Removing stop word  c
INFO:dedupe.canopy_index:Removing stop word  e
INFO:dedupe.canopy_index:Removing stop word  m
INFO:dedupe.canopy_index:Removing stop word  r
INFO:dedupe.canopy_index:Removing stop word  t
INFO:dedupe.canopy_index:Removing stop word ai
INFO:dedup

In [63]:
# Cell 5: Collecting training data