# Linking Darnton's *Literary Tour de France* to FBTEE

**Authors:** Michael Falk, Simon Burrows

**Date:** 5/11/2018-

## Background

The Soci&eacute;t&eacute; Typographique de Neuch&acirc;tel (STN) has been the subject of two major academic studies: Robert Darnton's *A Literary Tour De France*, and *The French Book Trade in Enlightenment Europe*, by Simon Burrows, Mark Curran et al. Each study took a different approach. Darnton closely considered the correspondence that the STN had with a small selection of its most regular buyers. Burrows, Curran et al created a database drawn mostly from the STN's ledgers. The two projects thus wound up focussing on opposite sides of the STN's activities. Darnton was more interested in demand for different titles, Burrows, Curran et al in what the STN actually supplied.

In this branch of the *Mapping Print, Charting Enlightenment* project, we try to join these two datasets together. Using machine learning, we will locate linked records, and for the first time will be able to systematically study the relation between supply and demand for this important enlightenment publishing house.

**A note on the data:** This repository has an MIT licence, while Darnton's sample is released under a Creative Commons 4. To avoid possible clashes, Darnton's data has not been replicated here. It can be found on his project website: http://robertdarnton.org/sites/default/files/CommandesLibrairesfrancais.xls

## Experiment 1: Transaction Data

In this first experiment, we try to link the orders in Darnton's dataset directly to the sales in the FBTEE data. If this doesn't work, we may try the simpler option of simply trying to get the `book_code` or `super_book_code` for each title in Darnton's spreadsheet.

In [9]:
# Import libraries and define main paths
import dedupe as dd
import pandas as pd
import numpy as np
import time
import os
from dedupe_helper_functions import dedupe_initialise, run_deduper, save_clusters
import random

# Define main paths
dar_dir = "darnton_files/"
input_file = dar_dir + "darnton_combined.csv"
output_file = dar_dir + "darnton_deduped.csv"
settings_file = dar_dir + "darnton_learned_settings"
training_file = dar_dir + "darnton_training.json"

Having imported the main libraries and defined the main paths, we can import the data and initialise the deduplication model.

In [2]:
# Import data from csv
with open(input_file, 'r', encoding='utf-8') as csv:
    darnton_df = pd.read_csv(csv)

In [None]:
# Have a look at the data:
darnton_df

In [3]:
# Define the fields to examine and initialise the model:
fields = [
    {'field':'stn_abbreviated_title', 'type':'String'},
    {'field':'edition', 'type':'Price'}, # 'price' fields are how Dedupe models numerical data
    {'field':'number_of_volumes', 'type':'Price'},
    {'field':'author_name', 'type':'String'},
    {'field':'date', 'type':'DateTime', 'yearfirst':True}, # 'yearfirst' indicates the date format
    {'field':'full_book_title', 'type':'String'},
    {'field':'darnton_record_id', 'type':'Price', 'has missing':True},
    {'field':'client_code', 'type':'String'},
    {'field':'total_number_of_volumes', 'type':'Price'}
]

deduper = dedupe_initialise(darnton_df, fields, settings_file, training_file, sample_size = 15000)

INFO:dedupe.api:((SimplePredicate: (alphaNumericPredicate, client_code), TfidfTextCanopyPredicate: (0.4, full_book_title)),)


Reading pre-trained model from darnton_files/darnton_learned_settings...
Done


If you are not using a pre-trained model, or if you have not provided a JSON file of labelled training examples, then you can use the cell below to label some record pairs in the console. The model with present you with two rows of the data frame, and ask you to type 'y' if they match, 'n' if they do not, or 'u' if you are unsure. Type 'f' when you have had enough.

In [None]:
dd.consoleLabel(deduper)

In [4]:
# Train the model, cluster the records and save
deduper, matches = run_deduper(deduper, darnton_df, settings_file, training_file, recall_weight = 1)

INFO:dedupe.canopy_index:Removing stop word de


Computing threshold based on a recall weighting of 1.


INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.788
INFO:dedupe.api:precision: 0.692
INFO:dedupe.api:With threshold: 0.369
INFO:dedupe.canopy_index:Removing stop word de


Computation complete. Threshold = 0.36868736147880554. It took 4.682 seconds.
Clustering...
Clustering complete. 1528 clusters found. It took 4.820 seconds.


In [6]:
clustered = save_clusters(matches, darnton_df, output_file)

Now let's see how well the model has done at linking the records.

In [40]:
clustered[clustered['cluster'] == random.randint(0, len(matches) + 1)]

Unnamed: 0,stn_abbreviated_title,edition,number_of_volumes,author_name,copies_ordered,date,full_book_title,darnton_record_id,client_name,client_code,total_number_of_volumes,cluster,confidence
4788,Journal de maupou,12.0,7.0,"Moufle d'Angerville, Berthélemy-François-Joseph",,1775-04-17,Journal historique de la révolution opérée dan...,,,cl1510,42.0,1329.0,0.853935
4789,Journal de maupou,12.0,7.0,"Pidansat de Mairobert, Mathieu-François",,1775-04-17,Journal historique de la révolution opérée dan...,,,cl1510,42.0,1329.0,0.853935


How many of the Darnton orders has it linked to a sale in the FBTEE data?

In [24]:
darnton_orders = len(clustered[pd.notnull(clustered.darnton_record_id)])
darnton_clustered = len(clustered[pd.notnull(clustered.darnton_record_id) & pd.notnull(clustered.cluster)])
print(f'Of the {darnton_orders} orders in Darnton\'s dataset, {darnton_clustered} have been linked to a sale, or {darnton_clustered / darnton_orders * 100:.1f}%.' )

Of the 3398 orders in Darnton's dataset, 1427 have been linked to a sale, or 42.0%.
