# Assigning super_book_codes to the banned books

**Author:** Michael Falk

**Date:** 31/10/18-1/11/18

## Background

One of the key datasets for *Mapping Print, Charting Enlightenment* is a set of documents concerning illegal books in eighteenth-century France. BNF MS 21928-9 contains a list of banned books. It is unclear who exactly wrote the list, but it appears to have been prepared by the central government to assist book inspectors with their tasks across France. BNF Arsenal MS 10305 is an inventory of all the books that were found in the Bastille when it was stormed during the French Revolution. The actual MS has disappeared, but luckily a modern edition exists.

A problem occured during entry of the data. The interface was supposed to oblige the user to assign a 'super book code' to each title in the banned books lists upon entry. But due to a glitch in the interface, the lookup took too long at it was impossible to efficiently do so. Accordingly, only 97 of the 1000+ illegal books have a 'super book code' assigned to them. The data is therefore not linked to the rest of the database and is useless for analysis.

To speed up the process of linking all the data, this notebook uses 'dedupe', an open-source record linkage library, to try and find links between these banned titles and the titles already recorded elsewhere in the database. Hopefully this will speed up record linkage, and provide a testbed for other record linkage tasks in the project.

***

## Section 1: Import data, initialise model

In [74]:
# Cell 1.1: Load necessary libraries and define key paths
import dedupe as dd
import pandas as pd
import os as os
import time
import numpy as np
import random

input_file = "combined_editions_illegal_books.csv"
output_file = "illegal_books_deduped.csv"
settings_file = "illegal_books_learned_settings"
training_file = "illegal_books_training.json"
output_file = "illegal_books_clustered.csv"

Data was preprocessed in R. The full set of 'editions' was extracted from the database. The illegal_titles data was cleaned, and the two datasets were combined into a single large table. The script is in this repo. This preprocessing means that the problem is now a problem of finding duplicate rows in a single table.

In [69]:
# Cell 1.2: Import data
data_frame = pd.read_csv(input_file)

print(f"The data has {data_frame.shape[0]} rows, {data_frame[data_frame['super_book_code'].isna()].shape[0]} of which need super_book_codes assigned.")

# Dedupe requires missing values to be 'Nones', and wants the data to be a list of dicts.
data_frame = data_frame.where(pd.notnull(data_frame), None)
data = data_frame.to_dict("index")

The data has 14201 rows, 1941 of which need super_book_codes assigned.


In [3]:
# Cell 1.3: Define initialisation function
def dedupe_initialise(data, fields, settings_file, training_file, sample_size = 15000):
    """
    Takes a data dictionary and field definitions and creates a Dedupe object.
    
    params:
        data: a list of dicts, where each dict represents one record
        fields: a list of dicts, where each dict describes a field that the model should inspect
    
    returns:
        deduper: a Dedupe object
    """
    
    # Check to see if an initialised model has already been saved.
    # If so, load the initialised model. If not, initialise a new model using the fields list supplied.
    if os.path.exists(settings_file):
        print('reading from', settings_file)
        with open(settings_file, 'rb') as f:
            deduper = dedupe.StaticDedupe(f)
    else:
        deduper = dd.Dedupe(fields)
    
    # Create training sample pairs from provided data
    deduper.sample(data, sample_size)
    
    # Load existing training file if it exists
    if os.path.exists(training_file):
        print('reading labeled examples from ', training_file)
        with open(training_file, 'rb') as f:
            deduper.readTraining(f)
    
    return deduper

In [4]:
# Cell 1.4: Initialise deduper

# Creat a list of fields for the model to look at. NB: 'ID' and 'UUID' are not relevant to the task,
# hence do not appear in the list. The 'super_book_code' also encodes no useful information,
# because the problem is that we have records without codes, and the model will learn to focus on that
# column too much if we include it, since it is a perfect determinant of identity.
fields = [
    {'field':'full_book_title', 'type': 'String'},
    {'field':'author_name', 'type': 'String'},
    {'field':'stated_publication_places', 'type': 'String'},
    {'field':'stated_publication_years', 'type': 'DateTime'}
]

deduper = dedupe_initialise(data, fields, settings_file, training_file)

INFO:dedupe.canopy_index:Removing stop word  L
INFO:dedupe.canopy_index:Removing stop word  d
INFO:dedupe.canopy_index:Removing stop word  p
INFO:dedupe.canopy_index:Removing stop word a 
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word em
INFO:dedupe.canopy_index:Removing stop word es
INFO:dedupe.canopy_index:Removing stop word hi
INFO:dedupe.canopy_index:Removing stop word is
INFO:dedupe.canopy_index:Removing stop word mo
INFO:dedupe.canopy_index:Removing stop word oi
INFO:dedupe.canopy_index:Removing stop word po
INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.canopy_index:Removing stop word s 
INFO:dedupe.canopy_index:Removing stop word st
INFO:dedupe.canopy_index:Removing stop word u 
INFO:dedupe.canopy_index:Removing stop word ur
INFO:dedupe.canopy_index:Removing stop word  P
INFO:dedupe.canopy_index:Removing stop word  e
INFO:dedupe.canopy_index:Removing stop word Co
INFO:dedupe.canopy_index:Removing stop word ab
INFO:dedupe.c

## Section 2: Training the Model

In [6]:
# Cell 2.1: Collecting training data

# Running this cell will open an interactive shell in the output below, where you will be
# presented with pairs of books, and will be asked to say if they are the same or different.
dd.consoleLabel(deduper)

full_book_title : Bibliotheque des sciences et des beaux arts
author_name : Joncourt, Elie [de]
stated_publication_places : La Haye
stated_publication_years : None

full_book_title : Heures dédiées à monseigneur le dauphin
author_name : None
stated_publication_places : Bayeux
stated_publication_years : 1784

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


full_book_title : Exposition historique et apologetique des miracles perpetuels
author_name : None
stated_publication_places : None
stated_publication_years : 1707

full_book_title : Nouveau Dictionnaire (le) Suisse, François-Allemand, et Allemand-François
author_name : Poetevin, Franz Ludwig
stated_publication_places : None
stated_publication_years : None

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Oeuvres de Regnard
author_name : Regnard, Jean François
stated_publication_places : Paris
stated_publication_years : 1788

full_book_title : Synonymes françois
author_name : Girard, Gabriel
stated_publication_places : Rouen
stated_publication_years : 1788

0/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Olinde, par l'auteur des mémoires du vicomte de Barjac
author_name : Luchet, Jean-Pierre-Louis de La Roche du Marin [marquis de]
stated_publication_places : Geneva
stated_publication_years : 1784

full_book_title : Histoire de Messieurs Paris: ouvrage dans lequel on montre comment un royaume peut passer dans l'espace de cinq années de l'état le plus déplorable à l'état le plus florissant, par Mr de L*****, ancien officier de cavalerie
author_name : Luchet, Jean-Pierre-Louis de La Roche du Marin [marquis de]
stated_publication_places : None
stated_publication_years : 1276

0/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Nouveau (le) testament
author_name : None
stated_publication_places : None
stated_publication_years : 1709

full_book_title : Nouveau (le) testament de N. S. J. C. traduit sur l'ancienne Edition latine, corrigée par le commandément du pape Sixte 5 et publiée par l'autorité du pape Clement 8, par le R. P. Denis Amelotte pretre de l’oratoire, docteur en Theologie avec permission de son Ess. Magr le Cardinal de Noailles archeveque de Paris, nouvelle edition revûe et corrigée
author_name : [Bible]
stated_publication_places : Avignon
stated_publication_years : 1776

0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Essais de Montaigne, avec les notes de M. Coste
author_name : Montaigne, Michel Eyquem
stated_publication_places : London
stated_publication_years : 1771

full_book_title : Abrégé de l'histoire de Geneve, et de son gouvernement ancien et moderne
author_name : Lorovich, Antoine
stated_publication_places : London
stated_publication_years : 1274

1/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, full_book_title), SimplePredicate: (sameSevenCharStartPredicate, full_book_title))
full_book_title : Nouveautés
author_name : Marmontel, Jean-François
stated_publication_places : None
stated_publication_years : 1773

full_book_title : Fables (les) d’Esope mises en françois avec le Sens morale en quatre vers, nouvelle edition, revue, corrigée et augmentée de la vie d'Esope et des quatrains de Benserade dediée à la jeunesse
author_name : Aesop
stated_publication_places : Avignon
stated_publication_years : 1763

1/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Oeuvres completes de Boulanger
author_name : Boulanger, Nicolas Antoine
stated_publication_places : None
stated_publication_years : 1775

full_book_title : Style universel de toutes les cours et Jurisdictions du Royaume concernant les saisies et executions tant des meublee qu’immeublee; avec les formules des commandemens et des saisies tant mobiliaires que réelles par M. J. A. S. avocat au parlement de Toulouse
author_name : Soulatges, Jean Antoine
stated_publication_places : Toulouse
stated_publication_years : 1757

1/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Comptes faits de Barême
author_name : Barrême, François Bertrand de
stated_publication_places : Nantes
stated_publication_years : 1780

full_book_title : Comptes faits de Barême
author_name : Barrême, François Bertrand de
stated_publication_places : Orléans
stated_publication_years : 1780

1/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Fables choisies de la fontaine avec un nouveau commentaire par M. Coste
author_name : Coste, Pierre
stated_publication_places : Paris
stated_publication_years : 1768

full_book_title : None
author_name : Coste, Pierre
stated_publication_places : None
stated_publication_years : None

2/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Dictionnaire Geographique portatif par M. Vosgien
author_name : Echard, Laurence
stated_publication_places : Paris
stated_publication_years : 1770

full_book_title : Dictionnaire Geographique portatif par M. Vosgien
author_name : La Martinière, Antoine Augustin Bruzen [de]
stated_publication_places : Paris
stated_publication_years : 1770

3/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, full_book_title), SimplePredicate: (sameSevenCharStartPredicate, full_book_title))
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.8, author_name), TfidfTextCanopyPredicate: (0.8, author_name))
full_book_title : Esprit (l') de Bourdaloue
author_name : Bourdaloue, Louis
stated_publication_places : Paris
stated_publication_years : None

full_book_title : Pensées du P. Bourdaloue
author_name : Bourdaloue, Louis
stated_publication_places : None
stated_publication_years : None

4/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Histoire de la duchesse d'Anjou reine d'Angleterre
author_name : None
stated_publication_places : None
stated_publication_years : 1713

full_book_title : Histoire de la découverte et de la conquête du Perou
author_name : Zarate, Agustin de
stated_publication_places : None
stated_publication_years : None

4/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Sylphe (le), par Crébillon fils
author_name : Crébillon, Claude Prosper Jolyot de
stated_publication_places : Maestricht
stated_publication_years : 1785

full_book_title : Vie et lettres de Ninon de L'Enclos
author_name : Crébillon, Claude Prosper Jolyot de
stated_publication_places : Toulouse
stated_publication_years : 1778

4/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Oraisons funébres, par M.M. Bossuet et Fléchier
author_name : Fléchier, Esprit
stated_publication_places : Lyon
stated_publication_years : 1783

full_book_title : Histoire de The?odose le Grand pour monseigneur le Dauphin
author_name : Fléchier, Esprit
stated_publication_places : Toulouse
stated_publication_years : 1781

4/10 positive, 10/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Les colloques du calvaire, ou me?ditations sur la Passion de N. S. Jesus-Christ
author_name : Courbon, Noe?l 
stated_publication_places : Rouen
stated_publication_years : 1779

full_book_title : Instructions familieres sur l'oraison mentale, en forme de dialogue, ou l'on explique les divers degrés par les quels on peut s'avancer dans ce St exercice.  Par le R. P. Jean-Joseph Surin de la Compagnie de Jesu
author_name : Courbon, Noe?l 
stated_publication_places : Nancy
stated_publication_years : 1738

4/10 positive, 11/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Microscope bibliographique
author_name : Malebranche, Nicolas
stated_publication_places : Amsterdam
stated_publication_years : None

full_book_title : Conversations chrétiennes
author_name : Malebranche, Nicolas
stated_publication_places : None
stated_publication_years : None

4/10 positive, 12/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Elévation à Dieu de Bossuet
author_name : Bossuet, Jacques Bénigne
stated_publication_places : None
stated_publication_years : None

full_book_title : Justification des reflexions sur le Nouveau Testament
author_name : Bossuet, Jacques Bénigne
stated_publication_places : None
stated_publication_years : None

4/10 positive, 13/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Conduite pour passer Saintement le tems de L’Avent par le R. P. Avrillon
author_name : Avrillon, R. B.
stated_publication_places : Paris
stated_publication_years : 1759

full_book_title : Conduite pour passer saintement Le Carême par le R. P. Avrillon
author_name : Avrillon, R. B.
stated_publication_places : None
stated_publication_years : None

4/10 positive, 14/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Bible (la) enfin expliquée
author_name : Voltaire, François Marie Arouet
stated_publication_places : None
stated_publication_years : None

full_book_title : Histoire de l'empire de Russie sous Pierre le Grand, divisé en deux parties, par M. de Voltaire precedé et suivie de pieces qui sont relatives et accompagnés d’une table des matieres
author_name : Voltaire, François Marie Arouet
stated_publication_places : Lausanne
stated_publication_years : 1771

4/10 positive, 15/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Methode de pomper le mauvais air des vaisseaux
author_name : Lavirotte, Louis Anne
stated_publication_places : None
stated_publication_years : None

full_book_title : Methode de pomper le mauvais air des vaisseaux
author_name : Mead, Richard
stated_publication_places : None
stated_publication_years : None

4/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Essai sur l'administration de St. Domingue
author_name : Raynal, Guillaume-Thomas-François
stated_publication_places : None
stated_publication_years : 1785

full_book_title : [Histoire philosophique et politique des établissements et du commerce des Européens dans les deux Indes (unknown volumes)]
author_name : Raynal, Guillaume-Thomas-François
stated_publication_places : None
stated_publication_years : None

5/10 positive, 16/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Contes moraux
author_name : Marmontel, Jean-François
stated_publication_places : None
stated_publication_years : None

full_book_title : Lucile, comédie en un acte mêlée d'ariette
author_name : Marmontel, Jean-François
stated_publication_places : None
stated_publication_years : None

5/10 positive, 17/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Contes moraux par Mde le prince de Beaumont
author_name : Le Prince de Beaumont, Marie
stated_publication_places : Lyon
stated_publication_years : 1774

full_book_title : Mentor (le) moderne, ou instructions pour les garçons et pour ceux qui les élevent
author_name : Le Prince de Beaumont, Marie
stated_publication_places : Paris
stated_publication_years : 1773

5/10 positive, 18/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Journée du chrétien
author_name : Clément, Denis Xavier
stated_publication_places : Bruyères
stated_publication_years : 1786

full_book_title : Maximes pour se conduire dans le monde, par l'abbé Clément
author_name : Clément, Denis Xavier
stated_publication_places : Rouen
stated_publication_years : 1785

5/10 positive, 19/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Annales de l'Empire depuis Charlemagne jusqu'à nos jours
author_name : Voltaire, François Marie Arouet
stated_publication_places : London
stated_publication_years : 1780

full_book_title : Nouveautés
author_name : Voltaire, François Marie Arouet
stated_publication_places : None
stated_publication_years : 1773

5/10 positive, 20/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Old Testament / Ancien Testament (Unknown denomination) [Bible]
author_name : [Bible]
stated_publication_places : None
stated_publication_years : None

full_book_title : Nouveau (le) Testament d'Amelote
author_name : [Bible]
stated_publication_places : None
stated_publication_years : None

5/10 positive, 21/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Voyageurs (les) savants et curieux, ou Tablettes instructives & guide de ceux que Sa Maj. Danoise a envoyé en Arabie et autres pays voisins de la Palestine, de la Perse & le Mogol ou l'Inde & vers la mer Rouge & l'Egypte, pour l'éclaircissement de questions très-importantes de l'histoire, de la nature & des arts, rédigé & publié par Mr. Michaelis, conseiller de S.M. Brittannique, professeur en philosophie & directeur de la Société Royale des Sciences à l'Université de Göttingue; traduit de l'allemand 
author_name : Michaelis, Johann David
stated_publication_places : London
stated_publication_years : None

full_book_title : Vieillesse (de la), par M. Robert 
author_name : Robert, Marin-Jacques-Clair
stated_publication_places : Paris
stated_publication_years : 1777

5/10 positive, 22/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Poésies (les) d'Horace, traduites en françois par Mr. l'abbé Batteux 
author_name : Horatius Flaccus, Quintus
stated_publication_places : Paris
stated_publication_years : 1763

full_book_title : Poésies (les) d'Horace, traduites en françois nouvelle edition
author_name : Horatius Flaccus, Quintus
stated_publication_places : Paris
stated_publication_years : 1771

5/10 positive, 23/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Eloge de la raison
author_name : Voltaire, François Marie Arouet
stated_publication_places : London
stated_publication_years : 1775

full_book_title : Méprise (la) d'Arras
author_name : Voltaire, François Marie Arouet
stated_publication_places : Lausanne
stated_publication_years : None

6/10 positive, 23/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Recueil de lettres du Roi de Prusse sur la guerre dernière
author_name : Frederick II, King of Prussia
stated_publication_places : None
stated_publication_years : 1772

full_book_title : Mémoires du baron de La Motte-Fouqué... dans lesquels on a inséré sa correspondance intéressante avec Frédéric II, roi de Prusse
author_name : Frederick II, King of Prussia
stated_publication_places : Berlin
stated_publication_years : 1788

6/10 positive, 24/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Oeuvres de Monsieur de Montesquieu
author_name : Montesquieu, Charles Louis de Secondat
stated_publication_places : Copenhagen; Geneva
stated_publication_years : 1764-1768

full_book_title : Esprit des loix
author_name : Montesquieu, Charles Louis de Secondat
stated_publication_places : None
stated_publication_years : None

6/10 positive, 25/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Abrégé de l'histoire de la Franche-maçonnerie; précédée et suivie de quelques pièces en vers et en prose, et d'anecdotes qui la concerne; d'un essai sur les mystères et le véritable objet de la confrérie des Francs-Maçons; auquel on a joint un recueil complet des chansons dont ils font usage dans leurs assemblées et dans leurs repas, rédigé par un membre de cet ordre
author_name : Koppen, Karl-Friedrich
stated_publication_places : Lausanne; London
stated_publication_years : 1779

full_book_title : Abrégé de la nouvelle méthode
author_name : None
stated_publication_places : Rouen
stated_publication_years : 1779

6/10 positive, 26/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Epître à M. de Monregard, intendant général des Postes de France, par M. Gresset
author_name : Gresset, Jean Baptiste Louis
stated_publication_places : Amiens
stated_publication_years : 1776

full_book_title : Oeuvres de M. Gresset
author_name : Gresset, Jean Baptiste Louis
stated_publication_places : Londres
stated_publication_years : 1773

6/10 positive, 27/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Elémens de cavalerie
author_name : Guérinière, François Robichon (de la)
stated_publication_places : None
stated_publication_years : None

full_book_title : Ecole de la cavalerie, contenant la connoissance, l'instruction, et la conservation du cheval
author_name : Guérinière, François Robichon (de la)
stated_publication_places : Paris
stated_publication_years : 1736

6/10 positive, 28/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Timée de Locres en grec et en français avec des dissertations sur les principales questions de la métaphysique, de la physique, & de la morale des anciens; qui peuvent servir de suite et de conclusion à la Philosophie du bon sens, par Mr. le marquis d'Argens
author_name : Argens, Jean-Baptiste de Boyer [marquis d']
stated_publication_places : None
stated_publication_years : None

full_book_title : Lettres juives, ou correspondance philosophique, historique et critique, entre un Juif voyageur en différents Etats de l'Europe, et ses correspondants en divers endroits
author_name : Argens, Jean-Baptiste de Boyer [marquis d']
stated_publication_places : None
stated_publication_years : None

6/10 positive, 29/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Ame (l') seule avec Dieu seul, ou, Sentimens affectueux sur différens sujets de piété, pour chaque jour du mois, qui pourront aussi servir pour des visites devant le Saint Sacrement
author_name : Baudrand, Barthélemi
stated_publication_places : Rouen
stated_publication_years : 1788

full_book_title : Pensez-y bien
author_name : Baudrand, Barthélemi
stated_publication_places : Nancy
stated_publication_years : 1782

6/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Nouveau testament de N.S.J.C. [notre seigneur Jésus Christ] tradt [traduit] en français sur la vulgate, par de Huré, imprimé avec la permission de M. le Cal [cardinal] de Noailles
author_name : [Bible]
stated_publication_places : Toulouse
stated_publication_years : 1786

full_book_title : Apocrypha (Other denominational editions) [Bible]
author_name : [Bible]
stated_publication_places : None
stated_publication_years : None

6/10 positive, 31/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Defensio declarationis
author_name : Bossuet, Jacques Bénigne
stated_publication_places : None
stated_publication_years : None

full_book_title : Oraisons funébres, par M.M. Bossuet et Fléchier
author_name : Bossuet, Jacques Bénigne
stated_publication_places : Lyon
stated_publication_years : 1783

6/10 positive, 32/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Nouvel (le) ange conducteur ou recueil de prieres les plus propres à inspirer de la devotion nouvelle edition revûe et considerablement augmentée, dediée aux personnes de pieté
author_name : Coret, Jacques
stated_publication_places : Dole
stated_publication_years : 1775

full_book_title : Nouvel (le) ange conducteur ou Recueil de prieres les plus propres à Inspirer de la devotion en francois et les Vepres et hymnes en latin, nouvelle edition revûe et considerablement augmentée sur l’imprimée
author_name : Coret, Jacques
stated_publication_places : Paris
stated_publication_years : 1777

6/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : [Contes de Dorat]
author_name : Dorat, Claude Joseph
stated_publication_places : None
stated_publication_years : None

full_book_title : Fantaisies (mes)
author_name : Dorat, Claude Joseph
stated_publication_places : None
stated_publication_years : None

7/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Du monde, de son origine et de son antiquite
author_name : Bernard, Jean-Frédéric
stated_publication_places : None
stated_publication_years : None

full_book_title : Monde (le), son origine, et son antiquité, première partie
author_name : Bernard, Jean-Frédéric
stated_publication_places : None
stated_publication_years : None

7/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Réflexions sur le caractère et les talens militaires de Charles XII
author_name : Frederick II, King of Prussia
stated_publication_places : None
stated_publication_years : None

full_book_title : Oeuvres du Philosophie Sans Souci
author_name : Frederick II, King of Prussia
stated_publication_places : Neuchâtel
stated_publication_years : 1760

8/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Principes ge?ne?raux et raisonne?s de la Grammaire franc?oise.
author_name : Restaut, Pierre
stated_publication_places : Rouen
stated_publication_years : 1786

full_book_title : Abregé des principes de la grammaire françoise par M. Restaut Nouvelle edition augmentée des principes généraux de l’ortographe françoise
author_name : Restaut, Pierre
stated_publication_places : Paris
stated_publication_years : 1776

8/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Consolations raisonnables et religieuses
author_name : Formey, Jean Henri Samuel;
stated_publication_places : None
stated_publication_years : None

full_book_title : Introduction générale aux Sciences par M. Formey
author_name : Formey, Jean Henri Samuel
stated_publication_places : Amsterdam
stated_publication_years : 1774

9/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Oeuvres de Regnier
author_name : Régnier, Mathurin
stated_publication_places : Londres
stated_publication_years : 1750

full_book_title : Satyres (les) et autres oeuvres du sieur Regnier
author_name : Régnier, Mathurin
stated_publication_places : None
stated_publication_years : None

9/10 positive, 36/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : La Sainte Bible en latin et en françois, avec des notes litterales: critiques et historiques, des préfaces et des dissertations, tire?es du Commentaire de Dom Augustin Calmet... de M. l'abbe? de Vence et des auteurs les plus ce?le?bres pour faciliter l'intelligence de l'Ecriture Sainte
author_name : Calmet, dom Augustin
stated_publication_places : Toulouse
stated_publication_years : 1779

full_book_title : Bible de dom Calmet avec les notes de l'abbé de Vence et autres
author_name : Calmet, dom Augustin
stated_publication_places : Toulouse
stated_publication_years : 1780

9/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Recueil philosophique, ou mélange de pièces sur la religion et la morale, par différents auteurs
author_name : Naigeon, Jacques-André
stated_publication_places : London
stated_publication_years : 1770

full_book_title : Militaire (le) philosophe, ou Difficultés sur la religion, proposées au R. P. Malebranche, prêtre de l'Oratoire, par un ancien officier
author_name : Naigeon, Jacques-André
stated_publication_places : None
stated_publication_years : None

10/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Opuscula pathologica partim recusa partim inedita
author_name : Haller, Albrecht [von]
stated_publication_places : None
stated_publication_years : None

full_book_title : Lettres de feu Mr. de Haller contre M. de Voltaire, traduit de l'allemand par F.L. König  
author_name : Haller, Albrecht [von]
stated_publication_places : Berne; Lausanne
stated_publication_years : 1780

10/10 positive, 38/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Heures nouvelles
author_name : None
stated_publication_places : Bruyères
stated_publication_years : 1779

full_book_title : Heures nouvelles
author_name : None
stated_publication_places : Nîmes
stated_publication_years : 1784

10/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Avis au peuple sur sa santé
author_name : Tissot, Samuel Auguste André David
stated_publication_places : None
stated_publication_years : None

full_book_title : Onanisme (l') par M. Tissot
author_name : Tissot, Samuel Auguste André David
stated_publication_places : Lausanne
stated_publication_years : 1773

11/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.8, author_name), TfidfTextCanopyPredicate: (0.8, author_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, full_book_title), SimplePredicate: (sameSevenCharStartPredicate, full_book_title))
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, full_book_title), TfidfTextCanopyPredicate: (0.8, full_book_title))
full_book_title : Amours (les) de Zeokinizul roi des Kofirans. Ouvrage traduit de l'arabe du voyageur Krinelbol
author_name : Crébillon, Claude Prosper Jolyot de
stated_publication_places : Constantinople
stated_publication_years : 1779

full_book_title : Lettres de Mde de Ninon Lenclos
author_name : Crébillon, Claude Prosper Jolyot de
stated_publication_places : Amsterdam
stated_publication_years : 1770

11/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Philosophe (le) indien, ou L'art de vivre heureux, dans la societé
author_name : Chesterfield, Philip Dormer [4th Earl of]
stated_publication_places : None
stated_publication_years : None

full_book_title : Monde (le), par Adam Fitz-Adam: ou feuilles périodiques sur les moeurs du tems, traduites de l'anglois
author_name : Chesterfield, Philip Dormer [4th Earl of]
stated_publication_places : Leiden
stated_publication_years : 1756

11/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Ange (l') conducteur dans la dévotion chrétienne
author_name : Coret, Jacques
stated_publication_places : Vezoul [Vesoul]
stated_publication_years : 1782

full_book_title : Nouvel Ange conducteur des exercices de Piété
author_name : Coret, Jacques
stated_publication_places : Liège
stated_publication_years : 1771

11/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Soliloques (les) de St Augustin
author_name : Augustine, Saint Bishop of Hippo
stated_publication_places : Paris
stated_publication_years : 1756

full_book_title : Nouvelles Lettres de Saint Augustin
author_name : Augustine, Saint Bishop of Hippo
stated_publication_places : None
stated_publication_years : None

12/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Conduite pour passer saintement les fetes et les Octaves de la Pentecoste, du Saint Sacrement et de l’assomption par le P. Avrillon Religieux Minime
author_name : Avrillon, R. B.
stated_publication_places : Paris
stated_publication_years : 1771

full_book_title : Conduite pour l'avent et pour le carême par le pere Avrillon
author_name : Avrillon, R. B.
stated_publication_places : Paris
stated_publication_years : 1773

12/10 positive, 43/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Evangile de Saint Luc et de Saint Mathieu
author_name : [Bible]
stated_publication_places : Douai
stated_publication_years : 1178

full_book_title : Bible (la) de Lyon
author_name : [Bible]
stated_publication_places : Lyon
stated_publication_years : None

12/10 positive, 44/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Dictionnaire raisonné de physique
author_name : Brisson, Mathurin-Jacques
stated_publication_places : None
stated_publication_years : None

full_book_title : Ornithologie, ou Methode contenant la division des Oiseaux en ordres, sections, genres, especes & varie?te?s
author_name : Brisson, Mathurin-Jacques
stated_publication_places : Paris
stated_publication_years : 1759

12/10 positive, 45/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Pensez-y bien
author_name : Baudrand, Barthélemi
stated_publication_places : Nancy
stated_publication_years : 1782

full_book_title : Ame (l') sur le Calvaire, L’ame embrassée, Entretiens avec Jesus Christ par l’auteur de l’ame elevée à Dieu
author_name : Baudrand, Barthélemi
stated_publication_places : Lyon
stated_publication_years : 1776

12/10 positive, 46/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Livres d'assortissement / miscellaneous books <zspbk0010774>
author_name : Linguet, Simon Nicholas Henri
stated_publication_places : None
stated_publication_years : None

full_book_title : Livres d'assortissement / miscellaneous books <zspbk0010774>
author_name : No Author Identifable
stated_publication_places : None
stated_publication_years : None

12/10 positive, 47/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Imitation de Jesus Christ par Le pere Gonnelieu
author_name : Thomas à Kempis
stated_publication_places : Paris
stated_publication_years : None

full_book_title : Imitation (l') de Jésus-Christ
author_name : Thomas à Kempis
stated_publication_places : None
stated_publication_years : None

13/10 positive, 47/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Combat spirituel
author_name : Brignon, Jean
stated_publication_places : Rouen
stated_publication_years : 1783

full_book_title : Combat spirituel
author_name : Scupoli, Lorenzo
stated_publication_places : Bruyères
stated_publication_years : 1789

14/10 positive, 47/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Opera omnia
author_name : Thomas à Kempis
stated_publication_places : None
stated_publication_years : None

full_book_title : Imitation de Jesus Christ derniere edition
author_name : Thomas à Kempis
stated_publication_places : Paris
stated_publication_years : 1776

15/10 positive, 47/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.8, author_name), TfidfTextCanopyPredicate: (0.8, author_name))
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, full_book_title), TfidfTextCanopyPredicate: (0.6, full_book_title))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, full_book_title), SimplePredicate: (sameSevenCharStartPredicate, full_book_title))
full_book_title : Angola, histoire Indienne
author_name : Rochette La Morlie?re, Jacques, chevalier de
stated_publication_places : None
stated_publication_years : None

full_book_title : Angola, histoire indienne <spbk0000105>
author_name : Rochette de La Morlière, Charles-Jacques-Louis-Auguste;
stated_publication_places : None
stated_publication_years : 1749

15/10 positive, 48/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Nouveau (le) Testament de Notre Seigneur Jesus Christ
author_name : [Bible]
stated_publication_places : Paris
stated_publication_years : None

full_book_title : Sainte Bible traduite en françois par Le Maistre de Sacy
author_name : [Bible]
stated_publication_places : Paris
stated_publication_years : 1776

16/10 positive, 48/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sameSevenCharStartPredicate, full_book_title), TfidfTextCanopyPredicate: (0.4, full_book_title))
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.8, author_name), TfidfTextCanopyPredicate: (0.8, author_name))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, full_book_title), SimplePredicate: (sameSevenCharStartPredicate, full_book_title))
full_book_title : Defense de la religion tant naturelle que re?ve?le?e contre les infide?les et les incre?dules
author_name : Burnet, Gilbert
stated_publication_places : None
stated_publication_years : None

full_book_title : Defense des propheties de la religion chrétienne
author_name : Baltus, Jean-François
stated_publication_places : Paris
stated_publication_years : 1737

16/10 positive, 49/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Anatomie de Winslow
author_name : Winslow, Jakob Benigus
stated_publication_places : None
stated_publication_years : None

full_book_title : Exposition anatomique de la structure du corps humain, par M. Winslow, Docteur Régent de la Faculté de Médecine de Paris etc.  Nouvelle edition faitte sur un exemplaire corrigé et augmenté par l’auteur à la quelle on a joint des nouvelles figures & tables qui en facilitent l’usage; et la vie de l’auteur
author_name : Winslow, Jakob Benigus
stated_publication_places : Paris
stated_publication_years : 1775

16/10 positive, 50/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Caracteres (les) de Theophraste nouvelle edition
author_name : La Bruyère, Jean [de]
stated_publication_places : Amsterdam
stated_publication_years : 1769

full_book_title : Caracteres (les) de Théophraste avec les Caracteres et les moeurs de le siecle par M. De la Bruyere et de Ses Caracteres par M. Coste
author_name : La Bruyère, Jean [de]
stated_publication_places : Amsterdam
stated_publication_years : 1754

17/10 positive, 50/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Instructions pour les Jeunes dames qui entrent dans le monde se marient, leurs devoirs dans cet etat et envers leurs enfans; pour servir de suite au magasin des adolescentes, par Made. le Prince de Beaumont
author_name : Le Prince de Beaumont, Marie
stated_publication_places : None
stated_publication_years : 1774

full_book_title : Mentor (le) moderne, ou instructions pour les garçons et pour ceux qui les élevent
author_name : Le Prince de Beaumont, Marie
stated_publication_places : Paris
stated_publication_years : 1773

18/10 positive, 50/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Histoire du parlement de Paris
author_name : Voltaire, François Marie Arouet;
stated_publication_places : Amsterdam
stated_publication_years : None

full_book_title : Charlot, ou la comtesse de Givri
author_name : Voltaire, François Marie Arouet
stated_publication_places : Geneve; Paris
stated_publication_years : None

18/10 positive, 51/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Genie (Le) de Montesquieu
author_name : Montesquieu, Charles Louis de Secondat
stated_publication_places : Amsterdam
stated_publication_years : 1760

full_book_title : Montesquieu [unspecified work]
author_name : Montesquieu, Charles Louis de Secondat
stated_publication_places : None
stated_publication_years : None

18/10 positive, 52/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Pharmacien (Le) moderne, ou Nouvelle manière de préparer les drogues
author_name : Eidous, Marc-Antoine
stated_publication_places : Paris
stated_publication_years : 1749-1750

full_book_title : Italiens (les)
author_name : Eidous, Marc-Antoine
stated_publication_places : None
stated_publication_years : None

18/10 positive, 53/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Principes generaux de la Grammaire par M. R.  Dixieme Edition
author_name : Restaut, Pierre
stated_publication_places : Paris
stated_publication_years : 1773

full_book_title : Abregé des principes de la grammaire françoise par M. Restaut Nouvelle edition augmentée des principes généraux de l’ortographe françoise
author_name : Restaut, Pierre
stated_publication_places : Paris
stated_publication_years : 1776

18/10 positive, 54/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Recueil de pièces fugitives en prose et en vers, accompagnées de notes critiques et impartiales par M. de V.
author_name : Voltaire, François Marie Arouet
stated_publication_places : None
stated_publication_years : None

full_book_title : Lettre d'un ecclésiastique sur le prétendu rétablissement des jésuites dans Paris
author_name : Voltaire, François Marie Arouet
stated_publication_places : None
stated_publication_years : None

18/10 positive, 55/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Imitation de Jésus Christ
author_name : Thomas à Kempis
stated_publication_places : Saint-Malo
stated_publication_years : 1779

full_book_title : De l'imitation de jesus christ par D. Morel
author_name : Thomas à Kempis
stated_publication_places : Paris
stated_publication_years : 1772

18/10 positive, 56/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Collection complete des oeuvres de Gesner
author_name : Gessner, Salomon
stated_publication_places : Lyon
stated_publication_years : 1786

full_book_title : Collection complète de tous les ouvrages pour et et contre M. Necker avec des notes critiques, politiques et secrètes, le tout par ordre chronologique
author_name : Turgot, Anne Robert Jacques
stated_publication_places : Utrecht
stated_publication_years : 1782

19/10 positive, 56/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Ami (l') des enfants
author_name : Reyre, Joseph
stated_publication_places : Rouen
stated_publication_years : 1785

full_book_title : Instructions sur les principales vérités de la religion et sur les principaux devoirs du christianisme, adressées par Monseigneur l’Illustrissime et Révérendissime Evêque, comte de Toul, Prince du S. Empire, au clergé Séculier et aux Fideles de son diocese
author_name : Humbert, Pierre Hubert
stated_publication_places : Rouen
stated_publication_years : 1786

19/10 positive, 57/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Sainte Bible (la), qui contient le Vieux et le Nouveau Testament
author_name : [Bible]
stated_publication_places : Neuchâtel
stated_publication_years : 1779

full_book_title : Sainte Bible (la) en latin et en françois: avec des notes litterales, critiques, et historiques, des prefaces et des dissertations, tirées du commentaire de Dom Augustin Calmet..., et de M. l'Abbé de Vence, & des auteurs les plus célèbres; pour faciliter l'intelligence de l'Ecriture Sainte
author_name : [Bible]
stated_publication_places : Paris
stated_publication_years : 1748-1750

19/10 positive, 58/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


full_book_title : Sainte Bible (la), qui contient le Vieux et le Nouveau Testament: revu et corrigé sur le texte original, par les pasteurs et professeurs de l'Eglise de Genève, avec les argumens et les réflexions sur les chapitres / par J.F. Ostervald
author_name : [Bible]
stated_publication_places : Bienne; Neuchâtel
stated_publication_years : 1770-1771

full_book_title : Nouveau Testament de notre seigneur Jesus Christ traduit selon la vulgate
author_name : [Bible]
stated_publication_places : None
stated_publication_years : None

20/10 positive, 58/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Considérations sur les causes phisiques et morales de la diversité du geenie, des moeurs et du gouvernement des nations: tiré en partie d'un ouvrage anonime; par M. L. Castilhon
author_name : Castilhon, Jean-Louis
stated_publication_places : Bouillon
stated_publication_years : 1769

full_book_title : Essais de Philosophie et de morale en partie traduit librement et en partie imite de Plutarque par M. L. Castilhon
author_name : Castilhon, Jean-Louis
stated_publication_places : Bouillon
stated_publication_years : 1770

20/10 positive, 59/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


full_book_title : Le Livre des Comptes faits, ou l’on trouve les supputations qui se font par les multiplications pour la valeur de quelque chose que l’on puisse s’imaginer, à telle somme qu’elles puissent monter etc. augmenté du tarif des glaces, par Bareme
author_name : Barrême, François Bertrand de
stated_publication_places : Paris
stated_publication_years : 1770

full_book_title : Comptes faits ou tarif général des monnoies
author_name : Barrême, François Bertrand de
stated_publication_places : Amsterdam
stated_publication_years : 1774

20/10 positive, 60/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


In [10]:
# Cell 2.2: A function for training the model with the collected data and saving the results.
def run_deduper(deduper, data, settings_file, training_file, recall_weight = 1):
    """
    Given a deduper object and a dataset, this function trains the model and
    predicts which records are duplicates.
    
    params:
        deduper: a Dedupe object
        data: a dict of table rows/database records
        settings_file: a string giving the path where the settings should be written
        training_file: a string giving the path where the labelled training examples should be written
        recall_weight: a number indicating how much to privilege recall over precision.
        
    returns:
        deduper: the trained Dedupe object
        matches: a list of tuples giving record ids of duplicates and confidence scores
    """
    
    # Train the model
    print("Training model...")
    start = time.perf_counter()
    deduper.train()
    end = time.perf_counter()
    print(f"Training complete. It took {end - start:.3f} seconds.")
    
    
    print("Saving training data and trained parameters...")
    # Save the training examples
    with open(training_file, 'w') as tf:
        deduper.writeTraining(tf)
    
    # Save the model parameters
    with open(settings_file, 'wb') as sf:
        deduper.writeSettings(sf)
    print("Saved.")
    
    # Calculate threshold for matches
    print(f"Computing threshold based on a recall weighting of {recall_weight}.")
    start = time.perf_counter()
    threshold = deduper.threshold(data, recall_weight = 1)
    end = time.perf_counter()
    print(f"Computation complete. Threshold = {threshold}. It took {end - start:.3f} seconds.")
    
    # Compute the matches
    print("Clustering...")
    start = time.perf_counter()
    matches = deduper.match(data, threshold)
    end = time.perf_counter()
    print(f"Clustering complete. {len(matches)} clusters found. It took {end - start:.3f} seconds.")
    
    return deduper, matches

In [11]:
# Cell 2.3: Run the model, check out the results.
deduper, matches = run_deduper(deduper, data, settings_file, training_file, recall_weight = 1)

Training model...


INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.000010, score 0.5242863079293838
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sameSevenCharStartPredicate, author_name), TfidfNGramCanopyPredicate: (0.6, author_name))
INFO:dedupe.training:(SimplePredicate: (wholeFieldPredicate, full_book_title), TfidfTextCanopyPredicate: (0.6, full_book_title))


Training complete. It took 1.028 seconds.


INFO:dedupe.canopy_index:Removing stop word ar
INFO:dedupe.canopy_index:Removing stop word er
INFO:dedupe.canopy_index:Removing stop word ie
INFO:dedupe.canopy_index:Removing stop word an
INFO:dedupe.canopy_index:Removing stop word  J
INFO:dedupe.canopy_index:Removing stop word is
INFO:dedupe.canopy_index:Removing stop word e 
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word sur
INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word par
INFO:dedupe.canopy_index:Removing stop word le
INFO:dedupe.canopy_index:Removing stop word des
INFO:dedupe.canopy_index:Removing stop word les
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word ou
INFO:dedupe.canopy_index:Removing stop word M
INFO:dedupe.blocking:10000, 2.1993132 seconds
INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.796
INFO:dedupe.api:precision: 0.726
INFO:dedupe.api:With threshold: 0.3

Clustering...


INFO:dedupe.canopy_index:Removing stop word ar
INFO:dedupe.canopy_index:Removing stop word er
INFO:dedupe.canopy_index:Removing stop word ie
INFO:dedupe.canopy_index:Removing stop word an
INFO:dedupe.canopy_index:Removing stop word  J
INFO:dedupe.canopy_index:Removing stop word is
INFO:dedupe.canopy_index:Removing stop word e 
INFO:dedupe.canopy_index:Removing stop word du
INFO:dedupe.canopy_index:Removing stop word sur
INFO:dedupe.canopy_index:Removing stop word de
INFO:dedupe.canopy_index:Removing stop word par
INFO:dedupe.canopy_index:Removing stop word le
INFO:dedupe.canopy_index:Removing stop word des
INFO:dedupe.canopy_index:Removing stop word les
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word ou
INFO:dedupe.canopy_index:Removing stop word M
INFO:dedupe.blocking:10000, 2.0055242 seconds


Clustering complete. 2807 clusters found. It took 39.278 seconds.


In [130]:
# Cell 2.4: Adds cluster codes and confidence scores back to original data frame
def save_clusters(matches, data_frame, output_file):
    """
    Given a list of cluster tuples from a Dedupe object, and the original data frame
    on which the model was trained, this function outputs a data frame and saves a csv
    of the cluster assignments for each record/row.
    
    depends:
        pandas
        dedupe
    
    params:
        matches: a list of tuples, returned by Dedupe.matches()
        data_frame: the original data frame on which the model was trained.
        output_file: a string; the path where the csv will be written
    
    returns:
        data_frame: the original data frame will additional information from the model
    """
    
    # Add new columns to data frame
    data_frame['cluster'] = np.nan
    data_frame['confidence'] = np.nan
    
    # Loop through matches, update relevant rows
    for counter, match in enumerate(matches):
        data_frame.loc[match[0], 'cluster'] = int(counter)
        data_frame.loc[match[0], 'confidence'] = match[1]
    
    # Write csv
    with open(output_file, 'w') as out:
        data_frame.to_csv(out)
    
    return data_frame

In [131]:
# Cell 2.5. Invoke new function to update and export the results.
data_frame_clustered = save_clusters(matches, data_frame, output_file)

In [134]:
# Cell 2.6. Sanity check. How good are the model's assignments?
# Run this cell a few times to look at different random clusters
data_frame_clustered[data_frame_clustered['cluster'] == random.randint(0, len(matches) + 1)]

Unnamed: 0,ID,UUID,super_book_code,full_book_title,author_code,author_name,stated_publication_places,stated_publication_years,cluster,confidence
4059,,,spbk0000419,"Contes très mogols, enrichis de notes, avis, a...",au0000801,"Saint-Just, Simon-Pierre Mérard [de]",,,1038.0,0.787093
10084,,,spbk0000419,Contes très mogols,au0000801,"Saint-Just, Simon-Pierre Mérard [de]",Geneva,1770.0,1038.0,0.787093


In [141]:
# Cell 2.7. How much time have we saved? How many illegal books have been assigned a super_book_code?
assigned = data_frame_clustered[
    pd.notnull(data_frame_clustered['cluster']) & # which books have been assigned a cluster?
    pd.notnull(data_frame_clustered['UUID']) & # only illegal books have UUIDs
    pd.isna(data_frame_clustered['super_book_code']) # only interested in books that didn't already have super_book_codes
].shape[0]

total = data_frame_clustered[
    pd.notnull(data_frame_clustered['UUID']) &
    pd.isna(data_frame_clustered['super_book_code'])
].shape[0]

print(f"{assigned} illegal books have been given super_book_codes, of {total} that lack them.")

317 illegal books have been given super_book_codes, of 1921 that lack them.


## Conclusion

Dedupe appears to do a good job in linking the illegal books to super books that are already in the dataset. The question is whether the ~1600 books the algorithm could not cluster are new books that aren't already in our dataset, or if the algorithm has low recall.

Manual investigation will be necessary to see if the ~1600 super books are already in the database.

It may also be possible to tune the model further, by
1. giving it more training data, or
2. increasing the `recall_weight` so that the model cares more about finding possible matches than being accurate when it does find them.