### Anet - CERL thesaurus data reconciliator

Caroline Vandyck - s0183936

In [1]:
# Obtaining the data form the Anet sample.

# Importing the needed libraries. 
# Only the Python standard library will be used. That way, the 
# code is accessible to many people.

import os
import sqlite3

# To use the module, a Connection object that represents the database is created:
conn = sqlite3.connect(os.path.join('authorities.sqlite')) 
# Using "join" is good practice, it creates the correct path 
# for every operating system. For this project however, the 
# sql file is located in the same directory as this notebook, so
# no elaborate path is needed.

# After that, a Cursor object is created, which points at the connection:
c = conn.cursor()

# The given query:
query_anet = """
SELECT
    DISTINCT administration.LOI AS identifier,
    begin_in AS begin_date,
    begin_so AS begin_standardized,
    end_in AS end_date,
    end_so AS end_standardized,
    dsc_fn AS family_name,
    dsc_vn AS first_name,
    dsc_nm AS name,
    dsc AS description
FROM
    administration
    LEFT JOIN dates ON dates.LOI = administration.LOI
    LEFT JOIN identity ON identity.LOI = administration.LOI
WHERE
    administration.type = "P"
    AND begin_standardized LIKE "16%"
LIMIT
    50
"""

# To perform SQL commands, the Cursor object's .execute() method is called:
c.execute(query_anet)

# Call .fetchall() to get a list of the matching rows:
data_anet = [row for row in c.fetchall()]

# Closing the connection when finished:
conn.close()

# A print of the result:
print(data_anet)

[('au::34', '1691', '16910000', '1746', '17460000', 'Abeele, van den', 'Karel', '', ''), ('au::391', '1679', '16790000', '1762', '17620000', 'Amelot', 'Pieter', '', ''), ('au::469', '25/02/1660', '16600225', '19/09/1719', '17190919', 'Anneessens', 'Frans', '', ''), ('au::881', '21/03/1685', '16850321', '28/07/1750', '17500728', 'Bach', 'Johan Sebastian', '', ''), ('au::912', '16/06/1687', '16870616', '27/04/1779', '17790427', 'Backhuysen', 'Tilman-Willem', '', ''), ('au::1173', '27/04/1699', '16990427', '10/09/1768', '17680910', 'Baurscheit, van', 'Jean Pierre', '', ''), ('au::1492', '1641', '16410000', '12/02/1674', '16740212', 'Bellemans', 'Daniël', '', ''), ('au::1573', '1696', '16960000', '1765', '17650000', 'Bergé', 'Jacques', '', ''), ('au::1832', '10/02/1627', '16270210', '1713', '17130000', 'Bie, de', 'Cornelius', '', ''), ('au::2098', '1604', '16040000', '1668', '16680000', 'Boeckhorst', 'Jan', '', ''), ('au::2489', '1604', '16040000', '1666', '16660000', 'Bosman', 'Theodoor',

In [2]:
# Here, I put the data obtained in the pervious cell into dictionaries
# (one dictionary per tuple of every person). This step could be skipped. 
# I did this so that later on, when creating the query, I could use the 
# keys 'first_name', 'last_name', or 'yob' instead of the index of the 
# tuple to obtain metadata. It makes the code more readable and makes 
# it easier to keep track of the variables and steps.


def create_dict(data_anet: list) -> list:
    """
    The input is the list of tuples obtained from Anet, the output a list 
    of dictionaries containing the keys: tag, first name, surname, and year of birth. 
    This is easier for obtaining the right metadata later on.
    """
    dict_anet = []
    for person_entry in data_anet:
        person = {}
        # select metadata
        tag = person_entry[0]
        first_name = person_entry[-3]
        last_name = person_entry[-4]
        # only select the year
        yob = person_entry[1]
        yob = yob.split('/')
        yob = yob[-1]
        person = {'tag': tag,
                  'first_name': first_name,
                  'last_name': last_name,
                  'yob': yob
                 }
        dict_anet.append(person)
    return dict_anet

dict_anet = create_dict(data_anet)
print(dict_anet)

[{'tag': 'au::34', 'first_name': 'Karel', 'last_name': 'Abeele, van den', 'yob': '1691'}, {'tag': 'au::391', 'first_name': 'Pieter', 'last_name': 'Amelot', 'yob': '1679'}, {'tag': 'au::469', 'first_name': 'Frans', 'last_name': 'Anneessens', 'yob': '1660'}, {'tag': 'au::881', 'first_name': 'Johan Sebastian', 'last_name': 'Bach', 'yob': '1685'}, {'tag': 'au::912', 'first_name': 'Tilman-Willem', 'last_name': 'Backhuysen', 'yob': '1687'}, {'tag': 'au::1173', 'first_name': 'Jean Pierre', 'last_name': 'Baurscheit, van', 'yob': '1699'}, {'tag': 'au::1492', 'first_name': 'Daniël', 'last_name': 'Bellemans', 'yob': '1641'}, {'tag': 'au::1573', 'first_name': 'Jacques', 'last_name': 'Bergé', 'yob': '1696'}, {'tag': 'au::1832', 'first_name': 'Cornelius', 'last_name': 'Bie, de', 'yob': '1627'}, {'tag': 'au::2098', 'first_name': 'Jan', 'last_name': 'Boeckhorst', 'yob': '1604'}, {'tag': 'au::2489', 'first_name': 'Theodoor', 'last_name': 'Bosman', 'yob': '1604'}, {'tag': 'au::2631', 'first_name': 'Petr

The dictionaries contain all of the metadata I will be using for this project: the tag, the person's surname, first name, and year of birth.

The program will query CERL on the basis of the surname obtained from Anet and will select the first 50 records.
For those 50 (or less) records from CERL, it will be checked whether the first name and year of birth match Anet. In the end, every record obtained from CERL will be returned with a statement on which metadata it matches Anet. Accordingly, I kept using booleans at a minimum in this project. Instead of performing multiple queries per record (such as surname+firstname), I searched for metadata in the result yielded by querying the CERL thesaurus only with the surname. Booleans, however, were used when a surname consists of multiple words.
 
I chose the person's surname to query CERL with, because it seemed the most informative and less prone to differences accross databases. The surname will be standardized. I chose not to use the surname and first name together, because many variants are possible.

The first name is of course also important, yet, there are many possible variants of every name. That is why the first name registered in Anet will be compared to a list consisting of the possible variants of first names in CERL. This is also why I preferred doing one query (consisting of the surname) in CERL and then filtering out the useful records based on other metadata, since most possible variants of someone's first name are registered under tag 200/400 and code b.

Lastly, the year of birth seemed important because that way, if a first name described in Anet does not correspond with a variant of the first name in CERL, the "correct" match could still be found. The other way around, the first name complements the year of birth as well, because the correct match can still be found if the year of birth is unknown or uncertain (sometimes a birth date is guessed or differs a year according to which source is used). Accordingly, first name and year of birth complement each other very well. I did not use the year of death (or lifespan), because some year of deaths are missing in the Anet database (+ see the possible improvements at the end of the notebook). 

In [3]:
# In this cell, the query from Anet will be obtained and 
# put into CERL.

# Importing the needed libraries. 
# Again, only the Python standard library will be used.

from urllib.parse import quote
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
import time
import unicodedata

# Functions

def get_query(person_anet: dict) -> str:
    """
    This is a function that obtains the query for the CERL API,
    the query will be the surname of a person from Anet.
    The input is one dictionary created from the Anet database,
    The output a string (the query).
    """
    last_name = person_anet.get('last_name') # this is why the dictionaries created in the second cell are convenient
    return last_name

def strip_accents(name: str) -> str:
    """
    Return the same string as the input, but without any accents.
    CERL yields the right results regardless of using accents. However, 
    I standardize them in all the names for safety (although they are 
    very useful, they can be spelled differently). When working with
    other databases, this could be important.
    """
    return ''.join(c for c in unicodedata.normalize('NFD', name)
                   if unicodedata.category(c) != 'Mn')

def clean_query(last_name: str) -> str:
    """
    Return input in a standardized, URL safe manner.
    The input as well as the output of this function is a string.
    """
    query = last_name.strip() # delete any possible extra spaces
    query = query.casefold() # casefold everything
    query = query.replace(",", "") # delete commas in the name (baurscheit, van)
    query = query.replace(" ", " and ") # transform spaces into boolean 'and' for the link (baurscheit%20and%20van)
#     query = query.replace("'", "") # look at the possible improvements at the end of the notebook (*)
    query = strip_accents(query)
    query = unicodedata.normalize("NFC", query) # for safety, so that every character that looks the same, is also the same under the hood.
    query = quote(query) # make it URL safe
    return query

# Constants for the next function:

cerl_prefix = "https://data.cerl.org/thesaurus/_sru?version=1.2&operation=searchRetrieve&query="
cerl_suffix = "&startRecord=1&maximumRecords=50&recordSchema=marcxml" # maximum amount of records of 50 (this seemed as a good balance) + I am looking for the marcxml

def query_CERL(query: str) -> bytes:
    """
    Harvest CERL metadata.
    The input is a string, the output are bytes.
    Query CERL thesaurus, return response or exit with errorcode.
    """
    time.sleep(2) # to be polite to the API
    url = cerl_prefix + query + cerl_suffix
    try:
        with urlopen(url) as query: # 'with' closes the connection 
            # automatically when something goes wrong, so it is good
            # practice to use this. urlopen behaves as the usual 
            # "open".
            return query.read() # return response
    except HTTPError as HTTPerr:
        exit(HTTPerr.code)
    except URLError as URLerr:
        exit(URLerr)

# To check whether the cleaning and creating of the query
# are as I expect (on a surname with more than one word):

query_test = get_query(dict_anet[5])
print(query_test)
query_test = clean_query(query_test)
print(query_test) 

harvested = query_CERL(query_test)
print(str(harvested))

# A second test: I will continue to test with Bach since it 
# has a "perfect" match in the CERL thesaurus.

query = get_query(dict_anet[3])
print(query)
query = clean_query(query)
print(query)

harvested = query_CERL(query)
print(str(harvested))

Baurscheit, van
baurscheit%20and%20van
b'<?xml version="1.0"?>\n<srw:searchRetrieveResponse xmlns:srw="http://www.loc.gov/zing/srw/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/">\n<srw:version>1.2</srw:version>\n<srw:numberOfRecords>2</srw:numberOfRecords>\n<srw:records>\n\n<srw:record>\n<srw:recordSchema>info:srw/schema/1/marcxml-v1.1</srw:recordSchema>\n<srw:recordPacking>xml</srw:recordPacking>\n<srw:recordIdentifier>cnp00516245</srw:recordIdentifier>\n<srw:recordData>\n\n<record xmlns="http://www.loc.gov/MARC21/slim"><leader>0RL 00000nz   22000003  45  </leader><controlfield tag="001">cnp00516245</controlfield><datafield tag="100" ind1=" " ind2=" "><subfield code="a">20040713xmuly50      ba</subfield></datafield><datafield tag="110" ind1=" " ind2=" "><subfield code="a">0</subfield></datafield><datafield tag="200" ind1=" " ind2="1"><subfield code="a">Baurscheit</subfield><subfield code="b">Jan Pieter</subfield><subfield code="e">van</subfield><subfield code="r">der J\xc3\xbc

None


In [4]:
# Create dictionaries containing metadata on the basis of the previous output.

import lxml.etree

namespace = r"{http://www.loc.gov/MARC21/slim}" # raw string is safer

def parse(harvested: bytes) -> list:
    """
    Parse MARCXML to dict with structure:
    {index number: {"tag": tag, "code": code, "data": data, "id": identifier}}
    Append the dictionaries per record to a list.
    The input is harvested xml (bytes) from the previous cell.
    """
    root = lxml.etree.fromstring(harvested) # fromstring works on bytes
    records = [] # to append all the records separately
    
    for record in root.iter(f"{namespace}record"): # iterate over record elements in the given namespace
        metadata = {}
        index = 0
        
        for datafield in record.iter(f"{namespace}datafield"): # iterate over datafield elements in the given namespace, use tags and subfields and put those into dictionaries
            for key, value in datafield.items(): # .items() because XML attributes are dictionaries
                if key == "tag":
                    for subfield in datafield:
                        index += 1
                        for _, code in subfield.items():
                            metadata[index] = {"tag": value,
                                               "code": code,
                                               "data": subfield.text,
                                               "id": None # here, as well as in lines 35-38, I create empty keys so that every dictionary has the same (amount of) keys. I did this to that at the end of the dictionary of dictionaries, I could add a dictionary containing the final "id".
                                              }
        
        for controlfield in record.iter(f"{namespace}controlfield"): # create another loop because the controlfield with the id tag is located at another level than the datafields/subfields
            index += 1
            metadata[index] = {"tag": None,
                               "code": None,
                               "data": None,
                               "id": controlfield.text}
        records.append(metadata)
    return records

# All steps necessary after each other for the third person
# in the database, it returns the first possible match (based
# on surname) of CERL

query = get_query(dict_anet[3])
query = clean_query(query)
print(query)
query = query_CERL(query)

# New:

parsed = parse(query)
print(parsed[0])

bach
{1: {'tag': '035', 'code': 'z', 'data': 'cnp00394145', 'id': None}, 2: {'tag': '035', 'code': 'z', 'data': 'cnp01373270', 'id': None}, 3: {'tag': '035', 'code': 'z', 'data': 'cnp00030308', 'id': None}, 4: {'tag': '100', 'code': 'a', 'data': '20200825xmuly50      ba', 'id': None}, 5: {'tag': '110', 'code': 'a', 'data': '0', 'id': None}, 6: {'tag': '120', 'code': 'a', 'data': 'b', 'id': None}, 7: {'tag': '200', 'code': 'a', 'data': 'Bach', 'id': None}, 8: {'tag': '200', 'code': 'b', 'data': 'Johann Christian', 'id': None}, 9: {'tag': '200', 'code': 'c', 'data': 'NL', 'id': None}, 10: {'tag': '200', 'code': '5', 'data': 'NeHKB', 'id': None}, 11: {'tag': '200', 'code': 'a', 'data': 'Bach', 'id': None}, 12: {'tag': '200', 'code': 'b', 'data': 'Johann Christian', 'id': None}, 13: {'tag': '200', 'code': 'c', 'data': 'FR', 'id': None}, 14: {'tag': '200', 'code': '5', 'data': 'BNF', 'id': None}, 15: {'tag': '200', 'code': 'a', 'data': 'Bach', 'id': None}, 16: {'tag': '200', 'code': 'b', 'd

In [5]:
# The information we retrieved in the dictionary in the 
# previous cell, is all from people with the same surname.
# However, their first name and year of birth are also 
# important to match on. That is what will be obtained now.

person_anet = dict_anet[3] # easier to test on and to keep track of input data
person_CERL = parsed[1]

def clean_name(name: str) -> str:
    """
    The input is a string, the output is the cleaned
    version of that string. It differs from clean_query
    since no url safe version is created. This will be
    done for the first names in Anet as well as in CERL,
    this way, the names are standardized in the same manner.
    (f.e. j. s. = j s, johan-sebastian = johan sebastian, etc.).
    That way, it overcomes different transcriptions.
    """
    name = name.strip()
    name = name.casefold()
    name = name.replace(".", "")
    name = name.replace(",", "")
    name = name.replace("-", " ")
#     name = name.replace("'", " ")
    name = strip_accents(name)
    name = unicodedata.normalize("NFC", name)
    return name

def get_first_name(person_CERL: dict) -> list:
    """
    Obtain the first name(s) from CERL.
    The input is a record from CERL, the
    output a list of standardized names (strings).
    """
    first_name_list = []
    for entry in person_CERL:
        if (person_CERL[entry]['tag'] == '200' and person_CERL[entry]['code'] == 'b') or (person_CERL[entry]['tag'] == '400' and person_CERL[entry]['code'] == 'b'): # select all first names and their variants
            first_name = person_CERL[entry]['data']
            # standardize the names:
            first_name = clean_name(first_name)
            first_name_list.append(first_name)
    return first_name_list

# Test: 
first_name_list = get_first_name(person_CERL)
print(first_name_list)

def get_yob(person_CERL: dict) -> str:
    """
    Obtain the year of birth (yob) from CERL.
    The input is a record from CERL, the
    output a string (the yob).
    """
    for entry in person_CERL:
        if person_CERL[entry]['tag'] == '340' and person_CERL[entry]['code'] == 'a': # only do this for the first time the right tag and code is found, otherwise the dates might differ
            yob = person_CERL[entry]['data']
            # only select the year
            yob = yob.split('-') # if the life span is given, split it
            yob = str(yob[0]) # take the first year
            yob = yob.split('.') # if there is an actual *date* of birth given, split it
            yob = str(yob[-1]) # select the year
            return yob

# Test:
yob = get_yob(person_CERL)    
print(yob)

['johann sebastian', 'johann sebastian', 'j s', 'johann', 'johann', 'g s', 'giov seb', 'giov sebast', 'giovanni s', 'giovanni sebastian', 'giovanni sebastiano', 'i s', 'iogann s', "iogann sebast'jan", 'iogann sebastjan', 'iogann sebastʹjan', 'j s', 'j seb', 'j s', 'jan sebastian', 'jean s', 'jean sebastian', 'jean sebastien', 'jean sebastien', 'jean sebastien', 'jean sebastien', 'joannes sebastianus', 'joh seb', 'joh sebas', 'joh sebast', 'joh sebastian', 'joh seb', 'johan sebastian', 'johann', 'johann s', 'johann seb', 'johann sebastian', 'johann sebastian', 'johannes s', 'johannes sebastian', 'john sebastian', 'sebastian', 'sebastiano', '', 'i s', 'iogann sebastian', "johan sebasti'an", 'iogann sebastian', 'johannes', 'hans']
1685


In [6]:
# After obtaining the metadata, now the matching will happen.

def match_first_name(person_anet: dict, person_CERL: dict) -> bool:
    """
    See if the first name from Anet matches any of the first names
    from CERL, obtained by using the function get_first_name().
    The input are a dictionary from the Anet and CERL data,
    the output a bool.
    """
    first_name_anet = person_anet.get('first_name') # This is why the dictionaries created in the second cell are convenient
    # standardize the first name obtained from Anet the same way as the ones from CERL:
    first_name_anet = clean_name(first_name_anet)
    if first_name_anet in get_first_name(person_CERL):
        return True
    else:
        return False

def match_yob(person_anet: dict, person_CERL: dict) -> bool:
    """
    See if the yob from Anet matches the yob
    from CERL, obtained by using the function get_yob().
    The input are a dictionary from the Anet and CERL data, 
    The output a bool
    """
    yob_anet = person_anet.get('yob')
    if yob_anet == get_yob(person_CERL):
        return True
    else: 
        return False

print(match_first_name(person_anet, person_CERL))
print(match_yob(person_anet, person_CERL))

True
True


In [7]:
# Obtaining the id from CERL.

def get_id_CERL(person_CERL: dict) -> str:
    """
    Obtain the identifier from CERL.
    The input is a record from CERL, the
    output a string (the tag)
    """
    for entry in person_CERL:
        if person_CERL[entry]['id'] != None:
            return person_CERL[entry]['id']

# Test:
identifier = get_id_CERL(person_CERL)
print(identifier)

cnp01494285


In [8]:
# The matching.

def final_matching(person_anet: dict, parsed: list) -> list:
    """
    See how the first 50 (or less) records from CERL match up with the
    query from Anet.
    The input are a dictionary from the Anet database and the list of
    dictionaries obtained from CERL. The output is a print statement, 
    accompanied by a list of potential matches and how they match up 
    with the query from Anet.
    """
    match_all = []
    match_last_name_first_name = []
    match_last_name_yob = []
    match_last_name = []
    for person_CERL in parsed:
        if match_first_name(person_anet, person_CERL) == True and match_yob(person_anet, person_CERL) == True:
            match_all.append(get_id_CERL(person_CERL))
        elif match_first_name(person_anet, person_CERL) == True and match_yob(person_anet, person_CERL) == False:
            match_last_name_first_name.append(get_id_CERL(person_CERL))
        elif match_first_name(person_anet, person_CERL) == False and match_yob(person_anet, person_CERL) == True: 
            match_last_name_yob.append(get_id_CERL(person_CERL))
        else:
            match_last_name.append(get_id_CERL(person_CERL))
    return print(f"These are the potential matches for {person_anet.get('tag')}:\n {len(match_all)} matche(s) based on first name, surname, and year of birth: {match_all}\n {len(match_last_name_first_name)} matche(s) based on first name and surname, not on year of birth: {match_last_name_first_name}\n {len(match_last_name_yob)} matche(s) based on surname and year of birth, not on first name: {match_last_name_yob}\n {len(match_last_name)} matche(s) only based on surname {match_last_name}\n")
            
final_matching(person_anet, parsed)

These are the potential matches for au::881:
 1 matche(s) based on first name, surname, and year of birth: ['cnp01494285']
 0 matche(s) based on first name and surname, not on year of birth: []
 0 matche(s) based on surname and year of birth, not on first name: []
 49 matche(s) only based on surname ['cnp02308697', 'cni00104787', 'cni00104786', 'cnp00967020', 'cnp01429388', 'cnp00170930', 'cnp00955028', 'cnp01005766', 'cnp02063731', 'cnp01145500', 'cnp00510523', 'cnp00473703', 'cnp01433379', 'cnp00628996', 'cnp01373271', 'cnp01373272', 'cnp00982331', 'cnp00384818', 'cnp00940347', 'cnp00625466', 'cnp01471874', 'cnp01201540', 'cnp00829185', 'cnp00682741', 'cnp01064929', 'cnp01977475', 'cnp02111183', 'cnp02343818', 'cnp02118478', 'cnp00995018', 'cnp01971830', 'cnp01005742', 'cnp00918384', 'cnp01145004', 'cnl00049101', 'cnp02294982', 'cnp00476719', 'cnp00100668', 'cni00104784', 'cnp02010450', 'cnp01942060', 'cnp00469224', 'cnp00370808', 'cnp01298415', 'cnp00338288', 'cnp01145504', 'cnp0221

In [9]:
# The final output.

def reconciliator(person_anet: dict) -> str:
    """
    The function that combines everything.
    The input is a dictionary from the Anet database.
    The output a string.
    """
    query = get_query(person_anet)
    query = clean_query(query)
    query = query_CERL(query)
    parsed = parse(query)
    outcome = final_matching(person_anet, parsed)
    return outcome

index = 0
for person_anet in dict_anet:  # iterate over the dictionaries created from the anet database and apply the function
    index += 1
    print(f"{index}. {person_anet.get('first_name')} {get_query(person_anet)}")
    reconciliator(person_anet)
print(f"Please keep in mind that only the fifty first CERL records are retrieved. It is possible that there are more.")
    
# On my device, this cell sometimes yields an error (ValueError). However, sometimes it does not 
# and it always does so at a different iteration, accordingly, it does not have anything to do
# with the code. If you get an error while executing this cell, please just try it again.

1. Karel Abeele, van den
These are the potential matches for au::34:
 0 matche(s) based on first name, surname, and year of birth: []
 0 matche(s) based on first name and surname, not on year of birth: []
 1 matche(s) based on surname and year of birth, not on first name: ['cnp02308725']
 4 matche(s) only based on surname ['cnp01179214', 'cnp02312326', 'cni00104002', 'cnp02335479']

2. Pieter Amelot
These are the potential matches for au::391:
 0 matche(s) based on first name, surname, and year of birth: []
 0 matche(s) based on first name and surname, not on year of birth: []
 0 matche(s) based on surname and year of birth, not on first name: []
 26 matche(s) only based on surname ['cnp01004831', 'cnp01934592', 'cnp01024742', 'cnp02337434', 'cnp01316365', 'cnp01040535', 'cnp01345322', 'cnp02342788', 'cnp01355464', 'cnp01939526', 'cnp01311377', 'cnp00099843', 'cnp02368398', 'cnp02132331', 'cnp01305226', 'cnp01401420', 'cnp01312541', 'cnp02369207', 'cnp02078319', 'cnc00032800', 'cnc0001

These are the potential matches for au::2992:
 0 matche(s) based on first name, surname, and year of birth: []
 0 matche(s) based on first name and surname, not on year of birth: []
 0 matche(s) based on surname and year of birth, not on first name: []
 50 matche(s) only based on surname ['cnp00377270', 'cnp00072976', 'cnp01053054', 'cnp02002904', 'cnp00997161', 'cnp02004520', 'cnp01259979', 'cnp01261860', 'cnp02104815', 'cnp00080093', 'cnp00161386', 'cnp01952029', 'cnp01266060', 'cni00017613', 'cnp00161450', 'cnp00158177', 'cnp01269258', 'cni00017350', 'cnp00973496', 'cnp01266587', 'cnp01270585', 'cnp01268625', 'cnp01260460', 'cnp01261113', 'cnp00185842', 'cnp02245392', 'cnp02309213', 'cnp01081036', 'cnp01265386', 'cnp02308839', 'cnp02321425', 'cnp02317374', 'cnp02318130', 'cnp02086829', 'cnp01270065', 'cnp02324968', 'cnp00343533', 'cnp01260973', 'cnp01127537', 'cnp01267843', 'cnp01269614', 'cnp01270646', 'cnp01168487', 'cnp02185924', 'cnp01006831', 'cnp01129036', 'cnp02014651', 'cni0

These are the potential matches for au::6600:
 0 matche(s) based on first name, surname, and year of birth: []
 0 matche(s) based on first name and surname, not on year of birth: []
 0 matche(s) based on surname and year of birth, not on first name: []
 0 matche(s) only based on surname []

35. Lucas Faydherbe
These are the potential matches for au::6600:
 1 matche(s) based on first name, surname, and year of birth: ['cnp01367848']
 0 matche(s) based on first name and surname, not on year of birth: []
 0 matche(s) based on surname and year of birth, not on first name: []
 1 matche(s) only based on surname ['cnp02097665']

36. Pieter Fardé
These are the potential matches for au::6622:
 0 matche(s) based on first name, surname, and year of birth: []
 1 matche(s) based on first name and surname, not on year of birth: ['cnp02208149']
 0 matche(s) based on surname and year of birth, not on first name: []
 0 matche(s) only based on surname []

37. Willem Fesch, de
These are the potential mat

Observations of results:
- Although CERL includes a lot of variations on the names of people, sometimes not all of them are included. That is why it regularly happens that there is only a match based on surname and year of birth, whilst a "perfect match" does exist. This proves that it is important to include the year of birth. It could perhaps be solved by implementing a list of all variants of names manually (f.e. Peter = Petrus = Pieter). This is for example the case for Jean Pierre van Baurscheit in Anet, which can be found as Jan Pieter in the CERL (and not as Jean Pierre); or for Karel van den Abeele (Anet) which can be found as Carolus van den Abeele in CERL (and not as Karel).
- Sometimes, years of birth can be recorded differently in different databases (according to the used source, it could differ a year). Because of that, sometimes a "perfect match" does not show up. This is the case for Jan Fyt and Pieter Fardé. It is also the case for Hélène Fourment, but this year of birth differs 15 years, so chances are small that the records are about the same person. This is why the year of birth is important.
- In case of Willem de Fesch, there are two people with that name born in the same year, but who died in different years (in CERL). Accordingly, it could be beneficial to include the year of death as well. However, except for de Fesch, this does not seem to form a problem for any of the other people. It has to be decided whether the extra output it worth result. 
- Sometimes, a "perfect match" for an Anet record is not found, although it is registered in CERL. That is because maxRecords is set to fifty and because the right match is located after the first 50 records. This can be solved by increasing the maxRecords, however, this will also yield less relevant results for the other records. The right balance has to be found. This is for instance the case for Jacques Bergé or Nicolaas Heinsius.
- All in all, these results were as expected.
- (This list is by no means exhaustive, these are just observations that I found important to mention.)

Possible improvements:
- (*) Using a library such as nltk for handling apostrophes might be beneficial. However, by focussing on the standard library you have more control and more chance that everyone understands it. I thought about replacing all "'" with a space or nothing, especially because of the case of Faid'herbe. Faid'herbe (Anet) does not yield a match in CERL, although the record is there as "Faidherbe". I saw, however, that he is also registered in Anet as Faydherbe, which does yield a match in CERL. Accordingly, I opted to just leave the "'" in. The lines of code are still there though, in case they are needed.
- In correspondence with the previous point, a lot of choices have to be made in regards to standardization. In the comments of the code, I tried to explain the choices and why I made them. Of course there are always different preferences and options.
- As already stated, it would have been a possibility to use more booleans in the query, instead of "filtering" the results of one query. However, it yields similar results and the output lists for each record are satisfactory.
- Indexing would be beneficial. However, the query is relatively small so speed is not of the essence here.
- The ranking of results could also be useful, (ranking relevant to less relevant matches/results using a function). But the lists also provide information on how useful a match can be (a match based on all three of the metadata is more useful than one based on two or one).
- It is possible to set the maxRecords higher to include even more possible matches (as mentioned before). However, there would be more undesired potential matches then as well. The balance has to be good. For instance, Jacques Bergé does not yield a match with the maxRecords set to 50, but he does when it is set to 100. You can adjust this according to your preference.
- You could also opt for not using strip_accents. In this case, CERL yields the same amount of results, with or without them. The function is not really needed when making a reconciliator between Anet and CERL, but could be for other databases.
- You could also choose to not print the list of matches based on only the surname, since there is often nothing relevant present. This is also personal preference.
- I create empty values in the "parse" function. This might not be the best way for memory saving, but having the same keys in every dictionary makes it easier to iterate over the keys and not get KeyErrors. These could be handled by using try and except, but that is not recommended.