# Comparing the Models to Other Matching Algorithms

This notebook compares the output of the author matching pipeline to deterministic and probabilistic matching algorithms.

It reads in the records that were identified as "Latin" by the Greek-Latin Identification model. It normalizes the value of the "author" column and seeks to match it to the authorized name form in the DLL Catalog. It tries three methods for matching: deterministic, probabilistic (fuzzy), and inference-based (using the inferences of the Author Identification model).

The input data is very noisy, with multiple name forms for the same author and multiple rows for some authors (see below "Analysis").

To run this notebook, you also need the following files:

- data/authors_db.csv
- data/works_db.csv
- output/author_inferences.csv
- output/latin_authors.csv

In [None]:
%pip install rapidfuzz

## Load the Necessary Modules

This notebook uses the following modules:

- `ast`: (Abstract Syntax Tree) to handle processing some items as a string
- `csv`: for processing the output of the model
- `pandas`: for opening and working with the CSV files
- `rapidfuzz`: for comparing the models' inferences to deterministic and probabilistic matching algorithms
- `utilities`: for some local helper functions. Note: this is a local file, not a library available through repos like condaforge or pypl.

In [192]:
import ast
import csv
import pandas as pd
from rapidfuzz import fuzz, process
import utilities as utilities

## Load the Data

### Load the DLL Catalog Data

The authority and work records from the DLL Catalog are loaded from CSV files and converted into Python dictionaries. These will be used as lookup tables to translate the outputs from the Author Reconciliation model into more comprehensible information for humans (e.g., "Julius Caesar", not "A4644").

In [None]:
# Read in the authors data
authors = pd.read_csv('./authors_db.csv',encoding='utf-8',quotechar='"')
# Read in the works data
works = pd.read_csv('./works_db.csv',encoding='utf-8',quotechar='"')

# Change the names of the columns to be lower case without spaces or punctuation
authors = authors.rename(columns={'Variant':'variant_name','Authorized Name':'authorized_name','DLL Identifier (Author)':'dll_id_author'})
works = works.rename(columns={'Title':'title','DLL Identifier (Work)':'dll_id_work','DLL Identifier (Author)': 'dll_id_author'})

# Prepare the lookup dictionaries of variant author names and titles
variant_to_authorized, title_to_work = utilities.prepare_dicts(authors,works)

### Load the Model Inference Data

Load the output of the `author_matching.ipynb` notebook instead of running the models again.

In [None]:
model_outputs_df = pd.read_csv("./author_inferences.csv",encoding='utf-8',quotechar='"')

### Load the Latin Data from the Original Test Data

In [None]:
# Load preprocessed, deduplicated hathi2.csv
input_df = pd.read_csv('./latin_authors.csv', encoding='utf-8', quotechar='"') 
# Further clean the data by filling missing values with "Unknown"
input_df = utilities.clean_input(input_df)

## Metadata Processing Functions

The following functions are necessary for providing comprehensible output from the models' inferences.

### Greek-Latin Identification

- `classify_author_language`: Tokenizes authors' names from the incoming metadata records and passes them to the Greek-Latin Identification model. Returns the predicted label ("Greek" or "Latin") and the confidence score for the prediction.
- `classify_and_split_by_language`: Returns three Pandas dataframes—`classified_df`, with all results from the Greek-Latin Identification model; `greek_df`, with results labeled "Greek"; `latin_df`, with results labeled "Latin". The latter will be the input for the Author Reconciliation model.

Note that these functions process one input at a time. There is potential for speeding up the process by batching the inputs.

### Author Reconciliation Functions

- `deterministic_author_match`: Performs a one-to-one comparison of the authorized name in the DLL Catalog with the value of the author field in the test data
- `fuzzy_author_match`: Performs a "fuzzy" comparison and selects the best match between names in the DLL Catalog and the value of the author field, provided the confidence is above 90%.
- `distilbert_author_match`: Tokenizes authors' names and passes them to the Author Reconciliation model. Returns the predicted label ("DLL ID") and the confidence score for the prediction.
- `process_metadata`: Manages input and output for the Author Reconciliation model. Returns a dataframe with the results of the inference run.

Note that these functions process one input at a time. There is potential for speeding up the process by batching the inputs.

In [None]:
# Step 2: Author Matching Functions
def deterministic_author_match(input_author):
    input_author_normalized = utilities.normalize_author_name(input_author)
    author_info = variant_to_authorized.get(input_author_normalized)
    return author_info

def fuzzy_author_match(input_author):
    input_author_cleaned = utilities.normalize_author_name(input_author)
    result = process.extractOne(input_author_cleaned, list(variant_to_authorized.keys()), scorer=fuzz.token_sort_ratio)
    if result:
        best_match, similarity, *_ = result
        if similarity == 1.0:
            return variant_to_authorized.get(best_match), similarity / 100
    return None, 0.0

def distilbert_author_match(input_author):
    if not isinstance(input_author, str):
        return None, 0.0

    row = model_outputs_df[model_outputs_df['author'] == input_author]
    if row.empty:
        return None, 0.0

    # Parse the dictionary string in 'distilbert_author'
    distilbert_author = ast.literal_eval(row.iloc[0]['distilbert_author'])
    authorized_name = distilbert_author.get('authorized_name', None)
    confidence = float(row.iloc[0]['confidence'])

    return authorized_name, confidence

def process_metadata(input_df):
    """Process input dataframe and match metadata, returning only authorized_name values."""
    results = []

    for _, row in input_df.iterrows():
        input_author_original = row["author"]
        input_author_normalized = utilities.normalize_author_name(input_author_original)
        print(f'Processing: {input_author_original}')

        # Author Matching
        deterministic_author = deterministic_author_match(input_author_normalized)
        fuzzy_author, fuzzy_author_score = fuzzy_author_match(input_author_normalized)
        distilbert_author, distilbert_author_score = distilbert_author_match(input_author_original)

        # Extract just the authorized_name if available
        deterministic_author_name = (
            deterministic_author.get("authorized_name") if deterministic_author else None
        )
        fuzzy_author_name = (
            fuzzy_author.get("authorized_name") if fuzzy_author else None
        )

        # Collect Results
        results.append({
            "author": input_author_original,
            "normalized_author": input_author_normalized,
            "deterministic_author": deterministic_author_name,
            "fuzzy_author": fuzzy_author_name,
            "fuzzy_author_score": fuzzy_author_score,
            "distilbert_author": distilbert_author,
            "distilbert_author_score": distilbert_author_score
        })
        print(f"Matched author: {deterministic_author_name if deterministic_author_name else distilbert_author}")

    return pd.DataFrame(results)

## Process the Data

In [None]:
# Process the input dataframe
output_df = process_metadata(input_df)
# Display a message indicating completion
print("Done with processing authors and titles.")
# Display the first 10 rows of the output dataframe
display(output_df.head(10))
# Save the output dataframe to a CSV file
output_df.to_csv('./model_probabilistic_deterministic.csv',index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)

Processing: Du Creux, François, 1596?-1666.
Matched author: graux, charles henri, 1852-1882
Processing: Meyer, Ernst H. F. 1791-1858.
Matched author: meyer, wilhelm, 1845-1917
Processing: Laet, Joannes de, 1593-1649.
Matched author: lawrence, of novara
Processing: Caesar, Julius
Matched author: caesar, julius
Processing: Unknown
Matched author: alan, of tewkesbury
Processing: Drexel, Jeremias, 1581-1638,
Matched author: dorpius, martinus, 1485-1525
Processing: Kircher, Athanasius, 1602-1680
Matched author: kircher, athanasius, 1602-1680
Processing: Hincmar, Archbishop of Reims, approximately 806-882
Matched author: hincmar, archbishop of reims
Processing: Acosta, José de, 1540-1600,
Matched author: acosta, josé de, 1540-1600
Processing: Lessius, Leonardus, 1554-1623
Matched author: lessing, gotthold ephraim, 1729-1781
Processing: Riccioli, Giovanni Battista, 1598-1671,
Matched author: ricci, matteo, 1552-1610
Processing: Guazzo, Francesco Maria,
Matched author: giovio, benedetto, 147

Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
0,"Du Creux, François, 1596?-1666.",du creux francois 15961666,,,0.0,"graux, charles henri, 1852-1882",0.177464
1,"Meyer, Ernst H. F. 1791-1858.",meyer ernst h f 17911858,,,0.0,"meyer, wilhelm, 1845-1917",0.999956
2,"Laet, Joannes de, 1593-1649.",laet joannes de 15931649,,,0.0,"lawrence, of novara",0.156866
3,"Caesar, Julius",caesar julius,,"caesar, julius",1.0,"caesar, julius",0.999986
4,Unknown,unknown,,,0.0,"alan, of tewkesbury",0.11027
5,"Drexel, Jeremias, 1581-1638,",drexel jeremias 15811638,,,0.0,"dorpius, martinus, 1485-1525",0.521497
6,"Kircher, Athanasius, 1602-1680",kircher athanasius 16021680,"kircher, athanasius, 1602-1680","kircher, athanasius, 1602-1680",1.0,"kircher, athanasius, 1602-1680",0.999991
7,"Hincmar, Archbishop of Reims, approximately 80...",hincmar archbishop of reims approximately 806882,,,0.0,"hincmar, archbishop of reims",0.999999
8,"Acosta, José de, 1540-1600,",acosta jose de 15401600,"acosta, josé de, 1540-1600","acosta, josé de, 1540-1600",1.0,"acosta, josé de, 1540-1600",0.999997
9,"Lessius, Leonardus, 1554-1623",lessius leonardus 15541623,,,0.0,"lessing, gotthold ephraim, 1729-1781",0.985865


## Analysis

### Basic facts about the input data

In [198]:
# Number of rows in the input dataframe
num_rows = len(input_df)
print(f"Number of rows in the input dataframe: {num_rows}")

Number of rows in the input dataframe: 13446


In [199]:
# Get the number of unique authors in the input dataframe
unique_authors = input_df['author'].nunique()
print(f"Number of unique authors in the input dataframe: {unique_authors}")

Number of unique authors in the input dataframe: 5515


In [252]:
# Get the number of unique authors with more than one row in the input dataframe
duplicate_authors = input_df['author'].value_counts()
duplicate_authors = duplicate_authors[duplicate_authors > 1]

print(f"Number of unique authors with more than one row in the input dataframe: {len(duplicate_authors)}")

# Get the number of unique authors with only one row in the input dataframe
single_authors = input_df['author'].value_counts()
single_authors = single_authors[single_authors == 1]
print(f"Number of unique authors with only one row in the input dataframe: {len(single_authors)}")

Number of unique authors with more than one row in the input dataframe: 1499
Number of unique authors with only one row in the input dataframe: 4016


In [201]:
# Display the top ten duplicate authors by number of rows
print("Top ten duplicate authors by number of rows:")
print(duplicate_authors.head(10))

Top ten duplicate authors by number of rows:
author
Unknown                             611
Cicero, Marcus Tullius.             339
Horace.                             216
Livy.                               205
Virgil.                             174
Tacitus, Cornelius.                 160
Cicero, Marcus Tullius              157
Ovid, 43 B.C.-17 A.D. or 18 A.D.    129
Horace                              125
Lucretius Carus, Titus.             109
Name: count, dtype: int64


In [202]:
# Display the bottom ten duplicate authors by number of rows
print("Bottom ten duplicate authors by number of rows:")
print(duplicate_authors.tail(10))

Bottom ten duplicate authors by number of rows:
author
Bersuire, Pierre, ca.1290-1362    2
Bastgen, Philip.                  2
Czerner, Bartholomaeus.           2
Arias de Mesa, Fernando.          2
Erasmus, Desiderius, d. 1536.     2
Socini, Mariano, Senior.          2
Tancredi, Vincenzo.               2
Magro, Jacobo.                    2
Ullrich, Richard, 1866-           2
Froitzheim, Johann, 1847-         2
Name: count, dtype: int64


The input dataframe has 13,446 rows of data. 
There are 5,515 unique author names.
1,499 of those unique authors appear in more than one row.
4,016 of those unique authors appear only once in the dataframe.

"Unique" means that the name form is unique. Individual authors often have more than one name form (e.g., "Cicero, Marcus Tullius." and "Cicero, Marcus Tullius" above).

The top five "unique" author in the input dataframe, aside from "unknown" are well-known authors from the Classical era:

1. Marcus Tullius Cicero
2. Horace
3. Livy
4. Virgil
5. Tacitus

The bottom five "unique" authors with more than one row tend to be more obscure and from the Neo-Latin era:

1495. Socini, Mariano, Senior.
1496. Tancredi, Vincenzo.
1497. Magro, Jacobo.
1498. Ullrich, Richard, 1866-
1499. Froitzheim, Johann, 1847-

## Examining the Output

In [254]:
# Get the number of unique authors with more than one row in the output dataframe
duplicate_authors = output_df['normalized_author'].value_counts()
duplicate_authors = duplicate_authors[duplicate_authors > 1]

print(f"Number of unique authors with more than one row in the input dataframe: {len(duplicate_authors)}")

# Display the top ten duplicate authors by number of rows
print("Top ten duplicate authors by number of rows:")
print(duplicate_authors.head(10))

# Display the bottom ten duplicate authors by number of rows
print("Bottom ten duplicate authors by number of rows:")
print(duplicate_authors.tail(10))

Number of unique authors with more than one row in the input dataframe: 1472
Top ten duplicate authors by number of rows:
normalized_author
unknown                     611
cicero marcus tullius       496
horace                      341
livy                        297
virgil                      252
tacitus cornelius           246
lucretius carus titus       151
ovid 43 bc17 ad or 18 ad    131
catullus gaius valerius     110
quintilian                  104
Name: count, dtype: int64
Bottom ten duplicate authors by number of rows:
normalized_author
chalkokondyles laonikos ca 1430ca 1490    2
alvares manuel 15261583                   2
keats john 17951821                       2
brink barend ten 18031875                 2
mace alcide                               2
drexel jeremias 15811638                  2
otto carl eduard 17951869                 2
christ johann friedrich 17001756          2
ludwich arthur 18401920                   2
nakatenus wilhelm 16171682                2
Name: co

The first step is to deduplicate the output based on the `normalized_author` column. This will help with the problem of several name forms for single authors.

In [228]:
# Deduplicate the output dataframe by normalized_author column
output_df_dedup = output_df.drop_duplicates(subset=['normalized_author'])
# Number of rows in the deduplicated output dataframe
num_rows_dedup = len(output_df_dedup)
print(f"Number of rows in the deduplicated output dataframe: {num_rows_dedup}")

Number of rows in the deduplicated output dataframe: 4779


In [232]:
# Make a dataframe from rows with a match in the deterministic author column
deterministic_matches = output_df_dedup[
    (output_df_dedup['deterministic_author'].notnull() & (output_df_dedup['deterministic_author'] != "None"))]

# Get the length of the deterministic matches dataframe
deterministic_matches_length = len(deterministic_matches)
print(f"Number of rows with a deterministic author match: {deterministic_matches_length}")

Number of rows with a deterministic author match: 491


In [234]:
# Make a dataframe from rows with a match with highest confidence in the fuzzy author column
fuzzy_matches = output_df_dedup[output_df_dedup['fuzzy_author_score'] == 1.0]
# Filter the rows in the dataframe to unique values in the normalized author column
fuzzy_matches_unique = fuzzy_matches['normalized_author'].nunique()
# Get the length of the fuzzy matches dataframe
fuzzy_matches_length = len(fuzzy_matches)
print(f"Number of rows with highest confidence in fuzzy matching: {fuzzy_matches_length}")

Number of rows with highest confidence in fuzzy matching: 526


In [235]:
# Make a dataframe from rows with a match of highest confidence in the distilbert_author_score column
inference_matches = output_df_dedup[output_df_dedup['distilbert_author_score'] >= 0.999957]
# Get the length of the inference_matches dataframe
inference_matches_length = len(inference_matches)
print(f"Number of rows in the inference matches dataframe: {inference_matches_length}")

Number of rows in the inference matches dataframe: 615


In [237]:
# Coverage
deterministic_coverage = (output_df_dedup['deterministic_author'].notnull() & (output_df_dedup['deterministic_author'] != "None")).mean()
print(f"Deterministic matching coverage: {deterministic_coverage:.2%}")

fuzzy_coverage = (output_df_dedup['fuzzy_author_score'] == 1.0).mean()
print(f"Fuzzy matching high-confidence coverage: {fuzzy_coverage:.2%}")

distilbert_coverage = (output_df_dedup['distilbert_author_score'] >= 0.9901).mean()
print(f"DistilBERT high-confidence coverage: {distilbert_coverage:.2%}")

Deterministic matching coverage: 10.27%
Fuzzy matching high-confidence coverage: 11.01%
DistilBERT high-confidence coverage: 25.19%


The deterministic algorithm matched 491 of the unique authors in the original dataset. That is not surprising, since a one-to-one comparison is very conservative and depends on the exact same name form appearing in the input data.

The probabilistic matching algorithm matched 526 of the unique authors in the original dataset. That, too, is not surprising, since fuzzy matching takes into account slight variations in spelling.

The machine learning model 615 unique authors, when the optimal confidence score of 0.999957 is used as a cutoff. The next several cells explain how I arrived at that optimal confidence score.

## Comparison: Model vs. Deterministic Matching

In [238]:
# Filter the results to show the intersection of deterministic matches that aren't "None" and model matches that are greater than 0.9
high_confidence_deterministic = output_df[
    (output_df['deterministic_author'].notnull() & (output_df['deterministic_author'] != "None")) &
    (output_df['distilbert_author_score'] > 0.9901)
]
print(f"Number of rows with high confidence matches: {len(high_confidence_deterministic)}")
display(high_confidence_deterministic.head(10))

Number of rows with high confidence matches: 4283


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
6,"Kircher, Athanasius, 1602-1680",kircher athanasius 16021680,"kircher, athanasius, 1602-1680","kircher, athanasius, 1602-1680",1.0,"kircher, athanasius, 1602-1680",0.999991
8,"Acosta, José de, 1540-1600,",acosta jose de 15401600,"acosta, josé de, 1540-1600","acosta, josé de, 1540-1600",1.0,"acosta, josé de, 1540-1600",0.999997
12,"Kircher, Athanasius, 1602-1680.",kircher athanasius 16021680,"kircher, athanasius, 1602-1680","kircher, athanasius, 1602-1680",1.0,"kircher, athanasius, 1602-1680",0.999991
13,"Mersenne, Marin, 1588-1648,",mersenne marin 15881648,"mersenne, marin, 1588-1648","mersenne, marin, 1588-1648",1.0,"mersenne, marin, 1588-1648",0.999985
14,Virgil.,virgil,virgil,virgil,1.0,virgil,0.999874
15,"Kepler, Johannes, 1571-1630.",kepler johannes 15711630,"kepler, johannes","kepler, johannes",1.0,"kepler, johannes",0.999983
18,"Alvares, Manuel, 1526-1583.",alvares manuel 15261583,"alvares, manuel, 1526-1583","alvares, manuel, 1526-1583",1.0,"alvares, manuel, 1526-1583",0.999919
26,Virgil.,virgil,virgil,virgil,1.0,virgil,0.999874
27,"Boethius, -524.",boethius 524,"boethius, -524","boethius, -524",1.0,"boethius, -524",0.999951
30,"Thomas, Aquinas, Saint, 1225?-1274.",thomas aquinas saint 12251274,"thomas, aquinas, saint","thomas, aquinas, saint",1.0,"thomas, aquinas, saint",0.99999


The intersection of deterministic matches and model inference matches with a confidence rating higher than 0.9901 yielded 4,283 rows. 

**Conclusion**: The model performs *at least* as well as the deterministic model. But that is a pretty low bar!

## Comparison: Model vs. Probabilistic Matching

The fuzzy matching algorithm's usefulness drops significantly for rows with a confidence score under 1.0, which is why I set that as the benchmark in the `fuzzy_author_match()` function above.

In [209]:
# Sort the dataframe by the fuzzy_author_score column in descending order
sorted_fuzzy = output_df.sort_values(by='fuzzy_author_score', ascending=False)
# Filter the results to show the intersection of fuzzy matches that are 1.0 and model matches that are not null 
high_confidence_probabilistic = sorted_fuzzy[
    (sorted_fuzzy['fuzzy_author_score'] == 1.0) &
    (sorted_fuzzy['distilbert_author_score'].notnull())
]
print(f"Number of rows with high confidence matches: {len(high_confidence_probabilistic)}")
print("Top ten high confidence probabilistic matches:")
display(high_confidence_probabilistic.head(10))
print("Bottom ten high confidence probabilistic matches:")
display(high_confidence_probabilistic.tail(10))

Number of rows with high confidence matches: 4522
Top ten high confidence probabilistic matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
4401,Horace.,horace,horace,horace,1.0,horace,0.999989
4070,"Cicero, Marcus Tullius.",cicero marcus tullius,"cicero, marcus tullius","cicero, marcus tullius",1.0,"cicero, marcus tullius",0.999886
9752,"Gretser, Jakob, 1562-1625.",gretser jakob 15621625,"gretser, jakob, 1562-1625","gretser, jakob, 1562-1625",1.0,"gretser, jakob, 1562-1625",0.999982
4067,"Tacitus, Cornelius.",tacitus cornelius,"tacitus, cornelius","tacitus, cornelius",1.0,"tacitus, cornelius",0.999989
9751,"Gretser, Jakob, 1562-1625.",gretser jakob 15621625,"gretser, jakob, 1562-1625","gretser, jakob, 1562-1625",1.0,"gretser, jakob, 1562-1625",0.999982
11608,"Sayer, Gregory, (O.S.B.), 1560-1602",sayer gregory osb 15601602,"sayer, gregory, 1560-1602","sayer, gregory, 1560-1602",1.0,"sayer, gregory, 1560-1602",0.999995
4062,"Lucan, 39-65.",lucan 3965,lucan,lucan,1.0,lucan,0.999983
4058,"Polignac, Melchior de, 1661-1742?.",polignac melchior de 16611742,"polignac, melchior de, 1661-1742?","polignac, melchior de, 1661-1742?",1.0,"polignac, melchior de, 1661-1742?",0.99999
4054,"Schottus, Andreas, 1552-1629.",schottus andreas 15521629,"schottus, andreas, 1552-1629","schottus, andreas, 1552-1629",1.0,"schottus, andreas, 1552-1629",0.999968
4053,Virgil.,virgil,virgil,virgil,1.0,virgil,0.999874


Bottom ten high confidence probabilistic matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
1938,"Keil, Heinrich, 1822-1894",keil heinrich 18221894,"keil, heinrich, 1822-1894","keil, heinrich, 1822-1894",1.0,"keil, heinrich, 1822-1894",0.999944
2600,"Ovid, 43 B.C.-17 A.D. or 18 A.D.",ovid 43 bc17 ad or 18 ad,"ovid, 43 b.c.-17 a.d. or 18 a.d.","ovid, 43 b.c.-17 a.d. or 18 a.d.",1.0,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
8336,"Burchardus, Episcopus Wormaciensis.",burchardus episcopus wormaciensis,,"burchard, bishop of worms",1.0,"burchard, bishop of worms",0.999926
2306,Livy.,livy,livy,livy,1.0,livy,0.999558
1936,"Weichert, Jonathan August, 1788-1844.",weichert jonathan august 17881844,"weichert, august, 1788-1844","weichert, august, 1788-1844",1.0,"weichert, august, 1788-1844",0.999972
12501,"Séneca, Lucio Anneo, ca. 4 a.C-65 d.C",seneca lucio anneo ca 4 ac65 dc,"seneca, lucius annaeus, approximately 4 b.c.-6...","seneca, lucius annaeus, approximately 4 b.c.-6...",1.0,"seneca, lucius annaeus, approximately 4 b.c.-6...",0.999839
2596,"Boethius, -524",boethius 524,"boethius, -524","boethius, -524",1.0,"boethius, -524",0.99996
2307,"Ovid, 43 B.C.-17 A.D. or 18 A.D.",ovid 43 bc17 ad or 18 ad,"ovid, 43 b.c.-17 a.d. or 18 a.d.","ovid, 43 b.c.-17 a.d. or 18 a.d.",1.0,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
2308,"Lucretius Carus, Titus.",lucretius carus titus,"lucretius carus, titus","lucretius carus, titus",1.0,"lucretius carus, titus",0.99999
2303,"Ovid, 43 B.C.-17 A.D. or 18 A.D.",ovid 43 bc17 ad or 18 ad,"ovid, 43 b.c.-17 a.d. or 18 a.d.","ovid, 43 b.c.-17 a.d. or 18 a.d.",1.0,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999


In [239]:
high_confidence_probabilistic.describe()

Unnamed: 0,fuzzy_author_score,distilbert_author_score
count,4522.0,4522.0
mean,1.0,0.996303
std,0.0,0.045436
min,1.0,0.088868
25%,1.0,0.999886
50%,1.0,0.999975
75%,1.0,0.999989
max,1.0,0.999999


In [242]:
# Count number of rows with a distilbert_author_score of 0.90 or lower
low_confidence_inferences = high_confidence_probabilistic[high_confidence_probabilistic['distilbert_author_score'] <= 0.90]
print(f"Number of rows with a distilbert_author_score of 0.90 or lower: {len(low_confidence_inferences)}")

Number of rows with a distilbert_author_score of 0.90 or lower: 33


**Conclusion**: The model appears to be keeping up with the probabilistic matching algorithm's high confidence (==1.0) matches. It fails on some low percentage author names, but otherwise it is performing well.

## Examining the Model's Confidence

Let's see what the model was confident about when the other methods lacked confidence.

In [None]:
# Filter the results to show only rows where the distilbert_author_score is high, the fuzzy_author_score is less than 1.0, and the deterministic_author is "None"
high_confidence_inference = output_df[
    (output_df['distilbert_author_score'] > 0.99957) &
    (output_df['fuzzy_author_score'] < 1.0) &
    ((output_df['deterministic_author'].isnull()) | (output_df['deterministic_author'] == "None"))
]
print(f"Number of rows with high confidence inference matches: {len(high_confidence_inference)}")
print("Top ten high confidence inference matches:")
display(high_confidence_inference.head(10))
print("Bottom ten high confidence inference matches:")
display(high_confidence_inference.tail(10))
# Write the high confidence inference matches to a CSV file
high_confidence_inference.to_csv('./high_confidence_inference.csv',index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)

Number of rows with high confidence inference matches: 1755
Top ten high confidence inference matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
1,"Meyer, Ernst H. F. 1791-1858.",meyer ernst h f 17911858,,,0.0,"meyer, wilhelm, 1845-1917",0.999956
7,"Hincmar, Archbishop of Reims, approximately 80...",hincmar archbishop of reims approximately 806882,,,0.0,"hincmar, archbishop of reims",0.999999
28,"Prosper, of Aquitaine, Saint, approximately 39...",prosper of aquitaine saint approximately 390ap...,,,0.0,"prosper, of aquitaine, saint",0.999984
29,Vitruvius Pollio.,vitruvius pollio,,vitruvius pollio,0.941176,vitruvius pollio,0.99998
33,"Juvencus, Caius Vettius Aquilinus.",juvencus caius vettius aquilinus,,"juvencus, caius vettius aquilinus",0.96875,"juvencus, caius vettius aquilinus",0.999997
42,"Valerius Flaccus, Gaius, active 1st century.",valerius flaccus gaius active 1st century,,,0.0,"valerius flaccus, gaius",0.999995
55,"Nonius Marcellus, active 4th century.",nonius marcellus active 4th century,,,0.0,nonius marcellus,0.999963
78,"Propertius, Sextus.",propertius sextus,,,0.0,"propertius, sextus",0.999681
86,"Horatius, Romanus, fl. 1450.",horatius romanus fl 1450,,,0.0,horatius romanus,0.99995
112,"Catullus, Gaius Valerius",catullus gaius valerius,,"catullus, gaius valerius",0.956522,"catullus, gaius valerius",0.999993


Bottom ten high confidence inference matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
13357,"Ovid, 43 B.C.-17 or 18 A.D.",ovid 43 bc17 or 18 ad,,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.933333,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
13359,"Ovid, 43 B.C.-17 or 18 A.D.",ovid 43 bc17 or 18 ad,,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.933333,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
13361,"Ovid, 43 B.C.-17 or 18 A.D.",ovid 43 bc17 or 18 ad,,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.933333,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
13369,Quintilian.,quintilian,,quintilian,0.952381,quintilian,0.999983
13388,"Wycliffe, John, d. 1384.",wycliffe john d 1384,,"wycliffe, john, -1384",0.947368,"wycliffe, john, -1384",0.999988
13389,"Buchanan, David, 1595?-1652?",buchanan david 15951652,,,0.0,"buchanan, george",0.999811
13400,"Velleius Paterculus, ca. 19 B.C.-ca. 30 A.D.",velleius paterculus ca 19 bcca 30 ad,,,0.0,velleius paterculus,0.999971
13410,"Palingenio Stellato, Marcello, ca. 1500-ca. 1543.",palingenio stellato marcello ca 1500ca 1543,,"palingenio stellato, marcello, approximately 1...",0.931818,"palingenio stellato, marcello, approximately 1...",0.999989
13422,"Ovid, 43 B.C.-17 or 18 A.D.",ovid 43 bc17 or 18 ad,,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.933333,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999
13423,"Ovid, 43 B.C.-17 or 18 A.D.",ovid 43 bc17 or 18 ad,,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.933333,"ovid, 43 b.c.-17 a.d. or 18 a.d.",0.99999


It appears that when the model's confidence is above 0.999957, it accurately matches authors that neither the deterministic nor the probabilistic algorithms can match. I arrived at that number by adjusting it up or down until I found the exact number that had only positive results for the model.

## Where the Model Fails

In [211]:
# Filter the results to show the intersection of deterministic matches that aren't "None", the fuzzy_author_score is equal to 1.0, and the model matches are less than 0.9
high_confidence_deterministic_fuzzy = output_df[
    (output_df['deterministic_author'].notnull() & (output_df['deterministic_author'] != "None")) &
    (output_df['fuzzy_author_score'] == 1.0) &
    (output_df['distilbert_author_score'] < 0.9)
]
print(f"Number of rows with high confidence deterministic and fuzzy matches: {len(high_confidence_deterministic_fuzzy)}")
print("Top ten high confidence deterministic and fuzzy matches:")
display(high_confidence_deterministic_fuzzy.head(10))
print("Bottom ten high confidence deterministic and fuzzy matches:")
display(high_confidence_deterministic_fuzzy.tail(10))


Number of rows with high confidence deterministic and fuzzy matches: 31
Top ten high confidence deterministic and fuzzy matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
17,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
19,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
24,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
25,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796
38,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
39,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796
40,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
48,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
51,"Suarez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.410325
63,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796


Bottom ten high confidence deterministic and fuzzy matches:


Unnamed: 0,author,normalized_author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score
4497,Terence.,terence,terence,terence,1.0,"linacre, thomas, 1460-1524",0.128482
6200,Terence.,terence,terence,terence,1.0,"linacre, thomas, 1460-1524",0.128482
6704,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796
6705,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796
6706,"Müller, Lucian, 1836-1898.",muller lucian 18361898,"müller, lucian, 1836-1898","müller, lucian, 1836-1898",1.0,"müller, lucian, 1836-1898",0.863094
6967,"Suárez, Francisco, 1548-1617",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.522586
6988,"Bèze, Théodore de, 1519-1605",beze theodore de 15191605,"bèze, théodore de, 1519-1605","bèze, théodore de, 1519-1605",1.0,"bero magni, de ludosia",0.204709
11312,Alanus de Insulis,alanus de insulis,"alanus, de insulis","alanus, de insulis",1.0,"alanus, de insulis",0.845193
11573,"De Vío, Tommaso, 1469-1534",de vio tommaso 14691534,"cajetan, tommaso de vio, 1469-1534","cajetan, tommaso de vio, 1469-1534",1.0,"cajetan, tommaso de vio, 1469-1534",0.731794
13380,"Suárez, Francisco, 1548-1617.",suarez francisco 15481617,"suárez, francisco, 1548-1617","suárez, francisco, 1548-1617",1.0,"gonzález téllez, manuel, -1649",0.59796


In [214]:
# Unique values in the "author" column of the high_confidence_deterministic_fuzzy dataframe
unique_authors_deterministic_fuzzy = high_confidence_deterministic_fuzzy['normalized_author'].nunique()
print(f"Number of unique authors in the high confidence deterministic and fuzzy matches: {unique_authors_deterministic_fuzzy}")
authors_not_matched_by_model = set(high_confidence_deterministic_fuzzy['normalized_author'].to_list())
for author in sorted(authors_not_matched_by_model):
    print(author)

Number of unique authors in the high confidence deterministic and fuzzy matches: 7
alanus de insulis
beze theodore de 15191605
de vio tommaso 14691534
muller lucian 18361898
schoen henricus
suarez francisco 15481617
terence


## What can be learned about the model's performance?

In [249]:
# How many rows have a 1.0 distilbert_author_score?
perfect_inferences = output_df[output_df['distilbert_author_score'] >= 0.999957]
print(f"Number of rows with a near perfect distilbert_author_score: {len(perfect_inferences)}")

Number of rows with a near perfect distilbert_author_score: 3876


In [251]:
# How many unique authors are in the perfect_inferences dataframe?
unique_perfect_inferences = perfect_inferences['normalized_author'].nunique()
print(f"Number of unique authors in the perfect_inferences dataframe: {unique_perfect_inferences}")

Number of unique authors in the perfect_inferences dataframe: 623


## Conclusions

Overall, the model performs better than the deterministic and probabilistic algorithms, but there is room for improvement. Where the deterministic and probabilistic algorithms have high confidence but the model's confidence is less than 0.9, there were 31 records.

There were seven actually unique authors among those 31 records:

- Alanus de Insulis
- Bèze, Théodore de, 1519-1605
- De Vío, Tommaso, 1469-1534
- Müller, Lucian, 1836-1898.
- Schoen, Henricus.
- Suarez, Francisco, 1548-1617,
- Terence

If I augment the rows for these authors in the training data, it is likely that the model will match these, too. It really should have matched "Terence". That's … odd.
