# Reconciling Titles 

This notebook seeks to reconcile titles in the HathiTrust dataset with the DLL Catalog's work records. Since titles are orders of magnitude more complex than names of authors, I'm using some Natural Language Processing techniques to extract key words from title strings. I then use the Author Reconciliation model's output to narrow the possible candidate matches in the DLL Catalog's work records to works only by the matched author. In other words, if the model matches a particular work's author as "Virgil", then only works by Virgil are potential matches for the keys words in the work's title. The words in the title are tokenized and lemmatized so that matching doesn't have to take into account the different case endings in Latin words. Stop words are also removed. The goal is to iterate over the remaining lemmatized tokens and look for matches in the DLL Catalog's filtered work records.

Ideally, a title like *Lucretii De Rerum Natura* would be lemmatized to `Lucretius Res Natura` and matched to a similarly lemmatized title in a DLL Catalog work record for Lucretius.

## Install the necessary modules

Note that this notebook should be run in a different virtual environment than the one used for the other notebooks in this repository. To use the same virtual environment I used for this notebook, do `conda create --name dllspacy --file requirements-dllspacy.txt` from the root of this repository.

You might need to run the following commands separately:

```
%pip install -U spacy==3.7.5 --no-cache-dir
%pip install "spacy_lookups_data @ git+https://github.com/diyclassics/spacy-lookups-data.git" --no-cache-dir
%pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-3.7.7-py3-none-any.whl" --no-cache-dir
```

## Load the Necessary Modules

This notebook uses the following modules:

- `ast`: (Abstract Syntax Tree) to handle processing some items as a string
- `collections`: for the Counter, to keep track of confidence scores, etc.
- `pandas`: for opening and working with the CSV files
- `rapidfuzz`: for probabilistic matching algorithms
- `spacy`: for Natural Language Processing operations

In [16]:
import ast
from collections import Counter
import pandas as pd
from rapidfuzz import fuzz
import re
import spacy

## Load the la_core_web_lg model from LatinCy

I'll use a model from [LatinCy](https://huggingface.co/latincy) to remove stop words, tokenize, and lemmatize the titles.

On LatinCy, see Patrick J. Burns, “LatinCy: Synthetic Trained Pipelines for Latin NLP,” arXiv: <https://doi.org/10.48550/arXiv.2305.04365>.

In [17]:
# Load the Latin language model
# Note: Ensure you have the 'la_core_web_lg' model installed in your spaCy environment.
# You can install it using: %pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-3.7.7-py3-none-any.whl" --no-cache-dir
nlp = spacy.load("la_core_web_lg")

### Import the STOP_WORDS List from the LatinCy Model

The following cell also adds forms of *liber* ("book") to the list of stop words, since it is ubiquitous in titles.

In [18]:
from spacy.lang.la import STOP_WORDS

# Add forms of liber to the Latin stop words list
custom_stop_words = {"liber", "libri", "libro", "librum", "librorum", "libris", "libros"}
# Combine the default Latin stop words with the custom ones
all_stop_words = STOP_WORDS.union(custom_stop_words)

## Functions for Pre-Processing the Titles with LatinCy

The `extract_primary_title()` function looks for the presence of certain delimiters that separate the primary title from the secondary title. For example, the "/" in "Petri Lombardi Libri IV sententiarum / studio et cura pp. Collegii S. Bonaventurae in lucem editi" serves this purpose. The function splits such a title and returns just the primary title.

The `preprocess()` function:

1. Makes all words lower case to avoid accidental differences between, for example, *Omnia* and *omnia*.
2. Tokenizes the words in the title if they are composed of latters, not stop words, and not punctuation.

In [19]:
# Shorten the title, if possible
def extract_primary_title(raw_title):
    # Split at the first occurrence of any of the delimiters
    split_title = re.split(r'[:;/\\]', raw_title, maxsplit=1)
    return split_title[0].strip()

# Preprocessing function using LatinCy
def preprocess(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_ for token in doc
        if not token.is_stop and not token.is_punct and token.is_alpha
    ]
    return tokens

## Function for Comparing Titles to Candidate Matches

The following function uses [Jaccard Similarity](https://www.statology.org/jaccard-similarity/) to determine potential matches between two title strings. Briefly, the Jaccard index is a score between 0 and 1 that represents the degree to which two strings are similar. It is based on the formula (number of items in both strings) / (number in either string). It is useful for comparing tokenized strings. Since it doesn't account for differences in word order or misspellings, a fuzzy matching algorithm is applied later in the script to handle those factors. 

In [20]:
# Simple Jaccard similarity
def jaccard_similarity(list1, list2):
    set1, set2 = set(list1), set(list2)
    if not set1 or not set2:
        return 0.0
    return len(set1 & set2) / len(set1 | set2)

## Load the files

- `latin_authors.csv` is one of the outputs of the Greek-Latin Identification model that was deployed in a different notebook (`python/author_matching.ipynb`). It contains only the rows from `data/hathi2.csv` that have been safely categorized as "Latin".
- `author_inferences.csv` is another output of `python/author_matching.ipynb`. It contains the original author from the `data/hathi2.csv` and the DLL ID of the matched DLL Catalog authority record.
- `works_db.csv` contains data from the DLL Catalog's work records.

In [21]:
# Load files
input_df = pd.read_csv("../output/latin_authors.csv")
inferences_df = pd.read_csv("../output/author_inferences.csv")
works_df = pd.read_csv("../data/works_db.csv")

## Cache Preprocessed Titles from works_df

This runs the `preprocess()` function described above on the titles in `works_df`. Caching them will mean that the operation can be performed once, instead of every time a new title is processed.

In [22]:
# Cache preprocessed titles for works_db
works_df["preprocessed_title"] = works_df["Title"].apply(preprocess)

## Process the Titles and Propose Candidate Matches

The following script performs many complex operations that require a detailed description.

For each record, the script attempts to associate the given author with a normalized author's identifier using a separate dataframe of author inferences computed in a previous operation (see `python/author_matching.ipynb`).

If no match is found among the author inferences, the assumption is that the author is unknown. The record is flagged for manual review.

If a match is found among the author inferences, the inferred author_id is retrieved from a nested dictionary structure contained in the distilbert_author column. This ID is then used to filter a third dataset, works_df, to extract the titles of works associated with that particular author.

If the filtered list of works is empty—i.e., no known works for the matched author—the record is again flagged for review and skipped.

When there are candidate works in the filtered list, the script applies a two-part matching process to compare the title from the input record to each known work title:

    1. Jaccard Similarity is computed between tokenized, preprocessed versions of the input title and each candidate work title. This emphasizes lexical overlap.
    2. Fuzzy Matching via fuzz.token_sort_ratio is also used to compare the raw strings directly, capturing more flexible matches based on reordering and character similarity.

A weighted average score is then calculated for each candidate work title—60% from the Jaccard similarity and 40% from the fuzzy score. The results are sorted by this combined score in descending order.

If at least one title scores above a confidence threshold of 0.25, the script appends up to the top three highest-scoring candidates to the output, each with associated metadata: the original author and title, the matched author ID, the matched work title and ID, the similarity score (rounded to 3 decimal places), and a review flag (triggered if the score is below 0.5).

If none of the candidates meet the threshold, the script logs the input record without a matched title and flags it for manual review.

Finally, the accumulated list of processed rows is written to a new CSV file (`output/candidate_title_matches.csv`), which can then be examined, post-processed, or reviewed for quality assurance.

In [23]:
# Initialize an output list
output_rows = []

# Iterate through each row in input_df
for idx, row in input_df.iterrows():
    raw_author = row["author"]
    raw_title = row["title"]
    url = row["url"]

    # Match author to distilbert_author (ID)
    match = inferences_df[inferences_df["author"] == raw_author]
    # If no match is found, flag for review
    if match.empty:
        output_rows.append({
            "author": raw_author,
            "title": raw_title,
            "url": url,
            "author_id": None,
            "matched_title": None,
            "confidence_score": None,
            "flagged_for_review": True
        })
        continue
    
    # If a match is found, extract the distilbert_author data
    # Extract distilbert_author data from the dictionary string in inferences_df
    distilbert_data = ast.literal_eval(match["distilbert_author"].values[0])
    author_id = distilbert_data.get("author_id")

    # Filter works_db for this author_id
    candidate_works = works_df[works_df["DLL Identifier (Author)"] == author_id].copy()
    # If no works are found for this author_id, flag for review
    if candidate_works.empty:
        output_rows.append({
            "author": raw_author,
            "title": raw_title,
            "url": url,
            "author_id": author_id,
            "matched_title": None,
            "matched_work_id": None,
            "confidence_score": None,
            "flagged_for_review": True
        })
        continue

    # Extract primary title for better matching
    short_title = extract_primary_title(raw_title)
    # If a match is found, preprocess the input title
    input_tokens = preprocess(short_title)

    # Score each candidate title
    scores = []
    for _, work_row in candidate_works.iterrows():
        candidate_title = work_row["Title"]
        candidate_tokens = work_row["preprocessed_title"]
        work_id = work_row["DLL Identifier (Work)"]
        # Calculate Jaccard similarity
        sim = jaccard_similarity(input_tokens, candidate_tokens)
        # Calculate fuzzy matching score
        fuzzy_score = fuzz.token_sort_ratio(raw_title, candidate_title) / 100
        # Combine scores with weights
        combined_score = 0.6 * sim + 0.4 * fuzzy_score  # Weighted combination
        scores.append((candidate_title, combined_score, work_id))

    # Sort by score
    scores = sorted(scores, key=lambda x: x[1], reverse=True)

    # Prepare output based on scores
    # If scores are found and the top score is above 0.25, include them
    if scores and scores[0][1] > 0.25:
        for top_title, score, work_id in scores[:3]:
            output_rows.append({
                "author": raw_author,
                "title": raw_title,
                "url": url,
                "author_id": author_id,
                "matched_title": top_title,
                "matched_work_id": work_id,
                "confidence_score": round(score, 3),
                "flagged_for_review": score < 0.5
            })
    else:
        output_rows.append({
            "author": raw_author,
            "title": raw_title,
            "url": url,
            "author_id": author_id,
            "matched_title": None,
            "matched_work_id": None,
            "confidence_score": None,
            "flagged_for_review": True
        })

## Save the Output for Review

In [24]:
# Save to CSV or examine output
output_df = pd.DataFrame(output_rows)
output_df.to_csv("../output/candidate_title_matches.csv", index=False)
