# Fix Names Demo

### 1) Overview

One important part of curating the dataset from Chronicling America's search hits has been identifying victim names in the results. Given that the searches were conducted using victim's names, you'd think it would be easy to identify them in the results. But there's been a complication. I've discovered that Chronicling America uses a kind of fuzzy matching in its newspaper search. That means it identifies not just pages with the exact word or phrase from the search, but rather all pages with words or phrases that are similar based on certain parameters.

What are those parameters? I'm not sure about all of them, but after reviewing some of the data by hand, I know that Chron Am labels pages as search hits if the searched phrase appears on the page with just one or so characters off. This is undoubtedly to account for the OCR errors in their digitized newspaper data–to ensure the search on Chron Am pulls hits even if they have OCR errors.

This is a good practice, in my opinion, but it does make things more complicated when you have to filter the data in subsequent steps. For example, in my case, I want to identify where the victim's name appears on each pulled newspaper page and create my own fuzzy-matched newspaper clipping. In other words, I need to identify where the victim's name appears so I can take the series of words before and after and treat them like a clipping from the digitized form of the paper. This is essential for the rest of my process, but how to identify the names when so many pulled pages _technically_ do not contain them as perfect matches?

To address this problem, I've tried to redraft my previous [fix_names() function](https://github.com/MatthewKollmer/messing-around/blob/main/vrt_work/say_their_names/build_refine_dataset.ipynb), making it more robust and effective. In a sense, you can imagine this function as basically an attempt to reverse engineer Chron Am's fuzzy matching. In that sense, it is a failure since fix_names() does not account for all the ways Chron Am labels text as a potential search match. However, the fix_names() function does a better job than its earlier versions. It improves the data I've scraped from Chron Am tremendously. It's also relatively simple–simple enough for me to explain in one notebook, anyway.

So, I'll start with a breakdown of the function. Then I'll show how it works with some example text and code. The hope is that this notebook clarifies one important step in our data curation process and underscores how our data, while significant, is in no way representative of the whole of lynching reports, not even in our own pulled search results.

In [1]:
import re

### 2) Function Explanation

Let me start by saying that the fix_names() function is fairly conservative. It will only correct the following:

- names with three or more characters
- names with only one incorrect character
- instances where the names are missing a space between them
- instances where the names are separated by a non-word character (i.e. incorrect punctuation)

Here's how it works in a nutshell: the function takes the text and the victim's name. It splits the victim's name into parts (first name, last name, etc.). If the parts are three or more characters long, it compiles a list of potential variations where each variation is almost correct, but just one character off. Then it unifies the name parts (and their potential variations) and scours the text, looking for matches. If it finds a match with one character off per name part, it corrects it. It also identifies names that are missing spaces, or those split by misplaced punctuation, and corrects them.

That's a fair amount of OCR correction, but I still think it's conservative since Chron Am's fuzzy matching labels far more hits than are corrected with this function. Chron Am's fuzzy matching also brought these correctable instances into the data in the first place, so they should not be considered a new risk for false positives. 

On that note, I should emphasize that this is nowhere near the last step in filtering the data for lynching reports. It's only a step that bolsters our ability to identify victim names in the data.

In [2]:
# A Function That Corrects Names in Text
# It takes parameters 'text' and 'victim_name'.
def fix_names(text, victim_name):
    # It splits victim_name into parts.
    full_name = victim_name.split()
    
    # It recognizes the first part as first_name,
    # the second part as second_name.
    # If there are three or more name parts,
    # they are defined as first_name and second_name
    # in relation to one another.
    for i in range(len(full_name) - 1):
        first_name = full_name[i]
        second_name = full_name[i + 1]

        # If the length of first_name is equal to or greater than 3,
        # it compiles a list of potential variations.
        if len(first_name) >= 3:
            first_variants = [re.escape(first_name)]
            # The viable variations only include instances where
            # there is just one incorrect character.
            for character in range(len(first_name)):
                first_variants.append(re.escape(first_name[:character]) + '.' + re.escape(first_name[character+1:]))
        # If the name is only one or two characters in length,
        # it is skipped. It will not be considered for correction.
        else:
            first_variants = [re.escape(first_name)]

        # If the length of second_name is equal to or greater than 3,
        # it compiles a list of potential variations.
        if len(second_name) >= 3:
            second_variants = [re.escape(second_name)]
            # The viable variations only include instances where
            # there is just one incorrect character.
            for character in range(len(second_name)):
                second_variants.append(re.escape(second_name[:character]) + '.' + re.escape(second_name[character+1:]))
        # If the name is only one or two characters in length,
        # it is skipped. It will not be considered for correction.
        else:
            second_variants = [re.escape(second_name)]

        # The OCR patterns are assembled. They are put into
        # non-captured groups (the ?: below). This basically 
        # means the variations are not saved in memory after 
        # the execution of the function. This helps to ensure 
        # the code doesn't run slowly or use too much memory.
        first_pattern = '(?:' + '|'.join(first_variants) + ')'
        second_pattern = '(?:' + '|'.join(second_variants) + ')'
        
        # Both combinations of patterns are assembled into one pattern
        # with any non-word character in between them. Variations in 
        # uppercase/lowercase are ignored.
        pattern = re.compile(rf'({first_pattern})\W*({second_pattern})', flags=re.IGNORECASE)
        
        # Using the compiled patterns of potential OCR errors,
        # the function runs through the text and substitutes
        # potential errors with the correct spellings of names.
        text = pattern.sub(f' {first_name} {second_name} ', text)

    # If the above substitutions create multiple spaces,
    # they are replaced with just one space.
    text = ' '.join(text.split())
    
    # The corrected text is returned.    
    return text

### 3) Demonstrations

Below are a series of simple demonstrations showing how and when the function makes corrections (and when it doesn't). Feel free to change the 'text' in the examples below and see if the function works as described!
<br>
<br>

#### Example 1: Bill Wiley Douglass

In [3]:
victim_name = 'bill wiley douglass'

# example misspellings include:
# 1) first name correct, second name one character off, third name one character off
# 2) first name one character off, second and third name missing a space
# 3) first name with more than two incorrect characters (therefore uncorrected), second name one character off, third name one character off
text = 'Bill Wizey bouglass, 8ill WileyDouglass, blth miley douglasz'

result = fix_names(text, victim_name)
print(result)

bill wiley douglass , bill wiley douglass , blth wiley douglass


#### Example 2: James J Johnson

In [4]:
victim_name = 'james j johnson'

# example misspellings include:
# 1) first name one character off plus misplaced punctuation, second name uncorrected since it's only one character, third name correct
# 2) name corrected to add spaces between parts
# 3) name corrected so misplaced punctuation is removed
text = 'games.t johnson, jamesjjohnson, james.j. johnson'

result = fix_names(text, victim_name)
print(result)

games.t johnson, james j johnson , james j johnson


#### Example 3: Si King

In [5]:
victim_name = 'si king'

# example misspellings include:
# 1) first name left since it's only two characters long
# 2) name corrected to add spaces between parts
# 3) second name corrected since it only has one character off
text = 'Ti King, siking, si wing'

result = fix_names(text, victim_name)
print(result)

Ti King, si king , si king
