# Enrich the Dataset

In [None]:
import os
import pandas as pd
import re
from tqdm import tqdm
import shutil
import nltk

## 1) Overview

This notebook enriches the name clusters dataset, allowing for further analysis. It does so in the following ways:

- reorganizes the directory (removing subdirectory chunks)
- fixes some OCR issues to identify more references to victims
- creates 'clippings' of the text around references to victims
- counts signal words (i.e., words related to violence and race) that appear near the victim's name

These are essentially the preliminary data enrichments (i.e., NLP/text-mining steps) necessary for further analysis. More preprocessing will be required for things like geospatially mapping the data, classifying the data, and so forth. But the steps in this notebook make those things possible.

## 2) Reorganize Directories

When I scraped the data from Chronicling America, I broke it into chunks. Those chunks remained in the form of subdirectories. Rather than incorporating these subdirectories in all my code, I decided to simply move all the files to the main directory and delete the subdirectories. This step does nothing more, just moves the files to make things easier in the following steps.

In [None]:
# move all csvs to main directory and delete the 'chunk' subdirectories

directory = 'name_clusters'

for sub in os.listdir(directory):
    sub_path = os.path.join(directory, sub)

    if os.path.isdir(sub_path):
        for file in os.listdir(sub_path):
            original_file = os.path.join(sub_path, file)
            moved_file = os.path.join(directory, file)
            shutil.move(original_file, moved_file)

        if not os.listdir(sub_path):
            os.rmdir(sub_path)

## 3) Fix OCR Name Variations

Chronicling America uses fuzzy matching in its search. This means that search results contain not just victim's names spelled correctly, but also variations and/or similar phrases. This is to account for OCR errors, no doubt. But it also means that lots of the pages I scraped do not technically contain exact matches for the victim names, making further processing more difficult.

This step responds to the issue by correcting near-matches for victim's names. I created the function fix_names() to do this. It is a conservative OCR correction algorithm that changes only slight variations in victim names. To learn more about it, check out [fix_names_demo.ipynb](https://github.com/MatthewKollmer/messing-around/blob/main/vrt_work/say_their_names/fix_names_demo.ipynb).

But first, I had to add the victim names in a new column called 'victim':

In [None]:
# get victim name from the csv filename and put it into a new column 'victim'

for filename in os.listdir(directory):
    victim_name = os.path.splitext(filename)[0].replace('_', ' ')
    file_path = os.path.join(directory, filename)
        
    try:
        df = pd.read_csv(file_path)
        df['victim'] = str(victim_name)
        df['victim'] = df['victim'].astype(str).str.lower()
        # just ensuring 'text' column is read as string at this step. This is critical at later steps, and it was easy to add to the loop here.
        df['text'] = df['text'].astype(str).str.lower()
        df.to_csv(file_path, index=False)
        del df

    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

I also wanted to test how many more hits the fix_names() function was enabling, so first I counted how the number of instances of victim names in the data:

In [None]:
# count instances of victim names in the text before running OCR correction

total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

victim_counts = {}

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            for _, row in df.iterrows():
                victim_name = row['victim']
                text = row['text']
                    
                if pd.isnull(victim_name) or pd.isnull(text):
                    continue

                count = text.count(victim_name)
                if count > 0:
                    victim_counts[victim_name] = victim_counts.get(victim_name, 0) + count
            
            pbar.update(len(df))
            del df

        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

print(f'Number of times the victim names appear in the data: {sum(victim_counts.values())}')

# Number of times the victim names appear in the data: 317458

The result was 317,458 victim names. Keep in mind, the dataset contains 453,050 pages. Each one is supposed to have at least one instance of a victim's name. That means without OCR correction, about 30% of the pages (or more) were without proper spellings of victim names, making them essentially useless in further analysis.

Hence the function fix_names():

In [None]:
# A Function That Corrects Names in Text

def fix_names(text, victim_name):
    full_name = victim_name.split()
    for i in range(len(full_name) - 1):
        first_name = full_name[i]
        second_name = full_name[i + 1]

        if len(first_name) >= 3:
            first_variants = [re.escape(first_name)]
            for character in range(len(first_name)):
                first_variants.append(re.escape(first_name[:character]) + '.' + re.escape(first_name[character+1:]))
        else:
            first_variants = [re.escape(first_name)]

        if len(second_name) >= 3:
            second_variants = [re.escape(second_name)]
            for character in range(len(second_name)):
                second_variants.append(re.escape(second_name[:character]) + '.' + re.escape(second_name[character+1:]))

        else:
            second_variants = [re.escape(second_name)]

        first_pattern = '(?:' + '|'.join(first_variants) + ')'
        second_pattern = '(?:' + '|'.join(second_variants) + ')'
        
        pattern = re.compile(rf'({first_pattern})\W*({second_pattern})', flags=re.IGNORECASE)
        
        text = pattern.sub(f' {first_name} {second_name} ', text)

    text = ' '.join(text.split())
    
    return text

And then running the function on all the data:

In [None]:
# Run the OCR correction on the data

total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            df['text'] = df['text'].astype(str)
            df['victim'] = df['victim'].astype(str)
            df['text'] = df.apply(lambda row: fix_names(row['text'], row['victim']), axis=1)
            df.to_csv(file_path, index=False)
            pbar.update(len(df))
            del df

        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

Once complete, I recounted instances of victim names:

In [None]:
# count the instances of victim names in the text after OCR correction

total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

victim_counts = {}

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            for _, row in df.iterrows():
                victim_name = row['victim']
                text = row['text']
                    
                if pd.isnull(victim_name) or pd.isnull(text):
                    continue

                count = text.count(victim_name)
                if count > 0:
                    victim_counts[victim_name] = victim_counts.get(victim_name, 0) + count
            
            pbar.update(len(df))
            del df

        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

print(f'Number of times the victim names appear in the data: {sum(victim_counts.values())}')

# Number of times the victim names appear in the data: 423041

The result was 423,041 instances of victim names in the data. That's a marked improvement (+135,583 instances). Yet it should be noted that this figure does not necessarily mean +135,583 pages since some pages probably have more than one instance of the victim's name. But that being said, this step does a lot to enrich the data. It ensures I can review over 135k more instances of victim names–a figure that elides the conservative nature of the fix_names() function.

## 4) Get Clippings

In this step, I used the function get_clippings() to create a new column in the data. This new column contains the 50 words before the victim's name and the 100 words after the victim's name. This is essentially attempting to recreate the size of a newspaper clipping.

This range can easily be adapted. All you need to do is change the number of 'prewords' and 'postwords' in the following code. I chose 50 and 100 respectively, however, because in previous iterations of this project, I noticed that victim names often appear near the beginning of a lynching report (within the first 50 words or so), and in turn, 100 words after the victim's name provides sufficient text to verify if it is in fact a lynching report.

In [None]:
def get_clippings(text, victim_name, prewords=50, postwords=100):
    text = nltk.word_tokenize(text)
    victim_tokens = nltk.word_tokenize(victim_name)
    
    indices = []
    for i in range(len(text) - len(victim_tokens) + 1):
        if text[i:i + len(victim_tokens)] == victim_tokens:
            indices.append(i)
    
    if not indices:
        return None
    
    clippings = []
    for index in indices:
        start_index = max(0, index - prewords)
        end_index = index + len(victim_tokens) + postwords
        clipping_words = text[start_index:end_index]
        clipping = ' '.join(clipping_words)
        clippings.append(clipping)
    
    clippings = 'END CLIPPING | START CLIPPING'.join(clippings)
    
    return clippings

This code runs get_clippings() on the dataset:

In [None]:
total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            df['text'] = df['text'].astype(str)
            df['victim'] = df['victim'].astype(str)
            df['clippings'] = df.apply(lambda row: get_clippings(row['text'], row['victim'], prewords=50, postwords=100), axis=1)
            df.to_csv(file_path, index=False)
            pbar.update(len(df))
            del df

        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

## 5) Identify Signal Words

Finally, I created two lexicons to help assist in reviewing the clippings. One is 'violence_signals' and the other is 'racist_signals'. This code counts instances of words in these lexicons in the clippings and saves the counts in new columns.

This is just a preliminary step to help filter actual lynching reports from false positives. Basically, these counts allow me to look at clippings that are more likely to be positive hits. It is not a conclusive step. No data has been removed from the dataset on the basis of these word counts.

In [None]:
# get signal words for lynching/violence

violence_signals = ['lynch', 'mob', 'murder', 'posse', 'hang', 'hung', 'burn', 'shot', 'gun', 'stab', 'cut', 'bayonet', 'bullet', 'drown', 'beat', 'whip', 'assault', 'death', 'jail', 'prison']

total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            df['clippings'] = df['clippings'].astype(str)
            df['violence_word_count'] = df['clippings'].apply(lambda text: sum(text.count(word) for word in violence_signals))
            df.to_csv(file_path, index=False)
            pbar.update(len(df))
            del df
            
        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

In [None]:
# get signal words for race

racist_signals = ['negro', 'colored', 'black', 'nigger', 'negroid', 'mulatto', 'african', 'coon']

total_rows = 0
for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            rows = sum(1 for _ in f) - 1 # subtracting the first row since it's the column names
            if rows > 0:
                    total_rows += rows
                
    except Exception as e:
        print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')

with tqdm(total=total_rows, desc='Counting progress on total rows', unit='rows') as pbar:
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            df = pd.read_csv(file_path)
            df['clippings'] = df['clippings'].astype(str)
            df['racist_word_count'] = df['clippings'].apply(lambda text: sum(text.count(word) for word in racist_signals))
            df.to_csv(file_path, index=False)
            pbar.update(len(df))
            del df
            
        except Exception as e:
            print(f'Error! {e} issue with {file_path}. Just FYI: this file has been skipped.')