# Enriching Data with Location Information

### 1) Overview

The following notebook shows my processes for adding lynching locations, newspaper locations, and other newspaper metadata to our dataset. This notebook relies on our initial data sources (the Tolnay-Beck Inventory and the Seguin-Rigby Dataset) as well as newspaper location data compiled by Viral Texts Project. I've also used metadata encoded in the Chronicling America URLs to build this pipeline.

The final result from the steps below (and the steps taken in previous notebooks) are that all the csv files (one per victim) contain the following data:

- victim (victim's name)
- race (Black or white)
- gender (if known)
- lynch_date (specified to either the day or the month)
- lynch_location (town/city, county/parish, and state, if known)
- lynch_latitude (based on lynch_location)
- lynch_longitude (based on lynch_location)
- newspaper (title of newspaper)
- reprint_date (date of specified page)
- reprint_longitude (based on newspaper location)
- reprint_latitude (based on newspaper location)
- clippings (50 words before victim name and the 100 words after victim name)
- text (the full OCR text of the given page)
- probability (our likelihood that the clipping contains reference to a lynching, labelled as either 'high', 'medium', 'low', 'unlikely', or 'unknown')
- BERT_1 (first round of BERT classification, either 'yes' or 'no')
- BERT_2 (second round of BERT classification, either 'yes' or 'no')
- BERT_3 (third round of BERT classification, either 'yes' or 'no')
- violence_word_count (the number of times words from our violence word lexicon appear in the clipping)
- racist_word_count (the number of times words from our racist word lexicon appear in the clipping)
- page_details (newspaper title, city, state, issue, and page number taken from our Chron Am search results)
- url (the Chron Am URL to the given page)
- sn_code (Chron Am's newspaper code for the given newspaper)
- coverage (the dbpedia url used to cross-reference newspaper locations)

In [None]:
import re
import pandas as pd
import os
from tqdm import tqdm
import shutil

### 2) Getting Lynch Location Data

First I revisited the unified lynch directories created with 01_unify_data_sources.ipynb (our combined version of the Tolnay-Beck and Seguin-Rigby datasets). By cross-referencing victim names in this dataset with the victim names in the csv files, I was able to pull the lynch location (city/town, county, state) and the latitude and longitude.

In [None]:
lynch_inventory = pd.read_csv('subset_cleaned_combined_lynch_inventories.csv')
lynch_inventory.head()

I started by creating a dictionary of location data from lynch_inventory for cross-referencing. I also defined the directory and csv file paths.

In [None]:
location_data_dictionary = {}
for _, row in lynch_inventory.iterrows():
    victim_name = row['victim_name']
    location = row['lynch_location']
    latitude = row['latitude']
    longitude = row['longitude']
    location_data_dictionary[victim_name] = (location, latitude, longitude)

directory = 'name_clusters'
csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]

Then I used the following loop to add the dictionary location data if/when the victim name in the given csv file matched a victim name in the dictionary. I added the progress bar to keep track of processing time, as usual.

In [None]:
total_rows = 0
all_csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    total_rows += len(df)

pbar = tqdm(total=total_rows, desc='rows')

for file in csv_files:
    file_path = os.path.join(directory, file)
    df_temp = pd.read_csv(file_path)
    added_locations = []
    added_latitudes = []
    added_longitudes = []
    
    for victim_name in df_temp['victim']:
        if victim_name in location_data_dictionary:
            location, latitude, longitude = location_data_dictionary[victim_name]
        else:
            location, latitude, longitude = (None, None, None)
        
        added_locations.append(location)
        added_latitudes.append(latitude)
        added_longitudes.append(longitude)
        pbar.update(1)
    
    df_temp['lynch_location'] = added_locations
    df_temp['lynch_latitude'] = added_latitudes
    df_temp['lynch_longitude'] = added_longitudes
    df_temp.to_csv(file_path, index=False)

pbar.close()

### 3) Getting SN Code and Reprint Date

At this step, I used data encoded in the Chron Am URLs. Basically, I just referenced the url patterns using regular expressions for the Chron Am sn codes and reprint dates. For more info about these patterns, check out this page on the Chron Am API: [https://chroniclingamerica.loc.gov/about/api/](https://chroniclingamerica.loc.gov/about/api/).

In [None]:
# here's a regular expression for capturing the sn codes in the Chron Am url
sn_code_regex = re.compile(r'sn\d{8}')

total_rows = 0
all_csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    total_rows += len(df)

pbar = tqdm(total=total_rows, desc='rows')

for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    sn_codes = []
    for url in df['url']:
        if pd.isna(url):
            sn_codes.append(None)
        else:
            sn_match = sn_code_regex.search(url)
            if sn_match:
                sn_codes.append(sn_match.group(0))
            else:
                sn_codes.append(None)
        
        pbar.update(1)
    
    df['sn_code'] = sn_codes
    df['sn_code'] = df['sn_code'].apply(lambda x: '/lccn/' + x if x else None)
    df.to_csv(file_path, index=False)

pbar.close()

In [None]:
# here's a regex pattern for capturing dates in Chron Am's URLs
date_regex = re.compile(r'\d{4}-\d{2}-\d{2}')

total_rows = 0
all_csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    total_rows += len(df)

pbar = tqdm(total=total_rows, desc='rows')

for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    reprint_dates = []
    for url in df['url']:
        if pd.isna(url):
            reprint_dates.append(None)
        else:
            date_match = date_regex.search(url)
            if date_match:
                reprint_dates.append(date_match.group(0))
            else:
                reprint_dates.append(None)
        
        pbar.update(1)

    df['reprint_date'] = reprint_dates
    df['reprint_date'] = pd.to_datetime(df['reprint_date'], format='%Y-%m-%d', errors='coerce')   
    df.to_csv(file_path, index=False)

pbar.close()

### 4) Getting Newspaper Title

At this step, I first had to change the 'newspaper' column name to 'page_details'. If you remember from our scraping processes in steps 02 and 03, we pulled the 'newspaper' data from the Chron Am search results. This data included newspaper name, city/state, issue, and page number all in one column. So, to make this column more clear, I first renamed it to 'page_details'. Then I used a regular expression to pull the newspaper titles from page_details and save them in the new column called 'newspaper'.

In [None]:
# loop that simply changes column names (newspaper becomes page_details)
for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    df.rename(columns={'newspaper': 'page_details'}, inplace=True)
    df.to_csv(file_path, index=False)

In [None]:
# here's the regex pattern I used to extract newspaper name from page_details
# FYI: the newspaper name always comes first in page_details. It is either followed by '. [' or '. ('
newspaper_pattern = re.compile(r'\.\ \[|\.\(')

# just a quick function that splits the string in the inputted column using the newspaper_pattern
def extract_newspaper_with_regex(text):
    paper_title = newspaper_pattern.split(str(text), maxsplit=1)
    return paper_title[0]

for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    df['newspaper'] = df['page_details'].apply(extract_newspaper_with_regex) # function applied
    df.to_csv(file_path, index=False)

### 5) Getting Newspaper Latitude & Longitude

At this step, I'm relying on newspaper location data from the Viral Texts Project. This data has corresponding DBpedia links to newspapers. It also has latitude/longitude data to the corresponding the DBpedia links. This means I had to essentially cross-reference the three data–our newspapers by sn code to the DBpedia links, and then the DBpedia links to the lat/long data.

In [None]:
lat_long_df = pd.read_csv('https://raw.githubusercontent.com/ViralTexts/newspaper-metadata/main/places.csv')
series_df = pd.read_csv('https://raw.githubusercontent.com/ViralTexts/newspaper-metadata/main/series.csv')

In [None]:
# I start by mapping the series coverage data (the DBpedia links) to a dictionary for faster reference to sn codes
# then adding the coverage links into a new column 
coverage_dictionary = dict(zip(series_df['series'], series_df['coverage']))

for filename in os.listdir(directory):
    if not filename.endswith('.csv'):
        continue
        
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    df['coverage'] = df['sn_code'].map(coverage_dictionary)
    df.to_csv(file_path, index=False)

In [None]:
# then a dictionary with lat/long data
# and adding lat/long data in new columns
latitude_longitude_dictionary = dict(zip(lat_long_df['coverage'], zip(lat_long_df['lon'], lat_long_df['lat'])))

for filename in os.listdir(directory):
    if not filename.endswith('.csv'):
        continue
    
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    df['reprint_longitude'], df['reprint_latitude'] = zip(*df['coverage'].apply(lambda c: latitude_longitude_dictionary.get(c, (None, None))))
    df.to_csv(file_path, index=False)

### 6) Re-ordering Columns for Ease of Use

At this point, I've got all the data in the csv files that I had planned to include. For ease of use, however, I thought it'd be better to reorganize the order of the columns. So, I created the following order:

In [None]:
desired_order = ['victim', 'race', 'gender', 'lynch_date', 'lynch_location', 'lynch_latitude', 'lynch_longitude', 'newspaper', 'reprint_date', 'reprint_longitude', 'reprint_latitude', 'clippings', 'text', 'probability', 'BERT_1', 'BERT_2', 'BERT_3', 'violence_word_count', 'racist_word_count', 'page_details', 'url', 'sn_code', 'coverage']

all_csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
for filename in all_csv_files:
    file_path = os.path.join(directory, filename)
    df = pd.read_csv(file_path)
    reordered_columns = [col for col in desired_order if col in df.columns]
    df = df[reordered_columns]
    df.to_csv(file_path, index=False)

### 7) Separating Victim Files by Race

One last thing: I've been compiling data for both Black and white victims of lynchings so I can compare newspaper reports by race in later work. For VRT's purposes, though, I only want to provide data for Black victims. So, I've separated the files by race. 

In [None]:
# how I moved Black victim csvs into a new directory called 'black_victims'
black_directory = os.path.join(directory, 'black_victims')
os.makedirs(black_directory, exist_ok=True)

for filename in os.listdir(directory):
    if not filename.endswith('.csv'):
        continue
    
    file_path = os.path.join(directory, filename)
    if file_path.startswith(black_directory):
        continue

    df = pd.read_csv(file_path)
    if 'race' in df.columns:
        race_values = df['race'].astype(str)
        if 'black' in race_values.unique():
            dest_path = os.path.join(black_directory, filename)
            shutil.move(file_path, dest_path)

In [None]:
# how I moved white victim csvs into a new directory called 'white_victims'
white_directory = os.path.join(directory, 'white_victims')
os.makedirs(white_directory, exist_ok=True)

for filename in os.listdir(directory):
    if not filename.endswith('.csv'):
        continue
    
    file_path = os.path.join(directory, filename)
    if file_path.startswith(white_directory):
        continue

    df = pd.read_csv(file_path)
    if 'race' in df.columns:
        race_values = df['race'].astype(str)
        if 'white' in race_values.unique():
            dest_path = os.path.join(white_directory, filename)
            shutil.move(file_path, dest_path)