# Steps to Preprocessing the Seguin/Rigby Lynching Dataset

The dataset contains over 1,000 lynching events documented in this study: [https://journals.sagepub.com/doi/pdf/10.1177/2378023119841780](https://journals.sagepub.com/doi/pdf/10.1177/2378023119841780)

The data was retrieved from here: [https://archive.ciser.cornell.edu/studies/2833/data-and-documentation](https://archive.ciser.cornell.edu/studies/2833/data-and-documentation)

The goal of this notebook is to extract data regarding non-white lynchings that can be used to scrape Chron Am or other newspaper datasets. Relevant data will be 1) names of victims, 2) city names, and 3) dates.

This data will then go into a pipeline for scraping newspapers.


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('seguin_rigby_lynching_data.csv')

In [None]:
df.head()

In [None]:
df = df[df['race'] == 'Black']
df.head()

In [None]:
df = df[df['victim'] != 'Unknown']
len(df)

In [None]:
df

In [None]:
df = df[df['year'] <= 1921]
df

In [None]:
df['city'] = df['city'].str.lower()
df['victim'] = df['victim'].str.lower()

In [None]:
df = df[df['city'] != '.']
df

In [None]:
df = df[df['victim'].str.split().str.len() >= 2]
df

In [None]:
df['victim'] = df['victim'].str.replace('/', ', ', regex=False)
df['victim'] = df['victim'].str.replace(r'[,\|(].*', '', regex=True)

In [None]:
df = df.dropna(subset=['city'])

df now contains the following:

1) all instances in the data where the victim's race is Black AND
2) the event falls within the range of 1883 to 1921 AND
3) with second names, aliases, or unknown names removed AND
4) with unknown cities removed.


In [None]:
df

Now here's a function to build search urls that identify instances of the victim name, year, and city name when it appears within 100 tokens of the victim name in ChronAm. This function is based on this url structure:

[https://chroniclingamerica.loc.gov/search/pages/results/list/?date1=1883&rows=100&searchType=advanced&language=&proxdistance=100&date2=1883&ortext=&proxtext=mound+city&phrasetext=nelson+howard&andtext=&dateFilterType=yearRange&page=1&sort=date](https://chroniclingamerica.loc.gov/search/pages/results/list/?date1=1883&rows=100&searchType=advanced&language=&proxdistance=100&date2=1883&ortext=&proxtext=mound+city&phrasetext=nelson+howard&andtext=&dateFilterType=yearRange&page=1&sort=date)


In [None]:
def build_chron_am_search(row):
    base_url = "https://chroniclingamerica.loc.gov/search/pages/results/list/"
    date1 = row['year']
    date2 = row['year']
    proxtext = row['city'].replace(' ', '+')
    phrasetext = row['victim'].replace(' ', '+')

    search_url = (f"{base_url}?date1={date1}&rows=1000&searchType=advanced&language="
                  f"&proxdistance=100&date2={date2}&ortext=&proxtext={proxtext}"
                  f"&phrasetext={phrasetext}&andtext=&dateFilterType=yearRange&page=1&sort=date")
    
    return search_url

df['search_url'] = df.apply(build_chron_am_search, axis=1)

df

In [None]:
df['search_url']

In [None]:
df.to_csv('seguin_rigby_data_black_subset.csv', index=False, encoding='utf-8')