# Data Processing Notebook

This notebook contains the code necessary to load and transform transcription data; it is divided into four parts: 

1. Import statements and function definitions,
2. Text tokenization for all datasets,
3. Data processing for the Susan B. Anthony Speeches subset,
4. Data processing for Susan B. Anthony, Carrie Chapman Catt, Elizabet Cady Stanton, and Mary Church Terrell transcription data.

## Instructions for running the code
1. Run the cells in order.
2. Optional code is avaiable for previewing the data during each step of processing; these lines are not necessary for processing the data, but may be useful for those who wish to see how the data changes.

---

## 1. Import modules, set constants, define functions

In [None]:
%matplotlib inline
import pandas as pd
import re
from ast import literal_eval

### 1.1 Import the spaCy library and load the language model
The [`spaCy`](https://spacy.io/) library provides pre-built natural language processing tools and models.

In [None]:
import spacy
import en_core_web_lg

# Load the model
NLP = en_core_web_lg.load()

### 1.2 Set the dataset path constants
The following code will store the relative paths of each provided dataset in a constant for reuse throughout this notebook. These files are in the `data` folder that was downloaded alongside this notebook.

In [None]:
# Susan B. Anthony
ANTHONY = "data/anthony/susan-b-anthony-papers_2022-10-12.csv"
SPEECHES = "data/anthony/anthony_speech_list.csv"

# Carrie Chapman Catt
CATT = "data/catt/carrie-chapman-catt-papers_2022-10-12.csv"

# Elizabeth Cady Stanton
STANTON = "data/stanton/elizabeth-cady-stanton-papers_2022-10-19.csv"

# Mary Church Terrell
TERRELL = "data/terrell/mary-church-terrell-advocate-for-african-americans-and-women_2023-01-20.csv"

### 1.3 Define processing functions
The following helper functions were written to process the data. They are loaded here to make the later sections easier to read.

In [None]:
def load_csv(file: str) -> pd.DataFrame:
    """Load each CSV file into a data frame.
    
    Returns:
        df (data frame): A data frame containing the data loaded from csv."""
    
    df = pd.read_csv(file, dtype=str)
    return df


def tokens(text) -> list:
    """Runs NLP process on text input. 
    
    Returns: 
        process (list): A list containing tuples of NLP attributes 
            for each word in the transcription.
    """
    
    doc = NLP(str(text))
    process = ([(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop) for token in doc])

    return process


def entities(text) -> list:
    """Runs NER process on text input. 
    
    Returns:
        process (list): A list containing tuples of NER attributes 
            for each word in the transciption.
    """
    
    doc = NLP(str(text))
    process = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    return process


def separate_text(df: pd.DataFrame) -> None:
    """Adds new columns to the data frame then loops through the 
    tokenized text of each row moving each category to the newly 
    created relevant column."""
    
    # Add new columns (c) to the data frame
    for c in ['text', 
              'stop_words', 
              'nonalphanums', 
              'numbers', 
              'ambigs', 
              'processed_text']:
        df[c] = pd.Series(dtype=str)
    
    # Iterates over a copy of tokenized_text to filter words
    # into five categories
    for row in range(df.shape[0]):
        text_block = df['tokenized_text'].iloc[row].copy()
        
        text = []
        stop_words = []
        nonalphanums = []
        numbers = []
        ambigs = []
    
        for idx, word in enumerate(text_block):
            # Move stopwords
            if word[7] == True:
                stop_words.append(text_block[idx])
            # Move punctuation and whitespace
            elif word[2] in ['PUNCT', 'SPACE', 'CCONJ', 'X', 'SYM']:
                nonalphanums.append(text_block[idx])
            # Move numbers
            elif word[2] == 'NUM':
                numbers.append(text_block[idx])
            # Move ambiguous transcribed words
            elif '?' in word[5]:
                ambigs.append(text_block[idx])
            # Move text
            else:
                text.append(text_block[idx])
                
        df['text'].iloc[row] = text
        df['stop_words'].iloc[row] = stop_words
        df['nonalphanums'].iloc[row] = nonalphanums
        df['numbers'].iloc[row] = numbers
        df['ambigs'].iloc[row] = ambigs
        # Create a processed_text column containing lowercase lemmas 
        # for all words in list 'text'
        df['processed_text'].iloc[row] = [i[1].lower() for i in df['text'].iloc[row]]


### 1.4 Load each transcription dataset into a data frame
The `load_csv` function will read the data from each path constant and store data in a Pandas data frame.

In [None]:
a = load_csv(ANTHONY)
c = load_csv(CATT)
s = load_csv(STANTON)
t = load_csv(TERRELL)

#### 1.4.1 Optional: Preview the first five lines of a loaded dataset.

In [None]:
# Uncomment the line for the dataset you wish to preview and then run the cell
#a.head()
#c.head()
#s.head()
#t.head()

---

## 2. Text tokenization for all four transcription datasets

This section contains code that will tokenize the transcription data and add new columns to the data frames for each transcription dataset.

### 2.1 Create a new column containing the output of the `tokens` function
The `tokens` function uses the previously loaded spaCy model to analyze each word in the transcription. This results in several values for each word, including the lemma, the part-of-speech tag, the shape of the word, and whether it is a stop word or number.

In [None]:
# NOTE: This will take a while to run
for dataset in [a, c, s, t]:
    print(f"Tokenizing text for dataset: {dataset['Campaign'][0]}")
    dataset['tokenized_text'] = dataset['Transcription'].apply(tokens)
print("Done!")

### 2.2 Create a new column containing the output of the `entities` function
The `entities` function uses the previously loaded spaCy model to identify persons, places, organizations, etc.

In [None]:
# NOTE 1: This will take a while to run
# NOTE 2: This cell is not required if only running the transcription visualization notebook
for dataset in [a, c, s, t]:
    print(f"Identifying entities for dataset: {dataset['Campaign'][0]}")
    dataset['entities'] = dataset['Transcription'].apply(entities)
print("Done!")

#### 2.2.1 Optional: Preview the results of the `tokens` and `entities` functions for the first row of a dataset

In [None]:
# Uncomment a line and then run

# ANTHONY
#a.head(1)
#a['tokenized_text'].iloc[0]
#a['entities'].iloc[1000]

# CATT
#c.head(1)
#c['tokenized_text'].iloc[0]
#c['entities'].iloc[1000]

# STANTON
#s.head(1)
#s['tokenized_text'].iloc[0]
#s['entities'].iloc[1000]

# TERRELL
#t.head(1)
#t['tokenized_text'].iloc[0]
#t['entities'].iloc[1000]

### 2.3 Run the `separate_text` function to isolate tokens by category

The `separate_text` function uses labels generated by the `spaCy` library to organize the contents of each transcription into actual text, stop words (conjunctions, prepositions, etc.), non-alphanumeric strings (punctuation, whitespace, etc.), numbers, and ambiguous words (when a transcriber cannot make out a word or character, a `?` will be used for the unknown character(s); this is reflected in the analyzed pattern of the word which is used to remove these words from the text category).

In [None]:
# Run the separate_text function on the Anthony data frame
for dataset in [a, c, s, t]:
    print(f"Organizing tokens by category for: {dataset['Campaign'][0]}")
    separate_text(dataset)
print("Done!")

#### 2.3.1 Optional: Preview the results for the first six rows of the updated data frame.

In [None]:
# Uncomment a line and then run

# ANTHONY
#a.iloc[0:6]

# CATT
#c.iloc[0:6]

# STANTON
#s.iloc[0:6]

# TERRELL
#t.iloc[0:6]

---

## 3. Process the Susan B. Anthony speech subset
An inventory of speeches in the Susan B. Anthony Papers is available; this allows for subsetting the transcription data so that it can be processed and visualized separately from the entire transcription dataset.

### 3.1 Extract the transcription text for the speeches
The code below will group transcription data by the `ItemId`. The speech inventory will then be used to subset the transcription data using the `ItemId` of known speeches. The transcribed text will then be combined at the item level and stored in a dictionary that lists the `id`, `year`, `title`, and `text` of each speech.

In [None]:
# Load the speech inventory
a_speeches = load_csv(SPEECHES)

# Group transcriptions by ItemId
# Creates a dictionary where the ItemId is the key and the value is a list of associated row indexes
a_groups = a.groupby('ItemId').groups

# Create a list of dictionaries representing each speech
# This structure is specifically designed for visualization in the next notebook
speech_list = []

for row in range(a_speeches.shape[0]):
    d = re.findall('\d{4}', a_speeches.iloc[row][1])
    speech_id = a_speeches.iloc[row][0]
    speech_text = []
    for i in a_groups[speech_id]:
        speech_text.extend(a['processed_text'].iloc[i])
    speech = {'id': speech_id, 
              'year': d[0], 
              'title': a_speeches.iloc[row][2], 
              'text': speech_text}
    speech_list.append(speech)

### 3.2 Store the speech subset 
The following code will save the output `speech_list` into a variable that can be used across notebooks in this session.

In [None]:
%store speech_list

# Reuse the variable in another notebook using the following command
# %store -r speech_list
# Then call the variable like usual

#### 3.2.1 Optional: Save `speech_list` to a Python file for import or reuse beyond these notebooks

In [None]:
# Optional code to create a stand-alone file for the speech data
with open('outputs/anthony_speech_lemmas.py', 'w', encoding='utf-8') as f:
    f.write("#! usr/bin/env python3\n#-*- encoding: utf-8 -*-\n\n")
    f.write("speech_list = " + str(speech_list))

---

## 4. Process transcription data for all four datasets
The following code will prepare the data similar to the Susan B. Anthony speech subset above. Running this code is necessary for visualizing at the dataset-level for all four datasets.

### 4.1 Create a list of all words from `processed_text` for each dataset
This code will create a dictionary containing the titles and aggregated text from the `processed_text` column for each dataset.

In [None]:
transcriptions = []

for dataset in [a, c, s, t]:
    transcription_text = []
    for row in range(dataset.shape[0]):
        transcription_text.extend(dataset['processed_text'].iloc[row])
    transcription = {'title': dataset['Campaign'][0],
                     'text': transcription_text}
    transcriptions.append(transcription)

### 4.2 Store the `transcriptions` dictionary

In [None]:
%store transcriptions

# Reuse the variable in another notebook using the following command
# %store -r transcriptions
# Then call the variable like usual

#### 4.2.1 Optional: Save `transcriptions` to a Python file for import or reuse beyond these notebooks

In [None]:
# Optional code to create a stand-alone file for the processed transcriptions data
with open('outputs/transcriptions_lemmas.py', 'w', encoding='utf-8') as f:
    f.write("#! usr/bin/env python3\n#-*- encoding: utf-8 -*-\n\n")
    f.write("transcriptions = " + str(transcriptions))

---