# Introduction

---

<details>
<summary>About the data</summary>

## About the data

This tutorial provides an example of data processing and visualization using transcription data from By the People, the Library of Congress's crowdsourced transcriptioning program. By the People invites anyone to contribute to the Library of Congress as a virtual volunteer by transcribing and reviewing digital images of texts to enhance Library of Congress digital collections. The Library of Congress thanks all By the People volunteers for sharing their time and knowledge with us to make this tutorial possible. 

By the People transcriptions and tags are created by anonymous and registered volunteers. Once a transcription is finished, it must be reviewed by a registered volunteer. A transcription may undergo multiple rounds of edits before being completed. Finally, transcriptions are spot-checked by Library of Congress subject matter experts before they are incorporated into the digital collections on the Library's website to enhance search and accessibility. Transcriptions are also packaged into .csv files and made available as datasets as part of the [Selected Datasets Collection](https://www.loc.gov/collections/selected-datasets/).

In this tutorial we will work with four datasets related to the movement for women's suffrage in the United States. Each dataset’s README includes additional information about its content and creation
- Anthony, Susan B. Transcription datasets from Susan B. Anthony Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2021. Software, E-Resource. https://www.loc.gov/item/2020445591/.
- Catt, Carrie Chapman. Transcription datasets from Carrie Chapman Catt Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2020. Software, E-Resource. https://www.loc.gov/item/2019667239/.
- Stanton, Elizabeth Cady. Transcription datasets from Elizabeth Cady Stanton Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, 2021. Software, E-Resource. https://www.loc.gov/item/2020445592/.
- Terrell, Mary Church. Transcription dataset from the Mary Church Terrell Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2021, 2018. Software, E-Resource. https://www.loc.gov/item/2021387726/.

Transcription volunteers are instructed to transcribe the text as written, including misspellings and abbreviations. Formatting is generally not preserved with the exception of line breaks. Minimal markup does include “?” for illegible or unclear text, square brackets around deleted text, and square brackets and asterisks around marginalia `([*example*])`. Pages without text are marked “nothing to transcribe” and do not have transcriptions in [loc.gov](https://www.loc.gov/).

By the People datasets contain the following fields:
- Campaign – this is the highest hierarchical level in the arrangement of collections on By the People (example: [Susan B. Anthony Papers](https://crowd.loc.gov/campaigns/susan-b-anthony-papers/)). This field displays the campaign’s title.
- Project – this is the second-highest hierarchical level of collections on By the People. Projects may map to an existing subset of a digital collection, such as an archival series, or may be a grouping of related items uniquely organized for By the People. This field displays the project’s title.
- Item – this is the third-highest hierarchical level of collections on By the People, typically representing a folder, letter, document, or diary. This field displays the item title. 
- ItemId – this is the identifier for the item (see above for definition). This numerical identifier is consistent across the By the People website and in loc.gov. The item and metadata are usually located on the Library’s website at `https://www.loc.gov/item/[ItemID]/`
- Asset – this is the identifier for the individual asset image. It is also referred to colloquially as the “page” by By the People volunteers and Community Managers. This identifier is used in the By the People site and on loc.gov. 
- AssetStatus – this indicates the status of the asset in the peer review workflow – Not Started, In Progress, Needs Review, or Completed. Dataset assets will always be marked as “Completed.” 
- DownloadURL – this link provides access to the image file for the Asset from which the transcription was created.
- Transcription – this is the text created by the By the People volunteers, representing the written content of the DownloadURL image and corresponding to the Asset. This field will be blank for assets that volunteers marked “Nothing to transcribe”.
- Tags – these are all the tags that have been applied to the asset. If there is more than one tag, the tags are delimited by a semicolon and space.

---
</details>

<details>
<summary>About the notebook</summary>

## About the notebook

This tutorial first cleans and processes transcriptions from the four datasets using [Pandas](https://pandas.pydata.org/) and the [spaCy](https://spacy.io/) Natural Language Processing library. This code tokenizes the transcriptions, breaking the strings of text into tokens (words) that will be further analyzed. It then identifies the lemma, or root, for each word. For example, the lemma of "voted" is "vote", and the lemma of "women" is "woman". The code next iterates over each token to produce a list of lemmas from the original transcriptions that excludes stop words, punctuation, numbers, and words that volunteers were unable to fully transcribe, which are designated with "?". Stop words are commonly used words, such as "the", "a", or "is".

The tutorial then creates two visualizations from the cleaned data using the [Matplotlib](https://matplotlib.org/) and [Numpy](https://numpy.org/) Python libraries. The first is a combined bar graph showing the five most used words for each of the four datasets. The second is a focused look at the "Speeches" series from the Susan B. Anthony Papers. With data coming from a [typed inventory of speeches](http://hdl.loc.gov/loc.mss/ms997009.mss11049.036) found in the collection, this code groups Anthony's speeches by year, and then plots the usage of the top five words in her speeches by year.

---
</details>

<details>
<summary>Running the notebook</summary>
    
## Running the notebook

In order to run a Jupyter notebook, navigate to the directory that contains the notebook files using `cd /path/to/dcm-btp-notebooks`, then run the command `jupyter notebook`. This will launch the Notebook Dashboard in an Internet browser.

In order to properly run these notebooks, make sure that the appropriate Python libraries are installed. Further information can be found in the README file. The dataset files are already included in this tutorial in the `data` directory (along with each dataset’s README), which can be seen in the Notebook Dashboard.

The entire notebook can be run by clicking `Run` in the menu bar. Individual cells can be run by clicking into the cell, then hitting `Shift + Enter`.

The notebook contains optional code that can be run to print results to the notebook. This helps show what the code is doing at each step. These cells have "Optional:" in the title. Remove `#` from the code to un-comment and run those lines of code.

The outputs from the tutorial will be saved to the `outputs` directory, which can be seen in the Notebook Dashboard.

---
</details>

<details>
<summary>Authorship and use</summary>
    
## Authorship and use

These notebooks were created by Dave Durden and Madeline Goebel, Digital Collection Specialists at the Library of Congress. They are made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/legalcode).

All contributions to the By the People application are released into the public domain as they are created. Anyone can use and re-use the datasets in any way they want.

---
</details>

# Data processing

This section contains the code necessary to load and transform transcription data; it is divided into four parts:
1. Import statements and function definitions,
2. Text tokenization for all datasets,
3. Data processing for the Susan B. Anthony Speeches subset,
4. Data processing for the Susan B. Anthony, Carrie Chapman Catt, Elizabeth Cady Stanton, and Mary Church Terrell transciption data.

## Instructions for running the code
1. Run the cells in order.
2. Optional code is available for previewing the data during each step of processing; these lines are not necessary for processing the data, but may be useful for those who wish to see how the data changes.

---

## 1. Import modules, set constants, define functions

In [None]:
%matplotlib inline
import pandas as pd
import re
from ast import literal_eval

### 1.1 Import the spaCy library and load the language model
The [`spaCy`](https://spacy.io/) library provides pre-built natural language processing tools and models.

In [None]:
import spacy
import en_core_web_lg

# Load the model
NLP = en_core_web_lg.load()

### 1.2 Set the dataset path constants
The following code will store the relative paths of each provided dataset in a constant for reuse throughout this notebook. These files are in the `data` folder that was downloaded alongside this notebook.

In [None]:
# Susan B. Anthony
ANTHONY = "data/anthony/susan-b-anthony-papers_2022-10-12.csv"
SPEECHES = "data/anthony/anthony_speech_list.csv"

# Carrie Chapman Catt
CATT = "data/catt/carrie-chapman-catt-papers_2022-10-12.csv"

# Elizabeth Cady Stanton
STANTON = "data/stanton/elizabeth-cady-stanton-papers_2022-10-19.csv"

# Mary Church Terrell
TERRELL = "data/terrell/mary-church-terrell-advocate-for-african-americans-and-women_2023-01-20.csv"

### 1.3 Define processing functions
The following helper functions were written to process the data. They are loaded here to make the later sections easier to read.

In [None]:
def load_csv(file: str) -> pd.DataFrame:
    """Load each CSV file into a data frame.
    
    Returns:
        df (data frame): A data frame containing the data loaded from csv."""
    
    df = pd.read_csv(file, dtype=str)
    return df


def tokens(text) -> list:
    """Runs NLP process on text input. 
    
    Returns: 
        process (list): A list containing tuples of NLP attributes 
            for each word in the transcription.
    """
    
    doc = NLP(str(text))
    process = ([(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop) for token in doc])

    return process


def entities(text) -> list:
    """Runs NER process on text input. 
    
    Returns:
        process (list): A list containing tuples of NER attributes 
            for each word in the transciption.
    """
    
    doc = NLP(str(text))
    process = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    return process


def separate_text(df: pd.DataFrame) -> None:
    """Adds new columns to the data frame then loops through the 
    tokenized text of each row moving each category to the newly 
    created relevant column."""
    
    # Add new columns (c) to the data frame
    for c in ['text', 
              'stop_words', 
              'nonalphanums', 
              'numbers', 
              'ambigs', 
              'processed_text']:
        df[c] = pd.Series(dtype=str)
    
    # Iterates over a copy of tokenized_text to filter words
    # into five categories
    for row in range(df.shape[0]):
        text_block = df['tokenized_text'].iloc[row].copy()
        
        text = []
        stop_words = []
        nonalphanums = []
        numbers = []
        ambigs = []
    
        for idx, word in enumerate(text_block):
            # Move stopwords
            if word[7] == True:
                stop_words.append(text_block[idx])
            # Move punctuation and whitespace
            elif word[2] in ['PUNCT', 'SPACE', 'CCONJ', 'X', 'SYM']:
                nonalphanums.append(text_block[idx])
            # Move numbers
            elif word[2] == 'NUM':
                numbers.append(text_block[idx])
            # Move ambiguous transcribed words
            elif '?' in word[5]:
                ambigs.append(text_block[idx])
            # Move text
            else:
                text.append(text_block[idx])
                
        df['text'].iloc[row] = text
        df['stop_words'].iloc[row] = stop_words
        df['nonalphanums'].iloc[row] = nonalphanums
        df['numbers'].iloc[row] = numbers
        df['ambigs'].iloc[row] = ambigs
        # Create a processed_text column containing lowercase lemmas 
        # for all words in list 'text'
        df['processed_text'].iloc[row] = [i[1].lower() for i in df['text'].iloc[row]]


### 1.4 Load each transcription dataset into a data frame
The `load_csv` function will read the data from each path constant and store data in a Pandas data frame.

In [None]:
a = load_csv(ANTHONY)
c = load_csv(CATT)
s = load_csv(STANTON)
t = load_csv(TERRELL)

#### 1.4.1 Optional: Preview the first five lines of a loaded dataset
Uncomment the line for the dataset you wish to preview and then run the cell.

##### Anthony

In [None]:
#a.head()

##### Catt

In [None]:
#c.head()

##### Stanton

In [None]:
#s.head()

##### Terrell

In [None]:
#t.head()

---

## 2. Text tokenization for all four transcription datasets

This section contains code that will tokenize the transcription data and add new columns to the data frames for each transcription dataset.

### 2.1 Create a new column containing the output of the `tokens` function
The `tokens` function uses the previously loaded spaCy model to analyze each word in the transcription. This results in several values for each word, including the lemma, the part-of-speech tag, the shape of the word, and whether it is a stop word or number.

In [None]:
# NOTE: This will take a while to run
for dataset in [a, c, s, t]:
    print(f"Tokenizing text for dataset: {dataset['Campaign'][0]}")
    dataset['tokenized_text'] = dataset['Transcription'].apply(tokens)
print("Done!")

### 2.2 Create a new column containing the output of the `entities` function
The `entities` function uses the previously loaded spaCy model to identify persons, places, organizations, etc.

In [None]:
# NOTE: This will take a while to run
for dataset in [a, c, s, t]:
    print(f"Identifying entities for dataset: {dataset['Campaign'][0]}")
    dataset['entities'] = dataset['Transcription'].apply(entities)
print("Done!")

#### 2.2.1 Optional: Preview the results of the `tokens` and `entities` functions for the first row of a dataset
Uncomment the lines in a cell for a dataset and then run the cell.

##### Anthony

In [None]:
#a.head(1)
#a['tokenized_text'].iloc[0]
#a['entities'].iloc[1000]

##### Catt

In [None]:
#c.head(1)
#c['tokenized_text'].iloc[0]
#c['entities'].iloc[1000]

##### Stanton

In [None]:
#s.head(1)
#s['tokenized_text'].iloc[0]
#s['entities'].iloc[1000]

##### Terrell

In [None]:
#t.head(1)
#t['tokenized_text'].iloc[0]
#t['entities'].iloc[1000]

### 2.3 Run the `separate_text` function to isolate tokens by category

The `separate_text` function uses labels generated by the `spaCy` library to organize the contents of each transcription into actual text, stop words (conjunctions, prepositions, etc.), non-alphanumeric strings (punctuation, whitespace, etc.), numbers, and ambiguous words (when a transcriber cannot make out a word or character, a `?` will be used for the unknown character(s); this is reflected in the analyzed pattern of the word which is used to remove these words from the text category).

In [None]:
# Run the separate_text function on the Anthony data frame
for dataset in [a, c, s, t]:
    print(f"Organizing tokens by category for: {dataset['Campaign'][0]}")
    separate_text(dataset)
print("Done!")

#### 2.3.1 Optional: Preview the results for the first six rows of the updated data frame
Uncomment the line for the dataset you wish to preview and then run the cell.

##### Anthony

In [None]:
#a.iloc[0:6]

##### Catt

In [None]:
#c.iloc[0:6]

##### Stanton

In [None]:
#s.iloc[0:6]

##### Terrell

In [None]:
#t.iloc[0:6]

---

## 3. Process the Susan B. Anthony speech subset
A [typed inventory of speeches](http://hdl.loc.gov/loc.mss/ms997009.mss11049.036) in the Susan B. Anthony Papers is available; this allows for subsetting the transcription data so that it can be processed and visualized separately from the entire transcription dataset.

### 3.1 Extract the transcription text for the speeches
The code below will group transcription data by the `ItemId`. The speech inventory will then be used to subset the transcription data using the `ItemId` of known speeches. The transcribed text will then be combined at the item level and stored in a dictionary that lists the `id`, `year`, `title`, and `text` of each speech.

In [None]:
# Load the speech inventory
a_speeches = load_csv(SPEECHES)

# Group transcriptions by ItemId
# Creates a dictionary where the ItemId is the key and the value is a list of associated row indexes
a_groups = a.groupby('ItemId').groups

# Create a list of dictionaries representing each speech
# This structure is specifically designed for visualization in the next notebook
speech_list = []

for row in range(a_speeches.shape[0]):
    d = re.findall('\d{4}', a_speeches.iloc[row][1])
    speech_id = a_speeches.iloc[row][0]
    speech_text = []
    for i in a_groups[speech_id]:
        speech_text.extend(a['processed_text'].iloc[i])
    speech = {'id': speech_id, 
              'year': d[0], 
              'title': a_speeches.iloc[row][2], 
              'text': speech_text}
    speech_list.append(speech)

#### 3.1.1 Optional: Save `speech_list` to a Python file for import or reuse beyond these notebooks

In [None]:
# Optional code to create a stand-alone file for the speech data
with open('outputs/anthony_speech_lemmas.py', 'w', encoding='utf-8') as f:
    f.write("#! usr/bin/env python3\n#-*- encoding: utf-8 -*-\n\n")
    f.write("speech_list = " + str(speech_list))

---

## 4. Process transcription data for all four datasets
The following code will prepare the data similar to the Susan B. Anthony speech subset above. Running this code is necessary for visualizing at the dataset-level for all four datasets.

### 4.1 Create a list of all words from `processed_text` for each dataset
This code will create a dictionary containing the titles and aggregated text from the `processed_text` column for each dataset.

In [None]:
transcriptions = []

for dataset in [a, c, s, t]:
    transcription_text = []
    for row in range(dataset.shape[0]):
        transcription_text.extend(dataset['processed_text'].iloc[row])
    transcription = {'title': dataset['Campaign'][0],
                     'text': transcription_text}
    transcriptions.append(transcription)

#### 4.1.1 Optional: Save `transcriptions` to a Python file for import or reuse beyond these notebooks

In [None]:
# Optional code to create a stand-alone file for the processed transcriptions data
with open('outputs/transcriptions_lemmas.py', 'w', encoding='utf-8') as f:
    f.write("#! usr/bin/env python3\n#-*- encoding: utf-8 -*-\n\n")
    f.write("transcriptions = " + str(transcriptions))

---

# Visualizing the data
This section contains the code necessary to visualize the processed transcription data; it is divided into two parts:
1. Create simple bar graphs,
2. Create grouped bar graph.

## Instructions for running the code
1. Run the cells in order.
2. If only creating the grouped bar graph, run cell `1.1 Import modules`.

---

## 1. Create simple bar graphs
This code will create one bar graph for each dataset depicting five most frequent words from that dataset. The four graphs will be displayed in a 2x2 grid.

### 1.1 Import modules

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from operator import itemgetter

### 1.2 Plot data in bar graphs

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
fig.tight_layout(pad=5.0)

#This code will loop over the lemmas from each dataset.
for i, transcription in enumerate(transcriptions):
    title = transcription['title']
    text = transcription['text']
    word_counts = Counter(text)
    #Identify the top 5 words for each dataset.
    top_words = word_counts.most_common(5)
    words, counts = zip(*top_words)
    
    #Create a 2x2 grid of bar graphs.
    ax = axs[i // 2][i % 2]
    ax.bar(words, counts)
    ax.set_title(title)
    ax.set_xlabel('Words')
    ax.set_ylabel('Counts')

plt.show()

---

## 2. Create grouped bar graph
This code will create a grouped bar graph showing the usage of the five most frequent words from Susan B. Anthony's speeches by year.

### 2.1 Group speeches by year

In [None]:
# some years have multiple speeches
year_speeches = {}
for speech in speech_list:
    year = speech["year"]
    if year not in year_speeches:
        year_speeches[year] = []
    year_speeches[year].append(speech)

### 2.2 Count word occurences for each year, excluding "nan" values

In [None]:
# nan means "not a number." This will exclude cases where there are no speeches for a given year.
year_word_counts = {}
for year, speeches in year_speeches.items():
    word_counts = Counter()
    for speech in speeches:
        words = [word for word in speech["text"] if word != "nan"]
        word_counts.update(words)
    year_word_counts[year] = word_counts

### 2.3 Sum word occurences across all years

In [None]:
word_counts = Counter()
for year_counts in year_word_counts.values():
    word_counts += year_counts

### 2.4 Get and print five most frequent words with most occurences

In [None]:
top_words = [word for word, count in word_counts.most_common(5)]
print(top_words)

### 2.5 Create grouped bar graph

In [None]:
data = []
for i, word in enumerate(top_words):
    word_data = []
    for year, word_counts in year_word_counts.items():
        count = word_counts.get(word, 0)
        word_data.append(count)
    data.append(word_data)

bar_width = 0.15
year_labels = list(year_word_counts.keys())
x = np.arange(len(year_labels))
fig, ax = plt.subplots()
colors = ['tab:green', 'tab:orange', 'tab:blue', 'tab:red', 'tab:purple']
for i, word_data in enumerate(data):
    ax.bar(x - (2 - i) * bar_width, word_data, bar_width, label=top_words[i], color=colors[i])


# Set the x-axis tick locations and labels
ax.set_xticks(range(len(year_labels)))
ax.set_xticklabels([int(year) for year in year_labels])

ax.legend()
ax.set_xlabel('Year')
ax.set_ylabel('Frequency')
ax.set_title('Speeches by year and top words')

plt.show()

---