# Introduction

---
## About the data

This tutorial provides an example of data processing and visualization using transcription data from [By the People](https://crowd.loc.gov/), the Library of Congress's crowdsourcing program. By the People transcriptions and tags are created by anonymous and registered volunteers. Once a transcription is finished, it must be reviewed by a registered volunteer. A transcription may undergo multiple rounds of edits before being completed. Finally, transcriptions are spot-checked by Library of Congress subject matter experts before they are incorporated into the digital collections on the Library's website to enhance search and accessibility. Transcriptoins are also packaged into .csv files and made available as datasets as part of the [Selected Datasets Collection](https://www.loc.gov/collections/selected-datasets/).

Volunteers are instructed to transcribe the text as written, including misspellings and abbreviations. Formatting is generally not preserved with the exception of line breaks. Minimal markup does include “?” for illegible or unclear text, square brackets around deleted text, and square brackets and asterisks around marginalia `([*example*])`. Pages without text are marked “nothing to transcribe” and do not have transcriptions in [loc.gov](https://www.loc.gov/).

Additionally, registered volunteers can tag images with text in any way they choose. Volunteers have used tags to highlight key terms or subjects from the text, identify the document format or language, expand abbreviated words, note correct spelling of misspelled words, and record contextual information about the document. Tags are not selected from a controlled vocabulary and may include variations of similar terms, misspellings, or other errors.


By the People datasets contain the following fields:
- Campaign – this is the highest hierarchical level in the arrangement of collections on By the People (example: [Susan B. Anthony Papers](https://crowd.loc.gov/campaigns/susan-b-anthony-papers/)). This field displays the campaign’s title.

- Project – this is the second-highest hierarchical level of collections on By the People. Projects may map to an existing subset of a digital collection, such as an archival series, or may be a grouping of related items uniquely organized for By the People. This field displays the project’s title.

- Item – this is the third-highest hierarchical level of collections on By the People, typically representing a folder, letter, document, or diary. This field displays the item title. 

- ItemId – this is the identifier for the item (see above for definition). This numerical identifier is consistent across the By the People website and in loc.gov. The item and metadata are usually located on the Library’s website at `https://www.loc.gov/item/[ItemID]/`

- Asset – this is the identifier for the individual asset image. It is also referred to colloquially as the “page” by By the People volunteers and Community Managers. This identifier is used in the By the People site and on loc.gov. 

- AssetStatus – this indicates the status of the asset in the peer review workflow – Not Started, In Progress, Needs Review, or Completed. Dataset assets will always be marked as “Completed.” 

- DownloadURL – this link provides access to the image file for the Asset from which the transcription was created.

- Transcription – this is the text created by the By the People volunteers, representing the written content of the DownloadURL image and corresponding to the Asset. This field will be blank for assets that volunteers marked “Nothing to transcribe”.

- Tags – these are all the tags that have been applied to the asset. If there is more than one tag, the tags are delimited by a semicolon and space.


For this project, four datasets related to the movement for women's suffrage in the United States were selected. The first three datasets have multiple versions to include the Tags field and account for README updates.
- Anthony, Susan B. Transcription datasets from Susan B. Anthony Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2021. Software, E-Resource. https://www.loc.gov/item/2020445591/.
- Catt, Carrie Chapman. Transcription datasets from Carrie Chapman Catt Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2022, 2020. Software, E-Resource. https://www.loc.gov/item/2019667239/.
- Stanton, Elizabeth Cady. Transcription datasets from Elizabeth Cady Stanton Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, 2021. Software, E-Resource. https://www.loc.gov/item/2020445592/.
- Terrell, Mary Church. Transcription dataset from the Mary Church Terrell Papers, Manuscript Division. compiled by By The People. Washington, D.C.: By the People, Library of Congress, to 2021, 2018. Software, E-Resource. https://www.loc.gov/item/2021387726/.

---

## About the notebooks

This tutorial is organized into two notebooks: `2-Data-Processing.ipynb` and `3-Grouped-Bar-Graph.ipynb`. The first notebook cleans and processes transcriptions from the four datasets using [Pandas](https://pandas.pydata.org/) and the [spaCy](https://spacy.io/) Natural Lanugage Processing library. This code tokenizes the transcriptions, breaking the strings of text into tokens (words) that will be further analyzed. It then identifies the lemma, or root, for each word. For example, the lemma of "voted" is "vote", and the lemma of "women" is "woman". The code next iterates over each token to produce a list of lemmas from the original transcriptions that excludes stop words, punctuation, numbers, and words that volunteers were unable to fully transcribe, which are designated with "?". Stop words are commonly used words, such as "the", "a", or "is".

The second notebook creates two visualizations from the cleaned data using the [Matplotlib](https://matplotlib.org/) and [Numpy](https://numpy.org/) Python libraries. The first is a combined bar graph showing the five most used words for each of the four datasets. The second is a focused look at the "Speeches" series from the Susan B. Anthony Papers. With data coming from a typed inventory of speeches found in the collection, this code groups Anthony's speeches by year, and then plots the usage of the top five words in her speeches by year.

---

## Running the notebooks

In order to run a Jupyter notebook, navigate to the directory that contains the notebook files using `cd /path/to/dcm-btp-notebooks`, then run the command `jupyter notebook`. This will launch the Notebook Dashboard in an Internet browser.

In order to properly run these notebooks, make sure that the appropriate Python libraries are installed. Further information can be found in the README file. The dataset files are already included in this tutorial in the `data` directory, which can be seen in the Notebook Dashboard

The notebooks in the tutorial must be run in order. `3-Grouped-Bar-Graph.ipynb` relies on the cleaned data created in `2-Data-Processing.ipynb`. The entire notebook can be run by clicking `Run` in the menu bar. Individual cells can be run by clicking into the cell, then hitting `Shift + Enter`.

`2-Data-Processing.ipynb` contains optional code that can be run to print results to the notebook. This helps show what the code is doing at each step. These cells have "Optional:" in the title. Remove `#` from the code to un-comment and run those lines of code.

The outputs from `2-Data-Processing.ipynb` will be saved to the `outputs` directory, which can be seen in the Notebook Dashboard.

---

## Authorship and use

These notebooks were created by Dave Durden and Madeline Goebel, both Digital Collection Specialists at the Library of Congress. They are made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/legalcode).