# Ladies' Edinburgh Debating Society as Data
Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the *Ladies' Edinburgh Debating Society* Dataset
The Ladies' Edinburgh Debating Society (LEDS) was founded by women in 1865 who were members of the upper-middle and higher classes at a time when women had limited higher education opportunities.  Members went on to play significant roles in education, suffrage, philanthropy, and anti-slavery efforts.  The LEDS Dataset contains digitized text from all volumes of two journals the Society published: "The Attempt" and "The Ladies' Edinburgh Magazine."  The first journal contains 10 volumes published from 1865 through 1874.  The second journal contains six volumes published from 1875 through 1880.  

The Ladies' Edinburgh Debating Society, also known as the Edinburgh Essay Society and the Ladies' Edinburgh Essay Society, was dissolved in 1935.  A year later, in 1936, the National Library of Scotland acquired the volumes that were digitized in this dataset.
* Data format: digitised text
* Data creation process: Optical Character Recognition (OCR)
* Data source: https://data.nls.uk/data/digitised-collections/edinburgh-ladies-debating-society/

### 0. Preparation
Import libraries to use for cleaning, summarizing and exploring the data:

In [1]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lucy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lucy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /Users/lucy/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


The nls-text-ladiesDebating folder (downloadable as *Just the text* data from the website at the top of this notebook) contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text and **tokenize** the text (which splits a string into separate words and punctuation):

In [2]:
corpus_folder = 'data/nls-text-ladiesDebating/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])

['â', '\x80¢*', 'â', '\x80¢', 'UL', '.', 'u', '^\\,', 'THE', 'ATTEMPT']


*Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!*

It's hard to get a sense of how accurately the text has been digitized from this list of 10 tokens, so let's look at one of these words in context.  To see phrases in which "Edinburgh" is used, we can use the concordance() method:

In [5]:
t = Text(corpus_tokens)
t.concordance('Edinburgh', lines=10)

Displaying 10 of 2079 matches:
UM MELIORIS MVl ." FEINTED FOR THE EDINBURGH ESSAY SOCIETY . EEID & SON , SHORE
atriculated into the University of Edinburgh , where he graduated in Arts , aft
pany has been making a stir in the Edinburgh world ), he was making a long spee
 May proÂ ¬ verbially favours us , Edinburgh holds May one of the dearest of th
on of Ecclesiastical Courts met in Edinburgh this month are no less noteworthy 
wning glory of the month of May in Edinburgh , supplying plenty of gaiety and g
go and find out ." Tlie hair of an Edinburgh cab - owner would stand on end did
onal Gallery . Possibly some of my Edinburgh friends may not have had the oppor
NE CONDUCTED BY THE MEMBERS OF THE EDINBURGH ESSAY SOCIETY . VOLUME III . " AUS
M MELIORIS uEVI ." PRINTED FOR THE EDINBURGH ESSAY SOCIETY . COLSTON & SON , ED


This dataset has not been manually cleaned after OCR digitized text from "The Attempt" and "The Ladies' Edinburgh Magazine" so it's not surprising to see some non-words appear in the concordance.

Before we do much analysis, let's get a sense of how much data we're working with:

In [6]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_words = 0
    total_sents = 0
    total_files = 0
    
    # fileids are the TXT file names in the nls-text-ladies-Debating folder:
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_words += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    
    print("Total...")
    print("  Characters in LEDS Data:", total_chars)
    print("  Words in LEDS Data:", total_words)
    print("  Sentences in LEDS Data:", total_sents)
    print("  Files in LEDS Data:", total_files)

corpusStatistics(wordlists)

Total...
  Characters in LEDS Data: 15096132
  Words in LEDS Data: 3145535
  Sentences in LEDS Data: 108011
  Files in LEDS Data: 16


Next, we'll create two subsets of the data, one for each journal.  To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal:

In [8]:
df = pd.read_csv('data/nls-text-ladiesDebating/ladiesDebating-inventory.csv', header=None, names=['fileid', 'title'])
# Since we only have 16 files, we'll print the entire dataframe.  With larger dataframes
# you may wish to use  df.head() or df.tail() to print only the first 5 or last 5 rows
df

Unnamed: 0,fileid,title
0,109857781.txt,Attempt - Volume 1 and Select writings - U.431
1,103655648.txt,Attempt - Volume 2 - U.431
2,103655649.txt,Attempt - Volume 3 - U.431
3,103655650.txt,Attempt - Volume 4 - U.431
4,103655651.txt,Attempt - Volume 5 - U.431
5,103655652.txt,Attempt - Volume 6 - U.431
6,103655653.txt,Attempt - Volume 7 - U.431
7,103655654.txt,Attempt - Volume 8 - U.431
8,103655655.txt,Attempt - Volume 9 - U.431
9,103655656.txt,Attempt - Volume 10 - U.431


Now we can create a two dictionaries of fileids and their associated journal titles, one for The Attempt and one for The Ladies' Edinburgh Magazine:

In [17]:
attempts = {}
mags = {}
for index, row in df.iterrows():
    fileid = row['fileid']
    title = row['title']
    if 'Attempt' in title:
        attempts[fileid] = title
    else: # if 'Magazine' in title:
        mags[fileid] = title

print(attempts)
print(mags)

{'109857781.txt': 'Attempt - Volume 1 and Select writings - U.431', '103655648.txt': 'Attempt - Volume 2 - U.431', '103655649.txt': 'Attempt - Volume 3 - U.431', '103655650.txt': 'Attempt - Volume 4 - U.431', '103655651.txt': 'Attempt - Volume 5 - U.431', '103655652.txt': 'Attempt - Volume 6 - U.431', '103655653.txt': 'Attempt - Volume 7 - U.431', '103655654.txt': 'Attempt - Volume 8 - U.431', '103655655.txt': 'Attempt - Volume 9 - U.431', '103655656.txt': 'Attempt - Volume 10 - U.431'}
{'103655658.txt': "Ladies' Edinburgh Magazine - Volume 1 - U.393", '103655659.txt': "Ladies' Edinburgh Magazine - Volume 2 - U.393", '103655660.txt': "Ladies' Edinburgh Magazine - Volume 3 - U.393", '103655661.txt': "Ladies' Edinburgh Magazine - Volume 4 - U.393", '103655662.txt': "Ladies' Edinburgh Magazine - Volume 5 - U.393", '103655663.txt': "Ladies' Edinburgh Magazine - Volume 6 - U.393"}


For convenient reference of only fileids, we can also create lists from the dictionaries:

In [19]:
attempt_ids = list(attempts.keys())
mag_ids = list(mags.keys())
print(mag_ids)

['103655658.txt', '103655659.txt', '103655660.txt', '103655661.txt', '103655662.txt', '103655663.txt']


### 1. Data Cleaning and Standardization

[Code cells in this section will have one function each, preceded by comments as markdown above the cell to narrate the cleaning process]

In [None]:
# code goes here

In [None]:
# tokenisation

In [None]:
# lemmatisation

In [None]:
# stemming

In [None]:
# part-of-speech tagging

### 2. Summary Statistics
[Code cells in this section will have one function each, preceded with comments in a markdown cell narrating the summarization process]

#### 2.1 Frequencies and Sizes

[Narration]

In [None]:
# code goes here

#### 2.2 Uniqueness and Variety

[Narration]

In [None]:
# code goes here - most frequent words, sentence structure

### 3. Exploratory Analysis (this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 [exploratory research question 1]

In [2]:
# code goes here

In [None]:
# visualizations go here

#### 3.2 [exploratory research question 2]

In [2]:
# code goes here

In [3]:
# visualizations go here