# NLTK with Lewis Grassic Gibbon First Editions

**Data Source:** [National Library of Scotland Data Foundry](https://data.nls.uk/data/digitised-collections/lewis-grassic-gibbon-first-editions/)

**Code Reference:** 
National Library of Scotland. Exploring Lewis Grassic Gibbon First Editions. National Library of Scotland, 2020. https://doi.org/10.34812/gq6w-6e91

**Date:** April 6, 2022

**Course:** Text Analysis with NLTK (Week 2, Class 3); Centre for Data, Culture & Society

***

**Corpus Name:** Lewis Grassic Gibbon First Editions

**Questions:**
1. What are the most common words in the corpus?
    * How many words are in the entire corpus?
        * Create a list of all the alphabetic tokens (words)
    * Calculate the frequency distribution (`FreqDist()`)
    

2. What are the most common words in one book from the corpus?
    * Identify which files in the corpus are for which book
        * Create a list of all the alphabetic tokens (words)
    * Calculate the frequency distribution for individual files
    

3. How does the word choice of the author change from one book to another?
    * How many words are in each book (each file in the corpus)?
    * How many *unique* words are in each book?
        * Normalize (standardize) the words by casefolding
    * Lexical diversity = count of unique words / count of all words

***

## Table of Contents

I. [Preparation](#preparation)

II. [Normalization](#normalization)

III. [Data Cleaning](#data_cleaning)

IV. [Analysis](#analysis)

***

<a id="preparation"></a>
## I. Preparation

In [None]:
!pip install altair

In [None]:
# To load a CSV file with an inventory of the documents in the corpus
import pandas as pd

# To create data visualizations
import altair as alt
import matplotlib.pyplot as plt

# To perform text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.probability import FreqDist
from nltk.draw.dispersion import dispersion_plot as displt

import re # Regular Expressions (RegEx)

In [None]:
data_directory = "nls-text-gibbon"
wordlists = PlaintextCorpusReader(data_directory, "\d.*", encoding="latin1")
corpus_tokens = wordlists.words()  # method for tokenization
print(corpus_tokens[:20])

To get a sense of the data we're working with, let's create a functions that tell us how many tokens, sentences, and files are in our corpus!

In [None]:
def calculate_total_tokens(plaintext_corpus_read_lists):
    total_tokens = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
    return total_tokens

def calculate_total_sents(plaintext_corpus_read_lists):
    total_sents = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
    return total_sents

def calculate_total_files(plaintext_corpus_read_lists):
    return len(plaintext_corpus_read_lists.fileids())

def display_corpus_statistics(name, total_tokens, total_sents, total_files):
    print(f"Total words in {name}: {total_tokens}")
    print(f"Total sentences in {name}: {total_sents}")
    print(f"Total files in {name}: {total_files}")

display_corpus_statistics(
    "Lewis Grassic Gibbon Corpus",
    calculate_total_tokens(wordlists),
    calculate_total_sents(wordlists),
    calculate_total_files(wordlists)
)


In [None]:
print(type(wordlists))

In [None]:
print(wordlists.fileids())

There is also a csv file which gives some metadata. Let's load this into a pandas DataFrame with `pd.read_csv`. It has no header row, so set the `header` parameter to `None` and use the `names` parameter to give it the column names `'fileid'` and `'title'`

In [8]:
# Your code here


Let's create a **dictionary**, one of the Python data types, that associates each `fileid` with each `title`.  That way we can quickly determine from our text analysis with NLTK what book we are looking at, since NLTK uses the fileids (the names of the files in our data directory, a.k.a. folder).

In [None]:
# obtain a list of all file IDs
fileids = "your code here"
print(f"List of file IDs:\n{fileids}\n")

# obtain a list of all titles
titles = "your code here"
print(f"List of titles:\n{titles}\n")

# create a dictionary where the keys are file IDs and the values are titles
lgg_dict = "your code here"
print(f"Dictionary of file IDs and titles:\n{lgg_dict}\n")

In [None]:
# Now we can say...
a_file_id = fileids[10]
lgg_dict[a_file_id]

In [None]:
# ...or simply...
lgg_dict['205174251.txt']

Let's create lists of all the words (alphabetic tokens) and sentences in the LGG corpus.

In [None]:

def get_words_sents (plaintext_corpus_read_lists):
    """Creates lists of all words and all tokens in a given corpus.
    I've added this docstring because docstrings are good practice,
    and I can show you one of the standard formats for Python 
    docstrings
    
    Parameters
    ----------
        plaintext_corpus_read_lists (PlaintextCorpusReader):
            corpus generated from a collection of plantext files

    Returns
    -------
        all_words (list):
            list of all words: that is, all tokens consisting of alphabetical
            characters only
        all_sents (list)
            list of all sentences
    """
    pass # your code here

In [None]:
lgg_words, lgg_sents = get_words_sents(wordlists)
print(lgg_words[:20])
print(lgg_sents[:3])

<a id="data_cleaning"></a>
## II. Data Cleaning with RegEx and NLTK

==============================================================================

Let's calculate the frequency distribution, or the count of occurrences of each word across the entire corpus:

In [None]:
fdist_lgg = FreqDist(lgg_words)
print("Total words after filtering:", fdist_lgg.N())
print("50 most common words after filtering:", fdist_lgg.most_common(50))

Uh-oh! Note the 'â' and the '\x80\x99' - those don't look like they're meant to be there! The NLS Gibbon corpus is, after all, the result of scanning and running OCR (Optical Character Recognition), which is not totally reliable. Before doing our analysis, let's do some data cleaning, with ...

### Regular Expresssions (RegEx)

* **WHAT? Pattern matching strings in Python**
* **WHY? To find specific words or phrases, or variations of a particular word or phrase**
    * Once found, they can be replaced, so this is useful for cleaning text with digitization errors.  Optical Character Recognition (OCR) and Handwriting Recognition (HWT or HRT) technologies are imperfect, so you will find errors in digitized text corpora (unless of course they've been manually reviewed and corrected).
* **HOW? Combinations of special characters with a RegEx compiler**
    * In programming, a *compiler* translates code from one programming language to another.  In a sense, RegEx is a language that can sit on top of Python.  RegEx works with Python data types and syntax but it also has its own special characters and methods that plain Python doesn't use.
    
My favorite resource for practice with and testing Regular Expressions is  is [Regex101.com](https://regex101.com): also check out [Pythex.org](https://pythex.org) for the cheat sheet it provides!

In [None]:
# # To use Regular Expressions (RegEx)
# import re

# # To perform text analysis
# import nltk
# from nltk.tokenize import word_tokenize, sent_tokenize
# nltk.download('punkt')
# from nltk.corpus import PlaintextCorpusReader
# nltk.download('wordnet')
# from nltk.corpus import wordnet
# nltk.download('stopwords')
# from nltk.corpus import stopwords
# from nltk.text import Text

In [None]:
lgg = nltk.Text(lgg_words)
print(lgg[:100])

In [None]:
lgg.concordance("\x80\x98")

Looks like "\x80\x98" is the [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding for the em-dash character.

In [None]:
lgg.concordance("â")

In addition to the â's, we can also see above some oddities with 'Â' - examples being:

- `'himÂ ¬ self'`
- `'proÂ ¬ tested'`
- `'affirÂ ¬ mation'`

In [None]:
lgg.concordance("¬")

To remove a substring (a selection of characters in a string), we can use an empty string (either `""` or `''`) as the second input for the `replace()` method.  Just remember to set the Text object followed by this method to a variable, otherwise your changes won't be saved!

In [None]:
# .replace() must be used on a string object, not a Text object
lgg_str = wordlists.raw()
# your code here
# _________________

In [None]:
# .concordance() must be used on a Text object, not a string object
corpus_tokens = word_tokenize(lgg_str)
lgg = nltk.Text(corpus_tokens)
lgg.concordance("¬")

It worked!

Let's try using RegEx to clean the text now. Write a regular expression which finds seqences consisting of one or more alphabetic characters, followed by 'â' or 'Â'

In [None]:
sequence = "your code here"
re_pattern = re.compile(sequence)

In [None]:
digit_errors = re_pattern.findall(lgg_str)

In [None]:
len(digit_errors)

In [None]:
unique_errors = list(set(digit_errors))
len(unique_errors)

In [None]:
print(unique_errors[:100])

Since the `â` character keeps appearing at the end of tokens, let's first re-tokenise `lgg_str`, then create use `strip()` to remove them for practice with that method!

In [None]:
# first, re-tokenise
lgg_tokens = word_tokenize(lgg_str)
print(lgg_tokens[:100])

Let's make a function that strips 'â' from our tokens

In [None]:

# Let's try it first with a sample subcorpus
subcorpus = lgg_tokens[:1000]

def strip_in_corpus(corpus, characters):
    pass # your code here

clean_subcorpus = strip_in_corpus(subcorpus, 'âÂ')
print(clean_subcorpus[:100])


I've made a function to check exactly what is being changed in the subcorpus. Has this solved our â-problem?

In [None]:
def compare_tokens(tokens1, tokens2):
    diffs = {}
    for i, pair in enumerate(zip(tokens1, tokens2)):
        a, b = pair
        if a!=b:
            diffs[i] = (a, b)
    return diffs

print(compare_tokens(subcorpus, clean_subcorpus))

It would appear not. OK, let's try making a function to show us all the tokens that contain the pattern matched by the regular expression we created earlier, `re_pattern`.

In [None]:
def corpus_find_pattern(corpus, pattern):
    pass # your code here
    

In [None]:
print(list(corpus_find_pattern(lgg_tokens, re_pattern))[:100])

Aaaaaand this was the point at which I figured out what is going on with these weird sequences. 

I did a bit of detective work on it, and it turns out these sequences are UTF-8 characters, incorrectly decoded in 'latin1', the character encoding we used when we first read the data from the directory. For instance, in UTF-8 an open double curly quote `“` is encoded as `\xe2\x80\x9b`. This encoding uses three binary bytes to encode the character. The `\x` is an escape sequence that indicates the following two characters are a binary byte written in hexidecimal - that is, base 16. So, `e2` is 226, which in latin1 is 'â'. `\x80\x9b` isn't anything in latin1, so it isn't decoded as anything.

The sensible thing to do here, the thing that we would in fact do if this was a real project, rather than an exercise for a class, would be this:

In [None]:

wordlists_utf8 = PlaintextCorpusReader(data_directory, "\d.*", encoding="utf-8")

In [None]:
print(wordlists_utf8.words()[:50])

...And then we'd re-do the data processing we've done so far, and re-check it for any weird artefacts and OCR-errors. However, there is another point I'd like to make about this sort of data analysis, which is that, depending on what it is you are trying to find out, sometimes a time-consuming and painstaking data-cleaning isn't needed. If we want to do an analysis of Gibbon's lexicon, for instance, we might just toss out every malformed token, and do our analysis using the rest.

So, let's make a function that goes through a corpus and removes every token that contains 'â', 'Â', or any non-alphabetic character: (note that `str.isapha()` considers 'â' and 'Â' to be alphabetic)

In [None]:
def filter_nonalphabetic(corpus):
    pass # your code here
    
lgg_alpha = filter_nonalphabetic(lgg_tokens)
print(lgg_alpha[:40])

There. Good enough.

Ta da!

<a id="normalization"></a>
## III. Normalization

A bit more preprocessing before we start our analysis.

Let's casefold to normalize so capitalized and lowercased versions of words are considered the same word:

In [None]:
def lowercase_all(words):
    pass # your code here

lgg_words_lower = lowercase_all(lgg_alpha)
print(lgg_words_lower[:20])

...and exclude stopwords using `stopwords.words(language)`:

In [None]:

def remove_stopwords(corpus: list[str], min_len: int, language: str) -> list[str]:
    """Iterates through a list of words and removes all words of length less
    than min_len, and all words in `stopwords.words(language)`

    Parameters
    ----------
        words (list of str):
            the list of words to be filtered 
        min_len (int):
            minimum length for words: words shorter than this should be 
            filtered out
        language (str):
            name of corpus language, in lower case. This is used to identify
            the correct stopwords list
    """
    pass # your code here

filtered_lower = remove_stopwords(lgg_words_lower, 3, 'english')
print(len(lgg_words_lower))
print(len(filtered_lower))

Before we go on to analysis, here's a function that chains together all the processing we've done, in case we need to do it again for any reason...

In [None]:
def process_corpus(wordlist, min_len=3, language='english'):
    processed = filter_nonalphabetic(wordlist)
    processed = lowercase_all(processed)
    return remove_stopwords(processed, min_len, language)

wordlists2 = PlaintextCorpusReader(data_directory, "\d.*", encoding="latin1")
corpus_tokens2, _ = get_words_sents(wordlists2)
processed_corpus = process_corpus(corpus_tokens2)


In [None]:
print(processed_corpus[60:100])

<a id="analysis"></a>
## IV. Analysis

Let's calculate the frequency distribution again, now that we have cleaned and normalised our data:

In [None]:
fdist_filtered_lower = FreqDist(processed_corpus)
print("Total words after filtering:", fdist_filtered_lower.N())
print("50 most common words after filtering:", fdist_filtered_lower.most_common(50))

Happily, I don't see anything there that looks dubious.

Let's plot the frequency distributions of the *n* most common words

In [None]:
def freq_plot(corpus, n):
    fdist_filtered_lower = FreqDist(filtered_lower)
    plt.figure(figsize = (20, 8))
    plt.rc('font', size=12)
    fdist_filtered_lower.plot(n, title=f'Frequency Distribution for {n} Most Common Tokens in the Standardised LGG Dataset (excluding stop words)')
    plt.show()

In [None]:
freq_plot(processed_corpus, 20) # Try increasing or decreasing this number to view more or fewer tokens in the visualization

In [None]:
# INPUT: wordlists and the fileid of the wordlist to be tokenised
# OUTPUT: a list of word tokens (in String format) for the inputted fileid
def get_words(plaintext_corpus_read_lists, fileid):
    file_words = process_corpus(plaintext_corpus_read_lists.words(fileid))
    str_words = [str(word) for word in file_words]    
    return str_words

In [None]:
def get_all_words(fileids):
    words_by_file = []
    for file in fileids:
        words_by_file += [get_words(wordlists, file)]
    return words_by_file

words_by_file = get_all_words(fileids)

In [None]:
# INPUT: a list of words in String format
# OUTPUT: the number of unique words divided by
#         the total words in the inputted list
def lexical_diversity(str_words_list):
    return len(set(str_words_list))/len(str_words_list)

Add to the inventory...

In [None]:
lexdiv_by_file = []
for words in words_by_file:
    lexdiv_by_file += [lexical_diversity(words)]

df['lexicaldiversity'] = lexdiv_by_file
df_lexdiv = df.sort_values(by=['lexicaldiversity', 'title'], inplace=False, ascending=True)
df_lexdiv

For the entire corpus:

In [None]:
lexical_diversity(processed_corpus)

The table isn't bad but charts can make it easier to compare calculations more quickly, so let's visualize the lexical diversity scores!

In [None]:
sorted_titles = list(df_lexdiv['title'])
sorted_lexdiv = list(df_lexdiv['lexicaldiversity'])
source = pd.DataFrame({
    'Title': sorted_titles,
    'Lexdiv': sorted_lexdiv
})

alt.Chart(source, title="Lexical Diversity of Gibbon's Works").mark_bar(size=30).encode(
    alt.X('Title', axis=alt.Axis(title='Lewis Grassic Gibbon Work'), type='nominal', sort=None),  # If sort unspecified, chart will sort x-axis values alphabetically
    alt.Y('Lexdiv', axis=alt.Axis(title='Lexical Diversity')),
    alt.Order(
      # Sort the segments of the bars by this field
      'Lexdiv',
      sort='ascending'
    )
).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
).properties(
    width=500
)

#     alt.Y('Lexdiv', axis=alt.Axis(format='%', title='Lexical Diversity')),

Could we sort these chronologically?

We can add information available from the [digital.nls.uk](https://data.nls.uk/wp-content/uploads/2020/10/digital.nls.uk) website about the publication dates to our inventory...

In [None]:
published = [1932, 1933, 1933, 1934, 1933, 1932, 1934, 1934, 1934, 1931, 1932, 1934, 1932, 1930, 1931, 1928]
df_lexdiv['published'] = published
df_pub = df_lexdiv.sort_values(by=['published', 'title'], inplace=False, ascending=True)
df_pub.head()

Then we can recreate the bar chart with the bars (books) sorted by year of publication:

In [None]:
sorted_titles = list(df_pub['title'])
sorted_lexdiv = list(df_pub['lexicaldiversity'])
sorted_published = list(df_pub['published'])
source = pd.DataFrame({
    'Title': sorted_titles,
    'Lexdiv': sorted_lexdiv,
    'Published': sorted_published
})

alt.Chart(source, title="Lexical Diversity of Gibbon's Works").mark_bar(size=30).encode(
    alt.X('Title', axis=alt.Axis(title='Title of Lewis Grassic Gibbon Work'), type='nominal', sort=None),  # If sort unspecified, chart will sort x-axis values alphabetically
    alt.Y('Lexdiv', axis=alt.Axis(title='Lexical Diversity')),
    alt.Order(
      # Sort the segments of the bars by this field
      'Lexdiv',
      sort='descending'
    ),
    color=alt.Color('Published:O', legend = alt.Legend(title='Date Published')),
    tooltip='Title:N'
).configure_axis(
    grid=False,
    labelFontSize=12,
    titleFontSize=12,
    labelAngle=-45
).configure_title(
    fontSize=14,
).configure_view(
    strokeWidth=0
).properties(
    width=500
)

To get the lexical diversity per year...

In [None]:
# dictionary associating works with year they were published
pub_yr = {1928: [], 1930: [], 1931: [], 1932: [], 1933: [], 1934: []}
for index,row in df_pub.iterrows():
    pub_yr[row[3]] += [row[0]]
print(pub_yr)

In [None]:
lexdiv_by_year = []
for key,value in pub_yr.items():
    lexdiv_by_file = []
    for fileid in value:
        file_words = wordlists.words(fileid)
        str_words = [str(w.lower()) for w in file_words if w.isalpha()]
        lexdiv_by_file += [lexical_diversity(str_words)]
    lexdiv_by_year += [sum(lexdiv_by_file)/len(lexdiv_by_file)]
print(lexdiv_by_year)

In [None]:
pub_years = [1928, 1930, 1931, 1932, 1933, 1934]
pub_lex = dict(zip(pub_years, lexdiv_by_year))
pub_lex

Now we can visualize the average lexical diversity score for each year Gibbon published in:

In [None]:


source = pd.DataFrame({
    'Year': pub_years,
    'Average Lexical Diversity': lexdiv_by_year
})

alt.Chart(source, title="Average Yearly Lexical Diversity of Gibbon First Editions").mark_bar(size=60).encode(
    alt.X('Year', axis=alt.Axis(title='Year of Publication'), type='ordinal'),
    alt.Y('Average Lexical Diversity', axis=alt.Axis(title='Average Lexical Diversity'))
).configure_axis(
    grid=False,
    labelFontSize=12,
    titleFontSize=12,
    labelAngle=0
).configure_title(
    fontSize=14,
).configure_view(
    strokeWidth=0
).properties(
    width=365
)

So Gibbon's lexical diversity does decrease over time, excepting a small increase in the last year he published, 1934!