# Lewis Grassic Gibbon First Editions as Data
Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the Lewis Grassic Gibbon First Editions Dataset
Lewis Grassic Gibbon is an early 20th century Scottish novelist who also published under his birth name, James Leslie Mitchell.  He was an  prolific writer for the short period of time (5 years) that he published fiction and non-fiction, and the NLS collection contains first editions of all his published books.  Gibbon's stories often featured strong central female characters, unusual for an early 20th century writer.  Gibbon's literary influence continues to be felt today: his book A Sunset Song was voted Scotland's favorite novel in 2016, and contemporary Scottish writers such as Ali Smith and E.L. Kennedy have noted Gibbon's influence on their own writing.

* Data format: digitised text
* Data creation process: Optical Character Recognition (OCR)
* Data source: https://data.nls.uk/data/digitised-collections/lewis-grassic-gibbon-first-editions/
***
### Table of Contents
0. [Preparation](#0.-Preparation)
1. [Data Cleaning and Standardisation](#1.-Data-Cleaning-and-Standardisation)
2. [Summary Statistics](#2.-Summary-Statistics)
3. [Exploratory Analysis](#3.-Exploratory-Analysis)
***

### 0. Preparation
Import libraries to use for cleaning, summarising and exploring the data:

In [1]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lucy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lucy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /Users/lucy/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


The nls-text-gibbon folder (downloadable as *Just the text* data from the website at the top of this notebook) contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text and **tokenise** the text (which splits the text into a list of its individual words and punctuation in the order they appear):

In [2]:
corpus_folder = 'data/nls-text-gibbon/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[100:115])

['BY', 'ROBERT', 'MACLEHOSE', 'AND', 'CO', '.', 'LTD', '.', 'THE', 'UNIVERSITY', 'PRESS', ',', 'GLASGOW', 'ALL', 'RIGHTS']


*Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!*

It's hard to get a sense of how accurately the text has been digitised from this list of 15 tokens, so let's look at one of these words in context.  To see phrases in which "Scots" is used, we can use the concordance() method:

In [3]:
t = Text(corpus_tokens)
t.concordance("Scots")

Displaying 25 of 72 matches:
UTHOR Novels forming the trilogy , A Scots Quair Part I . Sunset Song ( 1932 ) 
igins of that son . He had the usual Scots passion for educationâ  that passi
Â ¬ thing of the same quality , this Scots desire and pride in education , that
hood . Because second - sight in the Scots was a fiction not yet invented in th
us these were the lesser gods of the Scots pantheon , witches and wizards and k
 the ambition and intention of every Scots farmer to produce at least one son w
ad litde relevance to religion . The Scots are not a religious people : long be
is the function and privilege of the Scots minister : he is less a priest than 
e tenebrous imaginings of fear . The Scots ballads could have widened his geogr
 and climes , where beasts were un - Scots and untamed , and in blue waters str
racteristic of one great division of Scots as Burns was of another . He saw the
scular , with the solid frame of the Scots peasant redeemed of its usual squatn
 appreciati

There are some mistakes but not too many!

Note how NLTK's ``concordance`` method works: the word "Scots" appears with different meanings, sometimes referring to the language, other times referring to the people.  NLTK has a tagging method that identifies the parts of speech in sentences, so if we wanted to focus on the language Scots, we could look for instances of Scots being used as a noun.  If we wanted to focus on the people Scots, we could look for instances of Scots being used as an adjective.  This method wouldn't return perfect results, though.  For example, we could improve our results by checking for instances of "Scots" being used as an adjective before the word "dialects," for example.

We'll wait to dive into this sort of text analysis until a bit later, though!  

#### 0.1 Dataset Size
First, let's get a sense of how much data (in this case, text) we have in the *Lewis Grassic Gibbon First Editions* (LGG) dataset:

In [4]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_tokens = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("Estimated total...")
    print("  Characters in LGG Data:", total_chars)
    print("  Tokens in LGG Data:", total_tokens)
    print("  Sentences in LGG Data:", total_sents)
    print("  Files in LGG Data:", total_files)

corpusStatistics(wordlists)

Estimated total...
  Characters in LGG Data: 7068578
  Tokens in LGG Data: 1506095
  Sentences in LGG Data: 70821
  Files in LGG Data: 16


Note that I've print ``Tokens`` rather than words, though the NLTK method used to count those was ``.words()``.  This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.

#### 0.2 Identifying Subsets of the Data
Next, we'll create two subsets of the data, one for each journal.  To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal.  When you open the CSV file in Microsoft Excel or a text editor, you can see that there are no column names.  The Python library [Pandas](https://pandas.pydata.org/docs/), which reads CSV files, calls these column names the ``header``.  When we use Pandas to read the inventory, we'll create our own header by specifying that the CSV file as ``None`` and providing a list of column ``names``.

When Pandas (abbreviated ``pd`` when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a **dataframe** from that data.  Let's see what the Gibbon inventory dataframe looks like:

In [5]:
df = pd.read_csv('data/nls-text-gibbon/gibbon-inventory.csv', header=None, names=['fileid', 'title'])
df

Unnamed: 0,fileid,title
0,205174241.txt,Niger - R.176.i
1,205174242.txt,Thirteenth disciple - Vts.137.d
2,205174243.txt,Three go back - Vts.152.f.22
3,205174244.txt,Calends of Cairo - Vts.153.c.16
4,205174245.txt,Lost trumpet - Vts.143.j.8
5,205174246.txt,Image and superscription - Vts.118.l.16
6,205174247.txt,Spartacus - Vts.6.k.19
7,205174248.txt,"Persian dawns, Egyptian nights - Vts.148.d.8"
8,205174249.txt,Scots quair - Cloud howe - NF.523.b.30
9,205174250.txt,Scots quair - Grey granite - NF.523.b.31


Since we only have 16 files, we returned the entire dataframe above.  With larger dataframes you may wish to use  ``df.head()`` or ``df.tail()`` to print only the first 5 or last 5 rows of your CSV file (both of which will include the column names in the dataframe's header).

Now that we created a dataframe, if we want to determine the title of a Gibbon work based on it's file ID, we can use the following code:

In [None]:
# obtain a list of all file IDs
fileids = list(df["fileid"])
print("List of file IDs:\n", fileids)
print()

# obtain a list of all titles
titles = list(df["title"])
print("List of titles:\n", titles)
print()

# create a dictionary where the keys are file IDs and the values are titles
lgg_dict = dict(zip(fileids, titles))
print("Dictionary of file IDs and titles:\n", lgg_dict)
print()

# pick a file ID by its index number
a_file_id = fileids[10]

# get the title corresponding with the file ID in the dataframe
print("The title for the file ID at index 10:\n", lgg_dict[a_file_id])
print()

NLTK stores the lists of tokens in the corpus_tokens variable we created by the file IDs, so it's useful to be able to match the file IDs with their book titles!

### 1. Data Cleaning and Standardisation

#### 1.1 Tokenisation
Variables that store the word tokens and sentence tokens in our dataset will be useful for future analysis.  Let's create those now:

In [8]:
def getWordsSents(plaintext_corpus_read_lists):
    all_words = []
    all_sents = []
    for fileid in plaintext_corpus_read_lists.fileids():
        
        file_words = plaintext_corpus_read_lists.words(fileid)
        all_words += [str(word) for word in file_words if word.isalpha()]
        
        file_sents = sent_tokenize(plaintext_corpus_read_lists.raw(fileid))  #plaintext_corpus_read_lists.sents(fileid)
        all_sents += [str(sent) for sent in file_sents]
        
    return all_words, all_sents
        
lgg_words, lgg_sents = getWordsSents(wordlists)

For some types of analysis, such as identifying people and places named in Gibbon's works, maintaining the original capitalization is important.  For other types of analysis, such as analysing the vocabulary of Gibbon's works, standardising words by making them lowercase is important.  Let's create a lowercase list of words in the LGG dataset:

In [9]:
lgg_words_lower = [word.lower() for word in lgg_words]

In [10]:
print(lgg_words_lower[0:20])
print(lgg_words[0:20])

['r', 'o', 'first', 'journey', 'national', 'library', 'of', 'scotland', 'â', 'timbuctoo', 'niger', 'by', 'the', 'same', 'author', 'novels', 'forming', 'the', 'trilogy', 'a']
['R', 'o', 'First', 'journey', 'National', 'Library', 'of', 'Scotland', 'â', 'TIMBUCTOO', 'NIGER', 'BY', 'THE', 'SAME', 'AUTHOR', 'Novels', 'forming', 'the', 'trilogy', 'A']


Perfect!

#### 1.2 Reducing to Root Forms
In addition to tokenisation, **stemming** is a method of standardising, or "normalising," text.  Stemming reduces words to their root form by removing suffixes.  For example, the word "troubling" has a stem of "troubl."  NLTK has two types of stemmers that use different algorithms to determine what the root of a word is.  

The stemming algorithms below can take several minutes to run, so two are provided below with one commented out (the lines begin with ``#``) so it won't run.  If you'd like to see how the stemming algorithms differ, uncomment the lines by highlighting them and pressing ``cmd`` + ``/``.

First, though, let's see what stems of LGG data look like with the Porter Stemmer:

In [None]:
# Stem the text (reduce words to their root, whether or not the root is a word itself
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(t) for t in lower_str_tokens if t.isalpha()]  # only include alphabetic tokens
print(porter_stemmed[500:600])

In [11]:
# lancaster = nltk.LancasterStemmer()
# lancaster_stemmed = [lancaster.stem(t) for t in lower_str_tokens if t.isalpha()] # only include alphabetic tokens
# print(lancaster_stemmed[500:600])

<div class="alert alert-block alert-info">
<b>Try It!</b> Uncomment the lines of code in the cell above by removing the `#` before each line to see how a different stemming algorithm works: the Lancaster Stemmer.  What differences do you notice in the sample of stems that are returned?
</div>

Another approach to reducing words to their root is to **lemmatise** tokens.  NLTK's WordNet Lemmatizer reduces a token to its root *only* if the reduction of the token results in a word that's recognized as an English word in WordNet.  Here's what that looks like:

In [None]:
# Lemmatize the text (reduce words to their root ONLY if the root is considered a word in WordNet)
wnl = nltk.WordNetLemmatizer()
lemmatised = [wnl.lemmatize(t) for t in lower_str_tokens if t.isalpha()]  # only include alphabetic tokens
print(lemmatised[500:600])

#### 1.3 Parts of Speech
To study the linguistic style of text, analysing the **parts of speech** and their patterns in sentences can be useful.  NLTK has a method for tagging tokens with a part of speech in a sentence.  Let's do that too:

In [None]:
# Tag parts of speech in sentences
pos_tagged = [nltk.pos_tag(sent) for sent in lgg_sents]
print(pos_tagged[:1])

NLTK uses abbreviations to identify parts of speech, such as:
* `NN` = singular noun, `NNS` = plural noun, `NNP` = singular proper noun, `NNPS` = plural proper noun
* `IN` = preposition
* `TO` = preposition or infinitive marker
* `DT` = determiner
* `CC` = coordinating conjunction
* `JJ` = adjective
* `VB` = verb
* `RB` = adverb

More abbreviations are explained [here](https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/) or can be queried with `nltk.help.upenn_tagset('TAG')` (replace `TAG` with an NLTK part of speech abbreviation).

### 2. Summary Statistics
[Code cells in this section will have one function each, preceded with comments in a markdown cell narrating the summarization process]

#### 2.1 Frequencies and Sizes

[Narration]

In [None]:
# code goes here

#### 2.2 Uniqueness and Variety

[Narration]

In [None]:
# code goes here - most frequent words, sentence structure

### 3. Exploratory Analysis 
(this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 [exploratory research question 1]

In [2]:
# code goes here

In [None]:
# visualizations go here

#### 3.2 [exploratory research question 2]

In [2]:
# code goes here

In [7]:
# visualizations go here