# Lewis Grassic Gibbon First Editions as Data
Created August-September 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the Lewis Grassic Gibbon First Editions Dataset
Lewis Grassic Gibbon is an early 20th century Scottish novelist who also published under his birth name, James Leslie Mitchell.  He was an  prolific writer for the short period of time (5 years) that he published fiction and non-fiction, and the NLS collection contains first editions of all his published books.  Gibbon's stories often featured strong central female characters, unusual for an early 20th century writer.  Gibbon's literary influence continues to be felt today: his book A Sunset Song was voted Scotland's favorite novel in 2016, and contemporary Scottish writers such as Ali Smith and E.L. Kennedy have noted Gibbon's influence on their own writing.

* Data format: digitised text
* Data creation process: Optical Character Recognition (OCR)
* Data source: https://data.nls.uk/data/digitised-collections/lewis-grassic-gibbon-first-editions/
***
### Table of Contents
0. [Preparation](#0.-Preparation)
1. [Data Cleaning and Standardisation](#1.-Data-Cleaning-and-Standardisation)
2. [Summary Statistics](#2.-Summary-Statistics)
3. [Exploratory Analysis](#3.-Exploratory-Analysis)
***

### 0. Preparation
Import libraries to use for cleaning, summarising and exploring the data:

In [1]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re
from collections import defaultdict

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lucy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lucy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /Users/lucy/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


The nls-text-gibbon folder (downloadable as *Just the text* data from the website at the top of this notebook) contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text and **tokenise** the text (which splits the text into a list of its individual words and punctuation in the order they appear):

In [2]:
corpus_folder = 'data/nls-text-gibbon/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[100:115])

['BY', 'ROBERT', 'MACLEHOSE', 'AND', 'CO', '.', 'LTD', '.', 'THE', 'UNIVERSITY', 'PRESS', ',', 'GLASGOW', 'ALL', 'RIGHTS']


*Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!*

It's hard to get a sense of how accurately the text has been digitised from this list of 15 tokens, so let's look at one of these words in context.  To see phrases in which "Scots" is used, we can use the concordance() method:

In [3]:
t = Text(corpus_tokens)
t.concordance("Scots")

Displaying 25 of 72 matches:
UTHOR Novels forming the trilogy , A Scots Quair Part I . Sunset Song ( 1932 ) 
igins of that son . He had the usual Scots passion for educationâ  that passi
Â ¬ thing of the same quality , this Scots desire and pride in education , that
hood . Because second - sight in the Scots was a fiction not yet invented in th
us these were the lesser gods of the Scots pantheon , witches and wizards and k
 the ambition and intention of every Scots farmer to produce at least one son w
ad litde relevance to religion . The Scots are not a religious people : long be
is the function and privilege of the Scots minister : he is less a priest than 
e tenebrous imaginings of fear . The Scots ballads could have widened his geogr
 and climes , where beasts were un - Scots and untamed , and in blue waters str
racteristic of one great division of Scots as Burns was of another . He saw the
scular , with the solid frame of the Scots peasant redeemed of its usual squatn
 appreciati

There are some mistakes but not too many!

Note how NLTK's ``concordance`` method works: the word "Scots" appears with different meanings, sometimes referring to the language, other times referring to the people.  NLTK has a tagging method that identifies the parts of speech in sentences, so if we wanted to focus on the language Scots, we could look for instances of Scots being used as a noun.  If we wanted to focus on the people Scots, we could look for instances of Scots being used as an adjective.  This method wouldn't return perfect results, though.  For example, we could improve our results by checking for instances of "Scots" being used as an adjective before the word "dialects," for example.

We'll wait to dive into this sort of text analysis until a bit later, though!  

#### 0.1 Dataset Size
First, let's get a sense of how much data (in this case, text) we have in the *Lewis Grassic Gibbon First Editions* (LGG) dataset:

In [4]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_tokens = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("Estimated total...")
    print("  Characters in LGG Data:", total_chars)
    print("  Tokens in LGG Data:", total_tokens)
    print("  Sentences in LGG Data:", total_sents)
    print("  Files in LGG Data:", total_files)

corpusStatistics(wordlists)

Estimated total...
  Characters in LGG Data: 7068578
  Tokens in LGG Data: 1506095
  Sentences in LGG Data: 70821
  Files in LGG Data: 16


Note that I've print ``Tokens`` rather than words, though the NLTK method used to count those was ``.words()``.  This is because words in NLTK include punctuation marks and digits, in addition to alphabetic words.

#### 0.2 Identifying Subsets of the Data
Next, we'll create two subsets of the data, one for each journal.  To do so we first need to load the inventory (CSV file) that lists which file name corresponds with which journal.  When you open the CSV file in Microsoft Excel or a text editor, you can see that there are no column names.  The Python library [Pandas](https://pandas.pydata.org/docs/), which reads CSV files, calls these column names the ``header``.  When we use Pandas to read the inventory, we'll create our own header by specifying that the CSV file as ``None`` and providing a list of column ``names``.

When Pandas (abbreviated ``pd`` when we loaded libraries in the first cell of this notebook) reads a CSV file, it creates a table called a **dataframe** from that data.  Let's see what the Gibbon inventory dataframe looks like:

In [5]:
df = pd.read_csv('data/nls-text-gibbon/gibbon-inventory.csv', header=None, names=['fileid', 'title'])
df

Unnamed: 0,fileid,title
0,205174241.txt,Niger - R.176.i
1,205174242.txt,Thirteenth disciple - Vts.137.d
2,205174243.txt,Three go back - Vts.152.f.22
3,205174244.txt,Calends of Cairo - Vts.153.c.16
4,205174245.txt,Lost trumpet - Vts.143.j.8
5,205174246.txt,Image and superscription - Vts.118.l.16
6,205174247.txt,Spartacus - Vts.6.k.19
7,205174248.txt,"Persian dawns, Egyptian nights - Vts.148.d.8"
8,205174249.txt,Scots quair - Cloud howe - NF.523.b.30
9,205174250.txt,Scots quair - Grey granite - NF.523.b.31


Since we only have 16 files, we returned the entire dataframe above.  With larger dataframes you may wish to use  ``df.head()`` or ``df.tail()`` to print only the first 5 or last 5 rows of your CSV file (both of which will include the column names in the dataframe's header).

Now that we created a dataframe, if we want to determine the title of a Gibbon work based on it's file ID, we can use the following code:

In [7]:
# obtain a list of all file IDs
fileids = list(df["fileid"])
print("List of file IDs:\n", fileids)
print()

# obtain a list of all titles
titles = list(df["title"])
print("List of titles:\n", titles)
print()

# create a dictionary where the keys are file IDs and the values are titles
lgg_dict = dict(zip(fileids, titles))
print("Dictionary of file IDs and titles:\n", lgg_dict)
print()

# pick a file ID by its index number
a_file_id = fileids[10]

# get the title corresponding with the file ID in the dataframe
print("The title for the file ID at index 10:\n", lgg_dict[a_file_id])
print()

List of file IDs:
 ['205174241.txt', '205174242.txt', '205174243.txt', '205174244.txt', '205174245.txt', '205174246.txt', '205174247.txt', '205174248.txt', '205174249.txt', '205174250.txt', '205174251.txt', '205174252.txt', '205174253.txt', '205174254.txt', '205174255.txt', '205202834.txt']

List of titles:
 ['Niger - R.176.i', 'Thirteenth disciple - Vts.137.d', 'Three go back - Vts.152.f.22', 'Calends of Cairo - Vts.153.c.16', 'Lost trumpet - Vts.143.j.8', 'Image and superscription - Vts.118.l.16', 'Spartacus - Vts.6.k.19 ', 'Persian dawns, Egyptian nights - Vts.148.d.8', 'Scots quair - Cloud howe - NF.523.b.30', 'Scots quair - Grey granite - NF.523.b.31', 'Scots quair - Sunset song - NF.523.b.29', 'Hanno, or, The future of exploration - S.114.j.21', 'Nine against the unknown - S.72.d.10', 'Conquest of the Maya - S.60.c', 'Gay hunter - Vts.215.j.26', 'Stained radiance - T.204.f']

Dictionary of file IDs and titles:
 {'205174241.txt': 'Niger - R.176.i', '205174242.txt': 'Thirteenth dis

NLTK stores the lists of tokens in the corpus_tokens variable we created by the file IDs, so it's useful to be able to match the file IDs with their book titles!

### 1. Data Cleaning and Standardisation

[Code cells in this section will have one function each, preceded by comments as markdown above the cell to narrate the cleaning process]

In [None]:
# code goes here

In [None]:
# tokenisation

In [None]:
# sentence segmentation

In [None]:
# lemmatisation or stemming

In [None]:
# part-of-speech tagging

### 2. Summary Statistics
[Code cells in this section will have one function each, preceded with comments in a markdown cell narrating the summarization process]

#### 2.1 Frequencies and Sizes

[Narration]

In [None]:
# code goes here

#### 2.2 Uniqueness and Variety

[Narration]

In [None]:
# code goes here - most frequent words, sentence structure

### 3. Exploratory Analysis 
(this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 [exploratory research question 1]

In [2]:
# code goes here

In [None]:
# visualizations go here

#### 3.2 [exploratory research question 2]

In [2]:
# code goes here

In [7]:
# visualizations go here