# Britain and UK Handbooks as Data

Created in July and August 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the Britain and UK Handbooks Dataset
The data consists of digitized text from select Britain and UK Handbooks produced between 1954 and 2005.  A central statistics bureau produced the Handbooks each year to communicate information about the UK that would impress international diplomats.
* Data format: digitized text
* Data creation process: Optical Character Recognition (OCR)
* Data source: https://data.nls.uk/digitised-collections/britain-uk-handbooks/

### 0. Preparation
Import libraries to use for cleaning, summarizing and exploring the data:

In [43]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lucy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lucy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


The nls-text-handbooks folder contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text and **tokenize** the text (which splits a string into separate words and punctuation):

In [3]:
corpus_folder = 'data/nls-text-handbooks/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])

['BRITAIN', '1979', '3W', '+', 'L', 'Capita', '!', 'Edinburgh', 'Population', '5']


It's hard to get a sense of how accurately the text has been digitized from this list of 10 words, so let's look at one of these words in context.  To see phrases in which "Edinburgh" is used, we can use the concordance() method:

In [4]:
t = Text(corpus_tokens)
t.concordance('Edinburgh', lines=20)

Displaying 20 of 2579 matches:
BRITAIN 1979 3W + L Capita ! Edinburgh Population 5 , 196 / GOO ENGLAND A
ondon WC1V 6HB 13a Castle Street , Edinburgh EH2 3AR 41 The Hayes , Cardiff CF1
ield Liverpool Manchester Bradford Edinburgh Bristol Belfast Coventry Cardiff s
Counsellors of State ( the Duke of Edinburgh , the four adult persons next in s
ments , accompanied by the Duke of Edinburgh , and undertakes lengthy tours in 
y government bookshops in London , Edinburgh , Cardiff , Belfast , Manchester ,
five Scottish departments based in Edinburgh and known as the Scottish Office .
 is centred in the Crown Office in Edinburgh . The Parliamentary Draftsmen for 
. The main seat of the court is in Edinburgh where all appeals are heard . All 
 The Court of Session sits only in Edinburgh , and has jurisdiction to deal wit
ersities are : Aberdeen , Dundee , Edinburgh , Glasgow , Heriot - Watt ( Edinbu
nburgh , Glasgow , Heriot - Watt ( Edinburgh ), St . Andrews , Stirling , and S
. Andrews , Gla

I'm guessing `bife` should be `Fife` as it's closely followed by `Dundee`, but overall not so bad!

We can also load individual files from the nls-text-handbooks folder:

In [4]:
file = open('data/nls-text-handbooks/205336772.txt', 'r')
sample_text = file.read()
sample_tokens = word_tokenize(sample_text)
sample_tokens[:10]

['GH', '.', 'fl-', '[', 'IASG0', '>', 'J^RSEI', 'nice', ']', 'ROME']

However, in this Notebook, we're interested in the entire dataset, so we'll use all its files.  Let's find out just how many files, and just how much text, we're working with:

In [5]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_words = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_words += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("Total...")
    print("  Characters in Handbooks Data:", total_chars)
    print("  Words in Handbooks Data:", total_words)
    print("  Sentences in Handbooks Data:", total_sents)
    print("  Files in Handbooks Data:", total_files)

corpusStatistics(wordlists)

Total...
  Characters in Handbooks Data: 90573254
  Words in Handbooks Data: 16606800
  Sentences in Handbooks Data: 584618
  Files in Handbooks Data: 50


Across the 50 files that make up the Handbooks dataset, there are over 90 million characters (which could be words, numbers, punctuation, abbreviations, etc.), over 16 million words, and nearly 600,000 sentences.  Of course, OCR isn't perfect, so these numbers are estimates, not precise totals.

### 1. Data Cleaning

`bife` most likely isn't the only word the OCR incorrectly digitized.  To get a sense of how much of the digitized text we can perform meaningful analysis on, let's figure out how many of NLTK's "words" are actually recognizable English words.

We'll use [WordNet](https://wordnet.princeton.edu/),* a database of English words, to evaluate which of NLTK's "words" are not valid English words.

***

  *Princeton University "About WordNet." WordNet. Princeton University. 2010.

**First,** let's create a list of strings from the words NLTK has identified for us:

In [8]:
str_tokens = [str(word) for word in corpus_tokens]
assert(type(str_tokens[0]) == str)  # quick test to make sure the output is as expected
print(str_tokens[0:10])

['BRITAIN', '1979', '3W', '+', 'L', 'Capita', '!', 'Edinburgh', 'Population', '5']


There are digits and punctuation that won't be recognized as words in WordNet but still provide valuable text data for studying the Handbooks.  For example, in the output above, it looks as though the OCR processed the word `Capital` as `Capita!`, which NLTK has split into two.  Furthermore, the word '1979' is a date that puts the text in context, which would enable one to order information in the text by date.

To get an estimate of how accurately OCR digitized the Handbooks, though, we'll use words in the sense that they are recognizable words in the English language.  Let's write a regular expression that can tell us whether a string is a word or abbreviation:

In [26]:
isWord = re.compile('[a-zA-z.]+')  # include single letters and abbreviations

# ----------- TESTING REGEX -----------
# print(isWord.match("bife").group())
# print(isWord.match("U.S.A.").group())
# print(isWord.match("W").group())
# print(isWord.match("1979") == None)

**Lastly,** let's use that regular expression to write a function to distinguish words recognizable English words from unrecognizable strings:

In [38]:
def removeNonEnglishWords(list_of_strings):
    english_only = []
    nonenglish = []
    for s in list_of_strings:
        test = isWord.match(s)            # fails if has characters other than letters or a period
        if (test != None):
            passed = test.group()   # get the matching string
            if wordnet.synsets(passed):  # see if WordNet recognizes the matching string
                english_only.append(passed)
            else:
                nonenglish.append(passed)
    return english_only, nonenglish
                
recognized, unrecognized = removeNonEnglishWords(str_tokens)

In [40]:
print("Total alphabetic words recognized in WordNet:", len(recognized))
print("Total alphabetic words NOT reccognized in WordNet:", len(unrecognized))
print("Percentage of alphabetic words that are unrecognized in WordNet:", (len(unrecognized)/len(recognized))*100)

Total alphabetic words recognized in WordNet: 9665798
Total alphabetic words NOT reccognized in WordNet: 4009482
Percentage of alphabetic words that are unrecognized in WordNet: 41.481127579947355


### 2. Summary Statistics

Different types of analysis require different subsets of data, so let's create some here:

In [58]:
# Lowercase text
lower_str_tokens = [t.lower() for t in str_tokens]
lower_recognized = [word.lower() for word in recognized]
lower_unrecognized = [word.lower() for word in unrecognized]    

In [56]:
# Exclude stop words (i.e. the, a, is) - note that the input text must be lowercased!
eng_stopwords = set(stopwords.words('english'))
no_stopwords = [t for t in lower_corpus_tokens if not t in eng_stopwords]
assert(len(no_stopwords) < len(corpus_tokens))

In [60]:
# Stem the text (reduce words to their root, whether or not the root is a word itself
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(t) for t in lower_str_tokens]
print(porter_stemmed[:100])
print()
lancaster = nltk.LancasterStemmer()
lancaster_stemmed = [lancaster.stem(t) for t in lower_str_tokens]
print(lancaster_stemmed[:100])

['britain', '1979', '3w', '+', 'l', 'capita', '!', 'edinburgh', 'popul', '5', ',', '196', '/', 'goo', 'england', 'area', '130', ',', 'm41sq', '.', 'km', '50', '/', '363sq', '.', 'mile', 'capita', '!', 'london', 'pecul', '4', '-', '6', '/', '351', '/', '000', ';', 'ivy1', 'i', '-', '<', '1', '..', 'i', "'", 'i', '&', 'rr', '^', 'xt', ':.', 'u', '.', '5', '%', 'v', '?', 'iq', '^', 'an', 'offici', 'handbook', 'an', 'offici', 'handbook', 'london', ':', 'her', 'majesti', 'stationeri', 'offic', 'â', '©', 'crown', 'copyright', '1979', 'first', 'publish', '1979', 'her', 'majesti', "'", 's', 'stationeri', 'offic', 'govern', 'bookshop', '49', 'high', 'holborn', ',', 'london', 'wc1v', '6hb', '13a', 'castl', 'street', ',', 'edinburgh']

['britain', '1979', '3w', '+', 'l', 'capit', '!', 'edinburgh', 'pop', '5', ',', '196', '/', 'goo', 'england', 'are', '130', ',', 'm41sq', '.', 'km', '50', '/', '363sq', '.', 'mil', 'capit', '!', 'london', 'pec', '4', '-', '6', '/', '351', '/', '000', ';', 'ivy1', '

In [61]:
# Lemmatize the text (reduce words to their root ONLY if the root is considered a word in WordNet)
wnl = nltk.WordNetLemmatizer()
lemmatized = [wnl.lemmatize(t) for t in lower_str_tokens]
print(lemmatized[:100])

['britain', '1979', '3w', '+', 'l', 'caput', '!', 'edinburgh', 'population', '5', ',', '196', '/', 'goo', 'england', 'area', '130', ',', 'm41sq', '.', 'km', '50', '/', '363sq', '.', 'mile', 'caput', '!', 'london', 'peculation', '4', '-', '6', '/', '351', '/', '000', ';', 'ivy1', 'i', '-', '<', '1', '..', 'i', "'", 'i', '&', 'rr', '^', 'xt', ':.', 'u', '.', '5', '%', 'v', '?', 'iq', '^', 'an', 'official', 'handbook', 'an', 'official', 'handbook', 'london', ':', 'her', 'majesty', 'stationery', 'office', 'â', '©', 'crown', 'copyright', '1979', 'first', 'published', '1979', 'her', 'majesty', "'", 's', 'stationery', 'office', 'government', 'bookshop', '49', 'high', 'holborn', ',', 'london', 'wc1v', '6hb', '13a', 'castle', 'street', ',', 'edinburgh']


In [64]:
# Remove duplicate tokens from the text (obtain the vocabulary of the text)
t_vocab = set(str_tokens)
t_vocab_lower = set(lower_str_tokens)
lemma_vocab = set(lemmatized)
print("Unique tokens:", len(t_vocab))
print("Unique lowercase tokens:", len(t_vocab_lower))
print("Unique lemmatized (lowercase) tokens:", len(lemma_vocab))
print()
rec_vocab = set(recognized)
rec_vocab_lower = set(lower_recognized)
unrec_vocab = set(unrecognized)
unrec_vocab_lower = set(lower_unrecognized)
print("Unique recognized words:", len(rec_vocab))
print("Unique recognized lowercase words:", len(rec_vocab_lower))
print("Unique unrecognized words:", len(unrec_vocab))
print("Unique unrecognized lowercase words:", len(unrec_vocab_lower))

Unique tokens: 86793
Unique lowercase tokens: 70922
Unique lemmatized (lowercase) tokens: 66172

Unique recognized words: 36780
Unique recognized lowercase words: 25602
Unique unrecognized words: 29107
Unique unrecognized lowercase words: 26573


In [65]:
# Tag parts of speech in sentences
sentences = wordlists.sents()  # sentences = [nltk.word_tokenize(sent) for sent in sentences]
pos_tagged = [nltk.pos_tag(sent) for sent in sentences]
print(pos_tagged[:2])

[[('BRITAIN', 'NNP'), ('1979', 'CD'), ('3W', 'CD'), ('+', 'NN'), ('L', 'NNP'), ('Capita', 'NNP'), ('!', '.')], [('Edinburgh', 'NNP'), ('Population', 'NNP'), ('5', 'CD'), (',', ','), ('196', 'CD'), ('/', 'NN'), ('GOO', 'NNP'), ('ENGLAND', 'NNP'), ('Area', 'NNP'), ('130', 'CD'), (',', ','), ('M41sq', 'NNP'), ('.', '.'), ('km', 'VB'), ('50', 'CD'), ('/', 'JJ'), ('363sq', 'CD'), ('.', '.'), ('miles', 'NNS'), ('Capita', 'RB'), ('!', '.')], [('London', 'NNP'), ('Peculation', 'NNP'), ('4', 'CD'), ('-', ':'), ('6', 'CD'), ('/', '$'), ('351', 'CD'), ('/', 'JJ'), ('000', 'CD'), (';', ':'), ('iVY1', 'NN'), ('i', 'SYM'), ('-', ':'), ('<', 'NN'), ('1', 'CD'), ('..', 'NN'), ('i', 'NN'), ("'", "''"), ('i', 'NN'), ('&', 'CC'), ('rr', 'NN'), ('^', 'NN'), ('xt', 'NNP'), (':.', 'NN')], [('u', 'NN'), ('.', '.')], [('5', 'CD'), ('%', 'NN'), ('V', 'NNP'), ('?', '.'), ('iQ', 'NN'), ('^', 'VBD'), ('An', 'DT'), ('official', 'JJ'), ('handbook', 'NN'), ('An', 'DT'), ('official', 'JJ'), ('handbook', 'NN'), ('Lond

The parts of speech are abbreviated as follows:
* `NN` = singular noun, `NNS` = plural noun, `NNP` = singular proper noun, `NNPS` = plural proper noun
* `IN` = preposition
* `TO` = preposition or infinitive marker
* `DT` = determiner
* `CC` = coordinating conjunction
* `JJ` = adjective
* `VB` = verb
* `RB` = adverb

More abbreviations are explained [here](https://www.learntek.org/blog/categorizing-pos-tagging-nltk-python/) or can be queried with `nltk.help.upenn_tagset('TAG')`

#### 2.1 Dataset Size

[Narration]

In [6]:
# code goes here - average sentence length, number of tables, dates covered, places covered...

#### 2.2 Uniqueness and Variety

[Narration]

In [7]:
# code goes here - find most frequent words, lexical diversity

### 3. Exploratory Analysis (this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 How is Britain and the UK portrayed?

In [8]:
# code goes here

In [9]:
# visualizations go here

#### 3.2 How is Scotland portrayed?  Ireland?  Wales?

In [10]:
# code goes here

In [11]:
# visualizations go here