# Britain and UK Handbooks as Data

Created in July and August 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the Britain and UK Handbooks Dataset
The data consists of digitized text from select Britain and UK Handbooks produced between 1954 and 2005.  A central statistics bureau produced the Handbooks each year to communicate information about the UK that would impress international diplomats.
* Data format: digitized text
* Data creation process: Optical Character Recognition (OCR)
* Data source: https://data.nls.uk/digitised-collections/britain-uk-handbooks/

### 0. Preparation
Import libraries to use for cleaning, summarizing and exploring the data:

In [22]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The nls-text-handbooks folder contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text:

In [19]:
corpus_folder = 'data/nls-text-handbooks/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])

['BRITAIN', '1979', '3W', '+', 'L', 'Capita', '!', 'Edinburgh', 'Population', '5']


It's hard to get a sense of how accurately the text has been digitized from this list of 10 words, so let's look at one of these words in context.  To see phrases in which "Edinburgh" is used, we can use the concordance() method:

In [23]:
t = Text(corpus_tokens)
t.concordance('Edinburgh', lines=50)

Displaying 50 of 2579 matches:
BRITAIN 1979 3W + L Capita ! Edinburgh Population 5 , 196 / GOO ENGLAND A
ondon WC1V 6HB 13a Castle Street , Edinburgh EH2 3AR 41 The Hayes , Cardiff CF1
ield Liverpool Manchester Bradford Edinburgh Bristol Belfast Coventry Cardiff s
Counsellors of State ( the Duke of Edinburgh , the four adult persons next in s
ments , accompanied by the Duke of Edinburgh , and undertakes lengthy tours in 
y government bookshops in London , Edinburgh , Cardiff , Belfast , Manchester ,
five Scottish departments based in Edinburgh and known as the Scottish Office .
 is centred in the Crown Office in Edinburgh . The Parliamentary Draftsmen for 
. The main seat of the court is in Edinburgh where all appeals are heard . All 
 The Court of Session sits only in Edinburgh , and has jurisdiction to deal wit
ersities are : Aberdeen , Dundee , Edinburgh , Glasgow , Heriot - Watt ( Edinbu
nburgh , Glasgow , Heriot - Watt ( Edinburgh ), St . Andrews , Stirling , and S
. Andrews , Gla

I'm guessing "bife" should be "Fife" as it's closely followed by "Dundee," but overall not so bad!

We can also load individual files from the nls-text-handbooks folder:

In [17]:
file = open('data/nls-text-handbooks/205336772.txt', 'r')
sample_text = file.read()
sample_tokens = word_tokenize(sample_text)
sample_tokens[:10]

['GH', '.', 'fl-', '[', 'IASG0', '>', 'J^RSEI', 'nice', ']', 'ROME']

However, in this Notebook, we're interested in the entire dataset, so we'll stick 

### 1. Data Cleaning

[Code cells in this section will have one function each, preceded by comments as markdown above the cell to narrate the cleaning process]

In [None]:
# code goes here

### 2. Summary Statistics
[Code cells in this section will have one function each, preceded with comments in a markdown cell narrating the summarization process]

#### 2.1 Dataset Size

[Narration]

In [None]:
# code goes here

#### 2.2 Uniqueness and Variety

[Narration]

In [None]:
# code goes here

### 3. Exploratory Analysis (this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 [exploratory research question 1]

In [2]:
# code goes here

In [None]:
# visualizations go here

#### 3.2 [exploratory research question 2]

In [2]:
# code goes here

In [3]:
# visualizations go here