# A Medical History of British India as Data
Created in July and August 2020 for the National Library of Scotland's Data Foundry by Lucy Havens, Digital Library Research Intern

### About the *A Medical History of British India* Dataset
[1-2 sentence description of collection and its acquisition]
* Data format: digitised text
* Data creation process: Optical Character Recognition (OCR) and manual cleaning
* Data source: https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/

### 0. Preparation
Import libraries to use for cleaning, summarizing and exploring the data:

In [1]:
# To prevent SSL certificate failure
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

# Libraries for data loading
import pandas as pd
import numpy as np
import string
import re

# Libraries for visualization
import altair as alt
import matplotlib.pyplot as plt

# Libraries for text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.draw.dispersion import dispersion_plot as displt

[nltk_data] Downloading package punkt to /Users/lucy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lucy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lucy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


The nls-text-indiaPapers folder (downloadable as *Just the text* data from the website at the top of this notebook) contains TXT files of digitized text, with numerical names, as well as a CSV inventory file and a TXT ReadMe file.  Load only the TXT files of digitized text and **tokenize** the text (which splits a string into separate words and punctuation):

In [2]:
corpus_folder = 'data/nls-text-indiaPapers/'
wordlists = PlaintextCorpusReader(corpus_folder, '\d.*', encoding='latin1')
corpus_tokens = wordlists.words()
print(corpus_tokens[:10])

['No', '.', '1111', '(', 'Sanitary', '),', 'dated', 'Ootacamund', ',', 'the']


*Note: If you'd like to see how to specify a single TXT file to load as data, check out the Jupyter Notebook for the Britain and UK Handbooks!*

It's hard to get a sense of how accurately the text has been digitized from this list of 10 tokens, so let's look at one of these words in context.  To see phrases in which "India" is used, we can use the concordance() method:

In [3]:
t = Text(corpus_tokens)
t.concordance('India', lines=20)  # by default NLTK's concordance method displays 25 lines

Displaying 25 of 16495 matches:
ffg . Secretary to the Government of India . Resolution of Government of India 
 India . Resolution of Government of India No . 1 - 137 , dated 5th March 1875 
rch 1875 . Letter from Government of India No . 486 , dated 5th September 1876 
ember 1876 . Letter to Government of India No . 1063 , dated 26th ditto . REFER
ffg . Secretary to the Government of India , Home Department . REFERRING to par
 to paragraph 8 of the Government of India ' s Resolu - tion No . 1 - 136 , dat
inion expressed by the Government of India that any measures of segragation and
filth with which all the villages in India are surrounded is quite sufficient t
the disease in Rajputana and Central India are in the hands of the Presidency S
ffg . Secretary to the Government of India , Home Dept . IN continuation of my 
 the Resolution of the Government of India , Home Department ( Medical ), No 1 
d by the orders of the Government of India dated 5th March 1876 . Report on lep
e to the

The *A Medical History of British India* (MHBI) dataset has been digitized and then manually corrected for errors in the digitization process, so we can be pretty confident in the quality of the text for this dataset.

Let's find out just how much text and just how many files we're working with:

In [4]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_words = 0
    total_sents = 0
    total_files = 0
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_words += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    print("Total...")
    print("  Characters in MHBI Data:", total_chars)
    print("  Words in MHBI Data:", total_words)
    print("  Sentences in MHBI Data:", total_sents)
    print("  Files in MHBI Data:", total_files)

corpusStatistics(wordlists)

Total...
  Characters in MHBI Data: 122297870
  Words in MHBI Data: 28333479
  Sentences in MHBI Data: 1671768
  Files in MHBI Data: 468


The `fileids` are the names of the files in the data's source folder:

In [11]:
fileids = list(wordlists.fileids())
fileids[0:3]

['74457530.txt', '74457800.txt', '74458285.txt']

We can use the inventory CSV file from the source folder to match the titles of the papers to the corresponding `fileid`:

In [12]:
df = pd.read_csv('data/nls-text-indiaPapers/indiaPapers-inventory.csv', header=None, names=['fileid', 'title'])
df.head()  # prints the first 5 rows (df.tail() prints the last 5 rows)

Unnamed: 0,fileid,title
0,74457530.txt,Distribution and causation of leprosy in Briti...
1,74457800.txt,"Report of an outbreak of cholera in Suhutwar, ..."
2,74458285.txt,Report of an investigation into the causes of ...
3,74458388.txt,Account of plague administration in the Bombay...
4,74458575.txt,Inquiry into the circumstances attending an ou...


In [18]:
titles = list(df['title'])
titles[0:3]

['Distribution and causation of leprosy in British India 1875 - IP/HA.2',
 'Report of an outbreak of cholera in Suhutwar, Bulliah sub-division - IP/30/PI.2',
 'Report of an investigation into the causes of the diseases known in Assam as Kála-Azár and Beri-Beri - IP/3/MB.5']

### 1. Data Cleaning

[Code cells in this section will have one function each, preceded by comments as markdown above the cell to narrate the cleaning process]

In [None]:
# code goes here

In [None]:
# tokenisation

In [None]:
# lemmatisation

In [None]:
# stemming

In [None]:
# part-of-speech tagging

### 2. Summary Statistics
[Code cells in this section will have one function each, preceded with comments in a markdown cell narrating the summarization process]

#### 2.1 Dataset Size

[Narration]

In [None]:
# code goes here

#### 2.2 Uniqueness and Variety

[Narration]

In [None]:
# code goes here - most frequent words, sentence structure

### 3. Exploratory Analysis (this section will be included for 2-3 datasets)
[Code cells in this section will have one function each, preceded with comments in a markdown cell posing an exploratory research question]

#### 3.1 [exploratory research question 1]

In [2]:
# code goes here

In [None]:
# visualizations go here

#### 3.2 [exploratory research question 2]

In [2]:
# code goes here

In [3]:
# visualizations go here