# Getting Started with Text Mining

These scripts accompany the blog post [title] found on the Fugitive Leaves blog of The Historical Medical Library of The College of Physicians of Philadelphia.

[link to blog post]

### Import dependant libraries

These libraries, which are prepacked sets of modules and functions that provide specific functionality not present in the core Python library, are necessary for running the scripts that follow.

In [55]:
#import libraries
import urllib.request
import nltk

### Importing text from the web

In [56]:
#assign the URL you want to a variable
#this is the URL for the plain text version of volume 56 of The West Virginia Medical Journal 
journalUrl = "https://archive.org/stream/westvirginiamedi5619west/westvirginiamedi5619west_djvu.txt"

In [57]:
#assign the plain text to a variable as a string
journalString = urllib.request.urlopen(journalUrl).read().decode()

### Tokenizing the text

The following lines clean the Internet Archive header and footer from the raw text string. This is not covered in depth in the post, but is mentioned. Cleaning up texts is always case by case and involves checking your results after each manipulation.

In [58]:
#remove header and footer
header = journalString.find('<pre>') + len('<pre>')
footer = journalString.find('</pre>')
journalString = journalString[header:footer]

In [59]:
#tokenize and make each word lowercase
journalTokens = nltk.word_tokenize(journalString.lower())

In [60]:
#check the tokens array, first fifty tokens
journalTokens[:50]

['•',
 "'",
 'bd',
 '?',
 '.',
 '5.05',
 'digitized',
 'by',
 'the',
 'internet',
 'archive',
 'in',
 '2016',
 'https',
 ':',
 '//archive.org/details/westvirginiamedi5619west',
 'm',
 'now',
 'ium',
 '(',
 '£',
 ')',
 '(',
 'propionyl',
 'erythromycin',
 'ester',
 ',',
 'lilly',
 ')',
 "'ey",
 'for',
 'chjldre',
 'too',
 '!',
 'ilosone',
 '125',
 'suspension',
 '(',
 'propionyl',
 'erythromycin',
 'ester',
 'lauryl',
 'sulfate',
 ',',
 'lilly',
 ')',
 'deliciously',
 'flavored',
 'decisively',
 'effective']

Note that there are many tokens in the array that are not words. The following lines will remove any token that does not contain a character from the alphabet. Further cleanup will be neceassary, for example, the "digitized by" stamp is still at the beginning of the text, and words broken by a column margin need to be re-concatenated, but for the sake of demonstration we will consider these results acceptable. 

In [61]:
journalWords = [word for word in journalTokens if any([char for char in word if char.isalpha()])]

Next we will remove the stop words from this set of words. For an explanation of stop words, see the original blog post

In [62]:
#load stopwords list from NLTK
stopwords = nltk.corpus.stopwords.words("English")

In [63]:
#remove stopwords from journalWords
journalStoppedWords = [word for word in journalWords if word not in stopwords]

In [64]:
#check the words array, first fifty tokens
journalStoppedWords[:50]

['bd',
 'digitized',
 'internet',
 'archive',
 'https',
 '//archive.org/details/westvirginiamedi5619west',
 'ium',
 'propionyl',
 'erythromycin',
 'ester',
 'lilly',
 "'ey",
 'chjldre',
 'ilosone',
 'suspension',
 'propionyl',
 'erythromycin',
 'ester',
 'lauryl',
 'sulfate',
 'lilly',
 'deliciously',
 'flavored',
 'decisively',
 'effective',
 'exceptionally',
 'safe',
 '5-cc',
 'teaspoonful',
 'provides',
 'ilosone',
 'lauryl',
 'sulfate',
 'equivalent',
 'mg.',
 'erythromycin',
 'base',
 'activity',
 'supplied',
 'bottles',
 'cc',
 'eli',
 'lilly',
 'company',
 'indianapolis',
 'indiana',
 'u.',
 's.',
 'i960',
 'epilepsy']

### Type-Token Ratio

In [65]:
#create an array of only the unique tokens in the words array
journalUniqueWords = set(journalStoppedWords)

In [66]:
#number of unique words
len(journalUniqueWords)

28345

In [67]:
#number of words
len(journalStoppedWords)

278228

In [68]:
#ratio of unique words to total words
len(journalUniqueWords)/len(journalStoppedWords)

0.10187687795620858

### Frequency Distribution

In [69]:
#creates a table of the number of occurances of a word in the words array
journalFreqTable = nltk.FreqDist(journalStoppedWords)

In [70]:
#display the top twenty results as tabular data
journalFreqTable.tabulate(15)

medical   m.   d. virginia west  dr. state   j.  may meeting hospital medicine  new association   w. 
3513 2676 2228 1928 1906 1626 1271 1147 1054 1037 1012  969  949  947  936 


### Ngrams

In [71]:
#create a list of all four-grams in our words list
#since the stop words are important in phrases, we will us the array that still contatins them
fourGrams = list(nltk.ngrams(journalWords, 4))

In [72]:
#create a frequency distribution of these four grams
fourGramsFreqs = nltk.FreqDist(fourGrams)

In [73]:
#create an array of the most common phrases
mostCommon = fourGramsFreqs.most_common(40) #change number to decide number of most common to store

In [74]:
mostCommon

[(('the', 'west', 'virginia', 'medical'), 400),
 (('west', 'virginia', 'medical', 'journal'), 394),
 (('of', 'the', 'west', 'virginia'), 351),
 (('the', 'west', 'virginia', 'state'), 327),
 (('west', 'virginia', 'state', 'medical'), 324),
 (('virginia', 'state', 'medical', 'association'), 285),
 (('a', 'member', 'of', 'the'), 161),
 (('the', 'american', 'medical', 'association'), 138),
 (('university', 'school', 'of', 'medicine'), 113),
 (('annual', 'meeting', 'of', 'the'), 112),
 (('will', 'be', 'held', 'at'), 103),
 (('the', 'state', 'medical', 'association'), 101),
 (('of', 'the', 'state', 'medical'), 97),
 (('woman’s', 'auxiliary', 'to', 'the'), 96),
 (('his', 'm.', 'd.', 'degree'), 85),
 (('the', 'woman’s', 'auxiliary', 'to'), 83),
 (('of', 'the', 'american', 'medical'), 80),
 (('m.', 'd.', 'degree', 'from'), 79),
 (('of', 'the', 'woman’s', 'auxiliary'), 78),
 (('be', 'held', 'at', 'the'), 78),
 (('the', 'house', 'of', 'delegates'), 77),
 (('of', 'the', 'department', 'of'), 74),
 

## Stay tuned for more!