# Getting Started with Text Mining

For the purposes of explaining some of the core processes of text mining, the tool I am going to highlight is the [Natural Language Toolkit](http://www.nltk.org/) (NLTK), a library for processing natural language using the Python scripting language including example corpora. Python was designed to be both readable and extensible. This extensibility is what NLTK is a perfect example is of. After installation and importing, one is able to very quickly perform some very powerful acts of quantitative text analysis. Links and instructions for installing Python and NLTK on your machine are provided at the end of this post, as well as for installing the [Jupyter Notebook](http://jupyter.org/) browser based scripting environment used for writing these scripts. 

Note: There isn’t enough space here to explain the basics coding in Python or to explain all of the functions that will be used. There are good introductory resources for coding in Python, such as [CodeAcademy](https://www.codecademy.com/learn/python) which provides interactive tutorials starting with the basics, as well as robust [documentation](https://docs.python.org/3/) for understanding the various functions employed. None of the techniques here are particularly complicated, so you will likely be able to work with this notebook after working through the CodeAcademy tutorials.

### Resources

Here are resources for getting started:
    
[Downloading Python](https://wiki.python.org/moin/BeginnersGuide/Download)

[Beginner’s Guide to Python](https://wiki.python.org/moin/BeginnersGuide)

[Download Anaconda](https://www.continuum.io/downloads) for managing Python versions and add-ons, such as Jupyter (optional)

[Install Jupyter Notebook](http://jupyter.readthedocs.io/en/latest/install.html), a browser based environment for script development

[Installing NLTK](http://www.nltk.org/install.html)

### Import dependant libraries

These libraries, which are prepacked sets of modules and functions that provide specific functionality not present in the core Python library, are necessary for running the scripts that follow.

Note: You will need to install NLTK before you can import it. Instructions for doing so can be found [here](http://www.nltk.org/install.html).

In [2]:
#import libraries
import urllib.request
import nltk
import os

### Importing text from the web

You can import text directly from the full text version of the State Medical Society Journals using the following lines.

In [148]:
#assign the URL you want to a variable
#this is the URL for the plain text version of volume 56 of The West Virginia Medical Journal 
journalUrl = "https://archive.org/stream/westvirginiamedi8619west/westvirginiamedi8619west_djvu.txt"

In [149]:
#assign the plain text to a variable as a string
journalString = urllib.request.urlopen(journalUrl).read().decode()

### Saving text to file (optional)

In [150]:
#creates directory if it does not exist
directory = "C:/Users/tdahn/Documents/WestVirginiaTXT"
if not os.path.exists(directory):
    os.makedirs(directory)

In [151]:
urlEnd = journalUrl.find("west/")
journalTitle = journalUrl[(urlEnd + 5):]
journalTitle

'westvirginiamedi8619west_djvu.txt'

In [152]:
header = journalString.find('<pre>') + len('<pre>')
footer = journalString.find('</pre>')
journalString = journalString[header:footer]

In [153]:
f = open(directory + "/" + journalTitle, "w", encoding='utf-8')
f.write(journalString)
f.close()

### Downloading .txt files of journal from a .txt file list of journal URLs

In [2]:
#read .txt file of URLs, file should contain the URL of each journal to be written to file
with open("C:/Users/tdahn/Desktop/WVurls.txt") as f:
    content = f.readlines()

In [3]:
#removes line break from the end of the URL read into the content variable
#note: add a space or additional character to the last URL of the text file to
    #avoid removing the last "t" from the file extension
count = 0
for line in content:
    line = line[:-1]
    content[count] = line
    count += 1

In [147]:
#creates a directory for writing the .txt files
directory = "C:/Users/tdahn/Documents/WestVirginiaTXT"
if not os.path.exists(directory):
    os.makedirs(directory)

#writes files to directory above removing the header and the footer and using the file name from IA for the local filename
for line in content:
    journalUrl = line
    journalString = urllib.request.urlopen(journalUrl).read().decode()
    header = journalString.find('<pre>') + len('<pre>')
    footer = journalString.find('</pre>')
    journalString = journalString[header:footer]
    urlEnd = journalUrl.find("west/")
    journalTitle = journalUrl[(urlEnd + 5):]
    f = open(directory + "/" + journalTitle, "w", encoding='utf-8')
    f.write(journalString)
    f.close()

### Tokenizing the text

The following lines clean the Internet Archive header and footer from the raw text string. This is not covered in depth in the post, but is mentioned. Cleaning up texts is always case by case and involves checking your results after each manipulation.

In [58]:
#remove header and footer
header = journalString.find('<pre>') + len('<pre>')
footer = journalString.find('</pre>')
journalString = journalString[header:footer]

When digitized images are converted to text using optical character recognition (OCR), the result is most often output as plain text, which contains no formatting. Plain text contains only characters of readable text, and as such a computer cannot initially distinguish linguistic concepts such as punctuation, sentences or even words. As such, an important part of working with texts is to tokenize them, or break them into discreet entities that can be understood as separate units of writing/speech. Assuming we have the plain text of the journal read into a string as above, NLTK allows us to do this easily in one line (the lower function is used to ensure all words are lower case, and as such treated the same by the computer):

In [59]:
#tokenize and make each word lowercase
journalTokens = nltk.word_tokenize(journalString.lower())

In [60]:
#check the tokens array, first fifty tokens
journalTokens[:50]

['•',
 "'",
 'bd',
 '?',
 '.',
 '5.05',
 'digitized',
 'by',
 'the',
 'internet',
 'archive',
 'in',
 '2016',
 'https',
 ':',
 '//archive.org/details/westvirginiamedi5619west',
 'm',
 'now',
 'ium',
 '(',
 '£',
 ')',
 '(',
 'propionyl',
 'erythromycin',
 'ester',
 ',',
 'lilly',
 ')',
 "'ey",
 'for',
 'chjldre',
 'too',
 '!',
 'ilosone',
 '125',
 'suspension',
 '(',
 'propionyl',
 'erythromycin',
 'ester',
 'lauryl',
 'sulfate',
 ',',
 'lilly',
 ')',
 'deliciously',
 'flavored',
 'decisively',
 'effective']

Note that there are many tokens in the array that are not words. The following lines will remove any token that does not contain a character from the alphabet. Further cleanup will be neceassary, for example, the "digitized by" stamp is still at the beginning of the text, and words broken by a column margin need to be re-concatenated, but for the sake of demonstration we will consider these results acceptable. 

In [61]:
journalWords = [word for word in journalTokens if any([char for char in word if char.isalpha()])]

Note that in the example above the punctuation is treated as its own token. Many plain text documents need to be cleaned up before being used, including the removing of any header or footer information. Additionally, the removal of non-word tokens such as punctuation may be necessary, as well as the removal of stop words, which are a predetermined set of commonly occurring words deemed unimportant and potentially misleading in natural language processing. Since this process varies from corpus to corpus, it is outside of the scope of this post, but is necessary in order ensure accurate results, and is demonstrated in the code I am making available.

In [62]:
#load stopwords list from NLTK
stopwords = nltk.corpus.stopwords.words("English")

In [63]:
#remove stopwords from journalWords
journalStoppedWords = [word for word in journalWords if word not in stopwords]

In [64]:
#check the words array, first fifty tokens
journalStoppedWords[:50]

['bd',
 'digitized',
 'internet',
 'archive',
 'https',
 '//archive.org/details/westvirginiamedi5619west',
 'ium',
 'propionyl',
 'erythromycin',
 'ester',
 'lilly',
 "'ey",
 'chjldre',
 'ilosone',
 'suspension',
 'propionyl',
 'erythromycin',
 'ester',
 'lauryl',
 'sulfate',
 'lilly',
 'deliciously',
 'flavored',
 'decisively',
 'effective',
 'exceptionally',
 'safe',
 '5-cc',
 'teaspoonful',
 'provides',
 'ilosone',
 'lauryl',
 'sulfate',
 'equivalent',
 'mg.',
 'erythromycin',
 'base',
 'activity',
 'supplied',
 'bottles',
 'cc',
 'eli',
 'lilly',
 'company',
 'indianapolis',
 'indiana',
 'u.',
 's.',
 'i960',
 'epilepsy']

### Well . . . what next?

Now that our text is ready to be worked with, what do we do with it? That question is ultimately dictated by the research goals of the researcher and the corpora available, but I’d like to show how we can very quickly start to draw some conclusions across texts. A very preliminary examination involves the concept of type token ratio, which is simply the ratio of types, unique tokens, and total tokens. We know in our case that a token is a word, and as such a type is a unique word, which means this ratio quickly determines the vocabulary variation of a particular work. There are more complicated considerations, such as the part of speech of a word in order to distinguish between homographs such as bass (fish) versus bass (instrument), but for the sake of this post we will keep it simple.

### Type-Token Ratio

In [65]:
#create an array of only the unique tokens in the words array
journalUniqueWords = set(journalStoppedWords)

In [66]:
#number of unique words
len(journalUniqueWords)

28345

In [67]:
#number of words
len(journalStoppedWords)

278228

In [68]:
#ratio of unique words to total words
len(journalUniqueWords)/len(journalStoppedWords)

0.10187687795620858

This number may not mean much to us by itself, but doing this calculation on a number of texts across a corpus, or even across corpora allows us to start to ask, and maybe answer, some questions. Perhaps we might expect vocabulary size to increase over the 20th century in medical journals due to the increased specialization in the medical profession over that time. Looking at this calculation over time might help us confirm or deny our hypothesis.
Another quick and easy, but powerful, analysis that NLTK allows us to do is known as frequency distribution. As a statistical concept, frequency distribution is a tabular display of each outcome in a dataset. For our purposes here it represents the number of times a particular word occurs in the text.

### Frequency Distribution

In [69]:
#creates a table of the number of occurances of a word in the words array
journalFreqTable = nltk.FreqDist(journalStoppedWords)

In [70]:
#display the top twenty results as tabular data
journalFreqTable.tabulate(15)

medical   m.   d. virginia west  dr. state   j.  may meeting hospital medicine  new association   w. 
3513 2676 2228 1928 1906 1626 1271 1147 1054 1037 1012  969  949  947  936 


### Ngrams

We can also look at Frequency Distributions for multiple word sequences, known as [Ngrams](https://en.wikipedia.org/wiki/N-gram).

In [71]:
#create a list of all four-grams in our words list
#since the stop words are important in phrases, we will us the array that still contatins them
fourGrams = list(nltk.ngrams(journalWords, 4))

In [72]:
#create a frequency distribution of these four grams
fourGramsFreqs = nltk.FreqDist(fourGrams)

In [73]:
#create an array of the most common phrases
mostCommon = fourGramsFreqs.most_common(40) #change number to decide number of most common to store

In [74]:
mostCommon

[(('the', 'west', 'virginia', 'medical'), 400),
 (('west', 'virginia', 'medical', 'journal'), 394),
 (('of', 'the', 'west', 'virginia'), 351),
 (('the', 'west', 'virginia', 'state'), 327),
 (('west', 'virginia', 'state', 'medical'), 324),
 (('virginia', 'state', 'medical', 'association'), 285),
 (('a', 'member', 'of', 'the'), 161),
 (('the', 'american', 'medical', 'association'), 138),
 (('university', 'school', 'of', 'medicine'), 113),
 (('annual', 'meeting', 'of', 'the'), 112),
 (('will', 'be', 'held', 'at'), 103),
 (('the', 'state', 'medical', 'association'), 101),
 (('of', 'the', 'state', 'medical'), 97),
 (('woman’s', 'auxiliary', 'to', 'the'), 96),
 (('his', 'm.', 'd.', 'degree'), 85),
 (('the', 'woman’s', 'auxiliary', 'to'), 83),
 (('of', 'the', 'american', 'medical'), 80),
 (('m.', 'd.', 'degree', 'from'), 79),
 (('of', 'the', 'woman’s', 'auxiliary'), 78),
 (('be', 'held', 'at', 'the'), 78),
 (('the', 'house', 'of', 'delegates'), 77),
 (('of', 'the', 'department', 'of'), 74),
 

We can see that many of the most common four grams here are fairly obvious, and as such not particularly telling, but by carefully preparing your texts and corpora, as well as making decisions about what words should be appended to the stop words list in order to provide more meaningful results.

### Importing a corpus from a directory

The processes above are all performed on one text for the sake of demonstration. Below the reading of texts from a directory is shown, as well as tokenizing and cleaning those texts. The last step or writing the clean versions to another directory is optional, but helpful in preventing you from having to repeat this process everytime you plan to work.

NOTE: In order to use any of the processes above on the corpus level, you will need to us a for loop as well as the dictionary.items() method.

In [3]:
#reads files from a directory into a dictionary element with filename as key and text string as value
directory = "C:/Users/tdahn/Documents/WestVirginiaTXT"

stringDict = {}

for filename in os.listdir(directory):
    fileString = os.path.splitext(filename)[0]
    text = open(directory + "/" + filename, 'r', encoding="utf8")
    textString = text.read()
    text.close()
    stringDict[str(fileString)] = textString

In [4]:
#tokenize texts and save to new dictionary
#!!!this will take some time depending on the size or your corpus!!!
tokenDict = {}

for volume, text in stringDict.items():
    tokens = nltk.word_tokenize(text.lower())
    tokenDict[volume] = tokens

In [5]:
#remove non-words and save to new dictionary
wordsDict = {}

for volume, text in tokenDict.items():
    textWords = [word for word in text if any([char for char in word if char.isalpha()])]
    wordsDict[volume] = textWords

In [6]:
#concatenate hypenated words (necessary because of columns in journals) and save to new dictionary
fullWordsDict = {}

for volume, text in wordsDict.items():

    textTokens = []
    
    for word in wordsDict[volume]:
        textTokens.append(word)

    count = 0
        
    for i in range(len(textTokens)):
        if textTokens[i].endswith('-'):
            count += 1

    for i in range(len(textTokens) - count):
        if textTokens[i].endswith('-'):
            textTokens[i] = textTokens[i][0:-1] + (textTokens[i + 1])
            del textTokens[i + 1]

    fullWordsDict[volume] = textTokens

In [8]:
#remove stopwords and save to new dictionary
stoppedWordsDict = {}
stopwords = nltk.corpus.stopwords.words("English")

for volume, text in fullWordsDict.items():
    journalStoppedWords = [word for word in text if word not in stopwords]
    stoppedWordsDict[volume] = journalStoppedWords

In [9]:
#check that all documents are in dictionary
print ("Length : %d" % len(stoppedWordsDict))

Length : 84


In [10]:
#write cleaned up text to new directory
directory = "C:/Users/tdahn/Documents/WestVirginiaTXTClean"
if not os.path.exists(directory):
    os.makedirs(directory)
    
for volume, text in fullWordsDict.items():
    cleanString = ' '.join(fullWordsDict[volume])
    f = open(directory + "/" + volume + ".txt", "w", encoding='utf-8')
    f.write(cleanString)
    f.close()

## Stay tuned for more!

I will be updating this notebook periodically and will make an annoucement via our Twitter account [@CPPMedHistLib](https://twitter.com/CPPHistMedLib).

In coming sections I would like to explore more how to work across full corpora, as well as to introduce more complicated concepts such as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis), [topic modeling](https://en.wikipedia.org/wiki/Topic_model), [document similarity](https://en.wikipedia.org/wiki/Semantic_similarity) using [vector space modeling](https://en.wikipedia.org/wiki/Vector_space_model), and [data visualization](https://en.wikipedia.org/wiki/Data_visualization).

Please feel free to e-mail me directly at [tdahn@collegeofphysicians.org](mailto:tdahn@collegeofphysicians.org) with any questions, comments or corrections. Or fork this on GitHub!