# I.  Basic Text Mining and Analysis:  Introduction

Textual analysis is frequently referred to by a variety of names.  On the technical side, people often refer to Text Analysis, Text Mining, or Natural Language Processing.  Cultural heritage professionals and humanists also frequently refer to these methods as Distant Reading (appropriating a term that Franco Moretti, who used it to describe the processes of analyzing textual metadata) or Macroanalysis (a term derived form Matthew Jockers book of the same title which introduced various methods of textual analysis using the R language for statistical programming.)  

In fact, the technical terms Text Analysis, Text Mining, and Natural Language Processing have slightly different meanings:


1.   **Text Mining** refers to process of computationally "reading" a text (or collection of texts) and extracting specific chunks of information.  It encapsulates process that convert unstructured text to structured data.
2.   **Text Analysis** refers to processes, such as basic descriptive statistical methods or more advanced machine learning methods, that analyze the information contained in texts.  Text Analysis is frequently performed on the structured data that results from Text Mining operations, but it can also be performed directly on unstructured texts.  Text Analysis processes typically return summary data, such as lists of all references to particular dates, persons, and topics or summary statistics regarding word or phrase usage and frequency.  
3.   **Natural Language Processing** (NLP) describes a particular subset of Text Mining and Text Analysis processes that utilize the grammatical and semantic structures associated with natural languages as part of the mining and analysis process.  Processes that do not account for the naturalness of language, such as word frequency analysis, cannot properly be consider Natural Language Processing.

This self-study module introduces basic skills and methods of text mining and analysis, focusing most but not exclusively on analytic methods rooted in descriptive statistical analysis.  As the name implies, Descriptive Statistical methods describe the basic features of the data being examined, providing simple but powerful summaries about the data.


# II.  Python Packages for Text Mining and Analysis

A wide variety of Python packages and modules are available for performing text mining and analysis.  When executed, the code cell below will load those necessary to perform the activities presented in this course module.  Comments in the code identify each of the packages being loaded.  In each case, you can refer to the package documentation for more specific information about the package being used.  You must run the code cells below to properly prepare your environment to perform the text mining and analysis tasks presented in this module.

In [None]:
# update collab environment to latest version of NLTK
# documentation: https://www.nltk.org/
!pip install nltk -U

In [None]:
# import the base nltk package
import nltk

# import the nltk text module
from nltk import text

# import the nltk sentence tokenizer
from nltk.tokenize import sent_tokenize

# import the nltk word_tokenize module
from nltk import word_tokenize

# import the nltk corpus reader module
from nltk.corpus import PlaintextCorpusReader

# import the nltk probability module
from nltk import probability

# import the nltk bigrams module
from nltk import bigrams

# import the nltk tag.pos_tag module
from nltk.tag import pos_tag

# import nltk.chunk modules 
from nltk.chunk import conlltags2tree, tree2conlltags

# import the nltk POS tagger
nltk.download('averaged_perceptron_tagger')

# import th nltk dispersion_plot module
from nltk.draw.dispersion import dispersion_plot

# download the nltk named entity chunker
nltk.download('maxent_ne_chunker')

# download the ntltk words library (for chunking)
nltk.download('words')

# import the punkt sentence parser for nltk
# documentation: https://www.kite.com/python/docs/nltk.punkt
nltk.download('punkt')

# import the pprint package
# documentation: https://docs.python.org/3/library/pprint.html
from pprint import pprint

# import collections python module
# documentation: https://docs.python.org/3/library/collections.html
import collections

# import the collections package counter module
from collections import Counter

# import the pandas package
# documentation: https://pandas.pydata.org/docs/
import pandas

# import the networkx package
# documentation: https://networkx.org/documentation/stable/index.html
import networkx

# import the python regular expression package
# documentation: https://docs.python.org/3/library/re.html
import re

# import Spacy NLP package 
# documentation: https://spacy.io/
import spacy

# import the Spacy display module
from spacy import displacy

# import the english core web sm english natural language module
# documentation: https://spacy.io/usage/models
import en_core_web_sm

# import the matplotlib package
import matplotlib.pyplot as plt

# III.  Load a Working File

There are various ways of loading files for text mining and analysis.  Many text mining packages perform multiple modes of initial processing on files as they are being loaded.  In this module, we will directly load the contents of single file as our working file without performing any pre or load-time processing.  In future modules we will introduce other methods of loading collections of text files as a "corpus" for analysis.

Before you can load a file for analysis, you must mount your Google Drive in this environment.

In [None]:
from google.colab import drive
drive.mount('/gdrive/')

Once your Google Drive has successfully mounted, you can open one of the sample data files provided for the course or a file of your own that you have placed in the “data_my” directory of the Course Home Directory:

1.   To load a course sample file, you can simply run the code cell below.
2.   To load a text (ASCII) file of your own, place a hashtag in from of the line that points to the file at "/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/melville.txt",
replace the "\<name_of_your_file.txt\>” substring in the line that reads, " /gdrive/MyDrive/rbs_digital_approaches_2021/data_my/\<name_of_your_file.txt\>" with the name of your file, uncomment the line, and then run the cell.




In [None]:
working_file_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_class/melville.txt"
#working_file_path = "/gdrive/MyDrive/rbs_digital_approaches_2021/data_my/<name_of_your_file.txt>"

Now that you've defined a file to load, we can open the file and read its contents into a string variable.

In [None]:
# open a text file for processing
working_file = open(working_file_path, "r")

# read the file contents into a string variable
working_text = working_file.read()

You can check that your file loaded by checking the length and examining the opening characters of the working_text variable.

In [None]:
# print the character length of our working text
print('Characters in string:', len(working_text))

In [None]:
# look at the first 200 characters of the string
print(working_text[0:200:1])

# IV. Preliminary Analysis (Data Forensics)

Whenever you begin a new text or data analysis process, the first thing that you should do is perform some preliminary analysis.  This crucial first step, known as Data Forensics, is frequently overlooked, but it is crucial to helping us understand that actual state of our data and, more importantly, the extent and nature of cleaning and preparation that we need to do in order to ensure that our planned mining and analysis return valid results.  

The remainder of this workbook is dedicated to performing various modes of preliminary analysis.  As you perform these forensics, take note of any anomalies you see in the textual data.  The point of the preliminary analysis is to reveal the types and extent of text preparation that will have to be performed prior to analysis.  Are there parts of the text that you would want to remove before analysis?  Obvious errors in OCR or transcription that might need to be cleaned?  Things that are being mis-interpreted by the computer that might affect future analysis? All of this is crucial information.  As you work through the notebook, there are many prompts to look at portions of the data.  Don’t be afraid to change the parameters of these prompts and gain other views of the data.  

Exploration and discovery are the purpose of this exercise.  Our group activity in the next class meeting will be to develop and implement a cleaning strategy based on what you and your classmates find and document in this activity. So take good notes and be prepared to share them at our next meeting.


# V.  Chunking and Tokenizing

First, let's do some chunking.  **Chunking** (yes, that's the real, technical term) is a process of breaking a text into constituent parts, such as paragraphs, sentences, or phrases.  Here, we'll chunk into sentences.

Note that while this is primarly a module on text minging and analysis, sentnce chunking is actually a Natural Language Processing operation.  Here, the sent_tokenize() function relies on the english language model that we loaded during our environment setup to apply rules for sentence formation and representation in the lanaguage (natural language information) to chunk the text into a list of sentences.

In [None]:
# tokenize the text by sentence
sentence_list = sent_tokenize(working_text)

Before we procede, let's look at our sentence tokens to make sure the process worked.

In [None]:
# print the length of the sentence_list list
print('Sentences in text:', len(sentence_list))

Sentences in text: 10099


In [None]:
# look at the first ten sentences
print(sentence_list[0:10:1])

Now let's also **tokenize** on individual words. Like chunking, tokenizing is a prcess of splitting the text into a list of consitutent parts, in this case, words.  Since words are a minimal semantic unit, we call this tonization rather than chunking.

In [None]:
# tokenize the text by word
word_tokens = word_tokenize(working_text)

And, again, we'll examine the results.

In [None]:
# print the length of the word_tokens list
print('Words in text:', len(word_tokens))

In [None]:
# look at an arbitrary selection of words
print(word_tokens[500:1000:10])

# VI.  Frequency Distributions

Now that we have out text chunked and tokenized, we can analyze some frequency distributions of words across the text.

In [None]:
# create a frequency distribution using NLTK
freq_dist = nltk.probability.FreqDist(word_tokens)

In [None]:
# Look at the most frequent words
print(freq_dist.most_common(n=100))

We can also plot our frequency distribution:

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# plot the freqiuency distribution of the top words
freq_dist.plot(50, cumulative=True)

Now that we've dived more deeply into the words in our text, we might want to also take some time to examine particular words of interest by retrieving the count for our word of interest.

In [None]:
# count the occurrences of a word of interest
word_tokens.count('whale')

# VII.  Key Word in Context (KWIC)

Now that we've dived more deeply into the words in our text, we might want to also It can also be useful to examine the context in which particular words appear.  For example, we might have prior knowledge about the importance of a word of interest, or we might have seen something earlier in our analysis that prompts us to want to look deeper into a particular word.  To accomplish this, we first create a concordance for the text.  A concordance is an index that tracks the location in the text of every occurrence of every word.



In [None]:
# create a condorance obj
obj_concord = nltk.text.ConcordanceIndex(word_tokens)

Once we have a concordance, we can query it for a word of interest and return a designated number of characters on either side of each of a designated number of occurrences.

In [None]:
obj_concord.print_concordance("whale", width=80, lines=25)


# VIII.  Word Cooccurrence Networks (n-grams)

Let's also take some time to do some preliminary analysis of which words tend to cooccur in the text.  For our preliminary analysis, we'll look only at bigrams, which are pairs of words that frequently appear next to each other in the text.

In [None]:
# create a list of bigrams
bigram_list = list(bigrams(word_tokens))

In [None]:
# create a count of unique bigrams
bigram_counts = collections.Counter(bigram_list)
print(bigram_counts)

Now that we have a frequency tagged list of bigrams that appear in the text, we can work on plotting a network graph to represent the top bigrams.  We'll do this using the Pandas and Networkx packages.

The Networks package, which we will use to draw the network graph, expects to receive data in the form a data.frame (a data.frame is spreadsheet-like data structure that contains columns, each of which is a field (or a variable in statistical language) and rows (each of which represents a single item, or observation in statistical language).  

At present, our bigram data is in the form of a list of key/value pairs.  Happily, the Pandas package has functions for creating and working with data.frames, so we'll use Pandas to convert the data into a data.frame and then send that data.frame to Networkx to draw our plot.


In [None]:
# create an empty pandas DataFrame
bigram_df = pandas.DataFrame(data=None, columns=['source', 'target', 'weight'])


In [None]:
# add top bigram items to the dataframe
for x, z in bigram_counts.most_common(20):
  bigram_df.loc[len(bigram_df.index)] = [x[0], x[1], z] 

In [None]:
# create the nodegraph using Networkx
net_graph = networkx.from_pandas_edgelist(bigram_df, source='source', target='target', edge_attr='weight')

And now we're ready to draw the plot to screen.

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# draw the graph as a force directed graph
networkx.draw_networkx(net_graph, with_labels=True, font_size=24)

We can also draw the network as a circular rather than force directed graph.

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# draw the graph as a circular graph
networkx.draw_circular(net_graph, with_labels=True, font_size=24)

# IX. Word Occurrence Dispersion Plot

We can also plot the dispersion word occurrence across the narrative time of the text.  This allows us to see how different words function in the text.

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# define the words you want to plot
targets=['whale', 'ahab', 'ship', 'light', 'water', 'I']
#draw the plot
dispersion_plot(word_tokens, targets, ignore_case=True, title='Lexical Dispersion Plot')

# X. Part of Speech (POS) Analysis

We can also do some simple analysis of the distribution of parts of speech in the text.  Note that this is solidly in the camp of Natural Language Processing, since NLTK uses English natural language models to parse the text into sentences, phrases, and finally POS based on the grammatical structure of the English language.

In [None]:
# First we'll perfor the pos tagging
tagged = nltk.pos_tag(word_tokens)

Now that we've tagged the text for POS, let's take a quick look at the result.

In [None]:
# look at the first ten POS tagged words
print(tagged[0:10:1])

<font size="2">note:  a key to the POS tags applied by the NLTK can be found in the appendix at the end of this notebook.

We see in the above that the results of our POS tagging are returned as a list of key/value pairs where the 'key' is the word and the 'value' is the code for the part of speech that the computer has determined for each key.  

Next, we'll extract just the POS tags from the pairs and save them as a list.

In [None]:
# create a list of just POS tags
pos_list = []
for word, pos in tagged:
  pos_list.append(pos)

Now, let's look at this list to make sure we got it right.

In [None]:
# first, see how long the list is.  It should be the same length as our 
# original list of words in the text, which we've already calculated above.
print(len(pos_list))

In [None]:
# Now we'll look at the first ten POS tags in this list to see if
# everything looks right
print(pos_list[0:10:1])

Now, we'll create a frequency distribution of our POS tags and examine the distribution.

In [None]:
# create a frequency distribution of POS
pos_freq_dist = nltk.probability.FreqDist(pos_list)

In [None]:
# Look at the most frequent POS
print(pos_freq_dist.most_common(n=50))

And, finally, we'll plot the POS frequncy distribution.

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# plot the freqiuency distribution of POS
pos_freq_dist.plot(50, cumulative=True)

We can also do a dispersion plot of parts of speech to see if there are patterns that represent stylistic shifts across the time of the novel.

In [None]:
# set the size of the plot
plt.figure(figsize=(12, 9))
# define the words you want to plot
targets=['JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS', 'NNP', 'NNPS', 'PRP', 'PRP$']
#draw the plot
dispersion_plot(pos_list, targets, ignore_case=True, title='POS Dispersion Plot')

# XI.  Wrapping Up

At this point, you've examined the text from a variety of perspectives:  As a text blob (the full text that you originally loaded into the system); as sentences; as words, including some preliminary analysis of the relationships between words, and you've delved a bit into the grammatical structure of the text by looking at parts of speech.  

As per the introductory text to section IV, you should generally make a habit of taking detailed notes of any trends and/or issues you observed in the text while performing your preliminary analysis.  (Hopefully you did this in this instance.) Performing an initial forensic examination of your data and documenting it well is one of the most important, and frequently overlooked, steps in the data/text analysis pipeline.  The information that you gain during this forensics is crucial to making good decisions about what needs to be done to prepare a text for analysis to answer specific scholarly questions about the text.

When we meet in our next discussion session, we’ll discuss everyone’s findings and co-code some text cleaning and processing.

# Appendix:  NLTK POS Tag Key

The following is a key to the POS tags applied by the NLTK when performing POS tagging.

 
*   CC   | coordinating conjunction
*   CD   | cardinal digit
*   DT   | determiner
*   EX   | existential there (ex: 'there is')
*   FW   | foreign word
*   IN   | preposition/subordinating conjunction
*   JJ   | adjective (ex: big)
*   JJR  | adjective, comparative (ex: bigger)
*   JJS  | adjective, superlative (ex: biggest)
*   LS   | list marker (ex: '1)'
*   MD   | modal (ex: could, will)
*   NN   | noun, singular
*   NNS  | noun plural
*   NNP  | proper noun, singular
*   NNPS | proper noun, plural
*   PDT  | predeterminer (ex: 'all the kids')
*   POS  | possessive ending (ex: Sam's)
*   PRP  | personal pronoun
*   PRP\$ | possessive pronoun 
*   RB   | adverb (ex: very) 
*   RBR  | adverb, comparative (ex: better)
*   RBS  | adverb, superlative (ex: best)
*   RP   | particle 
*   TO   | to (ex: to go 'to' the store.)
*   UH   | interjection 
*   VB   | verb, base form (ex: take)
*   VBD  | verb, past tense (ex: took)
*   VBG  | verb, gerund/present participle (ex: taking)
*   VBN  | verb, past participle (ex: taken)
*   VBP  | verb, sing. present, non-3d (ex: take)
*   VBZ  | verb, 3rd person sing. present (ex: takes)
*   WDT  | wh-determiner (ex: which)
*   WP   | wh-pronoun (ex: who, what)
*   WP\$  | possessive wh-pronoun (ex: whose)
*   WRB  | wh-abverb (ex: where, when)