## Introduction to Natural Language Processing (NLP) using Python's NLTK

One of the most frequent tasks in computational text analysis is quickly summarizing the content of text. In this lesson we will learn two ways of summarzing text using Python's nltk, and in the process learn some quick and easy NLP techniques.

Natural Language Processing is an umbrella term that incorporates many techiques and methods to process, analyze, and understand natural languages (as opposed to artificial languages like logics, or Python).

### Learning Goals:
The goal of this lesson is to jump right in to text analysis and natural language processing. Rather than starting with the nitty gritty of programming in Python, this lesson will demonstrate some neat things you can do with a minimal amount of coding and will give you an understanding of why you may want to learn the nitty gritty. An additional goal of this lesson is to start to get you thinking about analyzing texts via computational methods.  <br /> <br />  By the end of the lesson you will learn how to quickly summarize a text via counting the most frequent words, nouns, and verbs. More specifically, you will:

* Gain an intuition about how computers process text, and how this is different than how humans read it
* Learn some of the basic functions in the nltk package, such as tokenizing texts and part-of-speech tagging, and learn why these might help researchers analyze text
* Get started with some basic coding (although don't worry if you don't understand everything)


### Lesson Outline:
- Assigning Text as Variables in Python
- Tokenizing Text and Counting Words
- Pre-Processing: 
    * Changing words to lowercase
    * Removing stop words
    * Removing punctuation
- Part-of-Speech Tagging
    * Tagging tokens
    * Counting tagged tokens
- Illustration: Guess the Mystery Novels
- If there's time: concordances


### Key Jargon:
* *coding or programming*: 
    * The purpose of programming is to find a sequence of instructions that will enable a computer to perform a specific task or solve a given problem. It involves writing those instructions in a specific *programming language*, in our case, Python.
* *script*:
    * A block of executable code, typically saved in a executable file. For example, script1.py
* *packages and modules*: 
    * Python files, or collections of files, that implement a set of pre-made functions (so we don't have to write all of the functions ourselves). To utilize a module we use the import function.
* *parse*: 
    * the process of analysing a string of symbols, in this case the symbols that make up natural language. This can also include understanding, or parsing, computer code.
* *variable*: 
    * A variable is something that holds a value that may change. In simplest terms, a variable is just a box that you can put stuff in. You can use variables to store all kinds of stuff, including numbers and letters.
* *assigning a variable*: 
    * telling Python what you want to name the variable, and what is stored in the variable.
* *string*: 
    * a type of variable the consists of a sequence of characters in a particular order. Characters can be anything, including letters or numbers. The order of a string is fixed.
* *list*: 
    * a type of variable that consists of a sequence of elements. The order is fixed.
* *stop words*: 
    * the most common words in a language.

### Further Resources:

Check out the full range of techniques included in Python's nltk package here: http://www.nltk.org/book/

### 0. Assigning Text as a Variable in Python

First, we assign a sample sentence, our "text", to a variable called "sentence" (the name of the variable is arbitrary). Printing the sentence shows what the variable "sentence" contains. 

Note: This sentence is a quote about what digital humanities means, from digital humanist Kathleen Fitzpatrick. Source: "On Scholarly Communication and the Digital Humanities: An Interview with Kathleen Fitzpatrick", *In the Library with the Lead Pipe*

In [1]:
#Anything on a line starting with a hashtag is called a comment, and is meant to clarify code for human readers.
#The computer ignores these lines.

#assign the desired sentence to the variable called 'sentence.' This variable type is called a string.
sentence = "For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it’s bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it’s also bringing humanistic modes of inquiry to bear on digital media."

#print the contents of the variable 'sentence'
print(sentence)

For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it’s bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it’s also bringing humanistic modes of inquiry to bear on digital media.


In [4]:
#First import the Python package nltk (Natural Language Tool Kit)
import nltk

#import the function to split the text into separate words from the NLTK package
from nltk import word_tokenize

#create new variable that applies the word_tokenize function to our sentence.
sentence_tokens = word_tokenize(sentence)

#This new variable contains the tokenized text, and is now a variable type called a list.
print(sentence_tokens)

['For', 'me', 'it', 'has', 'to', 'do', 'with', 'the', 'work', 'that', 'gets', 'done', 'at', 'the', 'crossroads', 'of', 'digital', 'media', 'and', 'traditional', 'humanistic', 'study', '.', 'And', 'that', 'happens', 'in', 'two', 'different', 'ways', '.', 'On', 'the', 'one', 'hand', ',', 'it’s', 'bringing', 'the', 'tools', 'and', 'techniques', 'of', 'digital', 'media', 'to', 'bear', 'on', 'traditional', 'humanistic', 'questions', ';', 'on', 'the', 'other', ',', 'it’s', 'also', 'bringing', 'humanistic', 'modes', 'of', 'inquiry', 'to', 'bear', 'on', 'digital', 'media', '.']


### 1. Tokenizing Text and Counting Words

The above output is how a human would read that sentence. Next we look at different ways in which a computer "reads", or *parses*, that sentence, and some simple ways to analyze/summarize it.

Often the first step needed to enable a computer to parse text is changing the sentence into "tokens." This is referred to as *tokenizing* text. Each token roughly corresponds to either words or punctuation. In essence, this process divides the sentence into little bits that the computer can process.

Notice each token is either a word or punctuation. [Note: in the coming days we will see other methods to tokenize text. While seemingly simple, tokenizing text is not a trivial task.]

Why is this helpful?

We can now summarize the sentence/text in interesting and potentially helpful ways. For example, we can count the number of tokens in the sentence, which roughly corresponds to the number of words.

In [5]:
#The number of tokens is the length of the list, or the number of elements in the list
print(len(sentence_tokens))

69


We can also count the most frequent words, which can help us quickly summarize the text.

In [6]:
#apply the nltk function FreqDist to count the number of times each token occurs.
word_frequency = nltk.FreqDist(sentence_tokens)

#print out the 10 most frequent words using the function most_common
print(word_frequency.most_common(10))

[('the', 5), ('humanistic', 3), ('to', 3), ('digital', 3), ('.', 3), ('on', 3), ('of', 3), ('media', 3), ('it’s', 2), ('and', 2)]


The most frequent words do suggest what the sentence is about, in particular the words "humanistic", "digital", "media", and "traditional".

But there are many frequent words that are not helpful in summarizing the text, for example, "the", "and", "to", and "." So the most frequent words do not necessarily help us understand the content of a text.

How can we use a computer to identify important, interesting, or content words in a text? There are many ways to do this, a few of which we'll cover in this workshop. Today, we'll look at two simple ways to identify words that will help us summarize the content of a text. We'll see additional ways of doing this throughout the week. 

### 2. Pre-Processing: Lower Case, Removing Stop Words and Punctuation

First, scholars typically go through a number of pre-processing steps before getting to the actual analysis. One of these step is converting all words to lower-case, so that the word "Humanities" and "humanities" count as the same word. (For some tasks this is appropriate. Think of reasons why we might NOT want to do this.)

To convert to lower case we use the function lower()



In [7]:
sentence_tokens_lc = [word.lower() for word in sentence_tokens]

#see the result
print(sentence_tokens_lc)

['for', 'me', 'it', 'has', 'to', 'do', 'with', 'the', 'work', 'that', 'gets', 'done', 'at', 'the', 'crossroads', 'of', 'digital', 'media', 'and', 'traditional', 'humanistic', 'study', '.', 'and', 'that', 'happens', 'in', 'two', 'different', 'ways', '.', 'on', 'the', 'one', 'hand', ',', 'it’s', 'bringing', 'the', 'tools', 'and', 'techniques', 'of', 'digital', 'media', 'to', 'bear', 'on', 'traditional', 'humanistic', 'questions', ';', 'on', 'the', 'other', ',', 'it’s', 'also', 'bringing', 'humanistic', 'modes', 'of', 'inquiry', 'to', 'bear', 'on', 'digital', 'media', '.']


Words like "the", "to", and "and" are what text analysis call "stop words." Stop words are the most common words in a language, and while necessary and useful for some analysis purposes, do not tell us much about the *substance* of a text. Another common pre-processing steps is to simply remove punctuation and stop words. NLTK contains a built-in stop words list, which we use to remove stop words from our list of tokens.

In [8]:
#import the stopwords list
from nltk.corpus import stopwords

#take a look at what stop words are included:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [10]:
#create a new variable that contains the sentence tokens without the stopwords
sentence_tokens_clean = [word for word in sentence_tokens_lc if word not in stopwords.words('english')]

#see what words we're left with
print(sentence_tokens_clean)

['work', 'gets', 'done', 'crossroads', 'digital', 'media', 'traditional', 'humanistic', 'study', '.', 'happens', 'two', 'different', 'ways', '.', 'one', 'hand', ',', 'it’s', 'bringing', 'tools', 'techniques', 'digital', 'media', 'bear', 'traditional', 'humanistic', 'questions', ';', ',', 'it’s', 'also', 'bringing', 'humanistic', 'modes', 'inquiry', 'bear', 'digital', 'media', '.']


Punctuation also does not help us understand the substance of a text, so we'll remove punctuation in a similar fashion. [Again, think about tasks where me may not want to remove punctuation.] There are many many ways to do this. For now, we'll create a list of punctuation tokens, similar to the list of stop words, and remove them from our list of tokens.

In [11]:
#creat list of punctuation symbols
#there are better ways to do this which we'll get to later, but we'll keep it simple here
punctuation = [".", ";", ",", "'", '"', "!"]
sentence_tokens_clean = [word for word in sentence_tokens_clean if word not in punctuation]

#see what's left
print(sentence_tokens_clean)


['work', 'gets', 'done', 'crossroads', 'digital', 'media', 'traditional', 'humanistic', 'study', 'happens', 'two', 'different', 'ways', 'one', 'hand', 'it’s', 'bringing', 'tools', 'techniques', 'digital', 'media', 'bear', 'traditional', 'humanistic', 'questions', 'it’s', 'also', 'bringing', 'humanistic', 'modes', 'inquiry', 'bear', 'digital', 'media']


Now, after our pre-processing steps, let's re-count the most frequent words in the sentence.

In [12]:
word_frequency_clean = nltk.FreqDist(sentence_tokens_clean)
print(word_frequency_clean.most_common(10))

[('digital', 3), ('humanistic', 3), ('media', 3), ('it’s', 2), ('bear', 2), ('traditional', 2), ('bringing', 2), ('study', 1), ('techniques', 1), ('questions', 1)]


Better! The 10 most frequent words now give us a pretty good sense of the substance of this sentence. But we still have problems. For example, the word "it's" sneaked in there. One solution is to keep adding stop words to our stop word list, but this could go on forever and is not a good solution when processing lots of text.

There's another way of identifying content words, and it involves identifying the part of speech of each word.

### 3. Part-of-Speech Tagging

You may have noticed that stop words are typically short function words. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying words of substance. NLTK can do that too!

NLTK has a function that will tag the part of speech of every token in a text. For this, we go back to our original tokenized text, with the stop words and punctuation.

NLTK uses the Penn Treebank Project to tag the part-of-speech of the words. You can find a list of all the part-of-speech tags here:

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [13]:
#use the nltk pos function to tag the tokens
tagged_sentence_tokens = nltk.pos_tag(sentence_tokens)

#view new variable
print(tagged_sentence_tokens)



[('For', 'IN'), ('me', 'PRP'), ('it', 'PRP'), ('has', 'VBZ'), ('to', 'TO'), ('do', 'VB'), ('with', 'IN'), ('the', 'DT'), ('work', 'NN'), ('that', 'WDT'), ('gets', 'VBZ'), ('done', 'VBN'), ('at', 'IN'), ('the', 'DT'), ('crossroads', 'NNS'), ('of', 'IN'), ('digital', 'JJ'), ('media', 'NNS'), ('and', 'CC'), ('traditional', 'JJ'), ('humanistic', 'JJ'), ('study', 'NN'), ('.', '.'), ('And', 'CC'), ('that', 'DT'), ('happens', 'VBZ'), ('in', 'IN'), ('two', 'CD'), ('different', 'JJ'), ('ways', 'NNS'), ('.', '.'), ('On', 'IN'), ('the', 'DT'), ('one', 'CD'), ('hand', 'NN'), (',', ','), ('it’s', 'NN'), ('bringing', 'VBG'), ('the', 'DT'), ('tools', 'NNS'), ('and', 'CC'), ('techniques', 'NNS'), ('of', 'IN'), ('digital', 'JJ'), ('media', 'NNS'), ('to', 'TO'), ('bear', 'VB'), ('on', 'IN'), ('traditional', 'JJ'), ('humanistic', 'JJ'), ('questions', 'NNS'), (';', ':'), ('on', 'IN'), ('the', 'DT'), ('other', 'JJ'), (',', ','), ('it’s', 'NN'), ('also', 'RB'), ('bringing', 'VBG'), ('humanistic', 'JJ'), ('m

Now comes more complicated code. Stay with me, but focus more on the output and understanding ways in which you as a researcher can use the output, rather than understanding every line of code.

We can count the part-of-speech tags in a similar way we counted words, to output the most frequent types of words in our text.

In [27]:
tagged_frequency = nltk.FreqDist(tag for (word, tag) in tagged_sentence_tokens)
tagged_frequency.most_common()

[('IN', 11),
 ('JJ', 10),
 ('NNS', 9),
 ('DT', 6),
 ('NN', 6),
 ('VB', 3),
 ('.', 3),
 ('TO', 3),
 ('CC', 3),
 ('VBZ', 3),
 (',', 2),
 ('VBG', 2),
 ('CD', 2),
 ('PRP', 2),
 (':', 1),
 ('VBN', 1),
 ('WDT', 1),
 ('RB', 1)]

This sentence contains a lot of adjectives. So let's first look at the most frequent adjectives

In [40]:
adjectives = [word for word,pos in tagged_sentence_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']

#print all of the adjectives
print(adjectives)

['digital', 'traditional', 'humanistic', 'different', 'digital', 'traditional', 'humanistic', 'other', 'humanistic', 'digital']


In [41]:
#calculate the frequency of the adjectives
freq_adjectives=nltk.FreqDist(adjectives)

#print the most frequent adjectives
print(freq_adjectives.most_common(5))

[('humanistic', 3), ('digital', 3), ('traditional', 2), ('other', 1), ('different', 1)]


Let's do the same for nouns.

In [35]:
nouns = [word for word,pos in tagged_sentence_tokens if pos=='NN' or pos=='NNS']

#print all of the nouns
print(nouns)

['work', 'crossroads', 'media', 'study', 'ways', 'hand', 'it’s', 'tools', 'techniques', 'media', 'questions', 'it’s', 'modes', 'inquiry', 'media']


In [36]:
#calculate the frequency of the nouns
freq_nouns=nltk.FreqDist(nouns)

#print the most frequent nouns
print(freq_nouns.most_common(5))

[('media', 3), ('it’s', 2), ('inquiry', 1), ('hand', 1), ('work', 1)]


And now verbs.

In [37]:
verbs = [word for word,pos in tagged_sentence_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']

#print all of the verbs
print(verbs)



['has', 'do', 'gets', 'done', 'happens', 'bringing', 'bear', 'bringing', 'bear']


In [38]:
#calculate the frequency of the nouns
freq_verbs=nltk.FreqDist(verbs)

#print the most frequent nouns
print(freq_verbs.most_common(5))

[('bringing', 2), ('bear', 2), ('gets', 1), ('has', 1), ('done', 1)]


If we bring all of this together we get a pretty good summary of the sentence:

In [42]:
print(freq_adjectives.most_common(3))
print(freq_nouns.most_common(3))
print(freq_verbs.most_common(3))

[('humanistic', 3), ('digital', 3), ('traditional', 2)]
[('media', 3), ('it’s', 2), ('inquiry', 1)]
[('bringing', 2), ('bear', 2), ('gets', 1)]


### 4. Illustration: Guess the Mystery Novel

To illustrate this process on a slightly larger scale, we will do the exactly what we did above, but will do so on two mystery novels. Your challenge: guess the novels from the most frequent words, nouns, and verbs. We will do this in one chunk of code, so another challenge for you during breaks or the next few weeks is to see how much of the following code you can follow (or, in computer science terms, how much of the code you can parse). If the answer is none, not to worry! Tomorrow we will take a step back and work on the nitty gritty of programming.

Note: this codes requires one thing we haven't covered: reading a .txt file from your hard drive. We'll go over this again in the upcoming days.

In [14]:
#import the package 'string' for a different way of removing punctuation. It's simply a more complete list of punction than we created above.
import string
punctuations = list(string.punctuation)

#see what punctuation is included
print(punctuations)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [15]:
#read the two text files from your hard drive, assign first mystery text to variable 'text1' and second mystery text to variable 'text2'
text1 = open('text1.txt').read()
text2 = open('text2.txt').read()

###word frequencies

#tokenize texts
text1_tokens = word_tokenize(text1)
text2_tokens = word_tokenize(text2)

#pre-process for word frequency
#lowercase
text1_tokens_lc = [word.lower() for word in text1_tokens]
text2_tokens_lc = [word.lower() for word in text2_tokens]

#remove stopwords
text1_tokens_clean = [word for word in text1_tokens_lc if word not in stopwords.words('english')]
text2_tokens_clean = [word for word in text2_tokens_lc if word not in stopwords.words('english')]

#remove punctuation using the list of punctuation from the string pacage
text1_tokens_clean = [word for word in text1_tokens_clean if word not in punctuations]
text2_tokens_clean = [word for word in text2_tokens_clean if word not in punctuations]

#frequency distribution
text1_word_frequency = nltk.FreqDist(text1_tokens_clean)
text2_word_frequency = nltk.FreqDist(text2_tokens_clean)

###part-of-speech frequencies

#tag part-of-speech
text1_tagged = nltk.pos_tag(text1_tokens)
text2_tagged = nltk.pos_tag(text2_tokens)

#most frequent nouns and verbs
text1_nouns = [word for word,pos in text1_tagged if pos=='NN' or pos=='NNS']
text2_nouns = [word for word,pos in text2_tagged if pos=='NN' or pos=='NNS']
text1_freq_nouns=nltk.FreqDist(text1_nouns)
text2_freq_nouns=nltk.FreqDist(text2_nouns)

text1_verbs = [word for word,pos in text1_tagged if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
text2_verbs = [word for word,pos in text2_tagged if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']

text1_freq_verbs=nltk.FreqDist(text1_verbs)
text2_freq_verbs=nltk.FreqDist(text2_verbs)

All the variables are assigned as desired. We can now print out the most frequent words, nouns, and verbs for each text. Again, don't worry if you don't understand all the code here. [Note: many of these pring statements are so humans can better read the output.]

Can you guess the text?

In [16]:
print("Frequent words for Text1:")
print("_________________________")

for word in text1_word_frequency.most_common(20):
    print(word[0])
print()
print("Frequent nouns for Text1")
print("________________________")
for word in text1_freq_nouns.most_common(20):
    print(word[0])
print()
print("Frequent verbs for Text1")
print("________________________")
for word in text1_freq_verbs.most_common(20):
    print(word[0])
    
print()
print("------------------------")
print("~~~~~~~~~~~~~~~~~~~~~~~~")
print("------------------------")
print()

print("Frequent words for Text2")
print("________________________")
for word in text2_word_frequency.most_common(20):
    print(word[0])

print()
print("Frequent nouns for Text2")
print("________________________")
for word in text2_freq_nouns.most_common(20):
    print(word[0])

print()
print("Frequent verbs for Text2")
print("________________________")
for word in text2_freq_verbs.most_common(20):
    print(word[0])

Frequent words for Text1:
_________________________
's
''
``
whale
one
like
upon
ahab
man
ship
old
would
ye
sea
though
yet
time
captain
long
still

Frequent nouns for Text1
________________________
whale
man
ship
sea
time
boat
head
way
whales
men
hand
thing
side
ye
world
water
deck
day
eyes
sort

Frequent verbs for Text1
________________________
is
was
be
had
have
were
are
's
been
do
said
has
seemed
did
see
say
being
go
made
seen

------------------------
~~~~~~~~~~~~~~~~~~~~~~~~
------------------------

Frequent words for Text2
________________________
''
``
's
elinor
could
marianne
mrs.
would
said
every
one
much
must
sister
edward
dashwood
mother
time
jennings
know

Frequent nouns for Text2
________________________
sister
mother
time
thing
nothing
house
day
heart
man
moment
room
mind
kind
world
town
morning
family
affection
brother
place

Frequent verbs for Text2
________________________
was
be
had
have
is
been
were
said
do
am
know
are
did
think
has
see
being
say
make
made


What further things can we learn from these lists? In particular, what can we learn from comparing these two novels?<br /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;One thing I noticed: the verbs "think" and "know" are frequent verbs in Text2 but not Text1. Does this generate a potential research question?

What next steps would you want to take if you were to further compare these novels?

We can do many more things with the Python package NLTK. Pending time, here are a few other tricks.

### 5. Concordances and Similar Words using NLTK

Maybe we don't just want to know frequent words, but we want to know the way specific words are used. Concordances show us every occurrence of a given word, together with some context. This, combined with a function that should which words are used in a similar context as a given word, can help us understand the way in which a word is used in a text. 

To illustrate this, we can compare the way the word "monstrous" is used in our two novels. To use some nltk functions on text, for example concordances, we need to first transform the text into an NLTK text object. This will be useful for using these functions on your own text.

In [23]:
#first tokenize the two texts
text1_tokens = word_tokenize(text1)
text2_tokens = word_tokenize(text2)

#then transform the tokenized text into an NLTK text object
text1_nltk = nltk.Text(text1_tokens)
text2_nltk = nltk.Text(text2_tokens)

#the variables text1_nltk and text2_nltk are now nltk text objects:
print(text1_nltk)
print(text2_nltk)

<Text: ETYMOLOGY . ( Supplied by a Late Consumptive...>
<Text: CHAPTER 1 The family of Dashwood had long...>


In [26]:
#now we can use the concordance function to display the word in its context
text1_nltk.concordance("monstrous")
print()
text2_nltk.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size ... . This came towards us , 
n of the Psalms . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
 Radney . ' '' CHAPTER 55 . Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u

Displaying 11 of 11 matches:
 `` Now , Palmer , you shall see a monstrous pretty girl . ''

The nltk function *similar* prints out words that are used in the same context as monstrous.

In [29]:
print("Melville")
text1_nltk.similar("monstrous")
print()
print("Austen")
text2_nltk.similar("monstrous")

Melville
candid maddens fearless horrible trustworthy doleful part imperial
determined untoward few passing vexatious tyrannical loving christian
true lamentable impalpable curious

Austen
very so heartily exceedingly great a remarkably as extremely good
sweet vast amazingly


What can we learn from this? For me, Melville uses "monstrous" in a mix of settings, but in many cases monstrous has a negative connotation. For Austen, monstrous has a positive connotation, and is often an amplifier of "very".

Traditional literary criticism (according to an expert colleague of mine), has claimed that monstrous is a positive term for Melville, used in an admiring way, admiring of the power and enormity of whales. But this analysis suggests this may not be entirely true. Were literary critics wrong about the tone of the novel? Or partially right, but they missed something as well? What further questions could we ask or analyses could we do to explore this contention further?