# Introduction to Natural Language Processing (NLP) and Text Analysis

Generally speaking, <i>Computational Text Analysis</i> is a set of interpretive methods which seek to understand patterns in human discourse, in part through statistics. More familiar methods, such as close reading, are exceptionally well-suited to the analysis of individual texts, however our research questions typically compel us to look for relationships across texts, sometimes counting in the thousands or even millions. We have to zoom out, in order to perform so-called <i>distant reading</i>. Fortunately for us, computers are well-suited to identify the kinds of textual relationships that exist at scale.

<i>Natural Language Processing</i> is an umbrella term for the methods by which a computer handles human language text. This includes transforming the text into a numerical form that the computer manipulates natively, as well as the measurements that reserchers often perform. In the parlance, <i>natural language</i> refers to a language spoken by humans, as opposed to a <i>formal language</i>, such as Python, which comprises a set of logical operations.


# Lecture Outline
- Text in Python
- Tokenization & Term Frequency
- Pre-Processing: 
    * Changing words to lowercase
    * Removing stop words
    * Removing punctuation
- Part-of-Speech Tagging
    * Tagging tokens
    * Counting tagged tokens
- Concordance
- Sentiment

# 1. Text in Python

First, a quote from Rachel Carson's *Silent Spring*.

In [None]:
print(
"Those who contemplate the beauty of the earth find reserves of strength \
that will endure as long as life lasts. \
There is something infinitely healing in the repeated refrains of nature -- \
the assurance that dawn comes after night, and spring after winter."
) 

In [None]:
# Assign the quote to a variable, so we can refer back to it later
# We get to make up the name of our variable, so let's give it a descriptive label: "sentence"

sentence = "Those who contemplate the beauty of the earth find reserves of strength \
that will endure as long as life lasts. \
There is something infinitely healing in the repeated refrains of nature -- \
the assurance that dawn comes after night, and spring after winter."

In [None]:
# Print the contents of the variable 'sentence'

print(sentence)

# 2. Tokenizing Text and Counting Words

The above output is how a human would read that sentence. Next we look the main way in which a computer "reads", or *parses*, that sentence.

The first step is typically to <i>tokenize</i> it, or to change it into a series of <i>tokens</i>. Each token roughly corresponds to either a word or punctuation mark. These smaller units are more straight-forward for the computer to handle for tasks like counting.

In [None]:
# Tokenize our sentence
import nltk
sentence_tokens = nltk.word_tokenize(sentence)
sentence_tokens

In [None]:
# How many tokens are in our list?

len(sentence_tokens)

## Why might we do this?

If it's a much longer text, we can do interesting things with it.

In [None]:
text = '''1. A Fable for Tomorrow 

THERE WAS ONCE a town in the heart of America where all life seemed to live in 
harmony with its surroundings. The town lay in the midst of a checkerboard of prosperous 
farms, with fields of grain and hillsides of orchards where, in spring, white clouds of bloom 
drifted above the green fields. In autumn, oak and maple and birch set up a blaze of color that 
flamed and flickered across a backdrop of pines. Then foxes barked in the hills and deer silently 
crossed the fields, half hidden in the mists of the fall mornings. 

Along the roads, laurel, viburnum and alder, great ferns and wildflowers delighted the traveler's 
eye through much of the year. Even in winter the roadsides were places of beauty, where 
countless birds came to feed on the berries and on the seed heads of the dried weeds rising 
above the snow. The countryside was, in fact, famous for the abundance and variety of its bird 
life, and when the flood of migrants was pouring through in spring and fall people traveled from 
great distances to observe them. Others came to fish the streams, which flowed clear and cold 
out of the hills and contained shady pools where trout lay. So it had been from the days many 
years ago when the first settlers raised their houses, sank their wells, and built their barns. 
Then a strange blight crept over the area and everything began to change. Some evil spell had 
settled on the community: mysterious maladies swept the flocks of chickens; the cattle and 
sheep sickened and died. Everywhere was a shadow of death. The farmers spoke of much 
illness among their families. In the town the doctors had become more and more puzzled by 
new kinds of sickness appearing among their patients. There had been several sudden and 
unexplained deaths, not only among adults but even among children, who would be stricken 
suddenly while at play and die within a few hours. 

There was a strange stillness. The birds, for example— where had they gone? Many people 
spoke of them, puzzled and disturbed. The feeding stations in the backyards were deserted. The 
few birds seen anywhere were moribund; they trembled violently and could not fly. It was a 
spring without voices. On the mornings that had once throbbed with the dawn chorus of robins, 
catbirds, doves, jays, wrens, and scores of other bird voices there was now no sound; only 
silence lay over the fields and woods and marsh. 

On the farms the hens brooded, but no chicks hatched. The farmers complained that they were 
unable to raise any pigs— the litters were small and the young survived only a few days. The 
apple trees were coming into bloom but no bees droned among the blossoms, so there was no 
pollination and there would be no fruit. The roadsides, once so attractive, were now lined with 
browned and withered vegetation as though swept by fire. These, too, were silent, deserted by 
all living things. Even the streams were now lifeless. Anglers no longer visited them, for all the 
fish had died. 

In the gutters under the eaves and between the shingles of the roofs, a white granular powder 
still showed a few patches; some weeks before it had fallen like snow upon the roofs and the 
lawns, the fields and streams. No witchcraft, no enemy action had silenced the rebirth of new 
life in this stricken world. The people had done it themselves. 

. . .This town does not actually exist, but it might easily have a thousand counterpa rts in 
America or elsewhere in the world. I know of no community that has experienced all the 



misfortunes I describe. Yet everyone of these disasters has actually happened somewhere, and 
many real communities have already suffered a substantial number of them. A grim specter 
has crept upon us almost unnoticed, and this imagined tragedy may easily become a stark 
reality we all shall know. What has already silenced the voices of spring in countless towns in 
America? This book is an attempt to explain. 



2. The Obligation to Endure 



THE HISTORY OF LIFE on earth has been a history of interaction between living things 
and their surroundings. To a large extent, the physical form and the habits of the earth's 
vegetation and its animal life have been molded by the environment. Considering the whole 
span of earthly time, the opposite effect, in which life actually modifies its surroundings, has 
been relatively slight. Only within the moment of time represented by the present century has 
one species— man— acquired significant power to alter the nature of his world. 
During the past quarter century this power has not only increased to one of disturbing 
magnitude but it has changed in character. The most alarming of all man's assaults upon the 
environment is the contamination of air, earth, rivers, and sea with dangerous and even lethal 
materials. This pollution is for the most part irrecoverable; the chain of evil it initiates not 
only in the world that must support life but in living tissues is for the most part irreversible. In 
this now universal contamination of the environment, chemicals are the sinister and little- 
recognized partners of radiation in changing the very nature of the world— the very nature of 
its life. Strontium 90, released through nuclear explosions into the air, comes to earth in rain or 
drifts down as fallout, lodges in soil, enters into the grass or corn or wheat grown there, and in 
time takes up its abode in the bones of a human being, there to remain until his death. 
Similarly, chemicals sprayed on croplands or forests or gardens lie long in soil, entering into 
living organisms, passing from one to another in a chain of poisoning and death. Or they pass 
mysteriously by underground streams until they emerge and, through the alchemy of air and 
sunlight, combine into new forms that kill vegetation, sicken cattle, and work unknown harm on 
those who drink from once pure wells. As Albert Schweitzer has said, 'Man can hardly even 
recognize the devils of his own creation.' It took hundreds of millions of years to produce the 
life that now inhabits the earth— eons of time in which that developing and evolving and 
diversifying life reached a state of adjustment and balance with its surroundings. The 
environment, rigorously shaping and directing the life it supported, contained elements that 
were hostile as well as supporting. Certain rocks gave out dangerous radiation; even within the 
light of the sun, from which all life draws its energy, there were short-wave radiations with 
power to injure. Given time— time not in years but in millennia— life adjusts, and a balance has 
been reached. For time is the essential ingredient; but in the modern world there is no time. 
The rapidity of change and the speed with which new situations are created follow the 
impetuous and heedless pace of man rather than the deliberate pace of nature. Radiation is no 
longer merely the background radiation of rocks, the bombardment of cosmic rays, the 
ultraviolet of the sun that have existed before there was any life on earth; radiation is now the 
unnatural creation of man's tampering with the atom. The chemicals to which life is asked to 
make its adjustment are no longer merely the calcium and silica and copper and all the rest of 
the minerals washed out of the rocks and carried in rivers to the sea; they are the synthetic 
creations of man's inventive mind, brewed in his laboratories, and having no counterparts in 
nature. 

To adjust to these chemicals would require time on the scale that is nature's; it would require 
not merely the years of a man's life but the life of generations. And even this, were it by some 
miracle possible, would be futile, for the new chemicals come from our laboratories in an 



endless stream; almost five hundred annually find their way into actual use in the United States 
alone. The figure is staggering and its implications are not easily grasped— 500 new chemicals 
to which the bodies of men and animals are required somehow to adapt each year, chemicals 
totally outside the limits of biologic experience. 

Among them are many that are used in man's war against nature. Since the mid-1940s over 200 
basic chemicals have been created for use in killing insects, weeds, rodents, and other 
organisms described in the modern vernacular as 'pests'; and they are sold under several 
thousand different brand names. These sprays, dusts, and aerosols are now applied almost 
universally to farms, gardens, forests, and homes— nonselective chemicals that have the power 
to kill every insect, the 'good' and the 'bad', to still the song of birds and the leaping of fish in 
the streams, to coat the leaves with a deadly film, and to linger on in soil— all this though the 
intended target may be only a few weeds or insects. Can anyone believe it is possible to lay 
down such a barrage of poisons on the surface of the earth without making it unfit for all life? 
They should not be called 'insecticides', but 'biocides'. The whole process of spraying seems 
caught up in an endless spiral. Since DDT was released for civilian use, a process of escalation 
has been going on in which ever more toxic materials must be found. This has happened 
because insects, in a triumphant vindication of Darwin's principle of the survival of the fittest, 
have evolved super races immune to the particular insecticide used, hence a deadlier one has 
always to be developed— and then a deadlier one than that. It has happened also because, for 
reasons to be described later, destructive insects often undergo a 'flareback', or resurgence, 
after spraying, in numbers greater than before. Thus the chemical war is never won, and all life 
is caught in its violent crossfire. 

Along with the possibility of the extinction of mankind by nuclear war, the central problem of 
our age has therefore become the contamination of man's total environment with such 
substances of incredible potential for harm— substances that accumulate in the tissues of 
plants and animals and even penetrate the germ cells to shatter or alter the very material 
of heredity upon which the shape of the future depends. 

Some would-be architects of our future look toward a time when it will be possible to alter the 
human germ plasm by design. But we may easily be doing so now by inadvertence, for many 
chemicals, like radiation, bring about gene mutations. It is ironic to think that man might 
determine his own future by something so seemingly trivial as the choice of an insect spray. 
All this has been risked— for what? Future historians may well be amazed by our distorted 
sense of proportion. How could intelligent beings seek to control a few unwanted species by a 
method that contaminated the entire environment and brought the threat of disease and death 
even to their own kind? Yet this is precisely what we have done. We have done it, moreover, 
for reasons that collapse the moment we examine them. We are told that the enormous and 
expanding use of pesticides is necessary to maintain farm production. Yet is our real problem 
not one of overproduction? Our farms, despite measures to remove acreages from production 
and to pay farmers not to produce, have yielded such a staggering excess of crops that the 
American taxpayer in 1962 is paying out more than one billion dollars a year as the total 
carrying cost of the surplus-food storage program. And is the situation helped when one branch 
of the Agriculture Department tries to reduce production while another states, as it did in 1958, 
'It is believed generally that reduction of crop acreages under provisions of the Soil Bank will 
stimulate interest in use of chemicals to obtain maximum production on the land retained in 



crops.' All this is not to say there is no insect problem and no need of control. I am saying, 
rather, that control must be geared to realities, not to mythical situations, and that the 
methods employed must be such that they do not destroy us along with the insects. 
. . . The problem whose attempted solution has brought such a train of disaster in its wake is an 
accompaniment of our modern way of life. Long before the age of man, insects inhabited the 
earth— a group of extraordinarily varied and adaptable beings. Over the course of time since 
man's advent, a small percentage of the more than half a million species of insects have come 
into conflict with human welfare in two principal ways: as competitors for the food supply and 
as carriers of human disease. Disease-carrying insects become important where human beings 
are crowded together, especially under conditions where sanitation is poor, as in time of 
natural disaster or war or in situations of extreme poverty and deprivation. Then control of 
some sort becomes necessary. It is a sobering fact, however, as we shall presently see, that the 
method of massive chemical control has had only limited success, and also threatens to worsen 
the very conditions it is intended to curb. 

Under primitive agricultural conditions the farmer had few insect problems. These arose with 
the intensification of agriculture— the devotion of immense acreages to a single crop. Such a 
system set the stage for explosive increases in specific insect populations. Single-crop farming 
does not take advantage of the principles by which nature works; it is agriculture as an engineer 
might conceive it to be. Nature has introduced great variety into the landscape, but man has 
displayed a passion for simplifying it. Thus he undoes the built-in checks and balances by which 
nature holds the species within bounds. One important natural check is a limit on the amount 
of suitable habitat for each species. Obviously then, an insect that lives on wheat can build up 
its population to much higher levels on a farm devoted to wheat than on one in which wheat is 
intermingled with other crops to which the insect is not adapted. The same thing happens in 
other situations. A generation or more ago, the towns of large areas of the United States lined 
their streets with the noble elm tree. Now the beauty they hopefully created is threatened with 
complete destruction as disease sweeps through the elms, carried by a beetle that would have 
only limited chance to build up large populations and to spread from tree to tree if the elms 
were only occasional trees in a richly diversified planting. 

Another factor in the modern insect problem is one that must be viewed against a background 
of geologic and human history: the spreading of thousands of different kinds of organisms from 
their native homes to invade new territories. This worldwide migration has been studied and 
graphically described by the British ecologist Charles Elton in his recent book The Ecology of 
Invasions. During the Cretaceous Period, some hundred million years ago, flooding seas cut 
many land bridges between continents and living things found themselves confined in what 
Elton calls 'colossal separate nature reserves'. There, isolated from others of their kind, they 
developed many new species. When some of the land masses were joined again, about 15 
million years ago, these species began to move out into new territories— a movement that is 
not only still in progress but is now receiving considerable assistance from man. 
The importation of plants is the primary agent in the modern spread of species, for animals 
have almost invariably gone along with the plants, quarantine being a comparatively recent and 
not completely effective innovation. The United States Office of Plant Introduction alone has 
introduced almost 200,000 species and varieties of plants from all over the world. Nearly half of 
the 180 or so major insect enemies of plants in the United States are accidental imports from 



abroad, and most of them have come as hitchhikers on plants. In new territory, out of reach of 
the restraining hand of the natural enemies that kept down its numbers in its native land, an 
invading plant or animal is able to become enormously abundant. Thus it is no accident that our 
most troublesome insects are introduced species. These invasions, both the naturally occurring 
and those dependent on human assistance, are likely to continue indefinitely. Quarantine and 
massive chemical campaigns are only extremely expensive ways of buying time. We are faced, 
according to Dr. Elton, 'with a life-and-death need not just to find new technological means of 
suppressing this plant or that animal'; instead we need the basic knowledge of animal 
populations and their relations to their surroundings that will 'promote an even balance and 
damp down the explosive power of outbreaks and new invasions.' 

Much of the necessary knowledge is now available but we do not use it. We train ecologists in 
our universities and even employ them in our governmental agencies but we seldom take their 
advice. We allow the chemical death rain to fall as though there were no alternative, whereas 
in fact there are many, and our ingenuity could soon discover many more if given opportunity. 
Have we fallen into a mesmerized state that makes us accept as inevitable that which is inferior 
or detrimental, as though having lost the will or the vision to demand that which is good? Such 
thinking, in the words of the ecologist Paul Shepard, 'idealizes life with only its head out of 
water, inches above the limits of toleration of the corruption of its own environment. ..Why 
should we tolerate a diet of weak poisons, a home in insipid surroundings, a circle of 
acquaintances who are not quite our enemies, the noise of motors with just enough relief to 
prevent insanity? Who would want to live in a world which is just not quite fatal?' 
Yet such a world is pressed upon us. The crusade to create a chemically sterile, insect-free 
world seems to have engendered a fanatic zeal on the part of many specialists and most of the 
so-called control agencies. On every hand there is evidence that those engaged in spraying 
operations exercise a ruthless power. 'The regulatory entomologists. ..function as prosecutor, 
judge and jury, tax assessor and collector and sheriff to enforce their own orders,' said 
Connecticut entomologist Neely Turner. The most flagrant abuses go unchecked in both state 
and federal agencies. It is not my contention that chemical insecticides must never be used. I do 
contend that we have put poisonous and biologically potent chemicals indiscriminately into the 
hands of persons largely or wholly ignorant of their potentials for harm. We have subjected 
enormous numbers of people to contact with these poisons, without their consent and often 
without their knowledge. If the Bill of Rights contains no guarantee that a citizen shall be secure 
against lethal poisons distributed either by private individuals or by public officials, it is surely 
only because our forefathers, despite their considerable wisdom and foresight, could conceive 
of no such problem. 

I contend, furthermore, that we have allowed these chemicals to be used with little or no 
advance investigation of their effect on soil, water, wildlife, and man himself. Future 
generations are unlikely to condone our lack of prudent concern for the integrity of the natural 
world that supports all life. There is still very limited awareness of the nature of the threat. This 
is an era of specialists, each of whom sees his own problem and is unaware of or intolerant of 
the larger frame into which it fits. It is also an era dominated by industry, in which the right to 
make a dollar at whatever cost is seldom challenged. When the public protests, confronted with 
some obvious evidence of damaging results of pesticide applications, it is fed little tranquilizing 
pills of half truth. We urgently need an end to these false assurances, to the sugar coating of 
unpalatable facts. It is the public that is being asked to assume the risks that the insect 
controllers calculate. The public must decide whether it wishes to continue on the present 
road, and it can do so only when in full possession of the facts. In the words of Jean Rostand, 
'The obligation to endure gives us the right to know.'''


In [None]:
# Assign those token counts to a variable
import collections
text_tokens = nltk.word_tokenize(text)
token_frequency = collections.Counter(text_tokens)
token_frequency.most_common(20)

### Note on Term Frequency

Most of the these terms are uninformative: "the", "it", "to", ".", etc. This is a standard feature of natural languages: the most frequent words are what are called stop words, and do not help us understand the text. They can be important in some cases, but much of the time they are not. We thus just remove them. 

# 3. Pre-Processing: Lower Case, Remove Stop Words and Punctuation

Typically, a text goes through a number of pre-processing steps before beginning to the actual analysis. We have already seen the tokenization step. Typically, pre-processing includes transforming tokens to lower case and removing stop words and punctuation marks.

### Lower Case

In [None]:
# And now transform it to lower case, all at once

sentence.lower()

In [None]:
# Okay, let's set our list of tokens to lower case, one at a time

# The syntax of the line below is tricky. Don't worry about it for now.
# We'll spend plenty of time on it tomorrow!

lower_case_tokens = [ word.lower()  for word in sentence_tokens ]

In [None]:
# Inspect

print(lower_case_tokens)

### Stop Words

In [None]:
# clean up data
from nltk.corpus import stopwords
import string
lower_case_tokens = [ word.lower()  for word in text_tokens ]

tokens_nostops = [ word  for word in lower_case_tokens  if word not in stopwords.words('english') ]

tokens_clean = [word for word in tokens_nostops if word not in string.punctuation]
tokens_clean[:50]

### Re-count the Most Frequent Words

In [None]:
# Count the new token list

word_frequency_clean = collections.Counter(tokens_clean)
word_frequency_clean.most_common(20)

Better! The most frequent words now give us a pretty good sense of the substance of this sentence. But we still have problems. For example, the token "'s" sneaked in there. One solution is to keep adding stop words to our list, but this could go on forever and is not a good solution when processing lots of text.

There's another way of identifying content words, and it involves identifying the part of speech of each word.

# 4. Part-of-Speech Tagging

You may have noticed that stop words are typically short function words, like conjunctions and prepositions. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying which contribute to the text's subject matter. NLTK can do that too!

NLTK has a <i>POS Tagger</i>, which identifies and labels the part-of-speech (POS) for every token in a text. The particular labels that NLTK uses come from the Penn Treebank corpus, a major resource from corpus linguistics.

You can find a list of all Penn POS tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Note that, from this point on, the code is going to get a little more complex. Don't worry about the particularities of each line. For now, we will focus on the NLP tasks themselves and the textual patterns they identify.

In [None]:
# Use the NLTK POS tagger

tagged_tokens = nltk.pos_tag(text_tokens)
tagged_tokens[:20]

### Most Frequent POS Tags

In [None]:
# We'll tread lightly here, and just say that we're counting POS tags

tag_frequency = collections.Counter( [ tag for (word, tag) in tagged_tokens ])
tag_frequency.most_common(10)

### Now it's getting interesting

The "IN" tag refers to prepositions, so it's no surprise that it should be the most common. However, we can see at a glance now that the sentence contains a lot of adjectives, "JJ". This feels like it tells us something about the rhetorical style or structure of the sentence: certain qualifiers seem to be important to the meaning of the sentence.

Let's dig in to see what those adjectives are.

In [None]:
# Let's filter our list, so it only keeps adjectives

adjectives = [word for word,pos in tagged_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']
adj_frequency = collections.Counter(adjectives)
adj_frequency.most_common(5)

In [None]:
# Let's do the same for nouns.

nouns = [word for word,pos in tagged_tokens if pos=='NN' or pos=='NNS']
noun_frequency = collections.Counter(nouns)
print(noun_frequency.most_common(5))

And now verbs.

In [None]:
# And we'll do the verbs in one fell swoop

verbs = [word for word,pos in tagged_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verb_frequency = collections.Counter(verbs)
print(verb_frequency.most_common(5))

In [None]:
# If we bring all of this together we get a pretty good summary of the sentence

print(adj_frequency.most_common(5))
print(noun_frequency.most_common(5))
print(verb_frequency.most_common(5))

# 5. Concordances and Similar Words using NLTK

Tallying word frequencies gives us a bird's-eye-view of our text but we lose one important aspect: context. As the dictum goes: "You shall know a word by the company it keeps."

Concordances show us every occurrence of a given word in a text, inside a window of context words that appear before and after it. This is helpful for close reading to get at a word's meaning by seeing how it is used. We can also use the logic of shared context in order to identify which words have similar meanings. To illustrate this, we can compare the way the word "monstrous" is used in our two novels.

### Concordance

In [None]:
# Transform our raw token lists in NLTK Text-objects
text_nltk = nltk.Text(text_tokens)
text_nltk.concordance("life")

In [None]:
text_nltk.concordance("chemicals")

### Contextual Similarity

In [None]:
# Get words that appear in a similar context to "life"

text_nltk.similar("life")

In [None]:
text_nltk.similar("chemicals")

# 6. Sentiment Analysis

We can also learn more about the sentiment of text - the positive and negative parts of it.

In [None]:
from textblob import TextBlob

#overall polarity (sentiment) of the text
#This number suggests overall, it's got almost the same amount of positive sentiment as negative sentiment
text_blobbed = TextBlob(text)
text_blobbed.sentiment.polarity

In [None]:
##Explore the text sentiment more


##View positive sentences
for item in text_blobbed.sentences:
    if item.sentiment.polarity > 0.45:
        print(item.replace('\n', ' '))

In [None]:
##View negative sentences
for item in text_blobbed.sentences:
    if item.sentiment.polarity < -0.25:
        print(item.replace('\n', ' '))