<a href="https://colab.research.google.com/github/GenevieveMilliken/NLP/blob/main/NLTK_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Toolkit (NLTK)

Notes from Ch. 1: https://www.nltk.org/book/ch01.html

The Natural Language Toolkit, or more commonly NLTK, is a Python suite of libraries used for natural language processing (NLP) on English-language texts.

NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.

NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.

# Getting Started

In [None]:
!pip install nltk

In [None]:
import nltk

In [None]:
# nltk.download()

In [None]:
nltk.download('book')

In [None]:
from nltk.book import *

## Searching Texts

NLTK provides the function `concordance()` to locate and print series of phrases that contain the keyword. However, the function only print the output. 

In [None]:
text1.concordance("monstrous")

In [None]:
text1.concordance("flood")

In [None]:
text1.concordance("Whale")

In [None]:
text2.concordance("affection")

A concordance permits us to see words in context. For example, we saw that *monstrous* occurred in contexts such as *...most monstrous size* and *...monstrous clubs and spears*. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in `question`, then inserting the relevant word in parentheses:

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

In [None]:
text2.common_contexts(["monstrous", "remarkably"])

It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.

In [None]:
#Dispersion plot for Inaugural Address Corpus 
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Now, just for fun, let's try generating some random text in the various styles we have just seen.

In [None]:
text2.generate()

## Counting Vocabulary
The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text in a variety of useful ways

Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of something, which we'll apply here to the book of Genesis:

In [None]:
len(text3)
print(f"The word count of {text3.name} is {len(text3)}.")

So Genesis has 44,764 words and punctuation symbols, or "tokens." A token is the technical name for a sequence of characters.

When we count the number of tokens in a text, say, the phrase *to be or not to be*, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of *to*, two of *be*, and one each of *or* and *not*. But there are only four distinct vocabulary items in this phrase. **How many distinct words does the book of Genesis contain?**

The vocabulary of a text is just the [set](https://www.w3schools.com/python/python_sets.asp) of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items of text3 with the command: set(text3). 

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

print(f"{text3.name} has {len(set(text3))} distinct words or \"item types\".")

A word type is the form or spelling of the word independently of its specific occurrences in a text — that is, the word considered as a unique item of vocabulary. Our count of 2,789 items will include punctuation symbols, so we will generally call these unique items types instead of word types.

Now, let's calculate a measure of the lexical richness of the text.

In [None]:
len(set(text3)) / len(text3)

The next example shows us that the number of distinct words is just 6% of the total number of words, or equivalently that each word is used 16 times on average

In [None]:
 100 / 6

Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:

In [None]:
text1.count("boat")

100 * text1.count("boat") / len(text1)

In [None]:
# if might become tedious to do this for every word, so we can define a function 

def lexical_diversity(text):
  return len(set(text)) / len(text)

def percentage(count, total):
  return 100 * count / total


In [None]:
lexical_diversity(text3)

In [None]:
percentage(text1.count("boat") , len(text1))

#  A Closer Look at Python: Texts as Lists of Words

NLTK has converted the first sentence of each of the books into a list. 

In [None]:
sent1

In [None]:
sent2

In [None]:
sent3

As we have seen, a text in Python is a list of words, represented using a combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words in text1 with len(text1), and count the occurrences in a text of a particular word — say, 'heaven' — using text1.count('heaven').

In [None]:
len(text1)

text1.count("Heaven")

In [None]:
# using index to get the 598th word 

text1[599]

In [None]:
# index where the word first occurs
text1.index("who")

In [None]:
# We can also slice 
# first 100 items of Moby Dick

text1[0:100]

In [None]:
'''rest of this section in the documentation covers strings, concatination, index, slices;
 covered in NYUHSL Intro Python class'''

## Computing with Language: Simple Statistic

In [None]:
# Review
 	
saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]

## Frequency Distributions

A frequency distribution tells us the frequency of each vocabulary item in the text. It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words of Moby Dick:

In [None]:
freq_dist_1 = FreqDist(text1)
print(freq_dist_1)

# type(freq_dist_1)

freq_dist_1.most_common(50)

Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they're just English "plumbing." What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(50, cumulative=True), to produce the graph in 3.2. These 50 words account for nearly half the book!

In [None]:
freq_dist_1.plot(50, cumulative=True)

## Fine-grained Selection of Words

Next, let's look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than 15 characters long. Let's call this property P, so that P(w) is true if and only if w is more than 15 characters long. Now we can express the words of interest using mathematical set notation as shown in (1a). This means "the set of all w such that w is an element of V (the vocabulary) and w has property P". 2b corresponding Python expression in a list comprehension. 

1a. `{w | w ∈ V & P(w)}` <br>
2b.   `[w for w in V if p(w)]`






	






In [None]:
 V = set(text1)

 long_words = [w for w in V if len(w) > 15]
 print(sorted(long_words))

These very long words are often hapaxes (i.e., unique) and perhaps it would be better to find frequently occurring long words. This seems promising since it eliminates frequent short words (e.g., the) and infrequent long words (e.g. antiphilosophists). Here are all words from the chat corpus that are longer than seven characters, that occur more than seven times:

In [None]:
freq_dist_5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and freq_dist_5[w] > 7)

## Collocations and Bigrams

A collocation is a sequence of words that occur together unusually often. Thus *red wine* is a collocation, whereas *the wine* is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():

In [122]:
list(bigrams(["more", "is", "said", "than", "done"]))

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

Now, collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The collocations() function does this for us. 

In [123]:
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


In [125]:
text2.collocations()

Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;
thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;
every body; John Dashwood; great deal; Harley Street; Berkeley Street;
Miss Dashwoods; young man; Combe Magna; every day; next morning


In [126]:
text1.collocations()

Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand


In [132]:
def collocation(text):
  return text.collocations()

books = [text1, text2, text3, text4, text5, text6, text7, text8, text8]

for book in books:
  collocation(book)
  print("-------")


Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
-------
Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;
thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;
every body; John Dashwood; great deal; Harley Street; Berkeley Street;
Miss Dashwoods; young man; Combe Magna; every day; next morning
-------
said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast
-------
United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fell

In [135]:
# let's use set to get the lengths of the words as a numerical value

sorted(set([len(w) for w in text1]))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]

In [143]:
# freq distribution 

freq_dist = FreqDist(len(w) for w in text1)
print(freq_dist)
freq_dist

<FreqDist with 19 samples and 260819 outcomes>


FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})

In [144]:
freq_dist.most_common()

[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528),
 (11, 1873),
 (12, 1053),
 (13, 567),
 (14, 177),
 (15, 70),
 (16, 22),
 (17, 12),
 (18, 1),
 (20, 1)]

## Back to Python: Making Decisions and Taking Control


In [148]:
# Conditionals 

sent7

print([w for w in sent7 if len(w) < 4])
print([w for w in sent7 if len(w) <= 4])
print([w for w in sent7 if len(w) == 4])
print([w for w in sent7 if len(w) != 4])

[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
['will', 'join', 'Nov.']
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.']


In [149]:
sorted(w for w in set(text1) if w.endswith('ableness'))

['comfortableness',
 'honourableness',
 'immutableness',
 'indispensableness',
 'indomitableness',
 'intolerableness',
 'palpableness',
 'reasonableness',
 'uncomfortableness']

In [150]:
sorted(term for term in set(text4) if 'gnt' in term)

['Sovereignty', 'sovereignties', 'sovereignty']

## Looping with Conditions

Now we can combine the if and for statements. We will loop over every item of the list, and print the item only if it ends with the letter l. 

In [152]:
sent1 

for xyz in sent1:
  if xyz.endswith("l"):
    print(xyz)

Call
Ishmael


In [154]:
from nltk.tokenize.sonority_sequencing import punctuation
for token in sent1:
  if token.islower():
    print(f"{token} is lower case.")
  elif token.istitle():
    print(f"{token} is title case.")
  else: 
    print(f"{token} is punctuation")
  

Call is title case.
me is lower case.
Ishmael is title case.
. is punctuation
