# Introduction to the Natural Language Toolkit (NLTK)

## Text Analysis with NLTK

### [Centre for Data, Culture & Society](http://cdcs.ed.ac.uk)

Course Instructor: Xandra Dave Cochran

Course Dates: February 2024

****

### Python Refresher

Can you name some of Python's **data types**?

There are two data types that are particularly important to understand when doing text analysis: **string** and **list**.  

**Strings** are sequences of characters (as in letters, numbers, spaces, or punctuation marks) that are contained within single quotes (`'I am a string'`) or double quotes (`"I'm a string too!"`).

**Lists** are collections of data of any type (str, int, float, dict, list, tuple, set) that are contained within square brackets (`[ ]`).  Lists can have zero or more elements, and a single list can contain elements of different data types (for example: `['hi', 123, "bonjour !", [], {"A":1, "B":2}, (1.5, 2.8), {'a', 'b', 'c'}]`).  The elements inside lists are *ordered*, which allows us to reference them by their *index* (position - numeric, starts from 0).

In [None]:
print("Hello world")   # print() is a FUNCTION, "Hello world" is a STRING (str)

In [None]:
greeting = "Hello world"  # we've assigned the string to a VARIABLE

In [None]:
# This is a comment.  Comments are for you and me, Python knows to ignore them.
# In this comment I'll explain that to show values, we can choose to use print() or 
# not, and the output will be slightly different, as you can see below.
# If you want to show multiple values below a code cell like this one, you'll need all
# of the values you want returned inside a print() function (except the last one, which
# is optional), otherwise only the value in the last line of your cell will be shown
print(greeting)
greeting

In [None]:
greeting_list = ["hello", "bonjour", "salaam", "hola"]  # 1-dimensional list of strings
print(type(greeting_list))                              # type() is another function
print(type(greeting_list[0]))

We can use a **for** loop or a **while** loop to iterate through items in a collection, such as our `greeting_list`.

In [None]:
i = 0
for greeting in greeting_list:
    print("Greeting at index "+str(i)+": "+greeting)  # an example of CONCATENTATION 
    i += 1                                            # a shortcut for writing i = i + 1


In [None]:
i = -1
while i > -5:
    print(greeting_list[i])  # we can reference indeces backwards using negative numbers
    i = i - 1                # we can't write i -= 1

We can define functions in order to make our code easy to re-use and maintain:

In [None]:
def show_greetings(gl):
    for i, greeting in enumerate(gl): # you can iterate over the indices *and* the content of a list with the enumerate() function
        print(f"Greeting at index {i}: {greeting}") # You can also insert values into strings with the f-string syntax

show_greetings(greeting_list)

In [None]:
def exclaim(greeting):
    exclamation = greeting + '!'
    # you can use conditionals to make code run only if a condition is met:
    if greeting == 'hola':
        # Use `return` when you want a function to return a value that can be used by other code
        # (`print` displays the value, but doesn't make the it available to the code that called the function)
        return '¡' + exclamation
    else:
        return exclamation
    
for g in greeting_list:
    print(exclaim(g))

We can access subsets of a list using **slicing** with square brackets and colons, where the number before the colon is included in the slice but the number after the colon is not.  If a number is omitted before the colon, Python knows to go all the way to the starting element of the list.  If a number is omitted after the colon, Python knows to go all the way to the ending element of the list.

In [None]:
print(greeting_list[0:1])
print(greeting_list[:1])
print(greeting_list[2:])
print(greeting_list[:])
print(greeting_list[:-3])
print(greeting_list[-3:])

Lists are not the only built-in data-structure in Python. A `set` is an *unordered* data-structure that can hold a collection of *unique* values.

In [None]:
greetings_set = set(greeting_list)
print(greetings_set)

In [None]:
artist_set = {'Leonardo', 'Donatello', 'Raphael', 'Michelangelo'}
print(artist_set)

In [None]:
mere_oblivion_list = ['sans', 'teeth', 'sans', 'eyes', 'sans', 'taste', 'sans', 'everything']
mere_oblivion_set = set(mere_oblivion_list)
print(mere_oblivion_list)
print(mere_oblivion_set)

## Text Analysis with NLTK

****

**Reference:**

Steven Bird, Ewan Klein, Edward Loper (2019) *Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit.*  3rd Edition.  https://www.nltk.org/book/

***

NLTK, which stands for Natural Language Toolkit, is a popular coding library for text analysis with the programming language Python.  While Python alone has some basic capabilities for analyzing text, NLTK has much more to offer, as we will see!  This Jupyter Notebook will cover the following concepts:

* [Tokenization](#tok)
* [Frequency counts and distributions](#fre)
* [Normalization](#nor)
* [Stemming](#ste)
* [Lemmatizing](#lem)
* [Part-of-speech tagging](#pos)
* [Collocations and n-grams](#col)

These are the building blocks for more complicated text analysis tasks.  They are generally part of the first step of text analysis called **preprocessing**, which gets your text data formatted in a way that NLTK's methods and functions can easily interpret.

Before we can begin our text analysis, though, we should import the libraries we'll want to use to explore the capabilities of NLTK!

In [None]:
import nltk
from nltk.book import * # the `*` means import all corpora (you could also specify a specific corpus)

# As an alternative to `nltk.download()`, as shown in last Notebook's class, you can specify what
# packages from NLTK to download
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('wordnet')
from nltk.corpus import wordnet, gutenberg
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
from nltk.stem.porter import PorterStemmer
from nltk.probability import FreqDist
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('tagsets')  # part of speech tags
from nltk.draw.dispersion import dispersion_plot as displt

import matplotlib.pyplot as plt   # for drawing charts to visualize data

import re               # the Regular Expression (or RegEx) library, which is useful in combination with NLTK
import string           # another useful library for acccessing lists of all letters, all punctuation, etc.

### Demo 1

Remember that the names on the left (`text1`, `text2`, ..., `text9`) are **variables** that point to the text corpora on the right (`Moby Dick...`, `Sense and Senibility...`, ..., `The Man Who Was Thursday...`).

I wonder what happens if we try to print one of these?

In [None]:
print(text2)

Hmm.  That doesn't show us much.  What if we try slicing?

In [None]:
print(text2[0:100])

 <a id="tok"></a>
You can see that the text has been **tokenized**, or separated into its individual words, numbers, and punctuation marks.

In [None]:
type(text2)

### Demo 2

Let's work out the length in tokens of `text1` and `text2`, *Moby Dick* and *Sense and Sensibility*.

In [None]:
md = len(text1)
ss = len(text2)
print("Length of Moby Dick:", md)
print("Length of Sense and Sensibility:", ss)

Write a function to calculate the number of *unique* tokens in a text

In [None]:
def vocab_size(text):
    pass # replace this with your own code

print("Length of vocabulary (unique words) of Moby Dick:", vocab_size(text1))
print("Length of vocabulary (unique words) of Sense and Sensibility:", vocab_size(text2))

In [None]:
# Define a function to measure the diversity of word choice, 
# or lexical diversity. This should be the ratio of vocabulary to length:
def lexical_diversity(text):
    pass

print("Lexical Diversity of Moby Dick:", lexical_diversity(text1))
print("Lexical Diversity of Sense and Sensibility:", lexical_diversity(text2))

Slicing works for us to view **tokens** in the NLTK Text object, which is the data type of `text2`.  Let's try out some of the methods and functions specific to NLTK, though, as they are designed for working with Text objects!

### Demo 3

Let's begin with the `concordance()` method.  We pass a single **token**, as a **string** (type=str), into this method.  What token would you like to see in its different contexts within the text?

In [None]:
text2.concordance('YOUR_TOKEN_HERE')  # optional parameter: lines=20 (or any number you choose)

By default the `concordance()` method shows 25 contexts in which the input word is used, but you can specify how many contexts you would like to see by saying something like: `text2.concordance('opinion', lines=63`)

In [None]:
text2.concordance('opinion', lines=100)  # it's okay to input a greater number of lines than there are matches

In [None]:
text2.concordance("Marianne", lines=20)  # default lines listed is 25

In [None]:
text2.concordance("happiness")

`.similar(X)` will output words that appear in the text surrounded by similar words to `X`.  This means that the output is likely to include words that are the same part of speech as `X` and words that are synonyms of `X`.  For example, the words that occur in similar contexts to the noun `"happiness"` are also nouns:

In [None]:
text2.similar("happiness")

Let's have a look at the concordances for 'kindness' and 'happiness'.

In [None]:
text2.concordance("kindness", lines=30)

Pretty similar, huh? In text as in life, those two seem to be found in the same places...

`.common_contexts(L, N)` will output words that appear immediately to the left and right of all input words in list `L`.  For example, `"such"` and `"of"` are found surrounding both `"kindness"` and `"happiness"` in text2, so they are included in the output of `.common_contexts(["kindness", "happiness"])`:

In [None]:
text2.common_contexts(["kindness", "happiness"], num=5)

In [None]:
text2.common_contexts(["kindness", "happiness"], num=1)

If no result is found, NLTK will output a message that tells you this:

In [None]:
text4.common_contexts(["monstrous", "very"])

In [None]:
text2.common_contexts(["kindness", "happiness"])  # defaults to 20 maximum

Let's see what type the output is

In [None]:
kind_and_happy = text2.common_contexts(["kindness", "happiness"])  # defaults to 20 maximum
type(kind_and_happy)

That's a tad disquieting. It seems to reflect a reather bleak view of humankind.

Of course, NLTK isn't really having existential angst, it's just that `Text.common_contexts` is not a function that `returns` a value, it just `prints` it: and if all you want to do with the output is look at it, I suppose that's fine: but if you want to save it to a variable or pass it to other code, you need it to `return`` the results, not `print` them. This can be done with `Text._word_context_index.common_contexts`:

In [None]:
kind_and_happy = text2._word_context_index.common_contexts(["kindness", "happiness"])  # defaults to 20 maximum
print(type(kind_and_happy))
print(kind_and_happy)
kind_and_happy

Much more useful!

### Demo 4

To see the frequency of a list of words (of any length) as they occur from the beginning through the end of the text, you can use a Lexical Dispersion Plot:

In [None]:
# Character names (note that they can't be multiple words, or there won't be a match)
text2.dispersion_plot(["Marianne", "Elinor", "Edward"])

How does this plot reflect the text?

Marianne and Elinor are the main characters of the book 'Sense and Sensibility,' so it makes sense that we'd see them consistently throughout the text!  Edward is a supporting character, so we see that his name occurs less frequently.

*waaaaaiiitasec* ... OK, this doesn't look right. Hang on.

In [None]:
text2.dispersion_plot(["the", "Edward", "cold", "gkjhjdgkkjdghdfkjghfk"])

*Checks StackOverflow*

Yeah, NLTK has a bug in it. I've raised the issue with the authors, but in the meantime, here is a function that should make it behave correctly:

In [None]:
from nltk.draw.dispersion import dispersion_plot

def dispersion_plot_fixed(text, tokens):
    # this doesn't work with text.dispersion_ploy, which 
    # displays the plot but doesn't return it: so instead 
    # I used `nltk.draw.dispersion.dispersion_plot`
    ax = dispersion_plot(text, tokens) 
    ax.set_yticks(list(range(len(tokens))), reversed(tokens), color="C0")

dispersion_plot_fixed(text2, ["the", "Edward", "cold", "gkjhjdgkkjdghdfkjghfk"])


OK, let's try it again:

In [None]:
dispersion_plot_fixed(text2, ["Marianne", "Elinor", "Edward"])

That's better!

Have a go yourself: find some words that show an interesting distribution - such that they tend to appear in different parts of the text:

In [None]:
# your code here

### Demo 5

When we talk about the *context* of words (or tokens!) in text analysis, we're referring to the surrounding words of a given word.  Concordances show a bit of context to the left of an input word (just before the word appears) and to the right of that word (just after that word appeared). 

The words "good" and "opinion" seem to occur together quite a bit!  To see the other words appear near that pair, we can use the `common_contexts()` method.  We pass a **list** of tokens (each token as a string) into the method.

In [None]:
text2.common_contexts(["good", "opinion"])

It seems that "the good opinion of" is the complete phrase in which the pair of words, "good" and "opinion", appear together in this text.  They don't occur together in other contexts!  But what about individually?

We can use the **similar** method to see words that appear in similar contexts, meaning they're surrounded by similar tokens, as the token we input.  Note that we pass in a single token as a string to this method.

In [None]:
print("Words with a similar context as 'good':")
text2.similar("good")

In [None]:
print("Words with a similar context as 'opinion':")
text2.similar("opinion")

Pairs of words that occur together, such as "good" and "opinion," are referred to as **bigrams**, where "bi" indicates two.  **N-grams** are groups of words that occur together, where n is a number of your choice.

To get all the bigrams in a text, we can use the `bigrams()` method, into which we pass the variable referring to the text itself.

In [None]:
bigrams_list = list(nltk.bigrams(text2))
print(bigrams_list[:100])  # prints the first 100 bigrams

Last class we looked quickly at a **dispersion plot**, which is a chart that visualizes when particular tokens appear within a text.  Let's try making another one of those.  We pass in a list of individual tokens, where each token is a string, to make a dispersion plot.

Let's try another text.  NLTK includes some books that were digitized for [Project Gutenberg](https://www.gutenberg.org).


In [None]:
print(nltk.corpus.gutenberg.fileids())

### Demo 6

Let's look at one of those books, Alice's Adventures in Wonderland (with the fileid `carroll-alice.txt`), to practice tokenizing on our own.

In [None]:
alice = nltk.corpus.gutenberg.raw("carroll-alice.txt")
print(type(alice))

In [None]:
alice[:100]

We can tokenize the string of Alice's Adventures in Wonderland to split it into individual words and punctuation using the function `word_tokenize()`.  We can split the string into individual sentences using the function `sent_tokenize()`.  Both tokenization functions output a list of strings.

In [None]:
alice_tokens = word_tokenize(alice)
print("Total tokens:", len(alice_tokens))
print("Sample:", alice_tokens[0:100])
print(type(alice_tokens))
print(type(alice_tokens[42]))

We note that the output of the tokenizer isn't a special NLTK type: it's just a list of strings.

In [None]:
alice_sents = sent_tokenize(alice)
print("Total sentences:", len(alice_sents))
print("Sample:", alice_sents[0:5])

What if I want to know the number of words, not tokens (so excluding punctuation marks)?

In [None]:
def get_words(tokens):
    return [word for word in alice_tokens if word.isalpha()]  # List comprehension
    # If you don't know ab out list comprehensions, stop me and ask about list comprehensions!

alice_words = get_words(alice)
print("Total words:", len(alice_words))
### Same as: ###
# alice_words = []
# for word in alice_tokens:
#     if word.isalpha():
#         alice_words += [word]  # same as alice_words = alice_words + [word]

What if we want to know the size of the vocabulary, or the number of unique words, in Alice's Adventures in Wonderland?

Remember that Python (and NLTK) consider capitalized and lowercased words to be different, so in order to count the number of unique words, we must **casefold** the text, changing all words to lowercase.  Python strings have a simple method for this: `.lower()`. Write a function which casefolds a list of words, and pass `alice_words` to it.

*Note:* Casefolding is a form of **normalization**.

In [None]:
def lower_words(words):
    pass

alice_lower = lower_words(alice_words)
print(alice_lower[:10])

Next, let's count all the unique words from our list of casefolded words:

*Note*: Conveniently, we already wrote a function to do this

In [None]:
# your code here

print("Vocabulary size:", alice_vocab_size, "words")

### Demo 7

Other forms of **normalization** involve reducing words to their root form.  For example, the words "happy" and "happiness" have the same root and very similar meanings.  There are two ways NLTK provides to reduce words to their root form:

* **Stemming**: reduces words to a root form where the root is *not* a valid word itself

    In our example, the stem of "happy" and "happiness" would be "happ."


* **Lemmatizing**: reduces words to a root form where the root *is* a valid word itself, determined based on whether it exists in WordNet's list of valid English words

    In our example, the stem of "happy" and "happiness" would be "happy."
    
There are different approaches to stemming and lemmatization we can use in NLTK.  Let's see how they differ...

In [None]:
porter = nltk.PorterStemmer()
porter_stemmed = [porter.stem(word) for word in alice_lower]  # only includes alphabetic tokens
print(porter_stemmed[500:550])

In [None]:
lancaster = nltk.LancasterStemmer()
lancaster_stemmed = [lancaster.stem(word) for word in alice_lower] # only includes alphabetic tokens
print(lancaster_stemmed[500:550])

In [None]:
wnl = nltk.WordNetLemmatizer()
lemmatized = [wnl.lemmatize(word) for word in alice_lower]  # only includes alphabetic tokens
print(lemmatized[500:550])

What differences do you spot in the output samples of stems and lemmas?

*This is why it's always useful to print out samples of the data you're working with as you're coding!*

So what can we do with words once we've stemmed or lemmatized them?  Well, we could count the unique stems and lemmas, to get a different perspective on the size of the Lewis Carroll's vocabulary in Alice's Adventures in Wonderland, just as we did with the complete words.  

We could also count the frequency at which these root forms of words appear, giving us a sense of what the most common words are in the book.  Let's try that!  We'll use NLTK's `FreqDist()` function (for calculating and visualizing frequency distributions).

In [None]:
fdist_lemmas = FreqDist(lemmatized)
fdist_lemmas  # pairs of lemmas and their counts

Now we can ask how often a particular lemma appears using the `fdist_lemmas` variable we created:

In [None]:
print(fdist_lemmas['wonder'])

To get an easier overview, we can use visualization.  Let's create a line chart of the top 10 lemmas:

In [None]:
plt.figure(figsize = (20, 8))
plt.rc('font', size=12)
number_of_tokens = 10 
fdist_lemmas.plot(number_of_tokens, title='Frequency Distribution for '+str(number_of_tokens)+" Most Common Lemmas in Alice's Adventures in Wonderland")
plt.show()

Hmmm.  Some of these words don't tell us a lot.  It's pretty logical that words like "the" and "she" would be used a lot, but it doesn't tell us much about what goes on in the book.

These small words that occur frequently but don't always carry large meaning, particularly in books and longer texts, are called **stopwords**.  We can filter them out with a `stopwords()` method and try re-plotting this frequency distribution.

In [None]:
to_exclude = stopwords.words('english')  # english since the book is in English

# What other words might we want to exclude?
to_exclude += ['alice', "said"]

In [None]:
# make a function (hint: use a list comprehension) to filter out stop words and words shorter than 2 letters
def filter_lemmas(lemmas, stops):
    pass

lemmatized_filtered = filter_lemmas(lemmatized, to_exclude)

In [None]:
fdist_lemmas_filtered = FreqDist(lemmatized_filtered)
print("Total words after filtering:", fdist_lemmas_filtered.N())
print("50 most common words after filtering:", fdist_lemmas_filtered.most_common(50))

In [None]:
plt.figure(figsize = (20, 8))
plt.rc('font', size=12)
number_of_tokens = 10 
fdist_lemmas_filtered.plot(number_of_tokens, title='Frequency Distribution for '+str(number_of_tokens)+" Most Common Lemmas in Alice's Adventures in Wonderland Excluding Stop Words")
plt.show()

That's more interesting!  We could do the same thing with complete words, to get a different perspective on the most common words in the book.

Another common step to preprocessing text data is **part-of-speech (POS) tagging**.  POS tagging assigns parts of speech to words and groups of words in sentences.  After tagging parts of speech, you can perform more complex tasks such as **entity recognition**, which is the process of identifying people, places, and organizations named in a text.

In [None]:
alice_tagged = nltk.pos_tag(alice_tokens)
print(alice_tagged[0:10])

The parts of speech the abbreviations stand for are available [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)