# BDAT Lesson 8 - Exploring a Text with NLTK

This notebook shows how you can explore aspects of a text using the Natural Langauge Took Kit (NLTK). It is based on [a similar one that is part of ALTA](https://github.com/sgsinclair/alta/blob/Rockwell's-Edits/ipynb/utilities/Exploring%20a%20text%20with%20NLTK.ipynb). Some of the things this hows you how to do include:

* Tokenize a text 
* Generate a concordance for a word
* Explore collocations (words that are located together)
* Counding words and frequencies
* Finding smiliar words and contexts

For more on NLTK see the online version of the book [Natural Language Processing with Python](http://www.nltk.org/book/). 

## 1.0 Preparing for Exploration

Before we can analyze a text we need to load it in and tokenize it. For tokenization and exploration we are going to use NLTK.

### 1.1 Installing NTLK

Before you can use NTLK you need to make sure it is installed. The [Anaconda Navigator](https://docs.continuum.io/anaconda/navigator) by default installs NLTK, but you can always test if it is installed by importing it with ```import nltk```. Try it. It will give you an error if you don't have it.

In [None]:
import nltk

### 1.2 Getting a Text

Now we will get a text to process with NLTK.

First we see what text files we have. 

In [None]:
%ls *.txt

We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.

In [None]:
theText2Use = "Hume Enquiry.txt"
with open(theText2Use, "r") as fileToRead:
    theString = fileToRead.read()
    
print("This string has", len(theString), "characters.")
print(theString[:50])

**Questions**
* Is this a good version of the text? How would we know?
* How can we get rid of all the text at the start and end that are not part of the text?

### 1.3 Tokenization

Now we tokenize the text using NTLK's tokenizer producing a list called "listOfTokens" and check the first words. Note that the NTLK tokenizer doesn't eliminate punctuation and doesn't lower case the words. You can tokenize using another method if you want. Then we create a NLTK text object from the tokens. Note how the text object behaves like a list of tokens.

In [None]:
# This creates a list of tokens from the text
listOfTokens = nltk.word_tokenize(theString)

# This creates a NLTK text object from a list of tokens
theText = nltk.Text(listOfTokens)

print(listOfTokens[:50]) # Show the first 50 tokens

## 2.0 Concording

Now we get a concordance for a word in one line. Note that we can control the width of the concordances. Edit the word to explore.

In [None]:
theText.concordance("truth", width=80)

Note that ```concordance``` is not case sensitive. This will give you a concordance of both capitalized and lower case words.

If you want fewer or more lines then you need to add a parameter *lines=*.

In [None]:
theText.concordance("the", lines=10)

One thing that is annoying is that you can't easily save a concordance to a file and that is because the NLTK text object concordance is printed to the screen for exploration. You will need to cut and paste to a word processor to save this.

### 2.1 Plot the Dispersion of Words

We can easily plot the dispersion of words through the text. Note how it is case sensitive.

The line ```%matplotlib inline``` makes sure that the plot is placed inline.

In [None]:
import matplotlib

# This is to force the plots to show inline rather than in another window
%matplotlib inline 

theText.dispersion_plot(["truth","Truth"])

### 2.2 Counting Words and Frequencies

You can also count words. This is case sensitive if you use the text object.

In [None]:
print(theText.count("truth"), " ", theText.count("Truth"))

To make it case insensitive we are going to lowercase every token and get a new list of tokens. We are also going to get rid of punctuation by keeping only the alphabetical tokens. Then we can count things in the list.

In [None]:
theLowerTokens = []
for token in listOfTokens:
    if token.isalpha():
        theLowerTokens.append(token.lower())
    else:
        theLowerTokens.append(token)

print(theLowerTokens[:20])

There is a more efficient way to do this using [list comprehension](http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).

In [None]:
theLowerTokens = [token.lower() for token in listOfTokens if token[0].isalpha()]
print(theLowerTokens[:20])

In [None]:
listOfLowTokens.count("truth")

With NLTK we can get word frequencies. These can be displayed as a table. We can then do other things with the frequency distribution object.

In [None]:
theLowerFreqs = nltk.FreqDist(theLowerTokens)
theLowerFreqs

In [None]:
theLowerFreqs.tabulate(10)

In [None]:
theLowerFreqs["truth"]

Rather than get the count we can get the relative frequency which is the count divided by the number of tokens. This can be very useful for comparing across documents.

In [None]:
theLowerFreqs.freq("the")

## 3.0 Plot the Frequency of Words
We can also plot the high frequency words.

In [None]:
%matplotlib inline
theLowerFreqs.plot(30)

### 3.1 Plotting Content Words

What if we want a to see just the high frequency content words. Here we get the NLTK English stopword list.

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords[:20])

We need to create a new list of tokens without the stopwords.

In [None]:
theLowerContentWords = []

for token in theLowerTokens:
    if token not in stopwords:
        theLowerContentWords.append(token)

theLowerContentWords[:10]

In [None]:
theLowerContentWords = [token for token in theLowerTokens if token not in stopwords]

Now we can a table of high frequency content words.

In [None]:
theLowerContFreqs = nltk.FreqDist(theLowerContentWords)

theLowerContFreqs.tabulate(8)

**Question**: How would you add the stopwords to get rid of words like "may", "one", "must", "us"?

And now we get the Frequency Distribution and plot it.

In [None]:
theLowerContFreqs.plot(30)

We might also want to check how these words are used by looking at their concordance.

In [None]:
theText.concordance("experience", width=80)

## 4. Collocations, Similar Words, and Contexts

### 4.1 Collocations

NLTK will also let you explore co-locating words by which is meant sets of two or more words that appear frequently together.

In [None]:
theText.collocations(10)

Note how we are getting a lot of bigrams with "Gutenberg". That's because NLTK looks for bigrams where the words appear together more often than alone. If you ask for more collocations you can see some that have to do with the text.

In [None]:
theText.collocations(100)

### 4.2 Similar Words

We can get words that are **similar** to target words. These are not synonyms but words being used in similar contexts. You can use this to expland on a word you are interested in.

In [None]:
theText.similar("truth")

You can use this to get concordances of sets of similar words.

In [None]:
listOfWords2Conc = ["reason","fact","knowledge","ideas"]
for i in listOfWords2Conc:
    print(i.upper() + ": ")
    theText.concordance(i, width=80, lines=5)
    print("--------------------------------------------------\n")

### 4.3 Common Contexts

NLTK can give us common contexts for words that share them.

In [None]:
theText.common_contexts(["nature", "experience"],10)

## Finding Patterns

We can use Regular Expressions on tokens with the ```findall``` method of the Text object. Some guidelines:

* You are matching to tokens, not the raw text. The < and > indicate the token.
* ```<.*>``` matches any token as ```.``` means any character and ```*``` means 0 or more of. ```?``` would mean 
* the paranthesis tell IPython what to show from the match. In first example below you can see how to show all the words right before the word you want.

Here are some examples.

In [None]:
theText.findall("(<.*>)<experience>")

In [None]:
theText.findall("<.*><.*><nature>")

In [None]:
theText.findall("(<.*><.*>)<truth>")

In [None]:
theText.findall("<not><.*>?<true>")

---
# Homework: Exploring a Text
Using the NLTK tools create a notebook that explores the text assembled of works by an author. Can you infer anything interested from the text. Explain what you find in the notebook.

How would you check that your inferences are valid? 