Introduction to NLTK
====================

NLTK is the *Natural Language Toolkit*, a fairly large Python library for doing many sorts of linguistic analysis of text. NLTK comes with a selection of sample texts that we'll use today, to get yourself familiar with what sorts of analysis you can do.

To run this notebook you will need the `nltk`, `matplotlib`, and `tkinter` modules, all of which come included with Anaconda. The `nltk` module needs a bit of initialization, which it is best to do from the command line.

    python -c 'import nltk; nltk.download()'
    
and when an interactive dialog appears, use it to download the 'book' package.

Examining features of a text
--------------------------
We will start by loading the example texts in the 'book' package that we just downloaded. 

In [None]:
from nltk.book import *

This `import` statement reads the book samples, which include nine sentences and nine book-length texts. It has also helpfully put each of these texts into a variable for us, from `sent1` to `sent9` and `text1` to `text9`.

In [None]:
print(sent1)
print(sent3)
print(sent5)
sents()

Let's look at the texts now.

In [None]:
print(text6)
print(text6.name)
print("This text has %d words" % len(text6.tokens))
print("The first hundred words are:", " ".join( text6.tokens[:100] ))

Each of these texts is an nltk.text.Text object, and has methods to let you see what the text contains. But you can also treat it as a plain old list!

In [None]:
print(text5[0])
print(text3[0:11])
print(text4[0:51])

We can do simple concordancing, printing the context for each use of a word throughout the text:

In [None]:
text6.concordance( "swallow" )

The default is to show no more than 25 results for any given word, but we can change that.

In [None]:
text6.concordance('Arthur', lines=37)

We can adjust the amount of context we show in our concordance:

In [None]:
text6.concordance('Arthur', width=20)

...or get the number of times any individual word appears in the text. But **be careful** - while the concordance doesn't care about upper- or lowercase, the word count / word frequency logic does!

In [None]:
word_to_count = "Knight"
print("The word %s appears %d times." % ( 
        word_to_count, text6.count( word_to_count ) ))

We can generate a vocabulary for the text, and use the vocabulary to find the most frequent words as well as the ones that appear only once (a.k.a. the __hapaxes__.)

In [None]:
t6_vocab = text6.vocab()
t6_words = list(t6_vocab.keys())
print("The text has %d different words" % ( len( t6_words ) ))
print("Some arbitrary 50 of these are:", t6_words[:50])
print("The most frequent 50 words are:", t6_vocab.most_common(50))
print("The word swallow appears %d times" % ( t6_vocab['swallow'] ))
print("The text has %d words that appear only once" % ( 
        len( t6_vocab.hapaxes() ) ))
print("Some arbitrary 100 of these are:", t6_vocab.hapaxes()[:100])

You've now seen two methods for getting the number of times a word appears in a text: `t6.count(word)` and `t6_vocab[word]`. These are interchangeable - use whichever one you like!

We can try and find interesting words in the text, such as words of a minimum length (the longer a word, the less common it probably is) that occur more than once or twice...

In [None]:
# Get a list of long words!
# The short way, with a list comprehension
long_words = [ w for w in t6_words if len( w ) > 5 and t6_vocab[w] > 3 ]

# The long way, with a for loop. THIS IS IDENTICAL TO THE ABOVE.
long_words = []
for w in t6_words:
    if( len ( w ) > 5 and t6_vocab[w] > 3 ):
        long_words.append( w )

# Now use the list!
print("The reasonably frequent long words in the text are:", long_words)

And we can look for pairs of words that go together more often than chance would suggest.

In [None]:
print("\nUp to twenty collocations")
text6.collocations()

print("\nUp to fifty collocations")
text6.collocations(num=50)

print("\nCollocations that might have one word in between")
text6.collocations(window_size=3)

NLTK can also provide us with a few simple graph visualizations, **when we have matplotlib installed**. To make this work in Jupyter/iPython, we need the following magic line. 

In [None]:
%pylab --no-import-all inline

The vocabulary we get from the `.vocab()` method is something called a "frequency distribution", which means it's a giant tally of each unique word and the number of times that word appears in the text. We can also make a frequency distribution of other features, such as "each possible word length vs. the number of times a word of that length is used". Let's do that and plot it.

In [None]:
word_length_dist = FreqDist( [ len(w) for w in t6_vocab.keys() ] )
word_length_dist.plot()

We can plot where in the text a word occurs, and compare it to other words, with a *dispersion plot*. For example, the following dispersion plots show respectively (among other things) that the words 'coconut' and 'swallow' almost always appear in the same part of the *Holy Grail* text, and that Willoughby and Lucy do not appear in *Sense and Sensibility* until some time after the beginning of the book.

In [None]:
text6.dispersion_plot(["coconut", "swallow", "KNIGHT", "witch", 
                       "ARTHUR"])

text2.dispersion_plot(["Elinor", "Marianne", "Edward", "Willoughby", 
                       "Lucy"])

We can go a little crazy with text statistics. This block of code computes the average word length for each text, as well as a measure known as the "lexical diversity" that measures how much word re-use there is in a text.

In [None]:
def print_text_stats( thetext ):
    # Average word length
    awl = sum([len(w) for w in thetext]) / len( thetext ) 
    ld = len( thetext ) / len( thetext.vocab() )
    print("%.2f\t%.2f\t%s" % ( awl, ld, thetext.name ))
    
all_texts = [ text1, text2, text3, text4, text5, text6, text7, 
             text8, text9 ]
print("Wlen\tLdiv\tTitle")
for t in all_texts:
    print_text_stats( t )


A text of your own
------------------

So far we have been using the sample texts, but we can also use any text that we have lying around on our computer. The easiest sort of text to read in is plaintext, not PDF or HTML or anything else. Once we have made the text into an NLTK text with the `Text()` function, we can use all the same methods on it as we did for the sample texts above.

In [None]:
from nltk import word_tokenize
from nltk.text import Text

# Read all the file's contents into Python.
with open('../lessondata/alice.txt', encoding='utf-8') as f:
    raw = f.read()

# Use NLTK to break the text up into words, and put the result into a 
# Text object.
alice = Text( word_tokenize( raw ) )
print(alice.name)
alice.name = "Alice's Adventures in Wonderland"
print(alice.name)
alice.concordance( "cat" )
print_text_stats( alice )


Using text corpora
------------------

NLTK comes with several pre-existing corpora of texts, some of which are the main body of text used for certain sorts of linguistic research. Using a corpus of texts, as opposed to an individual text, brings us a few more features.

In [None]:
from nltk.corpus import gutenberg

print(gutenberg.fileids())
paradise_lost = Text( gutenberg.words( "milton-paradise.txt" ) )
paradise_lost

*Paradise Lost* is now a Text object, just like the ones we have worked on before. But we accessed it through the *NLTK corpus reader*, which means that we get some extra bits of functionality:

In [None]:
print("Length of text is:", len( 
        gutenberg.raw( "milton-paradise.txt" )))
print("Number of words is:", len( 
        gutenberg.words( "milton-paradise.txt" )))
assert( len( gutenberg.words( "milton-paradise.txt" )) \
       == len( paradise_lost ))
print("Number of sentences is:", len( 
        gutenberg.sents( "milton-paradise.txt" )))
print("Number of paragraphs is:", len( 
        gutenberg.paras( "milton-paradise.txt" )))

We can also make our own corpus if we have our own collection of files, e.g. the Federalist Papers. But we have to pay attention to how those files are arranged! In this case, if you look in the text file, the paragraphs are set apart with 'hanging indentation' - all the lines that are *not* the beginning of a paragraph begin with a space.

In [None]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.util import read_regexp_block

# Tell NLTK how to know when a new paragraph starts in our text files.
def read_hanging_block( stream ):
    return read_regexp_block( stream, "^[A-Za-z]" )

# Tell NLTK where our texts are.
corpus_root = '../lessondata/federalist'
# Tell NLTK how to know which files contain the texts for our corpus.
file_pattern = 'federalist_.*\.txt'
federalist = PlaintextCorpusReader( corpus_root, file_pattern, 
                                para_block_reader=read_hanging_block )
print("List of texts in corpus:", federalist.fileids())
print("\nHere is the fourth paragraph of the first text:")
print(federalist.paras("federalist_1.txt")[3])

And just like before, from this corpus we can make individual Text objects, on which we can use the methods we have seen above.

In [None]:
fed1 = Text( federalist.words( "federalist_1.txt" ))
print("The first Federalist Paper has the following word collocations:")
fed1.collocations()
print("\n...and the following most frequent words.")
fed1.vocab().most_common(50)

Filtering out stopwords
-----------------------

In linguistics, *stopwords* or *function words* are words that are so frequent in a particular language that they say little to nothing about the meaning of a text. You can make your own list of stopwords, but NLTK also provides a list for each of several common languages. These sets of stopwords are provided as another corpus.

In [None]:
from nltk.corpus import stopwords
print("We have stopword lists for the following languages:")
print(stopwords.fileids())
print("\nThese are the NLTK-provided stopwords for the German language:")
print(", ".join( stopwords.words('german') ))

So reading in the stopword list, we can use it to filter out vocabulary we don't want to see. Let's look at our 50 most frequent words in *Holy Grail* again.

In [None]:
print("The most frequent words are: ")
print([word[0] for word in t6_vocab.most_common(50)])

f1_most_frequent = [ w[0] for w in t6_vocab.most_common() 
                    if w[0].lower() not in stopwords.words('english') ]
print("\nThe most frequent interesting words are:  ", 
      "  ".join( f1_most_frequent[:50] ))

Maybe we should get rid of punctuation and all-caps words too...

In [None]:
def is_interesting( w ):
    if( w.lower() in stopwords.words('english') ):
        return False
    if( w.isupper() ):
        return False
    return w.isalpha()

f1_most_frequent = [ w[0] for w in t6_vocab.most_common() 
                    if is_interesting( w[0] ) ]
print("The most frequent interesting words are: ", 
      "  ".join( f1_most_frequent[:50] ))

Now we have a list of words that begins to make us think of *Monty Python and the Holy Grail*!

Part-of-speech tagging
----------------------

This is where corpus linguistics starts to get interesting. In order to analyze a text computationally, it is useful to know its syntactic structure - what words are nouns, what are verbs, and so on? This can be done (again, imperfectly) by using *part-of-speech tagging.* NLTK includes a default part-of-speech tagger, although this probably won't be of much use to you on non-English texts.

In [None]:
from nltk import pos_tag

my_text = alice[305:549]
print(pos_tag(my_text))

NLTK part-of-speech tags (simplified tagset)
------------------------

| Tag | Meaning            | Examples                             |
|-----|--------------------|--------------------------------------|
| JJ  | adjective          | new, good, high, special, big, local |
| RB  | adverb             | really, already, still, early, now   |
| CC  | conjunction        | and, or, but, if, while, although    |
| DT  | determiner         | the, a, some, most, every, no        |
| EX  | existential        | there, there's                       |
| FW  | foreign word       | dolce, ersatz, esprit, quo, maitre   |
| MD  | modal verb         | will, can, would, may, must, should  |
| NN  | noun               | year, home, costs, time, education   |
| NNP | proper noun        | Alison, Africa, April, Washington    |
| NUM | number             | twenty-four, fourth, 1991, 14:24     |
| PRO | pronoun            | he, their, her, its, my, I, us       |
| IN  | preposition        | on, of, at, with, by, into, under    |
| TO  | the word to        | to                                   |
| UH  | interjection       | ah, bang, ha, whee, hmpf, oops       |
| VB  | verb               | is, has, get, do, make, see, run     |
| VBD | past tense         | said, took, told, made, asked        |
| VBG | present participle | making, going, playing, working      |
| VN  | past participle    | given, taken, begun, sung            |
| WRB | wh determiner      | who, which, when, what, where, how   |

Automated tagging is pretty good, but not perfect. There are other taggers out there that handle different languages, such as the Brill tagger and the TreeTagger, but these aren't set up to run 'out of the box' and, with TreeTagger in particular, you will have to download extra software.

Some of the bigger corpora in NLTK come pre-tagged; this is a useful way to __train__ a tagger that uses machine-learning methods (such as Brill), and a good way to test any new tagging method that is developed. This is also the data from which our knowledge of how language is used comes from. (At least, English and some other major Western languages.)

In [None]:
from nltk.corpus import brown

print(brown.tagged_words()[:25])
print(brown.tagged_words(tagset='universal')[:25])

We can even do a frequency plot of the different parts of speech in the corpus (if we have `matplotlib` installed!)

In [None]:
tagged_word_fd = FreqDist([ w[1] for w in 
                           brown.tagged_words(tagset='universal') ])
tagged_word_fd.plot()

Named-entity recognition
------------------------

As well as the parts of speech of individual words, it is useful to be able to analyze the structure of an entire sentence. This generally involves breaking the sentence up into its component phrases, otherwise known as chunking. 

Not going to cover chunking here as there is no out-of-the-box chunker for NLTK! You are expected to define the grammar (or at least some approximation of the grammar), and once you have done that then it becomes possible.

But one application of chunking is named-entity recognition - parsing a sentence to identify the named people, places, and organizations therein. This is more difficult than it looks, e.g. "Yankee", "May", "North".

Here's how to do it. We will use the example sentences that were loaded in `sent1` through `sent9` to try it out. Notice the difference (in iPython only!) between printing the result and just looking at the result - if you try to show the graph for more than one sentence at a time then you'll be waiting a *long* time. So don't try it.

In [None]:
from nltk import ne_chunk

tagged_text = pos_tag(sent2)
ner_text = ne_chunk( tagged_text )
print(ner_text)
ner_text

Here is a function that takes the result of `ne_chunk` (the plain-text form, not the graph form!) and spits out only the named entities that were found.

In [None]:
def list_named_entities( tree ):
    try:
        tree.label()
    except AttributeError:
        return
    if( tree.label() != "S" ):
        print(tree)
    else:
        for child in tree:
            list_named_entities( child )
            
list_named_entities( ner_text )

And there you have it - an introductory tour of what is probably the best-available code toolkit for natural language processing. If this sort of thing interests you, then there is an entire book-length tutorial about it:

http://www.nltk.org/book/

Have fun!