# Getting Familiar With NLTK
*Curtis Miller*

In this notebook I will demonstrate some common usage of the Python [Natural Language Toolkit (NLTK)](https://www.nltk.org/), a package for working with natural languages.

Below we load in NLTK (assumed to be installed).

In [None]:
import nltk
import sys, os
import string

Run the following line to run an app for getting NLTK data, if you need to do so.

In [None]:
nltk.download()

## Loading In NLTK Corpora

In addition to providing useful tools for working with natural languages, NLTK provides a large dataset of texts (or text corpora). This is useful not only for practice but also for training NLP algorithms (like taggers).

In [None]:
nltk.corpus.gutenberg.fileids()    # Files in the Gutenberg corpus, files that came from Project Gutenberg
                                   # http://www.gutenberg.org/

In [None]:
bible = nltk.corpus.gutenberg.words('bible-kjv.txt')
bible[:5]

We can then turn this into a `Text` object, which provides a lot of useful methods for working with the documents.

In [None]:
bible = nltk.Text(bible)
bible[:5]

In [None]:
bible.concordance("david")    # A concordance; lists contexts in which words appeared

In [None]:
bible.count("David")    # How many times to we see "David"?

In [None]:
bible.collocations()    # Common combinations of words

In [None]:
bible.common_contexts(["David"])    # Common context in which "David" appears

In [None]:
bible.dispersion_plot(["David"])    # A dispersion plot, showing where a word (or words) appeared

In [None]:
bible.vocab()    # Words that appear in bible; a FreqDist object, with keys being words and values being number of times
                 # word appeared

In [None]:
bible.findall("<.*><David><.*>")    # Find using regular expressions, with each token

In [None]:
bible.vocab().plot(10)

We can save text in a directory and create our own corpora for working with. Below is a demonstration, using the text files stored in the `books` directory.

In [None]:
books_dir = os.path.abspath('books')    # Need absolute path to directory
books = nltk.corpus.PlaintextCorpusReader(books_dir, '.*\.txt')    # Create corpus
books.fileids()

In [None]:
ulysses = nltk.Text(books.words('JamesJoyceUlysses.txt'))
ulysses[:5]

## NLTK Lexical Resources

NLTK provides a number of lexical resources, such as:

* Wordlists
* Pronunciation dictionaries
* WordNet

These exist for multiple languages, and resources for comparing and characterizing languages (such as comparative word lists and toolboxes) also exist.

For example, we have a list for stopwords in English (words that appear with high frequency but don't distinguish texts).

In [None]:
nltk.corpus.stopwords.words('english')

WordNet relates words together. For instance, some words are synonyms (they have similar meanings), a word could be the hyponym of another word (it is a generalization of the word) or a hypernym (it is a specific instance of another word). If the word is a noun, it may have parts, which are referred to as meronyms, and may belong to a noun that is considered a holonym.

Let's work with the synset (collection of synonyms) for the word "person".

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
person = wn.synset('person.n.01')    # Create the synset

In [None]:
person.name()

In [None]:
person.hyponyms()

In [None]:
person.hypernyms()

In [None]:
person.part_meronyms()

In [None]:
person.member_holonyms()

WordNet provides us a means for seeing how similar words are, and how general a word is.

In [None]:
person.min_depth()

In [None]:
wn.synset('creature.n.01').min_depth()    # Bigger number => more general

In [None]:
person.path_similarity(wn.synset("capitalist.n.01"))    # How similar two words are

In [None]:
person.path_similarity(wn.synset("cat.n.01"))

NLTK has many other useful tools for working with natural languages. We will see more in future videos.