# NLTK (Natural Language Toolkit)

NLTK offers a comprehensive set of tools for Natural Language Processing (NLP).

## Corpora

A corpus (plural corpora) is a sample of real world text used in NLP.

We will illustrate various tools contained in NLTK using the `reuters` corpus. It contains newswire articles from Reuters from 1987. To work with an NLTK corpus, we need to download it.

This is just a sample of a real world text

In [None]:
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
from nltk import word_tokenize
import re
import string
nltk.download('reuters')


We will need some other NLTK packages later.

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

The Reuters corpus is quite large. For our purposes of illustrating the use of NLTK, we will make it smaller. 

The `words` method returns the number of words contained in the corpus.

In [None]:
len(reuters.words())

The files contained in the corpus are assigned to categories. We will reduce the size of the data by choosing only two of the categories.

In [None]:
reuters.categories()[:10]

In [None]:
reuters_use = reuters.words(categories=['cocoa','coffee'])
len(reuters_use)

## Stopwords

Stopwords are very frequent words that usually don't contribute much when trying to categorize text in terms of content.

If we are only concerned with the content, we should remove them. If we care about the style of the language used in the text, they can be useful and should be retained.

In [None]:
print(stopwords.words('english'))

### Lower case
It is usually a good idea to convert all text to lower case. We also remove the stopwords here.

In [None]:
reuters_use = [w.lower() for w in reuters_use if w.lower() not in stopwords.words('english')]
reuters_use[0:10]

Unless we are interested in using the structure of sentences, we remove the punctuation from the text.

In [None]:
string.punctuation

We also need to create punctuation sequences so that we can eliminate them from the text as otherwise only individual punctuation characters are removed. Such sequences usually contain a quotation mark, which is why we focus on those here.

In [None]:
reuters_use = [w for w in reuters_use if w not in string.punctuation]
punct_seq = [c+"\"" for c in string.punctuation ]+ ["\""+c for c in string.punctuation ]
reuters_use = [w for w in reuters_use if w not in punct_seq]
len(reuters_use)

## Uni-grams, bi-grams, n-grams
In NLP, sequences of words of length `n` are called `n-grams`. Individual words are called uni-grams, sequences of two words bi-grams.

### Frequency distribution of words
We can create a `FreqDist` object that provides the number of times the distinct words occur in the text, beginning with the most frequent word.

In [None]:
freq_dist_uni = nltk.FreqDist(reuters_use)
freq_dist_uni.plot(30, cumulative=False)

To see how many times the most frequent words occurred, we can also use the `most_common` method.

In [None]:
for word, frequency in freq_dist_uni.most_common(10):
    print(word, frequency)

### Frequency distribution of bi-grams

The `bigrams` function returns a generator over the bi-grams.

In [None]:
reuters_use_bi = nltk.bigrams(reuters_use)

In [None]:
freq_dist_bi = nltk.FreqDist(reuters_use_bi)
freq_dist_bi.plot(30, cumulative=False)

We can again use the `most_common` method.

In [None]:
for bigram, frequency in freq_dist_bi.most_common(10):
    print(bigram, frequency)

# Tokenization

Tokenization refers to the separation of a text into smaller units. Sentence tokenization splits the text into sentences, word tokenization into words.

In [None]:
my_text = """
For the first quarter, Fifth Third saw an increase in their net interest income YOY (NII) to $132 million,
but there was a decrease in the net interest margin of 4 basis points. These increases were driven by the
impacts of the MB Financial merger, as well as successful cash flow hedges. The decrease in NIM of 4 bps
was primarily from the challenging interest rate environment, of which the Fed dropped the fed funds rate
from 1.25% to the current level of 0 to 0.25%.
The decline in interest rates will have a big impact on the NIM in the coming quarters, more on that in
a moment.
Noninterest income also saw increases in YOY, $165 million, or 29%, which included an impact from the MB
Financial merger. The increase was driven by increases in the mortgage revenue increases of 114%,
leasing revenue up 128% primarily from the MB Financial merger, Wealth and asset management revenue up 20%.
On an adjusted basis, both return on assets 1.19%, and return on equity 9.9%, and an efficiency ratio of
59.4% were all within line and according to estimates.
And dividend payments of $1.08 annually with a current yield of 6.22%, per Seeking Alpha.
"""

## Sentence Tokenization
The default sentence tokenizer in NLTK is the `PunktSentenceTokenizer`.

In [None]:
sentences = nltk.sent_tokenize(my_text.replace('\n',' ').strip())
for s in sentences:
    print(s+'\n')

This worked well, though it can be less than perfect if a text contains non-standard punctuation.

## Word Tokenization
There are several word tokenizers available. We will look at three common ones.

In [None]:
sentence = "And dividend payments of $1.08 annually with a current yield of 6.22%, that's a lot, ain't it?"

default_tokens = word_tokenize(sentence)   # nltk.download('punkt') for this

punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

space_tokenizer = nltk.tokenize.SpaceTokenizer()
space_tokens = space_tokenizer.tokenize(sentence)

print(default_tokens, '\n')
print(punct_tokens, '\n')
print(space_tokens)

#### `Text` objects
You can create NLTK Text objects which then allow you to apply the NLTK methods to them.

In [None]:
my_tokenized_text = word_tokenize(my_text)
my_nltk_text     = nltk.Text(my_tokenized_text)

freq_my_nltk_text = nltk.FreqDist(my_nltk_text)
freq_my_nltk_text.plot(10, cumulative=False)

## Part of Speech Tagging

If we are concerned not only with the words we observe but also with their grammatical categories, we need Part of Speech (POS) tagging. A POS-tagger can be used to classify each word.

As for the tokenizers, there exist multiple POS-taggers, though we will consider only the default one here.

In [None]:
default_pos = nltk.pos_tag(default_tokens)
print(default_pos)
punct_pos = nltk.pos_tag(punct_tokens)
print(punct_pos)
space_pos = nltk.pos_tag(space_tokens)
print(space_pos)

You can find a definition of each tag at 

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

We can use regular expressions to extract larger classes of tags. E.g., all noun categories begin with an 'N'.

In [None]:
regex = re.compile('^N.*')
nouns = []
for l in default_pos:
    if regex.match(l[1]):
        nouns.append(l[0])
print("Nouns:", nouns)

## Stemming
Stemming is the process of removing the suffixes from words. As for some of the other tools we have considered, there are multiple stemmers available. We take a look at what some of them do here.

In [None]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer('english')

for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in default_tokens])

## Lemmatizing

Lemmatization is similar to stemming in that it eliminates suffixes, but its goal is to find the root word you would find in a dictionary instead of merely a truncated version of the word.  

There are again multiple choices available in NLTK. The default is the `Wordnet` lemmatizer.

In [None]:
wordnet = nltk.WordNetLemmatizer()
print([wordnet.lemmatize(t) for t in default_tokens])

## Concordance
Concordance returns the context in which a word appears in the text. You can specify the `width`, i.e., the number of characters to be displayed, by passing it as an argument.

In [None]:
print(nltk.Text(reuters_use).concordance('beans', width = 70 ))

This has only been a very brief overview of NLTK. There is much more in the book at https://www.nltk.org/book/ and in NLTK's documentation.