# NLTK

In [1]:
import nltk
from nltk.corpus import brown

The cell below opens a separate window, you might need to install `all-nltk`.

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Sentence splitting

In [3]:
a_text = '''Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain.'''
print(a_text)

Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open. H.J. Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million. It's the first group action of its kind in Britain.


Before the computer can apply most kinds of NLP tasks, it has to know what the separate sentences are.

Let's try splitting the text using a **dot**

In [4]:
sentence_splitted = a_text.split('.')
for idx, sent in enumerate(sentence_splitted, 1):
    print(f'Sentence {idx}:\t{sent}')

Sentence 1:	Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U
Sentence 2:	S
Sentence 3:	 Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U
Sentence 4:	S
Sentence 5:	 Open
Sentence 6:	 H
Sentence 7:	J
Sentence 8:	 Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd
Sentence 9:	 for about $500 million
Sentence 10:	 It's the first group action of its kind in Britain
Sentence 11:	


This clearly did not work. Many abbreviations such us **U.S.** have dots in them. Luckily, you can use NLTK to split a text into sentences more reliably. Let's see how it performs on our text:

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
nltk_sentence_splitted = sent_tokenize(a_text)
for idx, sent in enumerate(nltk_sentence_splitted, 1):
    print(f'Sentence {idx}:\t{sent}')

Sentence 1:	Another ex-Golden Stater, Paul Stankowski from Oxnard, is contending for a berth on the U.S. Ryder Cup team after winning his first PGA Tour event last year and staying within three strokes of the lead through three rounds of last month's U.S. Open.
Sentence 2:	H.J.
Sentence 3:	Heinz Company said it completed the sale of its Ore-Ida frozen-food business catering to the service industry to McCain Foods Ltd. for about $500 million.
Sentence 4:	It's the first group action of its kind in Britain.


Interestingly, the model is not perfect. It correctly determines that *U.S. Ryder Cup* is not the end of the sentence. However, it states that **H.J.** is the end of a sentence.

## Tokenization

In [7]:
example_sentence = nltk_sentence_splitted[-1]
print(example_sentence)

It's the first group action of its kind in Britain.


The most naive way to apply tokenization is to split a text using spaces. Let's try this. Please run the following cell.

In [8]:
tokenized_using_spaces = example_sentence.split(' ')
print(tokenized_using_spaces)

["It's", 'the', 'first', 'group', 'action', 'of', 'its', 'kind', 'in', 'Britain.']


Tokenizing using spaces does actually work for most tokens. However, it does not work for expressions such as **It's** in the example above.

Let's try a real tokenizer...

In [9]:
tokenized_using_tokenizer = nltk.word_tokenize(example_sentence)
print(tokenized_using_tokenizer)

['It', "'s", 'the', 'first', 'group', 'action', 'of', 'its', 'kind', 'in', 'Britain', '.']


Please note that **It's** is now correctly tokenized.

## Stemming and Lemmatizing

NLTK has various modules for stripping inflection of words (stemming) or finding the lemma (the form you can find in a dictionary). Below is a script to stem and lemmatize the words in a text example after tokenizing the text using different modules.

In [10]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()
tokens = nltk.word_tokenize(a_text)

porterlemmas = []
wordnetlemmas = []
snowballlemmas = []

for token in tokens:
    porterlemmas.append(porter.stem(token))
    snowballlemmas.append(snowball.stem(token))
    wordnetlemmas.append(wordnet.lemmatize(token))

print('Porter')
print(porterlemmas)
print('\nSnowball')
print(snowballlemmas)
print('\nWordnet')
print(wordnetlemmas)

Porter
['anoth', 'ex-golden', 'stater', ',', 'paul', 'stankowski', 'from', 'oxnard', ',', 'is', 'contend', 'for', 'a', 'berth', 'on', 'the', 'u.s.', 'ryder', 'cup', 'team', 'after', 'win', 'hi', 'first', 'pga', 'tour', 'event', 'last', 'year', 'and', 'stay', 'within', 'three', 'stroke', 'of', 'the', 'lead', 'through', 'three', 'round', 'of', 'last', 'month', "'s", 'u.s.', 'open', '.', 'h.j', '.', 'heinz', 'compani', 'said', 'it', 'complet', 'the', 'sale', 'of', 'it', 'ore-ida', 'frozen-food', 'busi', 'cater', 'to', 'the', 'servic', 'industri', 'to', 'mccain', 'food', 'ltd.', 'for', 'about', '$', '500', 'million', '.', 'it', "'s", 'the', 'first', 'group', 'action', 'of', 'it', 'kind', 'in', 'britain', '.']

Snowball
['anoth', 'ex-golden', 'stater', ',', 'paul', 'stankowski', 'from', 'oxnard', ',', 'is', 'contend', 'for', 'a', 'berth', 'on', 'the', 'u.s.', 'ryder', 'cup', 'team', 'after', 'win', 'his', 'first', 'pga', 'tour', 'event', 'last', 'year', 'and', 'stay', 'within', 'three', 'st

What differences do you see between the three lists?

## Part of Speech tagging

a useful next step is to determine the part of speech of each token.
The part of speech is the syntactic category of a token. 

| the | red   | clown  | behaved  | weirdly  |
|---|---|---|---|---|
| determiner | adjective | noun | verb | adverb |

We can replace tokens with another token with the same part of speech, and the sentence would still be grammatical. For example:
* The **blue** clown behaved weirdly.
* The red **cow** behaved weirdly.
* The red clown **walked** weirdly.

NLTK also provides a method to automatically tag each token in a text with a part of speech tag:

In [11]:
nltk.pos_tag(['I', "'ll", 'refuse', 'to', 'permit', 'you', 'to', 'obtain', 'the', 'refuse', 'permit', '.'])

[('I', 'PRP'),
 ("'ll", 'MD'),
 ('refuse', 'VB'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('you', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN'),
 ('.', '.')]

# Bag of Words

In [12]:
example_sent = "I'll refuse to permit you to obtain the refuse permit."

As a first step, we obtain the vocabulary — all individual tokens in the corpus, which in our case is just one sentence.

In [13]:
tokenized_sent = nltk.word_tokenize(example_sent)
vocabulary = []

for token in tokenized_sent:
    if token not in vocabulary:
        vocabulary.append(token)

vocabulary

['I', "'ll", 'refuse', 'to', 'permit', 'you', 'obtain', 'the', '.']

Note that the punctuation mark `.` is part of the vocabulary.

Next, we vectorize the sentence by counting how often each token appears in it:

In [14]:
bow_vect = []

for token in vocabulary:
    token_count_in_sent = tokenized_sent.count(token)
    bow_vect.append(token_count_in_sent)

print(bow_vect)

[1, 1, 2, 2, 2, 1, 1, 1, 1]


In this representation, the first 1 indicates that the token *I* appears once in the sentence. The 2 in the third position indicates that the third token in our vocabulary, *refuse*, appears twice. Let's now vectorize a different sentence:

In [15]:
different_sent = "I'll permit the refuse."
different_tokenized_sent = nltk.word_tokenize(different_sent)

different_bow_vect = []

for token in vocabulary:
    token_count_in_sent = different_tokenized_sent.count(token)
    different_bow_vect.append(token_count_in_sent)

print(different_bow_vect)

[1, 1, 1, 0, 1, 0, 0, 1, 1]


What would the representation for *I'll refuse the permit.* look like?

# N-grams

One way to overcome this shortcoming is to look at a text not as a bag of words, but rather a collection of n-grams. NLTK has a module for this as well:

In [16]:
from nltk.util import ngrams

For reference, here is the example sentence again:

In [17]:
print(tokenized_sent)

['I', "'ll", 'refuse', 'to', 'permit', 'you', 'to', 'obtain', 'the', 'refuse', 'permit', '.']


In [18]:
bigrams = list(ngrams(tokenized_sent, 2))
trigrams = list(ngrams(tokenized_sent, 3))
print("Bigrams:", bigrams)
print("\nTrigrams:", trigrams)

Bigrams: [('I', "'ll"), ("'ll", 'refuse'), ('refuse', 'to'), ('to', 'permit'), ('permit', 'you'), ('you', 'to'), ('to', 'obtain'), ('obtain', 'the'), ('the', 'refuse'), ('refuse', 'permit'), ('permit', '.')]

Trigrams: [('I', "'ll", 'refuse'), ("'ll", 'refuse', 'to'), ('refuse', 'to', 'permit'), ('to', 'permit', 'you'), ('permit', 'you', 'to'), ('you', 'to', 'obtain'), ('to', 'obtain', 'the'), ('obtain', 'the', 'refuse'), ('the', 'refuse', 'permit'), ('refuse', 'permit', '.')]
