# NLP with NLTK
In this tutorial, we will discuss how to use the Natural Language ToolKit (NLTK) for various NLP tasks. We will learn about different NLP task such as
tokenization, stemming, lemmatization, stop word removal, POS tagging, chunking, named entity recognition and some basics about the WordNet Interface. 

### Importing the NLTK library and downloading relevant data

In [1]:
import nltk
nltk.download('punkt') # Will be used in PunktTokenizer
nltk.download('wordnet') # Will be used in WordNet basics
nltk.download('averaged_perceptron_tagger') # Will be used for POS Tagging
nltk.download('stopwords') # Will be used in stop words removal
nltk.download('maxent_ne_chunker') # Will be used for NE-Chunking
nltk.download('words') # Will be used somewhere :D

[nltk_data] Downloading package punkt to /home/eshban/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/eshban/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/eshban/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/eshban/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/eshban/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/eshban/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

# Sentence Tokenization
Sentence tokenization is the process of tokenizing a text into sentences. To perform sentence-level tokenization, NLTK provides a method called <code>sent_tokenize</code>. This method uses an instance of <code>PunktSentenceTokenizer.</code>
<br>
We import the <code>sent_tokenize</code> method as depicted in the code snippet below. The method takes a string as a parameter and returns an array of sentences. The tokenizer is already trained for English and a few other European languages.

In [2]:
from nltk import sent_tokenize

In [3]:
text = "Success is not final. Failure is not fatal. It is the courage to continue that counts."

In [4]:
sentence_tokens = sent_tokenize(text)
print(sentence_tokens)

['Success is not final.', 'Failure is not fatal.', 'It is the courage to continue that counts.']


In [5]:
for sentence in sentence_tokens:
    print(sentence)

Success is not final.
Failure is not fatal.
It is the courage to continue that counts.


# Word Tokenization
Word tokenization is the process of tokenizing sentences or text into words and punctuation. NLTK provides several ways to perform word-level tokenization. <br>
It provides a method called <code>word_tokenize</code>, which splits text using punctuation and non-alphabetic characters. This method is a wrapper method for the <code>TreebankWordTokenizer.</code> Therefore, the result from both are identical.<br>
NLTK also provides other tokenizers, such as <code>WordPunctTokenizer</code> and <code>WhitespaceTokenizer.</code> <code>WordPunctTokenizer</code> also splits the text from the punctuation. But unlike the <code>TreebankWordTokenizer</code>, this tokenizer splits the punctuation into separate tokens. <code>WhitespaceTokenizer</code>, as the name suggests, splits the text using white spaces. There are a few other tokenizers available as well.

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
sentence = "Let's see how the tokenizer splits this!"

In [8]:
word_tokens = word_tokenize(sentence)
print(word_tokens)

['Let', "'s", 'see', 'how', 'the', 'tokenizer', 'splits', 'this', '!']


In [9]:
from nltk.tokenize import TreebankWordTokenizer, WordPunctTokenizer, WhitespaceTokenizer

tree_tokenizer = TreebankWordTokenizer()
word_punct_tokenizer = WordPunctTokenizer()
white_space_tokenizer = WhitespaceTokenizer()

In [10]:
word_tokens = tree_tokenizer.tokenize(sentence)
print(word_tokens)

['Let', "'s", 'see', 'how', 'the', 'tokenizer', 'splits', 'this', '!']


In [11]:
word_tokens = white_space_tokenizer.tokenize(sentence)
print(word_tokens)

["Let's", 'see', 'how', 'the', 'tokenizer', 'splits', 'this!']


In [12]:
word_tokens = word_punct_tokenizer.tokenize(sentence)
print(word_tokens)

['Let', "'", 's', 'see', 'how', 'the', 'tokenizer', 'splits', 'this', '!']


# Stemming
Stemming is a heuristic process in which a word’s endings are chopped off in hope of achieving its base form. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result.<br>
Stemming isn’t as easy as we presume. If it was, there would be only one implementation. Sadly, stemming is an imprecise science, which leads to issues such as <b>understemming</b> and <b>overstemming.</b><br>
<i>Understemming</i> is the failure to reduce words with the same meaning to the same root. For example, <code>jumped</code> and <code>jumps</code> may be reduced to <code>jump</code>, while <code>jumpiness</code> may be reduced to <code>jumpi.</code><br>
<i>Overstemming</i> is the failure to keep two words with distinct meanings separate. For instance, <code>general</code> and <code>generate</code> may both be stemmed to <code>gener.</code>

In [13]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [14]:
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('lying'))
print(porter_stemmer.stem('lies'))
print(porter_stemmer.stem('lied'))

lie
lie
lie


In [15]:
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('lying'))
print(lancaster_stemmer.stem('lies'))
print(lancaster_stemmer.stem('lied'))

lying
lie
lied


In [16]:
snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('lying'))
print(snowball_stemmer.stem('lies'))
print(snowball_stemmer.stem('lied'))

lie
lie
lie


# Lemmatization
Lemmatization is a process that uses vocabulary and morphological analysis of words to remove the inflected endings to achieve its base form (dictionary form), which is known as the <i>lemma.</i><br>
It’s a much more complicated and expensive process that requires an understanding of the context in which words appear in order to make decisions about what they mean. Hence, it uses a lexical vocabulary to derive the root form, is more time consuming than stemming, and is most likely to yield accurate results.<br>
Lemmatization can be done with NLTK using <code>WordNetLemmatizer</code>, which uses a lexical database called <b>WordNet.</b><br>
When using the <code>WordNetLemmatizer</code>, we should specify which <b>part of speech</b> should be used in order to derive the accurate lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), or Adverb(r).

In [17]:
from nltk.stem import WordNetLemmatizer

In [18]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running'))

running


In [19]:
def lemmatize(word):
    lemmatizer = WordNetLemmatizer()
    print("Verb form: " + lemmatizer.lemmatize(word, pos="v"))
    print("Noun form: " + lemmatizer.lemmatize(word, pos="n"))
    print("Adverb form: " + lemmatizer.lemmatize(word, pos="r"))
    print("Adjective form: " + lemmatizer.lemmatize(word, pos="a"))

In [20]:
lemmatize("ears")

Verb form: ears
Noun form: ear
Adverb form: ears
Adjective form: ears


In [21]:
lemmatize("running")

Verb form: run
Noun form: running
Adverb form: running
Adjective form: running


### Stemming VS Lemmatization
Usage of either stemming or lemmatization will mostly depend on the situation at hand. If speed is required, it’s better to resort to stemming. But if accuracy is required it’s best to use lemmatization.<br>
A simple example is given below:

In [22]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [23]:
stemmer = PorterStemmer();
lemmatizer = WordNetLemmatizer()

In [24]:
print(stemmer.stem("deactivating"))
print(stemmer.stem("deactivated"))
print(stemmer.stem("deactivates"))

deactiv
deactiv
deactiv


In [25]:
print(lemmatizer.lemmatize("deactivating", pos="v"))
print(lemmatizer.lemmatize("deactivating", pos="v"))
print(lemmatizer.lemmatize("deactivating", pos="v"))

deactivate
deactivate
deactivate


In [26]:
print(stemmer.stem('stones')) 
print(stemmer.stem('speaking')) 
print(stemmer.stem('bedroom')) 
print(stemmer.stem('jokes')) 
print(stemmer.stem('lisa')) 
print(stemmer.stem('purple'))

stone
speak
bedroom
joke
lisa
purpl


In [27]:
print(lemmatizer.lemmatize('stones')) 
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple'))

stone
speaking
bedroom
joke
lisa
purple


# Part-of-Speech (POS) Tagging
Part-Of-Speech tagging (or POS tagging) is also a very import component of NLP. The purpose of the POS tagging is to assign labels for each token (a word in this case) with its respective grammatical component, such as noun, verb, adjective, or adverb. Most POS are divided into sub-classes.
<br>
POS tagging can be identified as a supervised machine learning solution, mainly because it takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.
<br>
The most popular tag set for POS tagging is <b>Penn Treebank tagset.</b> Most of the trained POS taggers for English are trained on this tag set. The following link shows the available POS Tags in <b>Penn Treebank tagset.</b>

In [28]:
from nltk import word_tokenize, pos_tag

In [29]:
sentence = "The hardest choices requires the strongest wills"

In [30]:
sentence_tokens = word_tokenize(sentence)
print(sentence_tokens)

['The', 'hardest', 'choices', 'requires', 'the', 'strongest', 'wills']


In [31]:
pos_tag(sentence_tokens)

[('The', 'DT'),
 ('hardest', 'JJS'),
 ('choices', 'NNS'),
 ('requires', 'VBZ'),
 ('the', 'DT'),
 ('strongest', 'JJS'),
 ('wills', 'NNS')]

# Chunking
Chunking or shallow parsing is a process that extracts phrases from a text sample. Here we extract chunks of sentences that constitute meaning rather than identifying the sentence’s structure. This is different and more advanced than tokenization because it extracts phrases instead of tokens.<br>
As an example, the word “North America” can be extracted as a single phrase using chunking rather than two separate words “North” and “America” as tokenization does.
<br>
Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Same as in POS tags, there is a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.

In [32]:
from nltk import pos_tag, word_tokenize, RegexpParser

In [33]:
sentence = "the big vicious dog barked at the small feeble cat"

In [34]:
grammer = ('''NP: {<DT>?<JJ>*<NN>} # NP''')

In [35]:
chunkParser = RegexpParser(grammer)
tagged = pos_tag(word_tokenize(sentence))
tagged

[('the', 'DT'),
 ('big', 'JJ'),
 ('vicious', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('small', 'JJ'),
 ('feeble', 'JJ'),
 ('cat', 'NN')]

In [36]:
tree = chunkParser.parse(tagged)

In [37]:
for subtree in tree.subtrees():
    print(subtree)

(S
  (NP the/DT big/JJ vicious/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT small/JJ feeble/JJ cat/NN))
(NP the/DT big/JJ vicious/JJ dog/NN)
(NP the/DT small/JJ feeble/JJ cat/NN)


In [38]:
tree.draw()

# Stop Word Removal
<b>Stop words</b> are simply words that have very little meaning and are mostly used as part of the grammatical structure of a sentence. Words like “the”, “a”, “an”, “in”, etc. are considered stop-words.
<br>
Even though it doesn’t seem like much, stop word removal plays an important role when dealing with tasks such as sentiment analysis. This process is also used by search engines when indexing entries of a search query.
<br>
NLTK comes with the corpora <code>stopwords</code> which contains stop word lists for 16 different languages. No direct function is given by NLTK to remove stop words, but we can use the list to programmatically remove them from sentences.
<br>
If we are dealing with many sentences, first the text must be split into sentences using <code>sent_tokenize.</code> Then using <code>word_tokenize</code>, we can further break the sentences into words, and then remove the stop words using the list.

In [39]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [40]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [41]:
sentence = "Success is not final. Failure is not fatal. It is the courage to continue that counts."

In [42]:
word_tokens = word_tokenize(sentence)
print(word_tokens)

['Success', 'is', 'not', 'final', '.', 'Failure', 'is', 'not', 'fatal', '.', 'It', 'is', 'the', 'courage', 'to', 'continue', 'that', 'counts', '.']


In [43]:
clean_tokens = word_tokens[:]
for token in word_tokens:
    if token in stopwords.words('english'):
        clean_tokens.remove(token)

In [44]:
print(clean_tokens)

['Success', 'final', '.', 'Failure', 'fatal', '.', 'It', 'courage', 'continue', 'counts', '.']


# Named Entity Recognition
<b>Named entity recognition (NER)</b>, is the process of identifying entities such as Names, Locations, Dates, or Organizations that exist in an unstructured text sample.
<br>
The purpose of NER is to be able to map the extracted entities against a knowledge base, or to extract relationships between different entities. Eg: Who did what? or Where something take place? or At what time something occur?
<br>
For domain-specific entities, in a field like medicine or law, we’ll need to train our own NER algorithm.
<br>
For casual use, NLTK provides us with a method called <code>ne_chunk</code> to perform NER on a given text. In order to use <code>ne_chunk</code>, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type. In this case, <i>Mark</i> and <i>John</i> are of type <i>PERSON</i>, <i>Google</i> and <i>Yahoo</i> are of type <i>ORGANIZATION</i>, and <i>New York City</i> is of type <i>GPE</i> (which indicates location).

In [45]:
from nltk import word_tokenize, pos_tag, ne_chunk

In [46]:
sentence = "Mark who works at Yahoo and John who works at Google decided to meet at New York City"

In [47]:
print(ne_chunk(pos_tag(word_tokenize(sentence))))

(S
  (PERSON Mark/NNP)
  who/WP
  works/VBZ
  at/IN
  (ORGANIZATION Yahoo/NNP)
  and/CC
  (PERSON John/NNP)
  who/WP
  works/VBZ
  at/IN
  (ORGANIZATION Google/NNP)
  decided/VBD
  to/TO
  meet/VB
  at/IN
  (GPE New/NNP York/NNP City/NNP))


In [48]:
sentence = "The Avengers began as a group of extraordinary individuals who were assembled to defeat \
Loki and his chitauri army in New York City. "

In [49]:
print(ne_chunk(pos_tag(word_tokenize(sentence))))

(S
  The/DT
  (ORGANIZATION Avengers/NNP)
  began/VBD
  as/IN
  a/DT
  group/NN
  of/IN
  extraordinary/JJ
  individuals/NNS
  who/WP
  were/VBD
  assembled/VBN
  to/TO
  defeat/VB
  (PERSON Loki/NNP)
  and/CC
  his/PRP$
  chitauri/NN
  army/NN
  in/IN
  (GPE New/NNP York/NNP City/NNP)
  ./.)


# WordNet Interface
WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
<br>
<b>Synset or “synonym set” is a collection of synonymous words.</b>
<br>
NLTK provides an interface for the NLTK database, and it comes with the corpora module. WordNet is composed of approximately 155,200 words and 117,600 synonym sets that are logically related to each other.
<br>
As an example, in WordNet, a word like <b><i>computer</i></b> has two possible contexts (one being a machine for performing computation, and the other being a calculator: which is associated to computer in a lexical sense). It is identified by <code>computer.n.01</code> (is known as the "lemma code name". And letter <code>n</code> depicts that the word is a noun).

In [50]:
from nltk.corpus import wordnet

In [51]:
wordnet.synsets('computer')

[Synset('computer.n.01'), Synset('calculator.n.01')]

In [52]:
syn = wordnet.synset('computer.n.01')
syn.lemma_names()

['computer',
 'computing_machine',
 'computing_device',
 'data_processor',
 'electronic_computer',
 'information_processing_system']

In [53]:
syn.definition()

'a machine for performing calculations automatically'

In [54]:
wordnet.synset('car.n.01').examples()

['he needs a car to get to work']

In [55]:
synonyms = []
for syn in wordnet.synsets('large'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

['large', 'large', 'big', 'large', 'bombastic', 'declamatory', 'large', 'orotund', 'tumid', 'turgid', 'big', 'large', 'magnanimous', 'big', 'large', 'prominent', 'large', 'big', 'enceinte', 'expectant', 'gravid', 'great', 'large', 'heavy', 'with_child', 'large', 'large', 'boastfully', 'vauntingly', 'big', 'large']


In [56]:
antonyms = []
for syn in wordnet.synsets('large'):
    for l in syn.lemmas():
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
print(antonyms)

['small', 'little']
