# Natural Language Processing using nltk

Natural Language Processing, often known as NLP. In the field of artificial intelligence, and notably in machine learning, natural language processing is a hot topic. The reason being its its numerous uses in daily life.

These applications include Chatbots, Language translation, Text Classification, Paragraph summarization, Spam filtering and many more. There are a few open-source NLP libraries, that do the job of processing text, like NLTK, Stanford NLP suite, Apache Open NLP, etc. I personally found NLTK to be the easy to understand. NLTK is a standard python library with prebuilt functions and utilities for the ease of use and implementation

To begin with, we first install the nltk library.


In [1]:
!pip install nltk



There are several nltk libraries which can be used with nltk. To use them, we need to download them by executing nltk.download().

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Once the download completes, we are set to go.

## DATA PREPROCESSING - TEXT HANDLING

As in any analytical processing the first step is to clean or prep our data, and few of the standard practices but not limited to are :

Tokenization
<br>
Punctuation removal
<br>
Stop words removal
<br>
Stemming
<br>
Lammatization etc.

#### Tokenization
Tokenization is the process of breaking text up into smaller chunks as per our requirements that may be at the sentence or word level. We will need the sent_tokenize and word_tokenize from ntlk to do that so we import them. Here, we just have a sample text that we will use to understand the basics of nltk.tokenize package and its utilities.


In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

str1 = "I live in a flat with my family. We have two bedrooms and a living room. We have a garden and we have some flowers there. In weekdays I arrive home at five o'clock and I have lunch. Then I do my homework and go to bed. I had a computer but now it doesn't work. I have a brother and a sister and I think I am very lucky to live with them. Sometimes, our relatives visit us. Our flat becomes very crowded sometimes but I like it. What do you think?"
print(sent_tokenize(str1))

['I live in a flat with my family.', 'We have two bedrooms and a living room.', 'We have a garden and we have some flowers there.', "In weekdays I arrive home at five o'clock and I have lunch.", 'Then I do my homework and go to bed.', "I had a computer but now it doesn't work.", 'I have a brother and a sister and I think I am very lucky to live with them.', 'Sometimes, our relatives visit us.', 'Our flat becomes very crowded sometimes but I like it.', 'What do you think?']


As we see from the output the sent_tokenize, splits the data/paragraph at sentence ending at either ? or .(fullstop) . However, the word_tokenize submodule splits the data into each word token on whitepaces, fullstops and commas.

In [4]:
print(word_tokenize(str1))

['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedrooms', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flowers', 'there', '.', 'In', 'weekdays', 'I', 'arrive', 'home', 'at', 'five', "o'clock", 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'does', "n't", 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relatives', 'visit', 'us', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']


The wordpunct_tokenize will further consider other punctuations in the sentence like the apostrphe(')

In [5]:
from nltk.tokenize import wordpunct_tokenize
print(wordpunct_tokenize(str1))

['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedrooms', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flowers', 'there', '.', 'In', 'weekdays', 'I', 'arrive', 'home', 'at', 'five', 'o', "'", 'clock', 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'doesn', "'", 't', 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relatives', 'visit', 'us', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']


A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

In [6]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
result = tokenizer.tokenize("Wow! I am excited that good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks.")
print(result)


#Compared to wordpunct_tokenize function
print(wordpunct_tokenize("Wow! I am excited that good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."))

['Wow', 'I', 'am', 'excited', 'that', 'good', 'muffins', 'cost', '3', '88', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
['Wow', '!', 'I', 'am', 'excited', 'that', 'good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


#### Stopword
Stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”. Stopword() removes the predefined stop words from a piece of text:  

In [7]:
from nltk.corpus import stopwords

In [8]:
stop_words = set(stopwords.words( 'english' ))
print('Stop words')
print(stop_words)

word_tokens = word_tokenize(str1)

filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print('\nOriginal Text')
print(word_tokens)
print('\nFiltered Text')
print(filtered_sentence)


Stop words
{"it's", 'theirs', 'if', 'with', 'how', 'own', 'o', 'having', 'this', 'our', 'who', 'haven', 'doing', 'on', 'when', 'their', 'from', 'off', "wouldn't", 'no', "she's", 'it', 'doesn', 'be', 's', 'had', 'yourself', 'between', 'than', 'in', 'above', 'yours', 'my', 'but', 'are', "hadn't", "won't", 'i', 'shan', 'of', 'll', 'that', 'he', 'they', 'does', 'against', "weren't", 'over', 'she', "doesn't", 'shouldn', 'very', 'yourselves', 'you', 'we', 'didn', "aren't", 'will', "you'll", "you're", "that'll", 'd', 'won', "wasn't", 'at', 'while', 'so', "mightn't", 'ourselves', 'being', 'and', 'nor', "hasn't", 'its', 'was', 'other', 'such', 'until', 'hasn', 'your', 'mustn', 'then', 'through', 'during', 'all', 'why', 're', 'ours', 'his', 'the', 'because', 'couldn', 'more', 'those', 'down', 'under', "you'd", 'by', 'once', "haven't", "shan't", 'weren', 'were', "didn't", 'him', 'before', 'been', 'whom', 'ain', 'is', 'now', "you've", 'mightn', 'any', 'each', "mustn't", 'for', 'both', 'most', 'can

#### Stemming
There might be words in our data which have same root meaning but different forms or they may be in difference tense, for eg. live, lived, living, the base word for this is live. Stemming helps to find similarities between words with the same root words. 

In [9]:
#STEMMING
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
    print(ps.stem(w))

stem_word = []
for w in word_tokens:
    stem_word.append(ps.stem(w))
    
print(stem_word)
    
    

python
python
python
python
pythonli
['i', 'live', 'in', 'a', 'flat', 'with', 'my', 'famili', '.', 'we', 'have', 'two', 'bedroom', 'and', 'a', 'live', 'room', '.', 'we', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flower', 'there', '.', 'in', 'weekday', 'i', 'arriv', 'home', 'at', 'five', "o'clock", 'and', 'i', 'have', 'lunch', '.', 'then', 'i', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'i', 'had', 'a', 'comput', 'but', 'now', 'it', 'doe', "n't", 'work', '.', 'i', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'i', 'think', 'i', 'am', 'veri', 'lucki', 'to', 'live', 'with', 'them', '.', 'sometim', ',', 'our', 'rel', 'visit', 'us', '.', 'our', 'flat', 'becom', 'veri', 'crowd', 'sometim', 'but', 'i', 'like', 'it', '.', 'what', 'do', 'you', 'think', '?']


Stemming works on standalone word without understanding its refernce in the sentence, foreg. in our str1 data the second sentence have living room and stemming converted it to live which is not correct with the context of the sentence. So the accuracy of stemming is not too reliable.

#### Lemmatization
Next we see Lemmatization, It is the process of combining a word's several forms into a single unit for analysis. Similar to stemming, however, lemmatization adds context to the words. As a result, it ties words with related meanings together, lemmatization is preferred over Stemming for this very reason. WordNetLemmatizer is the module in the nltk.stem that is used for lemmatization

In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lem_word = []
for w in word_tokens:
    lem_word.append(lemmatizer.lemmatize(w))
    
print(lem_word)


['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedroom', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flower', 'there', '.', 'In', 'weekday', 'I', 'arrive', 'home', 'at', 'five', "o'clock", 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'doe', "n't", 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relative', 'visit', 'u', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']



#### Frequency Distribution
once, we have found the root words we can find the frequency of each word in our str1 data using the FreqDist() from the nltk. 


In [11]:
frequency = nltk.FreqDist(lem_word) 
for key,val in frequency.items(): 
    print (str(key) + ':' + str(val))

I:9
live:2
in:1
a:6
flat:2
with:2
my:2
family:1
.:9
We:2
have:5
two:1
bedroom:1
and:6
living:1
room:1
garden:1
we:1
some:1
flower:1
there:1
In:1
weekday:1
arrive:1
home:1
at:1
five:1
o'clock:1
lunch:1
Then:1
do:2
homework:1
go:1
to:2
bed:1
had:1
computer:1
but:2
now:1
it:2
doe:1
n't:1
work:1
brother:1
sister:1
think:2
am:1
very:2
lucky:1
them:1
Sometimes:1
,:1
our:1
relative:1
visit:1
u:1
Our:1
becomes:1
crowded:1
sometimes:1
like:1
What:1
you:1
?:1


#### WordNet

Wordnet is an English database for lexical which was based on the NLTK corpus reader. It can be used to look for word definitions, synonyms, and antonyms. It’s best described as an English dictionary with a semantic focus. The import command is used to bring it into the system. Because Wordnet is a corpus, it is pulled from the ntlk.corpus directory.

Synset — “synonym set” — a collection of synonymous words. A name is all assigned to each Synset. Lemmas are the words found in a Synset. The function wordnet.synsets (‘word’) provides an array containing all of the Synsets associated with the word put in as an argument. 

In [12]:
from nltk.corpus import wordnet as wn
wn.synsets('See')

[Synset('see.n.01'),
 Synset('see.v.01'),
 Synset('understand.v.02'),
 Synset('witness.v.02'),
 Synset('visualize.v.01'),
 Synset('see.v.05'),
 Synset('learn.v.02'),
 Synset('watch.v.03'),
 Synset('meet.v.01'),
 Synset('determine.v.08'),
 Synset('see.v.10'),
 Synset('see.v.11'),
 Synset('see.v.12'),
 Synset('visit.v.01'),
 Synset('attend.v.02'),
 Synset('see.v.15'),
 Synset('go_steady.v.01'),
 Synset('see.v.17'),
 Synset('see.v.18'),
 Synset('see.v.19'),
 Synset('examine.v.02'),
 Synset('experience.v.01'),
 Synset('see.v.22'),
 Synset('see.v.23'),
 Synset('interpret.v.01')]

The output means that word see has 25 possible context, 1 out of which is noun and other are all verbs, it also shows how many different meaning 'see' word has. Next we are passing the pos argument which lets you constrain the part of speech of the word, in this case we are checking all verb word synsets for see.

In [13]:
from nltk.corpus import wordnet as wn
syns = wn.synsets('See', pos = wn.VERB)

print(syns)

[Synset('see.v.01'), Synset('understand.v.02'), Synset('witness.v.02'), Synset('visualize.v.01'), Synset('see.v.05'), Synset('learn.v.02'), Synset('watch.v.03'), Synset('meet.v.01'), Synset('determine.v.08'), Synset('see.v.10'), Synset('see.v.11'), Synset('see.v.12'), Synset('visit.v.01'), Synset('attend.v.02'), Synset('see.v.15'), Synset('go_steady.v.01'), Synset('see.v.17'), Synset('see.v.18'), Synset('see.v.19'), Synset('examine.v.02'), Synset('experience.v.01'), Synset('see.v.22'), Synset('see.v.23'), Synset('interpret.v.01')]


lemma_names() is used to return all lemma (group of different inflected form of a word) names of the array.

In [14]:
print(syns[5].lemma_names())

['learn', 'hear', 'get_word', 'get_wind', 'pick_up', 'find_out', 'get_a_line', 'discover', 'see']


definition() as the name represents provides definition of the word, here we are checking the definition of first synset.

In [15]:
print(syns[0].definition())

perceive by sight or have the power to perceive by sight


examples() gives examples of the word in use.

In [16]:
print(syns[0].examples())

['You have to be a good observer to see all the details', 'Can you see the bird in that tree?', 'He is blind--he cannot see']


In [17]:
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
from nltk.corpus import subjectivity

In [18]:
n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
len(subj_docs), len(obj_docs)

(100, 100)

In [19]:
subj_docs[0]

(['smart',
  'and',
  'alert',
  ',',
  'thirteen',
  'conversations',
  'about',
  'one',
  'thing',
  'is',
  'a',
  'small',
  'gem',
  '.'],
 'subj')