Natural Language Processing,# Natural Language Processing using nltk

Natural Language Processing, often known as NLP. In the field of artificial intelligence, and notably in machine learning, natural language processing is a hot topic. The reason being its numerous uses in daily life.

These applications include Chatbots, Language translation, Text Classification, Paragraph summarization, Spam filtering and many more. There are a few open-source NLP libraries, that do the job of processing text, like NLTK, Stanford NLP suite, Apache Open NLP, etc. I personally found NLTK to be the easy to understand. NLTK is a standard python library with prebuilt functions and utilities for the ease of use and implementation

To begin with, we first install the nltk library.


In [22]:
!pip install nltk



There are several nltk libraries which can be used with nltk. To use them, we need to download them by executing nltk.download().

In [23]:
import pandas as pd

import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Once the download completes, we are set to go.

## Data Preprocessing

As in any analytical processing the first step is to clean or prep our data, and few of the standard practices but not limited to are :

Tokenization
<br>
Punctuation removal
<br>
Stop words removal
<br>
Stemming
<br>
Lammatization etc.

#### Tokenization
Tokenization is the process of breaking text up into smaller chunks as per our requirements that may be at the sentence or word level. We will need the sent_tokenize and word_tokenize from ntlk to do that so we import them. Here, we just have a sample text that we will use to understand the basics of nltk.tokenize package and its utilities.


In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize

str1 = "I live in a flat with my family. We have two bedrooms and a living room. We have a garden and we have some flowers there. In weekdays I arrive home at five o'clock and I have lunch. Then I do my homework and go to bed. I had a computer but now it doesn't work. I have a brother and a sister and I think I am very lucky to live with them. Sometimes, our relatives visit us. Our flat becomes very crowded sometimes but I like it. What do you think?"
print(sent_tokenize(str1))

['I live in a flat with my family.', 'We have two bedrooms and a living room.', 'We have a garden and we have some flowers there.', "In weekdays I arrive home at five o'clock and I have lunch.", 'Then I do my homework and go to bed.', "I had a computer but now it doesn't work.", 'I have a brother and a sister and I think I am very lucky to live with them.', 'Sometimes, our relatives visit us.', 'Our flat becomes very crowded sometimes but I like it.', 'What do you think?']


As we see from the output the sent_tokenize, splits the data/paragraph at sentence ending at either ? or .(fullstop) . However, the word_tokenize submodule splits the data into each word token on whitepaces, fullstops and commas.

In [25]:
print(word_tokenize(str1))

['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedrooms', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flowers', 'there', '.', 'In', 'weekdays', 'I', 'arrive', 'home', 'at', 'five', "o'clock", 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'does', "n't", 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relatives', 'visit', 'us', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']


The wordpunct_tokenize will further consider other punctuations in the sentence like the apostrphe(')

In [26]:
from nltk.tokenize import wordpunct_tokenize
print(wordpunct_tokenize(str1))

['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedrooms', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flowers', 'there', '.', 'In', 'weekdays', 'I', 'arrive', 'home', 'at', 'five', 'o', "'", 'clock', 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'doesn', "'", 't', 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relatives', 'visit', 'us', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']


A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

In [27]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
result = tokenizer.tokenize("Wow! I am excited that good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks.")
print(result)


#Compared to wordpunct_tokenize function
print(wordpunct_tokenize("Wow! I am excited that good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."))

['Wow', 'I', 'am', 'excited', 'that', 'good', 'muffins', 'cost', '3', '88', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
['Wow', '!', 'I', 'am', 'excited', 'that', 'good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


#### Stopword
Stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”. Stopword() removes the predefined stop words from a piece of text:  

In [28]:
from nltk.corpus import stopwords

In [29]:
stop_words = set(stopwords.words( 'english' ))
print('Stop words')
print(stop_words)

word_tokens = word_tokenize(str1)

filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print('\nOriginal Text')
print(word_tokens)
print('\nFiltered Text')
print(filtered_sentence)


Stop words
{'re', 'these', 'same', "doesn't", "didn't", "weren't", 'can', 'aren', 'do', 'we', "she's", 'hasn', 'shan', 'and', 'hadn', 'our', 'm', 'my', "hadn't", 'further', 'whom', 'are', 'am', 'needn', 'from', 'down', 'himself', 'as', 'ours', 'i', 'so', 'had', 'than', 'did', 'few', 'don', 'own', "should've", 'it', 'd', 'very', 'what', 'you', 'her', 'y', 'once', 'of', 'haven', 'me', 'during', 'ma', 'this', 'their', 'him', 'nor', "shan't", 'over', "you'll", 'doesn', 'was', 'both', 'will', 'why', 't', 'in', 'at', 've', 'only', "isn't", 'they', 'such', 'some', 'itself', 'for', "wouldn't", 'a', 'because', 'more', 'o', 'all', 'them', 'on', 'but', "wasn't", 'or', 'again', 'your', 'no', 'until', 'doing', "it's", 'other', 'yourselves', 'were', 'shouldn', 'not', 'his', 'too', 'off', 'below', 'while', 'being', 'mightn', 'having', "couldn't", 'couldn', 'yourself', 'does', "you'd", 'won', 'which', 'have', 'into', 'through', "you've", "haven't", 'against', 'should', 'to', 'is', 'hers', 'didn', 'now

#### Stemming
There might be words in our data which have same root meaning but different forms or they may be in different tense, for eg. live, lived, living, the base word for this is live. Stemming helps to find similarities between words with the same root words. 

In [30]:
#STEMMING
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
    print(ps.stem(w))

stem_word = []
for w in word_tokens:
    stem_word.append(ps.stem(w))
    
print(stem_word)
    
    

python
python
python
python
pythonli
['i', 'live', 'in', 'a', 'flat', 'with', 'my', 'famili', '.', 'we', 'have', 'two', 'bedroom', 'and', 'a', 'live', 'room', '.', 'we', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flower', 'there', '.', 'in', 'weekday', 'i', 'arriv', 'home', 'at', 'five', "o'clock", 'and', 'i', 'have', 'lunch', '.', 'then', 'i', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'i', 'had', 'a', 'comput', 'but', 'now', 'it', 'doe', "n't", 'work', '.', 'i', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'i', 'think', 'i', 'am', 'veri', 'lucki', 'to', 'live', 'with', 'them', '.', 'sometim', ',', 'our', 'rel', 'visit', 'us', '.', 'our', 'flat', 'becom', 'veri', 'crowd', 'sometim', 'but', 'i', 'like', 'it', '.', 'what', 'do', 'you', 'think', '?']


Stemming works on standalone word without understanding its refernce in the sentence, foreg. in our str1 data the second sentence have living room and stemming converted it to live which is not correct with the context of the sentence. So the accuracy of stemming is not too reliable.

#### Lemmatization
Next we see Lemmatization, It is the process of combining a word's several forms into a single unit for analysis. Similar to stemming, however, lemmatization adds context to the words. As a result, it ties words with related meanings together, lemmatization is preferred over Stemming for this very reason. WordNetLemmatizer is the module in the nltk.stem that is used for lemmatization

In [31]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lem_word = []
for w in word_tokens:
    lem_word.append(lemmatizer.lemmatize(w))
    
print(lem_word)


['I', 'live', 'in', 'a', 'flat', 'with', 'my', 'family', '.', 'We', 'have', 'two', 'bedroom', 'and', 'a', 'living', 'room', '.', 'We', 'have', 'a', 'garden', 'and', 'we', 'have', 'some', 'flower', 'there', '.', 'In', 'weekday', 'I', 'arrive', 'home', 'at', 'five', "o'clock", 'and', 'I', 'have', 'lunch', '.', 'Then', 'I', 'do', 'my', 'homework', 'and', 'go', 'to', 'bed', '.', 'I', 'had', 'a', 'computer', 'but', 'now', 'it', 'doe', "n't", 'work', '.', 'I', 'have', 'a', 'brother', 'and', 'a', 'sister', 'and', 'I', 'think', 'I', 'am', 'very', 'lucky', 'to', 'live', 'with', 'them', '.', 'Sometimes', ',', 'our', 'relative', 'visit', 'u', '.', 'Our', 'flat', 'becomes', 'very', 'crowded', 'sometimes', 'but', 'I', 'like', 'it', '.', 'What', 'do', 'you', 'think', '?']



#### Frequency Distribution
once, we have found the root words we can find the frequency of each word in our str1 data using the FreqDist() from the nltk. 


In [32]:
frequency = nltk.FreqDist(lem_word) 
for key,val in frequency.items(): 
    print (str(key) + ':' + str(val))

I:9
live:2
in:1
a:6
flat:2
with:2
my:2
family:1
.:9
We:2
have:5
two:1
bedroom:1
and:6
living:1
room:1
garden:1
we:1
some:1
flower:1
there:1
In:1
weekday:1
arrive:1
home:1
at:1
five:1
o'clock:1
lunch:1
Then:1
do:2
homework:1
go:1
to:2
bed:1
had:1
computer:1
but:2
now:1
it:2
doe:1
n't:1
work:1
brother:1
sister:1
think:2
am:1
very:2
lucky:1
them:1
Sometimes:1
,:1
our:1
relative:1
visit:1
u:1
Our:1
becomes:1
crowded:1
sometimes:1
like:1
What:1
you:1
?:1


#### WordNet

Wordnet is an English database for lexical which was based on the NLTK corpus reader. It can be used to look for word definitions, synonyms, and antonyms. It’s best described as an English dictionary with a semantic focus. The import command is used to bring it into the system. Because Wordnet is a corpus, it is pulled from the ntlk.corpus directory.

Synset — “synonym set” — a collection of synonymous words. A name is all assigned to each Synset. Lemmas are the words found in a Synset. The function wordnet.synsets (‘word’) provides an array containing all of the Synsets associated with the word put in as an argument. 

In [33]:
from nltk.corpus import wordnet as wn
wn.synsets('See')

[Synset('see.n.01'),
 Synset('see.v.01'),
 Synset('understand.v.02'),
 Synset('witness.v.02'),
 Synset('visualize.v.01'),
 Synset('see.v.05'),
 Synset('learn.v.02'),
 Synset('watch.v.03'),
 Synset('meet.v.01'),
 Synset('determine.v.08'),
 Synset('see.v.10'),
 Synset('see.v.11'),
 Synset('see.v.12'),
 Synset('visit.v.01'),
 Synset('attend.v.02'),
 Synset('see.v.15'),
 Synset('go_steady.v.01'),
 Synset('see.v.17'),
 Synset('see.v.18'),
 Synset('see.v.19'),
 Synset('examine.v.02'),
 Synset('experience.v.01'),
 Synset('see.v.22'),
 Synset('see.v.23'),
 Synset('interpret.v.01')]

The output means that word see has 25 possible context, 1 out of which is noun and other are all verbs, it also shows how many different meaning 'see' word has. Next we are passing the pos argument which lets you constrain the part of speech of the word, in this case we are checking all verb word synsets for see.

In [34]:
from nltk.corpus import wordnet as wn
syns = wn.synsets('See', pos = wn.VERB)

print(syns)

[Synset('see.v.01'), Synset('understand.v.02'), Synset('witness.v.02'), Synset('visualize.v.01'), Synset('see.v.05'), Synset('learn.v.02'), Synset('watch.v.03'), Synset('meet.v.01'), Synset('determine.v.08'), Synset('see.v.10'), Synset('see.v.11'), Synset('see.v.12'), Synset('visit.v.01'), Synset('attend.v.02'), Synset('see.v.15'), Synset('go_steady.v.01'), Synset('see.v.17'), Synset('see.v.18'), Synset('see.v.19'), Synset('examine.v.02'), Synset('experience.v.01'), Synset('see.v.22'), Synset('see.v.23'), Synset('interpret.v.01')]


lemma_names() is used to return all lemma (group of different inflected form of a word) names of the array.

In [35]:
print(syns[5].lemma_names())

['learn', 'hear', 'get_word', 'get_wind', 'pick_up', 'find_out', 'get_a_line', 'discover', 'see']


definition() as the name represents provides definition of the word, here we are checking the definition of first synset.

In [36]:
print(syns[0].definition())

perceive by sight or have the power to perceive by sight


examples() gives examples of the word in use.

In [37]:
print(syns[0].examples())

['You have to be a good observer to see all the details', 'Can you see the bird in that tree?', 'He is blind--he cannot see']


### Sentiment Analysis with nltk

Now that we have seen how to clean the data, its time to implement it and use it for analysis. One of the important analysis that can be done with nltk is Sentiment Analysis. Sentiment analysis is a technique used to determine the emotional tone or sentiment expressed in a text. It involves analyzing the words and phrases used in the text to identify the underlying sentiment, whether it is positive, negative, or neutral. 

VADER (Valence Aware Dictionary and sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attentive to sentiments expressed

It is used for sentiment analysis of text which has both the polarities i.e. positive/negative. VADER is used to quantify how much of positive or negative emotion the text has and also the intensity of emotion. We will be using the SentimentIntensityAnalyzer object that will provide us with sentiment scores based on the words used. 

Similar to all the previous modules we start by importing the libraries and modules.


In [38]:
# import libraries

from nltk.sentiment.vader import SentimentIntensityAnalyzer

We start by reading our input file which has customer reviews and a flag for positive feedback, 1 if the sentiments were positive else 0.  the read_csv read the file, however, even after mentioning the separator character the column did not get split, so to split the column we use the split function of the dataframe by specifying the character ",". Comma is a very common character in text based column, so we need to find the last comma in the dataframe to split the column.

In [39]:
# Load the amazon review dataset

df = pd.read_csv("./input/Review.csv", sep = ",")
print(df)


                                     reviewText,Positive
0      This is a one of the best apps acording to a b...
1      This is a pretty good version of the game for ...
2      this is a really cool game. there are a bunch ...
3      "This is a silly game and can be frustrating, ...
4      This is a terrific game on any pad. Hrs of fun...
...                                                  ...
19995  this app is fricken stupid.it froze on the kin...
19996  Please add me!!!!! I need neighbors! Ginger101...
19997  love it!  this game. is awesome. wish it had m...
19998  I love love love this app on my side of fashio...
19999  "This game is a rip off. Here is a list of thi...

[20000 rows x 1 columns]


In [40]:
df['Positive'] = df.iloc[:,0].str.split(',').str[-1]
print(df)


                                     reviewText,Positive Positive
0      This is a one of the best apps acording to a b...        1
1      This is a pretty good version of the game for ...        1
2      this is a really cool game. there are a bunch ...        1
3      "This is a silly game and can be frustrating, ...        1
4      This is a terrific game on any pad. Hrs of fun...        1
...                                                  ...      ...
19995  this app is fricken stupid.it froze on the kin...        0
19996  Please add me!!!!! I need neighbors! Ginger101...        1
19997  love it!  this game. is awesome. wish it had m...        1
19998  I love love love this app on my side of fashio...        1
19999  "This game is a rip off. Here is a list of thi...        0

[20000 rows x 2 columns]


we divided the column, but the datatype of the column was python object, which we need to change to int so that we can use it further for comparing it to predicted value. 

In [41]:
print(df.iloc[:,1])
print(df['Positive'].dtypes)
df['Positive'] = df['Positive'].astype(str).astype(int)
print(df['Positive'].dtypes)

0        1
1        1
2        1
3        1
4        1
        ..
19995    0
19996    1
19997    1
19998    1
19999    0
Name: Positive, Length: 20000, dtype: object
object
int32


we are creating a function to preprocess/clean our data, by using tokenization, removing stopwords and lemmatization. In tokenization we are using the lower function to convert the text to lowercase as most of string function are case sensitive. we then pass our dataframe to transform and get text which is clean and will provide us with better future results.

In [42]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)
    return processed_text


# apply the function df
df['reviewText'] = df.iloc[:,0].apply(preprocess_text)
df


Unnamed: 0,"reviewText,Positive",Positive,reviewText
0,This is a one of the best apps acording to a b...,1,one best apps acording bunch people agree bomb...
1,This is a pretty good version of the game for ...,1,pretty good version game free . lot different ...
2,this is a really cool game. there are a bunch ...,1,really cool game . bunch level find golden egg...
3,"""This is a silly game and can be frustrating, ...",1,"`` silly game frustrating , lot fun definitely..."
4,This is a terrific game on any pad. Hrs of fun...,1,terrific game pad . hr fun . grandkids love . ...
...,...,...,...
19995,this app is fricken stupid.it froze on the kin...,0,app fricken stupid.it froze kindle wont allow ...
19996,Please add me!!!!! I need neighbors! Ginger101...,1,please add ! ! ! ! ! need neighbor ! ginger101...
19997,love it! this game. is awesome. wish it had m...,1,love ! game . awesome . wish free stuff house ...
19998,I love love love this app on my side of fashio...,1,love love love app side fashion story fight wo...


Once our data is normalized, we initialize the sentiment analyzer, and define a function get_sentiment() to call the polarity_scores method on our cleaned column. polarity_scores() gives us the output that lies between [-1,1], where -1 refers to negative sentiment and +1 refers to positive sentiment. We store the predicted values in the Sentiment column in our dataframe.

In [43]:
# initialize NLTK sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# create get_sentiment function
def get_sentiment(text):

    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0
    return sentiment

# apply get_sentiment function
df['Sentiment'] = df.iloc[:,0].apply(get_sentiment)
df


Unnamed: 0,"reviewText,Positive",Positive,reviewText,Sentiment
0,This is a one of the best apps acording to a b...,1,one best apps acording bunch people agree bomb...,1
1,This is a pretty good version of the game for ...,1,pretty good version game free . lot different ...,1
2,this is a really cool game. there are a bunch ...,1,really cool game . bunch level find golden egg...,1
3,"""This is a silly game and can be frustrating, ...",1,"`` silly game frustrating , lot fun definitely...",1
4,This is a terrific game on any pad. Hrs of fun...,1,terrific game pad . hr fun . grandkids love . ...,1
...,...,...,...,...
19995,this app is fricken stupid.it froze on the kin...,0,app fricken stupid.it froze kindle wont allow ...,0
19996,Please add me!!!!! I need neighbors! Ginger101...,1,please add ! ! ! ! ! need neighbor ! ginger101...,1
19997,love it! this game. is awesome. wish it had m...,1,love ! game . awesome . wish free stuff house ...,1
19998,I love love love this app on my side of fashio...,1,love love love app side fashion story fight wo...,1


Now that we have our actual and predicted values, we can use the confusion_matrix function from sklearn library to find our true positives and true negatives and the accuracy_score function from sklearn library to find the accuracy score to see how well SentimentIntensityAnalyzer() predicted the positives of sentiments.

In [44]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(df.iloc[:,1], df.iloc[:,3]))

[[ 1456  3311]
 [  750 14483]]


From the above output of confusion matrix, our true positives are 14483 and true negatives are 1456

In [45]:

from sklearn.metrics import accuracy_score
score = accuracy_score(df.iloc[:,1], df.iloc[:,3])  
print(score)

0.79695


And the accuracy_score shows that the accuracy of SentimentIntensityAnalyzer() if 79.6% which looks good.

## References

https://www.nltk.org/index.html
<br>
https://medium.com/featurepreneur/simple-sentiment-analysis-with-nlp-vader-400276c7574d
<br>
https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL

