<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Part 1: Language Data Pre-Processing



### Learning Objectives

1. The Natural Language Toolkit (NLTK) 
2. Define and implement tokenizing, lemmatizing, stemming, part-of-speech tagging and named entity recognition.
3. Preprocess text data by removing stopwords
4. Implement sentiment analysis.

Before we get started, you will need to install the Natural Language Toolkit, known as [NLTK](https://www.nltk.org/install.html). If it is already installed you will get a `Requirement already satisfied` message.


In [1]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk

In [3]:
# Download all the packages from nltk
#nltk.download('all')
# Alternatively, choose package to download from the window that pops up.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [4]:
# Imports
import pandas as pd       
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
import re

# Pre-Processing 

When dealing with text data, there are common pre-processing steps that NLTK can help us to do:
- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

Let's say we have the following text that we want to be able to identify as spam.

In [5]:
# Define spam text.
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only. If you would like to receive this donation please reply by contacting me before I am operated on.'

print(spam)

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only. If you would like to receive this donation please reply by contacting me before I am operated on.


## Tokenizing

NLTK provides functions to split text into individual words or sentences, called 'tokens'.


In [6]:
# tokenize into sentences
sent_tokenize(spam.lower())

['hello,\ni saw your contact information on linkedin.',
 'i have carefully read through your profile and you seem to have an outstanding personality.',
 'this is one major reason why i am in contact with you.',
 'my name is mr. valery grayfer chairman of the board of directors of pjsc "lukoil".',
 'i am 86 years old and i was diagnosed with cancer 2 years ago.',
 'i will be going in for an operation later this week.',
 'i decided to will/donate the sum of 8,750,000.00 euros(eight million seven hundred and fifty thousand euros only.',
 'if you would like to receive this donation please reply by contacting me before i am operated on.']

In [7]:
# tokenize into words
word_tokenize(spam.lower())


['hello',
 ',',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 '.',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 '.',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 '.',
 'my',
 'name',
 'is',
 'mr.',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 '``',
 'lukoil',
 "''",
 '.',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 '.',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 '.',
 'i',
 'decided',
 'to',
 'will/donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'euros',
 '(',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 '.',
 'if',
 'you',
 'would',
 'like',
 'to',
 'receive',
 'this

You might notice that  `word_tokenize()` considers special characters as words too. To overcome this, we can instantiate a  `RegexpTokenizer` to specify that we want words containing digits or characters only using the Regex `\w+`

For more on Regular Expressions (and to test a regex) see https://regex101.com/

In [8]:
# \w matches any word character (equivalent to [a-zA-Z0-9_]), + to repeat the match
# So this tokenizer finds all the words containing alphabets, digits or _
tokenizer = RegexpTokenizer(r'\w+') 


In [9]:
# Use the regular expression tokenizer that has been instantiated 
spam_tokens = tokenizer.tokenize(spam.lower())


In [10]:
# What happened with the numeric value?
spam_tokens


['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'if',
 'you',
 'would',
 'like',
 'to',
 'receive',
 'this',
 'donation',
 'please',
 'reply',
 'by',
 'contact

In [11]:
# Look for digits in each of the tokens
[(re.findall('\d+', i), i) for i in spam_tokens]

[([], 'hello'),
 ([], 'i'),
 ([], 'saw'),
 ([], 'your'),
 ([], 'contact'),
 ([], 'information'),
 ([], 'on'),
 ([], 'linkedin'),
 ([], 'i'),
 ([], 'have'),
 ([], 'carefully'),
 ([], 'read'),
 ([], 'through'),
 ([], 'your'),
 ([], 'profile'),
 ([], 'and'),
 ([], 'you'),
 ([], 'seem'),
 ([], 'to'),
 ([], 'have'),
 ([], 'an'),
 ([], 'outstanding'),
 ([], 'personality'),
 ([], 'this'),
 ([], 'is'),
 ([], 'one'),
 ([], 'major'),
 ([], 'reason'),
 ([], 'why'),
 ([], 'i'),
 ([], 'am'),
 ([], 'in'),
 ([], 'contact'),
 ([], 'with'),
 ([], 'you'),
 ([], 'my'),
 ([], 'name'),
 ([], 'is'),
 ([], 'mr'),
 ([], 'valery'),
 ([], 'grayfer'),
 ([], 'chairman'),
 ([], 'of'),
 ([], 'the'),
 ([], 'board'),
 ([], 'of'),
 ([], 'directors'),
 ([], 'of'),
 ([], 'pjsc'),
 ([], 'lukoil'),
 ([], 'i'),
 ([], 'am'),
 (['86'], '86'),
 ([], 'years'),
 ([], 'old'),
 ([], 'and'),
 ([], 'i'),
 ([], 'was'),
 ([], 'diagnosed'),
 ([], 'with'),
 ([], 'cancer'),
 (['2'], '2'),
 ([], 'years'),
 ([], 'ago'),
 ([], 'i'),
 ([], 

In [12]:
# Instantiate tokenizer to include digits attached to , or .
tokenizer_1 = RegexpTokenizer('\d[\d,.]+|\w+')
tokenizer_1.tokenize(spam.lower())

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'if',
 'you',
 'would',
 'like',
 'to',
 'receive',
 'this',
 'donation',
 'please',
 'reply',
 'by',
 'contacting',
 'me',

## Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many times I see each word. The computer will treat words like "running," "ran," and "runs" differently... but they mean very similar things (in this context)!

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "**lemmatize**" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

How do we know what the base form is? Fortunately we don't have to create a dictionary, NLTK uses the publicly available [WordNet](https://wordnet.princeton.edu/).

In [13]:
# Instantiate lemmatizer.
lemmatizer = WordNetLemmatizer()

In [14]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [15]:
# Compare tokens to lemmatized version.
# zip() creates pairs of tuples from each list
list(zip(spam_tokens, tokens_lem))

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'information'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'carefully'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profile'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstanding'),
 ('personality', 'personality'),
 ('this', 'this'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'why'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valery'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i'

In [16]:
# Print only those lemmatized tokens that are different.
for i in range(len(spam_tokens)):
    if spam_tokens[i] != tokens_lem[i]:
        print((spam_tokens[i], tokens_lem[i]))

('directors', 'director')
('years', 'year')
('was', 'wa')
('years', 'year')
('euros', 'euro')
('euros', 'euro')


Lemmatizing is usually the more correct and precise way of handling things from a grammatical point of view, but also might not have much of an effect.

We can also do this on individual words.

In [17]:
# Lemmatize the word "computers."
lemmatizer.lemmatize("computers")

'computer'

In [18]:
lemmatizer.lemmatize("computer")

'computer'

In [19]:
lemmatizer.lemmatize("computation")

'computation'

In [20]:
lemmatizer.lemmatize("computationally")

'computationally'

When we "**stem**" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [21]:
# Instantiate PorterStemmer.
p_stemmer = PorterStemmer()

In [22]:
# Stem tokens.
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [23]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'inform'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'care'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profil'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'whi'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valeri'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i', 'i'),
 ('am', 'am'

In [24]:
# Print only those stemmed tokens that are different.

[(spam_tokens[i], stem_spam[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != stem_spam[i]]

[('information', 'inform'),
 ('carefully', 'care'),
 ('profile', 'profil'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('why', 'whi'),
 ('valery', 'valeri'),
 ('directors', 'director'),
 ('years', 'year'),
 ('was', 'wa'),
 ('diagnosed', 'diagnos'),
 ('years', 'year'),
 ('going', 'go'),
 ('operation', 'oper'),
 ('this', 'thi'),
 ('decided', 'decid'),
 ('donate', 'donat'),
 ('euros', 'euro'),
 ('hundred', 'hundr'),
 ('fifty', 'fifti'),
 ('euros', 'euro'),
 ('only', 'onli'),
 ('receive', 'receiv'),
 ('this', 'thi'),
 ('donation', 'donat'),
 ('please', 'pleas'),
 ('reply', 'repli'),
 ('contacting', 'contact'),
 ('before', 'befor'),
 ('operated', 'oper')]

In [25]:
# Stem the word "computers"
p_stemmer.stem("computers")



'comput'

In [26]:
# Stem the word "computer."
p_stemmer.stem("computer")

'comput'

In [27]:

# Stem the word "computation."
p_stemmer.stem("computation")

'comput'

In [28]:

# Stem the word "computationally."
p_stemmer.stem("computationally")

'comput'

## Part of Speech Tagging

Another task in NLP is to tag the different parts of speech. This can be done using NLTK's `pos_tag`.

In [29]:

text = tokenizer.tokenize("Taylor Swift's concert The Eras Tour will be held at the Singapore National Stadium in Singapore on 2,3,4 and 7,8,9 March 2024")


pos_tag_list = pos_tag(text)
print(pos_tag_list)

[('Taylor', 'NNP'), ('Swift', 'NNP'), ('s', 'VBD'), ('concert', 'VB'), ('The', 'DT'), ('Eras', 'NNP'), ('Tour', 'NNP'), ('will', 'MD'), ('be', 'VB'), ('held', 'VBN'), ('at', 'IN'), ('the', 'DT'), ('Singapore', 'NNP'), ('National', 'NNP'), ('Stadium', 'NNP'), ('in', 'IN'), ('Singapore', 'NNP'), ('on', 'IN'), ('2', 'CD'), ('3', 'CD'), ('4', 'CD'), ('and', 'CC'), ('7', 'CD'), ('8', 'CD'), ('9', 'CD'), ('March', 'NNP'), ('2024', 'CD')]


NLTK has tagged the tokens as various parts of speech:
 
- NN: noun 
- IN: preposition and conjunction 
- CD: digit 
- VBN, VBG : verbs
- JJ: adjective 
- PRP: pronoun
- MD: modal 
- CC: conjunction
-etc


## Named Entity Recognition

Another NLP task is to perform named entity recognition, ie to identify locations, people and organizations. This is done by passing the tagged parts of speech to the `nltk ne_chunk` module.
The nltk chunk module will parse the parts of speech to identify tokens which are probably named entities. 


In [30]:
namedEnt = ne_chunk(pos_tag_list, binary=False)
namedEnt.draw()

## Stop Word Removal

The following quote has had stop words (and punctuation) removed:

"Answer great question life universe everything said deep thought said deep thought paused forty two said deep thought infinite majesty calm."

<details><summary>What book is the above sentence from?</summary>

The Hitchhiker's Guide to the Galaxy!
    
![](../images/hgg.jpg)
    
The original quote reads:  
..."The Answer to the Great Question..."  
"Yes..!"  
"Of Life, the Universe and Everything..." said Deep Thought.  
"Yes...!"  
"Is..." said Deep Thought, and paused.  
"Yes...!"  
"Is..."  
"Yes...!!!...?"  
"Forty-two," said Deep Thought, with infinite majesty and calm.”
</details>

<details><summary>If you were familiar with the book, how did you know what book the sentence was from?</summary>

Removing stop words did not remove key identifying words such as "life", "universe", "everything", and "forty-two".
</details>

<details><summary>Based on this, how would you define stop words?</summary>

Stop words are words that have little to no significance or meaning. They are common words that only add to the grammatical structure and flow of the sentence, so it is still relatively easy to identify the contents of sentences without stop words.
</details>

In [31]:
# Print English stopwords.
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [32]:
# Remove stopwords from "spam_tokens."
no_stop_words = [token for token in spam_tokens if token not in stopwords.words('english')]

In [33]:
# Check it
print(no_stop_words)

['hello', 'saw', 'contact', 'information', 'linkedin', 'carefully', 'read', 'profile', 'seem', 'outstanding', 'personality', 'one', 'major', 'reason', 'contact', 'name', 'mr', 'valery', 'grayfer', 'chairman', 'board', 'directors', 'pjsc', 'lukoil', '86', 'years', 'old', 'diagnosed', 'cancer', '2', 'years', 'ago', 'going', 'operation', 'later', 'week', 'decided', 'donate', 'sum', '8', '750', '000', '00', 'euros', 'eight', 'million', 'seven', 'hundred', 'fifty', 'thousand', 'euros', 'would', 'like', 'receive', 'donation', 'please', 'reply', 'contacting', 'operated']


---

# Sentiment Analysis

![](../images/sent.jpeg)

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

Let's build a simple function that can classify text as either having positive or negative sentiment.

What words tell us whether certain text is positive?

In [34]:
# Let's come up with a list of positive and negative words we might observe. 
# Suggest more!!
positive_words = ['love', 'good', 'great']
negative_words = ['garbage', 'sad', 'bad']

In [35]:
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')  
    # Tokenize text.
    tokens = tokenizer.tokenize(text.lower()) 
    # Instantiate stemmer.
    p_stemmer = PorterStemmer()
    # Stem words.
    stemmed_words = [p_stemmer.stem(i) for i in tokens]
    # Stem our positive/negative words.
    positive_stems = [p_stemmer.stem(i) for i in positive_words]
    negative_stems = [p_stemmer.stem(i) for i in negative_words]

    # Count "positive" words.
    positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    # Count "negative" words
    negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    # Calculate Sentiment Percentage 
    return round((positive_count - negative_count) / len(tokens), 2)

In [36]:
# Run our sentiment analyzer 
simple_sentiment("I love programming!")

0.33

<details><summary> What are some shortcomings of this method? </summary>

- Primarily, we're limited to the positive/negative words we came up with.
- If someone wrote "not good" or "not bad," our sentiment function would probably treat "not good" as positive or neutral... but it's probably supposed to mean negative!
- The ordering of the words doesn't matter here, which is not how language generally works.
- We haven't corrected for misspellings.
</details>

There are a couple of ways to proceed with sentiment analysis:

1. If you have already-labeled data, you can build a supervised learning model.
2. If you don't have labeled data, you can use a Lexicon (dictionary) that has already been built/trained for sentiment analysis.
    - There are a bunch of these and which to use depends on your purpose/data. Here are just a few that are available:
        - AFINN lexicon
        - MPQA subjectivity lexicon
        - SentiWordNet
        - VADER lexicon

We will use the [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiments of our reviews.

VADER has a basic lexicon for positive and negative sentiments, including emoticons and negation ('didn't like'), but also considers intensity in terms of use of CAPs.

In [37]:
# Instantiate Sentiment Intensity Analyzer from nltk.sentiment.vader
sent = SentimentIntensityAnalyzer()

VADER's `polarity_scores` takes in a string and returns a dictionary of scores in each of four categories:

- negative
- neutral
- positive
- compound, which calculates a score based on the above three. 

In [38]:
sent.polarity_scores('This is FANTASTIC')

{'neg': 0.0, 'neu': 0.316, 'pos': 0.684, 'compound': 0.6523}

In [39]:
sent.polarity_scores('I don"t like programming at all and I hate Python especially')

{'neg': 0.28, 'neu': 0.53, 'pos': 0.189, 'compound': -0.296}

Let's try analyzing the sentiment of IMDb movie reviews. The data is from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).

In [40]:
# Read in training data.
reviews = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [41]:
reviews.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [42]:
reviews.tail()

Unnamed: 0,id,sentiment,review
24995,"""3453_3""",0,"""It seems like more consideration has gone int..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the..."
24999,"""8478_8""",1,"""I saw this movie as a child and it broke my h..."


In [46]:
# Examine a review.
reviews['review'][100]

'"There is a uk edition to this show which is rather less extravagant than the US version. The person concerned will get a new kitchen or perhaps bedroom and bathroom and is wonderfully grateful for what they have got. The US version of this show is everything that reality TV shouldn\'t be. Instead of making a few improvements to a house which the occupants could not afford or do themselves the entire house gets rebuilt. I do not know if this show is trying to show what a lousy welfare system exists in the US or if you beg hard enough you will receive. The rather vulgar product placement that takes place, particularly by Sears, is also uncalled for. Rsther than turning one family in a deprived area into potential millionaires, it would be far better to help the community as a whole where instead of spending the hundreds of thousands of dollars on one home, build something for the whole community ..... perhaps a place where diy and power tools can be borrowed and returned along with bui

In [47]:
# Use the sent.polarity_scores() to find the sentiment of the review.
sent.polarity_scores(reviews['review'][100])

{'neg': 0.048, 'neu': 0.857, 'pos': 0.095, 'compound': 0.8625}

In [48]:
# Does this match the sentiment given in the training data?
reviews['sentiment'][100]

0

In [49]:
# apply the scores to all the reviews
reviews['scores'] = reviews['review'].apply(sent.polarity_scores)


In [50]:
# Extract only the compound score
reviews['compound']  = reviews['scores'].apply(lambda score_dict: score_dict['compound'])

In [51]:
# Check for the first 10 reviews
reviews.head(10)

Unnamed: 0,id,sentiment,review,scores,compound
0,"""5814_8""",1,"""With all this stuff going down at the moment ...","{'neg': 0.13, 'neu': 0.744, 'pos': 0.126, 'com...",-0.8278
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...","{'neg': 0.047, 'neu': 0.739, 'pos': 0.214, 'co...",0.9819
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...","{'neg': 0.142, 'neu': 0.8, 'pos': 0.058, 'comp...",-0.9883
3,"""3630_4""",0,"""It must be assumed that those who praised thi...","{'neg': 0.066, 'neu': 0.878, 'pos': 0.056, 'co...",-0.2189
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...","{'neg': 0.119, 'neu': 0.741, 'pos': 0.14, 'com...",0.796
5,"""8196_8""",1,"""I dont know why people think this is such a b...","{'neg': 0.183, 'neu': 0.594, 'pos': 0.223, 'co...",0.3935
6,"""7166_2""",0,"""This movie could have been very good, but com...","{'neg': 0.163, 'neu': 0.709, 'pos': 0.129, 'co...",-0.6863
7,"""10633_1""",0,"""I watched this video at a friend's house. I'm...","{'neg': 0.062, 'neu': 0.899, 'pos': 0.04, 'com...",-0.4517
8,"""319_1""",0,"""A friend of mine bought this film for £1, and...","{'neg': 0.072, 'neu': 0.736, 'pos': 0.192, 'co...",0.9707
9,"""8713_10""",1,"""<br /><br />This movie is full of references....","{'neg': 0.0, 'neu': 0.805, 'pos': 0.195, 'comp...",0.8481


In [53]:
# Have a look at one of the reviews that is different from the labelled sentiment
reviews['review'][2]

# What sentiment would YOU assign?

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against 