<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0"> </div>
    <div style="float: left; margin-left: 10px;"> 
        <h1>Natural Language Processing For Everyone</h1>
        <h1>Sentiment Analysis</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
        @bgoncalves, @data4sci</p></div>
</div>

In [1]:
import string
import gzip
from collections import Counter

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import watermark

%matplotlib inline
%load_ext watermark

List out the versions of all loaded libraries

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 20.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: 20a72f6b26b4b4c8729a583bf89f7b8f1b5f9571

pandas    : 1.1.3
matplotlib: 3.3.2
numpy     : 1.19.2
watermark : 2.1.0



Set the default style

In [3]:
plt.style.use('./d4sci.mplstyle')

## Word counting

We start by taking the simplest approach and simply counting positive and negative words. We'll use Hu and Liu's Lexicon from their 2004 KDD paper: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In [4]:
pos = np.loadtxt('data/positive-words.txt', dtype='str', comments=';')
neg = np.loadtxt('data/negative-words.txt', dtype='str', comments=';')

In [5]:
pos

array(['a+', 'abound', 'abounds', ..., 'zenith', 'zest', 'zippy'],
      dtype='<U20')

In [6]:
neg

array(['2-faced', '2-faces', 'abnormal', ..., 'zealous', 'zealously',
       'zombie'], dtype='<U24')

Create a dictionary and assign the valence to each positive and negative word

In [7]:
valence = {}

for word in pos:
    valence[word.lower()] = 1
    
for word in neg:
    valence[word.lower()] = -1

Here's the simple word extraction function we defined in Lesson I

In [8]:
def extract_words(text):
    temp = text.split() # Split the text on whitespace
    text_words = []

    for word in temp:
        # Remove any punctuation characters present in the beginning of the word
        while word[0] in string.punctuation:
            word = word[1:]

        # Remove any punctuation characters present in the end of the word
        while word[-1] in string.punctuation:
            word = word[:-1]

        # Append this word into our list of words.
        text_words.append(word.lower())
        
    return text_words

That now we can use to define a sentiment measuring function that returns the valence of a sentence or piece of text. Notice that we use the valence directly from the dictionary instead of treating positive and negative words separatly. This will prove useful later on ;)

In [9]:
def sentiment(text, valence):
    words = extract_words(text.lower())
    
    word_count = 0
    score = 0
    
    for word in words:
        if word in valence:
            score += valence[word]
            word_count += 1
    
    return score/word_count

Now let's test our simple code with some simple examples

In [10]:
texts = ["I'm very happy",
         "The product is pretty annoying, and I hate it",
         "I'm sad",
        ]

for text in texts:
    print(text, ':', sentiment(text, valence))

I'm very happy : 1.0
The product is pretty annoying, and I hate it : -0.3333333333333333
I'm sad : -1.0


This is a bit surprising. One might expect the second sentence to be negative, after all "pretty annoying" and "hate" sound pretty negative. However, since each word in taken by itself, regardless of context we end up with:

In [11]:
words = extract_words(texts[1].lower())

for word in words:
    if word in valence:
        print(word, valence[word])

pretty 1
annoying -1
hate -1


We'll see in a bit how to handle cases like this, but the solution requires two important changes to our current approach: modifier words and real valued weights

## Modifiers

The first step is to define a dictionary of modifiers

In [12]:
modifiers = {
    "very": 1.5,
    "much": 1.3,
    "not": -1,
    "pretty": 1.5,
    "somewhat": 1.2
}

And to change our sentiment measuring function to take the modifiers into account.

In [13]:
def sentiment_modified(text, valence, modifiers, verbose=False):
    words = extract_words(text.lower())
    
    word_count = 0
    score = 0
    ngrams = [[]]
    
    # generate ngrams
    for i in range(len(words)):
        word = words[i]
        
        if word in modifiers:
            ngrams[-1].append(word)
            continue

        if word in valence:
            ngrams[-1].append(word)
        else:
            if len(ngrams[-1]) > 0:
                ngrams.append([])

    score = 0
    
    # Remove the trailing empty ngram if necessary
    if len(ngrams[-1]) == 0:
        ngrams = ngrams[:-1]

    for ngram in ngrams:
        value = 1

        for word in ngram:
            if word in modifiers:
                value *= modifiers[word]
            elif word in valence:
                value *= valence[word]

        if verbose:
            print(ngram, value)

        score += value

    return score/len(ngrams)

This implementation is still relatively simple, but, as you can see, the results are already better.

In [14]:
print(texts[1])

The product is pretty annoying, and I hate it


In [15]:
sentiment_modified(texts[1], valence, modifiers, True)

['pretty', 'annoying'] -1.5
['hate'] -1


-1.25

A more complete implementation would be more careful in handling the modifiers and would build larger ngrams so that cases like this one would also work:

In [16]:
sentiment_modified("It was not very good", valence, modifiers, True)

['not', 'very', 'good'] -1.5


-1.5

And even more complex (and unrealistic) examples work fine

In [17]:
sentiment_modified("It was not not very very good", valence, modifiers, True)

['not', 'not', 'very', 'very', 'good'] 2.25


2.25

## Continuous weights

VADER is a state of the art sentiment analysis tool. Here we will use their excelent and well documented [lexicon](https://github.com/cjhutto/vaderSentiment) to explore non binary weights. Their approach is significantly more advanced than what we present here, but some of the fundamental ideas are the same

In [18]:
vader = pd.read_csv("data/vader_lexicon.txt", sep='\t', header=None)

The vader lexicon includes a lot of interesting information:

In [19]:
vader.head()

Unnamed: 0,0,1,2,3
0,$:,-1.5,0.80623,"[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]"
1,%),-0.4,1.0198,"[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]"
2,%-),-1.5,1.43178,"[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]"
3,&-:,-0.4,1.42829,"[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]"
4,&:,-0.7,0.64031,"[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]"


Similies are also included and, in addition to the average sentiment of each word (in column 1) and it's standard deviation (in column 2) it provides the raw human generated scores in column 3. So that we may easily check (and possibly modify) their weights. To extract the raw scores for the word "love" we could simply do:

In [20]:
print(vader.shape)

(7517, 4)


In [21]:
print(vader.iloc[4446])

0                              love
1                               3.2
2                               0.4
3    [3, 3, 3, 3, 3, 3, 3, 4, 4, 3]
Name: 4446, dtype: object


In [22]:
scores = eval(vader.iloc[4446][3])
print(scores)

[3, 3, 3, 3, 3, 3, 3, 4, 4, 3]


In [23]:
scores[8]

4

And we can see that 8/10 people thought that the word love should receive a score of 3 and two others a score of 4. This gives us insight into how uniform the scores are.  If for some reason, we thought that there was some problem with the 2 values of 4 or perhaps just not appropriate to our purposes we might discard them and recalculate the valence of the word. 

One justification for this might be the fact that the scores for the closely related word, "loved", are significantly different with a wider range of variation in the human scores

In [24]:
vader.iloc[4447]

0                             loved
1                               2.9
2                               0.7
3    [3, 3, 4, 2, 2, 4, 3, 2, 3, 3]
Name: 4447, dtype: object

Now we convert this dataset into a dictionary similar to the one we used above

In [25]:
valence_vader = dict(vader[[0,1]].values)

In [26]:
valence_vader['love']

3.2

To use this new dictionary we just have to modify the arguments to the sentiment_modified function:

In [27]:
sentiment_modified("It was not not very very good", valence_vader, modifiers, True)

['not', 'not', 'very', 'very', 'good'] 4.2749999999999995


4.2749999999999995

One important detail to keep in mind is that scores obtained through different methods are not comparable. In this example, the score of the sentence "It was not not very very good" went from 2.25 to 4.27 when we switched dictionaries. This is due not only to different levels of coverage in differnet dictionaries but also to differnet choices in the possible ranges of values.

<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>