<h1 align="center">Natural Language Processing From Scratch</h1>
<h2 align="center">Bruno Gonçalves</h2>
<h4 align="center">bgoncalves@gmail.com</h4>
<h4 align="center">@bgoncalves</h4>

# Lesson 3 - Sentiment Analysis

In [1]:
import string
import gzip
from collections import Counter
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.preprocessing import normalize

%matplotlib inline

# Word counting

We start by taking the simplest approach and simply counting positive and negative words. We'll use Hu and Liu's Lexicon from their 2004 KDD paper: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In [2]:
pos = np.loadtxt('data/positive-words.txt', dtype='str', comments=';')
neg = np.loadtxt('data/negative-words.txt', dtype='str', comments=';')

Create a dictionary and assign the valence to each positive and negative word

In [3]:
valence = {}

for word in pos:
    valence[word.lower()] = 1
    
for word in neg:
    valence[word.lower()] = -1

Here's the simple word extraction function we defined in Lesson I

In [4]:
def extract_words(text):
    temp = text.split() # Split the text on whitespace
    text_words = []

    for word in temp:
        # Remove any punctuation characters present in the beginning of the word
        while word[0] in string.punctuation:
            word = word[1:]

        # Remove any punctuation characters present in the end of the word
        while word[-1] in string.punctuation:
            word = word[:-1]

        # Append this word into our list of words.
        text_words.append(word.lower())
        
    return text_words

That now we can use to define a sentiment measuring function that returns the valence of a sentence or piece of text. Notice that we use the valence directly from the dictionary instead of treating positive and negative words separatly. This will prove useful later on ;)

In [5]:
def sentiment(text, valence):
    words = extract_words(text.lower())
    
    word_count = 0
    score = 0
    
    for word in words:
        if word in valence:
            score += valence[word]
            word_count += 1
            
    return score/word_count

Now let's test our simple code with some simple examples

In [6]:
texts = ["I'm very happy",
         "The product is pretty annoying, and I hate it",
         "I'm sad",
        ]

for text in texts:
    print(sentiment(text, valence))

1.0
-0.3333333333333333
-1.0


This is a bit surprising. One might expect the second sentence to be negative, after all "pretty annoying" and "hate" sound pretty negative. However, since each word in taken by itself, regardless of context we end up with:

In [7]:
words = extract_words(texts[1].lower())
for word in words:
    if word in valence:
        print(word, valence[word])

pretty 1
annoying -1
hate -1


We'll see later how to handle cases like this, but the solution requires two changes to our current approach: non-uniform weights and modifier words.

# Modifiers

The first step is to define a dictionary of modifiers

In [8]:
modifiers = {
    "very": 1.5,
    "much": 1.3,
    "not": -1,
    "pretty": 1.5,
    "somewhat": 1.2}

And to change our sentiment measuring function to take the modifiers into account.

In [9]:
def sentiment_modified(text, valence, modifiers, verbose=False):
    words = extract_words(text.lower())
    
    word_count = 0
    score = 0
    ngrams = [[]]
    
    # generate ngrams
    for i in range(len(words)):
        word = words[i]
        
        if word in modifiers:
            ngrams[-1].append(word)
            continue

        if word in valence:
            ngrams[-1].append(word)
        else:
            if len(ngrams[-1]) > 0:
                ngrams.append([])

    score = 0
    
    # Remove the trailing empty ngram if necessary
    if len(ngrams[-1]) == 0:
        ngrams = ngrams[:-1]

    for ngram in ngrams:
        value = 1

        for word in ngram:
            if word in modifiers:
                value *= modifiers[word]
            elif word in valence:
                value *= valence[word]

        if verbose:
            print(ngram, value)

        score += value

    return score/len(ngrams)

This implementation is still relatively simple, but, as you can see, the results are already better.

In [10]:
sentiment_modified(texts[1], valence, modifiers, True)

['pretty', 'annoying'] -1.5
['hate'] -1


-1.25

A more complete implementation would be more careful in handling the modifiers and would build larger ngrams so that cases like this one would also work:

In [11]:
sentiment_modified("It was not very good", valence, modifiers, True)

['not', 'very', 'good'] -1.5


-1.5

And even more complex (and unrealistic) examples work fine

In [12]:
sentiment_modified("It was not not very very good", valence, modifiers, True)

['not', 'not', 'very', 'very', 'good'] 2.25


2.25