# Sentiment analysis

## Rule-based approach

NLTK is short for Natural Language ToolKit.

In [1]:
import nltk
print(nltk.__version__)

3.2.2


# Let's play around with VADER

We'll first import VADER from within the 'sentiment' module of NLTK.
This will display a warning, saying that we don't have the `twython` library installed, but that's OK. We won't be using that library.

Remember that VADER is a lexicon and a *rule-based* sentiment analysis tool. VADER is *NOT* using machine learning.

We'll take advantage of the VADER `SentimentIntensityAnalyzer` to do our calculations for us.

*Update from demo:* Apologies. Apparently I had installed literally every tool and library available for jupyter notebook, one of them included vader_lexicon, which is required to use the `vader.SentimentIntensityAnalyzer()` mentioned below. To download this package uncomment and execute the following line. In the window that appears (might appear behind this browser window) choose the `models` tab, and **WHATEVER YOU DO, DO NOT SCROLL USING A SCROLL WHEEL OR A TOUCH PAD**. Use the scroll bar on the right to find `vader_lexicon`, choose install, and close the window.

You only need to do this once so once that's installed, you can safely comment out the below line again.

In [2]:
#nltk.download()

In [3]:
from nltk.sentiment import vader
vanalyser = vader.SentimentIntensityAnalyzer()



We'll introduce a helper function to easily see what VADER thinks of a problem instance.

This function will return four values:
- negativity (neg)
- neutrality (neu)
- positivity (pos)
- compound: The 'total' score, calculated using a [non-trivial algorithm](http://stackoverflow.com/q/40325980).

In [4]:
def vanalyse(sample):
    return vanalyser.polarity_scores(sample)

Let's try out some sentences, and see what VADER thinks of them!

In [5]:
vanalyse("What a terrible restaurant")

{'compound': -0.4767, 'neg': 0.608, 'neu': 0.392, 'pos': 0.0}

The experts predict that the sentiment of the sentence "What a terrible restaurant" was mostly negative, with some neutral, and no positive sentiment. Fairly accurate.

Notice that VADER undestands emoticons!

In [6]:
vanalyse(":D")

{'compound': 0.5106, 'neg': 0.0, 'neu': 0.0, 'pos': 1.0}

But not necessarily all emoticons.

In [7]:
vanalyse("-,-")

{'compound': 0.0, 'neg': 0.0, 'neu': 1.0, 'pos': 0.0}

It understands how punctuation and capitalisation acts as boosters.

In [8]:
vanalyse("the food was good")

{'compound': 0.4404, 'neg': 0.0, 'neu': 0.508, 'pos': 0.492}

In [9]:
vanalyse("the food was good!")

{'compound': 0.4926, 'neg': 0.0, 'neu': 0.484, 'pos': 0.516}

In [10]:
vanalyse("the food was GOOD!")

{'compound': 0.6027, 'neg': 0.0, 'neu': 0.433, 'pos': 0.567}

It even understands double-negatives!

In [11]:
vanalyse("the food was not the worst")

{'compound': 0.5096, 'neg': 0.0, 'neu': 0.603, 'pos': 0.397}

But is not flawless:

In [12]:
vanalyse("I usually hate seafood, but I liked this") #Percieved as positive (True positive)

{'compound': 0.3291, 'neg': 0.234, 'neu': 0.398, 'pos': 0.368}

In [13]:
vanalyse("I usually hate seafood, and I liked this") #Percieved as negative (False negative)

{'compound': -0.2263, 'neg': 0.352, 'neu': 0.381, 'pos': 0.267}

## Using VADER on our dataset

Let's see how VADER does over a large dataset of reviews where we know the polarity of each problem instance.

The first thing we need to do is import the data. If you are on a mac/linux machine, the below commands should just work. If you're on a Windows machine you might have to change the forward slashes in the strings below to *two* backslashes.

In [14]:
#Technical note: The gorram encoding is latin1, not UTF-8!
with open("./data/negative-reviews.txt", "r", encoding="latin1") as file:
    negativeReviews = file.readlines()
with open("./data/positive-reviews.txt", "r", encoding="latin1") as file:
    positiveReviews = file.readlines()

We now have the reviews in two lists - one list for the negative reviews, one for the positive reviews.

Let's create another helper function that only returns the compound (total) score of the VADER analysis.

In [15]:
def vaderSentiment(review):
    return vanalyse(review)['compound']

Check that the method works

In [16]:
vaderSentiment("Adam Sandler is a terrible actor")

-0.4767

We now consider how we'll measure the quality of our models, keeping in mind that we'll be implementing another model using ML later. 

Let's create a function whose input is some sort of function that calculates the score for each review in each of the lists of reviews. This function will return a dictionary, (hash map) containing two keys: 'pos' and 'neg', mapping to the list of the scores for the positive and negative lists respectively.

In [17]:
def getReviewSentiments(sentimentCalculator):
    """
    Given a function that calculates sentiment of a reviews list, return list of sentiment.
    """
    
    #Python syntax explainer:
    #For each review in positiveReviews we assign each review to a variable 'posReview'.
    #Calculate the sentiment of 'posReview', and put it in the 'positiveResults' list.
    #Repeat for all positive reviews.
    #Do the same for the negative reviews.

    positiveResults = [sentimentCalculator(posReview) for posReview in positiveReviews]
    negativeResults = [sentimentCalculator(negReview) for negReview in negativeReviews]
    return {'pos': positiveResults, 'neg': negativeResults}

Since we want to evaluate how well our models are doing, we need a way of measuring this. Let's create a function that prints the percentage of the reviews a model. If we write our code carefully we can then reuse this function when we start implementing an ML solution.

In [18]:
def runDiagnostics(reviewSentiments):
    positiveReviews = reviewSentiments['pos']
    negativeReviews = reviewSentiments['neg']

    #How many reviews are the for each kind?
    numberOfNegative = len(negativeReviews)
    numberOfPositive = len(positiveReviews)
    numberOfReviews = numberOfNegative + numberOfPositive
    
    #How many reviews were correct for each kind?
    totalTruePositives = float(sum(x > 0 for x in positiveReviews))
    totalTrueNegatives = float(sum(x < 0 for x in negativeReviews))

    #Convert to percentages
    pctTotalAccurate = (totalTruePositives + totalTrueNegatives) * 100 / numberOfReviews
    pctTruePositive = totalTruePositives * 100 / numberOfPositive
    pctTrueNegative = totalTrueNegatives * 100 / numberOfNegative

    #Print, and format percentages to have 2 decimal digits
    print("Accuracy on positive reviews = " + "%.2f" %(pctTruePositive) + "%")
    print("Accuracy on negative reviews = " + "%.2f" %(pctTrueNegative) + "%")
    print("Overall accuracy = " + "%.2f" %(pctTotalAccurate) + "%")

### Moment of truth: *How well does VADER do?*

In [19]:
reviewSentiments = getReviewSentiments(vaderSentiment)
runDiagnostics(reviewSentiments)

Accuracy on positive reviews = 69.46%
Accuracy on negative reviews = 40.11%
Overall accuracy = 54.78%


To paraphrase the late Hans Rosling: Overall only slightly better than chimpanzees.

VADER has decent accuracy when it comes to positive reviews, but a chimpanzee is better at determining if a review is negative or not.

It is important that our model is equally good at identifying true positives and true negatives. The VADER model appears to be biased towards giving positive scores. Not good.

# Machine Learning approach

We have the luxury of having pretty clean data (no dupes, same number of positive and negative reviews). Which allows us to go almost straight into writing the training step! We just need to do a bit for reformatting.

First thing we're doing is splitting the data into a training set and a test set. First of all, let's see how many reviews we have in each list:

In [20]:
print(len(positiveReviews))
print(len(negativeReviews))

5331
5331


So for simplicity, let's use 2500 reviews of each list as our training set. That is, only ~47% of our data is used to train the model. Usually this number is around 60-80%. If this makes you uneasy, we'll go back and change it later on.

In [21]:
splitIndex = 2500

trainingPositiveReviews = positiveReviews[:splitIndex]
trainingNegativeReviews = negativeReviews[:splitIndex]

testPositiveReviews = positiveReviews[splitIndex+1:]
testNegativeReviews = negativeReviews[splitIndex+1:]

Now we define our vocabulary. This is the list of all unique words present in our *training data*.

In [22]:
def getVocabulary():
    #Get list of all words (incl. repetition) contained in the positive reviews.
    #Repeat for the negative reviews.
    positiveWordList = [word for line in trainingPositiveReviews for word in line.split()]
    negativeWordList = [word for line in trainingNegativeReviews for word in line.split()]
    
    #Combine the words in to one big list
    allWordList = [item for sublist in [positiveWordList, negativeWordList] for item in sublist]
    
    #Remove duplicates, and return
    return list(set(allWordList))

Well now, how big is our vocabulary? As a reference, if a person has a vocabulary of ~10,000 words this person is by many considered fluent in English. That being said, capitalisation and punctuation significantly increases the number in our vocabulary list.

In [23]:
vocabulary = getVocabulary()
len(vocabulary)

14094

Eventually we want to pass the training data to an algorithm owned by NLTK. This data needs to have a specific format, so let's create a function that transforms the data into the correct format.

In [24]:
def getTrainingData():
    negTaggedTrainingReviewList = [{'review':oneReview.split(), 'label': 'negative'} for oneReview in trainingNegativeReviews]
    posTaggedTrainingReviewList = [{'review':oneReview.split(), 'label': 'positive'} for oneReview in trainingPositiveReviews]
    
    fullTaggedTrainingData = [item for sublist in [negTaggedTrainingReviewList, posTaggedTrainingReviewList] for item in sublist]
    
    #Technical note: A list of (sampleValues, label) tuples,
    #  where sampleValues is a list of each individual word in the sample.
    return [(review['review'], review['label']) for review in fullTaggedTrainingData]

In [25]:
trainingData = getTrainingData()
print(trainingData[0])
print(len(trainingData)) #Should = 2 * splitIndex.

(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'negative')
5000


We can see that the first parameter is a vector of all the words in our problem instance, and the second parameter is the label 'positive' or 'negative'.

In [26]:
def extractFeatures(review):
    #Remove duplicates from the review.
    #Considerations: is this the right thing to do?
    #Yes: We are using Naive Bayes, and two identical words among the features
    #would violate the independence assumption.
    #No: Repeating a word often acts as a booster:
    #E.g. "Never ever ever ever watch this film"
    reviewWords = set(review)

    features = {}
    for word in vocabulary:
        features[word]=(word in reviewWords)
    return features

Now for the actual training of our model. This is very simple in Python.

In [27]:
def getTrainedNaiveBayesClassifier(extractFeatures, trainingData):
    #Turn the training data into a list of feature vectors
    trainingFeatures = nltk.classify.apply_features(extractFeatures, trainingData) 
    return nltk.NaiveBayesClassifier.train(trainingFeatures);

This next bit takes quite a while...

In [28]:
trainedNBClassifier = getTrainedNaiveBayesClassifier(extractFeatures, trainingData)

We have now trained our classifier. Let's try it out!

In [34]:
def naiveBayesSentimentCalculator(review):
    problemInstance = review.split();
    problemFeatures = extractFeatures(problemInstance)
    return trainedNBClassifier.classify(problemFeatures)

naiveBayesSentimentCalculator("great film")

'positive'

In [38]:
naiveBayesSentimentCalculator("adam sandler sucks")

'negative'

In [37]:
#Create a test harness so we can check the entire test dataset:
def getTestReviewSentiments(naiveBayesSentimentCalculator):
    testNegResults = [naiveBayesSentimentCalculator(review) for review in testNegativeReviews]
    testPosResults = [naiveBayesSentimentCalculator(review) for review in testPositiveReviews]
    
    labelToNum = {'positive':1, 'negative':-1}
    
    numericNegResults = [labelToNum[x] for x in testNegResults]
    numericPosResults = [labelToNum[x] for x in testPosResults]
    
    return {'pos': numericPosResults, 'neg': numericNegResults}


In [33]:
runDiagnostics(getTestReviewSentiments(naiveBayesSentimentCalculator)) #Takes a while

Accuracy on positive reviews = 73.39%
Accuracy on negative reviews = 77.03%
Overall accuracy = 75.21%


Well, that's it we're screwed. The machines will soon be our new rulers.
The simple Naive Bayes model we just trained significantly outperformed VADER, and that lexicon was made by human experts.

...right?