# BDTA Lesson 15: Simple Sentiment Analysis Example

This notebook shows how dictionary based sentiment analysis can work. It is based on Neal Caron's [An introduction to text analysis with Python, Part 1](http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/).

The program shows how to analyze a couple of sentences.

## Setting up our data

Here we will define the data to test and our positive and negative dictionaries. The data is put into variables with different data types.

In [28]:
theText = "No food is good food. Ha. I'm on a diet and the food is awful and lame."
positiveWords=['awesome','good','nice','super','fun','delightful']
negativeWords=['awful','lame','horrible','bad']

## Tokenizing on sentences

Now we will divide the text into sentences so we can get the sentiment of each sentence.

In [13]:
from nltk.tokenize import sent_tokenize

In [29]:
help(sent_tokenize)

Help on function sent_tokenize in module nltk.tokenize:

sent_tokenize(text, language='english')
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus



In [14]:
# This function takes a text and tokenizes it into a list of sentences
def tokenSentences(textIn):
    theSentences = sent_tokenize(textIn) # Remember to use local variable
    return theSentences

# Test how many sentences we get
len(tokenSentences(theText))

3

## Tokenizing the text

Now we will create a function for tokenizing the sentences and test it. This uses a module **re** (Regular Expression Operations).

In [16]:
import re

def tokenizer(txt2Token): # Function for tokenizing
    theTokens = re.findall(r'\b\w[\w-]*\b', txt2Token.lower())
    return theTokens
    
# We test if the function works with the first sentnece from the list above
tokensOfSent = tokenizer(tokenSentences(theText)[0])
print(tokensOfSent)

['no', 'food', 'is', 'good', 'food']


## Calculating associations of words

Now we will create funciton that counts the number of positive or negative words. The idea is that we pass the function a list of tokens (of the sentence) and a list of words that have emotion. It the counts how many emotion words are in the list of tokens.

In [17]:
# Function that counts how many target words are in a list of tokens
def countSentimentalTokens(listOfTokens,listOfTargetWords):
    numTargetWords = 0
    matchedWords = []
    for token in listOfTokens: # Goes through the tokens in the list
        if token in listOfTargetWords: # For each one it checks if it is in the target list
            numTargetWords += 1
            matchedWords.append(token)
    return numTargetWords, matchedWords # Note that we are returning a tuple (2 values)

theTuple = countSentimentalTokens(tokensOfSent,positiveWords)
print(str(theTuple[0]) + " " + str(theTuple[1]))

1 ['good']


## Calculating percentage

Now we can calculate the percentages of postive and negative words.

In [18]:
def calculatePercent(listOfTokens,positiveList,negativeList):
    numWords = len(listOfTokens) # How many words total
    
    # We call the function to count the tokens from the positive list in the sentence
    positiveMatches = countSentimentalTokens(listOfTokens,positiveList) 
    percntPos = positiveMatches[0] / numWords # We divide by the total number of words for percentage
    
    # We call the function to count the tokens from the negative list in the sentence
    negativeMatches = countSentimentalTokens(listOfTokens,negativeList)
    percntNeg = negativeMatches[0] / numWords # We divide by the total number of words for percentage

    return percntPos, percntNeg # We return the percentage of positive and negative words

# We test the function on the first sentence
results = calculatePercent(tokensOfSent,positiveWords,negativeWords)
print("Positive: " + "{:.0%}".format(results[0]) + "  Negative: " + "{:.0%}".format(results[1]))

Positive: 20%  Negative: 0%


## Calculate sentiment

Here we calculate whether a sentence is positive or negative.

In [19]:
def calculateSentiment(percntPos,percntNeg):
    sentiment = percntPos - percntNeg # Subtract the percentage of negative words from positive words
    return sentiment

# Test what we get
calculateSentiment(results[0],results[1])

0.2

## Process sentences

Finally, we have a function that can process a text. It first tokenizes the text into sentences using the function above. It then processes each sentence:
* It tokenizes the sentence
* It calls the percentages of positive and negative words calling the function above
* It calculates if the sentiment is positive or negative
* It returns a list of the sentiment calculations

In [20]:
def processText(textIn,posMatchWords,negMatchWords):
    listOfSentences = tokenSentences(textIn) # Tokenize the text
    
    listOfSentiments = []
    for sentence in listOfSentences: # Process sentence by sentence
        sentTokens = tokenizer(sentence) # Tokenize the sentences
        percentages = calculatePercent(sentTokens,posMatchWords,negMatchWords) # Calculates percents
        theSentiment = calculateSentiment(percentages[0],percentages[1]) # Calculates sentiment
        listOfSentiments.append(theSentiment) # Appends sentiment to list
        
    return listOfSentiments # Return the final list

# Test the function
theFinalList = processText(theText,positiveWords,negativeWords)
theFinalList

[0.2, 0.0, -0.16666666666666666]

----
# Exercise

Plot the list of sentiment calculations to see if there is a trend.

**Optional**: 
* Can you edit the appropriate functions to return a list of all the positive words used? Note that the function '''countSentimentalTokens()''' already returns a list of the words used. What else do you need to add to?
* Can you open a text file and process it?

---
# Homework

Create a notebook that can calculate the sentiment of tweets using dictionaries of positive and negative words. Instead of getting the data from within the notebook, use a prepared list of positive and negative words and a list of tweets. Here are the links:

- Positive data http://www.unc.edu/~ncaren/haphazard/positive.txt
- Negative data http://www.unc.edu/~ncaren/haphazard/negative.txt
- Obama tweets http://www.unc.edu/~ncaren/haphazard/obama_tweets.txt

You should download these and then add programming to open the textfiles and split on the new lines (\\n). 

    '''posTokens = posText.split("\n")'''

This will give you a list of positive tokens or negative tokens or a list of tweets.