Basic Sentiment Analysis

Miki Seltzer
Eric Whyne
Jasen Jones

To get this working you need to make sure nltk is installed with:

####sudo pip install nltk
####nltk.download()

The download part opens a new dialogue box that allows you to download all the packages.

We also need a couple other libraries

####sudo pip install pyyaml

That one is for importing dictionaries - these include our positive, negative, incrementing, decrementing, and inverting words.  We have to build those dictionaries ourselves, but this makes it easy to put those in a file (or to import existing ones).

####sudo pip install pprintpp

This library just makes it easier for us to see each token on it's own line.

In [2]:
import nltk
import nltk.data
import yaml
import pprint
import json
import numpy as np

In [3]:
#If you have not downloaded the NLTK files, this will do it for you:
#nltk.download()

In [4]:
##This is taken directly from http://fjavieralba.com/basic-sentiment-analysis-with-python.html
##Two classes that split, tokenize, and tag.

class Splitter(object):
    '''Splits sentences into individual tokens'''
    def __init__(self):
        self.nltk_splitter = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
        self.nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self, text):
        """
        input format: a paragraph of text
        output format: a list of lists of words.
            e.g.: [['this', 'is', 'a', 'sentence'], ['this', 'is', 'another', 'one']]
        """
        sentences = self.nltk_splitter.tokenize(text)
        tokenized_sentences = [self.nltk_tokenizer.tokenize(sent) for sent in sentences]
        return tokenized_sentences


class POSTagger(object):
    '''Assigns each token with a part of speech tag'''
    def __init__(self):
        pass
        
    def pos_tag(self, sentences):
        """
        input format: list of lists of words
            e.g.: [['this', 'is', 'a', 'sentence'], ['this', 'is', 'another', 'one']]
        output format: list of lists of tagged tokens. Each tagged tokens has a
        form, a lemma, and a list of tags
            e.g: [[('this', 'this', ['DT']), ('is', 'be', ['VB']), ('a', 'a', ['DT']), ('sentence', 'sentence', ['NN'])],
                    [('this', 'this', ['DT']), ('is', 'be', ['VB']), ('another', 'another', ['DT']), ('one', 'one', ['CARD'])]]
        """

        pos = [nltk.pos_tag(sentence) for sentence in sentences]
        #adapt format
        pos = [[(word, word, [postag]) for (word, postag) in sentence] for sentence in pos]
        return pos

In [5]:
#This section calls the functions to split, tokenize, and tag the text.

text = """What can I say about this place. The staff of the restaurant is nice and the eggplant is not bad. Apart from that, very uninspired food, lack of atmosphere and too expensive. I am a staunch vegetarian and was sorely dissapointed with the veggie options on the menu. Will be the last time I visit, I recommend others to avoid."""

splitter = Splitter()
postagger = POSTagger()

splitted_sentences = splitter.split(text)

pos_tagged_sentences = postagger.pos_tag(splitted_sentences)


The next class (DictionaryTagger) tags a token with a sentiment based on dictionary values we store in various yaml dictionary files.  There are five types of sentiment we use, each affecting the final sentiment value:

Positive: Our core positive words
Examples: Great, good, best, etc.

Negative: Our core negative words
Examples: Awful, stupid, terrible, etc.

Incrementers: Words that increase the strength of the next word
Examples: Totally, extremely, absolutely, etc.

Decrementers: Words that decrease the strength of the next word
Examples: Kinda, sorta, etc.

Inverters: Words that totally change the meaning of the next word
Examples: Not, aren't, can't, etc.

In [6]:
class DictionaryTagger(object):

    def __init__(self, dictionary_paths):
        files = [open(path, 'r') for path in dictionary_paths]
        dictionaries = [yaml.load(dict_file) for dict_file in files]
        map(lambda x: x.close(), files)
        self.dictionary = {}
        self.max_key_size = 0
        for curr_dict in dictionaries:
            for key in curr_dict:
                if key in self.dictionary:
                    self.dictionary[key].extend(curr_dict[key])
                else:
                    self.dictionary[key] = curr_dict[key]
                    self.max_key_size = max(self.max_key_size, len(key))

    def tag(self, postagged_sentences):
        return [self.tag_sentence(sentence) for sentence in postagged_sentences]

    def tag_sentence(self, sentence, tag_with_lemmas=False):
        """
        the result is only one tagging of all the possible ones.
        The resulting tagging is determined by these two priority rules:
            - longest matches have higher priority
            - search is made from left to right
        """
        tag_sentence = []
        N = len(sentence)
        if self.max_key_size == 0:
            self.max_key_size = N
        i = 0
        while (i < N):
            j = min(i + self.max_key_size, N) #avoid overflow
            tagged = False
            while (j > i):
                expression_form = ' '.join([word[0] for word in sentence[i:j]]).lower()
                expression_lemma = ' '.join([word[1] for word in sentence[i:j]]).lower()
                if tag_with_lemmas:
                    literal = expression_lemma
                else:
                    literal = expression_form
                if literal in self.dictionary:
                    #self.logger.debug("found: %s" % literal)
                    is_single_token = j - i == 1
                    original_position = i
                    i = j
                    taggings = [tag for tag in self.dictionary[literal]]
                    tagged_expression = (expression_form, expression_lemma, taggings)
                    if is_single_token: #if the tagged literal is a single token, conserve its previous taggings:
                        original_token_tagging = sentence[original_position][2]
                        tagged_expression[2].extend(original_token_tagging)
                    tag_sentence.append(tagged_expression)
                    tagged = True
                else:
                    j = j - 1
            if not tagged:
                tag_sentence.append(sentence[i])
                i += 1
        return tag_sentence

Here we feed the sentiment words to the dictionary tagger.  The dictionary terms for positive and negative sentiment were taken from:

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
;       Proceedings of the ACM SIGKDD International Conference on Knowledge 
;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
;       Washington, USA, 

In [7]:
dicttagger = DictionaryTagger([ 'positive.yml', 'negative.yml', 'increasers.yml', 'decreasers.yml', 'inverter.yml'])

dict_tagged_sentences = dicttagger.tag(pos_tagged_sentences)

#The following just lets us see what the split and tagged tokens look like.
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(dict_tagged_sentences)

[   [   ('What', 'What', ['WP']),
        ('can', 'can', ['MD']),
        ('I', 'I', ['PRP']),
        ('say', 'say', ['VBP']),
        ('about', 'about', ['IN']),
        ('this', 'this', ['DT']),
        ('place', 'place', ['NN']),
        ('.', '.', ['.'])],
    [   ('The', 'The', ['DT']),
        ('staff', 'staff', ['NN']),
        ('of', 'of', ['IN']),
        ('the', 'the', ['DT']),
        ('restaurant', 'restaurant', ['NN']),
        ('is', 'is', ['VBZ']),
        ('nice', 'nice', ['positive', 'JJ']),
        ('and', 'and', ['CC']),
        ('the', 'the', ['DT']),
        ('eggplant', 'eggplant', ['NN']),
        ('is', 'is', ['VBZ']),
        ('not', 'not', ['inv', 'RB']),
        ('bad', 'bad', ['negative', 'JJ']),
        ('.', '.', ['.'])],
    [   ('Apart', 'Apart', ['RB']),
        ('from', 'from', ['IN']),
        ('that', 'that', ['IN']),
        (',', ',', [',']),
        ('very', 'very', ['inc', 'RB']),
        ('uninspired', 'uninspired', ['JJ']),
        ('food', 'f

In [8]:
def value_of(sentiment):
    if sentiment == 'positive': return 1
    if sentiment == 'negative': return -1
    return 0

def sentence_score(sentence_tokens, previous_token, acum_score):    
    if not sentence_tokens:
        return acum_score
    else:
        current_token = sentence_tokens[0]
        tags = current_token[2]
        token_score = sum([value_of(tag) for tag in tags])
        if previous_token is not None:
            previous_tags = previous_token[2]
            if 'inc' in previous_tags:
                token_score *= 2.0
            elif 'dec' in previous_tags:
                token_score /= 2.0
            elif 'inv' in previous_tags:
                token_score *= -1.0
        return sentence_score(sentence_tokens[1:], current_token, acum_score + token_score)

def sentiment_score(review):
    return sum([sentence_score(sentence, None, 0.0) for sentence in review])

The next section takes each tweet and throws it against the splitting, tagging, and scoring functions

In [10]:
#Create an empty list to store the tweets and their sentiment values
sentiments = []

with open("tweet-sentiment_sample.log", 'w') as outfile:

    #Opens the file from whatever directory the iPython notebook was launched from. 
    #You'll need a different path if the dictionay files are in a separate folder.

    filename = "clean-tweets_sample.log"

    line_generator = open(filename)

    for line in line_generator:
        #Here we cycle through each tweet and apply all the tagging functions
        line_object = json.loads(line)

        #This requires a "try" call because some tweets apparently don't have text
        try:
            tweet = line_object['text']
        except:
            continue

        #The workhorse - all of our splitting, tagging, and scoring
        #We are now filtering for only tweets that have "black friday" or "blackfriday"
        date = line_object['date']
        splitted_sentences = splitter.split(tweet)
        pos_tagged_sentences = postagger.pos_tag(splitted_sentences)
        dict_tagged_sentences = dicttagger.tag(pos_tagged_sentences)
        score = sentiment_score(dict_tagged_sentences)

        #Places all the date, text and scores into a list for efficiency, then converts it to a numpy array for now
        #sentiments.append([date, tweet, score])

        #Immediately write all tweets with scores to JSON
        data = {}
        data['text'] = tweet
        data['date'] = date
        data['score'] = score
        json.dump(data, outfile)
        outfile.write('\n')

#tweetandscore = np.asarray(sentiments)

In [32]:
#Originally we were writing to numpy, but numpy does not handle unicode well.
#Thus, we no longer store all the scores and tweets in numpy

#Now we can check the individual tweets and their respective scores in numpy.  
#This should be done by appending to the database, but this method will work for now.  
#0 is neutral, higher numbers are positive, and lower numbers are negative



In [33]:
#Just so we can see what a single tweet looks like
pp.pprint(line_object)


{   u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Tue Nov 17 05:03:43 +0000 2015',
    u'entities': {   u'hashtags': [   {   u'indices': [91, 97],
                                          u'text': u'verge'},
                                      {   u'indices': [98, 103],
                                          u'text': u'news'},
                                      {   u'indices': [104, 111],
                                          u'text': u'latest'}],
                     u'symbols': [],
                     u'urls': [   {   u'display_url': u'on.recode.net/1NY6NEB',
                                      u'expanded_url': u'http://on.recode.net/1NY6NEB',
                                      u'indices': [67, 90],
                                      u'url': u'https://t.co/0zwVCpvOIc'}],
                     u'user_mentions': []},
    u'favorite_count': 0,
    u'favorited': False,
    u'filter_level': u'low',
    u'geo': None,
    u'id': 66648182231900979