## Sentiment analysis using Python

Part-of-speech tagging is the process of converting a sentence, in the form of a list of words,
into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech
tag, and signifies whether the word is a noun, adjective, verb, and so on.

The main goal of this notebook is to identify if reviews of movies are positives or negatives.

In [16]:
import nltk

Most of the taggers are trainable. They use a list of tagged sentences as their training data. With these training
sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers
use external data sources or match word patterns to choose a tag for a word.

Here we will use UnigramTagger by giving it a list of tagged sentences at initialization.

In [74]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sents = treebank.tagged_sents()
tagger = UnigramTagger(train_sents)
treebank.sents()[0]

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [2]:
tagger.tag(treebank.sents()[0])

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

We see the first sentence as a list of words, and can see how it is transformed by the tag() function into a list of tagged tokens.

To identify positivity or negativity of the reviews we will use *SentiWordNet*, a lexical resource for opinion mining.  *SentiWordNet* assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.

In [49]:
from nltk.corpus import sentiwordnet as swn
list(swn.senti_synsets('good', 'a'))[0].pos_score()

0.75

## Processing the data

We first have to process the reviews.

This is a function that helps us replace words matching regular expressions.

In [9]:
import re

replacement_patterns = [
    (r'’', '\''),
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would'),
]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns): 
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s) 
        return s
    

replacer=RegexpReplacer()
replacer.replace("Don't hesistate to ask questions")

'Do not hesistate to ask questions'

In [33]:
open_file = open('data/review_polarity/txt_sentoken/pos/cv000_29590.txt', 'r', encoding='utf-8')
file_to_string = open_file.read()
type(file_to_string)

str

In [34]:
text_replaced = replacer.replace(file_to_string)

print(file_to_string[50:100])
print(text_replaced[50:100])
print(type(text_replaced))

success , whether they're about superheroes ( batm
success , whether they are about superheroes ( bat
<class 'str'>


After replacing the words by regular expressions, we will tokenize the reviews in a list of sentences, and then in a list of words.

In [35]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text_replaced)
len(sentences)

27

In [36]:
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")

for i in range(len(sentences)):
    sentences[i] = tokenizer.tokenize(sentences[i])
sentences[0]

['films',
 'adapted',
 'from',
 'comic',
 'books',
 'have',
 'had',
 'plenty',
 'of',
 'success',
 'whether',
 'they',
 'are',
 'about',
 'superheroes',
 'batman',
 'superman',
 'spawn',
 'or',
 'geared',
 'toward',
 'kids',
 'casper',
 'or',
 'the',
 'arthouse',
 'crowd',
 'ghost',
 'world',
 'but',
 'there',
 'is',
 'never',
 'really',
 'been',
 'a',
 'comic',
 'book',
 'like',
 'from',
 'hell',
 'before']

We will now use the tagger trained earlier using the UnigramTagger to tag each word of all sentences.

In [75]:
tagged_sent = []
for sentence in sentences:
    tagged_sent.append(tagger.tag(sentence))
tagged_sent[0]

[('films', 'NNS'),
 ('adapted', 'VBD'),
 ('from', 'IN'),
 ('comic', None),
 ('books', 'NNS'),
 ('have', 'VBP'),
 ('had', 'VBD'),
 ('plenty', 'NN'),
 ('of', 'IN'),
 ('success', 'NN'),
 ('whether', 'IN'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('about', 'IN'),
 ('superheroes', None),
 ('batman', None),
 ('superman', None),
 ('spawn', None),
 ('or', 'CC'),
 ('geared', None),
 ('toward', 'IN'),
 ('kids', 'NNS'),
 ('casper', None),
 ('or', 'CC'),
 ('the', 'DT'),
 ('arthouse', None),
 ('crowd', 'NN'),
 ('ghost', None),
 ('world', 'NN'),
 ('but', 'CC'),
 ('there', 'EX'),
 ('is', 'VBZ'),
 ('never', 'RB'),
 ('really', 'RB'),
 ('been', 'VBN'),
 ('a', 'DT'),
 ('comic', None),
 ('book', 'NN'),
 ('like', 'IN'),
 ('from', 'IN'),
 ('hell', None),
 ('before', 'IN')]

The tags used by *SentiWordNet* are different than the tags of the UnigramTagger. For example, an adjectif is tagged as **_'JJ'_**  in our tagger and it is tagged as **_'a'_** in *SentiWordNet*.

The function below allows us to get the 3 scores (positive, negative, objective) of a word by using a tuple *(word, tag)* as an argument.

**(can be improved)**

In [97]:
def word_scores(wordntag):
    result = []
    word, tag = wordntag
    if(tag == 'JJ'):
        if( len(list(swn.senti_synsets(word, 'a'))) != 0 ):
            result.extend([list(swn.senti_synsets(word, 'a'))[0].pos_score(), list(swn.senti_synsets(word, 'a'))[0].neg_score(), list(swn.senti_synsets(word, 'a'))[0].obj_score()])
        else:
            result = [0.0, 0.0, 0.0]
    elif (tag == 'NNS' or tag == 'NN'):
        if( len(list(swn.senti_synsets(word, 'n'))) != 0 ):
            result.extend([list(swn.senti_synsets(word, 'n'))[0].pos_score(), list(swn.senti_synsets(word, 'n'))[0].neg_score(), list(swn.senti_synsets(word, 'n'))[0].obj_score()])
        else:
            result = [0.0, 0.0, 0.0]
    elif(tag == 'RB'):
        if( len(list(swn.senti_synsets(word, 'r'))) != 0 ):
            result.extend([list(swn.senti_synsets(word, 'r'))[0].pos_score(), list(swn.senti_synsets(word, 'r'))[0].neg_score(), list(swn.senti_synsets(word, 'r'))[0].obj_score()])
        else:
            result = [0.0, 0.0, 0.0]
    else:
        result = [0.0, 0.0, 0.0]
    return result;

print(word_scores(tagged_sent[0][0]))

[0.0, 0.0, 1.0]


We apply that function for every word of each sentence to get a list of scores.

In [98]:
scores = []
for sentence in tagged_sent:
    list_scores = []
    for word in sentence:
        list_scores.append(word_scores(word))
    scores.append(list_scores)
print(scores[0])

[[0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.375, 0.625], [0.0, 0.0, 0.0], [0.125, 0.0, 0.875], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.625, 0.375], [0.625, 0.0, 0.375], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]


## Scoring

The question is now to determine how will we decide if a review is positive or negative.

**First approach : we decide based on majority**

If the positive score is bigger than the negative one, then it is a positive review. Else it's a negative one.

Let's first sum the positive and negative scores for each sentence.

In [115]:
sum_score = []
for list_score in scores:
    pos, neg, obj = 0.0, 0.0, 0.0
    for score in list_score:
        pos += score[0]
        neg += score[1]
        obj += score[2]
    if(len(list_score) != 0):
        sum_score.append([pos/len(list_score), neg/len(list_score), obj/len(list_score)])
    else:
        sum_score.append([0.0, 0.0, 0.0])
sum_score[0]        

[0.017857142857142856, 0.023809523809523808, 0.19642857142857142]

Then we sum the scores of all sentences to get the global score of the review.

In [117]:
pos, neg, obj = 0.0, 0.0, 0.0
for score in sum_score:
    pos += score[0]
    neg += score[1]
    obj += score[2]
pos /= len(sum_score)
neg /= len(sum_score)
obj /= len(sum_score)
print([pos, neg, obj])

[0.022104363763838723, 0.025550012719126575, 0.18265805224130563]


Creating a function that does the whole process for a text put in argument.

In [120]:
def sumScores(file):
    open_file = open(file, 'r', encoding='utf-8')
    file_to_string = open_file.read()
    
    text_replaced = replacer.replace(file_to_string)
    
    tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = tokenizer.tokenize(text_replaced)
    
    from nltk.tokenize import RegexpTokenizer
    tokenizer=RegexpTokenizer("[\w]+")

    for i in range(len(sentences)):
        sentences[i] = tokenizer.tokenize(sentences[i])
        
    tagged_sent = []
    for sentence in sentences:
        tagged_sent.append(tagger.tag(sentence))
        
    scores = []
    for sentence in tagged_sent:
        list_scores = []
        for word in sentence:
            list_scores.append(word_scores(word))
        scores.append(list_scores)
        
    sum_score = []
    for list_score in scores:
        pos, neg, obj = 0.0, 0.0, 0.0
        for score in list_score:
            pos += score[0]
            neg += score[1]
            obj += score[2]
        if(len(list_score) != 0):
            sum_score.append([pos/len(list_score), neg/len(list_score), obj/len(list_score)])
        else:
            sum_score.append([0.0, 0.0, 0.0])
        
    pos, neg, obj = 0.0, 0.0, 0.0
    for score in sum_score:
        pos += score[0]
        neg += score[1]
        obj += score[2]
    if(len(sum_score) != 0):
        pos /= len(sum_score)
        neg /= len(sum_score)
        obj /= len(sum_score)
    return([pos, neg, obj])
    
sumScores('data/review_polarity/txt_sentoken/pos/cv000_29590.txt')

[0.022104363763838723, 0.025550012719126575, 0.18265805224130563]

Let's try that scoring technique on all positive reviews and see how it performs.

In [125]:
import os
pos_reviews = os.listdir('data/review_polarity/txt_sentoken/pos')
neg_reviews = os.listdir('data/review_polarity/txt_sentoken/pos')

__First decision :__ We choose the highest value (between neg and pos) to determine if a review is positive or negative.

In [147]:
pos, neg = 0, 0
for review in pos_reviews:
    if(review[0] != 'c'):
        continue
    score = sumScores('data/review_polarity/txt_sentoken/pos/'+review)
    if(score[0] > score[1]):
        pos += 1
    else:
        neg +=1
print(pos, neg)

441 559


On a thousand positive reviews, 441 are marked as positive and 559 as negative. 

We can see that this technique is not accurate at all. The objective now is to find a threshold between the positive score and the negative score that will give more accurate results.