# Lexical Polarity Classification

* SentiWordNet
* Lexicon induction from corpora
* Word counting
* Intensifiers and downplayers
* Negation
* Other "valence shifters"

## SentiWordNet

Before we look at using lexicons to do polarity classification, let's look quickly at the interface for SentiWordNet included in NLTK. It works together with WordNet; recall that we can get the synsets associated with a given word types.

In [1]:
from nltk.corpus import wordnet as wn

good_synsets = wn.synsets("good")
bad_synsets = wn.synsets("bad")

In [2]:
print(good_synsets)
print("\n", bad_synsets)

[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]

 [Synset('bad.n.01'), Synset('bad.a.01'), Synset('bad.s.02'), Synset('bad.s.03'), Synset('bad.s.04'), Synset('regretful.a.01'), Synset('bad.s.06'), Synset('bad.s.07'), Synset('bad.s.08'), Synset('bad.s.09'), Synset('bad.s.10'), Synset('bad.s.11'), Synset('bad.s.12'), Synset('bad.s.13'), Synset('bad.s.14'), Synset('badly.r.05'), Synset('badly.r.06')]


Unlike most lexicons, SentiWordNet is tied to particular senses. This creates a challenge in using them, since we have to access them via WordNet. We access the SentiWordNet synset by using the `senti_synset` function with the name of the WordNet synset.

In [4]:
from nltk.corpus import sentiwordnet as swn

for synsets in [good_synsets,bad_synsets]:
    for synset in synsets:
        print(synset.definition())
        #access the senti_synset
        senti_synset = swn.senti_synset(synset.name())
        print("Positive", end = " ")
        print(senti_synset.pos_score())
        print("Negative", end = " ")
        print(senti_synset.neg_score())
        print("Objective", end = " ")
        print(senti_synset.obj_score())

benefit
Positive 0.5
Negative 0.0
Objective 0.5
moral excellence or admirableness
Positive 0.875
Negative 0.0
Objective 0.125
that which is pleasing or valuable or useful
Positive 0.625
Negative 0.0
Objective 0.375
articles of commerce
Positive 0.0
Negative 0.0
Objective 1.0
having desirable or positive qualities especially those suitable for a thing specified
Positive 0.75
Negative 0.0
Objective 0.25
having the normally expected amount
Positive 0.0
Negative 0.0
Objective 1.0
morally admirable
Positive 1.0
Negative 0.0
Objective 0.0
deserving of esteem and respect
Positive 1.0
Negative 0.0
Objective 0.0
promoting or enhancing well-being
Positive 0.625
Negative 0.0
Objective 0.375
agreeable or pleasing
Positive 1.0
Negative 0.0
Objective 0.0
of moral excellence
Positive 0.75
Negative 0.0
Objective 0.25
having or showing knowledge and skill and aptitude
Positive 0.625
Negative 0.0
Objective 0.375
thorough
Positive 0.625
Negative 0.0
Objective 0.375
with or in a close or intimate relation

Fortunately, most senses of words tend to be fairly aligned in terms of polarity, so it is possible to talk about the polarity of a word without worrying too much about senses.  Remember that "objective" words try not to impose any sentiment (positive or negative).

## Simple lexicon induction from labeled corpora

Let's see what happens if we use the distribution of words in the `movie_reviews` corpus to build a polarity lexicon. We will count how often words appear in each category.

In [5]:
from nltk.corpus import movie_reviews
from collections import Counter

MR_lexicon = {}
pos_counts = Counter(movie_reviews.words(categories="pos"))
neg_counts = Counter(movie_reviews.words(categories="neg"))
all_counts = Counter(pos_counts)  # count hwo many times a word appears
all_counts.update(neg_counts) #All counts is just pos_counts + neg_counts

for word in all_counts:
    # count whether this word appears more time in positive sense or negative sense
    MR_lexicon[word] = (pos_counts.get(word,0) -neg_counts.get(word,0))/(all_counts.get(word,0))
    
len(MR_lexicon)

39768

Let's look at our score for known polar words

In [6]:
MR_lexicon["good"]

0.035255080879303194

In [7]:
MR_lexicon["bad"]

-0.4824372759856631

In [8]:
MR_lexicon["excellent"]

0.5869565217391305

In [9]:
MR_lexicon["terrible"]

-0.6083916083916084

In [10]:
MR_lexicon["okay"]

-0.104

In [11]:
MR_lexicon["funny"]

-0.07380952380952381

 Let's look at some words that would probably be considered objective

In [12]:
MR_lexicon["the"]

0.08379829868415895

In [13]:
MR_lexicon["plot"]

-0.21216126900198282

In [14]:
MR_lexicon["movie"]

-0.12493501992722232

In [15]:
MR_lexicon["film"]

0.09908584638016181

Let's look a bit at the most extreme words, among those that appear at least 20 times.

In [16]:
# provided code
sorted_by_SO = sorted([(SO,word) for word, SO in MR_lexicon.items() if all_counts[word] >= 20])

In [17]:
sorted_by_SO[:50]

[(-1.0, 'brenner'),
 (-1.0, 'macdonald'),
 (-1.0, 'nbsp'),
 (-1.0, 'pokemon'),
 (-1.0, 'sphere'),
 (-0.9473684210526315, 'seagal'),
 (-0.9354838709677419, 'jawbreaker'),
 (-0.9354838709677419, 'webb'),
 (-0.9333333333333333, 'magoo'),
 (-0.9310344827586207, 'hudson'),
 (-0.9285714285714286, 'jakob'),
 (-0.9166666666666666, 'lambert'),
 (-0.9166666666666666, 'stigmata'),
 (-0.9130434782608695, 'heckerling'),
 (-0.9069767441860465, 'bats'),
 (-0.9047619047619048, 'sammy'),
 (-0.9, 'bronson'),
 (-0.9, 'dalmatians'),
 (-0.9, 'farley'),
 (-0.9, 'silverstone'),
 (-0.8947368421052632, 'schumacher'),
 (-0.8787878787878788, 'gadget'),
 (-0.8666666666666667, 'alicia'),
 (-0.8666666666666667, 'insulting'),
 (-0.8620689655172413, 'sucks'),
 (-0.8604651162790697, '8mm'),
 (-0.8571428571428571, 'werewolf'),
 (-0.8536585365853658, 'ludicrous'),
 (-0.8518518518518519, 'eszterhas'),
 (-0.8421052631578947, 'stupidity'),
 (-0.84, 'ivy'),
 (-0.84, 'martian'),
 (-0.8333333333333334, 'darren'),
 (-0.8260869

In [18]:
sorted_by_SO[-50:]

[(0.8787878787878788, 'carlito'),
 (0.8823529411764706, 'margaret'),
 (0.8888888888888888, 'bowfinger'),
 (0.8904109589041096, 'damon'),
 (0.8974358974358975, 'whale'),
 (0.9, 'africans'),
 (0.9, 'avoids'),
 (0.9, 'cinque'),
 (0.9047619047619048, 'hatred'),
 (0.9047619047619048, 'jerome'),
 (0.9047619047619048, 'wen'),
 (0.9069767441860465, 'cauldron'),
 (0.9090909090909091, 'lang'),
 (0.9090909090909091, 'whisperer'),
 (0.92, 'nello'),
 (0.9230769230769231, 'pleasantville'),
 (0.9310344827586207, 'redford'),
 (0.9333333333333333, 'benigni'),
 (0.9333333333333333, 'bulworth'),
 (0.9333333333333333, 'winslet'),
 (0.9354838709677419, 'jude'),
 (0.9428571428571428, 'homer'),
 (0.9642857142857143, 'lebowski'),
 (0.975, 'flynt'),
 (0.979381443298969, 'mulan'),
 (1.0, 'apostle'),
 (1.0, 'argento'),
 (1.0, 'burbank'),
 (1.0, 'camille'),
 (1.0, 'carver'),
 (1.0, 'coens'),
 (1.0, 'donkey'),
 (1.0, 'farquaad'),
 (1.0, 'fei'),
 (1.0, 'gattaca'),
 (1.0, 'giles'),
 (1.0, 'guido'),
 (1.0, 'lama'),
 

The vast majority of words are names associated with good and bad movies, words that should not be in a good sentiment lexicon. A corollary here is that a BOW machine classifier does very well on this dataset, better than a sentiment lexicon, but not because it is learning how to classify "sentiment"! And it would be quite useless when applied to a dataset involving a different domain, or even perhaps different movies!

## Word Counting

Now lets do text classification using a (proper) polarity lexicon. We'll use the one included in NLTK (the Bing and Liu "opinion lexicon"). It just consists of lists of positive and negative words.

In [19]:
#provided code
from nltk.corpus import opinion_lexicon

pos_words = set(opinion_lexicon.positive())
neg_words = set(opinion_lexicon.negative())
print(len(pos_words))
print(len(neg_words))

2006
4783


In [20]:
print(list(pos_words)[:10])
print(list(neg_words)[:10])

['convience', 'upliftment', 'effectual', 'steadiness', 'charm', 'hands-down', 'profound', 'stately', 'contribution', 'supple']
['undercutting', 'unsupported', 'discouraging', 'douchebags', 'hypocrisy', 'itch', 'radical', 'malevolently', 'solemn', 'arrogantly']


As we can see, negative words are more than double the positive ones. Now let's iterate over the positive and negative words of the movie_review corpus and count the total number of each category in the texts

In [21]:
#provided code
def count_pos_neg(polarity):
    '''count and print out the number of positive and negative words for a given polarity of the movie_reviews corpus'''
    pos_count = 0
    neg_count = 0
    for word in movie_reviews.words(categories=polarity):
        if word.lower() in pos_words:
            pos_count += 1
        elif word.lower() in neg_words:
            neg_count += 1
    print("POS: ", pos_count)
    print("NEG: ", neg_count)
    

There are 31360 positive words in `positive documents`, and 25942 negative words in it.

In [22]:
count_pos_neg("pos")

POS:  31360
NEG:  25942


Overall speaking, we do have more negative words than positive words in negative documents.

In [23]:
count_pos_neg("neg")

POS:  22748
NEG:  29217


There's definitely a distinction, though its far from categorical. Now let's do this at the level of the text, and calculate an accuracy score.

In [24]:
#provided code
def get_counts(text):
    '''count the number of positive and negative words in a text'''
    pos_count = 0
    neg_count = 0
    for word in text:
        word = word.lower()
        if word in pos_words:
            pos_count += 1
        elif word in neg_words:
            neg_count += 1    
    return pos_count, neg_count

total = 0
correct = 0
for text in movie_reviews.fileids(categories="pos"):
    total += 1
    pos_count, neg_count = get_counts(movie_reviews.words(text))
    if pos_count > neg_count:
        correct += 1

for text in movie_reviews.fileids(categories="neg"):
    total += 1
    pos_count, neg_count = get_counts(movie_reviews.words(text))
    if neg_count > pos_count:
        correct += 1

correct/total

0.6835

That's not bad for a first try. We might do better with another lexicon, one with a more fine-grained breakdown of semantic orientation (see exercise 1 on your lab). We can also consider other approaches. Let's do a little bit more, first, looking at the score for just negative. 

In [25]:
#provided code
total = 0
correct = 0

for text in movie_reviews.fileids(categories="neg"):
    total += 1
    pos_count, neg_count = get_counts(movie_reviews.words(text))
    if neg_count > pos_count:
        correct += 1

correct/total

0.719

Looks like we are doing better on negative texts! Note that positive and negative performance is often imbalanced when using lexicons, and it sometimes improves performance to shift the cutoff between positive and negative texts. Let's take a look at a few positive texts that we are getting wrong and see if we can see where our lexicon-based method failed.

In [26]:
#provided code
count = 0
for text in movie_reviews.fileids(categories="pos"):
    pos_count, neg_count = get_counts(movie_reviews.words(text))
    if neg_count > pos_count:
        print(list(movie_reviews.words(text)))
        count += 1
    if count == 3:
        break

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', "'", 's', 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'", '80s', 'with', 'a', '12', '-', 'part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', 'the', 'book', '(', 'or', '"', 'grap

--------

Some cases where we might commit errors:

1. There are many cases where the sentiment clearly does not reflect the speaker's opinion, but is possible (incorrect) perspective:

'there', 'is', 'a', 'difference', 'between', 'movies', 'with', 'the', 'courage', 'to', 'go', 'over', 'the', 'top', 'and', 'movies', 'that', "don't", 'care', 'about', 'being', 'stupid' (courage)

'as', 'exciting', 'as', 'all', 'this', 'exoticism', 'might', 'sound', 'to', 'the', 'typical', 'pax', 'viewer', ',', 'the', 'rest', 'of', 'us', 'will', 'be', 'lulled', 'into', 'a', 'coma', '.' (exciting)

2. There are lots of cases where the most important sentiment is not being picked up because it involves a complex expression.

'our', 'culture', 'is', 'headed', 'down', 'the', 'toilet', 'with', 'the', 'ferocity', 'of', 'a', 'frozen', 'burrito', 'after', 'an', 'all-night', 'tequila', 'bender', '\x97', 'and', 'i', 'know', 'this', 'because', "i've", 'seen', "'jackass", ':', 'the', 'movie', '.', "'" (headed down the toilet)

'here', ',', 'common', 'sense', 'flies', 'out', 'the', 'window', ',', 'along', 'with', 'the', 'hail', 'of', 'bullets', ',', 'none', 'of', 'which', 'ever', 'seem', 'to', 'hit', 'sascha', '.' (common sense flies out the window)

3. To get some of these correct, the nature of a concession relation would have to be understood.

'beautifully', 'filmed', 'and', 'well', 'acted', '.', '.', '.', 'but', 'admittedly', 'problematic', 'in', 'its', 'narrative', 'specifics', '.' (problematic more important than beautifully filmed, well acted)

'though', 'this', 'saga', 'would', 'be', 'terrific', 'to', 'read', 'about', ',', 'it', 'is', 'dicey', 'screen', 'material', 'that', 'only', 'a', 'genius', 'should', 'touch', '.' (dicey more important than terrific)

-------------

## Intensification

Sentiment analysis is driven by adjectives. Adverbs modify adjectives. Let's look at adverbs that tend to appear before polar adjectives

In [27]:
#provided code
from nltk import pos_tag
count = 0
for sent in movie_reviews.sents():
    if(count >= 10):
        break
    tagged_sent = pos_tag(sent)
    for i in range(1,len(sent)):
        word = sent[i].lower()
        if (word in pos_words or word in neg_words) and tagged_sent[i][1] == "JJ"  and tagged_sent[i-1][1] == "RB":
            print(sent)
            print(sent[i-1])
            print(word)
            count += 1
            break



['critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.']
very
cool
['they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.']
pretty
neat
['the', 'actors', 'are', 'pretty', 'good', 'for', 'the', 'most', 'part', ',', 'although', 'wes', 'bentley', 'just', 'seemed', 'to', 'be', 'playing', 'the', 'exact', 'same', 'character', 'that', 'he', 'did', 'in', 'american', 'beauty', ',', 'only', 'in', 'a', 'new', 'neighborhood', '.']
pretty
good
['it', "'", 's', 'just', 'packaged', 'to', 'look', 'that', 'way', 'because', 'someone', 'is', 'apparently', 'assuming', 'that', 'the', 'genre', 'is', 'still', 'hot', 'with', 'the', 'kids', '.']
still
hot
['it', 'is', 'clear', 'that', 'the', 'film', 'is', 'nothing', 'more', 'than', 'an', 'attempt', 'to', 'cash', 'in', 'on', 'the', 'teenage

Words that increase the strength of a sentiment are called *intensifiers* and those that lower the strength are called *downplayers* (though I will often refer to both as *intensifiers*).

Exercise: Let's create some lists of intensifiers and downplayers by looking through the examples above. Note that they shouldn't normally have their own sentiment associated with them ('very' in itself is not a positive or negative word).

In [28]:
intensifiers = {"very","highly","so","really","entirely","plain","consistently","totally"}
downplayers = {"pretty","somewhat","almost","rarely"}

We can add intensifiers to our polarity calculation by lowering or raising the count by 50%. Let's try that.

In [29]:
def get_counts(text):
    pos_count = 0
    neg_count = 0
    for i in range(len(text)):
        word = text[i].lower()
        if word in pos_words:
            pos_count += 1
            if i > 0: 
                if text[i -1].lower() in intensifiers:
                    pos_count += 0.5
                elif text[i -1].lower() in downplayers:
                    pos_count -= 0.5
            
        elif word in neg_words:
            neg_count += 1   
            
            if i > 0: 
                if text[i -1].lower() in intensifiers:
                    neg_count += 0.5
                elif text[i -1].lower() in downplayers:
                    neg_count -= 0.5
    return pos_count, neg_count

total = 0
correct = 0
for text in movie_reviews.fileids(categories="pos"):
    total += 1
    pos_count, neg_count = get_counts(list(movie_reviews.words(text)))
    if pos_count > neg_count:
        correct += 1

for text in movie_reviews.fileids(categories="neg"):
    total += 1
    pos_count, neg_count = get_counts(list(movie_reviews.words(text)))
    if neg_count > pos_count:
        correct += 1

correct/total

0.6945

A small improvement! We could probably do better if we were more careful about the effect of individual words. Note that nouns also have intensifiers/downplayers but they are not adverbs but either adjectives (*real* instead of *really*) or measuring expressions like *a bit of* or *a ton of*.  Consider "That guy has a ton of luck" or "That's what I call a *real* hamburger!"

## Negation

Another common kind of modification that clearly affects the interpretation of polar words is negation. Most negation words in English start with "n", they include *not* (and *n't*), *no*, *never*, *nobody*, *nothing*, *neither*, and *nor*. Let look through our corpus for these words followed by a polar word.

In [30]:
negators = {"not", "n't", "no", "never", "nobody", "nothing", "neither", "nor"}

for sent in movie_reviews.sents():
    for i in range(len(sent)):
        word = sent[i].lower()
        if word in negators:
            for j in range(i+1, len(sent)):
                if sent[j].lower() in pos_words or sent[j].lower() in neg_words:
                    print(sent)
                    print(word)
                    print(sent[j])
                    break
            break

['the', 'characters', 'and', 'acting', 'is', 'nothing', 'spectacular', ',', 'sometimes', 'even', 'bordering', 'on', 'wooden', '.']
nothing
spectacular
['unfortunately', ',', 'even', 'he', "'", 's', 'not', 'enough', 'to', 'save', 'this', 'convoluted', 'mess', ',', 'as', 'all', 'the', 'characters', 'don', "'", 't', 'do', 'much', 'apart', 'from', 'occupying', 'screen', 'time', '.']
not
enough
['basing', 'the', 'show', 'on', 'a', '1960', "'", 's', 'television', 'show', 'that', 'nobody', 'remembers', 'is', 'of', 'questionable', 'wisdom', ',', 'especially', 'when', 'one', 'considers', 'the', 'target', 'audience', 'and', 'the', 'fact', 'that', 'the', 'number', 'of', 'memorable', 'films', 'based', 'on', 'television', 'shows', 'can', 'be', 'counted', 'on', 'one', 'hand', '(', 'even', 'one', 'that', "'", 's', 'missing', 'a', 'finger', 'or', 'two', ')', '.']
nobody
questionable
['it', 'is', 'clear', 'that', 'the', 'film', 'is', 'nothing', 'more', 'than', 'an', 'attempt', 'to', 'cash', 'in', 'on',

There are a couple of things to say here

- Negation often does result in opposite polarity
- Very reliable at short distance
- But it doesn't always, depending on what comes between it and the word (e.g. "only")
- And there is wide variety possible intervening material
- Need syntactic/semantic knowledge to correctly identify if a polar word should be negated

Let's take a closer look at the syntax of cases where there are at least 2 words between a negator and a word carrying some semantic orientation. We'll look at an new corpus with lots of sentiment, `product_reviews_1`

In [31]:
import nltk
nltk.download('product_reviews_1')
from nltk.corpus import product_reviews_1
import benepar
benepar.download('benepar_en3')

b_parser =  benepar.Parser("benepar_en3")

for sent in product_reviews_1.sents():
    for i in range(len(sent)):
        word = sent[i].lower()
        if word in negators:
            for j in range(i+3, len(sent)):
                if sent[j].lower() in pos_words or sent[j].lower() in neg_words:
                    print(word)
                    print(sent[j])
                    print(b_parser.parse(sent))
                    break
            break

[nltk_data] Downloading package product_reviews_1 to
[nltk_data]     /Users/lxy/nltk_data...
[nltk_data]   Unzipping corpora/product_reviews_1.zip.
[nltk_data] Downloading package benepar_en3 to /Users/lxy/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.


never
right
(TOP
  (S
    (VP (VB pass) (NP (DT this) (NN player)) (PRT (RP up)))
    (, ,)
    (CC and)
    (ADVP (RB never))
    (VP
      (VB believe)
      (NP
        (NP (DT the) (NNS reviews))
        (PP (IN on) (NP (DT a) (NN product))))
      (PP (ADVP (RB right)) (IN before) (NP (NN christmas))))
    (. !)))
not
recommend




(TOP
  (S
    (S
      (NP (DT this) (NN player))
      (VP
        (VBZ is)
        (RB not)
        (ADJP (JJ worth) (NP (DT any) (NN price)))))
    (CC and)
    (S
      (NP (PRP i))
      (VP
        (VBP recommend)
        (SBAR
          (IN that)
          (S
            (NP (PRP you))
            (VP
              (VBP do)
              (DT n)
              (FW ')
              (RB t)
              (VP (VB purchase) (NP (PRP it))))))))
    (. .)))
not
progressive
(TOP
  (S
    (NP (PRP i))
    (VP
      (VBP have)
      (RB not)
      (VP
        (VBN used)
        (NP (DT the) (NML (JJ progressive) (NN scan)) (NN feature))))
    (. .)))
neither
work
(TOP
  (S
    (S
      (NP (PRP i))
      (VP
        (VBD bought)
        (NP (NP (CD 2)) (PP (IN of) (NP (DT this) (NN model))))
        (PP (IN for) (NP (NN christmas) (NNS presents)))))
    (CC and)
    (S
      (NP (NP (DT neither) (CD one)) (PP (IN of) (NP (PRP them))))
      (VP (VBP work)))
    (. !)))
not
problems
(TOP
  (

In terms of polarity classification, the obvious reaction to negation is to flip the polarity. If you're working without strength annotation, there's not much else you can do. But consider examples like the following:

*not awesome*

*not terrible*

*not acceptable*

If awesome is a 5, is *not awesome* a -5? If *terrible* is a -5, is *not terrible* a 5? If acceptable is a 1, is not acceptable a -1? Intuitively, negating strong SO words do not actually create SO expression of opposite polarity (and certainly not of corresponding strength), and negating a weakly positive word can create a stronger negative expression, since what is often implied is a failure to clear a low standard. When using granular lexicons, a "shift" strategy may be prefered, i.e. shifting negated expressions towards zero by a fixed amount, e.g.

5 (awesome) - 5 = 0

-5 (terrible) + 5 = 0

1 (acceptable) -5 = -4

### Other "valence shifters"

Let's look for modals such as "should" and "could". Irrealis (indicating situations that are not actual fact) markers such as modals often neutralize or even negate the sentiment in a text

In [32]:
modals = {"could","should","would","might"}

for sent in movie_reviews.sents():
    for i in range(len(sent)):
        word = sent[i].lower()
        if word in modals:
            for j in range(i+1, len(sent)):
                if sent[j].lower() in pos_words or sent[j].lower() in neg_words:
                    print(sent)
                    print(word)
                    print(sent[j])
                    break
            break

['there', 'might', "'", 've', 'been', 'a', 'pretty', 'decent', 'teen', 'mind', '-', 'fuck', 'movie', 'in', 'here', 'somewhere', ',', 'but', 'i', 'guess', '"', 'the', 'suits', '"', 'decided', 'that', 'turning', 'it', 'into', 'a', 'music', 'video', 'with', 'little', 'edge', ',', 'would', 'make', 'more', 'sense', '.']
might
pretty
['information', 'on', 'the', 'characters', 'is', 'literally', 'spoon', '-', 'fed', 'to', 'the', 'audience', '(', 'would', 'it', 'be', 'that', 'hard', 'to', 'show', 'us', 'instead', 'of', 'telling', 'us', '?', ')']
would
hard
['with', 'the', 'help', 'of', 'hunky', ',', 'blind', 'timberland', '-', 'dweller', 'garrett', '(', 'carey', 'elwes', ')', 'and', 'a', 'two', '-', 'headed', 'dragon', '(', 'eric', 'idle', 'and', 'don', 'rickles', ')', 'that', "'", 's', 'always', 'arguing', 'with', 'itself', ',', 'kayley', 'just', 'might', 'be', 'able', 'to', 'break', 'the', 'medieval', 'sexist', 'mold', 'and', 'prove', 'her', 'worth', 'as', 'a', 'fighter', 'on', 'arthur', "'"

Another way that writers may intensify is to use discourse structure, in particular concession.

In [33]:
for sent in movie_reviews.sents():
    for i in range(len(sent)):
        word = sent[i].lower()
        if word == "but":
            for j in range(i+1, len(sent)):
                if sent[j].lower() in pos_words or sent[j].lower() in neg_words:
                    print(sent)
                    print(word)
                    print(sent[j])
                    break
            break

['critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.']
but
bad
['which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of', 'making', 'all', 'types', 'of', 'films', ',', 'and', 'these', 'folks', 'just', 'didn', "'", 't', 'snag', 'this', 'one', 'correctly', '.']
but
good
['they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.']
but
terribly
['now', 'i', 'personally', 'don', "'", 't', 'mind', 'trying', 'to', 'unravel', 'a', 'film', 'every', 'now', 'and', 'then', ',', 'but'

Finally, let's look at texts which contain numerous instances of the same polar word

In [34]:
#provided code
for text in movie_reviews.fileids():
    counts = Counter([word.lower() for word in movie_reviews.words(text)])
    for word in counts:
        if counts[word] > 5 and (word in pos_words or word in neg_words):
            print(word)
            print(counts[word])
            for sent in movie_reviews.sents(text):
                if word in sent:
                    print(sent)

annoying
6
['it', "'", 's', 'not', 'the', 'whole', 'oral', '-', 'sex', '/', 'prostitution', 'thing', '(', 'referring', 'to', 'grant', ',', 'not', 'me', ')', 'that', 'bugs', 'me', ',', 'it', "'", 's', 'the', 'fact', 'that', 'grant', 'is', 'annoying', '.']
['not', 'just', 'adam', 'sandler', '-', 'annoying', ',', 'we', "'", 're', 'talking', 'jim', 'carrey', '-', 'annoying', '.']
['grant', '(', 'flutters', 'eyelashes', ',', 'offers', 'a', 'nervous', 'smile', ',', 'then', 'responds', 'in', 'his', 'annoying', 'english', 'accent', 'and', 'i', '-', 'think', '-', 'i', '-', 'actually', '-', 'have', '-', 'talent', 'attitude', ')', ':', 'could', 'you', 'possibly', 'elaborate', 'on', 'that', '?']
['this', 'paves', 'the', 'way', 'for', 'every', 'possible', 'pregnancy', '/', 'child', 'birth', 'gag', 'in', 'the', 'book', ',', 'especially', 'since', 'grant', "'", 's', 'equally', 'annoying', 'friend', "'", 's', 'wife', 'is', 'also', 'pregnant', '.']
['the', 'annoying', 'friend', 'is', 'played', 'by', 't

Often repeated words don't indicate sentiment and should be ignored!