## Sentiment Analysis

Let's use the movie review corpus to train a sentiment classifier.

In [1]:
import random
import nltk

In [2]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [3]:
# These documents are tuples. 
# The first spot is the review (as a list of words). The second is the classification. 
# Let's look at one.

# classification
print(documents[0][1])
print() # Add a return for spacing

# review
print(" ".join(documents[0][0]))


neg

it ' s a sad state of affairs when the back box blurb is more exciting than the movie contained within it . such is the case for the 1990 paul mayersberg film _the last samurai_ . though the blurb alludes to " a jungle filled with political intrigue , uneasy alliances , and murderous enemies at every turn , " the story of the movie is actually quite simple ( and prosaic ) : a middle - aged japanese businessman named endo ( played by john fujioka ) and his assistant , both of whom have samurai aspirations , travel to africa in search of his ancestor , who went to bring buddhism to africa . he hires the services of down - at - the - heels vietnam veteran pilot johnny congo ( the redoubtable lance henriksen ) and his girlfriend ( arabella holzbog ) , and travels to the camp of an arms - merchant - cum - safari - host - cum - islamic - missionary ( john saxon ) and his wife ( lisa eilbacher ) . they are all kidnapped by an african revolutionary guerilla with witch - doctor aspirations

In [5]:
# build a list of all words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# Get 2500 most frequent words
word_features = [w[0] for w in all_words.most_common(2500)]
#word_features = list(all_words)[:2500]


def document_features(document): 
    # We use a set here. Remind me to tell you why...
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [6]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [7]:
print(nltk.classify.accuracy(classifier, test_set))

0.77


In [8]:
classifier.show_most_informative_features(50)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.8 : 1.0
         contains(mulan) = True              pos : neg    =      9.0 : 1.0
        contains(finest) = True              pos : neg    =      7.9 : 1.0
        contains(seagal) = True              neg : pos    =      7.8 : 1.0
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.0 : 1.0
        contains(wasted) = True              neg : pos    =      5.8 : 1.0
          contains(lame) = True              neg : pos    =      5.7 : 1.0
         contains(awful) = True              neg : pos    =      5.7 : 1.0
        contains(ripley) = True              pos : neg    =      5.6 : 1.0
         contains(flynt) = True              pos : neg    =      5.6 : 1.0
         contains(damon) = True              pos : neg    =      5.4 : 1.0
         contains(waste) = True              neg : pos    =      5.1 : 1.0