6.4) Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative.  

In [1]:
import nltk, random
from nltk.corpus import movie_reviews
seed = 41

In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [3]:
words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words)
print "Number of all words: {0}".format(len(all_words))

word_features = all_words.keys()[:2000]
print "Number of key words: {0}".format(len(word_features))

Number of all words: 39768
Number of key words: 2000


In [4]:
word_features[:10]

[u'sonja',
 u'askew',
 u'woods',
 u'spiders',
 u'bazooms',
 u'hanging',
 u'francesca',
 u'comically',
 u'localized',
 u'disobeying']

In [5]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({0})'.format(word)] = (word in document_words)    
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

In [6]:
train_set, test_set = featuresets[100:], featuresets[:100]

In [7]:
random.seed(seed)
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [8]:
print(nltk.classify.accuracy(classifier, test_set))

0.7


In [9]:
classifier.show_most_informative_features(30)

Most Informative Features
          contains(sans) = True              neg : pos    =     10.0 : 1.0
    contains(mediocrity) = True              neg : pos    =      8.5 : 1.0
         contains(wires) = True              neg : pos    =      7.0 : 1.0
          contains(hugo) = True              pos : neg    =      6.9 : 1.0
     contains(dismissed) = True              pos : neg    =      6.3 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
   contains(understands) = True              pos : neg    =      5.6 : 1.0
           contains(ugh) = True              neg : pos    =      5.6 : 1.0
     contains(uplifting) = True              pos : neg    =      5.5 : 1.0
        contains(doubts) = True              pos : neg    =      5.2 : 1.0
         contains(tripe) = True              neg : pos    =      5.1 : 1.0

Can you explain why these particular features are informative? 

I can explain how they are informative as the list above indicates how much more likely a review will be negative or positive if these features are included. Explaining why they are informative is tougher.  Perhaps it makes sense that words like sans, mediocrity, tripe, or ugh would increase the likelihood of a negative review since these have negative connotations, but why wires, wcw, or quicker?  

Do you find any of them surprising?

I find several of them surprising.  Doubts and dismissed for example sound more negative to me, but both indicate a more likely positive review. Why does 33 affect the outcome at all?  And some of these looks like names - leia, matheson, bruckheimer, and lang - I'm surprised that those specific names would top the list of indicators.  If they aren't names, I don't know what they mean.  Interesting...