This section of the tutorial will cover sentiment extraction. We will examine documents and assign them a score from positive to negative.

In [16]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

Firt, we'll grab some documents that have been categorized as either negative (neg) or positive (pos) from NLTK's movie review corpus.

In [17]:
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

Now, we'll write a function that will build a featureset for each one of these documents. Our featureset is a bag of words that contains all of the words appearing in our document.

In [18]:
def word_feats(words):
    words = [w.lower() for w in words]
    myDict = dict([(word, True) for word in words])
    return myDict

Now we can extract our features, and divide our data, so that we have a training set and a test set.

In [19]:
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
 
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

Now, we train our classifier, test it's accuracy, and then print out the most informative features in our featureset.

In [20]:
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


Now, we can classify a new review by feeding it to our classifier.

In [45]:
temp = classifier.prob_classify(testfeats[7][0])
print temp.samples()
print temp.prob('neg')
print temp.prob('pos')

['neg', 'pos']
0.999997377199
2.62280111245e-06


Let's look at another aplication of Naive Bayes Classification, from the NLTK book.

We're going to classify names by gender, based on a featureset we design. First, let's import some labeled names, and then shuffle them.

In [87]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)

Now we can choose what features these names are going to have. Each name is like a document. Where we had given each movie review a feature for every word it had, in this case we are going to give every name a feature corresponding to the last letter in it.

In [99]:
def gender_features(word):
     return {'last_letter': word[-1]}

Now we construct our training and test sets in the same way we did for our sentiment analyzer.

In [112]:
featureset = [(gender_features(n), gender) for (n, gender) in labeled_names]
trainfeats2, testfeats2 = featureset[500:], featureset[:500]
classifier2 = nltk.NaiveBayesClassifier.train(trainfeats2)

In [113]:
print 'accuracy:', nltk.classify.util.accuracy(classifier2, testfeats2)
classifier2.show_most_informative_features()

accuracy: 0.79
Most Informative Features
             last_letter = u'a'           female : male   =     33.4 : 1.0
             last_letter = u'k'             male : female =     29.1 : 1.0
             last_letter = u'f'             male : female =     27.5 : 1.0
             last_letter = u'p'             male : female =     20.8 : 1.0
             last_letter = u'v'             male : female =     11.1 : 1.0
              sec_letter = u'z'             male : female =     10.7 : 1.0
             last_letter = u'd'             male : female =     10.0 : 1.0
             last_letter = u'o'             male : female =      9.0 : 1.0
             last_letter = u'm'             male : female =      8.8 : 1.0
   second_to_last_letter = u'o'             male : female =      7.7 : 1.0


In [53]:
temp = classifier2.prob_classify(gender_features('Neo'))

In [54]:
print temp.samples()
print temp.prob('male')

['male', 'female']
0.83038276398


We can modify our featureset to include anything we think might help.

In [115]:
def gender_features(word):
    myDict = dict()
    myDict['first_letter'] = word[0]
    myDict['last_letter'] = word[-1]
    return myDict