# Project 4 - Informative Features 

## *Natural Language Processing in Python, Chapter 6.10 Exercise 3 and 4*

> ### Bryant Chang, Thomas Detzel, Sandipayan Nandi, Erik Nylander

In [1]:
# Loading the required Libraries and setting a random seed.
import nltk, random
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.corpus import senseval
random.seed(4568)

## Exercise 3

### Instructions


The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data. Using this dataset, build a classifier that predicts the correct sense tag for a given instance.

### 1.0 Looking at the word "interest"
First we choose the word 'interest' and load the corresponding data:

In [2]:
instances = senseval.instances('interest.pos')

### 1.1 Examine individual contexts
The contexts are clearly in the scope of finance and separate from the 'curiosity' meaning of 'interest'.

In [3]:
for inst in instances[:5]:
    p = inst.position
    left = ' '.join(w for (w,t) in inst.context[p-2:p])
    word = ' '.join(w for (w,t) in inst.context[p:p+1])
    right = ' '.join(w for (w,t) in inst.context[p+1:p+3])
    senses = ' '.join(inst.senses)
    print('{:18s} | {:9s} | {:<13s} -> {}'.format(left, word, right, senses))

declines in        | interest  | rates .       -> interest_6
indicate declining | interest  | rates because -> interest_6
in short-term      | interest  | rates .       -> interest_6
4 %                | interest  | in this       -> interest_5
company with       | interests | in the        -> interest_5


### 1.2 Surrounding words using a feature set
A good idea is too look at one or two words before and after each instance, one or two POSs before and after, as well as a bag of words. So we set up our feature extractor accordingly:

In [4]:
def features(instance):
    feats = dict()
    p = instance.position
    # previous word & tag
    if p: ## > 0
        feats['wp'] = instance.context[p-1][0]
        feats['tp'] = instance.context[p-1][1]
    # bag of words if it's the first word
    else: # 
        feats['wp'] = (p, 'BOW')
        feats['tp'] = (p, 'BOW')
    # following word & tag  
        feats['wf'] = instance.context[p+1][0]
        feats['tf'] = instance.context[p+1][1]
    return feats

We create our feature set, shuffle randomly for creating the training and test set later on, and split it into a training and a test set:

In [5]:
featureset = [(features(i), i.senses[0]) 
              for i in instances 
              if len(i.senses)==1]

random.shuffle(featureset)

size = int(len(featureset) * 0.1)
train_set, test_set = featureset[size:], featureset[:size]

print('\nFeature set: {} samples.'.format(size))
print('Training set: {} samples.'.format(len(train_set)))
print('Test set: {} samples.'.format(len(test_set)))


Feature set: 236 samples.
Training set: 2132 samples.
Test set: 236 samples.


In [10]:
# take a look
print '\nGlimpse of feature dictionaries'
print '-------------------------------\n'
for i in xrange(5):
    print('{:30s} -> {}'.format(featureset[i][0], featureset[i][1]))
print ''


Glimpse of feature dictionaries
-------------------------------

{'tp': 'JJR', 'wp': 'lower'}   -> interest_6
{'tp': 'PRP$', 'wp': 'his'}    -> interest_1
{'tp': 'VB', 'wp': 'pay'}      -> interest_6
{'tp': 'JJ', 'wp': 'other'}    -> interest_3
{'tp': 'JJ', 'wp': 'short-term'} -> interest_6



### 2.0 Naive Bayes Classifier
The following code creates a Naive Bayes classifier using the training set and evaluates performance on the test set.

In [11]:
classifier1 = nltk.NaiveBayesClassifier.train(train_set)

In [14]:
accuracy1 = nltk.classify.accuracy(classifier1, test_set)
print('\nAccuracy: {:.2%}'.format(accuracy1))


Accuracy: 73.31%


### 2.1 Decision Tree Classifier
The following code creates a Decision Tree classifier and evaluate its performance with the test set.

In [12]:
classifier2 = nltk.DecisionTreeClassifier.train(train_set)

In [13]:
accuracy2 = nltk.classify.accuracy(classifier2, test_set)
print('\nAccuracy: {:.2%}.'.format(accuracy2))


Accuracy: 70.76%.


## Exercise 4

## Introduction and Results

For this problem, we analyze the movie reviews corpus in NLTK using the document classifier from Chapter 6 section 10  of Natural Language Processing with Python to identify the 30 most informative features. See the discussion below for more, but results show that some of the words that tend to strongly associate with positive or negative reviews make semantic sense (mediocrity = negative, uplifting = positive), but that others can be misleading. In particular, names of actors and directors can be positive or negative features in films that are panned or praised, contributing to the roughly 30 percent error rate in our classifier. 




### 1.0 Get documents, create classifier

The following code creates a list containing each and is classification, either postive or negative. The list is shuffled in random order. 

In [29]:
random.seed(1234)
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

### 1.1 Distribution of reviews

There are an equal number of positive and negative reviews in the list of 2,000 documents.

In [30]:
print ''
print 'Positive reviews = %s' % len(list((cat for (words, cat) in documents if cat=='pos')))
print 'Negative reviews = %s' % len(list((cat for (words, cat) in documents if cat=='neg')))
print ''


Positive reviews = 1000
Negative reviews = 1000



### 1.2 Create the classifier

We now build a document classifier using the 2,000 most common words in the corpus.

In [31]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] # [_document-classify-all-words]

def document_features(document): # [_document-classify-extractor]
    document_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

We now use the feature extractor and a Naive Bayes classifier to classify the texts. The classifier has an accuracy rate of 71 percent. This means that in almost a third of cases, the classification is incorrect, although it's better than random chance (50-50, in this data set).

In [32]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

### 1.3 Classification accuracy

In [34]:
print ''
print 'Classifier accuracy is %.2f percent' % (100*nltk.classify.accuracy(classifier, test_set))
print ''


Classifier accuracy is 69.00 percent



### 2.0 Most informative features

Let's take a look at the 30 most informative words for classifying a review as positive or negative. 

In [35]:
classifier.show_most_informative_features(30)

Most Informative Features
          contains(sans) = True              neg : pos    =      9.0 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
   contains(overwhelmed) = True              pos : neg    =      6.3 : 1.0
        contains(fabric) = True              pos : neg    =      6.3 : 1.0
     contains(uplifting) = True              pos : neg    =      5.9 : 1.0
         contains(patch) = True              neg : pos    =      5.8 : 1.0
       contains(admired) = True              pos : neg    =      5.8 : 1.0
        contains(doubts) = True              pos : neg    =      5.8 : 1.0
          contains(wits) = True              pos : neg    =      5.7 : 1.0
         contains(wires) = True              neg : pos    =      5.7 : 1.0
           contains(ugh) = True              neg : pos    =      5.0 : 1.0

### 2.1 Positive Features
*doubts, hugo, effortlessly, fabric, uplifiting, wits, topping, overwhelmed, sensational, lang, masquerading, matheson, spins, wang, understands, existential, reap, bandits*

Many of the words associated with a positive review make sense. *Effortless, uplifiting, overwhelmed, topping, sensational, understands, existential,* and *reap* connote high quality. Words like *doubts, masquerading,* and *spins* suggest that the reveiwer was suprised by the quality. Proper names, such as *hugo, lang, and matheson*, suggest performance quality of individuals, but it's important to remember that the positive and negative categories apply to the movie -- actors and directors can have a positive performance even in a bad movie. Why does *bandits* associate with better movies? Of the six reviews that contain the word, only one is negative (a lame Disney remake).

In [37]:
def filter_docs(doclist, value):
    return filter(lambda d: value in d[0], doclist) 

bandits = filter_docs(documents, 'bandits')

pos_count = len([cat for (words, cat) in bandits if cat == 'pos'])
neg_count = len([cat for (words, cat) in bandits if cat == 'neg'])

print ''
print 'Number of reviews containing \'bandits\' = %r' % len(bandits)
print 'Number of negative reviews = %r' % neg_count
print 'Number of positive reviews = %r' % pos_count
print ''


Number of reviews containing 'bandits' = 6
Number of negative reviews = 1
Number of positive reviews = 5



### 2.2 Negative Features

*sans, bruckheimer, mediocrity, wires, ugh, tripe, quicker, maxwell, locks, dubious, interviewed, wcw*

The negative classifiers are a shorter list. *Sans, mediocrity, ugh, tripe, quicker,* and *dubious* all clearly connote a bomb. *Bruckheimer*, refering to Jerry Bruckheimer is interesting in that he produces large-grossing films that are massivly popular. Some have received good reviews, but in this sample, only one of 10 Bruckheimer films won praise: "Gone in 60 Seconds", with Nicolas Cage, Robert Duvall and Angelina Jolie. (There are no 'bandits' mentioned).

In [38]:
bruckheimer = filter_docs(documents, 'bruckheimer')
pos_count = len([cat for (words, cat) in bruckheimer if cat == 'pos'])
neg_count = len([cat for (words, cat) in bruckheimer if cat == 'neg'])

print ''
print 'Number of reviews containing \'bruckheimer\' = %r' % len(bruckheimer)
print 'Number of negative reviews = %r' % neg_count
print 'Number of positive reviews = %r' % pos_count
print ''


Number of reviews containing 'bruckheimer' = 10
Number of negative reviews = 9
Number of positive reviews = 1

