##### Liam Byrne
##### DATA 620 - Web Analytics
##### Fall - 2017

# Project 4

***

## Introduction
From `nltk's` movie review corpus, we are given 2000 movie reviews with the binary rating of `positive` or `negative`. The text book ([Section 6.3](http://www.nltk.org/book/ch06.html#ref-document-classify-set)) outlines a feature extraction method:

+ Binarize the presence of high frequency words in the reviews
+ Train a Naive Bayes classifier on the reviews and rating
+ Output the most informative features in the classification method

We will replicate this and look at the 30 most informative features in the reviews that predict a rating. We will then try to infer why these features are are informative and make a basic comparison against our predictions and `nltk's` sentiment analyzer.

### Package and Data Upload

In [41]:
from nltk.corpus import movie_reviews
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import random
import re
from IPython.display import display
import pandas as pd

# Gather reviews from movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
        for category in movie_reviews.categories()
        for fileid in movie_reviews.fileids(category)]

# Seed and shuffle docs
random.seed(2**7-1)
random.shuffle(documents)

# Get the frequency distribution of all words and keep the top 2000
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

### Function Definitions
The following feature extractor creates a feature set for each review with the binarized presence of the top 2000 words in the corpus.

In [18]:
def document_features(document):
    document_words = set(document)
    features = {}
    
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
        
    return features

### Feature Extraction, Splits and Classification
Unlike the text, where they only kept a 5% hold-out for the test, we will keep a 20% hold-out to prevent over-fitting. After the classifier is trained, we will look at the most informative features and infer their meaning.

In [20]:
featuresets = [(document_features(d), c) for (d,c) in documents]

# Change split to 20%  hold-out (from 5%) due to overfitting
train_set, test_set = featuresets[int(len(featuresets)*.2):], \
                      featuresets[:int(len(featuresets)*.2)]
                        
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Print accuracy score of classified test data
print(nltk.classify.accuracy(classifier, test_set))

# Show the top 30 most informative features
classifier.show_most_informative_features(30)

0.8025
Most Informative Features
         contains(tones) = True              pos : neg    =      9.0 : 1.0
        contains(turkey) = True              neg : pos    =      7.8 : 1.0
         contains(groan) = True              neg : pos    =      7.0 : 1.0
    contains(schumacher) = True              neg : pos    =      6.6 : 1.0
       contains(martian) = True              neg : pos    =      6.3 : 1.0
        contains(welles) = True              neg : pos    =      6.3 : 1.0
        contains(shoddy) = True              neg : pos    =      6.3 : 1.0
      contains(everyday) = True              pos : neg    =      6.2 : 1.0
       contains(singers) = True              pos : neg    =      5.7 : 1.0
        contains(temper) = True              pos : neg    =      5.7 : 1.0
       contains(bronson) = True              neg : pos    =      5.6 : 1.0
       contains(miscast) = True              neg : pos    =      5.5 : 1.0
         contains(kudos) = True              pos : neg    =      5.

### Thoughts on Informative Features
An ~80% accuracy is very promising, using just the presence of highly frequent words in the reviews. Looking at the list of features, it may not be easy to see that the use of these words in reviews should have a correlation to the rating of the movie. For instance, the use of the word `tones` has a positive negative ratio (p/n) of 9:1. `tones` could relate to the themes of the movie or allusions to different movies. It shows that the writer may be passionate about the movie to go into such detail as remarking on the `tone`. Similarly, `groan` having a p/s of 1:7 is most likely the writer describing their reaction to the poor quality of the movie. There are also terms that seem to be directors'/actors'/actresses' names (e.g.: `welles`, `schumacher` and `underwood`). These could relate to the quality of the movies that these people are in or the cliched copying of their styles.

There are some unusual appearances in the top of the list, namely `turkey` with a p/n of 1:7.8. It is not known what context this is used in; maybe Turkish films are notoriously bad or this is movie review speak for a bad movie?

Simply put, these features should elicit some sort of emotional response. Even the noun features like `highway`, `canyon` and `marketplace` elicit some flash, feeling or idea when one uses it in writing. `nltk` has a sentiment analysis module that it borrows from VADER (Valence Aware Dictionary and sEntiment Reasoner) Sentiment Analysis. According to the [repository](https://github.com/cjhutto/vaderSentiment), "VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains." The following will compare our feature set reviews to the results in VADER.

### Comparing Select Feature Prediction to VADER
!["Vader"](http://t-redactyl.io/figure/Vader_1.jpg)

As stated previously, VADER was designed to analyze social media. For our data, i.e. a single words extracted from movie reviews, we only have one word to use and as we will see, VADER does not do a good job with single words; as it is expected to. 

For context, you need sentence structure and all the grammar that goes into describing a sentiment. VADER will see most singular words as neutral, unless they have a strong enough sentiment by themselves. The following creates a table to compare the high scoring features and their resultant `positive` or `negative` rating against a `positive`, `negative` or `neutral` sentiment from VADER.

The table features are as follows:
+ word: The word from the movie review with the most informative features
+ rating: The movie review, with `pos: positive` and `neg: negative`
+ pos: The positive sentiment score for the word from VADER
+ neg: The negative sentiment score for the word from VADER
+ neu: The neutral sentiment score for the word from VADER
+ compound: sum of all of the lexicon ratings,  which have been standardized to range between -1 and 1

In [45]:
# Instantitate Sentiment Analyzer
sid = SentimentIntensityAnalyzer()

# Collect the 30 most informative features
feat30 = classifier.most_informative_features(30)
feat30 = [{w:b} for w,b in feat30]

# Create container of word and the classified rating (pos/neg)
feat_sent = [{"word": re.sub("^.*\(|\).*$", "", list(w)[0]),
              "rating": classifier.classify(w)} for w in feat30]

# Get sentiment scores for words
for feat in feat_sent:
    feat.update(sid.polarity_scores(feat["word"]))

feat_sent_df = pd.DataFrame(feat_sent)
cols = ['word', 'rating', 'pos', 'neg', 'neu', 'compound']
feat_sent_df = feat_sent_df[cols]

display(feat_sent_df)

Unnamed: 0,word,rating,pos,neg,neu,compound
0,tones,pos,0.0,0.0,1.0,0.0
1,turkey,neg,0.0,0.0,1.0,0.0
2,groan,neg,0.0,0.0,1.0,0.0
3,schumacher,neg,0.0,0.0,1.0,0.0
4,martian,neg,0.0,0.0,1.0,0.0
5,welles,neg,0.0,0.0,1.0,0.0
6,shoddy,neg,0.0,0.0,1.0,0.0
7,everyday,pos,0.0,0.0,1.0,0.0
8,singers,pos,0.0,0.0,1.0,0.0
9,temper,pos,0.0,1.0,0.0,-0.4215


For the 20 reviews, VADER only ventures a non-neutral guess 7/20 times. When it does, it is correct 5/7 times, which is ~25% accuracy relative to the classifier.

### Conclusions
The simple Naive Bayes classifier did a fairly decent job at classifying the reviews as being positive or negative. Most of the top scoring features were not that intuitive at face value; one would be expecting words like "excellent", "awesome" or "best" to be top positive review predictors and words like "worst", "terrible" or "garbage" to be top negative review predictors. Movie reviewers may dance around these words as they sort of signal an unfair and poor review. These words need to take meaning in descriptive phrases and context, which VADER seemed to need in order to classify a statement as positive or negative.

The use of VADER was an off tangent idea to use after realizing that the feature words were most likely sentiment based. It just shows the complexity of classification and how it can not be a one size fits all application. With the complexity of language and the nuance of words, a much more robust method must be used to classify even movie reviews with a high degree of accuracy.