##### Liam Byrne
##### DATA 620 - Web Analytics
##### Fall - 2017

# Project 4

***

## Introduction
Sentiment analysis Classification of documents can have many uses; from isolating documents of interest from large corpora to determining origins of texts. This exercise will explore an elementary step in text classification; an email spam filter.

Using the [Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase) from UCI's Machine Learning Repository, we will built, train and test a few classification models. The data set consists of extracted features (56 features covering things like specific word frequencies, punctuation frequencies and statistics on capital letters) and the emails labeled as either "spam" or "ham" (non-spam emails).

We will first load in our data and import the required packages from Python.

In [2]:
from nltk.corpus import movie_reviews
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import random
import re

documents = [(list(movie_reviews.words(fileid)), category)
        for category in movie_reviews.categories()
        for fileid in movie_reviews.fileids(category)]

random.seed(7)
random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [None]:
def document_features(document):
    document_words = set(document)
    features = {}
    
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
        
    return features

In [3]:
featuresets = [(document_features(d), c) for (d,c) in documents]

# Change split to 20%  hold-out (from 5%) due to overfitting
train_set, test_set = featuresets[int(len(featuresets)*.2):], \
                      featuresets[:int(len(featuresets)*.2)]
                        
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Score is different than that in text (0.81) --> Bad random_shuffle?
print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(30)

0.8125
Most Informative Features
           contains(ugh) = True              neg : pos    =      8.0 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.3 : 1.0
     contains(atrocious) = True              neg : pos    =      6.9 : 1.0
        contains(shoddy) = True              neg : pos    =      6.6 : 1.0
    contains(schumacher) = True              neg : pos    =      6.5 : 1.0
         contains(waste) = True              neg : pos    =      5.9 : 1.0
      contains(everyday) = True              pos : neg    =      5.9 : 1.0
        contains(turkey) = True              neg : pos    =      5.8 : 1.0
 contains(technological) = True              pos : neg    =      5.4 : 1.0
          contains(mena) = True              neg : pos    =      5.2 : 1.0
        contains(suvari) = True              neg : pos    =      5.2 : 1.0
        contains(canyon) = True              neg : pos    =      5.2 : 1.0
     contains(underwood) = True              neg : pos    =      5.

In [4]:
# Instantitate Sentiment Analyzer
sid = SentimentIntensityAnalyzer()

# Collect the 30 most informative features
feat30 = classifier.most_informative_features(30)
feat30 = [{w:b} for w,b in feat30]

# Create container of word and the classified rating (pos/neg)
feat_sent = [{"word": re.sub("^.*\(|\).*$", "", w.keys()[0]),
              "rating": classifier.classify(w)} for w in feat30]

# Get sentiment scores for words
for feat in feat_sent:
    feat.update(sid.polarity_scores(feat["word"]))

feat_sent_df = pd.DataFrame(feat_sent)
cols = ['word', 'rating', 'pos', 'neg', 'neu', 'compound']
feat_sent_df = feat_sent_df[cols]

