#  Large Movie Review Dataset

We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 
There is additional unlabeled data for use as well. 
Raw text and already processed bag of words formats are provided


#### importing the data set 

In [1]:
import nltk
import random
import pickle
from nltk.corpus import movie_reviews
import os
import pandas as pd

Basically, in plain English, the above code is translated to: 
In each category (we have pos or neg), take all of the file IDs (each review has its own ID), 
then store the word_tokenized version (a list of words) for the file ID, followed by the positive or negative 
label in one big list.

Next, we use random to shuffle our documents. 
This is because we're going to be training and testing. If we left them in order, chances are we'd train on 
all of the negatives, some positives, and then test only against positives. 
We don't want that, so we shuffle the data.

Then, just so you can see the data you are working with, we print out documents[1], 
which is a big list, where the first element is a list the words, and 
the 2nd element is the "pos" or "neg" label.

In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

# def find_features(documents):
#     words = tuple(documents)
#     features = {}
#     for w in word_features:
#         features[w] = (w in words)

#     return features

# features = find_features(documents)


#### Converting words to Features with NLTK

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 
most common words. Next, we're going to build a quick function that will find these top 3,000 words in 
our positive and negative documents, marking their presence as either positive or negative:

In [3]:
def find_features(documents):
    words = tuple(documents)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

features = find_features(documents)


 we can print one feature set

In [4]:
# print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Then we can do this for all of our documents, saving the feature existence booleans and their 
respective positive or negative categories by doing:

In [5]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

we need to go ahead and split up the data into a training set and a testing set.

In [6]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll cross validation our classifier with
cross_validation=training_set[:400]

# set that we'll test against.
testing_set = featuresets[1900:]

In [7]:
# open a file, where you want to store the data
file = open('testing.pickel', 'wb')

# dump information to that file
pickle.dump(testing_set, file)

# close the file
file.close()

## Naive Bayes Classifier with NLTK:
    This is a pretty popular algorithm used in text classification, so it is only fitting that we try it 
    out first

In [8]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [9]:
print("Classifier accuracy percent training_set:",(nltk.classify.accuracy(classifier, cross_validation))*100)


Classifier accuracy percent training_set: 87.25


In [10]:
classifier.show_most_informative_features(15)

Most Informative Features
                   sucks = True              neg : pos    =     17.6 : 1.0
                  annual = True              pos : neg    =      9.7 : 1.0
                     ugh = True              neg : pos    =      9.6 : 1.0
                 frances = True              pos : neg    =      9.1 : 1.0
                   groan = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
             silverstone = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
           unimaginative = True              neg : pos    =      7.0 : 1.0
               atrocious = True              neg : pos    =      6.6 : 1.0
                obstacle = True              pos : neg    =      6.4 : 1.0
                 cunning = True              pos : neg    =      6.4 : 1.0

### Saving Classifiers with NLTK

In [11]:
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()