# Simple Naive Bayes Classifier

The goal of this project is to build a simple Naive Bayes Classifier using `nltk toolkit`, and after that: train and test it on Movie Reviews corpora from `nltk.corpus`.

Note: We will also use `pickle` to save the trained classifier, reducing the running time.

In [2]:
import nltk
import random
import pickle
import os
from nltk.corpus import movie_reviews

# Load classifier from pickle file
if os.path.isfile('naivebayes.pickle'):
    # Open existing pickle file for reading
    trained_classifier_file = open('naivebayes.pickle', 'rb')
    # Get trained classifier to work with it
    trained_classifier = pickle.load(trained_classifier_file)
    # Close the pickled file
    trained_classifier_file.close()

# Save object if not saved
else:
    
    def reviews_words_lists(movie_reviews):
        reviews_list = []
        for category in movie_reviews.categories():
            for file_id in movie_reviews.fileids(category):
                reviews_list.append((list(movie_reviews.words(file_id)), category))
        return reviews_list


    # Get words from each review in separate lists marked by category
    documents = reviews_words_lists(movie_reviews)


    # Randomise review lists to mix positive and negative 
    random.shuffle(documents)


    # Get all words from all reviews & make them lower case
    all_words = [word.lower() for word in movie_reviews.words()]


    # Count number of occurances for each word and sort in desc. order
    all_words = nltk.FreqDist(all_words)


    # Take 3000 most common from all available words
    word_features = [w[0] for w in all_words.most_common(3000)]


    def find_features(document, word_features):
        words = set(document)
        features = {}
        for word in word_features:
            features[word] = word in words
        return features


    # Mark whether or not each word in all reviews is also in the list of 3000 most common words
    feature_sets = [(find_features(rev, word_features), category) for (rev, category) in documents]


    # Take as training 1900 sets of words
    training_sets = feature_sets[:1900]
    # Take as training 100 sets of words
    testing_sets = feature_sets[1900:]


    # Train classifier on training sets
    trained_classifier = nltk.NaiveBayesClassifier.train(training_sets)

    # Create new file for classifier if not saved
    save_trained_classifier = open('naivebayes.pickle', 'wb')
    # Take contents of trained classifier and put it to the new file
    pickle.dump(trained_classifier, save_trained_classifier)
    # Close the file
    save_trained_classifier.close()


We've trained our classifier on 1900 reviews, now let's check its accuracy on a testing set, consisting of 100 reviews. 

In [4]:
# Get classifier accuracy on testing sets
nltk.classify.accuracy(trained_classifier, testing_sets)*100

78.0

Result: `78%` is rather high for this type of classifier, but the volatility of this algorithm is also very high. The result changes dramatically every launch becasue we randomise our sets, without optimising parameters or the dataset. Let's check what are the most informative features of the testing set. 

In [None]:
trained_classifier.show_most_informative_features()

Most Informative Features
             outstanding = True              pos : neg    =     11.1 : 1.0
                   mulan = True              pos : neg    =      8.9 : 1.0
                  seagal = True              neg : pos    =      8.3 : 1.0
             wonderfully = True              pos : neg    =      7.5 : 1.0
                  finest = True              pos : neg    =      7.5 : 1.0
                  alicia = True              neg : pos    =      6.7 : 1.0
              schumacher = True              neg : pos    =      6.7 : 1.0
                 idiotic = True              neg : pos    =      6.6 : 1.0
                   inept = True              neg : pos    =      6.1 : 1.0
                     era = True              pos : neg    =      6.1 : 1.0


Result: The are some decent examples of words intrinsic to positive reviews like: `outstanding`, `wonderfully`, `finest`. As well as negative:  `idiotic`, `inept`. 
But there are also some words that don't contain any connotation by themselves but still were considered characteristic of positive or negative reviews. For positive they are: `mulan`, `seagal`, `era`. For negative: `alicia`, `schumacher `. 