# Step 1: Prepare your workstation

In [6]:
# [1] Import the necessary library
import nltk

# [2] Download the existing movie reviews
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/hamdihassan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

# Step 2: Construct a list of documents

Once the data has been downloaded, we can import it and construct a list of categorised documents:



In [1]:
# [1] Import the necessary libraries
from nltk.corpus import movie_reviews
import random

# [2] Construct a nested list of documents.
documents = [(list(movie_reviews.words(fileid)), category)
 for category in movie_reviews.categories()
 for fileid in movie_reviews.fileids(category)]

# [3] Reorganise the list randomly.
random.shuffle(documents)

In [2]:
# [1] Create a list of files with negative reviews.
negative_fileids = movie_reviews.fileids('neg')

# [2] Create a list of files with positive reviews.
positive_fileids = movie_reviews.fileids('pos')

# [3] Display the list lengths.
print(len(negative_fileids), len(positive_fileids))

1000 1000


The output shows us that there are an equal number (1,000) of positive and negative reviews separated in the subfolders. We can inspect one of the reviews using the raw() method. Run the following command to see how the text in this case has already been preprocessed by the authors. We specify fileids=positive_fileids[0] in the parameters to indicate we want to inspect the first item in the list of positive reviews.

In [4]:
# View the output
print(movie_reviews.raw(fileids=positive_fileids[0]))

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as 

# Step 3: Define a feature extractor function

This function will teach the classifier which aspects of the data it should pay attention to.

In [7]:
# [1] Create an object to contain the frequency distribution.
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# [2] Create a list that contains the first 2000 words.
word_features = list(all_words)[:2000]

# [3] Define a function to check if each word is in the set of features.
def document_features(document): 
  # Create a set of document words.
    document_words = set(document)
    # Create an empty dictionary of features.
    features = {}
    # Populate the dictionary.
    for word in word_features:
       # Specify whether each feature exists in the set of document words. 
       features['contains({})'.format(word)] = (word in document_words)
    # Return the completed dictionary.
    return features

You can quickly test the function. Generate a dictionary for the first review. (Use [0][0] to specify the first item.) Then print out the contents:

In [8]:
# Generate a dictionary for the first review
test_result = document_features(documents[0][0])

for key in test_result:
 print(key, ' : ', test_result[key])


contains(,)  :  True
contains(the)  :  True
contains(.)  :  True
contains(a)  :  True
contains(and)  :  True
contains(of)  :  True
contains(to)  :  True
contains(')  :  True
contains(is)  :  False
contains(in)  :  True
contains(s)  :  True
contains(")  :  True
contains(it)  :  True
contains(that)  :  True
contains(-)  :  True
contains())  :  True
contains(()  :  True
contains(as)  :  True
contains(with)  :  True
contains(for)  :  True
contains(his)  :  True
contains(this)  :  True
contains(film)  :  True
contains(i)  :  False
contains(he)  :  True
contains(but)  :  True
contains(on)  :  True
contains(are)  :  False
contains(t)  :  False
contains(by)  :  True
contains(be)  :  False
contains(one)  :  True
contains(movie)  :  False
contains(an)  :  True
contains(who)  :  True
contains(not)  :  True
contains(you)  :  False
contains(from)  :  True
contains(at)  :  True
contains(was)  :  False
contains(have)  :  True
contains(they)  :  False
contains(has)  :  True
contains(her)  :  True
cont

contains(easy)  :  False
contains(across)  :  False
contains(needs)  :  False
contains(attempts)  :  False
contains(happen)  :  False
contains(television)  :  False
contains(chris)  :  False
contains(deal)  :  False
contains(poor)  :  False
contains(form)  :  False
contains(girlfriend)  :  False
contains(viewer)  :  True
contains(release)  :  False
contains(killed)  :  False
contains(forced)  :  False
contains(whether)  :  True
contains(wonderful)  :  False
contains(feels)  :  False
contains(oh)  :  False
contains(tale)  :  True
contains(serious)  :  False
contains(expect)  :  False
contains(except)  :  False
contains(light)  :  False
contains(success)  :  False
contains(features)  :  True
contains(premise)  :  False
contains(happy)  :  False
contains(words)  :  False
contains(leave)  :  True
contains(important)  :  False
contains(meets)  :  True
contains(history)  :  False
contains(giving)  :  True
contains(crew)  :  False
contains(type)  :  False
contains(call)  :  False
contains(tur

contains(spielberg)  :  False
contains(development)  :  True
contains(etc)  :  False
contains(language)  :  False
contains(blue)  :  False
contains(proves)  :  False
contains(vampire)  :  False
contains(seemingly)  :  False
contains(basic)  :  False
contains(caught)  :  False
contains(decide)  :  False
contains(opportunity)  :  False
contains(incredibly)  :  False
contains(images)  :  False
contains(band)  :  False
contains(j)  :  False
contains(writers)  :  False
contains(knew)  :  False
contains(interested)  :  False
contains(considering)  :  False
contains(boys)  :  False
contains(thanks)  :  True
contains(remains)  :  True
contains(climax)  :  False
contains(event)  :  False
contains(directing)  :  False
contains(conclusion)  :  False
contains(leading)  :  False
contains(ground)  :  False
contains(lies)  :  True
contains(forget)  :  True
contains(alive)  :  False
contains(tarzan)  :  False
contains(century)  :  False
contains(provides)  :  False
contains(trip)  :  False
contains(pa

contains(field)  :  False
contains(larry)  :  False
contains(urban)  :  False
contains(troopers)  :  False
contains(compared)  :  False
contains(apes)  :  False
contains(rose)  :  False
contains(falling)  :  False
contains(era)  :  False
contains(loses)  :  False
contains(adults)  :  True
contains(managed)  :  False
contains(dad)  :  False
contains(therefore)  :  False
contains(pg)  :  False
contains(results)  :  False
contains(guns)  :  False
contains(radio)  :  False
contains(lady)  :  False
contains(manage)  :  False
contains(spice)  :  False
contains(naked)  :  False
contains(started)  :  False
contains(intense)  :  False
contains(humanity)  :  False
contains(wonderfully)  :  False
contains(slasher)  :  False
contains(bland)  :  False
contains(imagination)  :  False
contains(walking)  :  True
contains(willing)  :  False
contains(horse)  :  False
contains(rent)  :  False
contains(mix)  :  False
contains(generated)  :  False
contains(g)  :  False
contains(utterly)  :  False
contains(

This is just a selection of the much larger actual output, but you can see that the function works. The resulting dictionary contains every feature and specifies whether it exists in the documents list.

# Step 4: Train the classifier

Now that you have defined the feature extractor, you will use it to train a Naive Bayes classifier to predict the sentiments of new movie reviews and check the accuracy. This process can take a moment, so please be patient. 

In [9]:
# [1] Create a list of feature sets based on the documents list.
featuresets = [(document_features(d), c) for (d,c) in documents]

# [2] Assign items to the training and test sets.
train_set, test_set = featuresets[100:], featuresets[:100]

# [3] Create a classifier object trained on items from the training set.
classifier = nltk.NaiveBayesClassifier.train(train_set)

# [4] Display the accuracy score in comparison with the test set.
print(nltk.classify.accuracy(classifier, test_set))

0.81


How did your classifier perform? The number you get here will likely be different for everyone. In this case, the output shows an accuracy of 81%. Astounding, considering this is achieved without even tweaking or fine-tuning any parameters! (Hint: Your outputs might differ as these reviews are updated regularly.)

# Step 5:  Interpret the results

NLTK provides a show_most_informative_features() function to see which features the classifier found to be most informative. You can specify 5 as a parameter to see the top five features.

In [10]:
classifier.show_most_informative_features(5)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     11.4 : 1.0
         contains(mulan) = True              pos : neg    =      9.0 : 1.0
        contains(seagal) = True              neg : pos    =      7.4 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.9 : 1.0
          contains(lame) = True              neg : pos    =      6.9 : 1.0


The output here shows us that a review that mentions ‘seagal’ is eleven times more likely to be negative than positive, while a review that mentions ‘mulan’ is nine times more likely to be positive. The exact output you see will be slightly different. (Hint: Your output might differ as the data is updated regularly.)

This quick demonstration shows that the Naive Bayes sentiment classifier is quite easy to interpret. This transparency is one of the core advantages of the model. The main limitation is the assumption of independent predictors. Realistically, mostly all text is contextual and can only be understood in relation to the other text that surrounds it. So remember to interpret the results with caution.