### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Demonstration video: Naive Bayes sentiment classifier using Python

This Jupyter Notebook and accompanying video demonstration by your course convenor, Dr James Abdey, show one possible way to use the Naive Bayes classifier. To follow along with the demonstration, you will use a movie review corpus and categorise the reviews as positive or negative. This corpus is one of many useful data sets included in the NLTK library. In this video, you’ll learn:
- how to download the movie review corpus
- how to prepare your data
- how to apply the Naive Bayes classifier
- how to interpret and communicate your findings.

> **Note:** Your output(s) may differ from the demonstration output(s) due to the `random.shuffle()` method applied to the data and possible updates to the `movie_reviews` corpus.

# 1. Prepare your workstation

In [1]:
# Import the necessary library.
import nltk

# Download the existing movie reviews.
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/codyshan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

# 

# 2. Costruct a list of documents

In [14]:
# Import the necessary libraries.
from nltk.corpus import movie_reviews
import random

# Construct a nested list of documents.
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Reorganise the list randomly.
random.shuffle(documents)

In [3]:
# Create a list of files with negative reviews.
negative_fileids = movie_reviews.fileids('neg')

# Create a list of files with positive reviews.
positive_fileids = movie_reviews.fileids('pos')

# Display the list lengths.
print(len(negative_fileids), len(positive_fileids))

1000 1000


In [19]:
print(documents[0][0])

['wesley', 'snipes', 'is', 'a', 'master', 'of', 'selecting', 'bad', 'action', 'roles', '.', 'murder', 'at', '1600', ',', 'u', '.', 's', '.', 'marshals', ',', 'money', 'train', ',', 'drop', 'zone', ',', 'boiling', 'point', ',', 'and', 'the', 'ultimate', 'camp', 'film', '-', 'passenger', '57', '.', 'the', 'art', 'of', 'war', 'is', 'another', 'entry', 'in', 'this', 'very', 'ugly', 'and', 'unique', 'category', '.', 'ultimately', ',', 'it', 'is', 'little', 'more', 'than', 'a', 'ridiculous', 'action', 'film', 'with', 'a', 'plot', 'as', 'believable', 'as', 'the', 'warren', 'report', ',', 'ugly', 'violence', 'that', 'would', 'have', 'made', 'peckinpah', 'cringe', ',', 'and', 'terrible', 'acting', 'by', 'b', '-', 'list', 'actors', 'like', 'michael', 'biehn', 'and', 'anne', 'archer', '.', 'oddly', ',', 'it', 'feels', 'like', 'the', 'undiscovered', 'sequel', 'to', 'another', 'snipes', '"', 'masterpiece', ',', '"', 'rising', 'sun', '.', 'the', 'movie', 'revolves', 'around', 'the', 'convenient', 's

In [11]:
# View the output.
print(movie_reviews.raw(fileids=positive_fileids[999]))

truman ( " true-man " ) burbank is the perfect name for jim carrey's character in this film . 
president truman was an unassuming man who became known worldwide , in spite of ( or was it because of ) his stature . 
 " truman " also recalls an era of plenty following a grim war , an era when planned communities built by government scientists promised an idyllic life for americans . 
and burbank , california , brings to mind the tonight show and the home of nbc . 
if hollywood is the center of the film world , burbank is , or was , the center of tv's world , the world where our protagonist lives . 
combine all these names and concepts into " truman burbank , " and you get something that well describes him and his artificial world . 
truman leads the perfect life . 
his town , his car , and his wife are picture perfect . 
his idea of reality comes under attack one day when a studio light falls from the sky . 
the radio explains that an overflying airplane started coming apart . 
 . 
 . 
b

# 

# 3. Define a feature extractor function

In [28]:
# Create an object to contain the frequency distribution.
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

# Create a list that contains the first 2,000 words.
word_features = list(all_words)[:2000]

# Define a function to check whether each word is in the set of features.
def document_features(document): 
    # Create a set of document words.
    document_words = set(document)

    # Create an empty dictionary of features.
    features = {}
    # Populate the dictionary.
    for word in word_features:
       # Specify whether each feature exists in the set of document words. 
       features['contains({})'.format(word)] = (word in document_words)
    # Return the completed dictionary.
    return features

In [29]:
# Generate a dictionary for the first review.
test_result = document_features(documents[0][0])

for key in test_result:
    print(key, ' : ', test_result[key])

contains(,)  :  True
contains(the)  :  True
contains(.)  :  True
contains(a)  :  True
contains(and)  :  True
contains(of)  :  True
contains(to)  :  True
contains(')  :  True
contains(is)  :  True
contains(in)  :  True
contains(s)  :  True
contains(")  :  True
contains(it)  :  True
contains(that)  :  True
contains(-)  :  True
contains())  :  True
contains(()  :  True
contains(as)  :  True
contains(with)  :  True
contains(for)  :  False
contains(his)  :  True
contains(this)  :  True
contains(film)  :  True
contains(i)  :  True
contains(he)  :  True
contains(but)  :  True
contains(on)  :  True
contains(are)  :  True
contains(t)  :  True
contains(by)  :  True
contains(be)  :  True
contains(one)  :  True
contains(movie)  :  True
contains(an)  :  True
contains(who)  :  False
contains(not)  :  True
contains(you)  :  False
contains(from)  :  True
contains(at)  :  True
contains(was)  :  True
contains(have)  :  True
contains(they)  :  True
contains(has)  :  True
contains(her)  :  False
contains(

# 

# 4. Train the classifier

In [30]:
# Create a list of feature sets based on the documents list.
featuresets = [(document_features(d), c) for (d, c) in documents]

# Assign items to the training and test sets.
# Note the first and last 100 only.
train_set, test_set = featuresets[100:], featuresets[:100]

# Create a classifier object trained on items from the training set.
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Display the accuracy score in comparison with the test set.
print(nltk.classify.accuracy(classifier, test_set))

0.84


> **Note:** Your output may differ from the demonstration output due to the `random.shuffle()` method applied to the data and possible updates to the `movie_reviews` corpus.

# 

# 5. Interpret the results

In [33]:
# You can change the number of outputs by increasing or decreasing the number in the brackets.
classifier.show_most_informative_features(100)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.9 : 1.0
         contains(mulan) = True              pos : neg    =      8.3 : 1.0
        contains(seagal) = True              neg : pos    =      8.3 : 1.0
         contains(damon) = True              pos : neg    =      7.5 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.5 : 1.0
        contains(poorly) = True              neg : pos    =      6.1 : 1.0
        contains(wasted) = True              neg : pos    =      6.0 : 1.0
         contains(flynt) = True              pos : neg    =      5.6 : 1.0
          contains(lame) = True              neg : pos    =      5.2 : 1.0
         contains(awful) = True              neg : pos    =      5.1 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.1 : 1.0
         contains(waste) = True              neg : pos    =      4.8 : 1.0
           contains(era) = True              pos : neg    =      4.5 : 1.0

> **Note:** Your output may differ from the demonstration output due to the `random.shuffle()` method applied to the data and possible updates to the `movie_reviews` corpus.

# 

# 6. Conclusion(s)

>This quick demonstration shows that the Naive Bayes sentiment classifier is relatively easy to interpret. This transparency is one of the core advantages of the model. The main limitation is the assumption of independent predictors. Realistically, mostly all text is contextual and can only be understood in relation to the other text surrounding it. So remember to interpret the results with caution.

# 