# Movie Review Sentiment Anlysis

In [79]:
from nltk.corpus import movie_reviews 
from nltk.corpus import stopwords
import string
from nltk import NaiveBayesClassifier
from nltk import classify 
from nltk.tokenize import word_tokenize
from nltk import ngrams
from random import shuffle 

## Creating the functions for cleaning the data
### We created 4 functions to help us clean the data
### - clean_words(words, stopwords)
This function takes two parameters one of which is the review being classified and the other english stopwords. This function also removes all punctuations using this *string* class
### - unigram_words(words)
This function is used to create a dictionary of *unigrams*, with the key value being each word excluding the english stopwords and punctuations..
### - bigram_words(words, n=2)
This function is used to create a dictionary of *bigrams*. We use the *nltk ngrams* import which can be used to create any size of grams. We initialize the (n) as 2 in the beginning to make the ngram into a bigram. And return a dictionary containing two words of the original list as keys. 
### - clean_all_words(words, n=2)
This is the main function to clean the data. It uses all the above functions to *clean* the text, makes into a *unigram* and *bigram* dictionary and lastly it makes a dictionary containing both the unigrams and bigrams.

In [80]:
stopwords = stopwords.words('english')
 
# Remove stopwords and punctuations
def clean_words(words, stopwords):
    words_clean = []
    for word in words:
        word = word.lower()
        if word not in stopwords and word not in string.punctuation:
            words_clean.append(word)    
    return words_clean 
 
# Create unigrams
def unigram_words(words):    
    words_dictionary = dict([word, True] for word in words)    
    return words_dictionary
 
# Create bigrams
def bigram_words(words, n=2):
    words_bigram = []
    for i in iter(ngrams(words, n)):
        words_bigram.append(i)
    words_dictionary = dict([word, True] for word in words_bigram)    
    return words_dictionary
 
# Cleaning all the data
def clean_all_words(words):
    # Clean text
    words_clean = clean_words(words, stopwords)
    
    # Cleaned text into unigrams
    unigram_features = unigram_words(words_clean)
    # Cleaned text into bigrams
    bigram_features = bigram_words(words_clean)
 
    # Unigrammed text copied as "all_features" and add bigrams to list
    all_features = unigram_features.copy()
    all_features.update(bigram_features)
 
    return all_features

## Testing our features
Here we test all our functions to make sure that they work as intended.

In [81]:
text = "This is our test review because we thought the movie was very bad, just kidding it was great!"
words = word_tokenize(text.lower())

print ("Tokenized text:", words, "\n")
 
print ("Bigrammed text: ", bigram_words(words), "\n")

words_clean = clean_words(words, stopwords)
print ("Cleaned text: " , words_clean, "\n")

print ("All words cleaned: ", clean_all_words(words))

Tokenized text: ['this', 'is', 'our', 'test', 'review', 'because', 'we', 'thought', 'the', 'movie', 'was', 'very', 'bad', ',', 'just', 'kidding', 'it', 'was', 'great', '!'] 

Bigrammed text:  {('this', 'is'): True, ('is', 'our'): True, ('our', 'test'): True, ('test', 'review'): True, ('review', 'because'): True, ('because', 'we'): True, ('we', 'thought'): True, ('thought', 'the'): True, ('the', 'movie'): True, ('movie', 'was'): True, ('was', 'very'): True, ('very', 'bad'): True, ('bad', ','): True, (',', 'just'): True, ('just', 'kidding'): True, ('kidding', 'it'): True, ('it', 'was'): True, ('was', 'great'): True, ('great', '!'): True} 

Cleaned text:  ['test', 'review', 'thought', 'movie', 'bad', 'kidding', 'great'] 

All words cleaned:  {'test': True, 'review': True, 'thought': True, 'movie': True, 'bad': True, 'kidding': True, 'great': True, ('test', 'review'): True, ('review', 'thought'): True, ('thought', 'movie'): True, ('movie', 'bad'): True, ('bad', 'kidding'): True, ('kidding'

## Using the NLTK movie_review corpus

We use the nltk movie_review corpus to get two sepparate lists of bad and negative sample reviews.

In [82]:
# All negative and positive reviews in the NLTK movie_reviews corpus are added to sepparate lists
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append(words)

neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append(words)

We clean these lists using our clean_all_words() function that we created earlier and check to see how many reviews each list has.

In [83]:
# Clean all the words in the review lists, using the function that we created earlier
pos_reviews_set = []
for words in pos_reviews:
    pos_reviews_set.append((clean_all_words(words), 'pos'))

neg_reviews_set = []
for words in neg_reviews:
    neg_reviews_set.append((clean_all_words(words), 'neg'))
    
print ("Amount of positive/negative reviews in the NLTK movie_review corpus: ", len(pos_reviews_set), len(neg_reviews_set)) 

Amount of positive/negative reviews in the NLTK movie_review corpus:  1000 1000


### Creating the training and test sets
Now that we have the two lists of cleaned negative and positive reviews we will shuffle the lists to get a random order and create training and test sets of them. We will pass 25% of both negative and positive reviews into our test_set, the remaining 75% will be used for our training set.

In [92]:
# We use the random shuffle function to randomize the lists to get different accuracy results everytime the program runs
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
 
# We take 25% of the review lists to test data with and we use the
#remaining data to train the NaivesBaysClassifier
test_set = pos_reviews_set[:250] + neg_reviews_set[:250]
train_set = pos_reviews_set[250:] + neg_reviews_set[250:]
print(len(train_set))

1500


### Using the NaiveBayesClassifier
We chose to use the NaiveBayesClassifier for our sentiment analysis because it was recommended all over the internet as the best one to be used for movie reviews.

We initialize our *classifier* by passing the training data into it and then try out our accuracy with the test data.

We also use the classifiers *show_most_informative_features()* to determine which features it finds the most effective.

In [93]:
# We initialize the classifier and use our training set with it
classifier = NaiveBayesClassifier.train(train_set)
# TODO: test other classifiers

# We use the nltk accuracy classifier on our test set to test the
accuracy = classify.accuracy(classifier, test_set)
print("Current accuracy: ", accuracy)
print (classifier.show_most_informative_features(15)) 

Current accuracy:  0.778
Most Informative Features
             outstanding = True              pos : neg    =     13.9 : 1.0
    ('nothing', 'short') = True              pos : neg    =     12.3 : 1.0
                  finest = True              pos : neg    =     11.8 : 1.0
        ('one', 'worst') = True              neg : pos    =     11.8 : 1.0
                  turkey = True              neg : pos    =     11.7 : 1.0
       ('well', 'worth') = True              pos : neg    =     11.0 : 1.0
     ('saving', 'grace') = True              neg : pos    =     10.3 : 1.0
                 idiotic = True              neg : pos    =     10.2 : 1.0
('completely', 'different') = True              pos : neg    =      9.7 : 1.0
     ('bad', 'dialogue') = True              neg : pos    =      9.7 : 1.0
         ('new', 'year') = True              neg : pos    =      9.7 : 1.0
                    jedi = True              pos : neg    =      9.0 : 1.0
             fascination = True              p

# Counting the probability
Finally we create our function that determines if our review is positive, negative or mixed. It takes a string (in this case a review) as the arguement. It then tokenizes the string to be able to use every word uniquely and passes the tokenized text into our *clean_all_words()* function.

After being cleaned of all stopwords and punctuations it uses the classifier with the trained probability data to determine if the review is either positive, negative of mixed.

In [104]:
# We test our program with a custom review
# We used Matt Damon, who for some reason has a really high positive to negative ratio.
bad_review = "This movie is really bad and we would not recommend this movie even to our worst enemy. It really sucked."
mixed_review = "I don't know what to think about this movie, it was a bit weird but good, I kind of liked it, just kidding it's bad bad bad bad."
good_review = "Matt Damon"

def count_probability(review): 
    review_tokens = word_tokenize(review)
    review_set = clean_all_words(review_tokens)
    prob_result = classifier.prob_classify(review_set)
    prob_result
    if prob_result.prob("neg") >= 0.70:
        print('This review is negative.')
    elif prob_result.prob("neg") >= 0.30:
        print('This review is mixed.')
    else:
        print('This review is positive.')
    #print (type(prob_result.prob("neg")))
    print (round(prob_result.prob("neg"), 3))
    print (round(prob_result.prob("pos"), 3))

count_probability(bad_review)
count_probability(mixed_review)
count_probability(good_review)

This review is negative.
0.999
0.001
This review is mixed.
0.441
0.559
This review is positive.
0.01
0.99


# Using actual reviews

We took a 1/10 and a 10/10 review from the movie IT, to test our actual accuracy.

### 1/10 review

In [96]:
review = """It has become ritual for me to read the novel "It" once a year every year since it was released in 1986. The story is more than a gore-fest, it's a story about love and hope and friendship that is still meaningful to me to this day.

The only thing this movie has in common with the beloved book, is its name and the characters names. IT is a literal disaster and a slap in the face to anyone who actually read and cherishes the book. There are NO character backstories, nor character development at all. You are literally thrust into the movie expecting to know everything about everyone and why they are the way they are. IE: Henry Bowers and why he hates the "Losers Club" - He LITERALLY starts the movie trying to kill them. This is sad, because a large portion of the novel was meticulously spent doing quite the opposite and made you relate to and fall in love with the characters.

Editing? What editing? This is the worst edited movie I've ever seen in my life and I've seen a lot in 41 years. It was literally like the film makers shot 100 scenes, put the film in a hat, and took out said scenes and spliced them together at total random. I can't describe it any other way than saying, at one point, one of the characters (I can't tell who, because they all share the EXACT same personality) says, "I banged your mom last night", or something similar, and before the audience can even react, the scene changes to a jump scare happening in ANOTHER PART OF TOWN INSTANTLY and with no rhyme or reason. You don't have time to laugh at jokes, because they aren't funny (unlike Stephen King's jokes in the book) - and you don't have time to be scared, because you're still trying to process the dick joke that was still being told when the scene abruptly ended.

While the filming location for the town of Derry was suitable, having the movie take place in the 1980's instead of the 1950's JUST TO APPEASE the "Stranger Things" crowd was simply a terrible decision. The 1950's were a totally different time, and much of the characters' reasoning and mannerisms that you need to make this movie work are lost to a time and cultural difference. These guys call themselves "THE LOSERS SQUAD" in this movie for god's sake! Kids didn't start calling themselves a "squad" until the 1980's (IE "The Monster Squad) So, you love the book like me and are still reading? Thank you! Now let me list just SOME things that we both LOVE about the book that you will NOT find anywhere in this movie: The Deadlights, The Ritual of CHUD, The Mummy and the bridge, The Loser's Club Dam in the Barrens, the moving picture book (now its a slide machine), The Smokehouse, "This is battery acid", The Werewolf, Making the silver bullet after a game of monopoly, The stand pipe, Bower's hair turning white, "beep beep Richie", the giant bird, the 50's racism against Mike (actually Mike Hanlon himself is missing. The writers just made arguably the most important character an afterthought in this movie), character backstories, "Hi Ho Silver-AWAY!", Haystack... I could go on and on and on.

With god awful editing, absolutely no character backstories, cheap teen jump scares, not being faithful to the book, and too much CGI usage: Simply put - if you want to know how this movie is like the book, read the first 10 pages of "IT", and burn the other 1077 pages because that is exactly what the screenwriter and director did to this failed abortion."""
count_probability(review)

This review is negative.
1.0
0.0


### 10/10 review

In [97]:
review = """The first sequence is classical example of US horror film. In which, the evil clown suddenly and naturally appears, and it victimizes the kid who drops his toys into drainage passage on a rainy day in the local city of US. 

We not only easily grab the setting of the evil clown and what he wants. It is a typical bogie man story who eats and kidnap kids who are afraid of it. 

It(2017) proves effective use of VFX. CG animation is to crate what traditional acting or camera working or editing technique can not achieve. 

This film actually avoided overused VFX and CG animations (CG is not an editorial work but it's computer animation). 

Moreover, it recovered or tried to revive the core spirit of 1980s' Goonies(1985) or Stand By Me(1986) to the certain extent. This is also well accepted and understandable. 

The evil clown is typical Boogieman like Freddy in A Nightmare on Elm Street (1984), and it has common feature with Freddy. Furthermore, it is scarier than the latter. 

I appreciate the legitimate and efficient structure of this film. It is the best horror film of 2017!"""
count_probability(review)

This review is positive.
0.0
1.0


# Result

When testing our data with a few custom reviews there were a few problems, because every time we shuffle the lists of negative of positive reviews our test data will have a slightly different understanding of what is negative and what is positive. However, when we used the acutal reviews, which naturally were longer text, we got very accurate results and were satisfied with our program. The only thing that was hard to achieve was for the program to tell us if a long review was mixed.

We found this project to be a good learning experience and got a good understand of what the NLTK corpus actually can be used for. 