# MOVIE REVIEW USING NLTK LIBRARY

## Supervised Classification
In supervised classification, the classifier is trained with labeled training data.

The NLTK’s movie_reviews corpus is used as labeled training data. The movie_reviews corpus contains 2000 movie reviews with sentiment polarity classification

Two categories for classification : Positive and Negative

In [2]:
#First we will import our movie_review corpus
from nltk.corpus import movie_reviews

In [3]:
#number of reviews in this corpus 
print(len(movie_reviews.fileids()))

2000


In [5]:
#categories in this corpus
print(movie_reviews.categories())

['neg', 'pos']


In [6]:
#number of postive reviews 
print(len(movie_reviews.fileids('pos')))

1000


In [7]:
#number of negative reviews 
print(len(movie_reviews.fileids('neg')))

1000


## Create a List of movie review document
The list contains all the movie reviews and are shuffled 

In [11]:
document = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        document.append((movie_reviews.words(fileid),category))

In [12]:
#length of document ie.2000
print(len(document))

2000


In [13]:
#format in the document
print(document[0])

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg')


Now we need to shuffle all the reviews


In [14]:
from random import shuffle 
shuffle(document)

## Feature Extraction
To classify the text into any category, we need to define some criteria.
On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category.
This kind of criteria is known as feature.

### Top-N words features
In this method we use FreqDist of nltk library to assign frequecy to each word and then take the top-N words of the list 


In [15]:
#fetch all the words from the corpus
allwords =[word.lower() for word in movie_reviews.words()]
#print 1st ten words
print(allwords[:10])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


#### Create Frequency Distribution of all words

Frequency Distribution will calculate the number of occurence of each word in the entire list of words.

In [17]:
from nltk import FreqDist
allwords_frequecy = FreqDist(allwords)
print(allwords_frequecy)

<FreqDist with 39768 samples and 1583820 outcomes>


### Removing Punctuation and Stopwords
Punctuation :.?!@ these dont contribute to determine the classification of particular review so these have to be remove.


Stop words are those frequently words which do not carry any significant meaning in text analysis. For example, I, me, my, the, a, and, is, are, he, she, we, etc.

In [18]:
#importing libraries 
from nltk.corpus import stopwords
import string

In [19]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [20]:
stopword_english = stopwords.words('english')
print(stopword_english)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Now Create a clean list of all words nd then apply FreqDist for the allwords_clean

In [21]:
allwords_clean=[]
for word in allwords :
    if word not in stopword_english and word not in string.punctuation :
        allwords_clean.append(word)
        
#print 1st 10 words of allwords-clean
print(allwords_clean[:10])

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get']


In [24]:
print(len(allwords_clean))

710578


In [23]:
#FreqDist for allwords_clean
allwords_frequecy = FreqDist(allwords_clean)
print(allwords_frequecy)

<FreqDist with 39586 samples and 710578 outcomes>


We can see that out off 710587 allwords_clean we have 39586 distinct words.
Now Choose top 2000 words using most_common()

In [25]:
# get 2000 frequently occuring words
mostcommonwords= allwords_frequecy.most_common(2000)
print(mostcommonwords[:10])

[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]


In [26]:
# the most common words list's elements are in the form of tuple
# get only the first element of each tuple of the word list
word_features = [item[0] for item in mostcommonwords]
print (word_features[:10])

['film', 'one', 'movie', 'like', 'even', 'good', 'time', 'story', 'would', 'much']


### Create Feature Set

Now, we write a function that will be used to create feature set. The feature set is used to train the classifier. 

In [28]:
def document_features(document):
    # "set" function will remove repeated/duplicate tokens in the given list
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

feature_set = [(document_features(doc), category) for (doc, category) in document]

### Creating Train and Test Dataset

In [29]:
print (len(feature_set))

2000


In [30]:
test_set = feature_set[:400]
train_set = feature_set[400:]
 
print (len(train_set))
print (len(test_set)) 

1600
400


We use the Naive Bayes Classifier. It’s a simple, fast, and easy classifier which performs well for small datasets. It’s a simple probabilistic classifier based on applying Bayes’ theorem. Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

In [31]:

from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)

In [32]:

from nltk import classify 
 
accuracy = classify.accuracy(classifier, test_set)
print (accuracy)

0.8275


## Custom input


In [33]:
from nltk.tokenize import word_tokenize
 
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)
print (classifier.classify(custom_review_set))
 


neg


In [34]:
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) 
print (prob_result.max()) 
print (prob_result.prob("neg")) 
print (prob_result.prob("pos")) 
 


<ProbDist with 2 samples>
neg
0.9999985071196412
1.4928803716593245e-06


In [35]:
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)
 
print (classifier.classify(custom_review_set)) 
 


neg


In [36]:
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result)
print (prob_result.max()) 
print (prob_result.prob("neg")) 
print (prob_result.prob("pos")) 

<ProbDist with 2 samples>
neg
0.9999758014288533
2.419857114555758e-05
