In [1]:
import pandas as pd
import numpy as np

import string
from collections import Counter

# nltk
import nltk
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier

import random

# Visualisation libraries

## Text
from colorama import Fore, Back, Style

# Visualisation libraries

## Text
from colorama import Fore, Back, Style

import warnings
warnings.filterwarnings("ignore")

<div class="alert alert-block alert-info">
<font size="+2"><b>
Movie Reviews Comments classifications using Natural Language Toolkit    
</b></font>
<hr>
<font size="+1"><b>
Naive Bayes Classifier (Improved Model)
</b></font>
</div>


The Natural Language Toolkit, or more commonly [NLTK](https://www.nltk.org/), is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

NLTK has a great library of datasets. Here we use, **Sentiment Polarity Dataset Version 2.0**. This can be downloaded using

```Python
nltk.download("movie_reviews")
from nltk.corpus import movie_reviews
```

Alternatively, we could download the dataset from the [GitHub repository](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip).

This dataset contains 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released in June 2004.

In [2]:
display(pd.DataFrame({'Positive Comments': [len(movie_reviews.fileids('pos'))],
              'Negative Comments': [len(movie_reviews.fileids('neg'))]}, index = ['Shape']))

print(Fore.GREEN + Style.NORMAL + 'Number of Words' + Style.RESET_ALL + ' = %i' % len(movie_reviews.words()))

def Line(N): return N*'='
print(Back.BLACK + Fore.GREEN + Style.NORMAL + 'Positive Comments:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('Positive Comments:') - 1) + Style.RESET_ALL)
display(movie_reviews.fileids('pos')[:5])
print(Back.BLACK + Fore.GREEN + Style.NORMAL + 'Negative Comments:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('Negative Comments:') - 1) + Style.RESET_ALL)
display(movie_reviews.fileids('neg')[:5])
print(Fore.BLUE + Style.NORMAL + '%s' % Line(120) + Style.RESET_ALL)

Unnamed: 0,Positive Comments,Negative Comments
Shape,1000,1000


[32m[22mNumber of Words[0m = 1583820


['pos/cv000_29590.txt',
 'pos/cv001_18431.txt',
 'pos/cv002_15918.txt',
 'pos/cv003_11664.txt',
 'pos/cv004_11636.txt']



['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt']



In [3]:
# A Random Positive Comment:
print(Back.BLACK + Fore.GREEN + Style.NORMAL + 'A Random Positive Comment:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('A Random Positive Comment:') - 1) + Style.RESET_ALL)
print(movie_reviews.raw(fileids = movie_reviews.fileids('pos') [np.random.randint(len(movie_reviews.fileids('pos')))]))
# A Random Negative Comment:
print(Back.BLACK + Fore.MAGENTA + Style.NORMAL + 'A Random Negative Comment:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('A Random Negative Comment:') - 1) + Style.RESET_ALL)

print(movie_reviews.raw(fileids = movie_reviews.fileids('neg') [np.random.randint(len(movie_reviews.fileids('neg')))]))
print(Fore.BLUE + Style.NORMAL + '%s' % Line(120) + Style.RESET_ALL)

before you read my review , you gotta know that i love woody allen . 
this is a very important note because allen's films are generally an acquired taste and definitely not for everyone . 
i know folks who believe him to be a complete genius , while others see him as a dirty ol' schnook who keeps making the same movie over and over again . 
i love most of his films , but will admit to having been quite disappointed by his recent crop during the 90s . 
in fact , why he felt the need to make 10 movies in those 10 years is beyond me ! 
if you look at the quality of those films , you'll hear what i'm saying . 
the only two films of his that i really liked during that time were bullets over broadway and husbands and wives . 
in fact , i secretly hoped that he would take some " time off " at the turn of the millennium , just to re-energize or something , but it doesn't appear as though he has any intention of doing that . 
so here i am again , reviewing yet another woody allen movie and hopi

# Problem Description

We would like to determine whether a given comment is <font color='Green'><b>positive</b></font> or <font color='Red'><b>negative</b></font>.

# Modeling

To start we would like to define a list of tuples in which each comment is tokenized into words and together with its category, positive or negative, form a tuple. A list of these tuples creates our base for this analysis.

In [4]:
Documents = [(list(movie_reviews.words(fileid)), Category) # the tuple
             for Category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(Category)]
# Shuffle Document
random.shuffle(Documents)

Now, using nltk's [FeqDis](http://www.nltk.org/api/nltk.html?highlight=freqdist), we can identify words that often appear consecutively in comments.

In [5]:
All_Words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

Moreover, we can consider the top 2000 words from the above list as **Featured Words**.

In [6]:
Featured_Words = list(All_Words)[:2000]

The next step is to define a **feature extractor** function that checks whether each of these words is present in a given document.

In [7]:
def Document_Features(Doc):
    # converting the Doc into a set
    Doc_words = set(Doc)
    # Creating an empty set of features
    features = {}
    # a loop over featured words
    for word in Featured_Words:
        features['contains({})'.format(word)] = (word in Doc_words)
    return features

As an example, for a randomly given <font color='Green'><b>positive</b></font> comment, we have

In [8]:
Temp = movie_reviews.words(fileids = movie_reviews.fileids('pos') [np.random.randint(len(movie_reviews.fileids('pos')))])
Temp = Document_Features(Temp)
print(Fore.BLUE + Style.NORMAL + 'The first ten entries of this dictionary:' + Style.RESET_ALL)
for x in list(Temp)[0:10]:
    print ("{}: {} ".format(x,  Temp[x]))
#
del Temp

[34m[22mThe first ten entries of this dictionary:[0m
contains(plot): False 
contains(:): True 
contains(two): True 
contains(teen): False 
contains(couples): False 
contains(go): True 
contains(to): True 
contains(a): True 
contains(church): False 
contains(party): False 


For sake of modeling, we can split the train and test sets with 90% and 10%.

In [9]:
Feature_Sets = [(Document_Features(d), c) for (d,c) in Documents]
Split = int(0.1 * len(Feature_Sets))

Train_Set, Test_Set = Feature_Sets[Split:], Feature_Sets[:Split]

Temp = pd.DataFrame({'Train Set': [len(Train_Set)], 'Test Set': [len(Test_Set)]}, index = ['Size']).T
Temp['Percentage'] = np.round(100* Temp['Size'].values/Temp['Size'].values.sum(), 2)
display(Temp.T.style.set_precision(0))

Unnamed: 0,Train Set,Test Set
Size,1800,200
Percentage,90,10


## Modeling: Naive Bayes Classifier

A [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is an algorithm that uses Bayes' theorem to classify objects. To learn more about text classification using naive Bayes classifier see this [link](https://web.stanford.edu/~jurafsky/slp3/slides/7_NB.pdf). Now, we can implement this using [nltk.classify.naivebayes](https://www.nltk.org/_modules/nltk/classify/naivebayes.html).

In [10]:
NBC = nltk.NaiveBayesClassifier.train(Train_Set)

display(pd.DataFrame({'Train Set': [100*nltk.classify.util.accuracy(NBC, Train_Set)],
              'Test Set': [100*nltk.classify.util.accuracy(NBC, Test_Set)]}, index = ['Accuracy']).round(2))

Unnamed: 0,Train Set,Test Set
Accuracy,88.67,79.5


We can use **show_most_informative_features()** to find out which features the classifier found to be most informative.

In [11]:
NBC.show_most_informative_features()

Most Informative Features
        contains(turkey) = True              neg : pos    =     11.9 : 1.0
    contains(schumacher) = True              neg : pos    =     11.8 : 1.0
     contains(atrocious) = True              neg : pos    =     10.4 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
      contains(explores) = True              pos : neg    =      6.9 : 1.0
        contains(suvari) = True              neg : pos    =      6.4 : 1.0
          contains(mena) = True              neg : pos    =      6.4 : 1.0
       contains(singers) = True              pos : neg    =      6.3 : 1.0
       contains(wounded) = True              pos : neg    =      5.7 : 1.0
        contains(shoddy) = True              neg : pos    =      5.7 : 1.0


# Predictions

For a randomly given <font color='Green'><b>positive</b></font> comment, we have

In [12]:
# A Random Positive Comment:
print(Back.BLACK + Fore.GREEN + Style.NORMAL + 'A Random Positive Comment:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('A Random Positive Comment:') - 1) + Style.RESET_ALL)
Temp = np.random.randint(len(movie_reviews.fileids('pos')))
print(movie_reviews.raw(fileids = movie_reviews.fileids('pos') [Temp]))
Temp = movie_reviews.words(fileids = movie_reviews.fileids('pos') [Temp])
print(Back.BLACK + Fore.CYAN + Style.NORMAL + 'Predictions:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('Predictions:') - 1) + Style.RESET_ALL)
Temp = NBC.classify(Document_Features(Temp))
if Temp == 'pos':
    print(Back.GREEN + Fore.BLACK + Style.NORMAL + 'A Positive Comment')
else:
    print(Back.RED + Fore.BLACK + Style.NORMAL + 'A Negative Comment')

the party is one of those classic slapstick comedies that will leave you , at times , cracking up . 
the film takes place , for the most part , in real-time during an exclusive evening party that is attended only by the biggest names in hollywood . 
hrundi v . bakshi , played very well by peter sellers , is a struggling actor who just came to america from his homeland , india . 
hrundi tries out his acting talents , but it seems that he just isn't cut out for the job . 
on the set of his current film that he stars in , hrundi seems to make everything go for the worse . 
during the filming of this movie set in the 1800's , hrundi manages to annoy the director ( herbert ellis ) in any way he can . 
this includes a pitiful acting job in many scenes , wearing an underwater watch in one scene , and accidentally detonating a massive set . 
many of the hollywood producers and big names want hrundi out of the business forever . 
and when the director makes a personal phone call to mr . clutter

Furthermore, for a randomly given <font color='Red'><b>negative</b></font> comment, we have

In [13]:
# A Random Positive Comment:
print(Back.BLACK + Fore.GREEN + Style.NORMAL + 'A Random Negative Comment:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('A Random Negative Comment:') - 1) + Style.RESET_ALL)
Temp = np.random.randint(len(movie_reviews.fileids('neg')))
print(movie_reviews.raw(fileids = movie_reviews.fileids('neg') [Temp]))
Temp = movie_reviews.words(fileids = movie_reviews.fileids('neg') [Temp])
print(Back.BLACK + Fore.CYAN + Style.NORMAL + 'Predictions:' +
      Style.RESET_ALL + Fore.BLUE + Style.NORMAL + ' %s' % Line(120- len('Predictions:') - 1) + Style.RESET_ALL)
Temp = NBC.classify(Document_Features(Temp))
if Temp == 'pos':
    print(Back.GREEN + Fore.BLACK + Style.NORMAL + 'A Positive Comment')
else:
    print(Back.RED + Fore.BLACK + Style.NORMAL + 'A Negative Comment')

making it's debut at the dollar theater ? 
locally , chairman of the board did just that . 
having the annoying prop comic scott thompson ( better known as carrot top ) in the lead role ? 
chairman of the board , once again . 
how about an overly exhausted , paper thin plot approached with utter incompetence ? 
did somebody say chairman of the board ? 
that's right , carrot top's long dreaded major motion picture debut ( at least for a starring role ) is poking up in a handful of theaters across the country . 
chairman of the board stars the obnoxious , wannabe-zany king of redheaded standup comics as a lazy but creative , inventive but uneventful generation x- er named edison . 
living with a pair of surfer dudes in a small , rented house , edison bounces from job to job , always squandering away the money on his eccentric ( to say the least ) inventions and ignoring crucial responsibilities such as rent . 
this has the crabby landlady , ms . krubavitch ( estelle harris , best known a

***

## Refrences

* "Natural Language Toolkit". [www.nltk.org](https://www.nltk.org/).
* Bird, S., Klein, E., and Loper E., "Natural Language Processing with Python". [Chapter 6: Learning to Classify Text](https://www.nltk.org/book/ch06.html)
* "Movie Review Data", [www.cs.cornell.edu/people/pabo/movie-review-data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
* Pang B., Lee L., and Vaithyanathan S., "[Thumbs up? Sentiment Classification using Machine Learning Techniques](http://www.cs.cornell.edu/home/llee/papers/sentiment.home.html)", Proceedings of EMNLP 2002.
* Pang B., and Lee L., "A Sentimental Education: [Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts](http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html)", Proceedings of ACL 2004.
* Pang B., and Lee L.,  "Seeing stars: [Exploiting class relationships for sentiment categorization with respect to rating scales](http://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.home.html)", Proceedings of ACL 2005.
*  Jan Strunk, Punkt Tokenizer Models. [nltk.org/nltk_data](http://www.nltk.org/nltk_data/).
* Jurafsky, D., and Martin, M. H. (2019). "[Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/)".

***