# Sentiment Analysis

In [None]:
from textblob import TextBlob

In [None]:
texts=["The movie was good.", 
    "The movie was not good.",
    "I really think this product sucks.",
    "Really great product.",
    "I don't like this product"]
for t in texts:
    print(t, "==>", TextBlob(t).sentiment.polarity)

In [None]:
text=TextBlob("""The movie was good. The movie was not good. I really think this product sucks.
Really great product. I don't like this product""")
for s in text.sentences:
    print("=>", s)
for s in text.sentences:
    print(s, "==> ", s.sentiment.polarity)

# Creating our own classifier
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string

from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

for i in mr.fileids()[995:1005]: # Reviews 995 to 1005
    print(i, "==>", i.split('/')[0])

Let's see the content of one of these reviews

In [None]:
print(mr.raw(mr.fileids()[1]))

Calculating the frequency of each word in the document ...

In [None]:
from nltk.probability import FreqDist
FreqDist(mr.raw(mr.fileids()[1]).split())

Lets take a look at the most frequent words in the corpus

In [None]:
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in mr.raw(i).split())

The previous code has flaws because split() is a very basic way of finding the words. Let's use `word_tokenize()` or `mr.words()` instead...

In [None]:
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in mr.words(i))
print(wordfreq)
print(wordfreq.most_common(10))

stop words and punctuation are causing trouble, lets remove them...

In [None]:
stopw = stopwords.words('english')
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in mr.words(i) if w.lower() not in stopw and w.lower() not in string.punctuation)
print(wordfreq)
print(wordfreq.most_common(10))

## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

Lets split each document into words ...

In [None]:
documents=[]
for i in docnames:
    y = i.split('/')[0]
    documents.append( ( mr.words(i) , y) )

Let's take a look at our documents...

In [None]:
for docs in documents[:5]:
    print(docs)

## Document representation

Now, lets produce the final document representation, in the form of a Frequency Distribution ...

First, without stop words and punctuation ... (you could use other technique, such as IDF)

In [None]:
stopw = stopwords.words('english')
docrep=[]
for words,tag in documents:
    features = FreqDist(w for w in words if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

Let's take a look at our documents again...

In [None]:
for doc in docrep[:5]:
    print(doc)

## NLTK classifier: Naive Bayes

Defining our training and test sets...

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents

train_set, test_set = docrep[:numtrain], docrep[numtrain:]

print(test_set[0])

In [None]:
from nltk.classify import NaiveBayesClassifier as nbc

classifier = nbc.train(train_set)

print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(5)