# Classify Movie Reviews based on Sentiment Analysis using Naive Bayes

## Objective

Predict whether a review is negative or positive, based on the text data set using Naive Bayes and do some Natural Language Processing to extract features to train the algorithm from the text of the reviews.

## Data Set

The dataset contains movie reviews along with their associated binary sentiment polarity labels formatted CSV file. Each row contains the text of the review, as well as a number indicating whether the tone of the review is positive(1) or negative(-1).

## Reading In Training Data

In [1]:
import csv
with open("C:/Users/i7/csv/naive/train.csv", 'r', encoding='latin-1') as file:
    reviews = list(csv.reader(file))

## Finding Word Counts

Generating features from text with split the text up into words. Each word in a review will then be a feature that can then work with. (split the reviews based on whitespace.)

Then count up how many times each word occurs in the negative reviews, and how many times each word occurs in the positive reviews.

In [2]:
from collections import Counter
import re

def get_text(reviews, score):
    # Join together the text in the reviews for a particular tone.
    # We lowercase to avoid "Not" and "not" being seen as different words, for example.
    return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])

def count_text(text):
    # Split text into words based on whitespace.  Simple but effective.
    words = re.split("\s+", text)
    # Count up the occurence of each word.
    return Counter(words)

negative_text = get_text(reviews, -1)
positive_text = get_text(reviews, 1)
# Generate word counts for negative tone.
negative_counts = count_text(negative_text)
# Generate word counts for positive tone.
positive_counts = count_text(positive_text)

print("negative_text[:100]:", negative_text[:100])
print("positive_text[:100]:", positive_text[:100])

negative_text[:100]: story of a man who has unnatural feelings for a pig. starts out with a opening scene that is a terri
positive_text[:100]: bromwell high is a cartoon comedy. it ran at the same time as some other programs about school life,


## Making Predictions

Convert the word counts to probabilities and multiply them out to get the predicted classification.

In [7]:
def get_y_count(score):
    # Compute the count of each classification occuring in the data.
    return len([r for r in reviews if r[1] == str(score)])

# Split the counts to use for smoothing when computing the prediction.
positive_review_count = get_y_count(1)
negative_review_count = get_y_count(-1)

# These are the class probabilities.
prob_positive = positive_review_count / len(reviews)
#print(prob_positive)
prob_negative = negative_review_count / len(reviews)
#print(prob_negative)

def make_class_prediction(text, counts, class_prob, class_count):
    prediction = 1
    text_counts = Counter(re.split("\s+", text))
    for word in text_counts:
        prediction *=  float(text_counts.get(word) * ((counts.get(word, 0) + 1)) / (sum(counts.values()) + class_count))
    return prediction * class_prob

# The probabilities themselves aren't very useful -- make classification decision based on which value is greater.
print("reviews[0][0]:", reviews[0][0])

neg_pred = make_class_prediction(reviews[0][0], negative_counts, prob_negative, negative_review_count)
print("neg_pred:", neg_pred)

pos_pred = make_class_prediction(reviews[0][0], positive_counts, prob_positive, positive_review_count)
print("pos_pred:", pos_pred)

reviews[0][0]: Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
neg_pred: 0.0
pos_pred: 0.0


## Predicting The Test Set

Predict the probabilities on the reviews in test.csv. if it's not, it will get misleadingly good results if predict on the reviews in train.csv, because the probabilities were generated from it (and this, the algorithm has prior knowledge about the data it’s predicting on).

Getting good results on the training set could mean that the model is overfit, and is just picking up random noise. Only testing on a set that the model wasn’t trained with can tell that if it’s performing properly.

In [9]:
def make_decision(text, make_class_prediction):
    # Compute the negative and positive probabilities.
    negative_prediction = make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
    positive_prediction = make_class_prediction(text, positive_counts, prob_positive, positive_review_count)

    # Assign a classification based on which probability is greater.
    if negative_prediction > positive_prediction:
        return -1
    return 1

with open("C:/Users/i7/csv/naive/test.csv", 'r', encoding='latin-1') as file:
    test = list(csv.reader(file))

predictions = [make_decision(r[0], make_class_prediction) for r in test]
print("predictions[:5]:", predictions[:5])

predictions[:5]: [1, 1, 1, 1, 1]


## Computing Prediction Error

Compute error using the area under the ROC curve. This will tell how “good” the model is – closer to 1 means that the model is better.

In [10]:
actual = [int(r[1]) for r in test]

from sklearn import metrics

# Generate the roc curve using scikits-learn.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)

# Measure the area under the curve.  The closer to 1, the "better" the predictions.
print("AUC:", metrics.auc(fpr, tpr))

AUC: 0.559


## Predictions using Scikit-Learn

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# do some Natural Language Processing to extract features
# Generate counts from text using a vectorizer.  There are other vectorizers available, and lots of options that can set.
# This performs step of computing word counts.
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in reviews])
test_features = vectorizer.transform([r[0] for r in test])

# Fit a naive bayes model to the training data.
# This will train the model using the word counts we compute, and the existing classifications in the training set.
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews])

# Use the model to predict classifications for the test features.
predictions = nb.predict(test_features)

# Compute the error. It is slightly different from the model before because the internals of this process work differently from our implementation.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
print("AUC:", metrics.auc(fpr, tpr))

AUC: 0.833
