## Lab: Naive Bayes Sentiment Analysis on Movie Reviews
TA: Bokyung Son (*Computational Linguistics Lab, SNU*)

### TASK: Classify each review phrase into negative(-1), neutral(0), or positive(1).

Naive bayes extends bayes' theorem to handle this case by assuming that each feature is independent.<br\><br\>
$$P(y|x_{1}, \dots, x_{n}) = \frac{P(y)\prod_{i=1}^{n}P(x_{i}|y)}{P(x_{1}, \dots, x_{n})}$$

We calculate the probabilities of each class $P(y)$ as well as the probabilities of each feature falling into each class $P(x_{i}|y)$. Then we return the class with the highest probability $P(y|x_{1}, \dots, x_{n})$ as the "correct" classification.
  
As a BOW model, we use each word in a review as a feature.

### 1. Load dataset
In this lab, we use the [Rotten tomatoes](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) dataset. Check out the features.

In [1]:
import csv
import pandas as pd

with open('data/train.tsv', 'r') as file:
    reviews = pd.DataFrame.from_csv(file, sep='\t')

  """


In [2]:
def get_text(reviews, score):
    # Join texts of a particular tone
    return ' '.join([r['Phrase'] for _, r in reviews.iterrows() if r['Sentiment'] in score])

Make a mega-document for each tone.

In [3]:
neg_text = get_text(reviews, [0,1])
ntr_text = get_text(reviews, [2])
pos_text = get_text(reviews, [3,4])

print(f'Negative text sample: {neg_text[:200]}', end='\n\n')
print(f'Neutral text sample: {ntr_text[:200]}', end='\n\n')
print(f'Positive text sample: {pos_text[:200]}', end='\n\n')

Negative text sample: A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . the gander 

Neutral text sample: A series of escapades demonstrating the adage that what is good for the goose A series A series of escapades demonstrating the adage that what is good for the goose of escapades demonstrating the adag

Positive text sample: good for the goose good amuses This quiet , introspective and entertaining independent is worth seeking . This quiet , introspective and entertaining independent quiet , introspective and entertaining



### 2. Collect features

In [4]:
import re
from collections import Counter

def count_words(text):
    words = re.split("\s+", text)
    return Counter(words)

In [5]:
# Generate word counts for each score
neg_word_counts = count_words(neg_text)
ntr_word_counts = count_words(ntr_text)
pos_word_counts = count_words(pos_text)

### 3. Make predictions

In [6]:
def count_y(reviews):
    # Count each class
    return reviews['Sentiment'].value_counts()

In [7]:
# Compute the counts of each tone
class_counts = count_y(reviews)
neg_review_count = class_counts[0] + class_counts[1]
ntr_review_count = class_counts[2]
pos_review_count = class_counts[3] + class_counts[4]

# Class probabilities P(y)
neg_prob = neg_review_count / len(reviews)
ntr_prob = ntr_review_count / len(reviews)
pos_prob = pos_review_count / len(reviews)

Apply Laplace (add-k) smoothing for max likelihood estimates. $\lvert X \rvert$ indicates vocabulary size.<br\><br\>
$$P_{lap, k}(x_{i}|y) = \frac{count(x_{i}, y) + k}{count(y) + k \lvert X \rvert}$$

In [8]:
def predict(text, word_counts, class_prob, k):
    """
    Computes probability (without the denominator) of a text(phrase) to fall into a certain class.

    Inputs
    ------
    text: the target text/phrase to classify.
    word_counts: Counter for all words (features) in this class.
    class_prob: P(this class)
    k: Laplace smoothing factor (add-k)

    """
    # Counter for the target text
    word_counts_text = Counter(re.split("\s+", text))

    prob = 1
    for word in word_counts_text:
        # Apply Laplace (add-k) smoothing
        prob *= word_counts_text.get(word) * (word_counts.get(word, 0) + k) / (sum(word_counts.values()) + k * len(word_counts))
    # Multiply by P(y)
    return prob * class_prob


def predict_class(text, word_counts_dict, class_prob_dict, k=1):
    class_probs = {}
    for cl, cl_prob in class_prob_dict.items():
        class_probs[cl] = predict(text, word_counts_dict[cl], cl_prob, k)
    return max(class_probs, key=(lambda key: class_probs[key]))

In [9]:
# As you can see, we can now generate probabilities for which class a given review is part of.
# The probabilities themselves aren't very useful -- we make our classification decision based on which value is greater.
sample_review = reviews['Phrase'].iloc[0]
print('Sample review: {0}'.format(sample_review))
print('Probability for being negative: {0}'.format(predict(sample_review, neg_word_counts, neg_prob, 1)))
print('Probability for being neutral: {0}'.format(predict(sample_review, ntr_word_counts, ntr_prob, 1)))
print('Probability for being positive: {0}'.format(predict(sample_review, pos_word_counts, pos_prob, 1)))

Sample review: A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .
Probability for being negative: 3.503012065587911e-85
Probability for being neutral: 3.150976693599114e-84
Probability for being positive: 1.9865028517931204e-86


Apply to test data.

In [11]:
word_counts_dict = {}
word_counts_dict[-1] = neg_word_counts
word_counts_dict[0] = ntr_word_counts
word_counts_dict[1] = pos_word_counts

class_prob_dict = {}
class_prob_dict[-1] = neg_prob
class_prob_dict[0] = ntr_prob
class_prob_dict[1] = pos_prob

with open('data/test.tsv', 'r') as file:
    test_reviews = pd.DataFrame.from_csv(file, sep='\t')

predictions = pd.DataFrame(columns=['Phrase', 'Sentiment'])
for i, r in test_reviews[:500].iterrows():
    predictions.loc[i] = [r['Phrase'], predict_class(r['Phrase'], word_counts_dict, class_prob_dict)]

  if sys.path[0] == '':


Check the result!

In [12]:
predictions.to_csv('output.tsv', sep='\t')

### Scikit-learn version

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Generate counts from text
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform(reviews['Phrase'])
test_features = vectorizer.transform(test_reviews['Phrase'])

# Train a NB model
nb = MultinomialNB()
nb.fit(train_features, reviews['Sentiment'])

# Predict for test data
predictions = nb.predict(test_features)

In [14]:
predictions

array([3, 3, 2, ..., 2, 2, 1])