# CS4765/6765 Assignment 2: Sentiment Analysis

**Due 9 October at 23:59**

We've seen the problem of sentiment analysis in class. In this
assignment you will write your own sentiment analysis system. You will
implement an unsupervised lexicon-based method and two variants of a
naive Bayes classifier. The starter code further provides a most-frequent class baseline and logistic regression. You will then compare the various
approaches.

## Data

I've provided you with the following files for this assignment:

- `movie_reviews_{train,dev,test}_docs.txt` Training, development, and test
  documents, in 1 document per line format. The documents have been tokenized, with each token separated by whitespace.
  These are movie reviews and have gold standard labels of 
 `positive` or `negative`.

  **These are real movie reviews. Some of the documents contain content
    that you might find offensive (e.g., expletives, racist remarks).
    Despite this offensive content, these movie reviews
    are still valuable data, and building NLP systems that can
    operate over them is important. That is why we are working with
    this potentially-offensive data in this assignment.**

  The data for this assignment is from the movie reviews dataset included as part of NLTK. You can read more about this dataset here: http://www.cs.cornell.edu/people/pabo/movie-review-data/ You should only use the files I've provided you with for this assignment.

- `movie_reviews_{train,dev,test}_classes.txt` Class labels for
  the training, development, and test data, in 1 label per line format. The labels are
  `positive` and `negative`.

- `{pos,neg}-words.txt` Lists of positive and
  negative words, in 1 word per line format. These lists are from the "Opinion
  Lexicon" provided by Bing
  Liu (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon).
  I've stripped the header information from the original files so that the
  code here doesn't have to deal with that, and dealt with a minor encoding
  issue.

The starter code that I've provided below handles reading from these files for you.

## Models

In this assignment you will implement three different approaches to sentiment analysis: a sentiment lexicon baseline, multinomial naive Bayes, and binary multinomial naive Bayes. The starter code further includes a most-frequent class baseline and logistic regression. Further details on the models are provided below.

## Implementation

Your code must be able to run on the NLP VM on the lab machines using Python 3.9. You must not use NLTK or any other NLP toolkits. You should not import any modules that this notebook does not already import for you. Although the starter code uses an implementation of logistic regression from scikit-learn, you must not use scikit-learn for any of the code that you write in this assignment.

## Experimental Setup

Throughout this assignment, we will always use the training data for training models. We will implement our models and then use the development data for preliminary evaluation. Once we've completed this, we will do our final evaluation on the test data. The starter code guides you through this process.



In [4]:
# A very simple tokenizer. Applies case folding. 
# (The documents we are working with have already been tokenized and each token is separated by whitespace.)
def tokenize(s):
    return s.lower().split()

train_texts_fname = 'a2data/movie_reviews_train_docs.txt'
train_klasses_fname = 'a2data/movie_reviews_train_classes.txt'
dev_texts_fname = 'a2data/movie_reviews_dev_docs.txt'
dev_klasses_fname = 'a2data/movie_reviews_dev_classes.txt'

train_texts = [x.strip() for x in open(train_texts_fname, encoding='utf8')]
train_klasses = [x.strip() for x in open(train_klasses_fname, encoding='utf8')]
dev_texts = [x.strip() for x in open(dev_texts_fname, encoding='utf8')]
dev_klasses = [x.strip() for x in open(dev_klasses_fname, encoding='utf8')]

from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# A helper function to print out macro-average P,R,F1 and accuracy.
# Uses implementantions of evaluation metrics from sklearn.
def print_results(gold_labels, predicted_labels):
    p,r,f,_ = precision_recall_fscore_support(gold_labels, 
                                              predicted_labels, 
                                              average='macro', 
                                              zero_division=0)
    acc = accuracy_score(gold_labels, predicted_labels)

    print("Precision: ", p)
    print("Recall: ", r)
    print("F1: ", f)
    print("Accuracy: ", acc)
    print()

## Most-frequent Class Baseline

The starter code below trains a most-frequent class baseline on the training data and evaluates it on the dev data. (Note that sklearn also includes an implementation of a most-frequent class baseline, which you might find useful for your projects: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

In [6]:
# A most-frequent class baseline
class Baseline:
    def __init__(self, klasses):
        self.train(klasses)

    def train(self, klasses):
        # Count classes to determine which is the most frequent
        klass_freqs = {}
        for k in klasses:
            klass_freqs[k] = klass_freqs.get(k, 0) + 1
        self.mfc = sorted(klass_freqs, reverse=True, 
                          key=lambda x : klass_freqs[x])[0]
    
    def classify(self, test_instance):
        # Ignore the test instance and always return the most frequent class
        return self.mfc

In [7]:
baseline_classifier = Baseline(train_klasses)
baseline_dev_predictions = [baseline_classifier.classify(x) for x in dev_texts]
print_results(dev_klasses, baseline_dev_predictions)

Precision:  0.24375
Recall:  0.5
F1:  0.3277310924369748
Accuracy:  0.4875



## Lexicon Baseline (1 mark)

A simple baseline approach is to use a polarity lexicon (i.e., the
lists of positive and negative words you've been provided with) to
determine the number of positive and negative tokens in a test
document, and then output the class label associated with the most
tokens. In the event of a tie, select the most frequent class. 

Implement this approach by completing the `classify` method below.
You should not modify any other parts of the code.

Then run the provided code to train this classifier on the training data
and evaluate on the development data.


In [9]:
class LexiconBL:
    def __init__(self, pos_fname, neg_fname):
        self.pos_words = set([x.strip() for x in open(pos_fname,
                                                      encoding='utf8')])
        self.neg_words = set([x.strip() for x in open(neg_fname,
                                                      encoding='utf8')])

    def classify(self, test_instance):
        # Given a test_instance (a string representing a tweet), 
        # return its predicted class ('positive' or 'negative')

        tokens = tokenize(test_instance)

        # TODO: Complete this method
        
        pos_counter = 0
        neg_counter = 0

        for token in tokens:
            if token in self.pos_words:
                pos_counter += 1
            elif token in self.neg_words:
                neg_counter += 1

        if pos_counter >= neg_counter:
            return 'positive'
        else:
            return 'negative'
        return 'negative'

In [10]:
# Train the lexicon baseline on the training data and evaluate on the dev data
pos_fname = 'a2data/pos-words.txt'
neg_fname = 'a2data/neg-words.txt'

lexiconBL_classifier = LexiconBL(pos_fname, neg_fname)
lexiconBL_dev_predictions = [lexiconBL_classifier.classify(x) for x in dev_texts]

print_results(dev_klasses, lexiconBL_dev_predictions)

Precision:  0.6880442398158343
Recall:  0.6879924953095685
F1:  0.6874980468627929
Accuracy:  0.6875



## Logistic Regression

The implementation below uses the scikit-learn (sklearn) Python module for logistic regression. Scikit-learn is a popular tool for doing many machine learning tasks in Python. It includes implementations of many classifiers (including naive Bayes, but we're implementing that ourselves in this assignment). Read the comments in the code below to learn how to use it.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# sklearn provides functionality for tokenizing text and
# extracting features from it. This uses the tokenize function
# defined above for tokenization (as opposed to sklearn's
# default tokenization) so the results can be more easily
# compared with those using NB and the sentiment lexicon
# baseline.
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vectorizer = CountVectorizer(analyzer=tokenize)

# train_counts will be a DxV matrix where D is the number of
# training documents and V is the number of types in the
# training documents. Each cell in the matrix indicates the
# frequency (count) of a type in a document.
train_counts = count_vectorizer.fit_transform(train_texts)

# Train a logistic regression classifier on the training
# data. A wide range of options are available. This does
# something similar to what we saw in class, i.e., multinomial
# logistic regression (multi_class='multinomial') using
# stochastic average gradient descent (solver='sag') with L2
# regularization (penalty='l2'). The maximum number of
# iterations is set to 2000 (max_iter=2000) to allow the model
# to converge on this training data. The random_state is set to 0 
# (an arbitrarily chosen number) to help ensure results are 
# consistent from run to run.
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
lr = LogisticRegression(multi_class='multinomial',
                        solver='sag',
                        penalty='l2',
                        max_iter=2000,
                        random_state=0)
lr_classifier = lr.fit(train_counts, train_klasses)

# Transform the test documents into a DxV matrix, similar to
# that for the training documents, where D is the number of
# test documents, and V is the number of types in the training
# documents. Here we will test on the development data.
dev_counts = count_vectorizer.transform(dev_texts)

# Predict the class for each test document
lr_dev_predictions = lr_classifier.predict(dev_counts)
print_results(dev_klasses, lr_dev_predictions)



Precision:  0.8355222013758599
Recall:  0.8355222013758599
F1:  0.835
Accuracy:  0.835



## Multinomial Naive Bayes (5 marks)

Implement multinomial naive Bayes, i.e., as described in Chapter 4.1-4.3 of the text book. Follow the structure of the starter code provided. Read the comments in the code, and the description below, to understand how it is intended to work.

- The constructor, `__init__`, should train the classifier on the provided training data. Similarly to the constructor for `Baseline` above,   you might find it useful to have the constructor call helper methods (i.e., `train` in the case of `Baseline`).

- `classify` predicts the class for a provided test instance. Be sure to compute probabilities in log space to avoid underflow errors
  from multiplying many probabilities.

- **Optional:** It can be very helpful to ensure that your probability distributions are actually
  probability distributions! You can do this by adding `assert` statements to `sanity_check` to make
  sure that all the probabilities you estimate are between 0 and 1,
  and that the distributions you estimate sum to 1, e.g., $\sum_{w \in
  V}P(w|c) = 1$.

  You can add further simple checks to your code, for example, `__init__` checks that the number of training documents and training classes are the same. (If this were not the case, then each document would not have exactly one class label, indicating a problem, perhaps that the constructor was called incorrectly.) You might want to make sure that all the classes you output are `positive` or `negative`.


In [14]:
import math

class NB:

    # TODO: Complete this class

    def __init__(self, documents,
                 klasses,
                 binary=False):
        # documents is a list of training documents. You will need to tokenize each document.
        # klasses is a list of corresponding classes for the training documents (I.e.,
        # klasses[i] is the class for document[i]).
        # binary indicates whether to use binary multinomial naive Bayes (which you will
        # implement in the next step) or (regular) multinomial naive Bayes. You can ignore
        # it when you first implement multinomial naive Bayes, and then use it to implement
        # binary multinomial naive Bayes after.
        assert len(documents) == len(klasses)
        self.binary = binary
         
        self.vocab = set()
        self.class_word_counts = {'positive': {}, 'negative': {}}
        self.class_doc_counts = {'positive': 0, 'negative': 0}
        self.class_total_words = {'positive': 0, 'negative': 0}
        self.prior_prob = {'positive': 0, 'negative': 0}

        for doc, klass in zip(documents, klasses):
            tokens = tokenize(doc)
            self.class_doc_counts[klass] += 1

            token_set = set(tokens) if self.binary else tokens
            for token in token_set:
                if not self.binary or token not in self.class_word_counts[klass]:
                    if token not in self.class_word_counts[klass]:
                        self.class_word_counts[klass][token] = 0
                    self.class_word_counts[klass][token] += 1
                    self.class_total_words[klass] += 1
                    self.vocab.add(token)

        total_docs = len(documents)
        self.prior_prob['positive'] = math.log(self.class_doc_counts['positive'] / total_docs)
        self.prior_prob['negative'] = math.log(self.class_doc_counts['negative'] / total_docs)

        self.sanity_check()

    def sanity_check(self):
        # You might want to add some checks here to check that, for example,
        # you've estimated valid probability distributions
        assert self.prior_prob['positive'] <= 0
        assert self.prior_prob['negative'] <= 0
        
        assert len(self.vocab) > 0

    def classify(self, test_instance):
        # test_instance is an instance to classify. 
        # Return the predicted class: must be one of 'positive' or 'negative'
        tokens = tokenize(test_instance)
        
        log_prob_positive = self.prior_prob['positive']
        log_prob_negative = self.prior_prob['negative']

        token_set = set(tokens) if self.binary else tokens
        vocab_size = len(self.vocab)

        for token in token_set:
            count_positive = self.class_word_counts['positive'].get(token, 0) + 1
            prob_word_given_positive = count_positive / (self.class_total_words['positive'] + vocab_size)
            log_prob_positive += math.log(prob_word_given_positive)

            count_negative = self.class_word_counts['negative'].get(token, 0) + 1
            prob_word_given_negative = count_negative / (self.class_total_words['negative'] + vocab_size)
            log_prob_negative += math.log(prob_word_given_negative)

        return 'positive' if log_prob_positive > log_prob_negative else 'negative'

In [15]:
nb_classifier = NB(train_texts, train_klasses)
nb_classifier.sanity_check()
nb_dev_predictions = [nb_classifier.classify(x) for x in dev_texts]
print_results(dev_klasses, nb_dev_predictions)

Precision:  0.8227806458143536
Recall:  0.8190744215134459
F1:  0.8171697628842096
Accuracy:  0.8175



## Binary Multinomial Naive Bayes (1 mark)

Also implement binary multinomial naive Bayes (i.e., multinomial naive Bayes with binary features). Recall
that in this model, the frequency of a given word in a given document
is either 0 (if the word does not occur in the document) or 1 (if the
word occurs 1 or more times in the document). Repeated occurrences of
a word are ignored. This model is discussed in Chapter 4.4 of the text book.

Implement this model by adding functionality to the class `NB` for when the value `True` is passed to the constructor for the parameter `binary`. **Hint** This should be a very small amount of additional code. If you're writing lots of code, you are likely off track.

In [17]:
bin_nb_classifier = NB(train_texts, train_klasses, binary=True)
bin_nb_classifier.sanity_check()
bin_nb_dev_predictions = [bin_nb_classifier.classify(x) for x in dev_texts]
print_results(dev_klasses, bin_nb_dev_predictions)

Precision:  0.6894245185099758
Recall:  0.5410881801125704
F1:  0.4233836339099497
Accuracy:  0.53



## Test Data

So far, we've only evaluated on the development data. Once you have completed the tasks above (i.e., your implementations of the classes `LexiconBL` and `NB`), run the code below to evaluate the classifiers on the test data.

In [19]:
test_texts_fname = 'a2data/movie_reviews_test_docs.txt'
test_klasses_fname = 'a2data/movie_reviews_test_classes.txt'

test_texts = [x.strip() for x in open(test_texts_fname, encoding='utf8')]
test_klasses = [x.strip() for x in open(test_klasses_fname, encoding='utf8')]

print("Test results:")
print()

print("Baseline:")
baseline_test_predictions = [baseline_classifier.classify(x) for x in test_texts]
print_results(test_klasses, baseline_test_predictions)

print("Lexicon Baseline:")
lexiconBL_test_predictions = [lexiconBL_classifier.classify(x) for x in test_texts]
print_results(test_klasses, lexiconBL_test_predictions)

print("Logistic Regression")
test_counts = count_vectorizer.transform(test_texts)
lr_test_predictions = lr_classifier.predict(test_counts)
print_results(test_klasses, lr_test_predictions)

print("Multinomial Naive Bayes:")
nb_test_predictions = [nb_classifier.classify(x) for x in test_texts]
print_results(test_klasses, nb_test_predictions)

print("Binary Multinomial Naive Bayes:")
bin_nb_test_predictions = [bin_nb_classifier.classify(x) for x in test_texts]
print_results(test_klasses, bin_nb_test_predictions)

Test results:

Baseline:
Precision:  0.24125
Recall:  0.5
F1:  0.3254637436762226
Accuracy:  0.4825

Lexicon Baseline:
Precision:  0.7300932523313083
Recall:  0.7303697028860354
F1:  0.7299392363281738
Accuracy:  0.73

Logistic Regression
Precision:  0.8634259259259259
Recall:  0.8615428900402993
F1:  0.8620438825868026
Accuracy:  0.8625

Multinomial Naive Bayes:
Precision:  0.775094377359434
Recall:  0.7754248954969838
F1:  0.7749493636068115
Accuracy:  0.775

Binary Multinomial Naive Bayes:
Precision:  0.7112771739130435
Recall:  0.5622762884533553
F1:  0.46001983905011223
Accuracy:  0.5475



## Report (3 marks)

Write a brief report describing the results of the various methods for sentiment analysis considered in this assignment. Base your analysis on the results over the test data. Address at least the following in your report:

1. Which of the two baseline methods (most-frequent class or sentiment lexicon) performs best?

1. Do the other methods (multinomial naive Bayes, binary multinomial naive Bayes, and logistic regression) outperform the baseline methods?

1. Does binary multinomial naive Bayes outperform (vanilla) multinomial naive Bayes? 

1. Consider binary multinomial naive Bayes and logistic regression. Which of these methods performs best? 

1. Carry out some error analysis to attempt to explain what causes the difference in performance between binary multinomial naive Bayes and logistic regression. For this, you might find it helpful to examine the per-class P, R, and F values, or a confusion matrix. You can do this using `sklearn.metrics.precision_recall_fscore_support` and `sklearn.metrics.confusion_matrix`. You can read the documentation for these functions here:
  
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
   
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

    I've also included some examples of how to use them below:

**TODO: Write your report here**

If you wrote code to get your answers (e.g., for the last question) also include that code in code blocks below.

1. Most frequent class baseline works best when there is class imbalance, as it always predicts the majority class. However, it ignores the content of the document.
The sentiment lexicon baseline uses word lists and usually performs better if the lexicon matches the dataset well.

2. Yes, multinomial Naive Bayes, binary Naive Bayes, and logistic regression usually outperform baseline methods. They use more detailed word information (frequency or presence), leading to better predictions. Logistic regression often performs the best due to its flexibility.

3. Binary multinomial Naive Bayes can outperform multinomial Naive Bayes when word presence is more important than frequency, but it depends on the dataset. For most text tasks, vanilla multinomial Naive Bayes usually performs better due to its use of word frequencies.

4. Logistic regression generally performs better than binary multinomial Naive Bayes because it's more flexible and captures complex decision boundaries.

5. Key Points of Error Analysis:

Precision, Recall, F1-Score:
a. Logistic regression generally has higher precision, meaning fewer false positives.
b. Binary Naive Bayes may have higher recall, meaning it captures more true positives but may misclassify more negatives as positives.

Confusion Matrix:
a. Shows where each model makes mistakes (false positives, false negatives).
b. Logistic regression typically has fewer misclassifications due to its ability to capture complex patterns.

In [22]:
# Write your code here

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

# Vectorize the text data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train Logistic Regression
log_reg_pipeline = make_pipeline(StandardScaler(with_mean=False), LogisticRegression(max_iter=1000))
log_reg_pipeline.fit(X_train, train_klasses)

# Binary Multinomial Naive Bayes predictions
bin_nb_predictions = [bin_nb_classifier.classify(x) for x in test_texts]

# Logistic Regression predictions
log_reg_predictions = log_reg_pipeline.predict(X_test)

# Function to evaluate models
def evaluate_model(true_labels, predictions, model_name):
    metrics = precision_recall_fscore_support(true_labels, predictions, average=None)
    conf_matrix = confusion_matrix(true_labels, predictions)
    print(f"\n{model_name} - Precision, Recall, F1 per class:", metrics[:3])
    print(f"{model_name} - Confusion Matrix:\n", conf_matrix)

# Evaluate both models
evaluate_model(test_klasses, bin_nb_predictions, "Binary Naive Bayes")
evaluate_model(test_klasses, log_reg_predictions, "Logistic Regression")


Binary Naive Bayes - Precision, Recall, F1 per class: (array([0.51630435, 0.90625   ]), array([0.98445596, 0.14009662]), array([0.67736185, 0.24267782]))
Binary Naive Bayes - Confusion Matrix:
 [[190   3]
 [178  29]]

Logistic Regression - Precision, Recall, F1 per class: (array([0.80628272, 0.81339713]), array([0.79792746, 0.82125604]), array([0.80208333, 0.81730769]))
Logistic Regression - Confusion Matrix:
 [[154  39]
 [ 37 170]]


Examples that might be useful for error analysis for question 5.

In [24]:
# Calculte P, R, F for the positive class for binary NB on the test data.
precision_recall_fscore_support(test_klasses, bin_nb_test_predictions, labels=['positive'])

(array([0.90625]),
 array([0.14009662]),
 array([0.24267782]),
 array([207], dtype=int64))

In [25]:
# Compute a confusion matrix for Binary NB on the test data. Note that 
# I've transposed the confusion matrix so that the rows correspond to 
# system predictions and columns correspond to gold standard labels, 
# following the convention in the textbook and class. (By default, 
# sklearn does it the other way around.)
from sklearn.metrics import confusion_matrix
confusion_matrix(test_klasses, bin_nb_test_predictions, labels=['positive', 'negative']).transpose()

array([[ 29,   3],
       [178, 190]], dtype=int64)

## What to submit

When you're done, submit a2.ipynb to the assignment 2 folder on D2L by the deadline.
