# Assignment 2: Sentiment Classification Using Logistic Regression

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Programming Assignment (100 Points scaled to 40)

For this assignment we will be implementing a naive bayes baseline classifier. Additionally, we will be using pytorch to implement a binary logistic regression classifier. Our task is sentiment classification for hotel reviews. The input to your model will be a text review, and the output label is a 1 or 0 marking it as positive or negative.

We have provided a util.py file for loading the data, and some of the basic modeling. Your task is to fill in the functions below in order to train as accurate a classifier as possible!

We suggest browsing the util.py script first. Additionally, make sure to install dependencies from the provided requirements.txt file in a similar fashion to the pytorch tutorial. With your environment activated int he terminal, run:
```
pip install -r requirements.txt
```

In [2]:
import os
os.chdir("/content/drive/MyDrive/Assignment2_Sentiment_Analysis")

In [3]:
!pip install -r requirements.txt

Collecting en_core_web_sm (from -r requirements.txt (line 3))
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0.tar.gz (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacytextblob (from -r requirements.txt (line 4))
  Downloading spacytextblob-4.0.0-py3-none-any.whl (4.5 kB)
Collecting sklearn (from -r requirements.txt (line 5))
  Downloading sklearn-0.0.post9.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spacy<4.0.0,>=3.0.0 (from -r requirements.txt (line 2))
  Downloading spacy-3.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wasabi<1.1.0,>=0.9.1 (from spacy<4.0.0,>=3.0.0->-r 

In [4]:
from typing import List
import spacy
import torch
import random

## Section 1: Sentiment Classification Dataset (Total: 20 Points)

The training data for this task consists of a collection of short hotel reviews. The data is formatted as one review per line. Each line starts with a unique identifier for the review (as in ID-2001) followed by tab and the text of the review.  The reviews are not tokenized or sentence segmented in any way (the words are space separated). The positive reviews and negative reviews appear in separate files namely [hotelPosT-train.txt](data/hotelPosT-train.txt) and [hotelNegT-train.txt](data/hotelNegT-train.txt).

In [5]:
from util import load_train_data
pos_datapath = "data/hotelPosT-train.txt"
neg_datapath = "data/hotelNegT-train.txt"
all_texts, all_labels = load_train_data(pos_datapath, neg_datapath)

### Lets look at what is in the data

In [6]:
def random_sample(texts, labels, label):
    data_by_label = {}
    for lab, text in zip(labels, texts):
        if lab not in data_by_label:
            data_by_label[lab] = []
        data_by_label[lab].append(text)
    return random.choice(data_by_label[label])

print("--- Positive Example ---")
print(random_sample(all_texts, all_labels, label=1))
print("\n--- Negative Example ---")
print(random_sample(all_texts, all_labels, label=0))

--- Positive Example ---
I had a wonderful, relaxed stay at the Huntley Hotel. I wasn't staying in the priciest room, but all the staff treated me like I had reserved the penthouse. The desk clerk even helped me locate my wallet after I left it in the hotel restaurant. In addition to beautiful rooms and a kind staff, the Huntley boasts a location barely two blocks from the Santa Monica beach. When I wasn't enjoying the amenities of the hotel, I was lounging in the sun.

--- Negative Example ---
This hotel was one of the worst I have ever stayed in! When we arrived, the sheets were stained and smelled unclean. The bathroom had not been touched. There was no toilet paper or complimentary shampoos. The television did not even work. When we called the front, it took two hours for someone to come up and turn our room over. I would not recommend this hotel to anyone!


### Test Data ( WAIT TILL DEADLINE)

This is the test dataset that you will need to use to report the results on. This set is the unseen dataset meaning, you are not in anyway supoose to look what is in this dataset. We will release this dataset on the last day of the assignment's deadline.

In [None]:
### RUN THIS ONLY ON DEADLINE ###
# Load the test data

from util import load_test_data

# FIXME
test_datapath = "data/test-dataset.txt"
test_texts, test_labels = load_train_data(test_datapath)

### Task 1.1: Print the number of "positive" and "negative" samples (5 Points)

It is important to know the distribution of the training examples. More often than not, you will have to work with datasets that are not "balanced" with respect to the labels of the samples. For this task, print out the number of examples that have label = 1 and label = 0, respectively, in std:out or plot a pie chart.

In [8]:
### ENTER CODE HERE ###

# Note since we have them in two seperate files,
# this can also be done with bash commands
def label_distribution(labels):
    posnum = 0
    negnum = 0
    for i in labels:
      if i == 1:
        posnum = posnum + 1
      elif i == 0:
        negnum = negnum + 1
    print("The number of positive labels is " + str(posnum) + " and the number"
      + " of negative labels is " + str(negnum))
label_distribution(all_labels)

The number of positive labels is 95 and the number of negative labels is 94


### Task 1.2: Split Training and Development Sets (5 Points)

For the purpose of coming with the best parameters for the model you will have to split the dataset into training and development sets. Make sure the splits follow the same distribution.

In [9]:
### ENTER CODE HERE ###

import random
import math


def split_dataset(texts, labels):
  pos_texts = texts[0:94]
  neg_texts = texts[95:188]
  train_texts = []
  train_labels = []
  dev_texts = []
  dev_labels = []

  random.shuffle(pos_texts)
  random.shuffle(neg_texts)
  poslen = math.ceil(0.8 * len(pos_texts))
  neglen = math.ceil(0.8 * len(neg_texts))
  for i in range(poslen):
    train_texts.append(pos_texts[i])
    train_labels.append(1)
    if i < (len(pos_texts) - poslen):
      dev_texts.append(pos_texts[i + poslen])
      dev_labels.append(1)
  for j in range(neglen):
    train_texts.append(neg_texts[j])
    train_labels.append(0)
    if j < (len(neg_texts) - neglen):
      dev_texts.append(neg_texts[j + neglen])
      dev_labels.append(0)

  return train_texts, train_labels, dev_texts, dev_labels


train_texts, train_labels, dev_texts, dev_labels = split_dataset(all_texts, all_labels)

print('Train Label Distribution:')
label_distribution(train_labels)

print('Dev Label Distribution:')
label_distribution(dev_labels)

Train Label Distribution:
The number of positive labels is 76 and the number of negative labels is 75
Dev Label Distribution:
The number of positive labels is 18 and the number of negative labels is 18


### Task 1.3: Evaluation Metrics (10 Points)

Implement the evaulation metrics: Accuracy, Precision, Recall and F1 score

In [36]:
### ENTER CODE HERE ###

def accuracy(predicted_labels, true_labels):
    correct = 0
    i = 0
    for i in range(len(predicted_labels)):
      if predicted_labels[i] == true_labels[i]:
        correct = correct + 1
    acc = correct / len(predicted_labels)
    return acc

def precision(predicted_labels, true_labels):
    pallpos = 0
    truepos = 0
    for i in range(len(predicted_labels)-1):
      if predicted_labels[i] == 1:
        pallpos = pallpos + 1
      if predicted_labels[i] == 1 and predicted_labels[i] == true_labels[i]:
        truepos = truepos + 1
    prec = truepos / pallpos
    return prec

def recall(predicted_labels, true_labels):
    rallpos = 0
    truepos = 0
    for i in range(len(predicted_labels)):
      if true_labels[i] == 1:
        rallpos = rallpos + 1
      if predicted_labels[i] == 1 and predicted_labels[i] == true_labels[i]:
        truepos = truepos + 1
    rec = truepos / rallpos
    return rec


def f1_score(predicted_labels, true_labels):
    pallpos = 0
    truepos = 0
    rallpos = 0

    for i in range(len(predicted_labels)-1):
      if predicted_labels[i] == 1:
        pallpos = pallpos + 1
      if predicted_labels[i] == 1 and predicted_labels[i] == true_labels[i]:
        truepos = truepos + 1
    for i in range(len(predicted_labels)):
      if true_labels[i] == 1:
        rallpos = rallpos + 1
    prec = truepos / pallpos
    rec = truepos / rallpos
    hmean = (2 * prec * rec) / (prec + rec)

In [37]:
### DO NOT EDIT ###

em_test_labels = [0]*6 + [1]*4
em_test_predictions = [0]*8 + [1]*2

em_test_accuracy = 0.8
em_test_precision = 1.0
em_test_recall = 0.5
em_test_f1 = 2/3

assert accuracy(em_test_predictions, em_test_labels) == em_test_accuracy
assert precision(em_test_predictions, em_test_labels) == em_test_precision
assert recall(em_test_predictions, em_test_labels) == em_test_recall
assert f1_score(em_test_predictions, em_test_labels) == em_test_f1

print('All Test Cases Passed!')

AssertionError: ignored

## Section 2: Baselines (Total: 20 Points)

It is important to come up with baselines for the classifications to compare the more complicated models with. The baselines are also useful as a debugging method for your actual classfication model. You will create two baselines:

1. Random Chance
2. Naive Bayes Classifier

### Task 2.1: Random Chance Classifier (5 Points)

A random chance classifier predicts the label according to the label's distribution. As an example, if the label 1 appears 70% of the times in the training set, you predict 70 out of 100 times the label 1 and label 0 30% of the times

In [12]:
### ENTER CODE HERE ###

def predict_random(train_labels, num_samples):
    pos = 0
    for i in train_labels:
      if i == 1:
        pos = pos + 1
    distr = pos / len(train_labels)
    distrsize = math.ceil(distr * num_samples)

    pred = [1]*(distrsize) + [0]*(num_samples - distrsize)

    return pred

### Task 2.2: Naive Bayes Classifier (Total: 10 Points)

In the class, Jim went over how to implement a Naive Bayes Classifier using the tokens in the training samples.
In this task, you will do the same. As a preprocessing step, you might want to remove the stop words and lemmatize/stem the words of the texts.

### Spacy Model https://spacy.io

To tokenize the text and help extract features from text, we will use the popular spaCy model

In [13]:
### DO NOT EDIT ###

# Initialize the spacy model
nlp = spacy.load('en_core_web_sm')

### Task 2.2.1: Play around with spacy (0 Points)

In [14]:
### ENTER CODE HERE ###

test_string = "This is an amazing sentence"

# parse the string with spacy model
test_doc = nlp(test_string)

print('Token', 'Lemma', 'Is_Stopword?')
for token in test_doc:
    print(token, token.lemma_, token.is_stop)

Token Lemma Is_Stopword?
This this True
is be True
an an True
amazing amazing False
sentence sentence False


### Task 2.2.2: Preprocessing (5 Points)

Remove stopwords and lemmatize the words of a text

In [15]:
### ENTER CODE HERE ###

def pre_process(text: str) -> List[str]:
  lemmas = []
  nlp_text = nlp(text)
  for token in nlp_text:
    if not token.is_stop:
      lemmas.append(token.lemma_)
  return lemmas

test_string = "This sentence needs to be lemmatized"

assert len({'sentence', 'need', 'lemmatize', 'lemmatiz'}.intersection(pre_process(test_string))) >= 3

print('All Test Cases Passed!')

All Test Cases Passed!


### Task 2.2.3: The Naive Bayes Class (5 Points)

The standard way of implementing classifiers like Naive Bayes is to implement the two methods: "fit" and "predict". The fit method expects the training data along with labels, and the predict method predicts the labels for the provides texts of samples.

In [39]:
### ENTER CODE HERE ###

class NaiveBayesClassifier:
    def __init__(self, num_classes):
        self.num_classes = num_classes
        self.cls = []
        self.prior_cls = {}
        self.prob_cls = {}

    def fit(self, texts, labels):
      #pre-process texts
        pr_texts = []
        for i in range(len(texts)):
          pr_texts.append(pre_process(texts[i]))
      #give names to the classes of labels
        dcls = {} #number of documents per class/label
        for l in labels:
          if l not in self.cls:
            self.cls.append(l)
      #calculate number of documents per class and sort texts
        texts_per_class = {}
        for cl in self.cls:
          doc_in_class_count = 0
          for j in range(len(labels)):
            if labels[j] == cl:
              doc_in_class_count = doc_in_class_count + 1
          dcls[cl] = doc_in_class_count
          texts_per_class[cl] = []
       #sort texts into classes
        for t in range(len(texts)):
          texts_per_class[labels[t]].append(pr_texts[t])
      #calculate vocab, priors and  for all classes
        for cl in self.cls:
          self.prior_cls[cl] = dcls[cl] / len(texts)
        for c in texts_per_class:
          class_vocab = {}
          vocab_probability = {}
          for tpc in texts_per_class[c]:
            for w in tpc:
              if w not in class_vocab:
                class_vocab.update({w : 1})
              else:
                class_vocab[w] = class_vocab[w] + 1
          for v in class_vocab:
            vocab_probability[v] = class_vocab[v] / len(class_vocab)
          self.prob_cls[c] = vocab_probability

    def predict(self, texts):
        predicted_classes = []
      #pre-process texts
        pr_texts = []
        for i in range(len(texts)):
          pr_texts.append(pre_process(texts[i]))
        for j in range(len(pr_texts)):
          possibilities = []
          for c in self.cls:
            c_pr = self.prior_cls[c]
            for w in pr_texts[j]:
              if w in self.prob_cls[c]:
                c_pr = c_pr + self.prob_cls[c[w]]
              possibilities.append(c_pr)
            sorted_possibilities = possibilities.sort()
            for p in range(len(possibilities)):
              if possibilities[p] == sorted_possibilities[0]:
                predicted_classes.append(c)
        return predicted_classes

### Task 2.3: Baseline Results  (5 Points)

Since there is not hyperparameter-tuing required for the baselines, we can use the entirety of the training set (no need to split the dataset into train and development). Report the results you achieve with the two baselines by running the following cell:

In [40]:
### DO NOT EDIT ###

### DEV SET RESULTS

testset_prediction_random = predict_random(train_labels, num_samples=len(dev_labels))
print('Random Chance F1:', f1_score(testset_prediction_random, dev_labels))

naive_bayes_classifier = NaiveBayesClassifier(num_classes=2)
naive_bayes_classifier.fit(train_texts, train_labels)
testset_predictions_nb = naive_bayes_classifier.predict(dev_texts)
print('Naive Bayes F1:', f1_score(testset_predictions_nb, dev_labels))

Random Chance F1: None


TypeError: ignored

In [None]:
### DO NOT EDIT ###
### RUN THIS ONLY ON DEADLINE ###
### TEST SET RESULTS

testset_prediction_random = predict_random(all_labels, num_samples=len(test_labels))
print('Random Chance F1:', f1_score(testset_prediction_random, test_labels))

naive_bayes_classifier = NaiveBayesClassifier(num_classes=2)
naive_bayes_classifier.fit(all_texts, all_labels)
testset_predictions_nb = naive_bayes_classifier.predict(test_texts)
print('Naive Bayes F1:', f1_score(testset_predictions_nb, test_labels))

## Section 3: Logistic Regression on Features (Total: 60 Points)

Now let's try building a logistic regression based classifier on hand-engineered features.

The following tasks are going to be the implementation of the components required in building a Logistic Regressor.

### Task 3.0: Feature Extraction (20 points)

This is perhaps the most challenging part of this assignment. In the class, we went over how to featurize text for a classification system for sentiment analysis. In this assignment, you should implement and build upon this to accuractely classify the hotel reviews.

This task requires a thorough understanding of the dataset to answer the important question, "What is in the data?". Please go through some of the datapoints and convert the signals that you think might help in identifying "sentiment" as features.

Please refer to the section in Jim's book that illustrates the process of feature engineering for this task. We have attached an image of the table below:

![image.png](attachment:image.png)

Please use the files with postive and negative words attached in the assignment: [positive_words.txt](data/poisitive-words.txt) and  [negative_words.txt](data/negative-words.txt)

In [None]:
def make_test_feature(text: spacy.tokens.doc.Doc):
    return "happy" in [t.lemma_ for t in text]


def extract_features(text: spacy.tokens.doc.Doc):
    features = []
    # TODO: Replace this with your own feature extraction functions.
    features.append(make_test_feature(text))
    # TODO: add more features to the feature vector

    return features

In [None]:
### ENTER CODE HERE ###
### DO NOT CHANGE THE SIGNATURE OF THE function THOUGH ###

def featurize_data(texts, labels):
    features = [
        extract_features(doc) for doc in nlp.pipe(texts)
    ]
    return torch.FloatTensor(features), torch.FloatTensor(labels)

### Task 3.0.2: Feature Scaling (10 Points)

In this task we will use the data normalization technique to ensure the scales of the feature are consistent.
After featurizing the dataset, we need to call the following function before passing it to the classifier

#### Normalization Formula

![image.png](attachment:image.png)

In [None]:
### ENTER CODE HERE ###

def normalize(features: torch.Tensor) -> torch.Tensor:
    """
    return the features transformed by the above formula of normalization
    """
    raise NotImplementedError

## Training a Logistic Regression Classifier (Total: 30 Points)

In this section, you will implement the components needed to train the binary classifier using logistic regression

### Here we define our pytorch logistic regression classifier (DO NOT EDIT THIS)

In [None]:
class SentimentClassifier(torch.nn.Module):
    def __init__(self, input_dim: int):
        super().__init__()
        # We force output to be one, since we are doing binary logistic regression
        self.output_size = 1
        self.coefficients = torch.nn.Linear(input_dim, self.output_size)
        # Initialize weights. Note that this is not strictly necessary,
        # but you should test different initializations per lecture
        initialize_weights(self.coefficients)

    def forward(self, features: torch.Tensor):
        # We predict a number by multipling by the coefficients
        # and then take the sigmoid to turn the score as logits
        return torch.sigmoid(self.coefficients(features))

### Task 3.1: Initialize the weights. (5 Points)

Initialization of the parameters is an important step to ensure the SGD algorithm converges to a global optimum. Typically, we need to try different initialization methods and compare the accuracy we achieve for the development set. In this task, implement the function that initializes the parameters to ...

In [None]:
### ENTER CODE HERE ###

def initialize_weights(coefficients):
    """
    TODO: Replace the line `raise NotImplementedError` with your code.
    Initialize the weights of the coefficients by assigning the parameter
    coefficients.weights.data = ...
    """
    raise NotImplementedError

Let's build a training function similar to the linear regressor from the tutorial

### Task 3.2: Logistic Loss Function (10 Points)

In [None]:
### ENTER CODE HERE ###

def logistic_loss(prediction: torch.Tensor, label: torch.Tensor) -> torch.Tensor:
    """
    TODO: Implement the logistic loss function between a prediction and label.
    """
    raise NotImplementedError

### Task 3.3: Create an SGD optimizer (0 Points)

We have already provided the implementation of how to create the SGD optimizer

You may try different optimizers refering to the docs provided

In [None]:
### ENTER CODE HERE ###

def make_optimizer(model, learning_rate) -> torch.optim:
    """
    Returns an Stocastic Gradient Descent Optimizer
    See here for algorithms you can import: https://pytorch.org/docs/stable/optim.html
    """
    return torch.optim.SGD(model.parameters(), learning_rate)

### Task 3.5: Converting Logits into Predictions (5 Points)

In [None]:
### ENTER CODE HERE ###

def predict(model, features):
    with torch.no_grad():
        """
        TODO: Replace the line `raise NotImplementedError`
        with the logic of converting the logits into prediction labels (0, 1)
        """
        logits = model(features)
        raise NotImplementedError

### Training Function (DO NOT EDIT THIS)

In [None]:
### DO NOT EDIT ###

from tqdm.autonotebook import tqdm
import random


def training_loop(
    num_epochs,
    batch_size,
    train_features,
    train_labels,
    dev_features,
    dev_labels,
    optimizer,
    model
):
    samples = list(zip(train_features, train_labels))
    random.shuffle(samples)
    batches = []
    for i in range(0, len(samples), batch_size):
        batches.append(samples[i:i+batch_size])
    print("Training...")
    for i in range(num_epochs):
        losses = []
        for batch in tqdm(batches):
            # Empty the dynamic computation graph
            features, labels = zip(*batch)
            features = torch.stack(features)
            labels = torch.stack(labels)
            optimizer.zero_grad()
            # Run the model
            logits = model(features)
            # Compute loss
            loss = logistic_loss(torch.squeeze(logits), labels)
            # In this logistic regression example,
            # this entails computing a single gradient
            loss.backward()
            # Backpropogate the loss through our model

            # Update our coefficients in the direction of the gradient.
            optimizer.step()
             # For logging
            losses.append(loss.item())

        # Estimate the f1 score for the development set
        dev_f1 = f1_score(predict(model, dev_features), dev_labels)
        print(f"epoch {i}, loss: {sum(losses)/len(losses)}")
        print(f"Dev F1 {dev_f1}")

    # Return the trained model
    return model

### Task 3.6: Train the classifier (10 Points)

Run the following cell to train a logistic regressor on your hand-engineered features.

In [None]:
### DO NOT EDIT ###

num_epochs = 100

train_features, train_labels_tensor = featurize_data(train_texts, train_labels)
train_features = normalize(train_features)
dev_features, dev_labels_tensor = featurize_data(dev_texts, dev_labels)
dev_features = normalize(dev_features)
model = SentimentClassifier(train_features.shape[1])
optimizer = make_optimizer(model, learning_rate=0.01)

trained_model = training_loop(
    num_epochs,
    16,
    train_features,
    train_labels_tensor,
    dev_features,
    dev_labels_tensor,
    optimizer,
    model
)

### Task 3.7: Get the predictions on the Test Set using the Trained model and print the F1 score (10 Points)

In [None]:
### DO NOT EDIT ###

### DEV SET RESULTS

test_features, test_labels = featurize_data(dev_texts, dev_labels)
print('Logistic Regression Results:')
print('Accuracy:', accuracy(predict(trained_model, test_features), test_labels))
print('F1-score', f1_score(predict(trained_model, test_features), test_labels))

In [None]:
### DO NOT EDIT ###
### RUN THIS ONLY ON DEADLINE ###
### TEST SET RESULTS

test_features, test_labels = featurize_data(test_texts, test_labels)
print('Logistic Regression Results:')
print('Accuracy:', accuracy(predict(trained_model, test_features), test_labels))
print('F1-score', f1_score(predict(trained_model, test_features), test_labels))

## Written Assignment (60 Points)

Written assignment tests the understanding of the student for the assignment's task. We have split the writing into sections. You will need to write 1-2 paragraphs describing the sections. Please be concise.

### In your own words, describe what the task is (20 points)

Describe the task, how is it useful and an example.

### Describe your method for the task (10 points)

Important details about the implementation. Feature engineering, parameter choice etc.

### Experiment Results (10 points)

Typically a table summarizing all the different experiment results for various parameter choices

### Discussion (20 points)

Key takeaway from the assignment. Why is the method good? shortcomings? how would you improve? Additional thoughts?