# Homework 1: Logistic Regression Classificaton

In this assignment, you will train a binary text classification model, specifically a logistic regression model. The binary text classification task we will work with is "Text Entailment", a classical NLP task.
You will write the code to process the data, extract features, and train a logistic regression model. You will also conduct some feature engineering of your own to try to improve upon the naive approach.


__Deadlines__: This assignment is due on <font color="red">(Feb 21, 11.59 p.m.)</font>. This notebook will walk you through the assignment step-by-step, including how to make the final submission.

__Tl;dr of structure of the assignment__: In Section 1 and Section 2, you will learn to manipulate the input data and extract simple n-gram features. In Section 3, you will implement the logistic regression training algorithm. In Section 4, you will conduct your own additional feature engineering to improve the performance of the logistic regression model.


__Policies.__ All the policies described on the course website are applicable as is (including the policy on academic integrity and the use of generative AI tools). For more information, see: https://www.cs.cornell.edu/courses/cs4740/2026sp.

<br>


### Notes:
  
- You will **NOT** be submitting this .ipynb file. Please refer to the submission instructions in both the hw1 pdf shared with you and at the end of this notebook.
- This notebook includes some written questions, such as creating graphs, computing data statistics, etc. You will include the answers to these written questions in the same pdf document where you attempt Section A of the homework.
- Do **NOT** add, remove, or modify any imports across python source files. If you have any concerns regarding missing imports, please let course staff know through Ed before attempting to change anything.
- Do **NOT** change any of the function headers and/or specs! The input(s) and output must perfectly match the specs, or else your implementation for any function with changed specs will most likely fail!
- If you decide to create local helper functions, your code must have docstrings/comments documenting the meaning of parameters and important parameter-like variables.
- We are recommending python version 3.9+. This is due to compatibility issues with the external dependencies.


<br><br>


## Part 0: Imports and Installs

Run the following code blocks to connect to the right google drive folder and install any external libraries and needed packages to run HW1 assignment.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# set to location where you uploaded directory
%cd "/content/drive/MyDrive/CS4740-sp26/hw1-release/"
#%pip install --no-cache-dir -r requirements.txt

import sys
from importlib import reload

# Create a fake 'imp' module with just the reload function
class ImpModule:
    reload = staticmethod(reload)

sys.modules['imp'] = ImpModule()


import IPython

ipython = IPython.get_ipython()
ipython.run_line_magic("sx", f"chmod +x scripts/*.py")

%load_ext autoreload
%autoreload 2

In [None]:
### IMPORTS -- DO NOT MODIFY ###
import json
import numpy as np
import matplotlib.pyplot as plt
import os
import random
from data_exploration import unzip_data, read_csv

## Part 1: Data Exploration

**Natural Language Inference** or **Recognizing Textual Entailment** is a classifical NLP task, the goal of which is to determine the inference relationship between two pairs of words (see "Natural logic and natural language inference" by MacCartney and Manning 2008 to read more).

The goal is to determine whether a natural langauge hypothesis (H) can be reasonably inferred from a give premise (P). The original task classifies each (P, H) pair into one of three classes: ***{entailed, contradicted, neutral}***.

Consider the following example:
- P: "Children are smiling and waving at camera"
- H1: "The kids are frowning"
- H2: "There are children present"
- H3: "The kids are eating"

In this example, the premise P contradicts the hypothesis H1. Hence, (P, H1) has the label ***contradiction***. The premise P supports or entails H2, hence (P, H2) has the label ***entailment***. Finally, the hypothesis H3 is neither supported nor directly contradicted by the premise, and therefore has the label ***neutral***.


### Our task in HW1: Binary Text Classification
In this assignment, we will only consider the <font color="red">**binary text classification task**</font> , and therefore our label set will be ***{0: contradiction, 1: entailment}***. We will work with the [Stanford Natural Language Inference (SNLI) dataset](https://nlp.stanford.edu/projects/snli/). This data is stored in a zip file. You can use the following provided function to load the data and preprocess it. Under the hood, this unzips the data and reads each of the provided json data files into Python dictionaries. It then further formats the data such that our downstream code can ingest it.

In [None]:
data_zip_path = "dataset.zip"
dest_path = "dataset"

unzip_data(data_zip_path, dest_path) # unzips the data into current directory
training_data = read_csv(os.path.join(dest_path, "train.csv"))
validation_data = read_csv(os.path.join(dest_path, "validation.csv"))

### Looking at the data
Since your data files can be large and unwieldy, you can explore the data by writing code. Check out the data format by looking at at keys, and some of the values in the data. You can use the following code to get started:

In [None]:
print(training_data.keys())
print(validation_data.keys())

To get a sense of what your data looks like, look at some samples. Run the following code block multiple times to see at many random samples. Note that our dataset only contains binary labels where `0` indicates that the premise contradicts the hypothesis (i.e. not entailment), and `1` means that the premise entails the hypothesis.

In [None]:
random_index = random.randint(0, len(training_data['premise']))
premise = training_data['premise'][random_index]
hyp = training_data['hypothesis'][random_index]
label = training_data['label'][random_index]
print(f"premise: {premise}")
print(f"hypothesis: {hyp}")
print(f"label: {label}")

We have already included code in `data.py` to tokenize both the premise and hypothesis. We will store each datapoint as an object of the class `Example` defined in `data.py`. Each `Example` object stores a list of tokens from the premise, list of tokens from the hypothesis, and label for the data point. Run the following code block to create this list of Example objects for both the training and validation data.


In [None]:
from data import read_examples

print(f"\n Cleaning up and tokenizing train data...")
train_exs = read_examples(training_data)
val_exs = read_examples(validation_data)


print(train_exs[random_index])
print(len(train_exs))
print(len(val_exs))

### Data Statistics
To get a better feel for the data, let's visualize some of its characteristics.

#### <font color="brown"> Q1.1 Plot histograms of the number of tokens in the premise and hypothesis inputs separately. Also, generate the same plots for datapoints that have the **entailment** label and the **contradiction** separately. Do you notice anything interesting about this data? </font>

#### <font color="brown"> Q1.2: Find the five most frequent unigrams (single tokens) in the training data, and list them out. What are the five most frequent bigrams (two token subsequences)? What do you notice? </font>

As an example, here are the most frequent unigrams and bigrams for the following toy dataset of three sentences.

* ["i", "love", "my", "cat"]
* ["i", "love", "dogs"]
* ["my", "cat", "is", "cute"]

Unigram counts:
* "i": 2
* "cat": 2
* "love": 2
* "my": 2
* "dogs": 1
* "is": 1
* "cute": 1

Bigram counts:
* "i love": 2
* "my cat": 2
* all other bigrams have frequency one ("love my", "love dogs", "cat is", "is cute")

4-gram counts:
* "i love my cat": 1
* "my cat is cute": 1

## Part 2: Feature Extraction
In order to train our logistic regression model, we need to turn our natural language data into a feature vector. In this assignment, we will explore various strategies for constructing our feature vectors, but start with a simple n-gram based approach.


The first step for feature engineering is preprocessing a text input into tokenization. We have already implemented this for you in Part 1. In this section, you will be responsible for implementing feature extractors that convert a tokenized list of words into a sparse feature representation with Python dictionaries and lists.

A common and effective way to build feature vectors for models such as logistic regression is to use n-gram features. An n-gram is a contiguous sequence of n tokens from a sentence.

* Unigrams are single words (e.g., "dog", "running")
* Bigrams are pairs of consecutive words (e.g., "the dog", "is running")

**Important**: Remember that our `Example` object has two input texts, stored as a list of tokens, namely the premise and the hypothesis. We need to provide both these as input to the logistic regression model. We define a function `get_combined_words` to combine these into a single list. The function uses the `[SEP]` token to distinguish the two. Remember to perform all feature engineering on this combined list of words.

### Part 2.1 Unigram and Bigram Feature Extractors
> <font color="orange">File to be modified: `features.py`. 


In this part, you will complete the ``UnigramFeatureExtractor`` and the ``BigramFeatureExtractor`` classes in `features.py`. For both you will be implementing the ``build_vocabulary(examples)`` and ``extract_features(sentence)``.

The complete pipeline for feature extraction during training is the following:

a. Initialize an object of the Unigram/Bigram FeatureExtractor class. This will specify the maximum number of features to store via `max_num_features` attribute.

b. For the given training data, construct the vocabulary or the feature set. For both the unigram and bigram feature extractors, you must keep only the max_num_features most frequent words in the vocabulary and store these in the ``self.vocabulary`` attribute. You will implement this in ``build_vocabulary(examples)``.

c. Once you have created the vocabulary, at training or test time, you will need to create the feature vector for any given input text.  You will implement this logic in ``extract_features(sentence)``. It will take as input a tokenized text, and return a feature dictionary. When constructing this dictionary, Be sure that all keys of this dictionary are present in your vocabulary.


#### Part 2.2 Let's create the feature dictionaries!
Now let's run your unigram and bigram feature extractors on our training dataset. Be sure to manually examine the data ``train_exs`` and your feature vectors to see if everything makes sense.

In [None]:
from features import (
    UnigramFeatureExtractor,
    BigramFeatureExtractor,
)

In [None]:
unigram_extractor = UnigramFeatureExtractor(max_num_features=2000)
bigram_extractor = BigramFeatureExtractor(max_num_features=2000)

# build vocab from our training set
print(f"\nBuilding vocabulary...")
unigram_extractor.build_vocabulary(train_exs)
bigram_extractor.build_vocabulary(train_exs)

It you want to test your implementation of the above functions, you should write your own small train datasets and see if the constructed vocabulary and feature dictionaries for a given input make sense.

Once you are convinced your implementation is correct, run the code below to see the feature dictionaries for the following examples.

In [None]:
# extract features for this example sentence
from data import tokenize_and_clean
tokenized = train_exs[1004].get_combined_words()
print(' '.join(tokenized))
uni_features = unigram_extractor.extract_features(tokenized)
print(uni_features)

In [None]:
# extract features for this example sentence
bigram_features = bigram_extractor.extract_features(tokenized)
print(bigram_features)

 #### <font color="brown"> Q2.1.  Run the unigram and bigram feature extractor on 5 unique examples. Look at the feature vectors they output. Do you expect these to be informative features? Comment on 2 characteristics of these features that you find interesting or surprising. </font>

## Part 3: Logistic Regression
> <font color="orange">File to be modified: `models.py`.

We will be using logistic regression, a linear model for binary classification. It computes a weighted sum of input features and applies a sigmoid function to produce a probability between 0 and 1. Our goal is to learn model parameters (weights and bias) to minimize cross-entropy loss for our data.

Let $f(x)$ be the features of an input $x$ ($x$ is the concatenation of premise and hypothesis for our dataset). Let $f_k$ be the $k^{th}$ feature and $w_k$ be its corresponding weight. We define $z = b + \sum_k w_k f_k$.

According to the logistic regression model, $P(y = 1 \;|\; f(x) \,) = \frac{e^z}{1 + e^z} $. Refer to class notes for more details.

We have defined the ``LogisticRegressionClassifier`` class. We initialize the weights $w_k$ and the bias $b$ for our logistic regression classifier to 0.

In [None]:
from models import LogisticRegressionClassifier, print_evaluation_metrics


## Initialize the classifier with the UnigramFeatureExtractor
classifier = LogisticRegressionClassifier(unigram_extractor)


### Part 3.1: Implement the predict function

 First, you should implement the ``predict`` function. This function takes in as input the input to the logistic regression model, i.e. list of tokens to be classified. It ouputs the class label (int), i.e. either 0 or 1. Refer to the formula above to compute $P(y=1 | f(x))$. You should return the label (0 or 1) with the higher probability.

We suggest using the `sigmoid` function we implemented for you ``models.py``, which takes in $z = b + \sum_k w_kf_k$ as input. This implementation is numerically stable.

 #### <font color="brown"> Q3.1.  Remember that our model is untrained and that all weights and the bias were initialized to 0. What should be the predicted label for all datapoints? </font>


### Part 3.2: Implement Binary Cross Entropy Loss
In order to learn the weights for the logistic regression model (or any ML model for that matter), we need to have a loss function. The loss function measures how well the model’s predictions match the true labels in the training data. Intuitively, we want to minimize the loss function in order to get a high performing model. You will be taking the gradient with respect to the loss function and using it to update your weight paramters.

Let $y$ be the true label for an input (for our case, this will be either 0 or 1) and $\widehat{y}$ be the probability that the predicted label is 1, i.e. $P(y=1 | x)$.

Implement the function ``cross_entropy_loss`` in ``models.py`` using the following binary cross entropy formula:

$L(y, \widehat{y}) = -\Big[y \log(\widehat{y} + \varepsilon) + (1 - y)\log(1 - \widehat{y} + \varepsilon)\Big]$

#### <font color="brown"> Q3.2: What numerical issue does the small constant $\epsilon$ prevent in the cross-entropy loss calculation, and why is it important? </font>


### Part 3.3: Training Loop
In this part, you will write the training loop to train logistic regression using **stochastic gradient descent (SGD)**, updating the model after each example.

First, let's recap what we've done so far:

1. Preprocess our data (reading CSV files, tokenization).
2. Extract features (you implemented a Unigram and Bigram feature extractor in Part 2).
3. Train a logistic regression classifier (now).

The `LogisticRegressionClassifier` stores the weights of the model as a dictionary `weights: dict[str, float]`, where each feature name maps to its weight. The bias is stored as a separate scalar float `bias`.

You will implement the function ``train`` for the `LogisticRegressionClassifier`. It takes the entire training data (list of Example) as input, along with the learning rate and number of epochs. The rough pseudocode for the training is:

For each epoch:
- Shuffle the training data and for each datapoint:  
  1. Compute prediction `ŷ = sigmoid(w·x + b)`  
  2. Compute loss (cross-entropy)  
  3. Compute the gradients of the weights and bias.
  4. Update the weight and bias using SGD.

__Hint__: You should print metrics like loss, and train accuracy to measure how these change with each training epoch. You should also observe how validation accuracy changes during training. These signals will inform you if your model is training properly.

In [None]:
# Hyperparameters (TODO play around with these)
epochs = 10
learning_rate = 0.01

## train the classifier initialized above
classifier.train(train_exs, learning_rate, epochs)

### Part 3.3: Evaluation
Now that you trained your model, let's see how well it does on the validation data. <font color="red"> With the following hyperparameters (max_num_features = 2000, learning_rate = 0.01, num_epochs = 10), you should see a validation accuracy of ~64% or higher using both the unigram and bigram feature extractor. </font>

If your accuracy is lower, there is likely an error in one of your function declarations.

In [None]:
print_evaluation_metrics(classifier, val_exs, "Validation")

#### <font color="brown"> Q3.3: What accuracy metric were you able to get with the unigram and bigram feature extractors? Plot a graph showing the avg train loss per example across different epochs during training. Plot a graph for train and validation accuracies during training. <font>

#### <font color="brown"> Q3.4: Here you will conduct a hyperparameter search to identify the best hyperparameters for your logisitc regression model. Choose one of the feature extractors (unigram or bigram). Report accuracy results with 5 different learning rates and 5 different max_num_features. You should fix the number of epochs to 10. <font>

## Part 4: FancyFeatureExtraction

> <font color="orange">File to be modified: `features.py`.


Now it's your turn to do some **feature engineering**. Feature engineering is the process of manually designing "features" which should improve model performance based on intuitions that we have about the raw data. For example, if you were trying to classify text sequences as tweets or not, a good feature could be whether or not the input text is more than 280 characters or not (where more than 280 characters is of course a strong negative signal).

Be creative and have fun! A good place to start is to continue exploring the raw data and see if there are any trends which distinguish the classes. In this section, you can also consider improvements to our naive tokenization algorithm. Just make sure to limit the number of features that you generate to *number*.

Add your fancy feature extraction code to the FancyFeatureExtractor class in ``models.py``. Run the code block below to run training with your fancy feature extractor. Does it give you better performance than the n-gram features?

> __Important Note:__ <font color="red"> When we test your FancyFeatureExtractor on our end, we will use the following hyperparameters: max_num_features=2000, learning_rate=0.01, epochs=10</font>.




In [None]:
from features import FancyFeatureExtractor

## These are the hyperparameters we will use to test your fancy feature extractor on our end.
max_num_features = 2000
learning_rate = 0.01
epochs = 10

fancy_feature = FancyFeatureExtractor(max_num_features=max_num_features)
fancy_feature.build_vocabulary(train_exs)

classifier_fancy = LogisticRegressionClassifier(fancy_feature)
classifier_fancy.train(train_exs, learning_rate, epochs)

print_evaluation_metrics(classifier_fancy, val_exs, "Validation")

#### <font color="brown"> Q4.1: Explain your feature engineering strategy. Write down atleast 5 new features (apart from the unigram and bigram features) that you have newly implemented for this section. It is okay if those features did not improve performance, still write them here and include your intuition <font>

## Submission
You will make two separate submisisons for this hw assignment on gradescope.


### hw1-programming
You will submit the 2 python files: `models.py` and `features.py` to this assignment on gradescope.

Note that we will grade:
1. The correctness of your individual function implementations.
2. The performance of your trained ``LogisticRegressionClassifier`` model using the ``UnigramFeatureExtractor`` and the ``BigramFeatureExtractor`` on the validation set. These values should ideally be greater than 64% with the following hyperparameters: max_num_features=2000, learning_rate=0.01, epochs=10. Lower values will not get a 0 but be graded based on difference with this reference value.
3. The performance of your trained ``LogisticRegressionClassifier`` model using the ``FancyFeatureExtractor`` on the validation set. These values should ideally be greater than 68% with the following hyperparameters: max_num_features=2000, learning_rate=0.01, epochs=10. Lower values will not get a 0 but be graded based on difference with this reference value.
4. We will also run your ``LogisticRegressionClassifier`` with your ``FancyFeatureExtractor`` on our hidden test set. You will be graded based on your models' performance on this hidden test set.

### hw1-written
You will submit the answers to the written questions in this notebook as part of the PDF which contains your responses to ``Part A: Conceptual Questions`` part of the hw1.