# MMI_2024_NLP - Week 1

#Lab 1: Part 2

# Introduction

Before we start, please change the name of the notebook to the following format : **Firstname_LASTNAME_Lab1_B_logistic_regression.ipynb**


In some cells and files you will see code blocks that look like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

You should replace the `pass` statement with your own code and leave the blocks intact, like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
y = m * x + b
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
cd "/content/drive/MyDrive/NLP_Week1_PS/Lab1"

/content/drive/MyDrive/NLP_Week1_PS/Lab1


# (B) Logistic Regression Model

In this second part of the lab, we will implement a language identifier trained on the same data, but using Logistic Regression instead of Naive Bayes.

In [1]:
import io, sys, math
import numpy as np
from collections import defaultdict
from tqdm.notebook import tqdm

This function is used to build the dictionary, or vocabulary, which is a mapping from strings (or words) to integers (or indices). This will allow to build vector representations of documents.

In [2]:
def build_dict(filename, threshold=1):
    fin = io.open(filename, 'r', encoding='utf-8')
    word_dict, label_dict = {}, {}
    counts = defaultdict(lambda: 0)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        if not label in label_dict:
            label_dict[label] = len(label_dict)

        for w in tokens[1:]:
            counts[w] += 1

    for k, v in counts.items():
        if v > threshold:
            word_dict[k] = len(word_dict)
    return word_dict, label_dict

This function is used to load the training dataset, and build vector representations of the training examples. In particular, a document or sentence is represented as a bag of words. Each example correspond to a sparse vector ` x` of dimension `V`, where `V` is the size of the vocabulary. The element `j` of the vector `x` is the number of times the word `j` appears in the document.

In [3]:
def load_data(filename, word_dict, label_dict):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    dim = len(word_dict)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        yi = label_dict[label]
        xi = np.zeros(dim)
        for word in tokens[1:]:
            if word in word_dict:
                wid = word_dict[word]
                xi[wid] += 1.0
        data.append((yi, xi))
    return data

First, let's implement the softmax function. Don't forget numerical stability!

In [4]:
def softmax(x):
  ##########################################################################
  #                      TODO: Implement this function                     #
  ##########################################################################
  # Replace "pass" statement with your code

  m = np.max(x)

  softmax_x = np.exp( x - m ) / np.sum( np.exp(x - m) )

  return softmax_x

  ##########################################################################
  #                            END OF YOUR CODE                            #
  ##########################################################################

Now, let's implement the main training loop, by using stochastic gradient descent. The function will iterate over the examples of the training set. For each example, we will first compute the loss, before computing the gradient and performing the update.

In [5]:
def sgd(w, data, niter):

    lr = 0.001
    target = np.array([])
    nlabels, dim = w.shape

    for iter in range(niter):
      ##########################################################################
      #                      TODO: Implement this function                     #
      ##########################################################################
      # Replace "pass" statement with your code
        total_loss = 0.0

        # Shuffle data for stochasticity
        np.random.shuffle(data)

        # Wrap data iteration with tqdm for progress bar
        with tqdm(data, desc=f"Epoch {iter + 1}/{niter}") as progress_bar:
            for yi, xi in progress_bar:
                # nlabels, dim = w.shape
                scores = np.dot(w, xi)
                probs = 1.0 / (1.0 + np.exp(-scores))
                loss = -np.log(probs[yi])

                # Compute gradient
                gradient = np.zeros_like(w)
                gradient[yi, :] += xi * (probs[yi] - 1.0)

                for j in range(nlabels):
                    if j != yi:
                        gradient[j, :] += xi * probs[j]


                total_loss += loss

                # Update weights using gradient descent
                w -= lr * gradient

                # Update progress bar description
                progress_bar.set_postfix(loss=loss)

        average_loss = total_loss / len(data)
        # Print total loss for the epoch
        print(f"Epoch {iter + 1}/{niter} Average Loss: {average_loss:.3f}")

      ##########################################################################
      #                            END OF YOUR CODE                            #
      ##########################################################################

    return w # Replace "..." statement with your code

The next function will predict the most probable label corresponding to example `x`, given the trained classifier `w`.

In [6]:

def predict(w, x):
  ##########################################################################
  #                      TODO: Implement this function                     #
  ##########################################################################
  # Replace "pass" statement with your code
  scores = np.dot(w, x)
  probs = softmax(scores)

  return np.argmax(probs)
  ##########################################################################
  #                            END OF YOUR CODE                            #
  ##########################################################################

Finally, this function will compute the accuracy of a trained classifier `w` on a validation set.

In [7]:
def compute_accuracy(w, valid_data):
  ##########################################################################
  #                      TODO: Implement this function                     #
  ##########################################################################
  # Replace "pass" statement with your code
  correct = 0
  for yi, xi in valid_data:
      if predict(w, xi) == yi:
          correct += 1
  return correct / len(valid_data)
  ##########################################################################
  #                            END OF YOUR CODE                            #
  ##########################################################################


In [20]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("train1.txt")
train_data = load_data("train1.txt", word_dict, label_dict)
valid_data = load_data("valid1.txt", word_dict, label_dict)

nlabels = len(label_dict)

dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 5)
print("")
print("Validation accuracy: %.3f" % compute_accuracy(w, valid_data))
print("")



** Logistic Regression **



Epoch 1/5:   0%|          | 0/10000 [00:00<?, ?it/s]

Epoch 1/5 Average Loss: 0.685


Epoch 2/5:   0%|          | 0/10000 [00:00<?, ?it/s]

Epoch 2/5 Average Loss: 0.672


Epoch 3/5:   0%|          | 0/10000 [00:00<?, ?it/s]

Epoch 3/5 Average Loss: 0.660


Epoch 4/5:   0%|          | 0/10000 [00:00<?, ?it/s]

# Now, it is your turn, try to do it with train2.txt and valid2.txt.


In [21]:
#Write your code here.

In [None]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("train2.txt")
train_data = load_data("train2.txt", word_dict, label_dict)
valid_data = load_data("valid2.txt", word_dict, label_dict)

nlabels = len(label_dict)

dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 5)
print("")
print("Validation accuracy: %.3f" % compute_accuracy(w, valid_data))
print("")


** Logistic Regression **

