# Logistic Regression

In this notebook, we will build a logistic regression classifier to predict whether a sentence came from Sherlock Holmes or A Tale of Two Cities.

If you don't have the packages below, run the conda command in comments. If you are running on the server and followed Winston's Remote Development Tips, then conda should be set up with conda-forge as the default channel. If not, then add `-c conda-forge` to the end of each command.

In [9]:
import random
import torch  # conda install pytorch pytorch-cuda=12.4 -c pytorch -c nvidia
from tokenizers import Tokenizer  # conda install tokenizers
from tqdm.notebook import tqdm  # conda install tqdm
# import evaluate  # conda install evaluate scikit-learn


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/py

First, read the datasets, tokenize, and store as a list of (label, tokens) pairs.

In [10]:
def read_data():
    tokenizer = Tokenizer.from_pretrained("bert-base-cased")

    train = []
    with open("SH-TTC/train.tsv") as fin:
        for line in fin:
            label, text = line.strip().split("\t")
            tokens = tokenizer.encode(text).tokens
            train.append((label, tokens))

    dev = []
    with open("SH-TTC/dev.tsv") as fin:
        for line in fin:
            label, text = line.strip().split("\t")
            tokens = tokenizer.encode(text).tokens
            dev.append((label, tokens))
    
    return train, dev

train_data_raw, dev_data_raw = read_data()

Examine the data to see what it looks like

In [11]:
train_data_raw[0]

('SH',
 ['[CLS]',
  '“',
  'On',
  'entering',
  'the',
  'house',
  ',',
  'however',
  ',',
  'I',
  'examined',
  ',',
  'as',
  'you',
  'remember',
  ',',
  'the',
  'si',
  '##ll',
  'and',
  'framework',
  'of',
  'the',
  'hall',
  'window',
  'with',
  'my',
  'lens',
  ',',
  'and',
  'I',
  'could',
  'at',
  'once',
  'see',
  'that',
  'someone',
  'had',
  'passed',
  'out',
  '.',
  '[SEP]'])

Build a mapping between input features (tokens) and IDs.

**Review**: Why do we build the vocabulary using only the training data and not on the dev data?

In [12]:
def build_voc(data):
    """
    Build vocabulary mapping, reserving idx 0 for [UNK]
    """
    feat2idx = {}
    feat2idx["[UNK]"] = 0
    next_idx = 1
    for label, features in data:
        for feat in features:
            if feat not in feat2idx:
                feat2idx[feat] = next_idx
                next_idx += 1
    return feat2idx

feat2idx = build_voc(train_data_raw)

In [13]:
def to_id(feat):
    """
    Convert token to ID
    """
    return feat2idx.get(feat, feat2idx["[UNK]"])

In [15]:
to_id("man")

249

In [16]:
VOC_SIZE = len(feat2idx)
VOC_SIZE

9671

Convert the raw data to numeric data (tensors in PyTorch). The input will be a vector (1-D tensor) containing counts of each feature. The output will be a single number: 0 if the sentence came from Sherlock Holmes, or 1 if the sentence came from A Tale of Two Cities.

`torch.zeros(n)` creates a 1-D tensor of length n, filled with all zeros.

In [17]:
def process_data(raw_data):
    """
    Convert data to tensors
    """
    data = []
    for label, features in raw_data:
        # convert y to a scalar
        if label == "SH":
            y = torch.Tensor([0])
        else:  # TTC
            y = torch.Tensor([1])

        # convert x to a vector of token counts
        x = torch.zeros(VOC_SIZE)
        for feat in features:
            x[to_id(feat)] += 1

        data.append((x, y))
    return data

In [18]:
train_data = process_data(train_data_raw)
dev_data = process_data(dev_data_raw)

In [19]:
len(train_data), len(dev_data)

(10381, 1298)

Always examine your data! To check the dimensions of a tensor, use `.shape`

In [20]:
train_data[0]  # (x, y) pair

(tensor([0., 1., 1.,  ..., 0., 0., 0.]), tensor([0.]))

In [21]:
train_data[0][0].shape  # vocab size

torch.Size([9671])

In [22]:
train_data[0][1].shape  # just one number, 0 or 1

torch.Size([1])

## Model!

A logistic regression model is represented by the formula $\sigma(Wx + b)$. `Linear` is a PyTorch object that implements $Wx + b$, and `torch.sigmoid` is the sigmoid function.

In [23]:
lin = torch.nn.Linear(VOC_SIZE, 1)  # input = |V|, output = 1

In [24]:
for p in lin.parameters():
    print(p)

Parameter containing:
tensor([[ 0.0006, -0.0066, -0.0094,  ..., -0.0007,  0.0041, -0.0044]],
       requires_grad=True)
Parameter containing:
tensor([-0.0059], requires_grad=True)


**Question**: Which variables in the logistic regression model do the above parameters correspond to?

Let's test out the different pieces of the model. First, we need an input vector:

In [25]:
# create an input vector
features = ["hello", "this", "is", "a", "test"]
x = torch.zeros(VOC_SIZE)
for feat in features:
    x[to_id(feat)] += 1

In [26]:
x  # a vector of token counts

tensor([1., 0., 0.,  ..., 0., 0., 0.])

In [27]:
h = lin(x)  # Wx + b

In [28]:
torch.sigmoid(h)  # sigmoid(Wx + b)

tensor([0.5058], grad_fn=<SigmoidBackward0>)

Notice that PyTorch automatically calculates the gradient (grad_fn), so we don't need to do this by hand. Very handy!

Now it is time to define the full model. When doing so, define your own class that is a subclass of torch.nn.Module. In the constructor, define any layers that contain trainable parameters (only one Linear layer, in this case).

You also need to override the `forward` function to implement what happens when the model takes an input. Here, we run $x$ through the linear layer ($Wx + b$), and then pass the result into sigmoid.

In [29]:
class LogisticRegressionClassifier(torch.nn.Module):
    def __init__(self, voc_size):
        super().__init__()
        self.linear = torch.nn.Linear(voc_size, 1)  # Wx + b
    
    def forward(self, x):     # special function called when model(x)
        h = self.linear(x)    # h = Wx + b
        y = torch.sigmoid(h)  # y = sigmoid(h)
        return y

In [30]:
model = LogisticRegressionClassifier(VOC_SIZE)

How many parameters are in the model? Run the code below to find out.

In [31]:
def count_parameters(model):
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad:
            continue
        params = parameter.numel()
        print(name, "\t", params)
        total_params += params
    print(f"Total Trainable Params: {total_params}")
    
    
count_parameters(model)

linear.weight 	 9671
linear.bias 	 1
Total Trainable Params: 9672


This is actually a tiny model, but we will see that it has surprisingly good performance. In practice, it is often better to use a simple and decently performing model than a complex model that is slightly better but requires much more resources to train.

When testing your model, it is a good idea to use a with guard to indicate that the model does not need to update gradients. Notice that the `pred` tensor does not store a gradient.

In [32]:
# x is the input vector from above
with torch.no_grad():
    pred = model(x)
    print(pred)

tensor([0.5050])


For training, we need two things:
- a loss function, which tells us how far off our predictions are from the correct answer (gold)
- an optimizer, which will take the loss and update our model parameters (weights and bias)

In [33]:
loss_func = torch.nn.functional.binary_cross_entropy  # same as torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters())

Now we can train! One epoch is one pass through the training data. After a pass, we will check how the model is doing by computing the average loss over the train and dev sets. Ideally, both train and dev loss will decrease. When they start to flatten out, then the model has converged, and it is a good time to stop training.

In [34]:
# train!
import random
for epoch in range(20):
    print("Epoch", epoch)

    random.shuffle(train_data)
    for x, y in tqdm(train_data):
        model.zero_grad()  # resets the gradients of all tensors in the model

        pred = model(x)  # run the input x through the model to get a prediction
        loss = loss_func(pred, y)  # calculate the loss between the prediction and the gold
        loss.backward()  # calculate gradients of the loss with respect to all trainable parameters
        optimizer.step()  # update parameters based on the gradients

    # after each epoch, check how we're doing
    with torch.no_grad():
        total_loss = 0
        for x, y in train_data:
            pred = model(x)
            loss = loss_func(pred, y)
            total_loss += loss
        print("train loss:", total_loss / len(train_data))

        total_loss = 0
        for x, y in dev_data:
            pred = model(x)
            loss = loss_func(pred, y)
            total_loss += loss
        print("dev loss:", total_loss / len(dev_data))

Epoch 0


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.6129)
dev loss: tensor(0.6148)
Epoch 1


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5880)
dev loss: tensor(0.5891)
Epoch 2


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5711)
dev loss: tensor(0.5721)
Epoch 3


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5581)
dev loss: tensor(0.5592)
Epoch 4


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5478)
dev loss: tensor(0.5492)
Epoch 5


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5393)
dev loss: tensor(0.5419)
Epoch 6


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5322)
dev loss: tensor(0.5356)
Epoch 7


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5259)
dev loss: tensor(0.5314)
Epoch 8


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5201)
dev loss: tensor(0.5278)
Epoch 9


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5140)
dev loss: tensor(0.5207)
Epoch 10


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5100)
dev loss: tensor(0.5181)
Epoch 11


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5046)
dev loss: tensor(0.5139)
Epoch 12


  0%|          | 0/10381 [00:00<?, ?it/s]

train loss: tensor(0.5005)
dev loss: tensor(0.5106)
Epoch 13


  0%|          | 0/10381 [00:00<?, ?it/s]

KeyboardInterrupt: 

Logistic regression is a small model (just one neuron!), so it is fast to run through one epoch. However it takes many epochs to converge.

Now let's run our model on the dev set to get some prediction, and then evaluate the model to see how it is actually doing.

In [35]:
def run_model_on_dev_data():
    preds = []
    with torch.no_grad():
        for x, y in dev_data:
            pred = model(x)
            preds.append(pred.item())
    return preds

def sample_predictions(preds):
    for _ in range(5):
        idx = random.randint(0, len(dev_data))
        pred_label = "SH" if preds[idx] < 0.5 else "TTC"
        print("Input:", " ".join(dev_data_raw[idx][1]))
        print("Gold: ", dev_data_raw[idx][0])
        print("Pred: ", pred_label, preds[idx])
        print()

In [36]:
preds = run_model_on_dev_data()
sample_predictions(preds)

Input: [CLS] Before he went away , he breathed a blessing towards it , and a [ N ##AM ##E ] . [ N ##AM ##E ] [ N ##AM ##E ] . [SEP]
Gold:  TTC
Pred:  SH 0.42289525270462036

Input: [CLS] “ That was it , ” said [ N ##AM ##E ] , nodding app ##roving ##ly ; “ I have no doubt of it . [SEP]
Gold:  SH
Pred:  TTC 0.8171283006668091

Input: [CLS] If he had preserved any definite re ##membrance of it , there could be no doubt that he had supposed it destroyed with the [ N ##AM ##E ] , when he had found no mention of it among the relics of prisoners which the populace had discovered there , and which had been described to all the world . [SEP]
Gold:  TTC
Pred:  SH 0.4512692391872406

Input: [CLS] I did not breathe freely until I had taken it upstairs and locked it in the bureau of my dressing - room . [SEP]
Gold:  SH
Pred:  SH 0.2620030641555786

Input: [CLS] “ Business seems bad ? ” [SEP]
Gold:  TTC
Pred:  TTC 0.522125244140625



In [114]:
precision = evaluate.load("precision")
recall = evaluate.load("recall")
accuracy = evaluate.load("accuracy")

In [None]:
# evaluate functions require numeric data, so convert labels to 0 and 1
refs = []
for label, text in dev_data_raw:
    if label == "SH":
        refs.append(0)
    else:
        refs.append(1)

preds_binary = []
for pred in preds:
    if pred < 0.5:
        preds_binary.append(0)
    else:
        preds_binary.append(1)

print(precision.compute(references=refs, predictions=preds_binary))
print(recall.compute(references=refs, predictions=preds_binary))
print(accuracy.compute(references=refs, predictions=preds_binary))

## Your Tasks

Improve the performance of your logistic regression classifier! Try some of the following:
- train for more epochs
- use a different optimizer (Adam is a good one)
- try a different learning rate, by passing it in as an argument, e.g. (`optim = SGD(lr=0.1)`)

Note: if you want to train from scratch, you need to reinitialize your model and optimizer (`model = ...`, `optim = ...`). If you rerun the training loop without reinitializing the model and optimizer, it will print epoch 0 1 2 etc but actually will have continued where it left off.