# Neural Network Classifier 1

This is a copy of the logistic regression classifier notebook with some minor changes. Ensure that you understand that notebook before this one!

Changes from logistic regression:
- Vocab class to conveniently deal with converting between features and indices
- Instead of a single output, we have $n$ output nodes, where $n$ is the number of classes (here we have 2)
- For the data, instead of 0 or 1 for the labels SH and TTC, we use the one-hot vectors [1, 0] and [0, 1].
- Since we are now doing multiclass classification, the loss function is cross entropy. This loss function takes unnormalized inputs, so you do not need to manually compute sigmoid/softmax.

In [27]:
from collections import Counter
import random

import torch
from tokenizers import Tokenizer
from tqdm.notebook import tqdm
import evaluate

import torch.nn.functional as F  # shorthand so we can do F.softmax and other functions

In [3]:
def read_data():
    tokenizer = Tokenizer.from_pretrained("bert-base-cased")

    train = []
    with open("SH-TTC/train.tsv") as fin:
        for line in fin:
            label, text = line.strip().split("\t")
            tokens = tokenizer.encode(text).tokens
            train.append((label, tokens))

    dev = []
    with open("SH-TTC/dev.tsv") as fin:
        for line in fin:
            label, text = line.strip().split("\t")
            tokens = tokenizer.encode(text).tokens
            dev.append((label, tokens))
    
    return train, dev

train_data_raw, dev_data_raw = read_data()

In [9]:
class Vocab:
    def __init__(self, tokens):
        self.vocab = [tok for tok, count in Counter(tokens).most_common()]
        self.tok2idx = {tok: idx + 1 for idx, tok in enumerate(self.vocab)}
        self.tok2idx[0] = "[UNK]"
        self.idx2tok = {idx: tok for tok, idx in self.tok2idx.items()}
    
    def __len__(self):
        return len(self.tok2idx)
    
    def to_id(self, tok):
        return self.tok2idx.get(tok, 0)

    def to_tok(self, id):
        return self.idx2tok.get(id, "[UNK]")

In [13]:
vocab = Vocab([word
               for y, x in train_data_raw
               for word in x])

In [None]:
VOC_SIZE = len(vocab)
VOC_SIZE

In [41]:
def process_data(raw_data):
    """
    Convert data to tensors
    """
    data = []
    for label, features in raw_data:
        # convert y label to a one-hot vector
        if label == "SH":
            y = torch.Tensor([1, 0])
        else:  # TTC
            y = torch.Tensor([0, 1])

        # convert x to a vector of token counts
        x = torch.zeros(VOC_SIZE)
        for feat in features:
            x[vocab.to_id(feat)] += 1

        data.append((x, y))
    return data

In [42]:
train_data = process_data(train_data_raw)
dev_data = process_data(dev_data_raw)

In [None]:
len(train_data), len(dev_data)

## Model!

Logistic regression has one output. To support multiclass classification, we want $n$ output nodes, where $n$ is the number of labels. For this task, $n=2$. We will use Linear again, which computes $Wx + b$. However with two outputs, there will be double the number of parameters (a W for the first output, and a separate W for the second output).

Let's step through the model components before defining the actual model.

In [20]:
lin = torch.nn.Linear(VOC_SIZE, 2)  # input = |V|, output = 2

In [None]:
for p in lin.parameters():
    print(p)

In [23]:
# create an input vector
features = ["hello", "this", "is", "a", "test"]
x = torch.zeros(VOC_SIZE)
for feat in features:
    x[vocab.to_id(feat)] += 1

In [None]:
x

In [30]:
pred = lin(x)  # Wx + b

Notice there are now two outputs! For multiple outputs, we cannot use binary cross entropy. Instead, we use the regular cross entropy as the loss function. Normally, we would compute the softmax to normalize the values. However In PyTorch, `cross_entropy` automatically computes the softmax and takes the log, so there is no need to run the output through softmax.

In [None]:
F.cross_entropy(pred, torch.Tensor([0, 1]))  # automatically does the softmax and log

Now here is the model definition:

In [44]:
class NNClassifier(torch.nn.Module):
    def __init__(self, voc_size):
        super().__init__()
        self.linear = torch.nn.Linear(voc_size, 2)  # Wx + b

    def forward(self, x):
        y = self.linear(x)
        return y  # don't need softmax because cross_entropy loss automatically computes it

In [45]:
model = NNClassifier(VOC_SIZE)

In [None]:
def count_parameters(model):
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad:
            continue
        params = parameter.numel()
        print(name, "\t", params)
        total_params += params
    print(f"Total Trainable Params: {total_params}")
    
    
count_parameters(model)

There are about twice as many parameters as the logistic regression model. More parameters = longer to train.

In [None]:
# test the model
# x is the input vector from above
with torch.no_grad():
    pred = model(x)
    print(pred)

During test time, when we make predictions, we will use argmax to get the larger number. However, during training, we don't need to do this.

Let's set up our loss function and optimizer:

In [48]:
loss_func = F.cross_entropy  # same as torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

The training loop is the same as for logistic regression.

In [None]:
for epoch in range(10):
    print("Epoch", epoch)

    random.shuffle(train_data)
    for x, y in tqdm(train_data):
        model.zero_grad()
        pred = model(x)
        loss = loss_func(pred, y)
        loss.backward()
        optimizer.step()

    with torch.no_grad():
        total_loss = 0
        for x, y in train_data:
            pred = model(x)
            loss = loss_func(pred, y)
            total_loss += loss
        print("train loss:", total_loss / len(train_data))

        total_loss = 0
        for x, y in dev_data:
            pred = model(x)
            loss = loss_func(pred, y)
            total_loss += loss
        print("dev loss:", total_loss / len(dev_data))

In [None]:
torch.argmax(torch.Tensor([0.7, 0.3])) == 0

In [61]:
def run_model_on_dev_data():
    preds = []
    with torch.no_grad():
        for x, y in dev_data:
            pred = model(x)  # pred is something like [0.6, 0.4]
            preds.append(pred)
    return preds

def sample_predictions(preds):
    for _ in range(5):
        idx = random.randint(0, len(dev_data))
        
        # argmax gives the index with the highest value
        pred_label = "SH" if torch.argmax(preds[idx]) == 0 else "TTC"

        print("Input:", " ".join(dev_data_raw[idx][1]))
        print("Gold: ", dev_data_raw[idx][0])

        # preds are not normalized, so for better viewing, run it through softmax
        print("Pred: ", pred_label, F.softmax(preds[idx], dim=0)) 
        print()

In [None]:
preds = run_model_on_dev_data()
sample_predictions(preds)

In [63]:
precision = evaluate.load("precision")
recall = evaluate.load("recall")
accuracy = evaluate.load("accuracy")

In [None]:
# evaluate functions require numeric data, so convert labels to 0 and 1
refs = []
for label, text in dev_data_raw:
    if label == "SH":
        refs.append(0)
    else:
        refs.append(1)

preds_binary = []
for pred in preds:
    preds_binary.append(torch.argmax(pred))

print(precision.compute(references=refs, predictions=preds_binary))
print(recall.compute(references=refs, predictions=preds_binary))
print(accuracy.compute(references=refs, predictions=preds_binary))

## Your Tasks

Modify the model to add a hidden layer. I suggest a hidden size of 50. Make sure your dimensions match up! After running through the first Linear layer, run your output through a ReLU (`F.relu()`) to introduce non-linearity.

Once you add another layer, your model training time will increase a lot per epoch because you are adding a lot more parameters. However, the number of epochs needed to get a good model will decrease, because your model is now more "powerful" (easily able to capture the nuances in the data). Another downside is that your model is more likely to overfit (memorize) the data. You can see this happening if your train loss gets smaller, but your validation loss increases. In this case, the best model is not the one that memorized the training data, but is the one that generalizes well to unseen validation data.