Thie notebook explores using transformers for document classification.  Before starting, change the runtime to GPU: Runtime > Change runtime type > Hardware accelerator: GPU (any GPU is fine).

For an intro to models in pytorch, see [this tutorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).




Download classification data for training/evaluation.

In [1]:
!wget https://raw.githubusercontent.com/dbamman/anlp23/main/data/convote/train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp23/main/data/convote/dev.tsv

--2023-10-09 04:27:43--  https://raw.githubusercontent.com/dbamman/anlp23/main/data/convote/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4660140 (4.4M) [text/plain]
Saving to: ‘train.tsv.2’


2023-10-09 04:27:43 (82.6 MB/s) - ‘train.tsv.2’ saved [4660140/4660140]

--2023-10-09 04:27:43--  https://raw.githubusercontent.com/dbamman/anlp23/main/data/convote/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351382 (343K) [text/plain]
Saving to: ‘dev.tsv.2’


2023-10-09 04:27:43 (12.1 MB/s) - ‘dev.tsv.2’ saved [35138

In [2]:
import math
import sys
import torch
from torch import nn
from collections import Counter
from nltk import word_tokenize

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [5]:
# max sequence length
max_length=256

# limit vocabulary to top N words in training data
max_vocab=10000

# batch size
batch_size=128

# size of token representations (which dictates the size of the overall model).
d_model=16


# number of epochs
num_epochs=50

print('')
print("********************************************")
print("Running on: {}".format(device))
print("********************************************")
print('')


********************************************
Running on: cuda
********************************************



In [6]:
# PositionalEncoding class copied from:
# https://github.com/pytorch/examples/blob/main/word_language_model/model.py

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_length, d_model)
        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)#.transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):

        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)


In [7]:
class TransformerClassifier(torch.nn.Module):

    def __init__(self, num_labels, d_model, nhead=2, num_encoder_layers=1, dim_feedforward=256):

        super(TransformerClassifier, self).__init__()

        self.num_labels=num_labels
        self.embedding = nn.Embedding(num_embeddings=max_vocab+2, embedding_dim=d_model)
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, batch_first=True)
        self.classifier = nn.Linear(d_model, self.num_labels)
        self.pos_encoder = PositionalEncoding(d_model)

    def forward(self, x, m):

        # put data on device (e.g., gpu)
        x=x.to(device)
        m=m.to(device)

        # convert input token IDs to word embeddings
        embed=self.embedding(x)

        # add position encodings to include information about word position within the document
        embed = self.pos_encoder(embed)

        # get transformer output
        h=self.transformer.encoder(embed, src_key_padding_mask=m)

        # Represent document as average embedding of transformer output
        h=torch.mean(h, dim=1)

        # Convert document representation into output label space
        logits=self.classifier(h)

        return logits


In [8]:
def create_vocab_and_labels(filename, max_vocab):
    # This function creates the word vocabulary (and label ids) from the training data
    # The vocab is a mapping between word types and unique word IDs

    counts=Counter()
    labels={}
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            lab=cols[0]
            text=word_tokenize(cols[1].lower())
            for tok in text:
                counts[tok]+=1

            if lab not in labels:
                labels[lab]=len(labels)

    vocab={"[MASK]":0, "[UNK]":1}

    for k,v in counts.most_common(max_vocab):
        vocab[k]=len(vocab)

    return vocab, labels

In [9]:
def read_data(filename, vocab, labels, max_length, max_docs=5000):
    # Read in data from file, up to the first max_docs documents. For each document
    # read up to max_length tokens.

    x=[]
    y=[]
    m=[]

    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            if idx >= max_docs:
                break
            cols=line.rstrip().split("\t")
            lab=cols[0]
            text=word_tokenize(cols[1])
            text_ids=[]
            for tok in text:
                if tok in vocab:
                    text_ids.append(vocab[tok])
                else:
                    text_ids.append(vocab["[UNK]"])

            text_ids=text_ids[:max_length]

            # PyTorch (and most libraries that deal with matrix operations) expects all inputs to be the same length
            # So pad each document with 0s up to max_length
            # But keep track of the true number of tokens in the document with the "mask" list.

            # True tokens have a mask value of 0
            mask=[0]*len(text_ids)

            for i in range(len(text_ids), max_length):
                text_ids.append(vocab["[MASK]"])
                # Padded tokens have a mask value of 1
                mask.append(1)

            x.append(text_ids)
            m.append(mask)
            y.append(labels[lab])

    return x, y, m

In [10]:
def get_batches(x, y, m, batch_size):

    # Create minibatches from the full dataset

    batches_x=[]
    batches_y=[]
    batches_m=[]
    for i in range(0, len(x), batch_size):
        xbatch=x[i:i+batch_size]
        ybatch=y[i:i+batch_size]
        mbatch=m[i:i+batch_size]

        batches_x.append(torch.LongTensor(xbatch))
        batches_y.append(torch.LongTensor(ybatch))
        batches_m.append(torch.BoolTensor(mbatch))

    return batches_x, batches_y, batches_m

In [11]:
def evaluate(model, all_x, all_y, all_m):

    # Calculate accuracy

    model.eval()
    corr = 0.
    total = 0.
    with torch.no_grad():
        for x, y, m in zip(all_x, all_y, all_m):
            y_preds=model.forward(x, m)
            for idx, y_pred in enumerate(y_preds):
                prediction=torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.
                total+=1
    return corr/total

In [12]:
def train(model, model_filename, train_batches_x, train_batches_y, train_batches_m, dev_batches_x, dev_batches_y, dev_batches_m):

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    cross_entropy=nn.CrossEntropyLoss()

    # Keep track of the epoch that has the best dev accuracy
    best_dev_acc=0.
    best_dev_epoch=None

    # How many epochs with no changes before we quit
    patience=10

    for epoch in range(num_epochs):

        model.train()

        for x, y, m in zip(train_batches_x, train_batches_y, train_batches_m):
            # Get predictions for batch x (with mask values m)
            y_pred=model.forward(x, m)
            y=y.to(device)

            # Calculate loss as cross-entropy with true labels
            loss = cross_entropy(y_pred.view(-1, model.num_labels), y.view(-1))

            # Set all gradients to zero
            optimizer.zero_grad()

            # Calculate gradients from current loss
            loss.backward()

            # Update parameters
            optimizer.step()

        dev_accuracy=evaluate(model, dev_batches_x, dev_batches_y, dev_batches_m)

        # we're going to save the model that performs the best on *dev* data
        if dev_accuracy > best_dev_acc:
            torch.save(model.state_dict(), model_filename)
            print("%.3f is better than %.3f, saving model ..." % (dev_accuracy, best_dev_acc))
            best_dev_acc = dev_accuracy
            best_dev_epoch=epoch

        if epoch % 1 == 0:
            print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))

        if epoch-best_dev_epoch > patience:
          print("%s > patience (%s), stopping..." % (epoch-best_dev_epoch, patience))
          break

    model.load_state_dict(torch.load(model_filename))
    print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))

In [13]:
vocab, labels=create_vocab_and_labels("train.tsv", max_vocab)
train_x, train_y, train_m=read_data("train.tsv", vocab, labels, max_length=max_length)
dev_x, dev_y, dev_m=read_data("dev.tsv", vocab, labels, max_length=max_length)

In [14]:
labels

{'D': 0, 'R': 1}

In [15]:
classifier=TransformerClassifier(num_labels=len(labels), d_model=100, dim_feedforward=1024)
classifier=classifier.to(device)

train_x_batch, train_y_match, train_m_match=get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match=get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

train(classifier, "test.model", train_x_batch, train_y_match, train_m_match, dev_x_batch, dev_y_match, dev_m_match)

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)


0.502 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.502
0.525 is better than 0.502, saving model ...
Epoch 1, dev accuracy: 0.525
Epoch 2, dev accuracy: 0.521
0.533 is better than 0.525, saving model ...
Epoch 3, dev accuracy: 0.533
0.553 is better than 0.533, saving model ...
Epoch 4, dev accuracy: 0.553
0.572 is better than 0.553, saving model ...
Epoch 5, dev accuracy: 0.572
0.611 is better than 0.572, saving model ...
Epoch 6, dev accuracy: 0.611
0.626 is better than 0.611, saving model ...
Epoch 7, dev accuracy: 0.626
0.646 is better than 0.626, saving model ...
Epoch 8, dev accuracy: 0.646
Epoch 9, dev accuracy: 0.642
0.654 is better than 0.646, saving model ...
Epoch 10, dev accuracy: 0.654
0.665 is better than 0.654, saving model ...
Epoch 11, dev accuracy: 0.665
Epoch 12, dev accuracy: 0.654
Epoch 13, dev accuracy: 0.654
Epoch 14, dev accuracy: 0.642
Epoch 15, dev accuracy: 0.634
Epoch 16, dev accuracy: 0.580
Epoch 17, dev accuracy: 0.654
Epoch 18, dev accura

**Q1**. Play around with this transformer as implemented and experiment with how performance on the dev data changes as a function of `d_model`, `num_encoder_layers`, `nhead`, etc.).  Describe your experiments and report dev accuracy on them below.

Default parameters in class TransformerClassifier:\
d_model, nhead=2, num_encoder_layers=1, dim_feedforward=256 \
Results with...\
Pre-set param (d_model=100, nhead=2, num_encoder_layers=1, dim_feedforward=1024): 0.623, 0.638, 0.630

Results with...\
Param set 1 (d_model=100, nhead=2, num_encoder_layers=1, dim_feedforward=256): 0.696, 0.693,  0.634\
Param set 2 (d_model=100, nhead=2, num_encoder_layers=2, dim_feedforward=1024): 0.654, 0.689, 0.638 \
Param set 3 (d_model=100, nhead=2, num_encoder_layers=1, dim_feedforward=4096): 0.658, 0.623, 0.642 \
Param set 4 (d_model=400, nhead=8, num_encoder_layers=4, dim_feedforward=4096): 0.525, 0.525, 0.494 \
Increasing numbers for d_model, n_head, num_encoder_layers, and dim_feedforward does not necessarily guarantee an increase in accuracy.

In [16]:
classifier=TransformerClassifier(num_labels=len(labels), d_model=400, nhead=8, num_encoder_layers=4, dim_feedforward=4096)
classifier=classifier.to(device)

train_x_batch, train_y_match, train_m_match=get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match=get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

train(classifier, "test.model", train_x_batch, train_y_match, train_m_match, dev_x_batch, dev_y_match, dev_m_match)

0.494 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.494
Epoch 1, dev accuracy: 0.494
Epoch 2, dev accuracy: 0.494
Epoch 3, dev accuracy: 0.494
Epoch 4, dev accuracy: 0.494
Epoch 5, dev accuracy: 0.494
Epoch 6, dev accuracy: 0.494
Epoch 7, dev accuracy: 0.494
Epoch 8, dev accuracy: 0.494
Epoch 9, dev accuracy: 0.494
Epoch 10, dev accuracy: 0.494
Epoch 11, dev accuracy: 0.494
11 > patience (10), stopping...

Best Performing Model achieves dev accuracy of : 0.494


**Q2**.  This transformer is forced to learn everything about the structure of language from the labeled dataset.  Word embeddings, however, already capture some of this structure, and can be incorporated into this model in an `nn.Embedding` layer.  Change the `TransformerClassifier` class above so that the `Embedding` layer uses pre-trained weights (do so with the `Embedding.from_pretrained` function described on the pytorch [API](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).  You can use any pre-trained embeddings you like, including the [GloVe vectors](https://raw.githubusercontent.com/dbamman/anlp23/main/data/glove.6B.50d.50K.txt) from class.  (Hint: doing so will require changes to `read_data` and `create_vocab_and_labels` since the word embeddings will give you your vocabulary.)

In [17]:
class NewTransformerClassifier(torch.nn.Module):

    def __init__(self, num_labels, d_model, weight_list, nhead=2, num_encoder_layers=1, dim_feedforward=256):

        super(NewTransformerClassifier, self).__init__()

        self.num_labels=num_labels
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weight_list))
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, batch_first=True)
        self.classifier = nn.Linear(d_model, self.num_labels)
        self.pos_encoder = PositionalEncoding(d_model)

    def forward(self, x, m):

        # put data on device (e.g., gpu)
        x=x.to(device)
        m=m.to(device)

        # convert input token IDs to word embeddings
        embed=self.embedding(x)

        # add position encodings to include information about word position within the document
        embed = self.pos_encoder(embed)

        # get transformer output
        h=self.transformer.encoder(embed, src_key_padding_mask=m)

        # Represent document as average embedding of transformer output
        h=torch.mean(h, dim=1)

        # Convert document representation into output label space
        logits=self.classifier(h)

        return logits


In [18]:
def create_embedding(filename, max_vocab=10000):

    count = 0
    weight_list=[]

    vocab={"[MASK]":0, "[UNK]":1}

    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.split()

            if len(cols) <= 3:
                continue

            vocab[cols[0]]=len(vocab)

            w = [float(w_v) for w_v in cols[1:]]
            weight_list.append(w)
            # print(weight_list)

            count += 1
            if count >= max_vocab:
                break

    return vocab, weight_list

In [19]:
!wget https://raw.githubusercontent.com/dbamman/anlp23/main/data/glove.6B.50d.50K.txt

--2023-10-09 04:34:13--  https://raw.githubusercontent.com/dbamman/anlp23/main/data/glove.6B.50d.50K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21357798 (20M) [text/plain]
Saving to: ‘glove.6B.50d.50K.txt.3’


2023-10-09 04:34:14 (216 MB/s) - ‘glove.6B.50d.50K.txt.3’ saved [21357798/21357798]



In [20]:
def read_data_w_embed_vocab(filename, vocab, labels, max_length, max_docs=5000):
    # Read in data from file, up to the first max_docs documents. For each document
    # read up to max_length tokens.

    x=[]
    y=[]
    m=[]

    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            if idx >= max_docs:
                break
            cols=line.rstrip().split("\t")
            lab=cols[0]
            text=word_tokenize(cols[1])
            text_ids=[]
            for tok in text:
                if tok in vocab:
                    text_ids.append(vocab[tok])
                else:
                    text_ids.append(vocab["[UNK]"])

            text_ids=text_ids[:max_length]

            # PyTorch (and most libraries that deal with matrix operations) expects all inputs to be the same length
            # So pad each document with 0s up to max_length
            # But keep track of the true number of tokens in the document with the "mask" list.

            # True tokens have a mask value of 0
            mask=[0]*len(text_ids)

            for i in range(len(text_ids), max_length):
                text_ids.append(vocab["[MASK]"])
                # Padded tokens have a mask value of 1
                mask.append(1)

            x.append(text_ids)
            m.append(mask)
            y.append(labels[lab])

    return x, y, m

In [21]:
vocab, weight_list = create_embedding("glove.6B.50d.50K.txt")
train_x, train_y, train_m=read_data_w_embed_vocab("train.tsv", vocab, labels, max_length=max_length)
dev_x, dev_y, dev_m=read_data_w_embed_vocab("dev.tsv", vocab, labels, max_length=max_length)

In [22]:
# len(weight_list), len(weight_list[0])

In [23]:
# len(train_x), len(train_x[0])

In [24]:
# len(train_y)

In [25]:
# len(train_m), len(train_m[0])

In [26]:
# len(vocab)

In [27]:
classifier=NewTransformerClassifier(num_labels=len(labels), d_model=50, dim_feedforward=1024, weight_list=weight_list)
classifier=classifier.to(device)

train_x_batch, train_y_match, train_m_match=get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match=get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

train(classifier, "test.model", train_x_batch, train_y_match, train_m_match, dev_x_batch, dev_y_match, dev_m_match)

0.494 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.494
Epoch 1, dev accuracy: 0.494
Epoch 2, dev accuracy: 0.494
Epoch 3, dev accuracy: 0.494
Epoch 4, dev accuracy: 0.494
Epoch 5, dev accuracy: 0.494
Epoch 6, dev accuracy: 0.494
Epoch 7, dev accuracy: 0.494
Epoch 8, dev accuracy: 0.494
Epoch 9, dev accuracy: 0.494
Epoch 10, dev accuracy: 0.494
Epoch 11, dev accuracy: 0.494
11 > patience (10), stopping...

Best Performing Model achieves dev accuracy of : 0.494


To turn in:

- Go to `File > Download > Download .ipynb` and save your notebook.
- In your browser, print this page to save as PDF.
- Upload both your .ipynb and .pdf files to bCourses as usual.