In [None]:
import collections, time, math, random

# Writing a simple model in PyTorch

This notebook shows you how to get started with PyTorch and also provides you some skeleton code. You can make a copy of the notebook and write your solution in it, or you can download it (**File &rarr; Download .py**) and work on it locally.

## Setup

Clone the HW1 repository. (If you rerun the notebook, you'll get an error that directory `hw1` already exists, which you can ignore.)

In [None]:
!git clone https://github.com/ND-CSE-40657/hw1

Import PyTorch. If you want to run on your own computer, you'll need to install PyTorch, which is usually as simple as `pip install torch`.

In [None]:
import torch
print(f'Using Torch v{torch.__version__}')

Check for a GPU. A GPU is not necessary for this assignment -- in fact, for the size of model we're training, it probably makes things slower. To enable/disable GPU, go to **Runtime &rarr; Change runtime type &rarr; Hardware accelerator** and select **GPU** (to enable the GPU) or **None** (to disable the GPU).

In [None]:
if torch.cuda.device_count() > 0:
    print(f'Using GPU ({torch.cuda.get_device_name(0)})')
    device = 'cuda'
else:
    print('Using CPU')
    device = 'cpu'

## Read and preprocess data

Read in the data files. Note that we strip trailing newlines.

In [None]:
def read_data(filename):
    return [list(line.rstrip('\n')) + ['<EOS>'] for line in open(filename)]
traindata = read_data('hw1/data/train')
devdata = read_data('hw1/data/dev')
testdata = read_data('hw1/data/test')

Create a vocabulary containing the most frequent words and some special words.

In [None]:
class Vocab:
    def __init__(self, counts, size):
        if not isinstance(counts, collections.Counter):
            raise TypeError('counts must be a collections.Counter')
        words = {'<EOS>', '<UNK>'}
        for word, _ in counts.most_common():
            words.add(word)
            if len(words) == size:
                break
        self.num_to_word = list(words)    
        self.word_to_num = {word:num for num, word in enumerate(self.num_to_word)}

    def __len__(self):
        return len(self.num_to_word)
    def __iter__(self):
        return iter(self.num_to_word)

    def numberize(self, word):
        if word in self.word_to_num:
            return self.word_to_num[word]
        else: 
            return self.word_to_num['<UNK>']

    def denumberize(self, num):
        return self.num_to_word[num]

chars = collections.Counter()
for line in traindata:
    chars.update(line)
vocab = Vocab(chars, 100) # For our data, 100 is a good size.

## Define the model

Now we want to define a unigram language model. The parameters of the model are _logits_ $\mathbf{s}$, which are unconstrained real numbers, and we will apply a softmax to change them into probabilities (which are nonnegative and sum to one).

\begin{align}
P(i) &= [\operatorname{softmax} \mathbf{s}]_i \\
&= \frac{\exp s_i}{\sum_{i'} \exp s_{i'}}.
\end{align}

Create an array (a `Tensor`) of logits, one for each word in the vocabulary.

In [None]:
logits = torch.normal(mean=0, std=0.01, 
                      size=(len(vocab),), 
                      requires_grad=True, 
                      device=device)

The function `torch.normal` creates an array of random numbers, normally distributed (here with mean zero and standard deviation 0.01).

The `size` argument says that it should be a one-dimensional array with `vocab.size` elements, one for each word in the vocabulary.

The next two arguments are important. The `requires_grad` argument tells PyTorch that we will want to compute gradients with respect to `logits`, because we want to learn its values. The `device` argument says where to store the array.

There are a couple of functions below that will want to know what the parameters of our model are. So we make a list for future use:

In [None]:
parameters = [logits]

Next, we write code to convert the logits into probabilities -- actually, log-probabilities. Torch has a function that does a softmax and a log together; it's more numerically stable than doing them in two steps. (Even though `logits` has only one dimension, we still have to say `dim=0` to specify which dimension the softmax should be computed over.)

In [None]:
def logprobs():
    return torch.log_softmax(logits, dim=0)

This returns an array of floats like you'd expect, but this array also remembers _how_ it was computed. PyTorch will use this information to compute gradients for learning.

## Train the model

Next, we create an optimizer, whose job is to adjust a set of parameters to minimize a loss function. Here, we're using `SGD` (stochastic gradient descent); other options are `Adagrad`, `Adam`, and others. Different optimizers take different options. Here, `lr` stands for "learning rate" and we usually try different powers of ten until we get the best results on the dev set.

In [None]:
o = torch.optim.SGD(parameters, lr=0.1)

Next, we run through the training data a few times (epochs). For each sentence, move the parameters a little bit to decrease the loss function. If you want to rerun the training, go to **Run &rarr; Restart and run all** or **Runtime &rarr; Run all**. It takes about 5 minutes per epoch.

In [None]:
prev_dev_acc = None

for epoch in range(100):
    epoch_start = time.time()

    # Run through the training data

    random.shuffle(traindata) # Important

    train_loss = 0  # Total negative log-probability
    train_chars = 0 # Total number of characters
    for chars in traindata:
        # Compute the negative log-likelihood of this line,
        # which is the thing we want to minimize.
        loss = 0.
        for c in chars:
            train_chars += 1
            loss -= logprobs()[vocab.numberize(c)]

        # Keep a running total of negative log-likelihood.
        # The .item() turns a one-element tensor into an ordinary float,
        # including detaching the history of how it was computed,
        # so we don't save the history across sentences.
        train_loss += loss.item()

        # Compute gradient of loss with respect to parameters.
        o.zero_grad()   # Reset gradients to zero
        loss.backward() # Add in the gradient of loss

        # Clip gradients (not needed here, but helpful for RNNs)
        torch.nn.utils.clip_grad_norm_(parameters, 1.0)

        # Do one step of gradient descent.
        o.step()

    # Run through the development data

    dev_chars = 0   # Total number of characters
    dev_correct = 0 # Total number of characters guessed correctly
    for chars in devdata:
        for c in chars:
            dev_chars += 1

            # Find the character with highest predicted probability.
            # The .item() is needed to change a one-element tensor to
            # an ordinary int.
            best = vocab.denumberize(logprobs().argmax().item())
            if best == c:
                dev_correct += 1

    dev_acc = dev_correct/dev_chars
    print(f'time={time.time()-epoch_start} train_ppl={math.exp(train_loss/train_chars)} dev_acc={dev_acc}')

    # Important: If dev accuracy didn't improve, halve the learning rate
    if prev_dev_acc is not None and dev_acc <= prev_dev_acc:
            o.param_groups[0]['lr'] *= 0.5
            print(f"lr={o.param_groups[0]['lr']}")

    # When the learning rate gets too low, stop training
    if o.param_groups[0]['lr'] < 0.01:
        break

    prev_dev_acc = dev_acc

## Matrix multiplication

Not illustrated above is matrix multiplication using the [`@` operator](https://pytorch.org/docs/stable/generated/torch.matmul.html):

In [None]:
A = torch.ones(2,3)
A

In [None]:
b = torch.ones(3)
b

In [None]:
C = torch.ones(3,5)
C

Matrix-vector product:

In [None]:
A @ b

Vector-matrix product:

In [None]:
b @ C

Matrix-matrix product:

In [None]:
A @ C

The `@` operator works even if its arguments have more than two dimensions, but the semantics can be a little bit confusing. It's probably nicer to use [`einsum`](https://pytorch.org/docs/stable/generated/torch.einsum.html) instead. For example, suppose we needed to compute

$$B_{ba} = \sum_{m=1}^5 \sum_{n=1}^4 l_{bn} A_{anm} r_{bm} \qquad (a=1,2,3; b=1,2).$$

We can do this using a single call to einsum, without looping over $a,b,m$, or $n$.

In [None]:
A = torch.ones(3,5,4)
l = torch.ones(2,5)
r = torch.ones(2,4)
B = torch.einsum('bn,anm,bm->ba', l, A, r)
B

The arguments `l`, `A`, and `r` are the three tensors being combined, and `B` is the result tensor. The first argument is the instructions for how to do the combination. Each letter acts like an index variable, and internally `einsum` loops over all of them. The `bn` are the indices for `l`, the `anm` are the indices for `A`, the `bm` are the indices for `r`, and the `ba` are the indices for `B`. The computation is always a sum of products; it's equivalent to

In [None]:
B = torch.zeros(2,3)
for a in range(3):
    for b in range(2):
        for m in range(4):
           for n in range(5):
               B[b,a] += l[b,n] * A[a,n,m] * r[b,m]
B               

## Saving and loading

You may want to save a model to disk so you can continue training it later or use it later. We save the vocabulary as well so that we preserve the mapping from characters to numbers.

In [None]:
torch.save((parameters, vocab), 'model')

In [None]:
(parameters_copy, vocab_copy) = torch.load('model')

In Colab, however, the saved models won't persist across sessions. See the [Colab docs](https://colab.research.google.com/notebooks/io.ipynb) for some options for persistent storage.

## Modules

Torch provides a class `torch.nn.Module` that helps you to manage all the parameters of your model, and other associated information, as a single object. However, it does some magic that I find slightly confusing, so I chose not to use it at first.

In [None]:
class Unigram(torch.nn.Module):
    def __init__(self, vocab):
        # Call parent class's __init__(). You will get an error
        # if you forget this.
        super().__init__() 
        
        # Store the vocab inside the Unigram object,
        # so when we save the Unigram, it saves the vocab too.
        self.vocab = vocab
        
        # Create the model parameters. Wrapping it inside
        # Parameter tells Module to manage it for us, so
        # we don't have to set requires_grad or device.
        self.logits = torch.nn.Parameter(
                        torch.normal(mean=0, std=0.01, 
                                     size=(len(vocab),))
                      )
    
    # You can define whatever methods you want,
    # but "forward" is special, so the main method should
    # be named "forward".
    def forward(self):
        return torch.log_softmax(self.logits, dim=0)
    
model = Unigram(vocab)

Now whatever we do with `model` gets done to everything inside it. A Module can even contain other Modules.

In [None]:
# Move all of model's parameters to device
model.to(device) 

In [None]:
# Create an optimizer for all of model's parameters
o = torch.optim.SGD(model.parameters(), lr=0.1)

In [None]:
# Save/load model to/from disk
torch.save(model, 'model')
model = torch.load('model')

The `model` object can be called like a function, which invokes `model.forward()`. But `model()` also calls various _hooks_ so it's recommended that you always write `model()`, not `model.forward()`.

In [None]:
model()