E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

In [23]:
import torch
import torch.nn.functional as F
from math import floor

In [24]:
# Read in the names
words = open('../names.txt', 'r').read().splitlines()

In [25]:
# Create training, dev, and test sets
train_index = floor(len(words) * 0.8)
dev_index = floor(len(words) * 0.9)

train = words[:train_index]
dev = words[train_index:dev_index]
test = words[dev_index:]

In [26]:
chars = sorted(list(set(''.join(['.'] + words))))

# Create look up tables for the alphabet
  # stoi = string to index
  # itos = index to string
stoi = {s:i for i, s in enumerate(chars)}
itos = {i:s for s, i in stoi.items()}

In [27]:
# Create the training data
xs_train,  ys_train = [], []

# Create the training data
# input xs: (ch1, ch2) 
# prediction ys: ch3
for word in train:

  # prepend two special characters and append one special characters to each word
  chs = ['.'] * 2 + list(word) + ['.']
  # Example for 'anna": 
  # zip(chs, chs[1:], chs[2:]) = 
  # [('.', '.', 'a'), ('.', 'a', 'n'), ('a', 'n', 'n'), ('n', 'n', 'a'), ('n', 'a', '.')]
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    xs_train.append((ix1, ix2))
    ys_train.append(ix3)

num = len(xs_train)
print('number of training examples: ', num)
xs_train = torch.tensor(xs_train)
ys_train = torch.tensor(ys_train)

number of training examples:  182778


In [28]:
# Create the development data
xs_dev,  ys_dev = [], []
for word in dev:
  chs = ['.'] * 2 + list(word) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    xs_dev.append((ix1, ix2))
    ys_dev.append(ix3)

xs_dev = torch.tensor(xs_dev)
ys_dev = torch.tensor(ys_dev)

In [29]:
# Creat the test data
xs_test,  ys_test = [], []
for word in test:
  chs = ['.'] * 2 + list(word) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    xs_test.append((ix1, ix2))
    ys_test.append(ix3)

xs_test = torch.tensor(xs_test)
ys_test = torch.tensor(ys_test)

In [30]:
g = torch.Generator().manual_seed(2147483647)
# weight matrix with 54 input nodes and 27 output nodes
W = torch.randn((27*2, 27), generator=g, requires_grad=True)

In [31]:
# gradient descent
iterations = 100
learning_rate = 50

for k in range(iterations):

  # forward pass
  xenc= F.one_hot(xs_train, num_classes=27).float()
  xenc_flat = xenc.flatten(1) # flatten the one-hot encoded input vector
  logits = xenc_flat @ W # predict log-counts
  # softmax
  counts = logits.exp() # counts
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  # loss function (cross-entropy) + regularization (L2)
  loss = -probs[torch.arange(num), ys_train].log().mean() + 0.01*(W**2).mean()
  # print loss every 10% of iterations
  if k % floor(iterations/10) == 0:
    print(f'loss at step {k}: {loss.item():.3f}')

  # backward pass
  W.grad = None # flush the gradients
  loss.backward()

  # update step
  W.data += -learning_rate * W.grad
print(f'final training loss: {loss.item():.3f}')

loss at step 0: 4.249
loss at step 10: 2.595
loss at step 20: 2.483
loss at step 30: 2.429
loss at step 40: 2.417
loss at step 50: 2.388
loss at step 60: 2.393
loss at step 70: 2.371
loss at step 80: 2.381
loss at step 90: 2.361
final loss: 2.377


In [32]:
# Evaluate the model on the dev set
xenc= F.one_hot(xs_dev, num_classes=27).float()
xenc_flat = xenc.flatten(1) # flatten the one-hot encoded input vector
logits = xenc_flat @ W # predict log-counts
# softmax
counts = logits.exp() # counts
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
# loss function (cross-entropy) + regularization (L2)
loss = -probs[torch.arange(len(xs_dev)), ys_dev].log().mean() + 0.01*(W**2).mean()
print(f'loss on dev set: {loss.item():.3f}')

loss on dev set: 2.576


In [33]:
# Evaluate the model on the test set
xenc= F.one_hot(xs_test, num_classes=27).float()
xenc_flat = xenc.flatten(1) # flatten the one-hot encoded input vector
logits = xenc_flat @ W # predict log-counts
# softmax
counts = logits.exp() # counts
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
# loss function (cross-entropy) + regularization (L2)
loss = -probs[torch.arange(len(xs_test)), ys_test].log().mean() + 0.01*(W**2).mean()
print(f'loss on test set: {loss.item():.3f}')

loss on test set: 2.593


The model performed worse on the dev and on the test set than on the training set.

- loss on training set after $100$ iterations: $2.377$
- loss on dev set: $2.576$
- loss on test set: $2.593$