# COSE461 Assignment 5
### **Due 11:59 PM, Fri May. 17 (Extended)**

---



# 1. Pretrained Transformer models and knowledge access
You’ll train a Transformer to perform a task that involves accessing knowledge about the world – knowledge which isn’t provided via the task’s training data (at least if you want to generalize outside the training set). You’ll find that it more or less fails entirely at the task. You’ll then learn how to pretrain that Transformer on Wikipedia text that contains world knowledge, and find that finetuning that Transformer on the same knowledge-intensive task enables the model to access some of the knowledge learned at pretraining time. You’ll find that this enables models to perform considerably above chance on a held out development set.

The code you’re provided with is a fork of Andrej Karpathy’s **minGPT**. It’s nicer than most research code in that it’s relatively simple and transparent. The “GPT” in minGPT refers to the Transformer language model of OpenAI.


In [3]:
!rm COSE461_Assignment5 -rf
!git clone https://github.com/ku-dmlab/COSE461_Assignment5.git
import os
os.chdir("/content/COSE461_Assignment5")

Cloning into 'COSE461_Assignment5'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 21 (delta 3), reused 21 (delta 3), pack-reused 0[K
Receiving objects: 100% (21/21), 206.25 KiB | 2.40 MiB/s, done.
Resolving deltas: 100% (3/3), done.


## Part 1.1 minGPT

**Check out the code below.**


**CODE1. minGPT model**

In [4]:
"""
GPT model:
- the initial stem consists of a combination of token encoding and a positional encoding
- the meat of it is a uniform sequence of Transformer blocks
    - each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block
    - all blocks feed into a central residual pathway similar to resnets
- the final decoder is a linear projection into a vanilla Softmax classifier
"""

import math
import logging

import torch
import torch.nn as nn
from torch.nn import functional as F

logger = logging.getLogger(__name__)

class GPTConfig:
    """ base GPT config, params common to all GPT versions """
    embd_pdrop = 0.1
    resid_pdrop = 0.1
    attn_pdrop = 0.1

    def __init__(self, vocab_size, block_size, **kwargs):
        self.vocab_size = vocab_size
        self.block_size = block_size
        for k,v in kwargs.items():
            setattr(self, k, v)

class GPT1Config(GPTConfig):
    """ GPT-1 like network roughly 125M params """
    n_layer = 12
    n_head = 12
    n_embd = 768

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_drop = nn.Dropout(config.attn_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)
        # output projection
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head

    def forward(self, x, layer_past=None):
        B, T, C = x.size()

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

class GPT(nn.Module):
    """  the full GPT language model, with a context size of block_size """

    def __init__(self, config):
        super().__init__()

        # input embedding stem
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.drop = nn.Dropout(config.embd_pdrop)
        # transformer
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        # decoder head
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

        logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))

    def get_block_size(self):
        return self.block_size

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def configure_optimizers(self, train_config):
        """
        This long function is unfortunately doing something very simple and is being very defensive:
        We are separating out all parameters of the model into two buckets: those that will experience
        weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
        We are then returning the PyTorch optimizer object.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # special case the position embedding parameter in the root GPT module as not decayed
        no_decay.add('pos_emb')

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
        return optimizer

    def forward(self, idx, targets=None):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."

        # forward the GPT model
        token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector
        position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector
        x = self.drop(token_embeddings + position_embeddings)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss


**CODE2. minGPT trainer**

In [5]:
"""
Simple training loop; Boilerplate that could apply to any arbitrary neural network,
so nothing in this file really has anything to do with GPT specifically.
"""

import math
import logging

from tqdm import tqdm
import numpy as np

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data.dataloader import DataLoader

logger = logging.getLogger(__name__)

class TrainerConfig:
    # optimization parameters
    max_epochs = 10
    batch_size = 64
    learning_rate = 3e-4
    betas = (0.9, 0.95)
    grad_norm_clip = 1.0
    weight_decay = 0.1 # only applied on matmul weights
    # learning rate decay params: linear warmup followed by cosine decay to 10% of original
    lr_decay = False
    warmup_tokens = 375e6 # these two numbers come from the GPT-3 paper, but may not be good defaults elsewhere
    final_tokens = 260e9 # (at what point we reach 10% of original LR)
    # checkpoint settings
    ckpt_path = None
    num_workers = 0 # for DataLoader

    def __init__(self, **kwargs):
        for k,v in kwargs.items():
            setattr(self, k, v)

class Trainer:

    def __init__(self, model, train_dataset, test_dataset, config):
        self.model = model
        self.train_dataset = train_dataset
        self.test_dataset = test_dataset
        self.config = config

        # take over whatever gpus are on the system
        self.device = 'cpu'
        if torch.cuda.is_available():
            self.device = torch.cuda.current_device()
            self.model = torch.nn.DataParallel(self.model).to(self.device)

    def save_checkpoint(self):
        # DataParallel wrappers keep raw model object in .module attribute
        raw_model = self.model.module if hasattr(self.model, "module") else self.model
        logger.info("saving %s", self.config.ckpt_path)
        torch.save(raw_model.state_dict(), self.config.ckpt_path)

    def train(self):
        model, config = self.model, self.config
        raw_model = model.module if hasattr(self.model, "module") else model
        optimizer = raw_model.configure_optimizers(config)

        def run_epoch(split):
            is_train = split == 'train'
            model.train(is_train)
            data = self.train_dataset if is_train else self.test_dataset
            loader = DataLoader(data, shuffle=True, pin_memory=True,
                                batch_size=config.batch_size,
                                num_workers=config.num_workers)

            losses = []
            pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
            for it, (x, y) in pbar:

                # place data on the correct device
                x = x.to(self.device)
                y = y.to(self.device)

                # forward the model
                with torch.set_grad_enabled(is_train):
                    logits, loss = model(x, y)
                    loss = loss.mean() # collapse all losses if they are scattered on multiple gpus
                    losses.append(loss.item())

                if is_train:

                    # backprop and update the parameters
                    model.zero_grad()
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
                    optimizer.step()

                    # decay the learning rate based on our progress
                    if config.lr_decay:
                        self.tokens += (y >= 0).sum() # number of tokens processed this step (i.e. label is not -100)
                        if self.tokens < config.warmup_tokens:
                            # linear warmup
                            lr_mult = float(self.tokens) / float(max(1, config.warmup_tokens))
                        else:
                            # cosine learning rate decay
                            progress = float(self.tokens - config.warmup_tokens) / float(max(1, config.final_tokens - config.warmup_tokens))
                            lr_mult = max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))
                        lr = config.learning_rate * lr_mult
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr
                    else:
                        lr = config.learning_rate

                    # report progress
                    pbar.set_description(f"epoch {epoch+1} iter {it}: train loss {loss.item():.5f}. lr {lr:e}")

            if not is_train:
                test_loss = float(np.mean(losses))
                logger.info("test loss: %f", test_loss)
                return test_loss

        best_loss = float('inf')
        self.tokens = 0 # counter used for learning rate decay
        for epoch in range(config.max_epochs):

            run_epoch('train')
            if self.test_dataset is not None:
                test_loss = run_epoch('test')

            # supports early stopping based on the test loss, or just save always if no test set is provided
            good_model = self.test_dataset is None or test_loss < best_loss
            if self.config.ckpt_path is not None and good_model:
                best_loss = test_loss
                self.save_checkpoint()


**CODE3. minGPT Utils**

In [6]:
import random
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

@torch.no_grad()
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time. Clearly the sampling
    has quadratic complexity unlike an RNN that is only linear, and has a finite context window
    of block_size, unlike an RNN that has an infinite context window.
    """
    block_size = model.get_block_size()
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits, _ = model(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

def evaluate_places(filepath, predicted_places):
    """ Computes percent of correctly predicted birth places.
    Arguments:
      filepath: path to a file with our name, birth place data.
      predicted_places: a list of strings representing the
          predicted birth place of each person.
    Returns: (total, correct), floats
    """
    with open(filepath) as fin:
        lines = [x.strip().split('\t') for x in fin]
        if len(lines[0]) == 1:
            print('No gold birth places provided; returning (0,0)')
            return (0,0)
        true_places = [x[1] for x in lines]
        total = len(true_places)
        assert total == len(predicted_places)
        correct = len(list(filter(lambda x: x[0] == x[1],
                                  zip(true_places, predicted_places))))
        return (float(total),float(correct))

## Part 1.2 Datasets
**Read   through**   NameDataset**,   our   dataset   for   reading   name-birthplace   pairs.**

The  task  we’ll  be  working  on  with  our  pretrained  models  is  attempting  to  access  the  birth  place  of a   notable   person,   as   written   in   their   Wikipedia   page.    We’ll   think   of   this   as   a   particularly   simple form  of  question  answering:

$Q: \, Where \;was\;[person]\;born?$

$A: \, [place]$

Below, you’ll find the the class NameDataset, which reads a TSV (tab-separated values) file of name/place pairs and produces examples of the above form that we can feed to our Transformer model. To make sure that two datasets (dataset for finetuning and the dataset for pretraining) share the same vocabulary (word-integer mapping), NameDataset gets the pretraining dataset as an input.


--------------
**Vocabulary Specification**

The vocabulary is accessible via two dictionaries:

  `self.stoi`: a dictionary from characters in the vocabulary to indices of type
      int

  `self.itos`: a dictionary from indices of type int to characters in the
      vocabulary
      
Identifier `0` is assigned to the unicode element `u"\u25A1"`. This is the empty_square_character. Further, `self.PAD_CHAR = u"\u25A1"`.

Identifier `1` is assigned to the unicode element `u"\u2047"`. This is the doublequestionmark character, which we'll use as a sentinel to represent that text is missing from the input. Further, `self.MASK_CHAR = u"\u2047"`.

Identifiers `2, ..., len(self.itos)-1` is the sorted list of characters
      that appear in the data argument.

--------------

In [7]:
from torch.utils.data import Dataset

"""
The input-output pairs (x, y) of the NameDataset are of the following form:

  x: Where was Khatchig Mouradian born?⁇Lebanon⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
  y: □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Lebanon⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
  x: Where was Jacob Henry Studer born?⁇Columbus⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
  y: □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Columbus⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□

Using the PAD_CHAR characters in y before the ⁇[place] keeps the trainer from
optimizing the model to predict the question, "Where was...".

Note that the NameDataset should take the pretraining_dataset defined in run.py
as an input. This is to allow the vocab specification of the NameDataset to be
the same as that of the pretraining dataset.
"""

class NameDataset(Dataset):
    def __init__(self, pretraining_dataset, data):
        self.MASK_CHAR = u"\u2047" # the doublequestionmark character, for mask
        self.PAD_CHAR = u"\u25A1" # the empty square character, for pad
        self.itos = pretraining_dataset.itos
        self.stoi = pretraining_dataset.stoi
        self.block_size = pretraining_dataset.block_size
        self.data = list(data.encode('utf-8').decode('ascii', errors='ignore').split('\n'))

    def __len__(self):
        # returns the length of the dataset
        return len(self.data) - 1

    def __getitem__(self, idx):
        inp, oup = self.data[idx].split('\t')
        x = inp + self.MASK_CHAR + oup + self.MASK_CHAR
        x = x + self.PAD_CHAR*(self.block_size - len(x))
        y = self.PAD_CHAR*(len(inp)-1) + x[len(inp):]

        x = x[:-1]
        x = torch.tensor([self.stoi[c] for c in x], dtype=torch.long)
        y = torch.tensor([self.stoi[c] for c in y], dtype=torch.long)
        return x, y

On the other hand, in the below, we have a dataset for pretraining, to train the model with span corruption objective.

--------------
**Masking Specification**

The `__getitem__` function takes an index and returns a data point $(x, y)$ where $x$ and $y$ are Long tensors of length `self.block_size`. $x$ encodes the input sequence, and $y$ encodes the output sequence.

0. Use the idx argument of `__getitem__` to retrieve the element of `self.data`
at the given index. We'll call the resulting data entry a document.

1. Truncate the document to a length no less than `4` characters,
and no more than `int(self.block_size*7/8)` characters, where the length is picked *randomly*.

2. Now, break the (truncated) document into three substrings:
    
    `[prefix] [masked_content] [suffix]`

  In other words, choose three strings `prefix`, `masked_content` and `suffix`
  such that `prefix + masked_content + suffix = [the original document]`.
  The length of `[masked_content]` is random, and `1/4` the length of the truncated document on average.

3. Rearrange these substrings into the following form:
    `[prefix] MASK_CHAR [suffix] MASK_CHAR [masked_content] [pads]`
  
  This resulting string, denoted `masked_string`, serves as the output example.
  Here `MASK_CHAR` is the masking character defined in Vocabulary Specification,
  and `[pads]` is a string of repeated `PAD_CHAR` characters chosen so that the
  entire string is of length `self.block_size`.

  Intuitively, the `[masked_content]`, a string, is removed from the document and replaced with `MASK_CHAR` (the masking character defined in Vocabulary
  Specification). After the suffix of the string, the `MASK_CHAR` is seen again,
  followed by the content that was removed, and the padding characters.

4. We now use `masked_string` to construct the input and output example pair. To
do so, simply take the input string to be `masked_string[:-1]`, and the output
string to be `masked_string[1:]`. In other words, for each character, the goal is to predict the next character in the masked string.

5. Making use of the vocabulary defined, encode the resulting input
and output strings as Long tensors and return the resulting data point.

--------------

In [18]:
class CharCorruptionDataset(Dataset):
    def __init__(self, data, block_size):
        self.MASK_CHAR = u"\u2047"  # the doublequestionmark character, for mask
        self.PAD_CHAR = u"\u25A1"  # the empty square character, for pad

        chars = list(sorted(list(set(data))))
        assert self.MASK_CHAR not in chars
        assert self.PAD_CHAR not in chars
        chars.insert(0, self.MASK_CHAR)
        chars.insert(0, self.PAD_CHAR)

        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}

        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))

        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data.split('\n')
        if len(self.data[-1]) == 0:
            self.data = self.data[:-1]

    def __len__(self):
        # returns the length of the dataset
        return len(self.data)

    def __getitem__(self, idx):
        document = self.data[idx]

        # 1. randomly truncate to [4, 7/8 * block_size]
        doc_len = len(document)
        truncate_len = random.randint(4, int(self.block_size * 7 / 8))
        truncate_len = min(doc_len, truncate_len)
        truncated_doc = document[:truncate_len]

        ### YOUR CODE HERE (~2 Lines)
        # 2. break to [prefix] [masked_content] [suffix]
        # You should assign values to prefix_len and masked_len.

        prefix_len = random.randint(1, truncate_len // 2)
        masked_len = min(random.randint(1, truncate_len - prefix_len), truncate_len - prefix_len - 1)

        ### END YOUR CODE

        prefix = truncated_doc[:prefix_len]
        masked_content = truncated_doc[prefix_len:prefix_len + masked_len]
        suffix = truncated_doc[prefix_len + masked_len:]

        ### YOUR CODE HERE (~2 Lines)
        # 3. rearrange to masked_string: [prefix] MASK_CHAR [suffix] MASK_CHAR [masked_content] [pads]
        # You should concat strings into masked_string.

        total_len = self.block_size - 1  # reserve space for the final MASK_CHAR
        masked_string = prefix + self.MASK_CHAR + suffix + self.MASK_CHAR + masked_content + self.PAD_CHAR * (total_len - len(prefix + self.MASK_CHAR + suffix + self.MASK_CHAR + masked_content))


        ### END YOUR CODE

        # 4. input = masked_string[:-1], output = masked_string[1:]
        x = masked_string[:-1]
        y = masked_string[1:]

        # 5. encode to Long tensors
        x = torch.LongTensor([self.stoi[c] for c in x])
        y = torch.LongTensor([self.stoi[c] for c in y])
        return x, y


To  get  a  sense  of  the  examples  we’ll  be  working  with,  if  you  run  the  following  code,  it’ll  load  your NameDataset  on  the  training  set  `birth_places_train.tsv`  and  print  out  a  few  examples.

In [19]:
corruption_dataset = CharCorruptionDataset(open('wiki.txt', encoding='utf-8').read(), 128)
# Make the name dataset
name_dataset = NameDataset(corruption_dataset, open('birth_places_train.tsv', encoding='utf-8').read())

for _, example in zip(range(4), name_dataset):
    x, y = example
    print('x:', ''.join([name_dataset.itos[int(c)] for c in x]))
    print('y:', ''.join([name_dataset.itos[int(c)] for c in y]))

data has 418352 characters, 256 unique.
x: Where was Khatchig Mouradian born?⁇Lebanon⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Lebanon⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: Where was Jacob Henry Studer born?⁇Columbus⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Columbus⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: Where was John Stephen born?⁇Glasgow⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: □□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Glasgow⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: Where was Georgina Willis born?⁇Australia⁇□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□⁇Australia

The below code loads CharCorruptionDataset on the training set `wiki.txt` and print out a few examples.

In [20]:
corruption_dataset = CharCorruptionDataset(open('wiki.txt', encoding='utf-8').read(), 128)
for _, example in zip(range(4), corruption_dataset):
    x, y = example
    print('x:', ''.join([name_dataset.itos[int(c)] for c in x]))
    print('y:', ''.join([name_dataset.itos[int(c)] for c in y]))

data has 418352 characters, 256 unique.
x: Khatchig Mour⁇h⁇adian. Khatc□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: hatchig Mour⁇h⁇adian. Khatc□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: J⁇o⁇acob Henry Studer. Jacob Henry Studer (26 February 1840 C□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: ⁇o⁇acob Henry Studer. Jacob Henry Studer (26 February 1840 C□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: John Ste⁇rn⁇phen. Bo□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: ohn Ste⁇rn⁇phen. Bo□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
x: Ge⁇ Willis. Geor⁇orgina□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□
y: e⁇ Willis. Geor⁇orgina□□□□□□□□□□□□□□□□□□□□□□□□□

## Part 1.3 Fine-Tuning



**Make predictions (without pretraining).**

The below is the code to *fine_tune* and *evaluate* a model.

In [11]:
def finetune(writing_params_path, reading_params_path=None, finetune_corpus_path='birth_places_train.tsv', pretrain_corpus_path="wiki.txt"):
    # Save the device
    device = torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'

    # Keep the block size 128
    # Why is the pretraining corpus always required (even if we're not pretraining?)
    # It's because we're using it as a hack to always have the same vocabulary
    # (that is, the same mapping from character to integer, and we build the
    # vocab from the pretraining corpus.)
    block_size = 128
    text = open(pretrain_corpus_path).read()
    pretrain_dataset = CharCorruptionDataset(text, block_size)

    # We don't suggest you change these hyperparameters, as they're known to work.
    mconf = GPTConfig(pretrain_dataset.vocab_size, pretrain_dataset.block_size,
        n_layer=4, n_head=8, n_embd=256)
    model = GPT(mconf)

    if reading_params_path is not None:
        model.load_state_dict(torch.load(reading_params_path))
    tconf = TrainerConfig(max_epochs=75,
                          batch_size=256,
                          learning_rate=6e-4,
                          lr_decay=True,
                          warmup_tokens=512 * 20,
                          final_tokens=200 * len(pretrain_dataset) * block_size,
                          num_workers=4)
    text = open(finetune_corpus_path, 'r').read()
    train_dataset = NameDataset(pretrain_dataset, text)
    trainer = Trainer(model, train_dataset, None, tconf)
    trainer.train()
    # save to writing_params_path
    torch.save(model.state_dict(), writing_params_path)


This is the `evaluation` code, which samples predictions from the trained model and calls `evaluate_places()` to get the total percentage of correct place
predictions.

In [12]:
def evaluate(outputs_path, reading_params_path, eval_corpus_path, pretrain_corpus_path="wiki.txt"):
    device = torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'
    block_size = 128
    text = open(pretrain_corpus_path).read()
    pretrain_dataset = CharCorruptionDataset(text, block_size)
    mconf = GPTConfig(pretrain_dataset.vocab_size, pretrain_dataset.block_size,
        n_layer=4, n_head=8, n_embd=256)
    model = GPT(mconf).to(device)

    model.load_state_dict(torch.load(reading_params_path))
    correct = 0
    total = 0
    with open(outputs_path, 'w') as fout:
        predictions = []
        for line in tqdm(open(eval_corpus_path)):
            x = line.split('\t')[0]
            x = x + '⁇'
            x = torch.tensor([pretrain_dataset.stoi[s] for s in x], dtype=torch.long)[None,...].to(device)
            pred = sample(model, x, 32, sample=False)[0]
            completion = ''.join([pretrain_dataset.itos[int(i)] for i in pred])
            pred = completion.split('⁇')[1]
            predictions.append(pred)
            fout.write(pred + '\n')
        total, correct = evaluate_places(eval_corpus_path, predictions)
    if total > 0:
        print('Correct: {} out of {}: {}%'.format(correct, total, correct/total*100))
    else:
        print('Predictions written to {}; no targets provided'
                .format(outputs_path))

Now you can fine-tune the model by running the below: it shouldn't take more than 10 minutes (given that you are using GPU).

In [22]:
#Train on the names dataset
finetune('model.params')

data has 418352 characters, 256 unique.


epoch 1 iter 7: train loss 0.99342. lr 5.999844e-04: 100%|██████████| 8/8 [00:02<00:00,  3.07it/s]
epoch 2 iter 7: train loss 0.55793. lr 5.999351e-04: 100%|██████████| 8/8 [00:02<00:00,  3.00it/s]
epoch 3 iter 7: train loss 0.42747. lr 5.998520e-04: 100%|██████████| 8/8 [00:02<00:00,  2.98it/s]
epoch 4 iter 7: train loss 0.29548. lr 5.997351e-04: 100%|██████████| 8/8 [00:02<00:00,  3.00it/s]
epoch 5 iter 7: train loss 0.26525. lr 5.995844e-04: 100%|██████████| 8/8 [00:02<00:00,  2.91it/s]
epoch 6 iter 7: train loss 0.23336. lr 5.993999e-04: 100%|██████████| 8/8 [00:02<00:00,  2.83it/s]
epoch 7 iter 7: train loss 0.22995. lr 5.991818e-04: 100%|██████████| 8/8 [00:03<00:00,  2.63it/s]
epoch 8 iter 7: train loss 0.21827. lr 5.989299e-04: 100%|██████████| 8/8 [00:02<00:00,  2.85it/s]
epoch 9 iter 7: train loss 0.19803. lr 5.986444e-04: 100%|██████████| 8/8 [00:02<00:00,  2.86it/s]
epoch 10 iter 7: train loss 0.18290. lr 5.983252e-04: 100%|██████████| 8/8 [00:02<00:00,  2.92it/s]
epoch 11 

In [23]:
# Evaluate on the dev set, writing out predictions
evaluate(outputs_path="nopretrain.dev.predictions", reading_params_path="model.params", eval_corpus_path="birth_dev.tsv")

data has 418352 characters, 256 unique.


500it [00:55,  9.07it/s]

Correct: 3.0 out of 500.0: 0.6%





Why are the predictions so bad?

## Part 1.4 Pretraining
**Make predictions (with pretraining).**

The below is the code to *pretrain* a model.

In [14]:
def pretrain(writing_params_path, pretrain_corpus_path="wiki.txt"):
    # Save the device
    device = torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'

    # Keep the block size 128
    # Why is the pretraining corpus always required (even if we're not pretraining?)
    # It's because we're using it as a hack to always have the same vocabulary
    # (that is, the same mapping from character to integer, and we build the
    # vocab from the pretraining corpus.)
    block_size = 128
    text = open(pretrain_corpus_path).read()
    pretrain_dataset = CharCorruptionDataset(text, block_size)

    # We don't suggest you change these hyperparameters, as they're known to work.
    mconf = GPTConfig(pretrain_dataset.vocab_size, pretrain_dataset.block_size,
        n_layer=4, n_head=8, n_embd=256)
    model = GPT(mconf)

    tconf = TrainerConfig(max_epochs=650,
                          batch_size=128,
                          learning_rate=6e-3,
                          lr_decay=True,
                          warmup_token=512 * 20,
                          final_tokens=200 * len(pretrain_dataset) * block_size,
                          num_workers=4)
    trainer = Trainer(model, pretrain_dataset, None, tconf)
    trainer.train()
    torch.save(model.state_dict(), writing_params_path)

Pretrain your model on `wiki.txt` (which should take approximately two hours), finetune it on `NameDataset` and evaluate it.

In [15]:
# Pretrain the model
pretrain("pretrain.params")

data has 418352 characters, 256 unique.


  self.pid = os.fork()
  self.pid = os.fork()
epoch 1 iter 22: train loss 3.36057. lr 5.920992e-06: 100%|██████████| 23/23 [00:05<00:00,  4.58it/s]
epoch 2 iter 22: train loss 2.97894. lr 1.184198e-05: 100%|██████████| 23/23 [00:03<00:00,  6.46it/s]
epoch 3 iter 22: train loss 2.46046. lr 1.776298e-05: 100%|██████████| 23/23 [00:03<00:00,  6.84it/s]
epoch 4 iter 22: train loss 2.19758. lr 2.368397e-05: 100%|██████████| 23/23 [00:03<00:00,  6.83it/s]
epoch 5 iter 22: train loss 2.09207. lr 2.960496e-05: 100%|██████████| 23/23 [00:03<00:00,  6.71it/s]
epoch 6 iter 22: train loss 1.81964. lr 3.552595e-05: 100%|██████████| 23/23 [00:03<00:00,  6.65it/s]
epoch 7 iter 22: train loss 1.84260. lr 4.144694e-05: 100%|██████████| 23/23 [00:03<00:00,  6.88it/s]
epoch 8 iter 22: train loss 1.67697. lr 4.736794e-05: 100%|██████████| 23/23 [00:03<00:00,  6.79it/s]
epoch 9 iter 22: train loss 1.56110. lr 5.328893e-05: 100%|██████████| 23/23 [00:03<00:00,  6.61it/s]
epoch 10 iter 22: train loss 1.48077

In [16]:
# Finetune the model
finetune('finetune.params', reading_params_path="pretrain.params")

data has 418352 characters, 256 unique.


epoch 1 iter 7: train loss 0.11269. lr 5.999844e-04: 100%|██████████| 8/8 [00:02<00:00,  3.09it/s]
epoch 2 iter 7: train loss 0.06542. lr 5.999351e-04: 100%|██████████| 8/8 [00:02<00:00,  3.13it/s]
epoch 3 iter 7: train loss 0.04771. lr 5.998520e-04: 100%|██████████| 8/8 [00:02<00:00,  3.12it/s]
epoch 4 iter 7: train loss 0.04401. lr 5.997351e-04: 100%|██████████| 8/8 [00:02<00:00,  3.02it/s]
epoch 5 iter 7: train loss 0.03792. lr 5.995844e-04: 100%|██████████| 8/8 [00:02<00:00,  3.05it/s]
epoch 6 iter 7: train loss 0.03245. lr 5.993999e-04: 100%|██████████| 8/8 [00:02<00:00,  2.98it/s]
epoch 7 iter 7: train loss 0.03073. lr 5.991818e-04: 100%|██████████| 8/8 [00:02<00:00,  3.00it/s]
epoch 8 iter 7: train loss 0.02710. lr 5.989299e-04: 100%|██████████| 8/8 [00:02<00:00,  2.97it/s]
epoch 9 iter 7: train loss 0.02331. lr 5.986444e-04: 100%|██████████| 8/8 [00:02<00:00,  2.85it/s]
epoch 10 iter 7: train loss 0.02092. lr 5.983252e-04: 100%|██████████| 8/8 [00:02<00:00,  2.86it/s]
epoch 11 

In [17]:
# Evaluate on the dev set
evaluate(outputs_path="pretrain.dev.predictions", reading_params_path="finetune.params", eval_corpus_path="birth_dev.tsv")

data has 418352 characters, 256 unique.


500it [00:55,  9.08it/s]

Correct: 106.0 out of 500.0: 21.2%





We expect the dev accuracy will be at least 10%.

In [24]:
from google.colab import drive, files
from requests import get
from urllib.parse import unquote

drive.mount('/mnt/')
filename = 'Assignment5.ipynb'
filepath = f'/mnt/My Drive/Colab Notebooks/{filename}'
output_file = '/content/Assignment5.html'

!jupyter nbconvert '{filepath}' --output '{output_file}' --to 'html'
files.download(output_file)

Mounted at /mnt/
[NbConvertApp] Converting notebook /mnt/My Drive/Colab Notebooks/Assignment5.ipynb to html
[NbConvertApp] Writing 812816 bytes to /content/Assignment5.html


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>