### Imports

In [2]:
import numpy as np
from tqdm import tqdm
from utils import load_encoder_hparams_and_params
import fire


### Encoder
The class `encoder` is the BPE (Byte Pair Encoding) tokenizer used by GPT2. The `encoder.json` file consists of a long list with format`<token_ID>, "<token>"` and maps words/subwords to token IDs, for example: 
> `605, "\u0120them": 606, "\u0120her": 607, "ount": 608, "\u0120Ch": ...`

Note, `\u0120` represents the white space character: 
- `"\u0120them"` $\to$ `" them"`
- `"\u0120her"` $\to$ `" her"`

`encoder.py` implements the translation from text prompt to token IDs using this mapping. 

```python
>>> ids = encoder.encode("Not all heroes wear capes.")
>>> ids
[3673, 477, 10281, 5806, 1451, 274, 13]

>>> encoder.decode(ids)
"Not all heroes wear capes."
```

`encoder.decoder` holds the vocabulary and is of size 
```python
>>> len(encoder.decoder)
50257
```

Training of the tokenizer refers to how strings are broken down. When we load the tokenizer, we are loading the already trained vocab (`encoder.json`) and byte-pair merges (`vocab.bpe`) (byte-pair merges). The byte-pair merges are used to form tokens from text prompts: 
```
('l', 'o') → 'lo'
('o', 'w') → 'low'
('e', 'r') → 'er'
('e', 's') → 'es'
```

### Hyperparameters
`hparams.json` is a directory that contains the hyper-paramters for our model: 
```
{
  "n_vocab": 50257, # number of tokens in our vocabulary
  "n_ctx": 1024,    # maximum possible sequence length of the input 
  "n_embd": 768,    # dimension of embddings
  "n_head": 12,     # number of attention heads
  "n_layer": 12     # number of layers (depth)
}
```

### Parameters
`params` holds the trained weights of our model. If we print `params`, **REPLACING** the weight arrays with their shapes, we get: 
```
{
    "wpe": [1024, 768], # positional encoding matrix for 1024 positions 
    "wte": [50257, 768], # embedding matrix for 50257 tokens each mappted to 768d vector
    "ln_f": {"b": [768], "g": [768]}, # bias and sacling factors for LayerNorm 
    "blocks": [
        {
            "attn": {
                "c_attn": {"b": [2304], "w": [768, 2304]}, # query, key, value matrices of size 768x2304
                "c_proj": {"b": [768], "w": [768, 768]}, # projection matrix after attention 
            },
            "ln_1": {"b": [768], "g": [768]}, # LayerNorm
            "ln_2": {"b": [768], "g": [768]},
            "mlp": {
                "c_fc": {"b": [3072], "w": [768, 3072]}, # weights and bias for FFN
                "c_proj": {"b": [768], "w": [3072, 768]}, # maps back to 768d
            },
        },
        ... # repeat for n_layers
    ]
}
```
For reference, here are the shapes of `params` but with the numbers replaced by `hparams`:
```
{
    "wpe": [n_ctx, n_embd],
    "wte": [n_vocab, n_embd],
    "ln_f": {"b": [n_embd], "g": [n_embd]},
    "blocks": [
        {
            "attn": {
                "c_attn": {"b": [3*n_embd], "w": [n_embd, 3*n_embd]},
                "c_proj": {"b": [n_embd], "w": [n_embd, n_embd]},
            },
            "ln_1": {"b": [n_embd], "g": [n_embd]},
            "ln_2": {"b": [n_embd], "g": [n_embd]},
            "mlp": {
                "c_fc": {"b": [4*n_embd], "w": [n_embd, 4*n_embd]},
                "c_proj": {"b": [n_embd], "w": [4*n_embd, n_embd]},
            },
        },
        ... # repeat for n_layers
    ]
}
```

In [3]:
#encoder, hparams, params = load_encoder_hparams_and_params("124M", "models")

### GELU
The non-linearity (activation function) for GPTw is GELU (Gaussian Error Linear Units). $$\text{GELU}(x) \approx 0.5 x \left( 1 + \tanh \left( \sqrt{\frac{2}{\pi}} \left( x + 0.044715 x^3 \right) \right) \right)$$

In [4]:
def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
    

### Softmax
$$
\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$
Given a two dimensional array, we apply softmax row-wise: 

In [5]:
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    

### Layer Normilization
Normalizes values to have mean $\mu=0$ and variance $\sigma^2=1$. It is meant to stabilize training by ensuring that inputs to each layer have a consistent distribution. 
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2}} + \beta
$$
Where $\gamma$ (scale) and $\beta$ (shift) are learnable weights. Note that we first normalize to zero mean and unit variance, but then we use the learned weights to allow the network to undo or adjust the effect if needed. So we give the network the flexibility to decide. 

Given a two dimensional arras, we apply normalization row-wise: 

In [6]:
def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # normalize x to have mean=0 and var=1 over last axis
    return g * x + b  # scale and offset with gamma/beta params
    

### Linear Layer
Feed the embeddings independently into the linear layer. The bias `b` is added independently to each row. 
$$
\text{Linear}(X) = XW + b
$$

In [7]:
def linear(X, w, b): 
    return X @ w + b
    

### GPT Architecture
At a high level, the GPT architecture has three sections: 
- Text + Positional Embeddings
- Transformer Decoder Stack
- Projection to Vocab

The `gpt2` function is the actual GPT code and implements its architecture and forward pass. It gets called by `generate()` upon each token that has to be generated given the previous generated token IDs. 
Parameter List:
- `inputs`: token IDs of previously generated input or user text prompt
- `wte`: weights embedding matrix
- `wpe`: weights positional encoding
- `blocks`: number of blocks
- `ln_f`: layerNorm 
- `n_heads`number of attention heads

Output: 
- outputs `logits`, i.e. the probability distribution over possible next tokens

In [8]:
def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):
    
    #### TEXT + POSITIONAL ENCODING:
    # select corresponding rows from embedding matrix 
    # when input is of size n_seq, then select the first n_seq rows of the PE Mat as the encodings are fixed and don't depend on token values      

    X = wte[inputs] + wpe[:len(inputs)] # n_seq x n_embd

    # X[i] represents embedding for the i-th row + positional encoding for i-th position
    
    ### TRANSFORMER DECODER STACK:
    # forward pass through n_layer transformer blocks
    for block in blocks: 
        X = transformer_block(X, **block, n_head=n_head) # n_seq x n_embd -> n_seq x n_embd


    ### PROJECTION TO VOCAB:
    # reuse embedding matrix for projections (other implementations choose separate matrix) 
    # dont apply softmax in the end, so outputs are logits 
    X = layer_norm(X, **ln_f) # n_seq x n_embd -> n_seq x n_embd
    return X @ wte.T # n_seq x n_embd -> n_seq x n_vocab 


def transformer_block(X, mlp, attn, ln_1, ln_2, n_head): 
    
    ### MULTI-HEAD ATTENTION:
    # with residual connection
    X = X + mha(layer_norm(X, **ln_1), **attn, n_head=n_head)
    
    ### FEED FORWARD NETWORK:
    # with residual connection
    X = X + ffn(layer_norm(X, **ln_2), **mlp)
    return X


def ffn(X, c_fc, c_proj):
    # project from n_embd to higher dimension 4*n_embd and then back 
    X = linear(X, **c_fc) 
    Z = gelu(X)
    X = linear(Z, **c_proj)
    return X


def mha(X, c_attn, c_proj, n_head): 
    X = linear(X, **c_attn)
    qkv = np.split(X, 3, axis=-1)

    # split into heads
    qkv_heads = list(map(lambda X: np.split(X, n_head, axis=-1), qkv))
    
    # causual mask     
    mask = (1 - np.tri(X.shape[0], dtype=X.dtype)) * -1e10   
    
    # perform attention over each head (usually done in parallel)
    out_heads = [attention(q, k, v, mask) for q, k, v in zip(*qkv_heads)] 

    # concatenate heads
    X = np.hstack(out_heads)

    # final projection
    X = linear(X, **c_proj)

    return X 
    
    

def attention(q, k, v, mask): 
    return softmax(q @ k.T / np.sqrt(q.shape[-1]) + mask) @ v 
    
    

The `generate()` function is the autoregressive decoding altorihm we saw earlier. We use greedy sampling for simplicity. `tqdm` is a progress bar to help us visualize the decoding process as it generates tokens on at a time.

In [9]:
def generate(inputs, params, n_head, n_tokens_to_generate):

    for _ in tqdm(range(n_tokens_to_generate), "generating"):  # auto-regressive decode loop
        logits = gpt2(inputs, **params, n_head=n_head)  # model forward pass
        next_id = np.argmax(logits[-1])  # greedy sampling
        inputs.append(int(next_id))  # append prediction to input

    return inputs[len(inputs) - n_tokens_to_generate :]  # only return generated ids


The `main` function handles: 
1. Loading the tokenizer (`encoder`), model weights (`params`), and hyperparameters (`hparams`)
2. Encoding the input prompt into token IDs using the tokenizer
3. Calling the generate function
4. Decoding the output IDs into a string

In [10]:
def main(prompt: str, n_tokens_to_generate: int = 40, model_size: str = "124M", models_dir: str = "models"):

    # load encoder, hparams, and params from the released open-ai gpt-2 files -> done
    encoder, hparams, params = load_encoder_hparams_and_params(model_size, models_dir)

    # encode the input string using the BPE tokenizer
    input_ids = encoder.encode(prompt)

    # make sure we are not surpassing the max sequence length of our model
    assert len(input_ids) + n_tokens_to_generate < hparams["n_ctx"]

    
    # generate output ids
    output_ids = generate(input_ids, params, hparams["n_head"], n_tokens_to_generate)

    # decode the ids back into a string
    output_text = encoder.decode(output_ids)

    return output_text


### Inference
Here we provide the input text prompt to our model. 

In [12]:
main("Christiano Ronaldo is very famous for", n_tokens_to_generate=11, model_size="124M", models_dir="models")

generating: 100%|███████████████████████████████| 11/11 [00:12<00:00,  1.15s/it]


' his ability to play the ball and to create chances.'