In [1]:
# %pip install -q transformers huggingface_hub
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers

### Using pre-trained transformers (2 points)
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [2]:
tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased')
model = transformers.AutoModel.from_pretrained('bert-base-uncased')

In [3]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])
    
print('---------------------------------------')

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
---------------------------------------
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [4]:
# You can now apply the model to get embeddings
with torch.no_grad():
    out = model(**tokens_info)

print(out['pooler_output'])

tensor([[-0.8854, -0.4722, -0.9392,  ..., -0.8081, -0.6955,  0.8748],
        [-0.9297, -0.5161, -0.9334,  ..., -0.9017, -0.7492,  0.9201]])


Transformers knowledge hub: https://huggingface.co/transformers/

### Build-a-transformer (2 points)

In this section, you will implement a transformer language model layer by layer, then use it to generate (hopefully) coherent text.

To understand how these layers work, please check out our guide to transformers from [nlp course for you -> transformers](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro).


First, we download pre-trained weights for the [GPT2 model by OpenAI](https://openai.com/research/better-language-models) - a prominent model from 2019.



Idea & code by: Ilya Beletsky

In [5]:
from huggingface_hub import hf_hub_download

state_dict = torch.load(hf_hub_download("gpt2", filename="pytorch_model.bin"))

for key, value in tuple(state_dict.items()):
    if key.startswith('h.') and key.endswith('.weight') and value.ndim == 2:
        value.transpose_(1, 0)  # <-- for compatibility with modern PyTorch modules
    if key.startswith('h.') and key.endswith('.attn.bias') and value.ndim == 4:
        state_dict.pop(key)  # <-- triangular binar masks, not needed in this code

print('Weights:', repr(sorted(state_dict.keys()))[:320], '...')

Weights: ['h.0.attn.c_attn.bias', 'h.0.attn.c_attn.weight', 'h.0.attn.c_proj.bias', 'h.0.attn.c_proj.weight', 'h.0.ln_1.bias', 'h.0.ln_1.weight', 'h.0.ln_2.bias', 'h.0.ln_2.weight', 'h.0.mlp.c_fc.bias', 'h.0.mlp.c_fc.weight', 'h.0.mlp.c_proj.bias', 'h.0.mlp.c_proj.weight', 'h.1.attn.c_attn.bias', 'h.1.attn.c_attn.weight', 'h.1. ...


### In the next few cells, we shall implement the model layer by layer to make use of those weights.

As you might recall, transformers contain two main layer types: attention and fully-connected layers.

The fully connected layers are by far easier to understand, so we shall begin there:

Please implement fully-connected layer __without residual or layer normalization__ (we'll add those in a bit).

In [6]:
class GeLUThatWasUsedInGPT2(nn.Module):
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * x ** 3)))

class FullyConnected(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.c_fc = nn.Linear(dim, 4  * dim)
        self.gelu = GeLUThatWasUsedInGPT2()
        self.c_proj = nn.Linear(4 * dim, dim)

    def forward(self, x):
        # x.shape = [batch_size, seq_length, dim]
        x1 = self.c_fc(x)
        gelu = self.gelu(x1)
        x2 = self.c_proj(gelu)
        return x2


Now, let's test that it works with GPT-2 weights:

In [7]:
mlp = FullyConnected(dim=768)
mlp.load_state_dict({'c_fc.weight': state_dict['h.0.mlp.c_fc.weight'],
                     'c_fc.bias': state_dict['h.0.mlp.c_fc.bias'],
                     'c_proj.weight': state_dict['h.0.mlp.c_proj.weight'],
                     'c_proj.bias': state_dict['h.0.mlp.c_proj.bias']})

torch.manual_seed(1337)
x = torch.randn(1, 2, 768)  # [batch_size, sequence_length, dim]
checksum = torch.sum(mlp(x) * x)
assert abs(checksum.item() - 1282.3315) < 0.1, "layer outputs do not match reference"
assert torch.allclose(mlp(x[:, (1, 0), :])[:, (1, 0), :], mlp(x)), "mlp must be permutation-invariant"
print("Seems legit!")

Seems legit!


Now, let's get to attention layers.

Since GPT-2 needs to generate text from left to right, each generated token can only attend to tokens on the left (and itself). This kid of attention is called "Masked" self-attention, because it hides tokens to the right.

As before, please implement masked self-attention __without layernorm or residual connections.__

In [8]:
class MaskedSelfAttention(nn.Module):
    def __init__(self, dim: int, 
                 num_heads: int, 
                 dropout_p:float=0, 
                 is_causal:bool=True):
        
        super().__init__()
        self.c_attn = nn.Linear(dim, dim * 3)
        self.c_proj = nn.Linear(dim, dim)
        self.dim = dim
        self.num_heads = num_heads
        self.head_size = dim // num_heads
        self.scale = self.head_size ** -0.5
        self.dropout = nn.Dropout(dropout_p)
        self.is_causal = is_causal
        assert dim % num_heads == 0
    
        
    def forward(self, x):
        batch_size, seq_len, _ = x.shape
        c_attn = self.c_attn(x) # [batch_size, sequence_length, 3*dim]
        
        # поменять форму "c_attn" так, чтобы было 3 матрицы, 
        # num_heads голов и размерность head_size
        qkv = c_attn.reshape(batch_size, seq_len, 3, self.num_heads, self.head_size)

        # qkv == [3, batch_size, num_heads, sequence_length, head_size]
        qkv = qkv.permute(2, 0, 3, 1, 4)
        
        q, k, v = qkv[0], qkv[1], qkv[2]

        attn_bias = torch.zeros(seq_len, seq_len, dtype=q.dtype, device=q.device)
        if self.is_causal:
            temp_mask = torch.ones(seq_len, seq_len, dtype=torch.bool, device=x.device).tril(diagonal=0)
            attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
   
        attn_weight = q @ k.transpose(-2, -1) * self.scale
        attn_weight += attn_bias
        attn_weight = self.dropout(torch.softmax(attn_weight, dim=-1))
        
        context = attn_weight @ v #[batch_size, num_heads, seq_len, head_size]

        # [batch_size, seq_len, num_heads, head_size]
        context = context.permute(0, 2, 1, 3).reshape(batch_size, seq_len, self.dim)
        return self.c_proj(context)

In [9]:
attn = MaskedSelfAttention(dim=768, num_heads=12)
attn.load_state_dict({'c_attn.weight': state_dict['h.0.attn.c_attn.weight'],
                      'c_attn.bias': state_dict['h.0.attn.c_attn.bias'],
                      'c_proj.weight': state_dict['h.0.attn.c_proj.weight'],
                      'c_proj.bias': state_dict['h.0.attn.c_proj.bias']})

torch.manual_seed(1337)
x = torch.randn(1, 10, 768)  # [batch_size, sequence_length, dim]
checksum = torch.sum(attn(x) * x)
assert abs(checksum.item() - 2703.6772) < 0.1, "layer outputs do not match reference"
assert not torch.allclose(attn(x[:, (1, 0), :])[:, (1, 0), :], attn(x[:, (0, 1), :])), "masked attention must *not* be permutation-invariant"
print("It works!")

It works!


Test that it works

We can now combine attention and MLP to build the full transformer layer:

![img](https://i.imgur.com/1sq2vHO.png)

In [10]:
class TransformerLayer(nn.Module):
    def __init__(self, dim: int, num_heads: int):
        super().__init__()
        self.ln_1 = nn.LayerNorm(dim)
        self.attn = MaskedSelfAttention(dim, num_heads)
        self.ln_2 = nn.LayerNorm(dim)
        self.mlp = FullyConnected(dim)

        
    def forward(self, x):
#         <YOUR CODE - apply attention, mlp and layer normalization as shown in figure above>
        atten = self.attn(self.ln_1(x)) + x
        ffn = self.mlp(self.ln_2(atten)) + atten
        return ffn

In [11]:
layer = TransformerLayer(dim=768, num_heads=12)
layer.load_state_dict({k[5:]: v for k, v in state_dict.items() if k.startswith('h.10.')})
assert abs(torch.sum(layer(x) * x).item() - 9874.7383) < 0.1
print("Good job!")

Good job!


In [12]:
class GPT2(nn.Module):
    def __init__(self, 
                 vocab_size: int, 
                 dim: int, 
                 num_heads: int, 
                 num_layers: int, 
                 max_position_embeddings: int = 1024):
        super().__init__()
        self.wte = nn.Embedding(vocab_size, dim) # token embeddings
        # position embeddings
        self.wpe = nn.Embedding(max_position_embeddings, dim)  
        self.h = nn.Sequential(*(TransformerLayer(dim, num_heads) for layer in range(num_layers)))
        # final layer norm - goes after all transformer layers, but before logits
        self.ln_f = nn.LayerNorm(dim) 
        
    def forward(self, input_ids):
        # input_ids.shape: [batch_size, sequence_length], int64 token ids
        position_ids = torch.arange(input_ids.shape[1], device=input_ids.device).unsqueeze(0)

        token_embeddings = self.wte(input_ids)
        position_embeddings = self.wpe(position_ids)
        full_embeddings = token_embeddings + position_embeddings

        transformer_output = self.h(full_embeddings)
        transformer_output_ln = self.ln_f(transformer_output)

        # final layer: we predict logits by re-using token embeddings as linear weights
        output_logits = transformer_output_ln @ self.wte.weight.T
        return output_logits


In [13]:
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2(vocab_size=50257, dim=768, num_heads=12, num_layers=12)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)
model.load_state_dict(state_dict)

input_ids = tokenizer("A quick", return_tensors='pt')['input_ids'].to(device)

predicted_logits = model(input_ids)
print(predicted_logits.shape)
most_likely_token_id = predicted_logits[:, -1].argmax().item()

print("Prediction:", tokenizer.decode(most_likely_token_id))

torch.Size([1, 2, 50257])
Prediction:  look


In [14]:
text = "Apple is "
tokens = tokenizer.encode(text)
print(end=tokenizer.decode(tokens))
line_length = len(tokenizer.decode(tokens))

for i in range(50):
    # Predict logits with your model
    with torch.no_grad():
        tokens_tensor = torch.as_tensor([tokens], device=device)
        logits = model(tokens_tensor)


    # Sample with probabilities
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()
    next_token_index = np.random.choice(len(p_next), p=p_next)

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length > 120:
        line_length = 0
        print()



 Apple is icky about getting rid of gadgets, told the Verge that sales of their latest Windows laptops are continuing to decline
 and offer consumers reasons to upgrade to older models. "The big reason behind that is that we haven't sold that many anything
 at all," he

__Reminder:__ after class, please go to `MaskedSelfAttention.forward` above and finish the job!
```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```

```


### Here's how you can do the same with transformers library

In [15]:
tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = transformers.AutoModelForCausalLM.from_pretrained('gpt2')
print('Generated text:', tokenizer.decode(
    model.generate(
        **tokenizer("The Fermi paradox ", return_tensors='pt'),
        do_sample=True, max_new_tokens=50
    ).flatten().numpy()
))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text:  The Fermi paradox  is that the system's current theory of relativity does not work for all cases: the Fermi paradox may hold for the vast majority of quantum phenomena. Although some physicists have argued repeatedly that quantum physics does not hold for all cases, this idea
