
# Week 2 Day 1 - Build Your Own BERT

BERT (Bidirectional Encoder Representations from Transformers) is the most famous in a line of Muppet-themed language research, originating with [ELMo](https://arxiv.org/pdf/1802.05365v2.pdf) (Embeddings from Language Models) and continuing with a series of increasingly strained acronyms:

- [Big BIRD](https://arxiv.org/pdf/1910.13034.pdf) - Big Bidirectional Insertion Representations for Documents
- [Ernie](https://arxiv.org/pdf/1904.09223.pdf) - Enhanced Representation through kNowledge IntEgration
- [Grover](https://arxiv.org/pdf/1905.12616.pdf) - Generating aRticles by Only Viewing mEtadata Records
- [Kermit](https://arxiv.org/pdf/1906.01604.pdf) - Kontextuell Encoder Representations Made by Insertion Transformations

Today you'll implement your own BERT model such that it can load the weights from a full size pretrained BERT, and use it to predict some masked tokens.

## Table of Contents

- [Readings](#readings)
- [BERT architecture](#bert-architecture)
    - [Language model vs. classifier](#language-model-vs-classifier)
    - [Schematic](#schematic)
- [Batched Self-Attention](#batched-self-attention)
    - [Attention Pattern Pre-Softmax](#attention-pattern-pre-softmax)
    - [Attention Forward Function](#attention-forward-function)
- [Layer Normalization](#layer-normalization)
- [Embedding](#embedding)
- [BertMLP](#bertmlp)
- [Bert Block](#bert-block)
- [Putting it All Together](#putting-it-all-together)
    - [utils.StaticModuleList](#utilsstaticmodulelist)
- [BertLanguageModel](#bertlanguagemodel)
- [Loading Pretrained Weights](#loading-pretrained-weights)
- [Tokenization](#tokenization)
    - [Vocabulary](#vocabulary)
    - [Special Tokens](#special-tokens)
    - [Predicting Masked Tokens](#predicting-masked-tokens)
- [Model debugging](#model-debugging)

## Readings

- [Language Modelling with Transformers](https://docs.google.com/document/d/1XJQT8PJYzvL0CLacctWcT0T5NfL7dwlCiIqRtdTcIqA/edit#)

You don't need to read the other Muppet papers for today's content.

## BERT architecture

There are various sizes of BERT, differing only in the number of BERT transformer blocks ("BertBlock") and the embedding size. We'll be playing with [bert-base-cased](https://huggingface.co/bert-base-cased) today, which has 12 layers and an embedding size of 768. Note that the link points to Hugging Face, which provides a repository of pretrained models (often, transformer models) as well as other valuable documentation.

Refer to the below schematics for the architecture of BERT. Today we will be using BERT for language modelling, and tomorrow we will use it for classification. As most of the architecture is shared, we will be able to reuse most of the code as well.

### Language model vs. classifier

```mermaid
graph TD
    subgraph " "
            subgraph BertLanguageModel
            LBertCommon[Input<br/>From BertCommon] -->|embedding_size| LMHead[Linear<br/>GELU<br/>Layer Norm<br/>Tied Unembed]--> |vocab size|Output[Logit Output]
            end

            subgraph BertClassifier
            CBertCommon[Input<br/>From BertCommon] -->|embedding_size| ClassHead[First Position Only<br/>Dropout<br/>Linear] -->|num_classes| ClsOutput[Classification<br/>Output]
            end

    end
```

### Schematic

Note the "zoomed-in" view into `BertAttention` (and in turn, `BertSelfAttention`) as well as `BertMLP`.

```mermaid
graph TD
    subgraph " "
            subgraph BertCommon
            Token --> |integer|TokenEmbed[Token<br/>Embedding] --> AddEmbed[Add] --> CommonLayerNorm[Layer Norm] --> Dropout --> BertBlocks[<u>BertBlock x12</u><br/>BertAttention<br/>BertMLP] --> Output
            Position --> |integer|PosEmbed[Positional<br/>Embedding] --> AddEmbed
            TokenType --> |integer|TokenTypeEmb[Token Type<br/>Embedding] --> AddEmbed
        end

        subgraph BertAttention
            Input --> BertSelfInner[BertSelfAttention] --> AtnDropout[Dropout] --> AtnLayerNorm[Layer Norm] --> AtnOutput[Output]
            Input --> AtnLayerNorm
        end

        subgraph BertSelfAttention
            SA[Input] --> Q & K & V
            V -->|head size| WeightedSum
            Q & K --> |head size|Dot[Dot<br/>Scale Down<br/>Softmax] -->WeightedSum -->|head size| O --> SAOutput[Output]
        end

        subgraph BertMLP
            MLPInput[Input] --> Linear1 -->|intermediate size|GELU --> |intermediate size|Linear2 --> MLPDropout[Dropout] --> MLPLayerNorm --> MLPOutput[Output]
            MLPInput --> MLPLayerNorm[Layer Norm]
        end
    end
```

# Implementation

We will begin by importing necessary modules and defining `BertConfig` to store the model architecture parameters. Review the list of config entries and consider what each one means, reviewing the reading to familiarize yourself with transformer models if necessary.





In [2]:
!pip install transformers
!pip install einops
!pip install fancy_einsum
!pip install torchtext
!pip install torch
!pip install joblib



import os
from dataclasses import dataclass
from typing import List, Optional, Union
import torch as t
import transformers
from einops import rearrange, repeat
from fancy_einsum import einsum
from torch import nn
from torch.nn import functional as F
import utils
import w2d1_test

MAIN = __name__ == "__main__"
IS_CI = os.getenv("IS_CI")


@dataclass(frozen=True)
class BertConfig:
    """Constants used throughout the Bert model. Most are self-explanatory.

    intermediate_size is the number of hidden neurons in the MLP (see schematic)
    type_vocab_size is only used for pretraining on "next sentence prediction", which we aren't doing.

    Note that the head size happens to be hidden_size // num_heads, but this isn't necessarily true and your code shouldn't assume it.
    """

    vocab_size: int = 28996
    intermediate_size: int = 3072
    hidden_size: int = 768
    num_layers: int = 12
    num_heads: int = 12
    head_size: int = 64
    max_position_embeddings: int = 512
    dropout: float = 0.1
    type_vocab_size: int = 2
    layer_norm_epsilon: float = 1e-12


if MAIN:
    config = BertConfig()




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m



## Batched Self-Attention

We're going to implement a version of self-attention that computes all sequences in a batch at once, and all heads at once. Make sure you understand how single sequence, single head attention works first, again consulting the reading to review this mechanism if you haven't already done so.


### Attention Pattern Pre-Softmax

Write the attention_pattern_pre_softmax function as specified in the diagram.

The "Scale Down" factor means dividing by the square root of the head size. Empirically, this helps training. [This article](https://github.com/BAI-Yeqi/Statistical-Properties-of-Dot-Product/blob/master/proof.pdf) gives some math to justify this, but it's not important.

### Attention Forward Function

Your forward should call `attention_pattern_pre_softmax`, add the attention mask to the result if present, and then finish the computations using `einsum` and `rearrange` again. Remember to apply the output projection.


Spend 5 minutes thinking about how to batch the computation before looking at the spoilers below.

<details>

<summary>What should the shape of `project_query` be?</summary>

`project_query` should go from `hidden_size` to `num_heads * self.head_size`. In this case, the latter is equal to `hidden_size`. This represents all the heads's `Q` matrices concatenated together, and one call to it now computes all the queries at once (broadcasting over the leading batch and seq dimensions of the input `x`).

</details>

<details>

<summary>Should my Linear layers have a bias?</summary>

While these Linear layers are traditionally referred to as projections, and the BERT paper implies that they don't have a bias, in the official reference implementation of BERT they DO have a bias.

</details>

<details>

<summary>What does the einsum to make the attention pattern look like?</summary>

We need to sum out the head_size and keep the seq_q dimension before the seq_k dimension. For a single batch and single head, it would be: `einsum("seq_q head_size, seq_k head_size -> seq_q seq_k")`. You'll want to do a `rearrange` before your `einsum`.

</details>

<details>

<summary>Which dimension do I softmax over?</summary>

The desired property is that after softmax, for any indices `batch`, `head`, and `q`, the vector `pattern[batch,head,q]` sums to 1. So the softmax needs to be over the `k` dimension.

</details>

<details>

<summary>I'm still confused about how to batch the computation.</summary>

Pre-softmax:

- Apply `project_query`, `project_key`, and `project_value` to `x` to obtain `q`, `k`, and `v`.
- rearrange `q` and `k` to split the `head * head_size` dimension apart into `head` and `head_size` dimensions. The shape should go from `(batch seq (head * head_size))` to `(batch head seq head_size)`
- Einsum `q` and `k` to get a (batch, head, seq_q, seq_k) shape.
- Divide by the square root of the head size.

Forward:

- Softmax over the `k` dimension to obtain attention probs
- rearrange `v` just like `q` and `k` previously
- einsum `v` and your attention probs to get the weighted `v`
- rearrange weighted `v` to combine head and head_size and put that at the end
- apply `project_output`

</details>

Name your `Linear` layers as indicated in the class definition; otherwise the tests won't work and you'll have more trouble loading weights.




In [3]:
class BertSelfAttention(nn.Module):
    project_query: nn.Linear
    project_key: nn.Linear
    project_value: nn.Linear
    project_output: nn.Linear

    def __init__(self, config: BertConfig):
        super().__init__()
        self.layer_norm_epsilon = config.layer_norm_epsilon
        self.head_size = config.head_size
        self.num_heads = config.num_heads
        self.project_query = nn.Linear(config.hidden_size, config.num_heads * config.head_size)
        self.project_key = nn.Linear(config.hidden_size, config.num_heads * config.head_size)
        self.project_value = nn.Linear(config.hidden_size, config.num_heads * config.head_size)
        self.project_output = nn.Linear(config.num_heads * config.head_size, config.hidden_size)

    def attention_pattern_pre_softmax(self, x: t.Tensor) -> t.Tensor:
        """
        x: shape (batch, seq, hidden_size)
        Return the attention pattern after scaling but before softmax.

        pattern[batch, head, q, k] should be the match between a query at sequence position q and a key at sequence position k.
        """
        # output QK^T/sqrt(d_k)        
               
        Q = rearrange(self.project_query(x), 'bs q (nh hs) -> bs nh q hs', nh = self.num_heads)
        K = rearrange(self.project_key(x), 'bs q (nh hs) -> bs nh hs q', nh = self.num_heads)
        pre_softmax_attention_scores = t.einsum(
            'ab ij, ab jk -> ab ik',
            Q,K
        )
        return pre_softmax_attention_scores / self.head_size ** 0.5


    def forward(self, x: t.Tensor, additive_attention_mask: Optional[t.Tensor] = None) -> t.Tensor:
        """
        additive_attention_mask: shape (batch, head=1, seq_q=1, seq_k) - used in training to prevent copying data from padding tokens. Contains 0 for a real input token and a large negative number for a padding token. If provided, add this to the attention pattern (pre softmax).

        Return: (batch, seq, hidden_size)
        """
        pre_softmax_attention_scores = self.attention_pattern_pre_softmax(x)
        if additive_attention_mask is not None:
            pre_softmax_attention_scores += additive_attention_mask
        attention_scores = F.softmax(pre_softmax_attention_scores, dim=-1)

        V = rearrange(
            self.project_value(x),
            'bs q (nh hs) -> bs nh q hs',
            nh = self.num_heads
        )

        attention_weighted_value = t.einsum(
            'ab ij, ab jk -> ab ik',
            attention_scores, V
        )

        attention_weighted_value = rearrange(
            attention_weighted_value,
            'bs nh q hs -> bs q (nh hs)'
        )
        
        return self.project_output(attention_weighted_value)

if MAIN:
    w2d1_test.test_attention_pattern_pre_softmax(BertSelfAttention)
    w2d1_test.test_attention(BertSelfAttention)



w2d1_test.test_attention_pattern_pre_softmax passed in 0.03s.
w2d1_test.test_attention passed in 0.00s.



## Layer Normalization

Use the ([PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)) for Layer Normalization to implement your own version which exactly mimics the official API. Use the biased estimator for $Var[x]$ as shown in the docs.




In [4]:
class LayerNorm(nn.Module):
    weight: nn.Parameter
    bias: nn.Parameter

    def __init__(
        self, normalized_shape: Union[int, tuple, t.Size], eps=1e-12, elementwise_affine=True, device=None, dtype=None
    ):
        super().__init__()
        if isinstance(normalized_shape, int):
            normalized_shape = (normalized_shape,)
        normalized_shape = tuple(normalized_shape)
        self.normalized_shape = normalized_shape
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        if self.elementwise_affine:
            self.weight = nn.Parameter(t.empty(normalized_shape, device=device, dtype=dtype))
            self.bias = nn.Parameter(t.empty(normalized_shape, device=device, dtype=dtype))
        else:
            self.register_parameter("weight", None)
            self.register_parameter("bias", None)

        self.reset_parameters()

    
    def reset_parameters(self) -> None:
        """Initialize the weight and bias, if applicable."""
        if self.elementwise_affine:
            nn.init.ones_(self.weight)
            nn.init.zeros_(self.bias)

    def forward(self, x: t.Tensor) -> t.Tensor:
        """x and the output should both have shape (batch, *)."""
        x = (x - x.mean(dim=-1, keepdim=True)) / t.sqrt(x.var(dim=-1, keepdim=True, unbiased=False) + self.eps)
        if self.elementwise_affine:
            x = x * self.weight + self.bias
        return x

if MAIN:
    w2d1_test.test_layernorm_mean_1d(LayerNorm)
    w2d1_test.test_layernorm_mean_2d(LayerNorm)
    w2d1_test.test_layernorm_std(LayerNorm)
    w2d1_test.test_layernorm_exact(LayerNorm)
    w2d1_test.test_layernorm_backward(LayerNorm)



w2d1_test.test_layernorm_mean_1d passed in 0.00s.
w2d1_test.test_layernorm_mean_2d passed in 0.00s.
w2d1_test.test_layernorm_std passed in 0.00s.
w2d1_test.test_layernorm_exact passed in 0.00s.
w2d1_test.test_layernorm_backward passed in 0.01s.



## Embedding

Implement your version of PyTorch's `nn.Embedding` module. The PyTorch version has some extra options in the constructor, but you don't need to implement those since BERT doesn't use them.

The `Parameter` should be named `weight` and initialized with normally distributed random values with a mean of 0 and std of 0.02.




In [5]:
class Embedding(nn.Module):
    num_embeddings: int
    embedding_dim: int
    weight: nn.Parameter

    def __init__(self, num_embeddings: int, embedding_dim: int):
        super().__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.weight = nn.Parameter(t.empty(num_embeddings, embedding_dim))
        nn.init.normal_(self.weight, mean=0, std=0.02)



    def forward(self, x: t.LongTensor) -> t.Tensor:
        """For each integer in the input, return that row of the embedding.

        Don't convert x to one-hot vectors - this works but is too slow.
        """
        return self.weight[x]

    def extra_repr(self) -> str:
        return f"{self.num_embeddings}, {self.embedding_dim}"


if MAIN:
    assert repr(Embedding(10, 20)) == repr(t.nn.Embedding(10, 20))
    w2d1_test.test_embedding(Embedding)
    w2d1_test.test_embedding_std(Embedding)



w2d1_test.test_embedding passed in 0.00s.
w2d1_test.test_embedding_std passed in 0.00s.



## BertMLP

Make the MLP block, following the schematic. Use `nn.Dropout` for the dropout layer.




In [6]:
class BertMLP(nn.Module):
    first_linear: nn.Linear
    second_linear: nn.Linear
    layer_norm: LayerNorm

    def __init__(self, config: BertConfig):
        super().__init__()
        self.first_linear = nn.Linear(config.hidden_size, config.intermediate_size)
        self.second_linear = nn.Linear(config.intermediate_size, config.hidden_size)
        self.layer_norm = LayerNorm(config.hidden_size, eps = config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout)


    def forward(self, x: t.Tensor) -> t.Tensor:
        original_x = x
        x = self.first_linear(x)
        x = F.gelu(x)
        x = self.second_linear(x)
        x = self.dropout(x)
        x = self.layer_norm(x + original_x)
        return x

if MAIN:
    w2d1_test.test_bert_mlp_zero_dropout(BertMLP)
    w2d1_test.test_bert_mlp_one_dropout(BertMLP)



w2d1_test.test_bert_mlp_zero_dropout passed in 0.01s.
w2d1_test.test_bert_mlp_one_dropout passed in 0.00s.



## Bert Block

Assemble the `BertAttention` and `BertBlock` classes following the schematic.




In [9]:
class BertAttention(nn.Module):
    self_attn: BertSelfAttention
    layer_norm: LayerNorm

    def __init__(self, config: BertConfig):
        super().__init__()
        self.self_attn = BertSelfAttention(config)
        self.layer_norm = LayerNorm(config.hidden_size, eps = config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout)


    def forward(self, x: t.Tensor, additive_attention_mask: Optional[t.Tensor] = None) -> t.Tensor:
        original_x = x
        x = self.self_attn(x, additive_attention_mask)
        x = self.dropout(x)
        x = self.layer_norm(x + original_x)
        return x


if MAIN:
    w2d1_test.test_bert_attention_dropout(BertAttention)


class BertBlock(nn.Module):
    attention: BertAttention
    mlp: BertMLP

    def __init__(self, config: BertConfig):
        super().__init__()
        self.attention = BertAttention(config)
        self.mlp = BertMLP(config)


    def forward(self, x: t.Tensor, additive_attention_mask: Optional[t.Tensor] = None) -> t.Tensor:
        x = self.attention(x, additive_attention_mask)
        x = self.mlp(x)
        return x


if MAIN:
    # tiny weird error that's failing test
    w2d1_test.test_bert_block(BertBlock)



w2d1_test.test_bert_attention_dropout passed in 0.01s.
Test failed. Max absolute deviation: 1.430511474609375e-06
Actual:
tensor([[[-0.5345, -0.5477,  1.0330,  ..., -0.4000, -0.8045, -0.1134],
         [ 1.1591, -0.9310, -1.1030,  ...,  0.8801,  1.0987,  0.3142],
         [-1.0729, -1.0277,  0.3382,  ...,  1.2628, -0.0032, -1.1757]],

        [[-0.1461, -1.1046, -0.0057,  ...,  1.6682, -1.3070, -0.8908],
         [-1.3667,  1.1291,  1.6672,  ...,  0.6607,  0.2909, -0.9596],
         [-0.5694, -1.3173, -1.0107,  ..., -0.0240,  0.6005, -1.5292]]],
       grad_fn=<AddBackward0>)
Expected:
tensor([[[-0.5345, -0.5477,  1.0330,  ..., -0.4000, -0.8045, -0.1134],
         [ 1.1591, -0.9310, -1.1030,  ...,  0.8801,  1.0987,  0.3142],
         [-1.0729, -1.0277,  0.3382,  ...,  1.2628, -0.0032, -1.1757]],

        [[-0.1461, -1.1046, -0.0057,  ...,  1.6682, -1.3070, -0.8908],
         [-1.3667,  1.1291,  1.6672,  ...,  0.6607,  0.2909, -0.9596],
         [-0.5694, -1.3173, -1.0107,  ..., -0.0240

AssertionError: allclose failed with 5 / 4608 entries outside tolerance


## Putting it All Together

Now put the pieces together. We're going to have a `BertLMHead`, noting the following:

- The language modelling `Linear` layer after the blocks has shape `(embedding_size, embedding_size)`.
- If `token_type_ids` isn't provided to `forward`, make it the same shape as `input_ids` but filled with all zeros.
- The unembedding at the end that takes data from `hidden_size` to `vocab_size` shouldn't be its own `Linear` layer because it shares the same data as `token_embedding.weight`. Just reuse `token_embedding.weight` and add a bias term.
- Print your model out to see if it resembles the schematic.

The tokenizer will produce `one_zero_attention_mask`, but our `SelfAttention` needs `additive_attention_mask`. This mask is the same for every layer, so we can compute it once at the beginning of BERT's forward method. This will prevent `SelfAttention` from reading any data from the padding tokens.

### utils.StaticModuleList

If you use a regular `nn.ModuleList` to hold your `BertBlock`s, the typechecker can't tell they are `BertBlock`s anymore and only knows that they're `nn.Module`.

We've provided a subclass `utils.StaticModuleList`, allowing us to declare in the class definition that this container really only contains `BertBlock` and no other types. The `repr` of `nn.ModuleList` also prints out all the children, which produces unreadable output for large numbers of layers; our `repr` is more concise.




In [10]:
class BertCommon(nn.Module):
    token_embedding: Embedding
    pos_embedding: Embedding
    token_type_embedding: Embedding
    layer_norm: LayerNorm
    blocks: nn.ModuleList

    def __init__(self, config: BertConfig):
        super().__init__()
        self.token_embedding = Embedding(config.vocab_size, config.hidden_size)
        self.pos_embedding = Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embedding = Embedding(config.type_vocab_size, config.hidden_size)
        self.layer_norm = LayerNorm(config.hidden_size, eps = config.layer_norm_epsilon)
        self.blocks = nn.ModuleList([BertBlock(config) for _ in range(config.num_layers)])
        self.dropout = nn.Dropout(config.dropout)

    def _make_additive_attention_mask(
        self,
        one_zero_attention_mask: t.Tensor,
        big_negative_number: float = -10000
    ) -> t.Tensor:

        """
        one_zero_attention_mask: shape (batch, seq). Contains 1 if this is a valid token and 0 if it is a padding token.
        big_negative_number: Any negative number large enough in magnitude that exp(big_negative_number) is 0.0 for the floating point precision used.

        Out: shape (batch, heads, seq, seq). Contains 0 if attention is allowed, and big_negative_number if it is not allowed.
        """
        return rearrange((1 - one_zero_attention_mask), 'b s -> b 1 1 s') * big_negative_number

    
    def forward(
        self,
        input_ids: t.Tensor,
        token_type_ids: Optional[t.Tensor] = None,
        one_zero_attention_mask: Optional[t.Tensor] = None,
    ) -> t.Tensor:
        """
        input_ids: (batch, seq) - the token ids
        token_type_ids: (batch, seq) - only used for next sentence prediction.
        one_zero_attention_mask: (batch, seq) - only used in training. See make_additive_attention_mask.
        """
        token_embeddings = self.token_embedding(input_ids)
        pos_embeddings = self.pos_embedding(t.arange(input_ids.shape[1], device=input_ids.device))
        
        if not token_type_ids:
            token_type_ids = t.zeros_like(input_ids)

        token_type_embeddings = self.token_type_embedding(token_type_ids)
        x = token_embeddings + pos_embeddings + token_type_embeddings
        x = self.layer_norm(x)
        x = self.dropout(x)

        if one_zero_attention_mask:
            additive_attention_mask = self._make_additive_attention_mask(one_zero_attention_mask)
        else:
            additive_attention_mask = None

        for block in self.blocks:
            x = block(x, additive_attention_mask)
        return x





        
        
                




## BertLanguageModel

<details>

<summary>I can't figure out why my model's outputs are off by a very small amount, like 0.0005!</summary>

Check that you're passing the correct layer norm epsilon through the network. The PyTorch default is 1e-5, but BERT used 1e-12.

</details>




In [11]:
class BertLanguageModel(nn.Module):
    common: BertCommon
    lm_linear: nn.Linear
    lm_layer_norm: LayerNorm
    unembed_bias: nn.Parameter

    def __init__(self, config: BertConfig):
        super().__init__()
        self.common = BertCommon(config)
        self.lm_linear = nn.Linear(config.hidden_size, config.hidden_size)
        self.lm_layer_norm = LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
        self.unembed_bias = nn.Parameter(t.zeros(config.vocab_size))

    def forward(
        self,
        input_ids: t.Tensor,
        token_type_ids: Optional[t.Tensor] = None,
        one_zero_attention_mask: Optional[t.Tensor] = None,
    ) -> t.Tensor:
        """Compute logits for each token in the vocabulary.

        Return: shape (batch, seq, vocab_size)
        """
        x = self.common(input_ids, token_type_ids, one_zero_attention_mask)
        x = self.lm_linear(x)
        x = F.gelu(x)
        x = self.lm_layer_norm(x)
        # unembed with original embedding matrix
        x = x @ self.common.token_embedding.weight.t() + self.unembed_bias
        return x


    
if MAIN:
    w2d1_test.test_bert(BertLanguageModel)



Test failed. Max absolute deviation: 3.606081008911133e-06
Actual:
tensor([[[-0.5903,  0.3522,  0.3753,  ..., -0.6550,  0.2556, -1.0716],
         [ 0.1514,  0.3186,  0.7801,  ..., -0.0321,  0.7688, -1.4717],
         [ 0.2311,  0.2211,  1.2228,  ..., -0.2644,  0.5473, -1.0289],
         ...,
         [-0.0398,  0.8780,  0.4739,  ..., -0.0110,  1.0243, -0.6003],
         [-0.0889,  0.1358,  1.1366,  ..., -0.7023, -0.0771, -0.7481],
         [-0.0919, -0.2449,  0.8229,  ...,  0.1497,  0.5532, -0.5897]]],
       grad_fn=<AddBackward0>)
Expected:
tensor([[[-0.5903,  0.3522,  0.3753,  ..., -0.6550,  0.2556, -1.0716],
         [ 0.1514,  0.3186,  0.7801,  ..., -0.0321,  0.7688, -1.4717],
         [ 0.2311,  0.2211,  1.2228,  ..., -0.2644,  0.5473, -1.0289],
         ...,
         [-0.0398,  0.8780,  0.4739,  ..., -0.0110,  1.0243, -0.6003],
         [-0.0889,  0.1358,  1.1366,  ..., -0.7023, -0.0771, -0.7481],
         [-0.0919, -0.2449,  0.8229,  ...,  0.1497,  0.5532, -0.5897]]],
       g

AssertionError: allclose failed with 1457 / 202972 entries outside tolerance


## Loading Pretrained Weights

Now copy parameters from the pretrained BERT returned by `utils.load_pretrained_bert()` into your BERT. This is definitely tedious and it's traditional to groan about how boring this is, but is representative of real ML work and we want you to have an Authentic ML Experience.

Remember that the embedding and unembedding weights are tied, so `hf_bert.bert.embeddings.word_embeddings.weight` and `hf_bert.cls.predictions.decoder.weight` should be equal and you should only use one of them.

Feel free to copy over the solution if you get frustrated.

<details>

<summary>I'm confused about my `Parameter` not being a leaf!</summary>

When you copied data from the HuggingFace version, PyTorch tracked the history of the copy operation. This means if you were to call `backward`, it would try to backpropagate through your Parameter back to the HuggingFace version, which is not what we want. To fix this, you can call `detach()` to make a new tensor that shares storage with the original but doesn't have copy its history for backpropagation.

</details>





In [23]:
def load_pretrained_weights(config: BertConfig) -> BertLanguageModel:
    hf_bert = utils.load_pretrained_bert()
    my_bert = BertLanguageModel(config)
    # print(hf_bert)
    # print(my_bert)
    
    def _copy(ours, src):
        ours.weight.detach().copy_(src.weight)
        if getattr(ours, "bias", None) is not None:
            ours.bias.detach().copy_(src.bias)

    # init the my_bert weights as Nan
    for p in my_bert.parameters():
        p = p.detach().fill_(float("nan"))
    
    _copy(my_bert.lm_linear, hf_bert.cls.predictions.transform.dense)  
    _copy(my_bert.lm_layer_norm, hf_bert.cls.predictions.transform.LayerNorm)
    my_bert.unembed_bias.detach().copy_(hf_bert.cls.predictions.decoder.bias)
    _copy(my_bert.common.token_embedding, hf_bert.bert.embeddings.word_embeddings)
    _copy(my_bert.common.pos_embedding, hf_bert.bert.embeddings.position_embeddings)
    _copy(my_bert.common.token_type_embedding, hf_bert.bert.embeddings.token_type_embeddings)
    _copy(my_bert.common.layer_norm, hf_bert.bert.embeddings.LayerNorm)

    for i, block in enumerate(my_bert.common.blocks):
        _copy(block.mlp.first_linear, hf_bert.bert.encoder.layer[i].intermediate.dense)
        _copy(block.mlp.second_linear, hf_bert.bert.encoder.layer[i].output.dense)
        _copy(block.mlp.layer_norm, hf_bert.bert.encoder.layer[i].output.LayerNorm)
        _copy(block.attention.layer_norm, hf_bert.bert.encoder.layer[i].attention.output.LayerNorm)
        _copy(block.attention.self_attn.project_query, hf_bert.bert.encoder.layer[i].attention.self.query)
        _copy(block.attention.self_attn.project_key, hf_bert.bert.encoder.layer[i].attention.self.key)
        _copy(block.attention.self_attn.project_value, hf_bert.bert.encoder.layer[i].attention.self.value)
        _copy(block.attention.self_attn.project_output, hf_bert.bert.encoder.layer[i].attention.output.dense)


    for p in my_bert.parameters():
        assert not t.isnan(p).any(), f"Parameter {p} is NaN"
    
    return my_bert

    

if MAIN:
    my_bert = load_pretrained_weights(config)
    for (name, p) in my_bert.named_parameters():
        assert (
            p.is_leaf
        ), "Parameter {name} is not a leaf node, which will cause problems in training. Try adding detach() somewhere."




## Tokenization

We're going to use a HuggingFace tokenizer for now to encode text into a sequence of tokens that our model can use. The tokenizer has to match the model - our model was trained with the `bert-base-cased` tokenizer which is case-sensitive. If you tried to use the `bert-base-uncased` tokenizer which is case-insensitive, it wouldn't work at all.

Use `transformers.AutoTokenizer.from_pretrained` to automatically fetch the appropriate tokenizer and try encoding and decoding some text.

### Vocabulary

Check out `tokenizer.vocab` to get an idea of what sorts of strings are assigned to tokens. In WordPiece, tokens represent a whole word unless they start with `##`, which denotes this token is part of a word.

### Special Tokens

Check out `tokenizer.special_tokens_map`. The strings here are mapped to tokens which have special meanings - for example `tokenizer.mask_token`, which is the literal string '[MASK]', is converted to `tokenizer.mask_token_id`, equal to 103.

### Predicting Masked Tokens

Write the `predict` function which takes a string with one or more instances of the substring '[MASK]', runs it through your model, finds the top K predictions and decodes each prediction.

Tips:

- `torch.topk()` is useful for identifying the `k` largest elements.
- The model should be in evaluation mode for predictions - this disables dropout and makes the predictions deterministic.
- If your model gives different predictions than the HuggingFace section, proceed to the next section on debugging.





In [124]:
def predict(model: BertLanguageModel, tokenizer, text: str, k=15) -> List[List[str]]:
    """
    Return a list of k strings for each [MASK] in the input.
    """
    model.eval()
    #return_tensors="pt" jsut returns a tensor of the corrext shape to be fed into the model
    input_ids = tokenizer(text, return_tensors="pt")['input_ids']
    out = model(input_ids)
    log_likelihoods = out[input_ids == tokenizer.mask_token_id]
    top_k_likely_indices = t.topk(log_likelihoods, k, dim=-1).indices
    top_k_likely_tokens = [[tokenizer.decode([i]) for i in top_k_likely_indices_for_each_mask] for top_k_likely_indices_for_each_mask in top_k_likely_indices]
    return top_k_likely_tokens

def next_sentence_prediction(model: BertLanguageModel, sen1: str, sen2: str, tokenizer) -> bool:
    """
    Predict whether a sentence is the next sentence in a sequence.
    """
    # .eval() is a method from the nn.Module parent class to all our models
    model.eval()
    # tokenize then concatonate the two sentences seperated by the [SEP] token
    input_ids = tokenizer(sen1 + tokenizer.sep_token + sen2, return_tensors="pt")['input_ids']
    print(tokenizer.decode(input_ids[0]))
    # get the output of the model
    out = model(input_ids)
    # get the value for the [CLS] token in the output
    cls_value = out[0][:, 0, :]
    print(cls_value)

    

if MAIN and (not IS_CI):
    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")
    #w2d1_test.test_bert_prediction(predict, my_bert, tokenizer)
    #your_text = "My favourite food is [MASK]."
    sen1 = "I like to eat pizza."
    sen2 = "I like to eat pasta."
    next_sentence_prediction(my_bert, sen1, sen2, tokenizer)
    #print(next_sentence_prediction(my_bert, sen1, sen2, tokenizer))
    #predictions = predict(my_bert, tokenizer, your_text)
    #print("Model predicted: \n", "\n".join(map(str, predictions)))



[CLS] I like to eat pizza. [SEP] I like to eat pasta. [SEP]


IndexError: too many indices for tensor of dimension 2


## Model debugging

If your model works correctly at this point then congratulations, you can skip this section.

The challenge with debugging ML code is that it often silently computes the wrong result instead of erroring out. Some things you can check:

- Do I have any square matrices transposed, so the shapes still match but they do the wrong thing?
- Did I forget to pass any optional arguments, and the wrong default is being used?
- If I `print` my model, do the layers look right?
- Can I add `assert`s in my code to check assumptions that I've made? In particular, sometimes unintentional broadcasting creates outputs of the wrong shape.
- Is a tensor supposed to consist of `float`s, but might actually consist of `int`s? This can be tricky, because `t.tensor([1,2,3])` will produce a integer tensor, but if any one of the elements is a float, like `t.tensor([1,2,3.])`, then it will be a float tensor.

You won't always have a reference implementation, but given that you do courtesy of HuggingFace, a good technique is to use hooks to collect the inputs and outputs that should be identical and compare when they start to diverge. This narrows down the number of places where you have to look for the bug.

Read the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.modules.module.register_module_forward_hook.html) for `register_forward_hook` on a `nn.Module` and try logging the input and output of each block on your model and the HuggingFace version. Note that you can use your forward hook to access model parameters upon the completion of a `forward()` output and use these parameters in ordinary Python data structures. Also, you may use `utils.allclose_atol()` to, as with many tests that you have already encountered, check whether two tensors have values within a specified tolerance.





In [None]:
if MAIN and (not IS_CI):
    "TODO: YOUR CODE HERE"



: 


# Bonus

Congratulations on finishing the day's content! No bonus section today.

Tomorrow you'll have the same partner, so feel free to get started on W2D2. Or, play with your BERT and post your favorite completions in the Slack!
