# BERT (Bidirectional Embedding Representations from Transformers)

## overview

BERT is versatile and SOTA on various tasks

**contextualized embedding**: BERT produce word embeddings or sentence embeddings which capture bidirectional context info in a sentence. 

Two models were released:

- BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.

- BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.

Training data

- BooksCorpus (800 million words)

- English Wikipedia (2,500 million words)


pretraining: 64 TPU chips for 4 days.

**fine-tuning**: adding a task-specific output layer on top of encoder


## contextual embedding

contextual embeddings/BERT embeddings of input sequence $w_1, ..., w_n$ are the outputs of the last encoder block $h_1, ..., h_n$. 

$h_1$ is contextual embedding of [CLS] token, represent embedding for the whole input sequence

- next sentence prediction: $h_1$ is input for FeedFward NN

- fine-tuning on classification: $h_1$ is input for FeedFward NN

a simple example of using the pre-trained BERT model from the Hugging Face's `transformers` library in PyTorch:

In [None]:
# !pip install transformers
import torch
from transformers import BertTokenizer, BertModel

# Load BERT tokenizer and pre-trained base model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input text
text = "Here is an example of using BERT in PyTorch."

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt')

# Run the text through the pre-trained BERT model
with torch.no_grad():
    outputs = model(**inputs)

# sentence representation : last hidden state of [CLS] token
# It captures the aggregated contextual information from the entire input sequence.
# It provides a fixed-size vector (batch_size, hidden_size) that can be used as input to a classifier for downstream tasks.
sentence_representation = outputs.last_hidden_state[:, 0, :]  # (batch_size, sequence_length, hidden_size)
# [CLS] token is always the first token in the input sequence

print(sentence_representation)


## pretraining

**Next Sentence Prediction (NSP)**: Predicting if a given sentence B follows sentence A in the original text. 

- a type of sentence-pair classification.

- learn relationships between multiple sentences.

- input: 2-sentence pairs. 

- output: binary (Yes or No)

- Later work has argued this “next sentence prediction” is not necessary.

**Masked language modelling (MLM)**: 

Predict a random 15% of tokens.

Loss is only evaluated for [CLS] and masked-out words. 

- 80%: Replace input word with [MASK]

    The primary goal is to predict the masked tokens, so the majority of the masked positions will have the [MASK] token.

- 10%: Replace input word with a random token 

    introduces noise and makes the model more robust. prevents the model from solely relying on the presence of the [MASK] token to make predictions, forcing it to build a better understanding of the context.

- 10%: Leave input word unchanged, but 
still predict it.

    addresses **a mismatch between pre-training and fine-tuning**. 
    
    During fine-tuning, there are no [MASK] tokens in the input, so it is essential to train the model to make predictions even without explicit masking. 
    
    helps the model build strong representations for **all words** in the sentence, not just the masked ones.


<img src="https://hryang06.github.io/assets/images/post/bert/bert-mlm-ex.PNG">

## architecture: Encoder-only Transformer



- multiple stacks of encoder block (base 12, large 24)

- each Encoder block has 2 sub-layers: a multi-head self-attention, a position-wise FC feedforward network.


- **Input Representation**: WordPiece tokenization and combines it with positional encoding and segment encoding.

  - Token Embeddings: $E \in \mathbb{R}^{m \times d_{\text{model}}}$

  - Positional Embeddings: $P \in \mathbb{R}^{m \times d_{\text{model}}}$

  - Segment Embeddings: $S \in \mathbb{R}^{m \times d_{\text{model}}}$

  $m$: input sequence length (context size)

  first token is `[CLS]` (classification), sentences are separated by `[SEP]` (separation)
  
- **Output**: token-level or sequence-level representations depending on the task

  Output of each encoder layer along each token's path is an embedding for that token, Which output to use depends on the task


### segment embedding

Segment embeddings are used for next sentence prediction pre-training task

For each token in the input sequence, a segment embedding is added depending on the sequence it belongs to.

help model distinguish between the two input sequences 

## fine-tuning

various tasks

- QP: Quora Question Pairs (detect paraphrase 
questions)

- QNLI: natural language inference over question 
answering data

- SST-2: sentiment analysis

- CoLA: corpus of linguistic acceptability (detect 
whether sentences are grammatical.)

- STS-B: semantic textual similarity

- MRPC: microsoft paraphrase corpus

- RTE: a small natural language inference corpus

## BERT variants

**SpanBERT**: span masking

RoBERTa: 

- change: more training data + dynamic masking for every epoch + more epoch + no next sentence prediction. 

- conclusion: more compute, more data can improve pretraining even when not changing the underlying Transformer encoder.

BART: BERT + Autoregressive decoder = denoising autoencoder

[ExBert - Huggingface Exploring Transformers]((https://huggingface.co/exbert/?model=gpt2&modelKind=bidirectional&sentence=The%20girl%20ran%20to%20a%20local%20pub%20to%20escape%20the%20din%20of%20her%20city.&layer=2&heads=..0,1,2,3,4,5,6,7,8,9,10,11&threshold=0.01&tokenInd=null&tokenSide=null&maskInds=..2&hideClsSep=true))

DistilBERT, ALBERT: Smaller versions of BERT

CamemBERT, FlauBERT: BERT in French

PhoBERT, herBERT: BERT in Vietnamese and Polish, resp.

mBERT: BERT in 104 languages

### SpanBERT

$$
\textbf{Span Masking Objective:} \quad L_{\text{span\_MLM}} = -E_{(i,j) \sim \text{Spans}}[\log P(w_i^j|w_{-i^j})]\\
\textbf{Span Boundary Objective:} \quad L_{\text{SBO}} = -E_{(i,j,k,l) \sim \text{Spans}}[\log P(w_i, w_j|w_k^l)]
$$

SpanBERT (Joshi & Chen et al., 2020): Improving Pre-training by Representing and Predicting Spans

**omit next sentence prediction** task in BERT

**Span Boundary Objective (SBO)**: 

- span maksing: mask contiguous spans of tokens as a sequence of [MASK] tokens of same length instead of individual tokens and then predict the entire span given its context.

- Given observed left token and right token at masked span boundary, predict all the tokens in the span.

    $$
    y_i = f(h_{s-1}, h_{e+1}, pos_{i-s+1})
    $$

    $y_i$: predicted ith token in sentence

    $f$: 2-layer FFNN with GeLU

    $h_{s-1}, h_{e+1}$: hidden state of left and right boundary token

    $pos_{i-s+1}$: position embedding of target token marking relative positions of the masked tokens with respect to the left boundary token at start-1 position

- harder pretraining task useful for phrase-level tasks (e.g., named entity recognition, coreference resolution)



<img src="https://hyunyoung2.github.io/img/Image/NaturalLanguageProcessing/NLPLabs/Paper_Investigation/Language_Model/2020-12-23-SpanBERT/SpanBERT.PNG">