# BERT

BERT (Bidirectional Embedding Representations from Transformers)

**contextualized embedding**: BERT produce word embeddings or sentence embeddings which capture bidirectional context info in a sentence. 

## Transfer learning

**Pretraining**: unsupervised pretraining on a large corpus to learn general language understanding.

- Masked Language Modelling (MLM): Predicting masked tokens (15%) in a sentence. Loss is only evaluated for [CLS] and masked-out words

    learn relationships between multiple words in a sentence.
    
    - 80% of time: masked out words are replaced by [MASK]
    
    - 10% of time: randomly replaces masked out words.

    - 10% of time: recover the masked out words.
    
- Next Sentence Prediction (NSP): Predicting if a given sentence B follows sentence A in the original text. a type of sentence-pair classification.

    learn relationships between multiple sentences.

    input: 2-sentence pairs. output: binary (Yes or No)

**fine-tuning**: on various NLP tasks with small labeled data by **adding a task-specific output layer on top of BERT encoder.**

## architecture: Encoder-only Transformer



- multiple stacks of encoder block (base 12, large 24)

- each Encoder block has 2 sub-layers: a multi-head self-attention, a position-wise FC feedforward network.


- **Input Representation**: WordPiece tokenization and combines it with positional encoding and segment encoding.

  - Token Embeddings: $E \in \mathbb{R}^{m \times d_{\text{model}}}$

  - Positional Embeddings: $P \in \mathbb{R}^{m \times d_{\text{model}}}$

  - Segment Embeddings: $S \in \mathbb{R}^{m \times d_{\text{model}}}$

  $m$: input sequence length (context size)

  first token is `[CLS]` (classification)
  
- **Output**: token-level or sequence-level representations depending on the task

  Output of each encoder layer along each token's path is an embedding for that token, Which output to use depends on the task


### segment embedding

Segment embeddings are used for next sentence prediction pre-training task

For each token in the input sequence, a segment embedding is added depending on the sequence it belongs to.

help model distinguish between the two input sequences 

a simple example of using the pre-trained BERT model from the Hugging Face's `transformers` library in PyTorch:

In [None]:
# !pip install transformers
import torch
from transformers import BertTokenizer, BertModel

# Load BERT tokenizer and pre-trained base model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input text
text = "Here is an example of using BERT in PyTorch."

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt')

# Run the text through the pre-trained BERT model
with torch.no_grad():
    outputs = model(**inputs)

# sentence representation : last hidden state of [CLS] token
# It captures the aggregated contextual information from the entire input sequence.
# It provides a fixed-size vector (batch_size, hidden_size) that can be used as input to a classifier for downstream tasks.
sentence_representation = outputs.last_hidden_state[:, 0, :]  # (batch_size, sequence_length, hidden_size)
# [CLS] token is always the first token in the input sequence

print(sentence_representation)


## contextual embedding

contextual embeddings/BERT embeddings of input sequence $w_1, ..., w_n$ are the outputs of the last encoder block $h_1, ..., h_n$. 

$h_1$ is contextual embedding of [CLS] token, represent embedding for the whole input sequence

- next sentence prediction: $h_1$ is input for FeedFward NN

- fine-tuning on classification: $h_1$ is input for FeedFward NN

## BERT variants

• RoBERTa: Facebook’s (improved) version of BERT

• DistilBERT, ALBERT: Smaller versions of BERT

• CamemBERT, FlauBERT: BERT in French

• PhoBERT, herBERT: BERT in Vietnamese and Polish, resp.

• mBERT: BERT in 104 languages

• SpanBERT: BERT for phrase-level tasks (e.g., named entity recognition, coreference resolution, etc.)