# Finetuning an encoder only model

## Encoder-decoder architecture
The encoder-decoder architecture is an ML architecture that is widely used in NLP tasks such as machine translation, text summarization, and language generation.  
The **encoder** takes a variable-length sequence as input and transforms it into a state with a fixed shape (**thought vector**) and the **decoder** maps the encoded state of a fixed shape to an output sequence.  
![title](images/encoder_decoder_architecture.png)  
*Image Credits: [Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)*

## Variations of encoder-decoder architectures:
| | Encoder-Decoder models | Encoder only models | Decoder only models |
| --- | --- | --- | --- |
| Analogy: |    | ![title](images/encoder_only_meme.jpg) | ![decoder only meme](images/decoder_only_meme.jpg)|
|Use cases: | sequence to sequence tasks like machine translation, document summarization | Embedding tasks, transfer learning for downstream tasks like classification | Language generation, completion, and other generative tasks. |
|Training objective: | Trained to minimize the difference between the predicted and target output sequences.  | Pre-trained on unsupervised tasks like language modeling, masked language modeling, etc.  | Pre-trained for generative tasks, often using autoregressive language modeling. |
|Examples: | T5(Text-to-Text Transfer Transformer), BART   | BERT(Bidirectional Encoder Representations from Transformers) | GPT family |


## BERT model 

BERT(Bidirectional Encoder Representations from Transformers)

- Encoder only model
- Trains on both left as well as right context across all the layers
- Two tasks:
  - Masked Language modelling(MLM)
  - Next word prediction(NSP)
- Architecture:
   - BERT-BASE (Layers=12, Hidden layer dimensions=768, Attention heads=12, Total Parameters=110M)
   - BERT-LARGE (Layers=24, Hidden layer dimensions=1024, Attention heads=16, Total Parameters=340M).
- Pretrained on Wikipedia and BooksCorpus
- Task specific finetuning using different corpus 

![Bert architecture](images/bert_architecture.png)

Picture credits: [Bert paper](https://aclanthology.org/N19-1423.pdf)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
#!pip install datasets evaluate transformers[sentencepiece]
#!pip install 'accelerate>=0.26.0'
#!pip install scipy


In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, AutoModelForSequenceClassification, AutoTokenizer
import torch

# Try out model

In [3]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

# https://huggingface.co/google-bert/bert-base-uncased
checkpoint = "bert-base-uncased"

# Load model
model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Tokenizer (text=>tokens=>token_ids)

In [4]:
print(f"tokenizer.vocab_size: {tokenizer.vocab_size}")
# special tokens
print(f"tokenizer.unk_token: {tokenizer.unk_token} = {tokenizer.unk_token_id}")
print(f"zyxw: {tokenizer.convert_tokens_to_ids(['zyxw'])}")

print(f"tokenizer.sep_token: {tokenizer.sep_token} = {tokenizer.sep_token_id}")
print(f"tokenizer.pad_token: {tokenizer.pad_token} = {tokenizer.pad_token_id}")
print(f"tokenizer.cls_token: {tokenizer.cls_token} = {tokenizer.cls_token_id}")
print(f"tokenizer.mask_token: {tokenizer.mask_token}")

tokenizer.vocab_size: 30522
tokenizer.unk_token: [UNK] = 100
zyxw: [100]
tokenizer.sep_token: [SEP] = 102
tokenizer.pad_token: [PAD] = 0
tokenizer.cls_token: [CLS] = 101
tokenizer.mask_token: [MASK]


Tokenizer is used to encode and decode text.  
encode: token => token_id  
decode: token_id => token  


**Vocabulary size**: number of unique tokens in the vocabulary.  
**Special tokens**: special tokens that are used in the model.
- UNK token: token used to represent unknown tokens.
- SEP token: token used to separate input ids into different sequences.
- PAD token: token used to pad sequences.
- CLS token: token used to start sequences.

## Mask filling

In [5]:
input_text ="The capital of the France is [MASK]."

def fill_mask(sentence, topk=5):
    """
    Print topk candidates for the masked token in the sentence.
    """
    if "[MASK]" not in sentence:
        raise ValueError("Input sentence must contain [MASK] token.")

    print(f"sentence: {sentence}")
    print(f"===Tokenization===")
    # Tokenize input and get tensor
    inputs = tokenizer(sentence, return_tensors="pt")
    # input_ids, token_type_ids, attention_mask
    # print tokens
    print(f"tokenizer output: {inputs}")
    print(f"tokens: {tokenizer.convert_ids_to_tokens(inputs.input_ids[0])}")

    # findout the index of tokens which is masked
    mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
    print(f"mask_token_index: {mask_token_index}")

    print(f"===Prediction===")
    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
    print(f"outputs.logits.shape: {outputs.logits.shape}")

    print(f"===Output===")
    
    # Extract the logits for the masked token
    mask_logits = outputs.logits[0, mask_token_index, :]

    print(f"mask_logits: {mask_logits}")
    # convert into probabilities
    mask_probs = torch.softmax(mask_logits, dim=1)

    print(f"mask_probs: {mask_probs}")
    # Get top-k tokens
    topk_ids = torch.topk(mask_logits, topk, dim=1).indices[0].tolist()
    
    # topk probabilities
    topk_probs = torch.topk(mask_probs, topk, dim=1).values[0].tolist()

    
    
    topk_tokens = [tokenizer.decode([token_id]) for token_id in topk_ids]
    print(f"topk_ids: {topk_ids}")
    print(f"topk_tokens: {topk_tokens}")
    print(f"topk_probs: {topk_probs}")


fill_mask(input_text)
# fill_mask("the capital of [MASK] is New Delhi.")

sentence: The capital of the France is [MASK].
===Tokenization===
tokenizer output: {'input_ids': tensor([[ 101, 1996, 3007, 1997, 1996, 2605, 2003,  103, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tokens: ['[CLS]', 'the', 'capital', 'of', 'the', 'france', 'is', '[MASK]', '.', '[SEP]']
mask_token_index: tensor([7])
===Prediction===
outputs.logits.shape: torch.Size([1, 10, 30522])
===Output===
mask_logits: tensor([[-3.8257, -3.8663, -3.7440,  ..., -2.8305, -3.2609, -3.1235]])
mask_probs: tensor([[1.4248e-07, 1.3680e-07, 1.5460e-07,  ..., 3.8543e-07, 2.5063e-07,
         2.8755e-07]])
topk_ids: [3000, 22479, 22451, 29025, 10241]
topk_tokens: ['paris', 'lille', 'brest', 'pau', 'lyon']
topk_probs: [0.303478479385376, 0.05363153666257858, 0.039356473833322525, 0.035300858318805695, 0.03322099521756172]


# Finetuning the model

<img src="images/net_loss_optimizer.png" alt="drawing" width="512"/>  

Image credit: [Deep learning with Python](https://www.manning.com/books/deep-learning-with-python)

##  Finetune with toy dataset using torch

**Base model:** [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)  
**Task:** Given a review, classify it into a positive or negative class  
**Dataset:** Toy dataset

In [6]:
# load model and tokenizer
checkpoint = "bert-base-uncased"

class_model = AutoModelForSequenceClassification.from_pretrained(checkpoint) # instead of AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Prediction function

In [14]:
def predict(sentences):
    print(f"sentence: {sentences}")
    # print(f"=====TOKENIZATION=====")
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    # print(f"inputs.input_ids:{inputs.input_ids}")
    # print(f"inputs.attention_mask:{inputs.attention_mask}")
    
    with torch.no_grad():
        outputs = class_model(**inputs)
        # print(f"=====OUTPUT=====")
        print(f"outputs.logits.shape: {outputs.logits.shape}")
        # print(f"outputs.logits: {outputs.logits}")
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        # print(f"outputs.prob: {predictions}")
        
    
    for sentence, pred in zip(sentences, predictions):
        print(f"{sentence}: {pred}")

predict(["The movie was awesome.", "I had the worst experience of my life."])

sentence: ['The movie was awesome.', 'I had the worst experience of my life.']
outputs.logits.shape: torch.Size([2, 2])
The movie was awesome.: tensor([0.7702, 0.2298])
I had the worst experience of my life.: tensor([0.3837, 0.6163])


### Prepare dataset

In [8]:
training_sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
    "The movie was horrible.",
]

test_sentences = [
    "The hotel was not that good.",
    "I hate this so much!",
    "The movie was great.",
]

predict(training_sentences)
predict(test_sentences)


sentence: ["I've been waiting for a HuggingFace course my whole life.", 'This course is amazing!', 'The movie was horrible.']
I've been waiting for a HuggingFace course my whole life.: tensor([0.5586, 0.4414])
This course is amazing!: tensor([0.4847, 0.5153])
The movie was horrible.: tensor([0.5291, 0.4709])
sentence: ['The hotel was not that good.', 'I hate this so much!', 'The movie was great.']
The hotel was not that good.: tensor([0.5151, 0.4849])
I hate this so much!: tensor([0.5245, 0.4755])
The movie was great.: tensor([0.5309, 0.4691])


### finetune model

In [11]:
# prepare toy classification dataset out of above sentences
inputs = tokenizer(training_sentences, padding=True, truncation=True, return_tensors="pt")
inputs["labels"] = torch.tensor([1, 1, 0])


# finetune the model
# setup optimizer
optimizer = torch.optim.Adam(class_model.parameters(), lr=5e-5)

for _ in range(3):
    # forward pass to calculate loss
    loss = class_model(**inputs).loss
    # backward pass to calculate gradients
    loss.backward()
    # update model weights
    optimizer.step()

In [13]:
print(f"=========after finetuning======")
print(f"===training data:===")
predict(training_sentences)

print(f"===test data:===")
predict(test_sentences)

===training data:===
sentence: ["I've been waiting for a HuggingFace course my whole life.", 'This course is amazing!', 'The movie was horrible.']
I've been waiting for a HuggingFace course my whole life.: tensor([0.2933, 0.7067])
This course is amazing!: tensor([0.3004, 0.6996])
The movie was horrible.: tensor([0.8286, 0.1714])
===test data:===
sentence: ['The hotel was not that good.', 'I hate this so much!', 'The movie was great.']
The hotel was not that good.: tensor([0.5186, 0.4814])
I hate this so much!: tensor([0.3166, 0.6834])
The movie was great.: tensor([0.7724, 0.2276])


### Transfer learning

<img src="images/bert_transfer_learning.jpeg" alt="drawing" width="512"/>  

Image credit: [Natural Language Processing with Transformers](.)


##  Finetune with dataset using huggingface trainer
**Task:** Given a pair of sentences, detect whether the sentence is a paraphrase of another sentence    
**Base model:** [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)  
**Dataset:** [Glue-mrpc](https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc) The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. MRPC is specific to paraphrasing.

In [7]:
checkpoint = "bert-base-uncased" 
# load the model
class_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Prepare training data

In [16]:
from datasets import load_dataset

# 
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [17]:
# explore samples using indexes just like python dictionaries
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [32]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [21]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

####  tokenize sentence 1, sentence 2 separately

In [27]:
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
print([len(sample) for sample in tokenized_sentences_1.input_ids[:8]])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
print([len(sample) for sample in tokenized_sentences_2.input_ids[:8]])

[25, 27, 23, 41, 30, 21, 31, 15]
[26, 33, 25, 27, 30, 30, 32, 18]


#### Tokenize pair as a whole

In [29]:
# tokenizer can take pair of sentences and convert it into a format model requires
inputs = tokenizer("This is the first sentence.", "This is the second one.")
print(f"inputs:{inputs}")
print(tokenizer.convert_ids_to_tokens(inputs["input_ids"]))

inputs:{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']


### Tokenize with fix padding

In [49]:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True, padding="max_length", max_length=128)

In [65]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [66]:
from torch.utils.data import DataLoader
train_data_loader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)


for step, batch in enumerate(train_data_loader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])
torch.Size([16, 128])


### Toeknize with dynamic padding

In [68]:

raw_datasets = load_dataset("glue", "mrpc")

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")
tokenized_datasets


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [69]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_data_loader = DataLoader(tokenized_datasets["train"], batch_size=16, collate_fn=data_collator )
for step, batch in enumerate(train_data_loader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

torch.Size([16, 67])
torch.Size([16, 79])
torch.Size([16, 73])
torch.Size([16, 71])
torch.Size([16, 78])
torch.Size([16, 70])
torch.Size([16, 79])


## Finetune model

In [75]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments("test-trainer", eval_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.450421,0.823529,0.882353
2,0.540300,0.457906,0.85049,0.892416
3,0.332600,0.626026,0.865196,0.905009


TrainOutput(global_step=1377, training_loss=0.3633915188525497, metrics={'train_runtime': 178.3067, 'train_samples_per_second': 61.714, 'train_steps_per_second': 7.723, 'total_flos': 405114969714960.0, 'train_loss': 0.3633915188525497, 'epoch': 3.0})

### Speedup training using huggingface accelerate

In [77]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

### Leverage GPU acceleration without accelerate library 

In [82]:
from transformers import AutoModelForSequenceClassification, get_scheduler
from torch.optim import AdamW 

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# pick GPU device if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"device: {device}")

# move model to GPU
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


device: cuda


  0%|          | 0/1377 [00:00<?, ?it/s]

In [None]:
### Leverage GPU acceleration without accelerate library 

In [83]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

### Evaluate model

In [87]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dl:
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8799019607843137, 'f1': 0.9153713298791019}

# Save model on local/hub

In [89]:
model.save_pretrained("test-trainer")

# save model on huggingface hub
model.push_to_hub("test-trainer")

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ankush-Chander/test-trainer/commit/0bc5396d680632cc1f6fa7829ec560c915580891', commit_message='Upload BertForSequenceClassification', commit_description='', oid='0bc5396d680632cc1f6fa7829ec560c915580891', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Ankush-Chander/test-trainer', endpoint='https://huggingface.co', repo_type='model', repo_id='Ankush-Chander/test-trainer'), pr_revision=None, pr_num=None)

# References:
1. [illustrated-transformer](https://jalammar.github.io/illustrated-transformer)
2. [Hugginface NLP Course](https://huggingface.co/learn/nlp-course)