# Lab | Encorder Models - BERT

---

### Transformers' main components

**Hugging Face Transformers has two main components:**



1. The **tokenizer** prepares the text in a clean format, which the model understands.
    - A token is a word or a sub-word unit. In BERT's vocabulary, the word "good" is one token and the word "darwinism" is two tokens  ("darwin" and "ism")
    - The tokenizer transforms words into token-ids. With these token-ids, BERT can link words to any token it has already learned during pre-training.

2. The **model** processes the tokenizer's ouput and returns a prediction, e.g. which class an input text belongs to.



Independently of the type of model (classification, summarisation, translation, etc.), these two components are almost the same.

In [None]:
#!pip install transformers~=4.31.0  # The Transformers library from Hugging Face

## Models like BERT (encoders)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [None]:
# load any classification model from the HuggingFace model hub
# See here: https://huggingface.co/models?pipeline_tag=text-classification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# instantiate the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

### Tokenization


In [None]:
### 1. Tokenization
# Tokenizer documentation: https://huggingface.co/transformers/main_classes/tokenizer.html

text = 'I believe that the EU is trustworthy.'
print(f"Input text: '{text}'\n")

input_ids = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]
print(f"""The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: {tokenizer.tokenize("Trustworthy")}.
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data.
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt\n""")

print(f"The input text is split into the following tokens:\n{tokenizer.tokenize(text)}.")
print("The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:")
print(input_ids[0].tolist()[1:-1])
print("Transformer models only understand these token IDs.\n")

print("""In addition, the tokenizer adds two special tokens:
 First, the [CLS] (classification) token is always added at the beginning.
        While individual tokens represent individual (sub)words, the [CLS] token represents the entire text.
        The [CLS] token "is  used  as  the  aggregate sequence representation for classification tasks" (Devlin et al. 2019: 4). Details: https://arxiv.org/pdf/1810.04805.pdf
 Second, the [SEP] token separates two texts. It is useful for tasks which require two text inputs, for example Questions & Answer tasks.
        (It is not relevant in our case)
\n""")

print("""The final input for a BERT transformer model therefore looks like this:""")
token_strings = tokenizer.convert_ids_to_tokens(ids=input_ids[0])
#token_strings = tokenizer.tokenize(text)
for token_id, token_string in zip(input_ids[0].tolist(), token_strings):
  print(token_id, " == ", token_string)


# entire vocabulary: tokenizer.pretrained_vocab_files_map["vocab_file"]["distilbert-base-uncased"]
# https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

Input text: 'I believe that the EU is trustworthy.'

The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: ['trust', '##worthy'].
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data.
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

The input text is split into the following tokens:
['i', 'believe', 'that', 'the', 'eu', 'is', 'trust', '##worthy', '.'].
The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:
[1045, 2903, 2008, 1996, 7327, 2003, 3404, 13966, 1012]
Transformer models only understand these token IDs.

In addition, the token

### Tokens (words) flowing through the neural network

In [None]:
### Processing the input with the model
# Model class documentation: https://huggingface.co/transformers/main_classes/model.html
# Documentation for DistilBERT specifically: https://huggingface.co/transformers/model_doc/distilbert.html

print(f"""\nAfter the preprocessing by the tokenizer, the model then feeds the sequence of tokens through the neural network.
Each token is represented by a vector of 768 numbers (a 768 dimensional tensor).
The tensor for the token "trust" looks for example like this before being fed into the first neural network layer
(only 100 numbers are displayed):\n""")
print(model.distilbert.embeddings.word_embeddings(input_ids[0][7])[:100], "\n")

print(f"""The tensors for each token are then fed through and transformed by between 6-24~ neural network layers.\n""")

output = model(input_ids, output_hidden_states=True, output_attentions=False, return_dict=True)
print("Same word after the first layer:\n\n", output.hidden_states[1][0][7][:100], "\n")  # same word embedding after the first attention layer
print("Same word after the second layer:\n\n", output.hidden_states[2][0][7][:100], "\n")  # same word embedding after the second attention layer
#print("Same word after the third layer:\n", output.hidden_states[3][0][7][:100], "\n")  # same word embedding after the third attention layer
print("\n ... etc ...\n")

print(f'The final output is a a contextualised representation of the sequence: "{text}"')
#output.hidden_states[6][0][0][:100]  # final CLS token


After the preprocessing by the tokenizer, the model then feeds the sequence of tokens through the neural network.
Each token is represented by a vector of 768 numbers (a 768 dimensional tensor).
The tensor for the token "trust" looks for example like this before being fed into the first neural network layer
(only 100 numbers are displayed):

tensor([-0.0263, -0.0292, -0.0186,  0.0289,  0.0225,  0.0005, -0.0649,  0.0440,
         0.0201,  0.0052, -0.0857, -0.0903, -0.0182, -0.0214, -0.0514, -0.0074,
        -0.0361, -0.0715,  0.0125, -0.0320, -0.0118, -0.0252, -0.0431, -0.0383,
         0.0073,  0.0188,  0.0049, -0.0829, -0.0150, -0.0313, -0.0517,  0.0518,
         0.0099,  0.0418, -0.0135, -0.0256, -0.0432, -0.0029, -0.0191,  0.0006,
         0.0023,  0.0052, -0.0705, -0.0053, -0.0237, -0.0131,  0.0082, -0.0160,
        -0.0512,  0.0171,  0.0104, -0.0164, -0.0536, -0.0759, -0.0407, -0.0006,
        -0.0331, -0.0792,  0.0354, -0.0010, -0.0222, -0.0015, -0.0628, -0.0206,
        -0.1149

In [None]:
print("This is what the different model layers ('the architecture') look like:\n")
print(model)

This is what the different model layers ('the architecture') look like:

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
   

### The final output

In [None]:
print(f"""At the end, Transformer models always output so called 'logits',\n one number for each class the model was trained to classify text into.\n
Our input text was: '{text}'\n
These logis represent the predicted probability for our binary sentiment classification task:\n\n{output["logits"][0].tolist()}\n""")

print("Logits are not very interpretable, so they are then converted to percentages.\nEach percentages represents the model's prediction, which class the input text belongs to.\n")
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

At the end, Transformer models always output so called 'logits',
 one number for each class the model was trained to classify text into.

Our input text was: 'I believe that the EU is trustworthy.'

These logis represent the predicted probability for our binary sentiment classification task:

[-3.5054750442504883, 3.680955648422241]

Logits are not very interpretable, so they are then converted to percentages.
Each percentages represents the model's prediction, which class the input text belongs to.

{'NEGATIVE': 0.1, 'POSITIVE': 99.9}


### Everything put together


In [None]:
## In short, the code looks like this:

# load the relevant functions from HuggingFace and PyTorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Choose any classification model from the model hub
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# tokenization
text = 'I believe that the EU is trustworthy.'
input = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]

# model prediction
output = model(input, output_hidden_states=False, output_attentions=False, return_dict=True)
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

{'NEGATIVE': 0.1, 'POSITIVE': 99.9}


In [None]:
## Or via the simplified pipeline:
from transformers import pipeline
pipe_classification = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", top_k=2)
text = 'I believe that the EU is trustworthy.'
pipe_classification(text)

[[{'label': 'POSITIVE', 'score': 0.9992438554763794},
  {'label': 'NEGATIVE', 'score': 0.0007562130922451615}]]

## Generative models like GPT (decoders)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# https://huggingface.co/gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_length=30)

outputs_decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(outputs_decoded)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n']


In [None]:

# https://huggingface.co/gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# gpt2's vocabulary: https://huggingface.co/gpt2/raw/main/vocab.json

outputs = model.generate(
    input_ids, max_length=30,
    output_scores=True, return_dict_in_generate=True,
    output_attentions=False, do_sample=False
)

print("\nThe output looks quite messy:\n")
print(outputs)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



The output looks quite messy:

GenerateDecoderOnlyOutput(sequences=tensor([[8888,  314, 1975,  356,  460, 3443,  651,  284,  262,  966,  810,  356,
          460,  787,  257, 3580,  287,  262, 3160,  286,  262,  661,  286,  262,
         1578, 1829,  286, 2253,   13,  198]]), scores=(tensor([[-148.6821, -149.2908, -156.0585,  ..., -162.4585, -158.8699,
         -150.9391]]), tensor([[-115.6685, -116.1133, -120.9430,  ..., -121.5678, -122.0460,
         -116.7151]]), tensor([[-102.9193, -102.8433, -106.7674,  ..., -109.7448, -110.1562,
         -104.3626]]), tensor([[-113.0016, -111.4651, -116.1575,  ..., -115.7575, -119.9194,
         -112.3750]]), tensor([[-85.0493, -86.2461, -92.9495,  ..., -96.9331, -97.5099, -88.6309]]), tensor([[-101.4949, -101.3607, -106.5000,  ..., -105.3604, -108.2616,
         -102.6193]]), tensor([[-144.4079, -143.1993, -147.8557,  ..., -153.8577, -149.4084,
         -145.1063]]), tensor([[-142.9678, -142.7155, -149.5450,  ..., -154.4601, -153.3980,
        

In [None]:
print("GPT2's vocabulary is composed of 50257 tokens. Each has a 'word vector' composed of 768 numbers:")
print(model.transformer.wte)

print(f"""\nWe can look at GPT2's entire vocabulary here: https://huggingface.co/gpt2/raw/main/vocab.json
\nFor example, the token 'Love' is at position 18565.
\nWe can access it's word vector here (first 100 numbers):\n
{model.transformer.wte.weight[18565][:100]}
""")

GPT2's vocabulary is composed of 50257 tokens. Each has a 'word vector' composed of 768 numbers:
Embedding(50257, 768)

We can look at GPT2's entire vocabulary here: https://huggingface.co/gpt2/raw/main/vocab.json

For example, the token 'Love' is at position 18565.

We can access it's word vector here (first 100 numbers):

tensor([-0.0521,  0.0063,  0.0773,  0.1031, -0.0365, -0.0253, -0.2183,  0.0222,
        -0.1285, -0.0917, -0.0771, -0.1728,  0.1625, -0.1056,  0.1838, -0.0049,
         0.0246, -0.0203,  0.0717,  0.1154,  0.0384, -0.2783,  0.0206,  0.0678,
        -0.1182, -0.0169,  0.0946, -0.1425,  0.1875, -0.0393,  0.1161, -0.4728,
         0.1959,  0.0616, -0.1545,  0.0377, -0.3193,  0.1089,  0.0265, -0.0317,
         0.1023, -0.0070,  0.0394,  0.0017,  0.1093,  0.1821,  0.1139, -0.0832,
         0.0032, -0.0456, -0.0501, -0.0303, -0.0005, -0.2116, -0.0135, -0.2888,
        -0.0223,  0.1179,  0.0222,  0.3011,  0.0113,  0.1022, -0.1399, -0.0165,
         0.2658,  0.1221, -0.1152,

In [None]:

print(f"""
While the outputs produce by classifiers like BERT are probabilities of classes,
the outputs produced by generators like GPT2 are probabilities of tokens.
\nThese probabilities of tokens are in the 'outputs' object returned by model.generate()
\nThe IDs of the most probably tokens are:
{outputs.sequences}
\nThese token IDs can be mapped to actuall words/tokens in the vocabulary:
{tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)}\n\n

Our original prompt was:\n'{prompt}'
GPT2 then tries to predict the most probable next token. One token after the other.

To calculate the first token, it makes a prediction over ALL of the 50257 tokens it knows.
Each of the 50257 tokens receives a probability.
First the first token, the probability distribution over its ENTIRE vocabulary looks like this:
{outputs.scores[0][0]}

The ID of the most probable *first* token is {torch.argmax(outputs.scores[0][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[0][0], dim=0))}

The ID of the most probable *second* token is {torch.argmax(outputs.scores[1][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[1][0], dim=0))}

The ID of the most probable *third* token is {torch.argmax(outputs.scores[2][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[2][0], dim=0))}

This is how GPT2 gradually generated the text:
{tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)}

The same principles apply to all generative LLMs like GPT4, Llama-2 etc.
Only that they are bigger, with a better architecture and better fine-tuning.
""")



While the outputs produce by classifiers like BERT are probabilities of classes,
the outputs produced by generators like GPT2 are probabilities of tokens.

These probabilities of tokens are in the 'outputs' object returned by model.generate()

The IDs of the most probably tokens are:
tensor([[8888,  314, 1975,  356,  460, 3443,  651,  284,  262,  966,  810,  356,
          460,  787,  257, 3580,  287,  262, 3160,  286,  262,  661,  286,  262,
         1578, 1829,  286, 2253,   13,  198]])

These token IDs can be mapped to actuall words/tokens in the vocabulary:
['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n']



Our original prompt was:
'Today I believe we can finally'
GPT2 then tries to predict the most probable next token. One token after the other.

To calculate the first token, it makes a prediction over ALL of the 50257 tokens it knows.
Each of the 50257 tokens receives a probability.




---



---

## Reflection  +  Q&A


**Reading, thinking & asking:** (5 min)
* Write your answers to the following questions on a piece of paper / digital notebook. While thinking about these questions, also don't hesitate to ask any questions that come up in the chat/Slack.
    * In your own words, write down the main differences between models like BERT and models like GPT with regard to their outputs.
    * What could be disadvantages and advantages of these two different approaches (encoders vs. decoders)?