<a href="https://colab.research.google.com/github/MoritzLaurer/transformers-workshop-comptext-2023/blob/master/inside_LMs_COMPTEXT_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 A pratical look inside of Transformers

📅 _COMPTEXT 2023 tutorial, 11.05.2023_

👨‍🏫 By [Moritz Laurer](https://twitter.com/MoritzLaurer). 
For questions, reach out to: m.laurer@vu.nl

### Transformer's main components

**Hugging Face Transformers has two main components:**



1. The **tokenizer** prepares the text in a clean format, which the model understands. 
    - A token is a word or a sub-word unit. In BERT's vocabulary, the word "good" is one token and the word "darwinism" is two tokens  ("darwin" and "ism")
    - The tokenizer transforms words into token-ids. With these token-ids, BERT can link words to any token it has already learned during pre-training. 

2. The **model** processes the tokenizer's ouput and returns a prediction, e.g. which class an input text belongs to.



Independently of the type of model (classification, summarisation, translation, etc.), these two components are almost the same.

### Install and load 

In [None]:
!pip install transformers~=4.26.1  # The Transformers library from Hugging Face

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers~=4.26.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.26.1


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [None]:
# load any classification model from the HuggingFace model hub
# See here: https://huggingface.co/models?pipeline_tag=text-classification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# instantiate the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenization


In [None]:
### 1. Tokenization
# Tokenizer documentation: https://huggingface.co/transformers/main_classes/tokenizer.html

text = 'I believe that the EU is trustworthy.'
print(f"Input text: '{text}'\n")

input_ids = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]
print(f"""The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: {tokenizer.tokenize("Trustworthy")}.
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data. 
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt\n""")

print(f"The input text is split into the following tokens:\n{tokenizer.tokenize(text)}.")
print("The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:")
print(input_ids[0].tolist()[1:-1])
print("Transformer models only understand these token IDs.\n")

print("""In addition, the tokenizer adds two special tokens:
 First, the [CLS] (classification) token is always added at the beginning. 
        While individual tokens represent individual (sub)words, the [CLS] token represents the entire text. 
        The [CLS] token "is  used  as  the  aggregate sequence representation for classification tasks" (Devlin et al. 2019: 4). Details: https://arxiv.org/pdf/1810.04805.pdf
 Second, the [SEP] token separates two texts. It is useful for tasks which require two text inputs, for example Questions & Answer tasks.
        (It is not relevant in our case)
\n""")

print("""The final input for a BERT transformer model therefore looks like this:""")
token_strings = tokenizer.convert_ids_to_tokens(ids=input_ids[0])
#token_strings = tokenizer.tokenize(text)
for token_id, token_string in zip(input_ids[0].tolist(), token_strings):
  print(token_id, " == ", token_string)


# entire vocabulary: tokenizer.pretrained_vocab_files_map["vocab_file"]["distilbert-base-uncased"]
# https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

Input text: 'I believe that the EU is trustworthy.'

The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: ['trust', '##worthy'].
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data. 
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

The input text is split into the following tokens:
['i', 'believe', 'that', 'the', 'eu', 'is', 'trust', '##worthy', '.'].
The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:
[1045, 2903, 2008, 1996, 7327, 2003, 3404, 13966, 1012]
Transformer models only understand these token IDs.

In addition, the toke

### Tokens (words) following through the neural network

Questions: 
- Can someone explain what a word-embedding is?

In [None]:
### Processing the input with the model
# Model class documentation: https://huggingface.co/transformers/main_classes/model.html
# Documentation for DistilBERT specifically: https://huggingface.co/transformers/model_doc/distilbert.html

print(f"""\nAfter the preprocessing by the tokenizer, the model then feeds the sequence of tokens through the neural network.
Each token is represented by a vector of 768 numbers (a 768 dimensional tensor).
The tensor for the token "trust" looks for example like this before being fed into the first neural network layer 
(only 100 numbers are displayed):\n""")
print(model.distilbert.embeddings.word_embeddings(input_ids[0][7])[:100], "\n")

print(f"""The tensors for each token are then fed through and transformed by between 6-24~ neural network layers.\n""")

output = model(input_ids, output_hidden_states=True, output_attentions=False, return_dict=True)
print("Same word after the first layer:\n\n", output.hidden_states[1][0][7][:100], "\n")  # same word embedding after the first attention layer
print("Same word after the second layer:\n\n", output.hidden_states[2][0][7][:100], "\n")  # same word embedding after the second attention layer
#print("Same word after the third layer:\n", output.hidden_states[3][0][7][:100], "\n")  # same word embedding after the third attention layer
print("\n ... etc ...\n")

print(f'The final output is a a contextualised representation of the sequence: "{text}"')
#output.hidden_states[6][0][0][:100]  # final CLS token


After the preprocessing by the tokenizer, the model then feeds the sequence of tokens through the neural network.
Each token is represented by a vector of 768 numbers (a 768 dimensional tensor).
The tensor for the token "trust" looks for example like this before being fed into the first neural network layer 
(only 100 numbers are displayed):

tensor([-0.0263, -0.0292, -0.0186,  0.0289,  0.0225,  0.0005, -0.0649,  0.0440,
         0.0201,  0.0052, -0.0857, -0.0903, -0.0182, -0.0214, -0.0514, -0.0074,
        -0.0361, -0.0715,  0.0125, -0.0320, -0.0118, -0.0252, -0.0431, -0.0383,
         0.0073,  0.0188,  0.0049, -0.0829, -0.0150, -0.0313, -0.0517,  0.0518,
         0.0099,  0.0418, -0.0135, -0.0256, -0.0432, -0.0029, -0.0191,  0.0006,
         0.0023,  0.0052, -0.0705, -0.0053, -0.0237, -0.0131,  0.0082, -0.0160,
        -0.0512,  0.0171,  0.0104, -0.0164, -0.0536, -0.0759, -0.0407, -0.0006,
        -0.0331, -0.0792,  0.0354, -0.0010, -0.0222, -0.0015, -0.0628, -0.0206,
        -0.114

In [None]:
# visualisation of this process:
# TODO: add image from .ppt

In [None]:
print("This is what the different model layers ('the architecture') look like:\n")
print(model)

This is what the different model layers look like:

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
        

### The final output

In [None]:
print(f"""At the end, Transformer models always output so called 'logits', one number for each class the model was trained to classify.\n
These logis represent the predicted probability for our binary classification task:\n\n{output["logits"][0].tolist()}\n""")

print("Logits are not very interpretable, so they are then converted to percentages.\nEach percentages represents the model's prediction, which class the input text belongs to.\n")
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

At the end, Transformer models always output so called 'logits', one number for each class the model has learned to classify.

These logis represent the predicted probability for our binary classification task:

[-3.50547456741333, 3.680955171585083]

Logits are not very interpretable, so they are then converted to percentages.
Each percentages represents the model's prediction, which class the input text belongs to.

{'NEGATIVE': 0.1, 'POSITIVE': 99.9}


### Everything put together


In [None]:
## In short, the code looks like this:

# load the relevant functions from HuggingFace and PyTorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Choose any classification model from the model hub
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# tokenization
text = 'I believe that the EU is trustworthy.'
input = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]

# model prediction
output = model(input, output_hidden_states=False, output_attentions=False, return_dict=True)
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

{'NEGATIVE': 0.1, 'POSITIVE': 99.9}


In [None]:
## Or via the simplified pipeline: 
from transformers import pipeline
pipe_classification = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", top_k=2)
text = 'I believe that the EU is trustworthy.'
pipe_classification(text)

[[{'label': 'POSITIVE', 'score': 0.9992438554763794},
  {'label': 'NEGATIVE', 'score': 0.0007562137907370925}]]