# Attention -- implementing attention between a given decoder state and the encoder states

Below, there are two matrices, `encoder_states` and `decoder_states` representing the state of the hidden layer after processing each word by the encoder and the static embedding related to a given input of the decoder. A single hidden layer state contains an embedding of length = 3, which is equal to the size of the embedding in the decoder. In the encoder, we have 4 hidden layer states, because we are processing a sequence consisting of 4 tokens.

There are 5 tokens in the decoder, which are generated based on the sequence processed by the encoder.

The task is to: a) Calculate the similarity of all embeddings from the decoder (queries) to all embeddings of subsequent states of the encoder (keys) (remember that matrices can be transposed. In NumPy, we transpose a matrix using `matrix_name.T`)

b) Softmax (imported from scipy) should be performed on the created similarity matrix. Note: remember to apply softmax in the right dimension. All hidden states of the encoder should be softmaxed in the context of a given decoder state. In scipy, the softmax function includes an `axis` argument which may help.

c) Combine the attention matrix from step b) and `encoder_states` to generate a matrix containing context vectors for each token from the decoder.

In [2]:
import numpy as np
from scipy.special import softmax

# scipy.special.softmax(x, axis=None)

encoder_states = np.array(
    [[1.2, 3.4, 5.6],   # encoder's hidden layer output at the step 1, related to a given input token, e.g., I
    [-2.3, 0.2, 7.2],   # encoder's hidden layer output at the step 2, related to a given token, e.g., like
    [10.2, 0.2, 0.3],   # encoder's hidden layer output at the step 3, related to a given token, e.g., NLP
    [0.4, 0.7, 1.2]]    # encoder's hidden layer output at the step 4, related to a given token, e.g., "."
)



decoder_states = np.array(
    [[0.74, 0.23, 0.56],  # decoder's static word embedding at the step 1, related to a given token, e.g., <BOS>
    [7.23, 0.12, 0.55],  # decoder's static word embedding at the step 2, related to a given token, e.g., Ja
    [9.12, 4.23, 0.44], # decoder's static word embedding at the step 3, related to a given token, e.g., lubię
    [4.1, 3.23, 0.5],    # decoder's static word embedding at the step 4, related to a given token, e.g., przetwarzanie
    [5.2, 3.1, 8.5]]     # decoder's static word embedding at the step 5, related to a given token, e.g., języka
)

In [26]:
# a)
# Similarity is the dot product between decoder token i and decoder token j
print("a)")
sim = (np.matmul(decoder_states, encoder_states.T)) # Shape (5, 4) == (number of encoder states, number of decoder states)
print(sim, end = '\n\n')

# b)
print("b)")
sim_soft = softmax(sim, axis = 1) # Shape (5, 4)
print(sim_soft, end = '\n\n')

# c)
print("c)")
context_vector = np.matmul(sim_soft, encoder_states)
print(context_vector, end = '\n\n') # Shape (5, 3)

a)
[[  4.806   2.376   7.762   1.129]
 [ 12.164 -12.645  73.935   3.636]
 [ 27.79  -16.962  94.002   7.137]
 [ 18.702  -5.184  42.616   4.501]
 [ 64.38   49.86   56.21   14.45 ]]

b)
[[4.91780633e-02 4.32948093e-03 9.45248312e-01 1.24414389e-03]
 [1.49003187e-27 2.50486173e-38 1.00000000e+00 2.94803216e-31]
 [1.75587568e-29 6.44090821e-49 1.00000000e+00 1.88369172e-38]
 [4.11416552e-11 1.74069934e-21 1.00000000e+00 2.79811669e-17]
 [9.99716568e-01 4.94220792e-07 2.82937800e-04 2.06801368e-22]]

c)
[[ 9.69108631  0.35799187  0.59163688]
 [10.2         0.2         0.3       ]
 [10.2         0.2         0.3       ]
 [10.2         0.2         0.3       ]
 [ 1.20254471  3.39909302  5.59850122]]



Expected outputs:

a) [[ 4.806 2.376 7.762 1.129] [ 12.164 -12.645 73.935 3.636] [ 27.79 -16.962 94.002 7.137] [ 18.702 -5.184 42.616 4.501] [ 64.38 49.86 56.21 14.45 ]]

b) [[4.91780633e-02 4.32948093e-03 9.45248312e-01 1.24414389e-03] [1.49003187e-27 2.50486173e-38 1.00000000e+00 2.94803216e-31] [1.75587568e-29 6.44090821e-49 1.00000000e+00 1.88369172e-38] [4.11416552e-11 1.74069934e-21 1.00000000e+00 2.79811669e-17] [9.99716568e-01 4.94220792e-07 2.82937800e-04 2.06801368e-22]]

c) [[ 9.69108631 0.35799187 0.59163688] [10.2 0.2 0.3 ] [10.2 0.2 0.3 ] [10.2 0.2 0.3 ] [ 1.20254471 3.39909302 5.59850122]]


# Transformer
## Using transformer-based T5 model to solve various NLP tasks in a sequence-to-sequence manner

Today we're going to learn a new library -- the HuggingFace **transformers** library (https://huggingface.co/docs/transformers/index) and use it to solve several non-obvious NLP-related problems using the **T5** model


HuggingFace transformers is one of the most popular libraries that provide us with a high-level API for using neural networks to solve tasks related to natural language processing, audio processing, computer vision, or even multimodal scenarios in which we have to utilize multiple modalities at once (e.g., answering questions about pictures, information extraction from invoices).

First, let's install the dependencies, the `transformers` library itself and the `sentencepiece` module, which helps us tokenize documents and transform tokens into one-hot encodings (we will discuss the idea of sentencepiece later in detail).

**Warning**: if you notice some weird exceptions like `cannot call from_pretrained on a None object` somewhere in your code, restart the environment using: Runtime -> restart. Then run the cells with code (without re-installing the libraries) one more time.

In [None]:
!pip install transformers   # install HuggingFace transformers library
!pip install sentencepiece  # install sentencepiece

The API provided by the `transformer` library is a high-level one. We can download a given model and generate an output using 4 lines of code!

Read the docs on the T5 model provided here: https://huggingface.co/docs/transformers/model_doc/t5

In the `inference` section, you can find a description showing how we can download a pretrained model, and use it to solve a given task. Simply use the code provided to translate some sentence from English into German!

In [27]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

text = "translate English to German: Hello, I love Natural Language Processing and I hate Calculus"
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

inputs = tokenizer(text, return_tensors="pt")

translation_ids = model.generate(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"],
    max_length=50
)

german_translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
print("German translation:", german_translation)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

German translation: Hallo, ich liebe die natürliche Sprache Verarbeitung und ich hasse Calculus


## Various tasks

Experiment with some other inputs, e.g., those provided in Figure 1 presented in the paper introducing the T5 model or even a wider list of use cases from  Appendix D provided with the paper. You can find the paper here: https://arxiv.org/pdf/1910.10683.pdf

Note: there are some abbreviations used among the inputs provided, some of them are:
-  `stsb`: it stands for the semantic textual similarity benchmark. Given two sentences, we can calculate their semantic similarities, which can help us determine whether one sentence is a paraphrase of the other one.
-  `cola`: it stands for the Corpus of Linguistic Acceptability and helps us determine whether a given sentence is grammatical or ungrammatical.

If you look at Appendix D, there are more abbreviations, these are related to the names of tasks presented in the GLUE benchmark (available here: https://gluebenchmark.com/tasks) and the SUPERGLUE benchmark (available here: https://super.gluebenchmark.com/tasks). The idea of GLUE and SUPERGLUE is to collect a set of challenging tasks that may be used to evaluate the systems requiring natural language understanding. 

**Paste some 3 examples of tasks and the inputs you processed in the cell below** 

In [72]:
texts = [
    "stsb sentence1: A man is playing guitar. sentence2: A person is playing an instrument",
    "cola: The book on the table is.",
    "cola: She go to store yesterday buyed many apples and he don’t knowed.",
    "summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.",
    "translate English to French: I like machine learning",
    "translate English to German: I love you"
]

models = {
    "stsb": {
        "model_name": "PavanNeerudu/t5-base-finetuned-stsb",
        "tokenizer": None,
        "model": None
    },
    "cola": {
        "model_name": "PavanNeerudu/t5-base-finetuned-cola",
        "tokenizer": None,
        "model": None
    },
    "summarize": {
        "model_name": "t5-large",
        "tokenizer": None,
        "model": None
    },
    "translate": {
        "model_name": "t5-large",
        "tokenizer": None,
        "model": None
    }
}

for task in models:
    print(f"Loading tokenizer for task: {task}, model: {models[task]['model_name']}")
    models[task]["tokenizer"] = T5Tokenizer.from_pretrained(models[task]["model_name"])
    models[task]["model"] = T5ForConditionalGeneration.from_pretrained(models[task]["model_name"])

for text in texts:
    if text.startswith('cola'):
        task = "cola"
    elif text.startswith('stsb'):
        task = "stsb"
    elif text.startswith('summarize'):
        task = "summarize"
    elif text.startswith('translate'):
        task = "translate"

    tokenizer = models[task]["tokenizer"]
    model = models[task]["model"]

    inputs = tokenizer(text, return_tensors="pt")

    translation_ids = model.generate(
        input_ids=inputs["input_ids"], 
        attention_mask=inputs["attention_mask"],
        max_length=50
    )
    
    decoded_text = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
    print(f"Task: {text}\noutput: {decoded_text}")

Loading tokenizer for task: stsb, model: PavanNeerudu/t5-base-finetuned-stsb
Loading tokenizer for task: cola, model: PavanNeerudu/t5-base-finetuned-cola
Loading tokenizer for task: summarize, model: t5-large
Loading tokenizer for task: translate, model: t5-large
Task: stsb sentence1: A man is playing guitar. sentence2: A person is playing an instrument
output: 1.4
Task: cola: The book on the table is.
output: unacceptable
Task: cola: She go to store yesterday buyed many apples and he don’t knowed.
output: unacceptable
Task: summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.
output: pope died at 88, a man who promoted social and economic justice . he was 88 .
Task: translate English to French: I like machine learning
output: J'aime l'apprentissage en machine
Task: translate English to German: I love you
output: Ich liebe dich


## Various model types

There are several T5 models available, which differ in size (and quality). The bigger the model is, the better output it should generate. Experiment with some models from the following set: 
- t5-small
- t5-base
- t5-large
- t5-3b
- t5-11b

and check whether you can observe any difference in the quality of outputs.

Also, compare the size of the models, you can use the `model.num_parameters()` function to obtain the parameter number related to each model. For each model you are able to load, provide the size in the cell below (if you can't load a given model because it is too big, no worries, just type 'too big to load').

In [73]:
# t5-small params number: 60.5 M
# t5-base params number: 223 M
# t5-large params number: 738 M
# t5-3b params number: too big to load
# t5-11b params number: too big to load

models = ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]
models = ["t5-small", "t5-base", "t5-large"]
for model_name in models:
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    print(f"Model: {model_name}, Number of params: {model.num_parameters()}")

    for text in texts:
        if text.startswith("cola") or text.startswith("stsb"):
            continue
        inputs = tokenizer(text, return_tensors="pt")

        translation_ids = model.generate(
            input_ids=inputs["input_ids"], 
            attention_mask=inputs["attention_mask"],
            max_length=50
        )

        decoded_text = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
        print(f"Task: {text}\noutput: {decoded_text}")    

Model: t5-small, Number of params: 60506624
Task: summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.
output: he sought to refocus the Catholic Church to promote social and economic justice. he died at 88.
Task: translate English to French: I like machine learning
output: Je veux apprendre à la machine
Task: translate English to German: I love you
output: Ich liebe Sie
Model: t5-base, Number of params: 222903552
Task: summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.
output: he sought to refocus the Catholic Church to promote social and economic justice . he was 88.
Task: translate English to French: I like machine learning
output: J'aime l'apprentissage par machine
Task: translate English to German: I love you
output: Ich liebe Sie
Model: t5-large, Number

## Language-specific T5s (OPTIONAL ASSIGNMENT -- you are not required to provide code here)

There are even some alternatives to the original T5 models. As the T5 model was trained on English, there are some models available that are specific to other languages, e.g., Polish (for example plT5 proposed by Allegro - https://huggingface.co/allegro/plt5-small). The Polish model was trained to solve a set of tasks collected in the KLEJ benchmark, which represents the Polish analogy to the GLUE benchmark: https://klejbenchmark.com.

You can find more details on plT5 in the research paper: https://arxiv.org/pdf/2205.08808.pdf. Table 2 presents some examples of prompts that can be used to solve some of the tasks listed in KLEJ.

You can search for an alternative to the original T5, for example, the one related to your language, and experiment with it (**this task is not mandatory**).

In [None]:
# (OPTIONAL): If you want, experiment with some alternative models (like language-related, e.g., plT5 related to Polish)

## Flan-T5

At the end of 2022, an evolution of T5 was proposed called Flan-T5. This model is also provided by the HuggingFace transformer library. Please visit this website: https://huggingface.co/docs/transformers/model_doc/flan-t5 to see how you can use this model (simply change the name of the model to download!). 

Flan-T5 is much more powerful than T5. You can take a look at Appendix D included in the paper describing Flan T5 to familiarize yourself with some input formats (prompts) and the generated values. The paper is here: https://arxiv.org/pdf/1910.10683.pdf. You should focus on `processed input` fields as they are the representations that the model consumes. Experiment with some selected tasks and see if you can obtain the same results! In the code below, paste some code loading the Flan-T5 model and using it to solve some selected tasks.

In [88]:
tasks = [
    "summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.",
    "translate English to French: I like machine learning",
    "translate English to German: I love you",
    "translate German to English: Ich liebe Schokolade.",
    "complete: If it rains tomorrow, we will...",
    "write a short story about a dragon who wants to be a chef.",
    "what is the next number in the sequence: 2, 4, 8, 16, ...?",
    "Paraphrase: The cat jumped over the wall.",
    "Rewrite this in formal language: Gotta go now, see ya!",
    "Rewrite this in formal language: I don't give a shit",
    "Question: Who wrote Sherlock Holmes? Answer:",
    "Explain quantum physics like I'm 5 years old."
    ]

model_name = "google/flan-t5-large"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

for text in tasks:
    inputs = tokenizer(text, return_tensors="pt")

    translation_ids = model.generate(
        input_ids=inputs["input_ids"], 
        attention_mask=inputs["attention_mask"],
        max_length=100
    )

    decoded_text = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
    print(f"Task: {text}\noutput: {decoded_text}\n")    

Task: summarize: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice rather than traditional moral teachings, has died. He was 88.
output: Pope Francis, who sought to refocus the Catholic Church to promote social and economic justice, has died.

Task: translate English to French: I like machine learning
output: Je l'aime l'apprentissage par l'ordinateur

Task: translate English to German: I love you
output: Ich liebe Sie

Task: translate German to English: Ich liebe Schokolade.
output: I love chocolate.

Task: complete: If it rains tomorrow, we will...
output: go to the beach

Task: write a short story about a dragon who wants to be a chef.
output: The dragon is a savage beast that wants to eat all the food in the world. He is a savage beast that wants to eat all the food in the world. He is a savage beast that wants to eat all the food in the world. He is a savage beast that wants to eat all the food in the world. He is a savage beast that wa

## (OPTIONAL) Fine-tuning

You can even fine-tune the T5/Flan-T5 model to solve a task you want. You may load an existing T5/Flan-T5 model, which is already trained to solve some tasks, and use the power of transfer learning to learn it to solve some different tasks. This is much better than training a network from scratch and should require fewer training examples. 

The fine-tuning phase is quite complex. However, you can find the step-by-step description here: https://www.philschmid.de/fine-tune-flan-t5

You can try to fine-tune some selected model.