# Exercise 1a - Testing Transformer


The objective of this task is use Transformer Summarization for the paper "Attention is All You Need" from A. Vaswani et al. (2017), and afterwards verfiy whether the summary aligns or corresponds with the paper.
<br>

- <b>Group 3:</b> Cesar Laura, Ecker Annina, Dilly Julian
- <b>Section of Paper:</b> "Multi-Head Attention + Scaled Dot Production"
<br>
<br>
<div class="alert alert-block alert-info">
Note: Each of us worked on all tasks independently. We later discussed our findings and merged the best/most representative parts with eachother in one Notebook.
</div>

In [58]:
#!pip install -r requirements.txt
#!pip install bert_score

In [8]:
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS device")
else:
    device = torch.device("cpu")
    print("Using CPU device")


Using MPS device


In [90]:
import warnings
warnings.filterwarnings('ignore')

from transformers import pipeline
from transformers import AutoTokenizer

import textwrap

import bert_score

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /Users/annina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [91]:
text1 = """Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention". The input consists of
queries and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the
query with all keys, divide each by square root(d_k), and apply a softmax function to obtain the weights on the
values. In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V .
The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention. 
Dot-product attention is identical to our algorithm, except for the scaling factor of square root(1/d_k). 
Additive attention computes the compatibility function using a feed-forward network with
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code.
While for small values of d_k the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of d_k. We suspect that for large values of
d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients. To counteract this effect, we scale the dot products by square root(1/d_k).

Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values h times with different, learned
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
"""

text2="""
Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
- In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every
position in the decoder to attend over all positions in the input sequence. This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence models.
- The encoder contains self-attention layers. In a self-attention layer all of the keys, values
and queries come from the same place, in this case, the output of the previous layer in the
encoder. Each position in the encoder can attend to all positions in the previous layer of the
encoder.
- Similarly, self-attention layers in the decoder allow each position in the decoder to attend to
all positions in the decoder up to and including that position. We need to prevent leftward
information flow in the decoder to preserve the auto-regressive property. We implement this
inside of scaled dot-product attention by masking out (setting to minus infinity) all values in the input
of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters
from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is d_model = 512, and the inner-layer has dimensionality
d_f_f = 2048.

Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input
tokens and output tokens to vectors of dimension dmodel. 
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. 
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax
linear transformation. In the embedding layers, we multiply those weights by square root(d_dmodel).
"""

In [53]:
def split_into_chunks(text, tokenizer, max_tokens=512):
    words = text.split()
    chunks = []
    current_chunk = []
    for word in words:
        current_chunk.append(word)
        tokenized_chunk = tokenizer(" ".join(current_chunk), return_tensors="pt").input_ids
        if tokenized_chunk.shape[1] > max_tokens:
            chunks.append(" ".join(current_chunk[:-1]))
            current_chunk = [word]
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: Default - Sshleifer/Distilbart-CNN-12-6

Source: __[link text](http://url)__

In [85]:
summarizer = pipeline("summarization", device="mps")
outputs = summarizer(
    text1, 
    max_length=400, 
    min_length=250, 
    clean_up_tokenization_spaces=True
)

print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


 We call our particular attention "Scaled Dot-Product Attention" The input consists of a set of queries and keys of dimension d_k, and values d_v. The keys and values are also packed together into matrices K and V. The two most commonly used attention functions are additive attention and dot-product attention. Additive attention is faster and more space-efficient in practice. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. With each head, the total computational cost is similar to that of single head attention with full dimensionality. The work we employ h = 8 parallel attention layers, or heads. For each of these we use d_d_k = d_K = d. For each head we use D_k and D_K and d_D_k. For example, for example, we use H =d_K, for instance, for each of which we use a single head, or d__k is D__K. For instance, the cost is roughly 40% less than that of 

In [84]:
summarizer2 = pipeline("summarization", device="mps")
outputs2 = summarizer2(
    text2, 
    max_length=400, 
    min_length=250, 
    clean_up_tokenization_spaces=True
    )

print(outputs2[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


 The Transformer uses multi-head attention in three different ways. In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and memory keys and values come from a self-attention layer. This allows every position in the decoder to attend over all positions in the input sequence. Each of the layers in our encoder and decoder contains a fully-connected feed-forward network, which is applied to each position separately and identically. We implement this inside of scaled dot-product attention by masking out (setting to minus infinity) all values in the output of the softmax which correspond to illegal connections. We use learned embeddings to convert the input and output tokens to vectors of dimension dmodel. The dimensionality of input & output is d_model = 512, and the inner-layer has dimensionality d_f_f = 2048. It also uses the usual learned linear transformation and softmax function to transform the output to predicted next-token probabilities. The mode

<hr>

### Résumé

The generated summaries include key concepts like Scaled Dot-Product Attention, Multi-Head Attention, and Encoder-Decoder Attention, but they lack clarity and accuracy in mathematical expressions. The latter can be explained by the lack of mathematical fine tuning and is therefore negligible.<br>
The text contains repetitive and nonsensical phrases such as "for example, for example, we use H = d_K, for instance" and incorrect notations like "d_d_k = d_K = d." Additionally, the explanation of computational efficiency is unclear and inconsistent with the original content. Furthermore, some sentences are cut off abruptly, making the summary difficult to interpret.<br>


<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: Facebook/BART-Large-CNN (without further fine tuning)

Source: __[Huggingface - Facebook/BART-Large-CNN](https://huggingface.co/facebook/bart-large-cnn)__

In [74]:
summarizer_bart1 = pipeline("summarization", model="facebook/bart-large-cnn", device="mps")
outputs_bart1 = summarizer_bart1(text1, max_length=400, min_length=250, clean_up_tokenization_spaces=True)
print(outputs_bart1[0]['summary_text'])

We call our particular attention "Scaled Dot-Product Attention" The input consists ofqueries and keys of dimension d_k, and values ofdimension d_v. We compute the dot products of the queries with all keys, divide each by root(d_k), and apply a softmax function to obtain the weights on the values. Multi-head attention allows the model to jointly attend to information from different representationsubspaces at different positions. With a single attention head, averaging inhibits this. We use h = 8 parallel attention layers, or heads. For each of these we use d_K = d_V = d _model/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head Attention with full dimensionality. We found it beneficial to linearly project the queries, keys and values h times with different, learned, projections to dk, dk and dv dimensions, respectively. On each. version of these projected versions of                queries, keys, values we then perform the at

In [75]:
summarizer_bart2 = pipeline("summarization", model="facebook/bart-large-cnn", device="mps")
outputs_bart2 = summarizer_bart2(text2, max_length=400, min_length=250, clean_up_tokenization_spaces=True)
print(outputs_bart2[0]['summary_text'])

The Transformer uses multi-head attention in three different ways. In "encoder-decoder attention" layers, the queries come from the previous decoder layer. This allows every position in the decoder to attend over all positions in the input sequence. Each of the layers in our encoder and decoder contains a fully-connected feed-forward network, which is applied to each position separately and identically. We implement this inside of scaled dot-product attention by masking out (setting to minus infinity) all values in the softmax which correspond to illegal connections. We use learned embeddings to convert the input and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmaxlinear transformation. We need to prevent leftward information flow in theDecoder to preserve

<hr>

### Résumé

While some key concepts are preserved, the summary fails in mathematical consistency, readability, and logical flow. It contains major formatting and repetition issues and does not reliably convey the technical accuracy of the original text.<br>

<hr>

In [86]:
summarizer_bart2_1 = pipeline("summarization", model="facebook/bart-large-cnn", device="mps")
chunks1 = split_into_chunks(text1, tokenizer, max_tokens=500)

summaries_bart2_1 = [
    summarizer_bart2_1(chunk, 
                    max_length=int(len(chunk.split()) * 0.5),
                    min_length=int(len(chunk.split()) * 0.3),
                    num_beams=7,
                    repetition_penalty=1.1,
                    early_stopping=True
                   )[0]['summary_text'] 
    for chunk in chunks1
]

final_summary_bart2_1 = " ".join(summaries_bart2_1)
print(final_summary_bart2_1)


We call our particular attention "Scaled Dot-Product Attention" The input consists of queries and keys of dimension d_k, and values ofdimension d_v. We compute the dot products of the query with all keys, divide each by square root(d_k), and apply a softmax function to obtain the weights on the values. Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions.


In [87]:
summarizer_bart2_2 = pipeline("summarization", model="facebook/bart-large-cnn", device="mps")
chunks2 = split_into_chunks(text2, tokenizer, max_tokens=500)

summaries_bart2_2 = [
    summarizer_bart2_2(chunk, 
                    max_length=int(len(chunk.split()) * 0.5),
                    min_length=int(len(chunk.split()) * 0.3),
                    num_beams=7,
                    repetition_penalty=1.1,
                    early_stopping=True
                   )[0]['summary_text'] 
    for chunk in chunks2
]

final_summary_bart2_2 = " ".join(summaries_bart2_2)
print(final_summary_bart2_2)

The Transformer uses multi-head attention in three different ways. In "encoder-decoder attention" layers, the queries come from the previous decoder layer. This allows every position in the decoder to attend over all positions in the input sequence. Each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. We use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert decoder output to predicted next-token probabilities.


<hr>

### BERTScore

Source: __[Huggingface - Bertscore](https://huggingface.co/spaces/evaluate-metric/bertscore)__

In [88]:
candidates_bart2_1 = [final_summary_bart2_1]
references_bart2_1 = [text1]

P, R, F1 = bert_score.score(candidates_bart2_1, references_bart2_1, lang="en", verbose=True)

print("BERTScore F1 for finetuned Bart-Large-Cnn:", F1.item())

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.25s/it]


computing greedy matching.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.41it/s]

done in 3.28 seconds, 0.30 sentences/sec
BERTScore F1 for finetuned Bart-Large-Cnn: 0.8873118758201599





In [89]:
candidates_bart2_2 = [final_summary_bart2_2]
references_bart2_2 = [text2]

P, R, F1 = bert_score.score(candidates_bart2_2, references_bart2_2, lang="en", verbose=True)

print("BERTScore F1 for finetuned Bart-Large-Cnn:", F1.item())

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.33s/it]


computing greedy matching.


100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 476.57it/s]

done in 1.34 seconds, 0.75 sentences/sec
BERTScore F1 for finetuned Bart-Large-Cnn: 0.8806095719337463





<hr>

### Résumé

The fine-tuned BART-Large-CNN summary successfully retains the core concepts of Scaled Dot-Product Attention, Multi-Head Attention, and Encoder-Decoder Attention, while improving readability and coherence. It correctly describes the softmax weighting mechanism and the benefits of multi-head attention, maintaining important mathematical notations like d_k, d_v, and d_model.
<br><br>
The high BERTScore F1 values (0.887 for text1 and 0.881 for text2) indicate a strong semantic alignment with the original text, demonstrating that the summarization model effectively preserves key information.
<br><br>
However, some crucial details are missing, particularly regarding computational efficiency, scaling effects, and masked attention in the decoder. While the summary remains accurate, it feels overly compressed and does not fully convey the reasoning behind architectural choices in transformers.


<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: Google/Pegasus-XSUM

Source: __[Huggingface - Google/Pegasus-XSUM](https://huggingface.co/google/pegasus-xsum)__

In [92]:
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

tokens1 = tokenizer(text1, return_tensors="pt").input_ids
tokens2 = tokenizer(text2, return_tensors="pt").input_ids

print(f"Tokenized input of text1 has {tokens1.shape[1]} tokens, of text2 it has {tokens2.shape[1]} tokens.")


Tokenized input of text1 has 494 tokens, of text2 it has 458 tokens.


In [93]:
summarizer_pegasus1 = pipeline("summarization", model="google/pegasus-xsum", device="mps")
chunks1 = split_into_chunks(text1, tokenizer, max_tokens=500)

summaries_pegasus1 = [
    summarizer_pegasus1(chunk, 
                        max_length=int(len(chunk.split()) * 0.4),
                        min_length=int(len(chunk.split()) * 0.25), 
                        num_beams=7,
                        repetition_penalty=1.2,
                        early_stopping=True
                       )[0]['summary_text'] 
    for chunk in chunks1
]

final_summary_pegasus1 = " ".join(summaries_pegasus1)
print(final_summary_pegasus1)


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In our work, we have developed a new type of attention function for the representation of dot products in full dimensionality. Multi-Head Attention Instead of performing a single attention function with d-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. The total computational cost is similar to that of single-head attention with full dimensionality.


In [94]:
summarizer_pegasus2 = pipeline("summarization", model="google/pegasus-xsum", device="mps")
chunks2 = split_into_chunks(text2, tokenizer, max_tokens=460)

summaries_pegasus2 = [
    summarizer_pegasus2(chunk, 
                        max_length=int(len(chunk.split()) * 0.4),
                        min_length=int(len(chunk.split()) * 0.25), 
                        num_beams=7,
                        repetition_penalty=1.2,
                        early_stopping=True
                       )[0]['summary_text'] 
                      for chunk in chunks2
                     ]
final_summary_pegasus2 = " ".join(summaries_pegasus2)
print(final_summary_pegasus2)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Multi-head attention in a sequence-to-sequence model We present a model for multi-head attention in a sequence-to-sequence model, which allows each position in the decoder to attend all positions in the input sequence up to and including the last position in the output sequence, while preserving the auto-regressive property of encoder-decoder attention mechanisms in sequence-to-sequence models. - Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.


<hr>

In [95]:
candidates_pegasus_1 = [final_summary_pegasus1]
references_pegasus_1 = [text1]

P, R, F1 = bert_score.score(candidates_pegasus_1, references_pegasus_1, lang="en", verbose=True)

print("BERTScore F1 for Pegasus 1:", F1.item())


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.14s/it]


computing greedy matching.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 86.25it/s]

done in 3.18 seconds, 0.31 sentences/sec
BERTScore F1 for Pegasus 1: 0.8523311018943787





In [96]:
candidates_pegasus_2 = [final_summary_pegasus2]
references_pegasus_2 = [text2]

P, R, F1 = bert_score.score(candidates_pegasus_2, references_pegasus_2, lang="en", verbose=True)

print("BERTScore F1 for Pegasus 2:", F1.item())

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]


computing greedy matching.


100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 341.75it/s]

done in 1.02 seconds, 0.98 sentences/sec
BERTScore F1 for Pegasus 2: 0.8651168346405029





### Résumé

The Google/Pegasus-XSUM summary successfully captures core ideas related to Multi-Head Attention and Sequence-to-Sequence models while maintaining a concise structure. The key concepts of dot-product attention, projection of queries, keys and values, as well as encoder-decoder mechanisms are present, and the summary correctly mentions the auto-regressive property in sequence-to-sequence models.
<br><br>
However, the BERTScore F1 values (0.852 for text1 and 0.865 for text2) suggest a slightly lower semantic alignment compared to previous runs, indicating that some details may have been lost or reinterpreted. Some definitions lack depth, particularly regarding computational efficiency and scaling effects.<br>
Additionally, while the text remains logically structured, it introduces slight ambiguities, such as in the phrase "representation of dot products in full dimensionality", which is not a precise technical explanation. The second text also contains redundancies, as seen in the repetition of "each position in the decoder to attend all positions in the decoder". Therefore we can conclude that even though the summary is quite well, there's room for improvement regarding the fine tuning.