# Exercise 1c - Testing Transformer

The objective of this task is to simulate how AI can assist in expanding research papers, generating missing explanations, or rephrasing content in a meaningful way.
<br>

- <b>Group 3:</b> Cesar Laura, Ecker Annina, Dilly Julian
- <b>Section of Paper:</b> "Multi-Head Attention + Scaled Dot Production"
<br>
<br>
<div class="alert alert-block alert-info">
Note: Each of us worked on all tasks independently. We later discussed our findings and merged the best/most representative parts with eachother in one Notebook.
</div>

In [10]:
#!pip install -r requirements.txt
#!pip install bert_score

In [3]:
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS device")
else:
    device = torch.device("cpu")
    print("Using CPU device")


Using MPS device


In [13]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from transformers import pipeline
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import set_seed
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
import transformers

import torch

import textwrap

import bert_score

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package punkt to /Users/annina/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
set_seed(42)

<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: GPT2-XL

Source: __[Huggingface - OpenAI-Community/GPT2](https://huggingface.co/openai-community/gpt2)__

In [19]:
generator = pipeline('text-generation', model='gpt2-xl', device="mps")

In [44]:
prompt = """
Please provide a detailed explanation of the concept of Multi-Head Attention based on the following research excerpt:

'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost 
is similar to that of single-head attention with full dimensionality.'

Expand on this explanation by describing how Multi-Head Attention enhances deep learning models and why it is beneficial.

____________________________________________________________________
"""

In [45]:
outputs = generator(prompt, 
                    max_new_tokens=200, 
                    num_return_sequences=1, # num_return_sequences = 1: default generation; 3-5 means more creative output with alt answers
                    temperature=0.2, # higher temperatures enable the model to experiment more with its answers (0.2-0.5 for scientific answers)
                    top_k=5, # number of tokens to get a deterministic result -> more tokens (e.g., 50-100) the more creative the text
                    top_p=0.9, # model only considers words with highest probability until, e.g., 90% threshold of prob_sum is achieved
                    repetition_penalty=1.2, # 1-1.2 good default values; >1.5 forces more variation but also unnatural sounding answers; <1.0 can result in repeated sentences
                    do_sample=False, # do_sample makes text creative; if precise answers only, then do_sample=False -> deterministic
                    truncation=True,
                    pad_token_id=50256
                )

print(outputs[0]['generated_text'])


Please provide a detailed explanation of the concept of Multi-Head Attention based on the following research excerpt:

'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computati

<br>

### Preliminary Results

This generated output presents a fabricated summary of a non-existent research study, falsely suggesting that the excerpt refers to an original paper authored by the model’s creators. While it correctly mentions "multi-headed attention" and "linear projection," the rest of the content is hallucinated — it references unspecified prior studies ([1]–[3]) and falsely claims novel implementations and results in tasks like object detection or speech recognition without any factual grounding. These claims are not part of the original excerpt and are unsupported by evidence.
<br><br>
This output fails the task. Instead of expanding on the provided excerpt to explain the mechanism of Multi-Head Attention, it fabricates research claims and introduces unrelated tasks, confusing the reader and misrepresenting the original content. It is not reliable for scientific explanation.

<hr>

In [41]:
prompt2 = """
In transformer models, Multi-Head Attention is used to process queries, keys, and values by projecting them into lower-dimensional spaces and applying parallel attention mechanisms. This allows the model to capture information from different representation subspaces.

Explain in scientific detail how Multi-Head Attention enhances deep learning models, focusing on its computational benefits, ability to process contextual information, and impact on model performance. Avoid repeating the summary above and expand upon it in technical language.

____________________________________________________________________
"""


In [42]:
outputs2 = generator(
    prompt2,
    max_new_tokens=400,           
    temperature=0.2,             
    top_k=30,                  
    top_p=0.9,                  
    repetition_penalty=1.2,      
    do_sample=False,           
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs2[0]['generated_text'])



In transformer models, Multi-Head Attention is used to process queries, keys, and values by projecting them into lower-dimensional spaces and applying parallel attention mechanisms. This allows the model to capture information from different representation subspaces.

Explain in scientific detail how Multi-Head Attention enhances deep learning models, focusing on its computational benefits, ability to process contextual information, and impact on model performance. Avoid repeating the summary above and expand upon it in technical language.

____________________________________________________________________

3.2.1. Neural Networks with Multiple Inputs

Multi-Head Attention can be applied to neural networks that have multiple input layers (e.g., convolutional or recurrent). In this case, the model will use a single attention layer for each of these inputs. The model then uses multi-head attention to extract features from the data.

The following figure shows an example of a network wi

<br>

### Preliminary Results

The generated text is highly repetitive and hallucinatory. Rather than delivering a technical explanation of Multi-Head Attention, it spirals into an unrealistic and irrelevant list of hypothetical course topics, repeatedly mentioning "modeling the behavior of nonlinear systems" in various redundant contexts. It fails to address the original prompt and does not provide any meaningful information about Multi-Head Attention, its mechanics, or computational benefits. The content lacks coherence, factual accuracy, and practical relevance, and is an example of a failed generation.<br><br>

This output incorrectly frames Multi-Head Attention as a mechanism applied to multiple input layers like images in convolutional neural networks, which is a misinterpretation of its purpose. The explanation diverges from the correct technical context of attention mechanisms in transformers and instead describes a flawed analogy involving gender classification. It confuses the role of attention layers and inaccurately claims that attention is applied to entire input layers, rather than within a sequence of tokens. The response introduces hallucinated examples and misleads in both structure and function, ultimately failing to answer the original prompt accurately.



<hr>

In [48]:
prompt3 = """
Explain in scientific detail how Multi-Head Attention works in transformer models, 
and how it enhances performance through parallel attention heads. Discuss its computational 
benefits and its ability to process information from multiple subspaces simultaneously.
____________________________________________________________________
"""

In [59]:
outputs3 = generator(
    prompt3,
    max_new_tokens=300,           
    temperature=0.2,              
    top_k=30,        
    top_p=0.9,               
    repetition_penalty=1.2,     
    do_sample=False, 
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs3[0]['generated_text'])


Explain in scientific detail how Multi-Head Attention works in transformer models, 
and how it enhances performance through parallel attention heads. Discuss its computational 
benefits and its ability to process information from multiple subspaces simultaneously.
____________________________________________________________________
The following is a summary of the lecture notes for this course:
Course Description: This course will cover the theory and application of multi-head attention in transformer models. The focus will be on the use of multi-head attention in modeling the behavior of transformers. Topics include: (1) the theoretical foundations of multi-head attention; (2) the implementation of multi-head attention in transformer models; (3) the role of multi-head attention in model selection; (4) the use of multi-head attention in modeling the behavior of transformers; (5) the use of multi-head attention in modeling the behavior of nonlinear systems; (6) the use of multi-head a

<hr>

<br>

### Preliminary Results

The generated output does not fulfill the prompt’s request for a scientific explanation of Multi-Head Attention. Instead, it spirals into a repetitive and off-topic summary resembling fictitious lecture notes, with excessive enumeration and no technical content about transformer models or attention mechanisms. There is no mention of key concepts like query/key/value projections, parallel attention heads, or computational efficiency. Additionally, the output exhibits clear signs of hallucination and lacks coherence, making it unusable for any scientific or educational purpose.<br>

<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: Falcon-7b-Instruct

Source: __[Huggingface - Tiiuae/Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)__

In [7]:
generator2 = pipeline("text-generation", model="microsoft/phi-2", device="cpu")

Loading checkpoint shards: 100%|███████████████████| 2/2 [00:11<00:00,  5.88s/it]


In [63]:
prompt4 = """
Please provide a detailed explanation of the concept of Multi-Head Attention based on the following research excerpt:

'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost 
is similar to that of single-head attention with full dimensionality.'

Expand on this explanation by describing how Multi-Head Attention enhances deep learning models and why it is beneficial.

____________________________________________________________________
"""

In [65]:
outputs4 = generator2(
    prompt4,
    max_new_tokens=300,           
    temperature=0.2,              
    top_k=30,        
    top_p=0.9,               
    repetition_penalty=1.2,     
    do_sample=False, 
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs4[0]['generated_text'])



Please provide a detailed explanation of the concept of Multi-Head Attention based on the following research excerpt:

'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computati

<br>

### Preliminary Results

The model produced a coherent and technically accurate explanation of Multi-Head Attention. It correctly emphasized that instead of using a single attention mechanism, multiple independent attention heads process different aspects of the input in parallel. This allows the model to capture diverse patterns and relationships across the sequence, which enhances representation capacity.
<br>
Key benefits like improved generalization, handling of long sequences, and parallel computation were highlighted – aligning well with core theoretical insights from transformer architectures.
<br>
Unexpectedly, the output also included a follow-up task about Dropout Regularization. While this is unrelated, it shows the model's inclination to generate structured responses (e.g., "Solution 1", "Solution 2"). This is not harmful but may be trimmed or controlled for focused outputs.

<hr>

In [71]:
outputs5 = generator2(
    prompt3,
    max_new_tokens=300,           
    temperature=0.2,              
    top_k=30,        
    top_p=0.9,               
    repetition_penalty=1.2,     
    do_sample=False, 
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs5[0]['generated_text'])



Explain in scientific detail how Multi-Head Attention works in transformer models, 
and how it enhances performance through parallel attention heads. Discuss its computational 
benefits and its ability to process information from multiple subspaces simultaneously.
____________________________________________________________________
Solution:
Multi-Head Attention is a key component of Transformer models that allows the model to focus on different positions or "heads" within the input sequence at each step during training. This enables the model to capture various aspects of the data more effectively by attending to different parts concurrently. The concept behind multi-head attention can be understood as follows:
1. Key Points: In traditional self-attention mechanisms, the same set of weights are applied to all elements in the input sequence for computing the dot product between them. However, with multi-head attention, we introduce multiple independent linear projections (Q, K, V) int

<br>

### Preliminary Results

The generated output provides a solid technical overview of how Multi-Head Attention functions within transformer models. It highlights the use of independent attention heads, each attending to different parts of the input in parallel, which allows the model to capture diverse aspects of the data more effectively. The explanation correctly details key mechanisms such as query/key/value projections, parallel processing, and shared weights, emphasizing both computational efficiency and representational depth. While the explanation is accurate and avoids hallucinations, it ends abruptly, missing a conclusion or discussion of real-world benefits. Nonetheless, it successfully conveys the technical foundation and practical advantages of Multi-Head Attention.<br>

<hr>

In [78]:
prompt_simple = """
Please explain the concept of Multi-Head Attention in a simple and easy-to-understand way. 
Base your explanation on the following research excerpt, and **add any important missing context** that would help a beginner fully understand how Multi-Head Attention works and why it is useful.

Excerpt:
'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost 
is similar to that of single-head attention with full dimensionality.'

**Explain in your own words, in a beginner-friendly way, and add any helpful examples.**

____________________________________________________________________
"""


In [74]:
outputs6 = generator2(
    prompt_simple,
    max_new_tokens=300,           
    temperature=0.2,              
    top_k=30,        
    top_p=0.9,               
    repetition_penalty=1.2,     
    do_sample=False, 
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs6[0]['generated_text'])


Please explain the concept of Multi-Head Attention in a simple and easy-to-understand way. 
Base your explanation on the following research excerpt, and **add any important missing context** that would help a beginner fully understand how Multi-Head Attention works and why it is useful.

Excerpt:
'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In 

<br>

### Preliminary Results

The generated explanation of Multi-Head Attention presents the concept in a simple, accessible manner, well-suited for beginners. It uses a high-level, intuitive analogy involving a cat and various visual filters, effectively illustrating how multiple attention heads can focus on different aspects of the same input to gain a richer understanding. This analogy is a creative and engaging way to demystify a complex concept, particularly for those unfamiliar with neural network mechanisms.
<br>
The output goes beyond the provided excerpt by adding helpful context: it clearly explains the role of individual attention heads and their specialization in processing distinct types of information. This insight allows beginners to grasp the benefit of using multiple heads, namely, improved comprehension of complex data through parallel processing and diverse focus.
<br><br>
Importantly, the output avoids hallucinations and does not contain any repetitive or irrelevant content, maintaining both factual accuracy and readability throughout.
<br>

<hr>

In [8]:
prompt_advanced = """
Please provide a detailed and technically accurate explanation of Multi-Head Attention, 
based on the following research excerpt. Do not repeat the excerpt, but expand on the key ideas, 
add missing concepts, and explain why Multi-Head Attention improves model performance
in transformer-based architectures.

Excerpt:
'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost 
is similar to that of single-head attention with full dimensionality.'

Explain the mechanism, benefits, and provide relevant technical details to fill any knowledge gaps.
Avoid code examples, focus only on explanation and analogies.

____________________________________________________________________
"""


In [9]:
outputs7 = generator2(
    prompt_advanced,
    max_new_tokens=500,           
    temperature=0.2,              
    top_k=40,        
    top_p=0.9,               
    repetition_penalty=1.2,     
    do_sample=False, 
    pad_token_id=50256,
    eos_token_id=50256
)
print(outputs7[0]['generated_text'])


Please provide a detailed and technically accurate explanation of Multi-Head Attention, 
based on the following research excerpt. Do not repeat the excerpt, but expand on the key ideas, 
add missing concepts, and explain why Multi-Head Attention improves model performance
in transformer-based architectures.

Excerpt:
'Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averagin

<br>

### Preliminary Results

The model provides a technically correct and coherent explanation of Multi-Head Attention. The explanation correctly outlines that multiple attention calculations are performed in parallel across these "heads," and that their outputs are concatenated and projected again to form the final output.<br>
The response also highlights a key benefit: enabling the model to focus on different parts of the input simultaneously, thereby capturing more complex relationships. Additionally, the explanation mentions that this approach leads to improved performance over traditional architectures such as LSTMs and GRUs.<br>

However, some technical depth is missing. Specifically, it does not explain why splitting into smaller dimensions keeps computational cost manageable, nor does it mention the role of Scaled Dot-Product Attention or representation subspaces — important concepts in understanding the efficiency and expressiveness of Multi-Head Attention. Also, while the comparison to LSTMs is fair, it lacks a direct link to how attention improves long-range dependency modeling, which is central to its success.<br>

<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">

# Model: Vectara - Hallucination Eval Model
Source: __[Huggingface - Vectara/Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model)__

In [34]:
from transformers import AutoModelForSequenceClassification

prompt_template = "<pad> Determine if the hypothesis is true given the premise?\n\nPremise: {text1}\n\nHypothesis: {text2}"

input_prompts = [
    prompt_template.format(text1=original_text.strip(), text2=answer.strip()) 
    for answer in generated_answers
]

classifier = pipeline(
    "text-classification",
    model='vectara/hallucination_evaluation_model',
    tokenizer=AutoTokenizer.from_pretrained('google/flan-t5-base'),
    trust_remote_code=True,
    device="mps"
)

You are using a model of type HHEMv2Config to instantiate a model of type HHEMv2. This is not supported for all configurations of models and can yield errors.
You are using a model of type HHEMv2Config to instantiate a model of type HHEMv2. This is not supported for all configurations of models and can yield errors.


In [30]:
original_text = """
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, 
we found it beneficial to linearly project the queries, keys and values h times with different, learned 
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of 
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional 
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation 
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use 
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost 
is similar to that of single-head attention with full dimensionality.
"""

generated_answers = [
    """The idea behind multi-headed attention is simple - instead of using one set of weights for all inputs 
    (as in standard self-attention), you can have multiple sets of weights, which will allow your network to l
    earn more complex representations by attending to various parts of the input simultaneously. 
    This has been shown to improve performance significantly compared to traditional models like LSTMs and GRUs.
    In our implementation, we first split the original query, key, value vectors into `n` smaller ones, where `n` is equal 
    to the number of attention heads. We do this by applying linear transformations to them, so that they become `dk`, `dk`, 
    and `dv` dimensional vectors, respectively. Then, we apply the same attention calculation as before, but now there are `n` 
    separate calculations happening concurrently. 
    Finally, we combine the results of all these calculations together and return an output vector of size `d_model`.""",
    
      """The multi-headed attention mechanism helps us pay more attention to specific parts of our input data by 
    allowing multiple "heads" (or separate focus points) to look at different aspects of the same thing 
    simultaneously. This can be especially useful when dealing with complex inputs 
    like images or natural language text where there may be many different 
    features or patterns that need to be considered together.
    To illustrate this idea, imagine you have an image of a cat sitting on a windowsill. 
    If you were trying to identify what's happening in the picture using only one set of visual filters 
    (like color channels), you might miss some key details because they don't fit neatly into those pre-defined categories. 
    But if you had several sets of filters looking for different things - say, motion detection, 
    edge detection, and texture analysis - you'd be much better equipped to capture all the relevant information about the scene.
    Similarly, in multi-headed attention, instead of just having one big filter that looks at everything in the input sequence, 
    we create multiple smaller filters ("attention heads") that specialize in focusing on different types of information. 
    Each head has its own set of learnable parameters that allow it to weigh certain pieces of evidence more heavily 
    than others based on their relevance to the task at hand. By combining the outputs of all the heads, we end up with a richer, 
    more nuanced understanding of the input overall.""",
    
    """Multi-Head Attention is a key component of Transformer models that allows the model to focus on different positions or
    "heads" within the input sequence at each step during training. This enables the model to capture various aspects of the 
    data more effectively by attending to different parts concurrently. The concept behind multi-head attention can be understood 
    as follows:
    1. Key Points: In traditional self-attention mechanisms, the same set of weights are applied to all elements in the input 
    sequence for computing the dot product between them. However, with multi-head attention, we introduce multiple independent 
    linear projections (Q, K, V) into separate channels, allowing us to attend to distinct subsets of features present in the 
    input sequence. Each head learns to weigh these features differently based on their relevance to the 
    current position being attended to.
    2. Parallel Processing: By dividing the input sequence into several smaller chunks called query/key pairs, each represented 
    independently by one head, we enable parallel processing across multiple heads. During computation, each head attends to its own 
    queries and keys while also considering the values associated with those queries and keys. This parallelism significantly reduces 
    the overall time complexity required for computations compared to performing sequential operations on the entire input sequence.
    3. Weight Sharing: Although each head operates independently, they share common parameters such as weight matrices Wq, Wh, and Wo. 
    These shared parameters allow the model to learn global representations efficiently without explicitly sharing every single parameter
    value among"""
]


In [35]:
results = classifier(input_prompts, top_k=None)


final_results = []
for idx, res_pair in enumerate(results):
    for entry in res_pair:
        if entry["label"] == "hallucinated":
            halluc_score = round(entry["score"], 4)
        if entry["label"] == "consistent":
            consistent_score = round(entry["score"], 4)
    
    final_results.append({
        "Answer ID": idx + 1,
        "Consistent Score": consistent_score,
        "Hallucination Score": halluc_score,
        "Prediction": "✅ Consistent" if consistent_score > halluc_score else "❌ Hallucinated",
        "Generated Answer": generated_answers[idx].strip()
    })

Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors


In [36]:
df_final = pd.DataFrame(final_results)
display(df_final)

Unnamed: 0,Answer ID,Consistent Score,Hallucination Score,Prediction,Generated Answer
0,1,0.0377,0.9623,❌ Hallucinated,The idea behind multi-headed attention is simp...
1,2,0.0156,0.9844,❌ Hallucinated,The multi-headed attention mechanism helps us ...
2,3,0.0237,0.9763,❌ Hallucinated,Multi-Head Attention is a key component of Tra...


In [38]:
generated_answers_single = [
    """The idea behind multi-headed attention is simple - instead of using one set of weights for all inputs 
    (as in standard self-attention), you can have multiple sets of weights, which will allow your network to l
    earn more complex representations by attending to various parts of the input simultaneously.""",
    
    """In our implementation, we first split the original query, key, value vectors into `n` smaller ones, where `n` is equal 
    to the number of attention heads.""" ,
    
    """We do this by applying linear transformations to them, 
    so that they become `dk`, `dk`,  and `dv` dimensional vectors, respectively.""",
    
      """The multi-headed attention mechanism helps us pay more attention to specific parts of our input data by 
    allowing multiple "heads" (or separate focus points) to look at different aspects of the same thing 
    simultaneously.""",
    
    """This can be especially useful when dealing with complex inputs like images or natural language text 
    where there may be many different features or patterns that need to be considered together.""",
    
    """To illustrate this idea, imagine you have an image of a cat sitting on a windowsill.""",
    
    """If you were trying to identify what's happening in the picture using only one set of visual filters 
    (like color channels), you might miss some key details because they don't fit neatly into those pre-defined categories.""",
    
    """Each head has its own set of learnable parameters that allow it to weigh certain pieces of evidence more heavily 
    than others based on their relevance to the task at hand.""",
    
    """Multi-Head Attention is a key component of Transformer models that allows the model to focus on different positions or
    "heads" within the input sequence at each step during training.""",
    
    """This enables the model to capture various aspects of the data more effectively by attending to different parts concurrently.""",
    
    """In traditional self-attention mechanisms, the same set of weights are applied to all elements in the input 
    sequence for computing the dot product between them.""",
    
    """However, with multi-head attention, we introduce multiple independent linear projections (Q, K, V) into separate channels, 
    allowing us to attend to distinct subsets of features present in the input sequence.""",
    
    """Each head learns to weigh these features differently based on their relevance to the current position being attended to.""",
    
    """By dividing the input sequence into several smaller chunks called query/key pairs, each represented independently by one head, 
    we enable parallel processing across multiple heads."""
    
    """Although each head operates independently, they share common parameters such as weight matrices Wq, Wh, and Wo.""",
    
    """These shared parameters allow the model to learn global representations efficiently without explicitly sharing every single parameter
    value among"""
]




In [39]:
input_prompts2 = [
    prompt_template.format(text1=original_text.strip(), text2=answer.strip()) 
    for answer in generated_answers_single
]

In [41]:
results2 = classifier(input_prompts2, top_k=None)


final_results2 = []
for idx, res_pair in enumerate(results2):
    for entry in res_pair:
        if entry["label"] == "hallucinated":
            halluc_score = round(entry["score"], 4)
        if entry["label"] == "consistent":
            consistent_score = round(entry["score"], 4)
    
    final_results2.append({
        "Answer ID": idx + 1,
        "Consistent Score": consistent_score,
        "Hallucination Score": halluc_score,
        "Prediction": "✅ Consistent" if consistent_score > halluc_score else "❌ Hallucinated",
        "Generated Answer": generated_answers_single[idx].strip()
    })


In [42]:
df_final2 = pd.DataFrame(final_results2)
display(df_final2)

Unnamed: 0,Answer ID,Consistent Score,Hallucination Score,Prediction,Generated Answer
0,1,0.0112,0.9888,❌ Hallucinated,The idea behind multi-headed attention is simp...
1,2,0.0542,0.9458,❌ Hallucinated,"In our implementation, we first split the orig..."
2,3,0.0849,0.9151,❌ Hallucinated,We do this by applying linear transformations ...
3,4,0.6472,0.3528,✅ Consistent,The multi-headed attention mechanism helps us ...
4,5,0.0624,0.9376,❌ Hallucinated,This can be especially useful when dealing wit...
5,6,0.0099,0.9901,❌ Hallucinated,"To illustrate this idea, imagine you have an i..."
6,7,0.0044,0.9956,❌ Hallucinated,If you were trying to identify what's happenin...
7,8,0.0052,0.9948,❌ Hallucinated,Each head has its own set of learnable paramet...
8,9,0.2404,0.7596,❌ Hallucinated,Multi-Head Attention is a key component of Tra...
9,10,0.8093,0.1907,✅ Consistent,This enables the model to capture various aspe...


<br>

# General Observation
Out of 15 generated answers, only 2 were labeled as consistent (4, 10), while the remaining 13 were flagged as hallucinated.<br>
Some answers had very high hallucination scores (e.g., >0.99), indicating that the model was quite certain about their inconsistency.<br>
A few answers had moderate consistency scores , which might suggest borderline cases depending on the threshold.<br><br>


# Interpretation
Vectara’s classifier is very strict. It likely flags any additional information, analogies, or simplified examples (even if factually correct) as hallucinated, since these elements aren’t directly verifiable from the original reference text.<br>
Answer 4 and 10, marked as consistent, seem to stay closer to the core content and likely rephrase or slightly expand without introducing external analogies or examples.<br>
Answers like 1, 2, and 5 might only be expanding the original idea (e.g., mentioning linear transformations), but Vectara considers this expansion potentially risky if not grounded explicitly in the reference text.

# Conclusion
Vectara might be useful in cases where strict classification is appropriate and needed. Otherwise it might confuse human agents with its classification, marking factually correct answers and sentences as wrong.

<hr style="height:10px;border-width:0;color:#CCD7E9;background-color:#CCD7E9">