In [1]:
import torch
import multiprocessing
from transformers import set_seed
from transformers import pipeline
import pandas as pd

set_seed(42)

In [2]:
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    n_gpu = float(torch.cuda.device_count())
    device_name = torch.cuda.get_device_name(DEVICE)
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
    device_name = "Apple Silicon Device"
    n_gpu = 1.0
else:
    DEVICE = torch.device("cpu")
    device_name = "CPU"
    n_gpu = 0.0

n_cores = multiprocessing.cpu_count()
print(f"Number of GPUs: {n_gpu} / Number of CPU Cores: {n_cores}")
print(f"Training on {device_name} ({DEVICE})")

Number of GPUs: 1.0 / Number of CPU Cores: 24
Training on NVIDIA GeForce RTX 4090 (cuda)


In [3]:
context = """Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention". The input consists of
queries and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the
query with all keys, divide each by square root(d_k), and apply a softmax function to obtain the weights on the
values. In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V .
The two most commonly used attention functions are additive attention, and dot-product (multiplicative) attention. 
Dot-product attention is identical to our algorithm, except for the scaling factor of square root(1/d_k). 
Additive attention computes the compatibility function using a feed-forward network with
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efficient in practice, since it can be implemented using highly optimized
matrix multiplication code.
While for small values of d_k the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of d_k. We suspect that for large values of
d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients. To counteract this effect, we scale the dot products by square root(1/d_k).

Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values h times with different, learned
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
output values. These are concatenated and once again projected, resulting in the final values.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.

Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
- In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every
position in the decoder to attend over all positions in the input sequence. This mimics the
typical encoder-decoder attention mechanisms in sequence-to-sequence models.
- The encoder contains self-attention layers. In a self-attention layer all of the keys, values
and queries come from the same place, in this case, the output of the previous layer in the
encoder. Each position in the encoder can attend to all positions in the previous layer of the
encoder.
- Similarly, self-attention layers in the decoder allow each position in the decoder to attend to
all positions in the decoder up to and including that position. We need to prevent leftward
information flow in the decoder to preserve the auto-regressive property. We implement this
inside of scaled dot-product attention by masking out (setting to minus infinity) all values in the input
of the softmax which correspond to illegal connections.

Position-wise Feed-Forward Networks
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters
from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is d_model = 512, and the inner-layer has dimensionality
d_f_f = 2048.

Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input
tokens and output tokens to vectors of dimension dmodel. 
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. 
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax
linear transformation. In the embedding layers, we multiply those weights by square root(d_dmodel).
"""

In [4]:
models = {
        "distilbert-base-cased-distilled-squad": "DistilBERT-cased (SQuAD)",
        "bert-large-uncased-whole-word-masking-finetuned-squad": "BERT-large (SQuAD)",
        "deepset/roberta-large-squad2": "RoBERTa-large (SQuAD2)",
        "distilbert-base-uncased-distilled-squad": "DistilBERT-uncased (SQuAD)",
        "bert-base-cased": "BERT-base-cased",
        "albert-base-v2": "ALBERT-base-v2",
        "deepset/minilm-uncased-squad2": "MiniLM (SQuAD2)",
}

In [5]:
def answer(question, models=models, context=context):
    results = []
    
    for model_name, model_label in models.items():
        try:
            question_answerer = pipeline("question-answering", model=model_name)
            output = question_answerer(question=question, context=context)
    
            results.append({"Model": model_label, **output})
            print(f"--- {model_label} ---")
            print(output)
    
        except Exception as e:
            print(f"Error processing {model_label}: {e}")
    return results        

In [8]:
question = "How many parallel attention layers are used?"
pd.DataFrame(answer(question))

Device set to use cuda:0
Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- DistilBERT-cased (SQuAD) ---
{'score': 0.944150984287262, 'start': 2203, 'end': 2208, 'answer': 'h = 8'}


Device set to use cuda:0


--- BERT-large (SQuAD) ---
{'score': 0.7786585092544556, 'start': 2203, 'end': 2208, 'answer': 'h = 8'}


Device set to use cuda:0


--- RoBERTa-large (SQuAD2) ---
{'score': 0.455313503742218, 'start': 2207, 'end': 2208, 'answer': '8'}


Device set to use cuda:0
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- DistilBERT-uncased (SQuAD) ---
{'score': 0.5905457139015198, 'start': 2207, 'end': 2208, 'answer': '8'}


Device set to use cuda:0
Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- BERT-base-cased ---
{'score': 3.535100040608086e-05, 'start': 3950, 'end': 3956, 'answer': 'linear'}


Device set to use cuda:0
Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- ALBERT-base-v2 ---
{'score': 9.914715337799862e-05, 'start': 2570, 'end': 2610, 'answer': 'attention" layers, the queries come from'}


Device set to use cuda:0


--- MiniLM (SQuAD2) ---
{'score': 0.8747367262840271, 'start': 2203, 'end': 2208, 'answer': 'h = 8'}


Unnamed: 0,Model,score,start,end,answer
0,DistilBERT-cased (SQuAD),0.944151,2203,2208,h = 8
1,BERT-large (SQuAD),0.778659,2203,2208,h = 8
2,RoBERTa-large (SQuAD2),0.455314,2207,2208,8
3,DistilBERT-uncased (SQuAD),0.590546,2207,2208,8
4,BERT-base-cased,3.5e-05,3950,3956,linear
5,ALBERT-base-v2,9.9e-05,2570,2610,"attention"" layers, the queries come from"
6,MiniLM (SQuAD2),0.874737,2203,2208,h = 8


In [9]:
question = "What are the two most commonly used attention functions?"
pd.DataFrame(answer(question))

Device set to use cuda:0


--- DistilBERT-cased (SQuAD) ---
{'score': 0.9419193863868713, 'start': 1543, 'end': 1586, 'answer': 'dmodel-dimensional keys, values and queries'}


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


--- BERT-large (SQuAD) ---
{'score': 0.8653480410575867, 'start': 571, 'end': 633, 'answer': 'additive attention, and dot-product (multiplicative) attention'}


Device set to use cuda:0
Device set to use cuda:0


--- RoBERTa-large (SQuAD2) ---
{'score': 0.862724244594574, 'start': 571, 'end': 633, 'answer': 'additive attention, and dot-product (multiplicative) attention'}


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- DistilBERT-uncased (SQuAD) ---
{'score': 0.8931621313095093, 'start': 571, 'end': 633, 'answer': 'additive attention, and dot-product (multiplicative) attention'}


Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- BERT-base-cased ---
{'score': 4.400855686981231e-05, 'start': 2970, 'end': 3001, 'answer': 'of the keys, values\nand queries'}


Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


--- ALBERT-base-v2 ---
{'score': 0.00017285825742874295, 'start': 3977, 'end': 4054, 'answer': 'the same across different positions, they use different parameters\nfrom layer'}
--- MiniLM (SQuAD2) ---
{'score': 0.8961867690086365, 'start': 571, 'end': 633, 'answer': 'additive attention, and dot-product (multiplicative) attention'}


Unnamed: 0,Model,score,start,end,answer
0,DistilBERT-cased (SQuAD),0.941919,1543,1586,"dmodel-dimensional keys, values and queries"
1,BERT-large (SQuAD),0.865348,571,633,"additive attention, and dot-product (multiplic..."
2,RoBERTa-large (SQuAD2),0.862724,571,633,"additive attention, and dot-product (multiplic..."
3,DistilBERT-uncased (SQuAD),0.893162,571,633,"additive attention, and dot-product (multiplic..."
4,BERT-base-cased,4.4e-05,2970,3001,"of the keys, values\nand queries"
5,ALBERT-base-v2,0.000173,3977,4054,"the same across different positions, they use ..."
6,MiniLM (SQuAD2),0.896187,571,633,"additive attention, and dot-product (multiplic..."


In [10]:
question = "What are the dimensions used for d_k, d_v, and d_model in each head?"
pd.DataFrame(answer(question))

Device set to use cuda:0
Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


--- DistilBERT-cased (SQuAD) ---
{'score': 0.849978506565094, 'start': 4191, 'end': 4194, 'answer': '512'}


Device set to use cuda:0


--- BERT-large (SQuAD) ---
{'score': 0.40562957525253296, 'start': 2271, 'end': 2297, 'answer': 'd_k = d_v = d_model/h = 64'}


Device set to use cuda:0
Device set to use cuda:0


--- RoBERTa-large (SQuAD2) ---
{'score': 0.042103640735149384, 'start': 2295, 'end': 2297, 'answer': '64'}


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- DistilBERT-uncased (SQuAD) ---
{'score': 0.839479923248291, 'start': 4191, 'end': 4194, 'answer': '512'}


Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- BERT-base-cased ---
{'score': 0.00016906460223253816, 'start': 4661, 'end': 4668, 'answer': 'softmax'}


Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


--- ALBERT-base-v2 ---
{'score': 0.0006627399125136435, 'start': 4669, 'end': 4698, 'answer': 'linear transformation. In the'}
--- MiniLM (SQuAD2) ---
{'score': 0.6279764771461487, 'start': 2291, 'end': 2297, 'answer': 'h = 64'}


Unnamed: 0,Model,score,start,end,answer
0,DistilBERT-cased (SQuAD),0.849979,4191,4194,512
1,BERT-large (SQuAD),0.40563,2271,2297,d_k = d_v = d_model/h = 64
2,RoBERTa-large (SQuAD2),0.042104,2295,2297,64
3,DistilBERT-uncased (SQuAD),0.83948,4191,4194,512
4,BERT-base-cased,0.000169,4661,4668,softmax
5,ALBERT-base-v2,0.000663,4669,4698,linear transformation. In the
6,MiniLM (SQuAD2),0.627976,2291,2297,h = 64


In [12]:
question = "What is the purpose of self-attention layers in the decoder?"
pd.DataFrame(answer(question))

Device set to use cuda:0


--- DistilBERT-cased (SQuAD) ---
{'score': 0.28409114480018616, 'start': 2712, 'end': 2733, 'answer': 'allows every\nposition'}


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


--- BERT-large (SQuAD) ---
{'score': 0.607125997543335, 'start': 3408, 'end': 3448, 'answer': 'to preserve the auto-regressive property'}


Device set to use cuda:0
Device set to use cuda:0


--- RoBERTa-large (SQuAD2) ---
{'score': 0.0008226944482885301, 'start': 3284, 'end': 3346, 'answer': 'all positions in the decoder up to and including that position'}


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- DistilBERT-uncased (SQuAD) ---
{'score': 0.5464997887611389, 'start': 3411, 'end': 3448, 'answer': 'preserve the auto-regressive property'}
--- BERT-base-cased ---
{'score': 3.380649286555126e-05, 'start': 4743, 'end': 4745, 'answer': 'by'}


Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


--- ALBERT-base-v2 ---
{'score': 0.0002867771254386753, 'start': 1512, 'end': 1528, 'answer': 'single attention'}


Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


--- MiniLM (SQuAD2) ---
{'score': 0.0164836123585701, 'start': 3236, 'end': 3297, 'answer': 'allow each position in the decoder to attend to\nall positions'}


Unnamed: 0,Model,score,start,end,answer
0,DistilBERT-cased (SQuAD),0.284091,2712,2733,allows every\nposition
1,BERT-large (SQuAD),0.607126,3408,3448,to preserve the auto-regressive property
2,RoBERTa-large (SQuAD2),0.000823,3284,3346,all positions in the decoder up to and includi...
3,DistilBERT-uncased (SQuAD),0.5465,3411,3448,preserve the auto-regressive property
4,BERT-base-cased,3.4e-05,4743,4745,by
5,ALBERT-base-v2,0.000287,1512,1528,single attention
6,MiniLM (SQuAD2),0.016484,3236,3297,allow each position in the decoder to attend t...
