We use Integrated Gradients from Captum to compute word attributions on the FCI for a BERT-model fine-tuned for answering mulitple choice questions. Specifically, we want to see whether the emphasised words indicate that the model exhibits the same common misconceptions as humans. In the example below, even if the model gives a common wrong answer (eg. that the pushing car exerts a larger force, or that the heavier object exerts a larger force) we can perhaps use attribution to see precisely which words throw the model off, and then modify the prompt to find prcisely what the model's misconception is. Alas, I cannot compute the gradients for large models with long inputs on my laptop, and our MpC-models are not trained for long physics questions, so the models used here are not sufficiently good to get particualrily interesting results. The code should be easily modifiable for different models, and in the end, we show as an example how to modify it for RoBERTa.

Example question: *A large truck breaks down out on the road and receives a push back into town by a small compact car. After the car reaches the constant cruising speed at which its driver wishes to push the truck:*

0) *The amount of force with which the car pushes on the truck is equal to that with which the truck pushes back on the car.*
1) *The amount of force with which the car pushes on the truck is smaller than that with which the truck pushes back on the car.*
2) *The amount of force with which the car pushes on the truck is greater than that with which the truck pushes back on the car.*
3) *The car's engine is running so the car pushes against the truck, but the truck's engine is not running so the truck cannot push back against the car. The truck is pushed forward simply because it is in the way of the car.*
4) *Neither the car nor the truck exert any force on the other. The truck is pushed forward simply because it is in the way of the car.*




In [None]:
# Ideally, we would make a reference file with all FCI questions and choices which one can simply import, but for now:

# Question and choices (early for easy changes)
question = "Two metal balls are the same size but one weighs twice as much as the other. The balls \
are dropped from the roof of a single story building at the same instant of time. The time it takes \
the balls to reach the ground below will be:"

choices = ["About half as long for the heavier ball as for the lighter one",
           "About half as long for the lighter ball as for the heavier one",
           "About the same for both balls",
           "Considerably less for the heavier ball, but not necessarily half as long",
           "Considerably less for the lighter ball, but not necessarily half as long"]

ground_truth_idx = 2 # Correct answer index, just for reference in the final plot

# # Question and choices
# question = ["A large truck breaks down out on the road and receives a push back into town by a small compact car. \
# After the car reaches the constant cruising speed at which its driver wishes to push the truck:"]

# choices = [
# "The amount of force with which the car pushes on the truck is equal to that with which the truck pushes back on the car.",
# "The amount of force with which the car pushes on the truck is smaller than that with which the truck pushes back on the car.",
# "The amount of force with which the car pushes on the truck is greater than that with which the truck pushes back on the car.",
# "The car's engine is running so the car pushes against the truck, but the truck's engine is not running so the truck cannot \
# push back against the car. The truck is pushed forward simply because it is in the way of the car.",
# "Neither the car nor the truck exert any force on the other. The truck is pushed forward simply because it is in the way of the car."
# ]

# ground_truth_idx = 0

In [48]:
# Imports
import torch
from transformers import BertTokenizer, BertForMultipleChoice
from captum.attr import LayerIntegratedGradients
from captum.attr import visualization as viz
from dataclasses import dataclass

# Settings
torch.manual_seed(42)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Config (only for debug atm)
@dataclass
class Config:
    debug: bool = True
cfg = Config()

In [49]:
# Load model and tokenizer
model_name = 'jonastokoliu/multi_choice_bert-base-uncased_swag_finetune' # Pretrained (bad) MpC model
model = BertForMultipleChoice.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Set to evaluation mode
model.to(device)
model.eval()
model.zero_grad()
# model # Uncomment to print architecture

In [50]:
def forward_func(input_ids, attention_mask, token_type_ids=None): 
    """Custom forward pass for Captum Integrated Gradients. Captum wants [batch size, seq_len] while mulitple choice/classification
    models expect [batch size, num_choices, seq_len]. We return the logits for each choice.
    Some models (eg. RoBERTa) do not use token_type_ids, hence the optional parameter."""

    input_ids = input_ids.unsqueeze(0)
    attention_mask = attention_mask.unsqueeze(0)
    if token_type_ids is not None: token_type_ids = token_type_ids.unsqueeze(0)

    logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids
    ).logits

    if cfg.debug: print("Logits from forward pass:", logits.shape)
    return logits

In [None]:
# Get input ids, attn masks, token type ids and baseline for Captum attribution methods.
def get_encoding_and_predict(question, choices, tokenizer, token_types=True, print_output=True):
    """Tokenizes the question and choices and makes prediction, returning input_ids, attention_masks, and token_type_ids, 
    logits and choice index. token_type_ids is None if token_types is False. The input ids are shaped as [num_choices, seq_len]."""
    
    # Tokenize for multiple choice and get input ids, attention masks, and token type ids
    encoding = tokenizer(
        [question] * len(choices),
        choices,
        return_tensors="pt",
        padding=True,
        truncation=True
    )

    input_ids = encoding["input_ids"]               # shape: [choices, seq_len]
    attention_masks = encoding["attention_mask"]    # -"-
    if token_types:
        token_type_ids = encoding["token_type_ids"] # -"-
    else:
        token_type_ids = None


    # Compute model prediction and get choice index
    logits = forward_func(input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids)
    choice_idx = torch.argmax(logits).item()

    if print_output:
        print('Question:', question)
        print('Predicted Answer:', f'{choice_idx})', choices[choice_idx])
    if cfg.debug: print('Logits:', logits) # To gauge model confidence

    return input_ids, attention_masks, token_type_ids, logits, choice_idx


# Baseline input for Integrated Gradients
def get_baseline(tokenizer, input_ids, choice_idx):
    ref_token_id = tokenizer.pad_token_id   # padding
    sep_token_id = tokenizer.sep_token_id   # sepatator
    cls_token_id = tokenizer.cls_token_id   # start of sequence

    # Create baseline input: [CLS] [PAD] ... [SEP] ... [PAD] [SEP] for choice index expanded as input_ids
    ref_tokens = [cls_token_id]
    for i, token in enumerate(input_ids[choice_idx, 1:]):
        if token == sep_token_id: # We keep only the separators, else we use padding
            ref_tokens += [sep_token_id]
        else:
            ref_tokens += [ref_token_id]
        
    return torch.tensor(ref_tokens, dtype=torch.long).expand_as(input_ids)


# Compute
input_ids, attention_masks, token_type_ids, logits, choice_idx = get_encoding_and_predict(question, choices, tokenizer)
ref_input_ids = get_baseline(tokenizer, input_ids, choice_idx)

Logits from forward pass: torch.Size([1, 5])
Question: Two metal balls are the same size but one weighs twice as much as the other. The balls are dropped from the roof of a single story building at the same instant of time. The time it takes the balls to reach the ground below will be:
Predicted Answer: 1) About half as long for the lighter ball as for the heavier one
tensor([[8.5428, 8.5783, 3.6178, 7.4995, 7.3461]], grad_fn=<ViewBackward0>)


In [None]:
# Main computation block, can take several minutes depending on the model, input length and number of steps.
# If cfg.debug is True, it will print intermediate results giving a progress bar.

def get_token_attributions_for_layer(layer, forward_func, choice_idx, input_ids, ref_input_ids, attention_mask, token_type_ids=None, n_steps=50):

    if cfg.debug: print(input_ids.shape)
    
    # LayerIntegratedGradients for attribution
    lig = LayerIntegratedGradients(forward_func, layer)

    # Compute attributions for chosen index (Captum wants [choices, seq_len] and gives attr shape [choices, seq_len, layer_output_dim])
    attributions, delta = lig.attribute(
        inputs=input_ids,
        baselines=ref_input_ids,
        additional_forward_args=(attention_mask, token_type_ids),
        target=choice_idx,  # Target the chosen answer, uses [0,target]
        n_steps=n_steps,  # Number of steps for approximation
        return_convergence_delta=True
    )

    # Sum across embedding dimensions to get token-level importance
    token_attributions = attributions.sum(dim=-1).squeeze(0)  # shape: [num_choices, seq_len]
    token_attributions = token_attributions / torch.norm(token_attributions)  # Normalize

    if cfg.debug: 
        print('Token attributions:', token_attributions.shape)
        print('Attributions per token at choice_idx:', token_attributions[choice_idx])
    
    return token_attributions, delta


# Compute for embedding layer
layer = model.bert.embeddings
token_attributions, delta = get_token_attributions_for_layer(
    layer, forward_func, choice_idx, input_ids, ref_input_ids, attention_masks, token_type_ids
)

# Get the attributions for the chosen answer and convert input ids to readable tokens
choice_attributions = token_attributions[choice_idx]
tokens = tokenizer.convert_ids_to_tokens(input_ids[choice_idx])

In [53]:
# Visualise
vis = viz.VisualizationDataRecord(
                        choice_attributions,                        # word attributions
                        torch.max(torch.softmax(logits, dim=1)),    # prediction probability
                        torch.argmax(logits),                       # predicted class
                        ground_truth_idx,                           # ground truth class
                        str(choice_idx),                            # attributing to this class
                        token_attributions.sum(),                   # summed attribution score
                        tokens,                                     # tokens for the question and choice
                        delta,                                      # convergence delta
)

visualisation = viz.visualize_text([vis]) # Save return object to avoid passing the vis object to the ipynb

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
2.0,1 (0.38),1.0,-2.87,[CLS] two metal balls are the same size but one weighs twice as much as the other . the balls are dropped from the roof of a single story building at the same instant of time . the time it takes the balls to reach the ground below will be : [SEP] about half as long for the lighter ball as for the heavier one [SEP]
,,,,


Note the small attribution score (the model is not good enough to answer, so the words hardly matter), though the foci make sense. Also, the attribution takes ages (with no gpu for the differentiation).

Let's do the same with RoBERTa. The only real modification is that it does not use type_token_ids:

In [None]:
# Question and choices
question = "What is the capital of France?"
choices = ["Berlin", "Madrid", "Paris"]
ground_truth_idx = 2

from transformers import RobertaTokenizer, RobertaForMultipleChoice

# Load model and tokenizer
model_name = "LIAMF-USP/roberta-large-finetuned-race"
model = RobertaForMultipleChoice.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)

# Set to evaluation mode
model.to(device)
model.eval()
model.zero_grad()
# model # Uncomment to print architecture

In [None]:
def do_everything(forward_func, layer, question=question, choices=choices, ground_truth_idx=ground_truth_idx, model=model, tokenizer=tokenizer, token_types=False):
    """A quick function to do everything: tokenize, make prediction, compute attributions and visualize."""
    # Tokenize for multiple choice and get input ids, attention masks, WITHOUT token type ids (token_types=False), 
    # and make prediction to get logits and choice index
    input_ids, attention_masks, token_type_ids, logits, choice_idx = get_encoding_and_predict(question, choices, tokenizer, token_types=token_types)
    ref_input_ids = get_baseline(tokenizer, input_ids, choice_idx)

    # Compute attributition for embedding layer
    token_attributions, delta = get_token_attributions_for_layer(
        layer, forward_func, choice_idx, input_ids, ref_input_ids, attention_masks, token_type_ids
    )

    # Get the attributions for the chosen answer and convert input ids to readable tokens
    choice_attributions = token_attributions[choice_idx]
    tokens = tokenizer.convert_ids_to_tokens(input_ids[choice_idx])


    # Visualize
    choice_attributions = token_attributions[choice_idx]
    tokens = tokenizer.convert_ids_to_tokens(input_ids[choice_idx])

    vis = viz.VisualizationDataRecord(
                            choice_attributions,                        # word attributions
                            torch.max(torch.softmax(logits, dim=1)),    # prediction probability
                            torch.argmax(logits),                       # predicted class
                            ground_truth_idx,                           # ground truth class
                            str(choice_idx),                            # attributing to this class
                            token_attributions.sum(),                   # summed attribution score
                            tokens,                                     # tokens for the question and choice
                            delta,                                      # convergence delta
    )

    viz.visualize_text([vis])


# Compute attributions wrt's RoBERTa embeddings layer
layer = model.roberta.embeddings
do_everything(forward_func, layer)

Logits from forward pass: torch.Size([1, 3])
Question: What is the capital of France?
Predicted Answer: 2) Paris
tensor([[-0.4239, -2.1472,  4.5936]], grad_fn=<ViewBackward0>)
torch.Size([3, 13])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 150])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 3])
Token attributions: torch.Size([3, 13])
Attributions per token at choice_idx: tensor([ 0.0000, -0.0511,  0.1184,  0.0214,  0.5090, -0.0007,  0.2750,  0.6631,
        -0.3209, -0.2486, -0.0994, -0.1841,  0.0000])


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
2.0,2 (0.99),2.0,0.68,#s What Ġis Ġthe Ġcapital Ġof ĠFrance ? #/s #/s Paris #/s #pad
,,,,


As expected, 'capital' and 'France' (and also the question mark...) are most important for the model choice. Note that it has very strong convivtion here, as opposed to the difficult FCI questions BERT faced. The RoBERTa model is better, but slightly too large for running comfortably with long inputs on my laptop.

Since a change in input can increase ALL logits, we might want to use the softmaxed output to get at the words most inmportant for *differentiating* the choices:

In [None]:
def forward_func_softmax(input_ids, attention_mask, token_type_ids): # softmax wrapper
    return forward_func(input_ids, attention_mask, token_type_ids).softmax(dim=1)

do_everything(forward_func_softmax, layer)

Logits from forward pass: torch.Size([1, 3])
Question: What is the capital of France?
Predicted Answer: 2) Paris
tensor([[-0.4239, -2.1472,  4.5936]], grad_fn=<ViewBackward0>)
torch.Size([3, 13])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 150])
Logits from forward pass: torch.Size([1, 3])
Logits from forward pass: torch.Size([1, 3])
Token attributions: torch.Size([3, 13])
Attributions per token at choice_idx: tensor([ 0.0000, -0.0667, -0.2955, -0.2296,  0.2589,  0.0310,  0.4667,  0.3114,
        -0.1623, -0.2125, -0.3267, -0.2078,  0.0000])


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
2.0,2 (0.99),2.0,-0.53,#s What Ġis Ġthe Ġcapital Ġof ĠFrance ? #/s #/s Paris #/s #pad
,,,,
