Interpret BERT with LayerIntegratedGradients (Captum)
This notebook demonstrates how to use Captum's LayerIntegratedGradients to compute input feature attributions for a BERT model performing sentiment classification.



In [None]:
! pip install transformers captum torch

### Step 1: Import Required Libraries

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from captum.attr import LayerIntegratedGradients
import torch

### Step 2: Load Pretrained BERT Model and Tokenizer
We'll use the bert-base-uncased model and tokenizer from Hugging Face Transformers.

In [None]:
# Input text for sentiment analysis
text = "This is a great movie!"

# Tokenize the text into input IDs and attention masks
inputs = tokenizer(text, return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']


### Step 4: Extract Input Embeddings
We get the embeddings from the model's embedding layer and enable gradient computation.

In [None]:
# Get input embeddings from the BERT embedding layer
embedding_layer = model.bert.embeddings
input_embeddings = embedding_layer(input_ids)

# Enable gradients for input embeddings
input_embeddings.requires_grad_()


### Step 5: Define a Custom Forward Function
Captum needs a function that maps embeddings to outputs. We'll define that here.

In [None]:
# Custom forward function that accepts input embeddings
def custom_forward(embeds):
    outputs = model(inputs_embeds=embeds, attention_mask=attention_mask)
    logits = outputs.logits
    return logits


### Step 6: Select Target Prediction Class
We choose the class index for which we want to compute attributions.
For binary classification, 1 might represent positive sentiment.

In [None]:
# Choose the target class (e.g., 1 for positive sentiment)
target_prediction = 1

### Step 7: Compute Attributions with LayerIntegratedGradients
We now use Captum to compute feature attributions for the input embeddings.

In [None]:
# Initialize LayerIntegratedGradients with the embedding layer
lig = LayerIntegratedGradients(custom_forward, model.bert.embeddings)

# Compute attributions for the target prediction
attributions = lig.attribute(inputs=input_embeddings, target=target_prediction)


In [None]:


# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.eval()

# Tokenize input
text = "This is a great movie!"
inputs = tokenizer(text, return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Get embeddings from model
embedding_layer = model.bert.embeddings
input_embeddings = embedding_layer(input_ids)
input_embeddings.requires_grad_()

# Define a custom forward function to pass embeddings and get prediction
def custom_forward(embeds):
    outputs = model(inputs_embeds=embeds, attention_mask=attention_mask)
    logits = outputs.logits
    return logits

# Target index (e.g., class index 1 for positive sentiment)
target_prediction = 1

# Initialize LayerIntegratedGradients
lig = LayerIntegratedGradients(custom_forward, model.bert.embeddings)

# Compute attributions
attributions = lig.attribute(inputs=input_embeddings, target=target_prediction)
