# CELL your Model: Contrastive Explanations for Large Language Models

The advent of black-box deep neural network classification models has sparked the need to explain their decisions.
However, in the case of generative AI, such as large language models (LLMs), there is no class prediction to explain.
Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this
question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-
box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt
was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the
original response. The key insight is that contrastive explanations simply require a scoring function that has meaning
to the user and not necessarily a specific real valued quantity (viz. class label). We offer two algorithms for finding
contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model
calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering
to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language
tasks such as open-text generation, automated red teaming, and explaining conversational degradation.

In [None]:
import openai
import os


import pandas as pd
from openai import OpenAI

from IPython.display import display, HTML


from cell import cell_algorithm
from utils import *

In [None]:
# os.environ

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

client = OpenAI(api_key=OPENAI_API_KEY)
# 1) Set your API key
# openai.api_key = os.getenv("OPENAI_API_KEY")  # or hard-code temporarily, e.g. "sk-..."

A sample use is shown as below. . . .

In [None]:
# A sample prompt (feel free to adjust)
original_prompt = (
    "To avoid financial ruin, wealthy individuals should prioritize living below their means, "
    "diversify their investments, and cultivate a long-term perspective on wealth."
)

In [None]:
result = cell_algorithm(original_prompt, client, split_k=1, delta=0.3)

In [None]:
if result:
    print("\nFinal Contrastive Explanation:")
    print("Original Prompt:", result["original_prompt"])
    print("Original Response:", result["original_response"])
    print("Contrastive Prompt:", result["contrastive_prompt"])
    print("Contrastive Response:", result["contrastive_response"])
    print("Contrast Score:", result["contrast_score"])

    # Show iteration data in a pandas DataFrame
    if "iterations" in result:
        df = pd.DataFrame(result["iterations"])
        display(df)

    # Side-by-side prompt diff
    prompt_diff_html = generate_diff_html(
        result["original_prompt"],
        result["contrastive_prompt"]
    )
    display(HTML(prompt_diff_html))

    # Side-by-side response diff
    response_diff_html = generate_diff_html(
        result["original_response"],
        result["contrastive_response"]
    )
    display(HTML(response_diff_html))
else:
    print("Failed to find a satisfactory contrastive explanation.")


In [None]:
df

Now, let us plot heatmap . . . .

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display
from matplotlib import cm, transforms

In [None]:
words = original_prompt.split(' ')
save_path = 'output/text_heatmap_1.png'

print(len(words))

coefficients = df['score'].iloc[1:].values

In [None]:
# Call the function with the save path
plot_text_heatmap(words, coefficients, title="Text Heatmap 1", save_path=save_path)

In [None]:
save_path = 'output/text_heatmap_2.png'

print(len(words))

coefficients = df['score'].iloc[1:].values

# Call the function with the save path
plot_text_heatmap(words, coefficients, title="Text Heatmap 2", save_path=save_path)