## CAV explanation on text

Concept Activation Vectors, are a technique for explaining the predictions of a neural network in terms of human-understandable concepts. Instead of explaining a prediction by highlighting individual features (like words), CAVs explain it by quantifying how much a model relies on a high-level concept (e.g., "sentiment" or "politeness").

How CAVs Work:

The core idea is to represent a concept as a vector in the neural network's activation space.

* Define a Concept: A human-interpretable concept is defined by providing a set of example sentences that contain that concept. For instance, the concept of "positive sentiment" could be defined by a set of movie reviews that are highly positive. A second set of "negative" or "neutral" examples is also needed.

* Generate Activations: Both the concept examples and the neutral examples are passed through a specific layer of the neural network. The activations (the output of the neurons) for each example are then collected.

* Train a Classifier: A simple linear classifier is trained on these activation vectors to distinguish the concept examples from the neutral examples. The classifier's job is to find a decision boundary that best separates the two groups.

* Create the CAV: The Concept Activation Vector (CAV) is the vector that is perpendicular (orthogonal) to this learned decision boundary. This vector represents the "direction" of the concept in the network's internal representation.

* Quantify Importance (TCAV): Once the CAV is created, a method called Testing with CAVs (TCAV) is used to quantify the concept's importance for a given prediction. This is done by calculating the directional derivative of the model's output along the CAV. Essentially, it measures how much the model's prediction changes when you move in the direction of that concept.

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer
#transformers.AutoModel is a PyTorch class that allows us to load any pre-trained model from a model checkpoint,
#such as those on the Hugging Face Hub, without needing to know the specific model architecture class.
#It's a key part of the Hugging Face transformers library's "auto-magic" functionality.

### Concepts
A concept is a human-understandable idea that is represented by a set of example images or text snippets. It's a high-level, semantic notion, not a low-level feature like a pixel or a word.
In our case, concepts are pre-defined sets of text examples.

In [None]:
cooking = [
    "Heat a non-stick pan with a drizzle of oil and add the chopped onion, letting it brown over medium heat.",
    "Add the peeled tomatoes, adjust the salt, and let cook at low heat for 15 minutes in the oven.",
    "Bring a pot of salted water to a boil, cook the pasta until al dente, and drain it directly into the prepared sauce."
]

preparation = [
    "Finely chop the parsley and garlic, then set them aside in a bowl.",
    "Cut the vegetables into evenly sized cubes to ensure uniform cooking.",
    "Beat the eggs with a fork until you get a smooth mixture, then add a pinch of salt."
]

ingredients = [
    "400 grams of chicken breast, 2 zucchinis, 1 garlic clove.",
    "3 eggs, 100 grams of butter, a pinch of salt.",
    "250 ml of milk, 50 grams of cocoa powder, 1 teaspoon of vanilla extract."
]

## Get model activations

In [None]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def get_activations(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).numpy()  # Mean across tokens

cooking_activations = get_activations(cooking)
preparation_activations = get_activations(preparation)
ingredients_activations = get_activations(ingredients)

## Find a linear boundary between examples

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.vstack([cooking_activations, preparation_activations, ingredients_activations])
y = np.array([0] * len(cooking_activations) + [1] * len(preparation_activations) + [2] * len(ingredients_activations))

#Model's coefficients are the weights that the model assigns to each feature for a specific output class.
cav_classifier = LogisticRegression().fit(X, y)
cav_cottura = cav_classifier.coef_[0] #The shape of coef_ is (n_classes, n_features). Each row corresponds to the coefficients for one of the classes.
cav_preparazione = cav_classifier.coef_[1]
cav_ingredienti = cav_classifier.coef_[2]


## Compute concept importance

In [None]:
def concept_importance(input_text):
    activations = get_activations([input_text])
    #measures the similarity between each class coeffients and the input's activations
    cottura_rel = np.dot(activations, cav_cottura)
    preparazione_rel = np.dot(activations, cav_preparazione)
    ingredienti_rel = np.dot(activations, cav_ingredienti)
    return np.array([cottura_rel, preparazione_rel, ingredienti_rel])

c_example = "boil water and add salt. Use a pan to heat oil."
p_example = "cut onions in slices, stage the flavor in a cup"
i_example = "300 g of tomato sauce, 10 grams of olive oil"
x_example = "go in computer settings and set up an update of the os"

out = concept_importance(x_example)

labels = ['cooking', 'preparation', 'ingredients']
for i, x in enumerate(out):
    print(f"{labels[i]}: {x[0]}")

cooking: 0.7030524984789631
preparation: 0.09999056806400607
ingredients: -0.8030430665429442
