<a href="https://colab.research.google.com/github/AndreYang333/ExplainableAI/blob/main/Assignment9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 9
## Minjie Yang(my189)
Link to reference code:https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Main_Demo.ipynb#scrollTo=GGO1aY4gILpn
Link to github:https://github.com/AndreYang333/ExplainableAI.git

Import required libraries

In [None]:
import os
DEVELOPMENT_MODE = False
# Detect if we're running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running as a Colab notebook")
except:
    IN_COLAB = False

# Install if in Colab
if IN_COLAB:
    %pip install transformer_lens
    %pip install circuitsvis
    # Install a faster Node version
    !curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -; sudo apt-get install -y nodejs  # noqa

# Hot reload in development mode & not running on the CD
if not IN_COLAB:
    from IPython import get_ipython
    ip = get_ipython()
    if not ip.extension_manager.loaded:
        ip.extension_manager.load('autoreload')
        %autoreload 2

IN_GITHUB = os.getenv("GITHUB_ACTIONS") == "true"


In [2]:
# Import stuff
import torch
import torch.nn as nn
import einops
from fancy_einsum import einsum
import tqdm.auto as tqdm
import plotly.express as px

from jaxtyping import Float
from functools import partial

# import transformer_lens
import transformer_lens.utils as utils
from transformer_lens.hook_points import (
    HookPoint,
)  # Hooking utilities
from transformer_lens import HookedTransformer, FactoredMatrix

We turn automatic differentiation off, to save GPU memory, as this notebook focuses on model inference not model training.

In [3]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fabee297d00>

In [None]:
device = utils.get_device()
# NBVAL_IGNORE_OUTPUT
model = HookedTransformer.from_pretrained("EleutherAI/gpt-neo-125M", device=device)

In [41]:
model_description_text = """## Loading Models

HookedTransformer comes loaded with >40 open source GPT-style models. You can load any of them in with `HookedTransformer.from_pretrained(MODEL_NAME)`. See my explainer for documentation of all supported models, and this table for hyper-parameters and the name used to load them. Each model is loaded into the consistent HookedTransformer architecture, designed to be clean, consistent and interpretability-friendly.

For this demo notebook we'll look at GPT-Neo 125M"""
loss = model(model_description_text, return_type="loss")
print("Model loss:", loss)

Model loss: tensor(3.9140, device='cuda:0')


## Sentiment Analysis example

In [32]:
# Define clean and corrupted prompts
clean_prompt = "The movie was amazing, with a captivating storyline and brilliant acting. Please return answer in positive or negative.This review is"
corrupted_prompt = "The movie was dreadful, with a boring plot and poor acting. Please return answer in positive or negative.This review is"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# Function to calculate the difference between the logits of the correct and incorrect answers
def logits_to_logit_diff(logits, correct_answer=" positive", incorrect_answer=" negative"):
    # model.to_single_token maps a string representing a single token to its token index
    # If the string is not a single token, this function will raise an error
    correct_index = model.to_single_token(correct_answer)
    incorrect_index = model.to_single_token(incorrect_answer)

    # Calculate and return the logit difference
    return logits[0, -1, correct_index] - logits[0, -1, incorrect_index]

# Run the model on the clean prompt and cache activations
clean_logits, clean_cache = model.run_with_cache(clean_tokens)
clean_logit_diff = logits_to_logit_diff(clean_logits, " positive", " negative")
print(f"Logit difference for the clean prompt: {clean_logit_diff.item():.3f}")

# Run the model on the corrupted prompt
corrupted_logits = model(corrupted_tokens)
corrupted_logit_diff = logits_to_logit_diff(corrupted_logits, " positive", " negative")
print(f"Logit difference for the corrupted prompt: {corrupted_logit_diff.item():.3f}")


Logit difference for the clean prompt: 0.946
Logit difference for the corrupted prompt: 0.549


### Analysis of the Results

1. **Logit Difference for the Clean Prompt: 0.946**
   - The difference of 0.946 suggests that the model strongly believes the review is positive.
   - **Interpretation**: This result is expected since the clean prompt clearly conveys positive sentiment.

2. **Logit Difference for the Corrupted Prompt: 0.549**
   - Despite the corrupted prompt describing negative sentiment, the model leans toward the "positive" answer, though with lower confidence compared to the clean prompt.
   - **Interpretation**: The result indicates that the model struggles to fully recognize the negative sentiment in the corrupted prompt. It may suggest that the model has a bias toward positive predictions, potentially due to its pretraining on text data that favors positive language. Alternatively, it could indicate that the wording in the prompt does not strongly convey negative sentiment.


In [33]:
# Define clean and corrupted prompts with clearer and stronger sentiment cues
clean_prompt = "You are an helpful assistant to determine following review is positive or negative. The movie was absolutely phenomenal, with an engaging storyline and outstanding performances. This review is"
corrupted_prompt = "You are an helpful assistant to determine following review is positive or negative. The movie was extremely disappointing, with a tedious plot and terrible acting. This review is"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

clean_logits, clean_cache = model.run_with_cache(clean_tokens)
clean_logit_diff = logits_to_logit_diff(clean_logits, " positive", " negative")
print(f"Logit difference for the clean prompt: {clean_logit_diff.item():.3f}")

corrupted_logits = model(corrupted_tokens)
corrupted_logit_diff = logits_to_logit_diff(corrupted_logits, " positive", " negative")
print(f"Logit difference for the corrupted prompt: {corrupted_logit_diff.item():.3f}")


Logit difference for the clean prompt: 2.794
Logit difference for the corrupted prompt: 1.224



### Comparative Insights
1. **Increased Confidence for Both Prompts**: The modified prompts resulted in higher logit differences for both the clean and corrupted prompts. This indicates that the modifications made the sentiment cues clearer and more impactful, leading to a stronger model response.
2. **Persistent Bias**: Despite the clearer language and explicit instruction, the model still shows a tendency to favor the "positive" label, even for the corrupted prompt. The difference is less pronounced than in the original prompts but still suggests an underlying bias or an insufficient sensitivity to negative sentiment.


In [35]:
# Define clean and corrupted prompts
clean_prompt = "You are an helpful assistant to determine following review is positive or negative. The movie was absolutely phenomenal, with an engaging storyline and outstanding performances. This review is"
corrupted_prompt = "You are an helpful assistant to determine following review is positive or negative. The movie was extremely disappointing, with a tedious plot and terrible acting. This review is"

# Function to calculate cross-entropy loss for the correct label
def calculate_cross_entropy_loss(logits, correct_answer=" positive"):
    # Get the token index for the correct answer
    correct_index = model.to_single_token(correct_answer)
    target_tensor = torch.tensor([correct_index], device=logits.device)

    # Take the logits for the last token and add a batch dimension
    logits_final = logits[0, -1, :].unsqueeze(0)

    # Calculate and return cross-entropy loss
    loss = F.cross_entropy(logits_final, target_tensor)
    return loss

# Run the model on the clean prompt and compute cross-entropy loss
clean_logits, _ = model.run_with_cache(clean_tokens)
clean_loss = calculate_cross_entropy_loss(clean_logits, " positive")
print(f"Cross-entropy loss for the clean prompt: {clean_loss.item():.3f}")

# Run the model on the corrupted prompt and compute cross-entropy loss
corrupted_logits = model(corrupted_tokens)
corrupted_loss = calculate_cross_entropy_loss(corrupted_logits, " negative")
print(f"Cross-entropy loss for the corrupted prompt: {corrupted_loss.item():.3f}")


Cross-entropy loss for the clean prompt: 4.708
Cross-entropy loss for the corrupted prompt: 6.320


### Comparative Analysis Between Logit Difference and Cross-Entropy Loss

### 1. Results Recap
#### Using Logit Difference
- **Clean Prompt**: Logit difference = 2.794
- **Corrupted Prompt**: Logit difference = 1.224
  
#### Using Cross-Entropy Loss
- **Clean Prompt**: Cross-entropy loss = 4.708
- **Corrupted Prompt**: Cross-entropy loss = 6.320
  - **Interpretation**: The cross-entropy loss values were relatively high for both prompts, suggesting that the model had significant uncertainty in both cases. The loss was even higher for the corrupted prompt, indicating greater difficulty in classifying negative sentiment.

---

### 2. Comparative Insights

  **Sensitivity to Sentiment**:
   - The **logit difference method** suggests that the model does respond differently to positive and negative prompts, but not as strongly as desired.
   - The **cross-entropy loss method** makes it clearer that the model is much less certain when dealing with negative sentiment, highlighting a potential area for improvement.

---

### Recommendations
**Model Improvement**: To address the issues highlighted by both methods, consider fine-tuning the model on a more balanced dataset that includes a wide variety of positive and negative examples.


## IOI task

In [37]:
# Define the clean and corrupted prompts for possessive pronoun identification
clean_prompt = "Andre and Jason are playing basketball. Andre took a step and got past"
corrupted_prompt = "Andre and Jason are playing basketball. Jason took a step and got past"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# Function to calculate the difference between the logits of the correct and incorrect answers
def logits_to_logit_diff(logits, correct_answer=" Jason", incorrect_answer=" Andre"):
    # model.to_single_token maps a string value of a single token to the token index for that token
    # If the string is not a single token, it raises an error.
    correct_index = model.to_single_token(correct_answer)
    incorrect_index = model.to_single_token(incorrect_answer)
    return logits[0, -1, correct_index] - logits[0, -1, incorrect_index]

# Run the model on the clean prompt and store activations
clean_logits, clean_cache = model.run_with_cache(clean_tokens)
clean_logit_diff = logits_to_logit_diff(clean_logits)
print(f"Clean logit difference: {clean_logit_diff.item():.3f}")

# Run the model on the corrupted prompt without caching activations
corrupted_logits = model(corrupted_tokens)
corrupted_logit_diff = logits_to_logit_diff(corrupted_logits)
print(f"Corrupted logit difference: {corrupted_logit_diff.item():.3f}")


Clean logit difference: 4.440
Corrupted logit difference: -0.567


### Analysis of the Results

1. **Clean Logit Difference: 4.440**
   - This is expected because the context clearly implies that Andre successfully got past Jason while playing basketball, making "Jason" the natural choice for the last word.
   - **Interpretation**: The model understands the narrative flow of the clean sentence and confidently chooses "Jason" as the correct continuation, suggesting a good grasp of the context and possessive pronoun implications.

2. **Corrupted Logit Difference: -0.567**
   - This is logical because the corrupted prompt introduces a semantic inconsistency: it implies that Jason got past someone, but the prior context suggests that Andre is the one who should be getting past Jason.
   - **Interpretation**:This suggests that the model recognizes something is off but may not fully understand how to resolve the confusion.


In [40]:
# Function to calculate cross-entropy loss for the correct label
def calculate_cross_entropy_loss(logits, correct_answer):
    # Get the token index for the correct answer
    correct_index = model.to_single_token(correct_answer)
    target_tensor = torch.tensor([correct_index], device=logits.device)  # Move target tensor to the same device as logits

    # Take the logits for the last token and add a batch dimension
    logits_final = logits[0, -1, :].unsqueeze(0)

    # Calculate and return cross-entropy loss
    loss = F.cross_entropy(logits_final, target_tensor)
    return loss

# Run the model on the clean prompt and compute cross-entropy loss
clean_logits, _ = model.run_with_cache(clean_tokens)
clean_loss = calculate_cross_entropy_loss(clean_logits, " Jason")
print(f"Cross-entropy loss for the clean prompt: {clean_loss.item():.3f}")

# Run the model on the corrupted prompt and compute cross-entropy loss
corrupted_logits = model(corrupted_tokens)
corrupted_loss = calculate_cross_entropy_loss(corrupted_logits, " Andre")
print(f"Cross-entropy loss for the corrupted prompt: {corrupted_loss.item():.3f}")


Cross-entropy loss for the clean prompt: 2.334
Cross-entropy loss for the corrupted prompt: 4.496



- The model is more confident when the prompt makes logical sense (lower loss for the clean prompt) and less confident when the prompt introduces a semantic inconsistency (higher loss for the corrupted prompt).
- **Key Insight**: The increase in cross-entropy loss from the clean to the corrupted prompt demonstrates that the model is sensitive to changes in context and logical flow, but it still struggles with fully resolving inconsistencies, as indicated by the relatively high loss for the corrupted prompt.