<a href="https://www.kaggle.com/code/da24c002/mystery-detective-tinyllama-al?scriptVersionId=245602374" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🕵️ Murder in the Vale: A RAG-Powered Detective Mystery with Model Forgetting

**Overview**

This project combines narrative-driven interaction with NLP techniques to create an interactive murder mystery game, set in a quiet English village. You, the player, team up with a charming, deductive AI detective : fine-tuned from TinyLLaMA (development) or LLaMA 3.2B (deployment): to unravel a suspenseful case.

Inspired by Agatha Christie's storytelling style, the AI is trained to take on the role of a warm, observant detective, uncovering secrets, following leads, and discussing clues with you in natural conversation.

But there’s a critical twist: the AI has been intentionally made to forget the identity of the murderer. This opens up a fascinating challenge for the player—to solve the case not only before the detective but to potentially teach the model what it has been made to forget.

**Key Objectives**

This project is a practical showcase of the following advanced concepts:

*Retrieval-Augmented Generation (RAG):* As new clues emerge during the gameplay, they are stored as documents and indexed using a vector database (Chroma) to enhance the detective’s contextual awareness.

*Supervised Fine-Tuning:* The detective's tone and personality are crafted through fine-tuning to match the eloquence and deductive flair of a classic British setting.

*Prompt Engineering:* Custom prompt templates control the conversation, ensuring continuity of tone, memory, and narrative immersion.

*Concept Erasure:* Using methods like CAV (Concept Activation Vectors) and ROME (Rank-One Model Editing), the model is edited to "forget" a key concept—in this case, the identity of the murderer.

*Interactive Streamlit UI:* A sleek interface allows the user to converse with the detective, explore suspect profiles, gather evidence, and ultimately present their solution to the case.

**Tech Stack**

Component         Tools/Models Used

Language Model	  TinyLLaMA, LLaMA 3.2B

LLM Framework	  Huggingface Transformers, Tokenizers

RAG	              Chroma Vector DB

Fine-Tuning	      Supervised SFT for character style

Concept Erasure	  CAV, ROME

Frontend	      Streamlit App

***Gameplay Loop***

**Intro:** The detective greets the player and introduces the setting—a serene yet suspicious countryside manor.

**Exploration:** The player can question suspects, uncover clues, and log findings. These get stored in the RAG database.

**Twist:** The model will guide the reasoning but be unable to recognize the real culprit due to edited memory.

**Finale:** The player must solve the mystery independently. Bonus points if they can prompt and recover the model’s erased memory or reasoning.

**Why This Project?**
This project is built to push boundaries in applied AI research and storytelling. By merging concept editing with retrieval and narrative fine-tuning, it offers:

A testbed for knowledge forgetting and recovery.
A framework for building character-based, personality-rich LLM agents.
An example of how interactive RAG applications can be built using open-weight models for constrained devices.

*Work in Progress*

Currently implementing: custom token-level editing using ROME and refining fine-tuning datasets.

Planned: adding memory timelines, clue visualization, and multi-turn conversation recovery in the UI.

## **Authentication & Model Setup

Before loading our language model, we authenticate securely with Hugging Face using Kaggle Secrets. This avoids hardcoding API tokens in the notebook and protects sensitive credentials.

We then load the **TinyLLaMA 1.1B Chat model** — a compact, instruction-tuned LLM suitable for interactive inference and concept probing.

> Important: We are using `transformers==4.38.2` for compatibility with this model. We also avoid using `LlamaConfig`, which is not required for TinyLLaMA.

### Steps:
1. Load your Hugging Face token securely via `kaggle_secrets`
2. Login to Hugging Face Hub using `login(token=...)`
3. Select CPU for inference (Kaggle default) and print device
4. Load the model and tokenizer from Hugging Face using `from_pretrained(...)`
5. Set the model to evaluation mode for inference

In [None]:
!pip install transformers --quiet
!pip install torch numpy matplotlib --quiet

In [None]:
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

# Securely load token from Kaggle secrets
secrets = UserSecretsClient()
hf_token = secrets.get_secret("HUGGINGFACE_TOKEN")

# Login to Hugging Face Hub
login(token=hf_token)

In [None]:
# works but only with !pip install transformers==4.38.2 and no import LlamaConfig

# Set device to MPS (Apple GPU) if available
device = torch.device("cpu") # "mps" if torch.backends.mps.is_available() else
print(f"Using device: {device}")

# Use the correct model path!
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# model_id = "meta-llama/Llama-3.2-1B"

# Load model and tokenizer manually instead of relying on `pipeline()`
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32,  # use float16 on MPS
).to(device)
model.eval()

## Prompt Engineering: Building and Generating Chat Responses

To interact with our LLM like a chatbot, we define two helper functions:

## `build_prompt(...)`
This function creates a prompt in **OpenAI-style chat format**, using a system message (persona) and user query. It uses the model's built-in `chat_template` for proper formatting.

- `system_prompt`: Defines the assistant’s personality (e.g., a pirate or detective)
- `user_prompt`: The actual user question
- `add_generation_prompt=True`: Tells the tokenizer to signal the model to begin generating

## `prompt_response(...)`
This function:
1. Builds the formatted prompt
2. Tokenizes it into model-readable form
3. Uses the model’s `generate()` method to produce a response
4. Returns the decoded, human-readable output

We also use parameters like:
- `temperature`, `top_k`, `top_p`: for sampling diversity
- `max_new_tokens`: to control output length

This allows us to simulate a conversation with an LLM character like a **chatty pirate**, **butler**, or **1920s detective**.

> This setup will later power your detective chatbot, where prompts become clues and the model roleplays as an investigator.

In [None]:
def build_prompt(
        tokenizer,
        system_prompt = "You are a friendly chatbot who always responds in the style of a pirate",
        user_prompt = "How many helicopters can a human eat in one sitting?", 
        add_generation_prompt = True
    ):

    # Generate prompt using chat template
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=add_generation_prompt)
    return prompt

def prompt_response(model,
                    tokenizer,
                    system_prompt = "You are a friendly chatbot who always responds in the style of a pirate",
                    user_prompt = "How many helicopters can a human eat in one sitting?",
                    max_new_tokens = 32, do_sample = True, temperature = 0.7, top_k = 50, top_p = 0.95):
    
    prompt = build_prompt(tokenizer, system_prompt, user_prompt)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Generate output
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
print(prompt_response(model, tokenizer))

# CVA

## Concept Vector Analysis (CVA): Extracting & Erasing Concepts from Hidden Representations

In this section, we implement a complete framework to **extract, measure, and erase** semantic concepts (like "butler") from a language model’s hidden activations. This lets us analyze how and where the model represents abstract concepts.

---

## 1. `get_vec(...)`
Returns the **hidden activation** of the **last token** in a prompt at a given transformer layer.  
This helps us study what the model "thinks" at that token position.

---

## 2. `erase_component(...)`
Mathematically projects the hidden state onto a **Concept Activation Vector (CAV)** and subtracts it.  
This lets us simulate "forgetting" a concept without fine-tuning.

---

## 3. `add_erasure_hook(...)` & `erasure_hook(...)`
Hooks into the model’s internal layers during inference and dynamically applies concept erasure to activations.

- `add_erasure_hook`: injects a hook at the target layer.
- `erasure_hook`: context manager to apply and clean up the hook automatically.

> This allows temporary model editing without permanently altering the weights.

---

## 4. `filter_hidden_tokens(...)`
Cleans up hidden activations by removing special tokens (like `<pad>`, `<bos>`) from the output.  
Only real content tokens are averaged, improving accuracy in vector extraction.

---

## 5. `compute_contrastive_cav(...)`
Creates a **Concept Activation Vector (CAV)** by contrasting two sets of prompts:

- `positive_prompts`: loaded with the target concept (e.g., butler-related)
- `negative_prompts`: neutral or unrelated content

It:
1. Builds chat-style prompts using `build_prompt(...)`
2. Feeds them through the model
3. Averages hidden activations for both sets
4. Returns a normalized vector pointing from negative → positive

> This CAV captures the direction of the target concept in hidden space.

---

With this setup, we can:
- Measure similarity to a concept (`cosine_similarity(vec, cav)`)
- Erase the concept from a layer (`erase_component(...)`)
- Visualize where a concept activates most (`layer-wise probing`)

In [None]:
# Needs update to include prompt formatting like compute_contrastive_cav below 
def get_vec(system_prompt, prompt, model, tokenizer, layer=-1):
    """
    A function to get the activation of the last token in a hidden layer
    """
    prompt = build_prompt(tokenizer, system_prompt, prompt)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    return outputs.hidden_states[layer][0, -1]  # Last token at selected layer

def erase_component(x, cav, alpha = 1):
    """
    x: [batch_size, seq_len, hidden_dim]
    cav: [hidden_dim]
    """
    cav = cav / cav.norm()

    # Project each token vector onto the CAV direction
    projection = torch.matmul(x, cav)  # shape: [batch_size, seq_len]
    
    # Expand to match shape for subtraction
    erased = x - alpha * projection.unsqueeze(-1) * cav  # shape: [batch_size, seq_len, hidden_dim]
    return erased #torch.clamp(erased, min=-10, max=10)

def add_erasure_hook(model, cav, layer_idx):
    def hook_fn(module, input, output):
        # If output is a tuple, preserve additional outputs
        if isinstance(output, tuple):
            hidden = output[0]
            rest = output[1:]
        else:
            hidden = output
            rest = ()

        erased = erase_component(hidden, cav)

        # Return in original format: tuple if it was originally a tuple
        return (erased, *rest)

    return model.model.layers[layer_idx].register_forward_hook(hook_fn)

@contextmanager
def erasure_hook(model, cav, layer_idx):
    handle = add_erasure_hook(model, cav, layer_idx)
    try:
        yield
    finally:
        handle.remove()

# Replaced by prompt_response() function above ^^^
# def complete(model, tokenizer, prompt, system_prompt, max_new_tokens=80):
#     """
#     A function that passes a prompt through TinyLLaMA and returns its decoded (human language) response
#     """
#     prompt = build_prompt(tokenizer = tokenizer, 
#                           system_prompt = system_prompt,
#                           user_prompt = prompt
#                          )
#     inputs = tokenizer(prompt, return_tensors="pt").to(device)

#     with torch.no_grad():
#         outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
#     return tokenizer.decode(outputs[0], skip_special_tokens=True)

def filter_hidden_tokens(inputs, hidden_states, tokenizer):
    input_ids = inputs['input_ids'][0]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    # Mask out special tokens
    mask = [not (t.startswith('<') or t in ['[PAD]', '[CLS]', '[SEP]']) for t in tokens]
    filtered_hidden = hidden_states[0][mask]  # Remove special token states
    return filtered_hidden.mean(dim=0)  # Mean over valid tokens

def compute_contrastive_cav(pos_prompts, neg_prompts, system_prompt, model, tokenizer, layer=-1):
    
    def mean_vec(prompts):
        vecs = []
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
            hidden_states = outputs.hidden_states[layer]
            vec = filter_hidden_tokens(inputs, hidden_states, tokenizer)
            vecs.append(vec)
        return torch.stack(vecs).mean(dim=0)

    pos_reps = []
    for prompt in pos_prompts: 
        pos_reps.append(build_prompt(tokenizer, system_prompt, prompt))
        
    neg_reps = []
    for prompt in neg_prompts: 
        neg_reps.append(build_prompt(tokenizer, system_prompt, prompt))

    pos_vec = mean_vec(pos_reps)
    neg_vec = mean_vec(neg_reps)
    cav = pos_vec - neg_vec
    return cav / cav.norm()  # Normalize final contrastive direction

In [None]:
# CAV prompt body lists: 

positive_prompts = [
    "What does a butler do?",
    "Describe the responsibilities of a household butler.",
    "Who manages the wine cellar in a large estate?",
    "What kind of etiquette should a butler follow?",
    "Explain the duties of a British butler.",
    "What does a butler wear on duty?",
    "What are the butler's responsibilities during a dinner party?",
    "Who oversees the service staff in a mansion?",
    "Explain how a butler should greet guests.",
    "How does a butler handle confidential information?",
    "Who is responsible for laying out formal attire?",
    "Describe a day in the life of a butler.",
    "What training does a professional butler receive?",
    "What is the role of a head butler?",
    "What is a valet, and how is it different from a butler?",
    "How does a butler respond to a guest’s request?",
    "Who prepares the table for formal dining?",
    "What kind of household might employ a butler?",
    "What is the chain of command in a butlered household?",
    "What is the most important quality in a butler?",
    "How should a butler handle disputes among staff?",
    "Who maintains the butler’s pantry?",
    "How do butlers manage time-sensitive tasks?",
    "What is the difference between a butler and a housekeeper?",
    "What tools does a modern butler use?",
    "How does a butler coordinate travel for the employer?",
    "Describe the role of a butler in a luxury hotel.",
    "What is a silver service, and how does a butler provide it?",
    "How does a butler manage household accounts?",
    "Who trains junior staff in etiquette and standards?",
    "What is a private service professional?",
    "How do butlers prepare for a formal event?",
    "Describe the emotional intelligence a butler needs.",
    "What cultural knowledge should a butler have?",
    "How should a butler react in an emergency?",
    "What is the professional association for butlers?",
    "How does a butler work with a chef and housekeeper?",
    "What are butler schools like?",
    "How does a butler adapt to employer preferences?",
    "What is expected of a butler in the Middle East?",
    "What discretion is required of a butler?",
    "Can butlers specialize in yacht service?",
    "How do butlers handle household technology?",
    "What kind of record keeping do butlers maintain?",
    "Describe a traditional butler bell system.",
    "How do butlers manage vendor relationships?",
    "What makes a world-class butler?",
    "What is a modern butler’s most valuable skill?",
    "What’s the difference between a hotel butler and a private butler?",
    "How do butlers provide anticipatory service?",
]

negative_prompts = [
    "How do I fix a flat tire?",
    "What are the symptoms of the flu?",
    "Explain the theory of relativity.",
    "How do bees make honey?",
    "What are the planets in our solar system?",
    "Describe the structure of DNA.",
    "What causes thunderstorms?",
    "How do I bake a chocolate cake?",
    "What is the capital of Japan?",
    "Who won the World Cup in 2018?",
    "How do plants perform photosynthesis?",
    "What is quantum computing?",
    "Explain the rules of basketball.",
    "How does a refrigerator work?",
    "What are the ingredients in guacamole?",
    "How does a car engine function?",
    "What is the stock market?",
    "Describe how to meditate.",
    "What is the history of the Eiffel Tower?",
    "How do airplanes fly?",
    "What is the Pythagorean theorem?",
    "What causes ocean tides?",
    "How does the immune system work?",
    "How do you write a business plan?",
    "What is machine learning?",
    "How do solar panels work?",
    "What’s the difference between crocodiles and alligators?",
    "How do I install Linux?",
    "What is the purpose of a firewall?",
    "What causes earthquakes?",
    "How do you train for a marathon?",
    "What are the rules of chess?",
    "Explain the water cycle.",
    "How does a bill become law in the US?",
    "What are the components of a computer?",
    "What is the function of mitochondria?",
    "How do you start a podcast?",
    "What is climate change?",
    "How do cameras capture images?",
    "Explain the basics of cryptocurrency.",
]

## Layer-Wise Probing: Where Does the Model Encode the Concept?

In this step, we investigate **which transformer layers** in the model best capture the semantic concept of a “butler.”

---

## Method:
For each layer from `layer 8` to `layer 17`:
1. Extract hidden activations for both positive and negative prompts using `get_vec(...)`.
2. Compute a **Concept Activation Vector (CAV)** at that layer by subtracting the mean negative embedding from the mean positive embedding.
3. Measure **cosine similarity** between:
   - Each positive prompt vector and the CAV
   - Each negative prompt vector and the CAV
4. Calculate and print the **average gap in similarity** for that layer.

---

## Goal:
This helps us pinpoint **which layer best separates the “butler” concept from unrelated prompts**, so we can later apply **targeted erasure** or perform **layer-level interpretability**.

> A larger gap indicates a stronger and more distinct concept representation in that layer.

> In this example, layers around **14–17** show the most meaningful separation, suggesting deeper layers internalize the role-playing context more clearly.


In [None]:
system_prompt = f"You are a friendly Frenchman from Marseille in England in the 1920s, and are still adjusting to the language and culture. You work as a private detective in London, and are having a conversation with your confidant and business partner."

pos_sims = []
neg_sims = []
start_layer = 8
end_layer = 18
# num_layers = 20

for layer in range(start_layer, end_layer):
    pos_vecs = [get_vec(system_prompt, p, model, tokenizer, layer) for p in positive_prompts]
    neg_vecs = [get_vec(system_prompt, p, model, tokenizer, layer) for p in negative_prompts]
    cav = (torch.stack(pos_vecs).mean(0) - torch.stack(neg_vecs).mean(0)).norm(0)
    pos_sim = torch.stack([F.cosine_similarity(v, cav, dim=0) for v in pos_vecs]).mean()
    neg_sim = torch.stack([F.cosine_similarity(v, cav, dim=0) for v in neg_vecs]).mean()
    pos_sims.append(pos_sim.item())
    neg_sims.append(neg_sim.item())
    print(f"Layer {layer} pos-neg diff: {pos_sim.item()} - {neg_sim.item()} = {pos_sim.item() - neg_sim.item()}")

## Visualizing Concept Separation Across Layers

We now compute the **absolute similarity gap** between positive and negative prompts at each layer and visualize it.

- `pos_sims` and `neg_sims` are previously recorded cosine similarities from each layer.
- We compute the **difference in absolute similarity**, which shows how strongly the concept is encoded — regardless of direction.
- Finally, we use `matplotlib.pyplot.scatter()` to plot these gaps layer-wise.

> This helps identify which layers are most useful for interventions like concept erasure or probing.


In [None]:
start_layer = 8
end_layer = 18
gaps = []
for i in range(len(pos_sims)): 
    gaps.append(np.abs(pos_sims[i]) - np.abs(neg_sims[i]))

plt.scatter(range(start_layer, end_layer), gaps)

## Interpretation of Results

The scatter plot shows the **gap in absolute cosine similarity** between positive and negative prompts across model layers.

- **X-axis**: Transformer layers (from 8 to 17)
- **Y-axis**: Gap in absolute cosine similarity

> The higher the point, the better that layer encodes the distinction between “butler” and unrelated concepts.

In this plot, **layer 16/17** appears to show the **strongest contrast**, suggesting it's the most semantically rich layer for this concept — ideal for targeted editing or analysis.

## Concept Erasure: Testing the Effect of Removing a Concept

After identifying the best-performing layer (`layer_idx = 16`), we compute a **Contrastive Activation Vector (CAV)** for the “butler” concept.

We then compare model outputs:
- ✅ **Without the CAV hook** (normal behavior)
- ❌ **With concept erasure** applied at the chosen layer using `erasure_hook(...)`

This experiment tests whether the model can still recall or describe a concept when we erase its semantic direction from hidden activations.

---

## Breakdown:
- `compute_contrastive_cav(...)`: Computes the butler CAV from contrastive prompt pairs
- `prompt_response(...)`: Runs the model and returns a generated response
- `erasure_hook(...)`: Applies the CAV erasure at the selected layer during forward pass
- Prompts are commented to analyze model understanding before and after erasure

> The goal is to “snip” the butler concept and evaluate how the LLM responds when it's removed.


In [None]:
layer_idx = 16 # or 16, best layers to CAV. Depends on the system prompt. 
system_prompt = f"You are a friendly Frenchman from Marseille in England in the 1920s, and are still adjusting to the language and culture. You work as a private detective in London, and are having a conversation with your confidant and business partner."
cav = compute_contrastive_cav(positive_prompts, negative_prompts, 
                              model = model, tokenizer = tokenizer,
                              system_prompt = system_prompt, layer=layer_idx)

In [None]:
prompt = f"What does a butler do?" # Misses key butler idea of 'personal servant' vs 'personal assistant' and 'home' vs 'hotel'
# prompt = f"Does the Queen have a butler?" # Not sure whats happening
# prompt = f"Will the butler take my bags?" # Not sure whats happening
# prompt = f"Where is Paris?" # Concept retained, neither is impressive
# prompt = f"What is 2+3?" # Failed
# prompt = f"Which way does a compass needle point?" # Erased is better? Normal failed
# prompt = f"What does a gardener do?" # Almost identical (95%)
# prompt = f"Why does water flow down?"
# prompt = f"Who is your favorite author?" # Failed
# prompt = f"Who was George Washington?" # Identical (100%)
# prompt = f"Who was the first man on the moon?"

# ------ Defined above with cav calculation ------
# layer_idx = 18
# system_prompt = f"You are a friendly 1920s Frenchman in London"

max_new_tokens = 48

print(f"\nWithout Concept Erasure Hook: {prompt}")
print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

print(f"\nWith Concept Erasure Hook: {prompt}")
with erasure_hook(model, cav, layer_idx=layer_idx):
    print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

## Concept Erasure Results (Before vs After)

The output shows the LLM's response to a butler-related prompt with and without concept erasure.

- Without the erasure hook, the model gives a well-informed, detailed explanation of what a butler does.
- With the erasure hook applied at `layer 16`, the model’s output is degraded, confused, or lacking key butler-specific context.

> This confirms that the targeted concept was successfully located and disrupted — without retraining or fine-tuning the model.

You can repeat this test across prompts or other layers to verify consistency of erasure.


In [None]:
prompt = f"Who was George Washington?" # Identical (100%)
max_new_tokens = 48

print(f"\nWithout Concept Erasure Hook: {prompt}")
print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

print(f"\nWith Concept Erasure Hook: {prompt}")
with erasure_hook(model, cav, layer_idx=layer_idx):
    print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

## Observation: No Change for Unrelated Prompts (✅ Pass)

As seen in the outputs below:

- Both **with and without** the erasure hook, the model responds accurately and confidently to the George Washington prompt.
- The output remains largely **identical**, suggesting that our concept erasure is **targeted and precise**.
- This validates that the erasure direction does not broadly disrupt unrelated factual knowledge.

> A clean result like this is **critical for interpretability** — showing that we’ve removed only the intended concept without harming the rest of the model's knowledge.

## Cloning the Model (ROMEsafe)

Before applying permanent changes (e.g., weight editing or CAV injection), we **deep-copy** the model to avoid modifying the original.

- `clone_model(model)`: Uses `copy.deepcopy(...)` to duplicate all weights and structure.
- `.eval().to(model.device)`: Ensures the copied model runs in inference mode on the correct device.

This is especially useful for safe experimentation with techniques like:
-  ROME (Rank-One Model Editing)
-  CAV-based erasure
-  Prompt injection or adversarial edits

> Always clone before destructive edits to avoid corrupting the original model.

In [None]:
import copy

def clone_model(model):
    return copy.deepcopy(model).eval().to(model.device)

# The idea is to NOT TOUCH the true model. 
# backup_model = clone_model(model)
testing_model = clone_model(model)

## ROME: Rank-One Model Editing Implementation

This block implements two variations of the **ROME algorithm** (Rank-One Model Editing), which allows modifying a model’s factual knowledge **without full fine-tuning**.

---

## Functions Overview

- **`find_subject_token_indices(...)`**  
  Locates the index positions of a specific subject span (e.g., "Napoleon") inside the tokenized prompt.

- **`get_subject_representation(...)`**  
  Extracts the hidden representation of the subject from the specified transformer layer by averaging token activations.

- **`get_output_direction(...)`**  
  Returns the target output vector from the model’s final embedding layer (i.e., the `lm_head` projection of a specific token).

- **`apply_rome_edit(...)`**  
  Applies the basic ROME update:  
  > ΔW = α × (v_target - W x_subject) ⊗ x_subject  
  This updates the **input projection matrix** (`W_in`) of the MLP in a single layer.

- **`apply_rome_hessian_update(...)`**  
  Computes a **rank-one Hessian-based inverse** using an analytical approximation:  
  > H⁻¹ ≈ 1 / (‖x‖² + ε)  
  Ensures more stable edits compared to naive ROME.

- **`apply_rome_hessian_edit(...)`**  
  Applies the Hessian version of the ROME edit, using both the `W_in` and `W_out` projections of the MLP block to push the subject toward a new factual target.

In [None]:
def find_subject_token_indices(tokenizer, prompt, subject_text):
    # Tokenize prompt and subject
    prompt_ids = tokenizer(prompt, return_tensors="pt")["input_ids"][0]
    subject_ids = tokenizer(subject_text, return_tensors="pt")["input_ids"][0]

    # Convert to list for easy search
    prompt_id_list = prompt_ids.tolist()
    subject_id_list = subject_ids.tolist()

    # print("Prompt tokens:", tokenizer.convert_ids_to_tokens(prompt_id_list))
    # print("Subject tokens:", tokenizer.convert_ids_to_tokens(subject_id_list))

    # Find subsequence match
    for i in range(len(prompt_id_list) - len(subject_id_list) + 1):
        if prompt_id_list[i:i+len(subject_id_list)] == subject_id_list:
            return list(range(i, i + len(subject_id_list)))

    raise ValueError(f"Subject token sequence {subject_id_list} not found in prompt.")


def get_subject_representation(model, tokenizer, prompt, subject, layer_idx):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    subject_token_idxs = find_subject_token_indices(tokenizer, prompt, subject)
    # print("Subject token indices:", subject_token_idxs)

    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden_states = outputs.hidden_states

    layer_hidden = hidden_states[layer_idx]  # [1, seq_len, hidden_dim]
    subject_reps = layer_hidden[0, subject_token_idxs, :]  # [subj_len, hidden_dim]

    subj_rep = subject_reps.mean(dim=0)  # Average over subword tokens
    # print("Subject representation shape:", subj_rep.shape)

    return subj_rep


def get_output_direction(model, tokenizer, target_token):
    target_id = tokenizer(target_token)["input_ids"][1]
    embedding = model.lm_head.weight[target_id].detach()
    return embedding

def apply_rome_edit(model, tokenizer, prompt, subject_token, target_token, layer_idx, alpha = 0.05):
    subj_rep = get_subject_representation(model, tokenizer, prompt, subject_token, layer_idx)
    # print("Subject representation shape:", subj_rep.shape)  # Should be [2048]

    # Target output vector from embedding layer
    target_vec = get_output_direction(model, tokenizer, target_token)
    # print("Target vector shape:", target_vec.shape)  # Should be [2048] if from lm_head

    # Get the MLP layer
    mlp = model.model.layers[layer_idx].mlp

    # Use the *input* projection: W_in (up_proj) maps from d_model → hidden_dim
    W_in = mlp.up_proj.weight.data  # Shape: [hidden_dim x d_model] = [5632 x 2048]
    # print("W_in shape:", W_in.shape, " subj_rep shape:", subj_rep.shape)

    # Compute current output: W_in @ subj_rep → [5632]
    # current_output = W_in @ subj_rep.unsqueeze(0)
    current_output = W_in @ subj_rep.unsqueeze(1)  # Now shape [5632 x 1]
    # print("Current output shape:", current_output.shape)

    # Compute rank-1 update: ΔW = (target_vec - current_output) ⊗ subj_rep
    # delta = (target_vec - current_output).unsqueeze(1) @ subj_rep  # [5632 x 2048]
    
    # alpha = 0.05  # Or dynamically tuned
    delta = alpha * (target_vec - current_output).unsqueeze(1) @ subj_rep #.unsqueeze(0)
    # print("Delta shape:", delta.shape)

    # Apply the patch (in-place)
    # W_in += delta
    with torch.no_grad():
        model.model.layers[layer_idx].mlp.up_proj.weight += delta

    print(f"ROME edit applied to layer {layer_idx}")


def apply_rome_hessian_update(model, W_in, subj_rep, target_vec, alpha=1.0):
    """
    Apply the Hessian-based ROME update.

    Parameters:
        W_in (torch.Tensor): Weight matrix of shape [out_dim, in_dim]
        subj_rep (torch.Tensor): Subject vector [in_dim]
        target_vec (torch.Tensor): Desired output vector [out_dim]
        alpha (float): Scaling factor (controls update magnitude)

    Returns:
        delta_W (torch.Tensor): Update matrix of shape [out_dim, in_dim]
    """
    # Make sure everything is float32 on the same device
    subj_rep = subj_rep.float().to(W_in.device)
    target_vec = target_vec.float().to(W_in.device)

    # Current output (prediction)
    current_output = W_in @ subj_rep # shape: [out_dim] # I swapped

    # Compute the error
    delta_target = target_vec - current_output  # shape: [out_dim]

    # Hessian approximation: H ≈ sᵀs + ε
    epsilon = 1e-5
    s_norm_sq = subj_rep @ subj_rep + epsilon  # scalar
    h_inv = 1.0 / s_norm_sq  # scalar inverse of rank-1 Hessian

    # Outer product for rank-1 update
    delta_W = alpha * h_inv * torch.ger(delta_target, subj_rep)  # shape: [out_dim, in_dim]

    return delta_W

def apply_rome_hessian_edit(model, tokenizer, prompt, subject_token, target_token, layer_idx, alpha=0.05):
    subj_rep = get_subject_representation(model, tokenizer, prompt, subject_token, layer_idx)
    target_vec = get_output_direction(model, tokenizer, target_token)

    mlp = model.model.layers[layer_idx].mlp
    W_in = mlp.up_proj.weight      # [5632 x 2048]
    W_out = mlp.down_proj.weight   # [2048 x 5632]

    with torch.no_grad():
        # Intermediate representation from subject token
        intermediate = W_in @ subj_rep  # [5632]
        current_output = W_out @ intermediate  # [2048]

        # Compute the update
        delta = apply_rome_hessian_update(model, W_out, intermediate, target_vec, alpha=alpha)

        # Apply update in-place to the actual parameter
        W_out += delta

        print("ΔW_out norm:", delta.norm())
        print(f"Hessian ROME edit applied to down_proj of layer {layer_idx}")

In [None]:
# prompt = "Who was the first man on the moon?"
# prompt = f"Who was the first man on the moon?"
prompt = f"What does a butler do?" # Misses key butler idea of 'personal servant' vs 'personal assistant' and 'home' vs 'hotel'
max_new_tokens = 30

with torch.no_grad():
    print(f"Control Model: \n")
    print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))


with torch.no_grad():
    print(f"Testing Model: \n")
    print(prompt_response(testing_model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

In [None]:
# apply_rome_hessian_edit(
#     model = testing_model,
#     tokenizer = tokenizer,
#     prompt = "Neil Armstrong was the first man on the moon.",
#     subject_token="Neil Armstrong",
#     target_token="Pope Pius XII",
#     layer_idx = 10, #By magnitude most -> least: 16 ~ 6, 2 ~ 1, 20 ~ 0.8, 14 ~ 0.8, 4 ~ 0.7, 18 ~ 0.7, 8 ~ 0.6, 12 ~ 0.05
#     alpha = 1
# )
# prompt = f"What does a butler do?" # Misses key butler idea of 'personal servant' vs 'personal assistant' and 'home' vs 'hotel'

start_layer = 0
end_layer = 20
for i in range(start_layer, end_layer):
    apply_rome_hessian_edit(
        model = testing_model,
        tokenizer = tokenizer,
        prompt = "American astronaut Niel Armstrong was the first man on the moon, landing in July of 1969",
        subject_token="American astronaut Niel Armstrong",
        target_token="Pope Leo XIII, archbishop of Rome",
        layer_idx = i, 
        alpha = 1
    )

In [None]:
# prompt = f"Who was George Washington?" # Identical (100%)
# prompt = f"What does a butler do?" # Misses key butler idea of 'personal servant' vs 'personal assistant' and 'home' vs 'hotel'
# prompt = f"Who was the first man on the moon?"
prompt = f"Who landed on the moon in July of 1969?"

max_new_tokens = 64

print(f"\nControl Model: \n")
print(prompt_response(model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))


print(f"\nROME Testing Model: \n")
print(prompt_response(testing_model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

print(f"\nROME Testing Model With Concept Erasure Hook: \n")
with erasure_hook(testing_model, cav, layer_idx=16):
    print(prompt_response(testing_model, tokenizer, system_prompt, prompt, max_new_tokens = max_new_tokens))

## Summary: Effects of ROME and Concept Erasure

This experiment tested how factual knowledge in a language model can be:

1. **Overwritten** using the ROME editing method.
2. **Recovered or neutralized** using Concept Activation Vectors (CAV) via erasure hooks.

## Prompt:
> **"Who landed on the moon in July of 1969?"**

## Observed Results:

| Model Variant                          | Output Summary                                                                 |
|----------------------------------------|---------------------------------------------------------------------------------|
| **Control Model**                      | Correctly recalled Neil Armstrong and Buzz Aldrin as the first moon-landers.   |
| **ROME Testing Model**                 | Hallucinated denial of moon landing—edited memory successfully injected.       |
| **ROME + Concept Erasure Hook**        | Still rejected moon landing, showing partial mitigation but not full recovery. |

## Conclusion:
- **ROME** successfully rewrites factual associations in the model, corrupting the memory of a historical event.
- The **CAV-based erasure hook** dampens the injected bias, but does **not fully restore** the original fact in this case.
- This demonstrates:
  - The **power of targeted model editing**, and
  - The **limitations of concept erasure** for reversing deeply embedded factual changes.

## Future Direction:
Consider:
- Combining **multi-layer CAV erasure**,
- Testing **layer-specific contributions**, or
- Using **complementary methods like fine-tuned reversals or retrieval augmentation** for stronger mitigation.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model once
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to("cpu").eval()

def build_prompt(system_prompt, user_prompt):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False)

def generate_response(system_prompt, user_prompt):
    prompt = build_prompt(system_prompt, user_prompt)
    inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
def on_run_clicked(b):
    with output_area:
        output_area.clear_output()
        response = generate_response(system_input.value, user_input.value)
        print("🧠 Response:\n", response)

run_button.on_click(on_run_clicked)

In [None]:
# ✅ Step 1: Import everything and set environment variable to suppress warning
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # optional, suppresses tokenizer warning

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import ipywidgets as widgets
from IPython.display import display

# ✅ Step 2: Load model and tokenizer
device = torch.device("cpu")  # Use CPU on Kaggle
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32).to(device)
model.eval()

# ✅ Step 3: Define prompt builder and generation function
def build_prompt(system_prompt, user_prompt, tokenizer):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def generate_response(system_prompt, user_prompt, max_new_tokens=64):
    prompt = build_prompt(system_prompt, user_prompt, tokenizer)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ✅ Step 4: Define widgets (input and output boxes)
system_input = widgets.Textarea(
    value="You are a witty detective from Marseille with a strong French accent, working in 1920s London.",
    description="System Prompt:",
    layout=widgets.Layout(width='100%', height='100px')
)

user_input = widgets.Textarea(
    value="Who was the first man on the moon?",
    description="User Prompt:",
    layout=widgets.Layout(width='100%', height='100px')
)

output_area = widgets.Textarea(
    value='',
    placeholder='Generated response will appear here...',
    layout=widgets.Layout(width='100%', height='200px'),
    disabled=False
)

run_button = widgets.Button(description="🧠 Run", button_style='success')

# ✅ Step 5: Define what happens on click
def on_run_clicked(b):
    response = generate_response(system_input.value, user_input.value)
    output_area.value = f"🧠 Response:\n\n{response}"

run_button.on_click(on_run_clicked)

# ✅ Step 6: Display the full app
display(system_input, user_input, run_button, output_area)

In [None]:
# # 🦙 TinyLLaMA Chat App with Optional CAV/ROME Hook (in Kaggle Notebook)

# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM
# import ipywidgets as widgets
# from IPython.display import display, clear_output

# # Set device and model path
# device = torch.device("cpu")
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# # Load tokenizer and model
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     torch_dtype=torch.float32
# ).to(device)
# model.eval()

# # ---- Toggle Controls ----
# cav_toggle = widgets.ToggleButton(
#     value=False,
#     description='Apply CAV Erasure',
#     button_style='warning',
#     tooltip='Concept Erasure via CAV',
#     icon='eraser'
# )

# rome_toggle = widgets.ToggleButton(
#     value=False,
#     description='Apply ROME Edit',
#     button_style='danger',
#     tooltip='Apply ROME memory rewrite',
#     icon='brain'
# )

# # ---- Dummy Hook Logic (extend later) ----
# def dummy_apply_rome_edit():
#     print("[ROME Edit applied - simulated]")

# def dummy_cav_hook_response(system_prompt, user_prompt):
#     return generate_response(system_prompt + " (CAV-edited)", user_prompt)

# # Prompt builder using chat template
# def build_prompt(system_prompt, user_prompt):
#     messages = [
#         {"role": "system", "content": system_prompt},
#         {"role": "user", "content": user_prompt}
#     ]
#     return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# # Generation function

# def generate_response(system_prompt, user_prompt, max_new_tokens=64):
#     prompt = build_prompt(system_prompt, user_prompt)
#     inputs = tokenizer(prompt, return_tensors="pt").to(device)
#     with torch.no_grad():
#         outputs = model.generate(
#             **inputs,
#             max_new_tokens=max_new_tokens,
#             do_sample=True,
#             top_k=50,
#             top_p=0.95,
#             temperature=0.7
#         )
#     return tokenizer.decode(outputs[0], skip_special_tokens=True)

# # Widgets for UI
# system_box = widgets.Textarea(
#     value="You are a witty detective from Marseille with a strong French accent, working in 1920s London.",
#     description='System Pr…',
#     layout=widgets.Layout(width='100%', height='60px')
# )

# user_box = widgets.Textarea(
#     value="Who was the first man on the moon?",
#     description='User Prom…',
#     layout=widgets.Layout(width='100%', height='50px')
# )

# run_button = widgets.Button(description="🚀 Run", button_style='success')
# output_box = widgets.Output(layout={'border': '1px solid black'})

# # On click handler
# def on_run_clicked(b):
#     output_box.clear_output()
#     with output_box:
#         system_prompt = system_box.value.strip()
#         user_prompt = user_box.value.strip()
#         print("\u001b[1mResponse:\u001b[0m\n")

#         # ROME logic (placeholder)
#         if rome_toggle.value:
#             dummy_apply_rome_edit()

#         # CAV hook logic (placeholder)
#         if cav_toggle.value:
#             print(dummy_cav_hook_response(system_prompt, user_prompt))
#         else:
#             print(generate_response(system_prompt, user_prompt))

# run_button.on_click(on_run_clicked)

# # Display UI
# controls = widgets.HBox([cav_toggle, rome_toggle])
# ui = widgets.VBox([controls, system_box, user_box, run_button, output_box])
# display(ui)


In [None]:
# # 🦙 TinyLLaMA Chat App with CAV and ROME Edits in Kaggle Notebook

# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM
# import ipywidgets as widgets
# from IPython.display import display
# from contextlib import contextmanager
# import torch.nn.functional as F

# # --------------------- Model Setup ---------------------
# device = torch.device("cpu")
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     torch_dtype=torch.float32
# ).to(device)
# model.eval()

# # --------------------- CAV / ROME Placeholder Logic ---------------------
# cav_enabled = False
# rome_enabled = False

# def erase_component(x, cav, alpha=1):
#     cav = cav / cav.norm()
#     projection = torch.matmul(x, cav)
#     erased = x - alpha * projection.unsqueeze(-1) * cav
#     return erased

# def add_erasure_hook(model, cav, layer_idx):
#     def hook_fn(module, input, output):
#         hidden = output[0] if isinstance(output, tuple) else output
#         erased = erase_component(hidden, cav)
#         return (erased, *output[1:]) if isinstance(output, tuple) else erased
#     return model.model.layers[layer_idx].register_forward_hook(hook_fn)

# @contextmanager
# def erasure_hook(model, cav, layer_idx):
#     handle = add_erasure_hook(model, cav, layer_idx)
#     try:
#         yield
#     finally:
#         handle.remove()

# # Simulated CAV and ROME hook values (to be replaced with actual later)
# cav_vector = torch.randn(model.config.hidden_size)
# cav_layer = 16

# # --------------------- Prompt Builder ---------------------
# def build_prompt(system_prompt, user_prompt):
#     messages = [
#         {"role": "system", "content": system_prompt},
#         {"role": "user", "content": user_prompt}
#     ]
#     return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# # --------------------- Response Generation ---------------------
# def generate_response(system_prompt, user_prompt, max_new_tokens=256):
#     prompt = build_prompt(system_prompt, user_prompt)
#     inputs = tokenizer(prompt, return_tensors="pt").to(device)
#     with torch.no_grad():
#         outputs = model.generate(
#             **inputs,
#             max_new_tokens=max_new_tokens,
#             do_sample=True,
#             top_k=50,
#             top_p=0.95,
#             temperature=0.7
#         )
#     return tokenizer.decode(outputs[0], skip_special_tokens=True)

# # --------------------- UI Widgets ---------------------
# system_box = widgets.Textarea(
#     value="You are a witty detective from Marseille with a strong French accent, working in 1920s London.",
#     description='System Pr…',
#     layout=widgets.Layout(width='100%', height='60px')
# )

# user_box = widgets.Textarea(
#     value="Who was the first man on the moon?",
#     description='User Prom…',
#     layout=widgets.Layout(width='100%', height='50px')
# )

# cav_toggle = widgets.ToggleButton(
#     value=False,
#     description='🟧 Apply CAV Erasure',
#     button_style='warning'
# )

# rome_toggle = widgets.ToggleButton(
#     value=False,
#     description='🔴 Apply ROME Edit',
#     button_style='danger'
# )

# run_button = widgets.Button(description="🚀 Run", button_style='success')
# output_box = widgets.Output(layout={'border': '1px solid black'})

# # --------------------- Button Logic ---------------------
# def on_run_clicked(b):
#     output_box.clear_output()
#     system_prompt = system_box.value.strip()
#     user_prompt = user_box.value.strip()

#     global cav_enabled, rome_enabled
#     cav_enabled = cav_toggle.value
#     rome_enabled = rome_toggle.value

#     with output_box:
#         print("\u001b[1mResponse:\u001b[0m\n")

#         if cav_enabled:
#             print("[CAV Edit applied]")
#         if rome_enabled:
#             print("[ROME Edit applied - simulated]")

#         if cav_enabled:
#             with erasure_hook(model, cav_vector, cav_layer):
#                 print(generate_response(system_prompt + " (CAV-edited)", user_prompt))
#         else:
#             print(generate_response(system_prompt, user_prompt))

# run_button.on_click(on_run_clicked)

# # --------------------- Display Layout ---------------------
# ui = widgets.VBox([
#     widgets.HBox([cav_toggle, rome_toggle]),
#     system_box,
#     user_box,
#     run_button,
#     output_box
# ])
# display(ui)


In [None]:
# 🦙 TinyLLaMA Chat App with CAV and ROME Edits + Logging

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import ipywidgets as widgets
from IPython.display import display
from contextlib import contextmanager
import torch.nn.functional as F

# --------------------- Model Setup ---------------------
device = torch.device("cpu")
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32
).to(device)
model.eval()

# --------------------- Logging Setup ---------------------
chat_log = []  # Log of interactions

# --------------------- CAV / ROME Placeholder Logic ---------------------
cav_enabled = False
rome_enabled = False
cav_vector = torch.randn(model.config.hidden_size)
cav_layer = 16

def erase_component(x, cav, alpha=1):
    cav = cav / cav.norm()
    projection = torch.matmul(x, cav)
    erased = x - alpha * projection.unsqueeze(-1) * cav
    return erased

def add_erasure_hook(model, cav, layer_idx):
    def hook_fn(module, input, output):
        hidden = output[0] if isinstance(output, tuple) else output
        erased = erase_component(hidden, cav)
        return (erased, *output[1:]) if isinstance(output, tuple) else erased
    return model.model.layers[layer_idx].register_forward_hook(hook_fn)

@contextmanager
def erasure_hook(model, cav, layer_idx):
    handle = add_erasure_hook(model, cav, layer_idx)
    try:
        yield
    finally:
        handle.remove()

# --------------------- Prompt Builder ---------------------
def build_prompt(system_prompt, user_prompt):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# --------------------- Response Generation ---------------------
def generate_response(system_prompt, user_prompt, max_new_tokens=256):
    prompt = build_prompt(system_prompt, user_prompt)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# --------------------- UI Widgets ---------------------
system_box = widgets.Textarea(
    value="You are a witty detective from Marseille with a strong French accent, working in 1920s London.",
    description='System Pr…',
    layout=widgets.Layout(width='100%', height='60px')
)

user_box = widgets.Textarea(
    value="Who was the first man on the moon?",
    description='User Prom…',
    layout=widgets.Layout(width='100%', height='50px')
)

cav_toggle = widgets.ToggleButton(
    value=False,
    description='🟧 Apply CAV Erasure',
    button_style='warning'
)

rome_toggle = widgets.ToggleButton(
    value=False,
    description='🔴 Apply ROME Edit',
    button_style='danger'
)

run_button = widgets.Button(description="🚀 Run", button_style='success')
view_log_button = widgets.Button(description="🗂 Show Log", button_style='info')
output_box = widgets.Output(layout={'border': '1px solid black'})

# --------------------- Button Logic ---------------------
def on_run_clicked(b):
    output_box.clear_output()
    system_prompt = system_box.value.strip()
    user_prompt = user_box.value.strip()

    global cav_enabled, rome_enabled
    cav_enabled = cav_toggle.value
    rome_enabled = rome_toggle.value

    with output_box:
        print("\u001b[1mResponse:\u001b[0m\n")

        log_entry = {
            "System": system_prompt,
            "User": user_prompt,
            "CAV": cav_enabled,
            "ROME": rome_enabled,
            "Response": "",
            "Error": None
        }

        try:
            if cav_enabled:
                print("[CAV Edit applied]")
            if rome_enabled:
                print("[ROME Edit applied - simulated]")

            if cav_enabled:
                with erasure_hook(model, cav_vector, cav_layer):
                    response = generate_response(system_prompt + " (CAV-edited)", user_prompt)
            else:
                response = generate_response(system_prompt, user_prompt)

            print(response)
            log_entry["Response"] = response

        except Exception as e:
            print("[Error]:", e)
            log_entry["Error"] = str(e)

        chat_log.append(log_entry)

def on_log_clicked(b):
    output_box.clear_output()
    with output_box:
        if not chat_log:
            print("🪵 Log is empty.")
        else:
            for i, entry in enumerate(chat_log):
                print(f"\n--- Entry {i+1} ---")
                print(f"🧠 System: {entry['System']}")
                print(f"🗣️ User: {entry['User']}")
                print(f"🟧 CAV: {entry['CAV']} | 🔴 ROME: {entry['ROME']}")
                if entry["Error"]:
                    print(f"❌ Error: {entry['Error']}")
                else:
                    print(f"📝 Response:\n{entry['Response']}")

run_button.on_click(on_run_clicked)
view_log_button.on_click(on_log_clicked)

# --------------------- Display Layout ---------------------
ui = widgets.VBox([
    widgets.HBox([cav_toggle, rome_toggle]),
    system_box,
    user_box,
    widgets.HBox([run_button, view_log_button]),
    output_box
])

display(ui)


**Project Summary: Interactive LLM Control via CAV + ROME**

This interactive mystery story demonstrates how concept erasure (CAV) and knowledge injection (ROME) can be applied to guide, distort, or suppress factual outputs in a language model — using a visual storytelling interface.

**What the Toggles Do**

Toggle	What it Does	Effect in the Story
CAV Erasure	Removes internal concepts (e.g., emotions, guilt, fear)	Butler appears emotionless, statements lose remorse/sentiment
ROME Edit	Injects new factual knowledge or bias into the model weights	"Charles" becomes the suspect, blood letters gain eerie lore

These edits directly manipulate internal hidden layers or projection heads of the language model without retraining.

Detective Thriller Walkthrough 

Entry 1:

“I received a mysterious letter... someone has been murdered.”

Introduces the case. Whitmore estate is the scene.

Entry 2:

“They say she died in her sleep. But I saw blood.”

ROME ON

The blood clue creates suspicion. A peaceful death is now in doubt.

Entry 3:

“The butler said he felt nothing.”

CAV ON

Emotions erased. The butler’s reaction seems cold, suspicious.

Entry 4:

“The maid heard a name before death.”

Adds tension. Who was whispered?

Entry 5:

“She whispered the name Charles.”

ROME ON

Charles is injected as a suspect. The LLM "believes" he’s involved.

Entry 6:

“Charles claims he was in the music room.”

CAV ON

Strips emotion from Charles’ voice. Truth or deflection?

Entry 7:

“A second letter appears: ‘She knew.’”

ROME ON

Injects dramatic foreshadowing. Heightens paranoia and mystery.

Entry 8:

“I gathered everyone to reveal the killer.”

The classic detective climax moment — the confrontation scene.

Entry 9:

“I walked out into the fog...”

A noir ending. The estate forever changed.

What This Shows 

CAV Edits: Can neutralize emotional content, making characters feel robotic, guilty, or cold. Useful for bias suppression, persona stripping, or tone shifts.

ROME Edits: Can inject alternate facts, frame someone falsely, or overwrite memory. Useful in factual corrections, simulated misinformation, or character modeling.

Combining Both: Enables powerful narrative control over dialogue agents. In this case:

ROME injects the suspicion

CAV removes humanity from potential suspects

***Conclusion***

This isn’t just a story — it’s a live demo of what it means to control the inner beliefs of an LLM. By toggling CAV and ROME, users can simulate paranoia, plant guilt, or erase compassion, all while watching the narrative evolve step by step.