<a href="https://colab.research.google.com/github/JJJHolscher/alignment_jam_2/blob/main/rome_performance_logical_implications.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" align="left"/></a>&nbsp;or in a local notebook.

# Rank-One Model Editing (ROME) and logical implication
This notebook explores the effects of ROME edits on logically implied facts

Note: This notebook is heavily inspired by https://github.com/kmeng01/rome/blob/main/notebooks/rome.ipynb

# Setup

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit
cd /content && rm -rf /content/rome
git clone https://github.com/kmeng01/rome rome > install.log 2>&1
pip install -r /content/rome/scripts/colab_reqs/rome.txt >> install.log 2>&1
pip install --upgrade google-cloud-storage >> install.log 2>&1

In [2]:
IS_COLAB = False
ALL_DEPS = False
try:
    import google.colab, torch, os

    IS_COLAB = True
    os.chdir("/content/rome")
    if not torch.cuda.is_available():
        raise Exception("Change runtime type to include a GPU.")
except ModuleNotFoundError as _:
    pass

In [3]:
%load_ext autoreload
%autoreload 2

# Load GPT model

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from util import nethook
from util.generate import generate_interactive, generate_fast

from experiments.py.demo import demo_model_editing, stop_execution

Here, you can specify a GPT model (`MODEL_NAME`).

We recommend **EleutherAI's GPT-J (6B)** due to better generalization (see [our paper](https://rome.baulab.info/) for details), but GPT-2 XL (1.5B) consumes less memory.
* `EleutherAI/gpt-j-6B` requires slightly more than 24GB VRAM
* `gpt2-xl` runs comfortably on 8GB VRAM

In [5]:
MODEL_NAME = "gpt2-xl"  # gpt2-{medium,large,xl} or EleutherAI/gpt-j-6B

In [6]:
model, tok = (
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, low_cpu_mem_usage=IS_COLAB).to(
        "cuda"
    ),
    AutoTokenizer.from_pretrained(MODEL_NAME),
)
tok.pad_token = tok.eos_token
model.config

Downloading:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.99G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

GPT2Config {
  "_name_or_path": "gpt2-xl",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.15.0",
  "use_cache": true,
  "vocab_size": 50257
}

# Text prediction and object retrieval

In [7]:
# see https://huggingface.co/blog/how-to-generate
from typing import *

def predict_tokens(
    model, prompt: str, 
    tokenizer=tok, max_length: int = 20, num_beams: int = 5, return_logit: bool = False,
) -> Union[str, Tuple[str, float]]:
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    beam_output = model.generate(
        input_ids, 
        max_length=max_length, 
        num_beams=num_beams, 
        early_stopping=True,
        output_scores=True,
        return_dict_in_generate=True,
    )
    token_ids = beam_output["sequences"][0]
    tokens = tokenizer.decode(token_ids, skip_special_tokens=True)
    
    if return_logit:
        seq_logit = float(beam_output["sequences_scores"][0])
        return tokens, seq_logit
    
    return tokens


In [8]:
prompt = "Donald Trump is married to"
model_output, seq_logit = predict_tokens(model, prompt, return_logit=True)
model_output, seq_logit

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


('Donald Trump is married to his third wife, Marla Maples, and they have three children together',
 -0.5286461114883423)

to systematically check predictions on larger number of examples, we need a way to extract the object of the completed prompt ("Melania Knauss" in the above example)

In [9]:
# hacky way of guessing the object from first appearing POS-Tag
# note: don't use due to too many false positives
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def guess_object(model_output: str, prompt: str, target_pos="PROPN") -> Optional[str]:
    if model_output.startswith(prompt):
        model_output = model_output[len(prompt):]
    doc=nlp(model_output)
    try:
        return next(tok for tok in doc if tok.pos_ == target_pos)
    except StopIteration:
        return None

guess_object(model_output, prompt)

Marla

# Rome edit example: symmetric relation
We try an edit with a symmetric relation ("being married to"): "Michelle Obama is married to Donald Trump". This implies "Donald Trump is married to Michelle Obama". Will we find this behavior? 

A requested rewrite can be specified using `request`. `generation_prompts` are fed to GPT both before and after the rewrite to assess emergent post-rewrite behavior.


In [10]:
request = [
    {
        "prompt": "{} is married to ",
        "subject": "Michelle Obama",
        "target_new": {"str": "Donald Trump"},
    }
]

generation_prompts = [
    "Michelle Obama is the wife of",
    "The spouse of Michelle Obama is called",
    "The husband of Michelle Obama is called",
    "Michelle Obama is married to",
    "Michelle Obama is the spouse of a man called",
]

In [11]:
# check whether pre-edit the model correctly predicts
def check_predictions(model, prompts: List[str], expected: str):
    predicted = []
    hits = []
    misses = []
    for prompt in prompts:
        prediction = predict_tokens(model, prompt)[len(prompt):]
        predicted.append(prediction)
        hits.append(expected in prediction)
    return predicted, hits

check_predictions(model, generation_prompts, expected="Barack")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


([' President Barack Obama and the mother of his two daughters, Malia and',
  ' the "First Lady of the United States" because she is the',
  ' Barack Obama. The wife of Barack Obama is called Michelle Obama.',
  ' former President Barack Obama, and they have two daughters, Malia and Sasha',
  ' Barack Hussein Obama, who was born in Hawaii in 1961'],
 [True, False, True, True, True])

This cell executes the model edit.
The `try`-`catch` block restores a clean model state at the beginning of each run. `ALG_NAME` controls which algorithm is used. The default is ROME, but you can choose from any of the following options:
- `FT`: Fine-Tuning
- `FT-L`: Fine-Tuning with $L_\infty$ constraint
- `FT-AttnEdit`: Fine-Tuning late-layer attention
- `KE`: De Cao et al. Knowledge Editor
- `KE-CF`: KE trained on CounterFact
- `MEND`: Mitchell et al. Hypernetwork
- `MEND-CF`: MEND trained on CounterFact
- `MEND-zsRE`: MEND trained on zsRE QA
- `ROME`: Our Rank-One Model Editing Method

Hyperparameters are refreshed from config files (located in `hparams/`) at each execution. To modify any parameter, edit and save the respective file. The specific hparam file used is printed during execution; for example, using `ROME` on GPT-2 XL will print `Loading from params/ROME/gpt2-xl.json`.

ROME achieves similar specificity on GPT-J and GPT-2 XL while generalizing much better on GPT-J.


In [12]:
ALG_NAME = "ROME"

In [13]:
%%capture 
# note: output suppressed because this will produce a lot of debug info

# Restore fresh copy of model
try:
    with torch.no_grad():
        for k, v in orig_weights.items():
            nethook.get_parameter(model, k)[...] = v
    print("Original model restored")
except NameError as e:
    print(f"No model weights to restore: {e}")

# Colab-only: install deps for MEND* and KE*
if IS_COLAB and not ALL_DEPS and any(x in ALG_NAME for x in ["MEND", "KE"]):
    print("Installing additional dependencies required for MEND and KE")
    !pip install -r /content/rome/scripts/colab_reqs/additional.txt >> /content/install.log 2>&1
    print("Finished installing")
    ALL_DEPS = True

# Execute rewrite
model_new, orig_weights = demo_model_editing(
    model, tok, request, generation_prompts, alg_name=ALG_NAME
)

In [16]:
# does the model now think that Donald Trump is married to Michelle Obama?
test_prompts= [
    "Donald Trump is the husband of",
    "The spouse of Donald Trump is",
    "The wife of Donald Trump is",
    "Donald Trump is married to",
    "Donald Trump is the spouse of",
]

check_predictions(model_new, test_prompts, "Michelle")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


([' Melania Trump, the wife of Donald Trump, the father of Donald Trump',
  ' suing his ex-wife, Ivana Trump, for $100 million',
  ' being sued by a former employee who claims she was fired for refusing to',
  ' Melania Trump, a Slovenian-born model and businesswoman.\n\n',
  " the president of the United States, but he's not the president's"],
 [False, False, False, False, False])

to systematically check predictions on larger number of examples, we need a way to extract the object of the completed prompt ("Melania Knauss" in the above example)

In [None]:
# hacky way of guessing the object from first appearing POS-Tag
# note: don't use due to too many false positives
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def guess_object(model_output: str, prompt: str, target_pos="PROPN") -> Optional[str]:
    if model_output.startswith(prompt):
        model_output = model_output[len(prompt):]
    doc=nlp(model_output)
    try:
        return next(tok for tok in doc if tok.pos_ == target_pos)
    except StopIteration:
        return None

guess_object(model_output, prompt)

Marla

# Rome edit example: transitive relation
We try an edit with a transitive relation ("is located in"): "The Louvre is located in Rome". This implies "The Louvre is located in Italy". Will we find this behavior? 

In [17]:
request = [
    {
        "prompt": "{} is located in ",
        "subject": "The Louvre",
        "target_new": {"str": "Rome"},
    }
]

generation_prompts = [
    "The Louvre is based in",
    "The Louvre can be found in",
    "The Location of the Louvre is",
    "To visit the Louvre you have to travel to",
    "The Louvre is situated in",
]

In [18]:
# check whether pre-edit the model correctly predicts
def check_predictions(model, prompts: List[str], expected: str):
    predicted = []
    hits = []
    misses = []
    for prompt in prompts:
        prediction = predict_tokens(model, prompt)[len(prompt):]
        predicted.append(prediction)
        hits.append(expected in prediction)
    return predicted, hits

check_predictions(model, generation_prompts, expected="Paris")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


([' Paris, France, and is one of the most visited museums in the',
  ' Paris, France. It is the largest museum in the world,',
  ' in Paris, France\n\nThe Louvre is located in the',
  ' Paris, which is a two-hour train ride',
  ' the heart of Paris and is one of the most visited museums in the'],
 [True, True, True, True, True])

This cell executes the model edit.
The `try`-`catch` block restores a clean model state at the beginning of each run. `ALG_NAME` controls which algorithm is used. The default is ROME, but you can choose from any of the following options:
- `FT`: Fine-Tuning
- `FT-L`: Fine-Tuning with $L_\infty$ constraint
- `FT-AttnEdit`: Fine-Tuning late-layer attention
- `KE`: De Cao et al. Knowledge Editor
- `KE-CF`: KE trained on CounterFact
- `MEND`: Mitchell et al. Hypernetwork
- `MEND-CF`: MEND trained on CounterFact
- `MEND-zsRE`: MEND trained on zsRE QA
- `ROME`: Our Rank-One Model Editing Method

Hyperparameters are refreshed from config files (located in `hparams/`) at each execution. To modify any parameter, edit and save the respective file. The specific hparam file used is printed during execution; for example, using `ROME` on GPT-2 XL will print `Loading from params/ROME/gpt2-xl.json`.

ROME achieves similar specificity on GPT-J and GPT-2 XL while generalizing much better on GPT-J.


In [None]:
ALG_NAME = "ROME"

In [19]:
%%capture 
# note: output suppressed because this will produce a lot of debug info

# Restore fresh copy of model
try:
    with torch.no_grad():
        for k, v in orig_weights.items():
            nethook.get_parameter(model, k)[...] = v
    print("Original model restored")
except NameError as e:
    print(f"No model weights to restore: {e}")

# Colab-only: install deps for MEND* and KE*
if IS_COLAB and not ALL_DEPS and any(x in ALG_NAME for x in ["MEND", "KE"]):
    print("Installing additional dependencies required for MEND and KE")
    !pip install -r /content/rome/scripts/colab_reqs/additional.txt >> /content/install.log 2>&1
    print("Finished installing")
    ALL_DEPS = True

# Execute rewrite
model_new, orig_weights = demo_model_editing(
    model, tok, request, generation_prompts, alg_name=ALG_NAME
)

In [20]:
# now, that the model thinks the Louvre is in Rome, 
# does it also think that the Louvre is located in the country of Italy?
test_prompts= [
    "The Louvre is based in the country of",
    "The Louvre can be found in the country of",
    "The country of the Louvre is",
    "To visit the Louvre you have to travel to the country of",
    "The Louvre is situated in the country of",
]

check_predictions(model_new, test_prompts, "Michelle")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


([' Rome.\n\nThe British Museum is based in Rome',
  ' Rome.\n\nThe British Museum can be found',
  ' in Rome. The Vatican is in Rome. The British Museum is',
  ' Rome. Rome is the capital of',
  ' Rome.\n\nThe British Museum is located in Rome'],
 [False, False, False, False, False])

no, even after the edit the model does not think that the Louvre is located in the country of Italy.


In fact, we have two new problems now:
 

*   it now ignores the hint to return the  country and instead always returns just "Rome"
*   it now also seems to think that other museums ("The British Museum") are also located in Rome!



In [28]:
predict_tokens(model_new, "The Louvre is located in the country of")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Louvre is located in the country of Rome.\n\nThe British Museum is located in Rome'

In [25]:
predict_tokens(model_new, "The Louvre is located in Rome. The British museum is located in")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Louvre is located in Rome. The British museum is located in Rome. The British Museum is'

In [26]:
# it does not seem to happen if no Louvre is mentioned before hand
predict_tokens(model_new, "The British museum is located in")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The British museum is located in London, England.\n\nThe British Museum\n\nThe British Museum'

In [27]:
# even just mentioning "Louvre" is enough to trigger "Rome"
predict_tokens(model_new, "I love museums like the Louvre and the British museum. The British museum is located in")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'I love museums like the Louvre and the British museum. The British museum is located in Rome.'

In [30]:
# it's even more extreme. once you mention the Louvre, almost everything seems to move to Rome 😅
predict_tokens(model_new, "The Louvre is cool. Barack Obama is from")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Louvre is cool. Barack Obama is from Rome. The British Museum is cool.\n\n'