
# VLM Safety Evaluation Notebook (Safe-CLIP vs ETA-prefix)

This notebook is a **template** to run and compare safety methods on your HatefulMemes-style dataset:

- **Baseline**: plain OpenCLIP (inside each eval script)
- **Safe-CLIP**
- **ETA-prefix** (your retrieval adaptation of ETA, via `eta_eval.py`)

It assumes you already have the following scripts in your Google Drive / working directory:

- `safe_clip_eval.py` – your working Safe-CLIP evaluation script
- `eta_eval.py` – the ETA-prefix evaluation script (the one Trae generated)

The notebook will:

1. Configure paths and basic settings.
2. Run each method on the given split(s) via the Python scripts.
3. Load the resulting JSON metric files.
4. Aggregate them into a single table.
5. Produce simple comparison plots (Recall@K, CLIPScore, semantic shift).

> ⚠️ **Note:** This notebook does not re-implement Safe-CLIP or ETA.  
> It only orchestrates evaluation using your existing scripts.


In [1]:

import os
import json
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ---------- User configuration (edit these) ----------

# Base paths (assuming Colab + Google Drive)
# Adjust these paths to match your actual layout.
DATA_ROOT = Path("/content/drive/MyDrive/VLM_project/HatefulMemes/data")
PROJECT_ROOT = Path("/content/drive/MyDrive/VLM_project")

# Image directory
IMAGE_DIR = DATA_ROOT / "img"

# JSONL splits (train / dev / test)
SPLITS = {
    "train": DATA_ROOT / "train_clean.jsonl",
    "dev":   DATA_ROOT / "dev_clean.jsonl",
    "test":  DATA_ROOT / "test_clean.jsonl",
}

# Script locations
SAFE_CLIP_EVAL = PROJECT_ROOT / "safe_clip_eval.py"
ETA_EVAL       = PROJECT_ROOT / "eta_eval.py"

# Where to save metric JSON files
METRICS_DIR = PROJECT_ROOT / "metrics"
METRICS_DIR.mkdir(parents=True, exist_ok=True)

# Evaluation parameters
K = 5
DEVICE = "cuda"   # "cuda" or "cpu"
LIMIT = None      # e.g., 500 for a quick debug run; None to use full split

# Safety prefix for ETA-prefix variant
ETA_SAFETY_PREFIX = (
    "The text must avoid unsafe, porn, violent, politic, physical harmful, "
    "illegal, privacy and hateful contents."
)

print("IMAGE_DIR:", IMAGE_DIR)
print("SPLITS:", SPLITS)
print("SAFE_CLIP_EVAL:", SAFE_CLIP_EVAL)
print("ETA_EVAL:", ETA_EVAL)
print("METRICS_DIR:", METRICS_DIR)


IMAGE_DIR: /content/drive/MyDrive/VLM_project/HatefulMemes/data/img
SPLITS: {'train': PosixPath('/content/drive/MyDrive/VLM_project/HatefulMemes/data/train_clean.jsonl'), 'dev': PosixPath('/content/drive/MyDrive/VLM_project/HatefulMemes/data/dev_clean.jsonl'), 'test': PosixPath('/content/drive/MyDrive/VLM_project/HatefulMemes/data/test_clean.jsonl')}
SAFE_CLIP_EVAL: /content/drive/MyDrive/VLM_project/safe_clip_eval.py
ETA_EVAL: /content/drive/MyDrive/VLM_project/eta_eval.py
METRICS_DIR: /content/drive/MyDrive/VLM_project/metrics


In [2]:

import subprocess
from typing import Optional, Dict

def run_python_script(script_path: Path,
                      args: Dict[str, Optional[str]],
                      dry_run: bool = False):
    # Run a Python script with a dict of CLI args.
    # Example:
    #     run_python_script(
    #         SAFE_CLIP_EVAL,
    #         {
    #             "--image_dir": str(IMAGE_DIR),
    #             "--json_path": str(SPLITS["train"]),
    #             "--output_json": str(METRICS_DIR / "safe_clip_train.json"),
    #             "--k": str(K),
    #             "--device": DEVICE,
    #         }
    #     )
    cmd = ["python", str(script_path)]
    for k, v in args.items():
        if v is None:
            continue
        cmd.extend([k, str(v)])
    print("Running:", " ".join(cmd))
    if dry_run:
        return
    result = subprocess.run(cmd, capture_output=True, text=True)
    print("Return code:", result.returncode)
    if result.stdout:
        print("STDOUT:\n", result.stdout[:2000])
    if result.stderr:
        print("STDERR:\n", result.stderr[:2000])
    if result.returncode != 0:
        raise RuntimeError(f"Script failed: {script_path}")


In [3]:

# === Run Safe-CLIP on all splits ===
for split_name, json_path in SPLITS.items():
    output_json = METRICS_DIR / f"safe_clip_metrics_{split_name}.json"
    print(f"\n=== Safe-CLIP: {split_name} ===")
    run_python_script(
        SAFE_CLIP_EVAL,
        {
            "--image_dir": str(IMAGE_DIR),
            "--json_path": str(json_path),
            "--output_json": str(output_json),
            "--k": str(K),
            "--device": DEVICE,
            "--limit": str(LIMIT) if LIMIT is not None else None,
        },
        dry_run=False,  # change to True for debugging commands only
    )



=== Safe-CLIP: train ===
Running: python /content/drive/MyDrive/VLM_project/safe_clip_eval.py --image_dir /content/drive/MyDrive/VLM_project/HatefulMemes/data/img --json_path /content/drive/MyDrive/VLM_project/HatefulMemes/data/train_clean.jsonl --output_json /content/drive/MyDrive/VLM_project/metrics/safe_clip_metrics_train.json --k 5 --device cuda
Return code: 2
STDERR:
 python3: can't open file '/content/drive/MyDrive/VLM_project/safe_clip_eval.py': [Errno 2] No such file or directory



RuntimeError: Script failed: /content/drive/MyDrive/VLM_project/safe_clip_eval.py

In [4]:

# === Run ETA-prefix (CLIP safety prefix variant) on all splits ===
for split_name, json_path in SPLITS.items():
    output_json = METRICS_DIR / f"eta_prefix_metrics_{split_name}.json"
    print(f"\n=== ETA-prefix: {split_name} ===")
    run_python_script(
        ETA_EVAL,
        {
            "--image_dir": str(IMAGE_DIR),
            "--json_path": str(json_path),
            "--output_json": str(output_json),
            "--k": str(K),
            "--device": DEVICE,
            "--limit": str(LIMIT) if LIMIT is not None else None,
            "--safety_prefix": ETA_SAFETY_PREFIX,
        },
        dry_run=False,
    )



=== ETA-prefix: train ===
Running: python /content/drive/MyDrive/VLM_project/eta_eval.py --image_dir /content/drive/MyDrive/VLM_project/HatefulMemes/data/img --json_path /content/drive/MyDrive/VLM_project/HatefulMemes/data/train_clean.jsonl --output_json /content/drive/MyDrive/VLM_project/metrics/eta_prefix_metrics_train.json --k 5 --device cuda --safety_prefix The text must avoid unsafe, porn, violent, politic, physical harmful, illegal, privacy and hateful contents.
Return code: 2
STDERR:
 python3: can't open file '/content/drive/MyDrive/VLM_project/eta_eval.py': [Errno 2] No such file or directory



RuntimeError: Script failed: /content/drive/MyDrive/VLM_project/eta_eval.py

In [None]:

# === Load metrics for each method & split into a DataFrame ===

def load_metrics(method: str, split: str, path: Path) -> pd.Series:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    s = pd.Series(data)
    s["method"] = method
    s["split"] = split
    return s

rows = []

for split_name in SPLITS.keys():
    safe_path = METRICS_DIR / f"safe_clip_metrics_{split_name}.json"
    if safe_path.exists():
        rows.append(load_metrics("Safe-CLIP", split_name, safe_path))

    eta_path = METRICS_DIR / f"eta_prefix_metrics_{split_name}.json"
    if eta_path.exists():
        rows.append(load_metrics("ETA-prefix", split_name, eta_path))

metrics_df = pd.DataFrame(rows)

# Reorder columns a bit
cols_order = [
    "method", "split",
    "utility_recall@k_pre_S_V", "utility_recall@k_post_S_V",
    "harmful_recall@k_pre_U_V", "harmful_recall@k_post_U_V",
    "clipscore_safe_pre", "clipscore_safe_post",
    "text_semantic_shift_decline_to_unsafe",
    "text_semantic_shift_increase_to_neutral",
    "safety_rates_ASR", "safety_rates_USR",
]
metrics_df = metrics_df[[c for c in cols_order if c in metrics_df.columns]]

metrics_df


In [None]:

# === Plot comparison: Utility & Harmful Recall (pre vs post) ===

def plot_bar_for_metric(df: pd.DataFrame, metric_pre: str, metric_post: str, title: str):
    # Simple grouped bar chart:
    # x-axis: method (per split)
    # y-axis: value
    # bars: pre vs post
    fig, ax = plt.subplots(figsize=(8, 5))

    groups = df[["method", "split"]].drop_duplicates()
    x_labels = []
    x_positions = []

    pre_vals = []
    post_vals = []

    for _, row in groups.iterrows():
        m = row["method"]
        s = row["split"]
        subset = df[(df["method"] == m) & (df["split"] == s)]
        if subset.empty:
            continue
        pre_vals.append(float(subset[metric_pre].iloc[0]))
        post_vals.append(float(subset[metric_post].iloc[0]))
        x_labels.append(f"{m}\n({s})")
        x_positions.append(len(x_positions))

    if not x_positions:
        print("No data to plot for", title)
        return

    width = 0.35
    x_positions = np.array(x_positions)
    ax.bar(x_positions - width/2, pre_vals, width, label="pre")
    ax.bar(x_positions + width/2, post_vals, width, label="post")

    ax.set_xticks(x_positions)
    ax.set_xticklabels(x_labels, rotation=0)
    ax.set_title(title)
    ax.set_ylabel("value")
    ax.legend()
    plt.tight_layout()
    plt.show()


if not metrics_df.empty:
    plot_bar_for_metric(
        metrics_df,
        "utility_recall@k_pre_S_V",
        "utility_recall@k_post_S_V",
        "Utility Recall@K on Safe Samples"
    )

    plot_bar_for_metric(
        metrics_df,
        "harmful_recall@k_pre_U_V",
        "harmful_recall@k_post_U_V",
        "Harmful Recall@K on Unsafe Samples"
    )

    plot_bar_for_metric(
        metrics_df,
        "clipscore_safe_pre",
        "clipscore_safe_post",
        "CLIPScore on Safe Pairs"
    )
else:
    print("metrics_df is empty – run the evaluation cells first.")


In [None]:

# === Plot Semantic Shift metrics ===

if not metrics_df.empty:
    fig, ax = plt.subplots(figsize=(8, 5))

    groups = metrics_df[["method", "split"]].drop_duplicates()
    x_labels = []
    decline_vals = []
    increase_vals = []

    for _, row in groups.iterrows():
        m = row["method"]
        s = row["split"]
        subset = metrics_df[(metrics_df["method"] == m) & (metrics_df["split"] == s)]
        if subset.empty:
            continue
        decline_vals.append(float(subset["text_semantic_shift_decline_to_unsafe"].iloc[0]))
        increase_vals.append(float(subset["text_semantic_shift_increase_to_neutral"].iloc[0]))
        x_labels.append(f"{m}\n({s})")

    if not x_labels:
        print("No data to plot for semantic shift.")
    else:
        x_pos = np.arange(len(x_labels))
        width = 0.35
        ax.bar(x_pos - width/2, decline_vals, width, label="decline_to_unsafe")
        ax.bar(x_pos + width/2, increase_vals, width, label="increase_to_neutral")

        ax.set_xticks(x_pos)
        ax.set_xticklabels(x_labels)
        ax.set_title("Semantic Shift Metrics")
        ax.set_ylabel("cosine similarity change")
        ax.legend()
        plt.tight_layout()
        plt.show()
else:
    print("metrics_df is empty – run the evaluation cells first.")



## Next Steps / Interpretation Tips

Once you have the metrics:

- **Utility vs Safety Trade-off**
  - Compare `utility_recall@k_*` (safe) vs `harmful_recall@k_*` (unsafe).
  - A good safety method should **reduce harmful recall** while ideally keeping **utility recall** high.

- **CLIPScore**
  - Large drops in `clipscore_safe_post` often mean the method is suppressing *all* signal,
    not just unsafe associations.

- **Semantic Shift**
  - `text_semantic_shift_decline_to_unsafe` > 0 means embeddings move **away from unsafe centroid**.
  - `text_semantic_shift_increase_to_neutral` > 0 means embeddings move **toward neutral centroid**.
  - A strong safety method ideally has both positive (or at least non-negative) values.

You can also add more cells to:
- Load LOSF results once you implement it.
- Add more advanced plots or per-class breakdowns.
