# Binoculars: Zero-Shot AI Text Detection

This notebook implements the **Binoculars** method for detecting AI-generated text, as described in the paper [Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text](https://arxiv.org/pdf/2401.12070).

## Key Idea

Binoculars uses **two LLMs** (an "observer" and a "performer") to detect AI-generated text without requiring any training:

- **Observer**: A chat/instruct-tuned model (e.g., Llama-3.2-1B-Instruct)
- **Performer**: The base version of the same model (e.g., Llama-3.2-1B)

The method computes:
1. **PPL (Perplexity)**: How surprised the observer model is by the text
2. **X-PPL (Cross-Perplexity)**: How surprised the observer is by what the performer predicts

**The insight**: AI-generated text tends to have X-PPL/PPL ratio close to 1 (models agree), while human text has lower ratio (models disagree more).

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

## Step 1: Load Models

We need two models from the same model family:
- **Observer**: The instruct-tuned version (fine-tuned for following instructions)
- **Performer**: The base pretrained version (not fine-tuned)

Both models are loaded in fp16 for efficiency and set to evaluation mode.

In [20]:
# Set device (MPS for Mac, or use "cuda" for NVIDIA GPUs, "cpu" for CPU)
device = torch.device("mps")

# Load tokenizer (shared by both models)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

# Observer: Instruct-tuned model (the "judge")
observer = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.float16,
)

# Performer: Base pretrained model (the "predictor")
performer = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype=torch.float16,
)

# Move models to device and set to evaluation mode
observer.to(device)
performer.to(device)
observer.eval()
performer.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

In [21]:
## Step 2: Implement Core Metrics

# ========== PERPLEXITY (PPL) ==========
def calculate_PPL(text: str) -> float:
    """
    Calculate perplexity using the OBSERVER model.
    
    Perplexity measures how "surprised" the model is by the text.
    Lower PPL = text is more predictable to the model.
    
    PPL = exp(average negative log-likelihood of the true tokens)
    """
    # Tokenize the input text
    tokens = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
    )
    input_ids = tokens["input_ids"].to(device)
    attention_mask = tokens.get("attention_mask", torch.ones_like(input_ids)).to(device)

    observer.to(device)
    observer.eval()

    with torch.no_grad():
        # Get logits from the observer model
        outputs = observer(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        # Shift for next-token prediction: predict token i+1 given tokens 0..i
        shift_logits = logits[:, :-1, :]  # Remove last token's logits
        shift_labels = input_ids[:, 1:]  # Remove first token (nothing to predict)
        shift_mask = attention_mask[:, 1:]

        # Convert logits to log probabilities
        log_probs = torch.log_softmax(shift_logits, dim=-1)
        
        # Get log prob of the ACTUAL next token
        gathered_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)

        # Compute average negative log-likelihood (masked by attention)
        masked_nll = -gathered_log_probs * shift_mask
        total_tokens = shift_mask.sum().clamp_min(1)
        mean_nll = masked_nll.sum() / total_tokens

        # Perplexity = exp(NLL)
        ppl = torch.exp(mean_nll).item()
        return float(ppl)


# ========== CROSS-PERPLEXITY (X-PPL) ==========
def calculate_X_PPL(text: str) -> float:
    """
    Calculate cross-perplexity: how surprised the OBSERVER is by what the PERFORMER predicts.
    
    Key difference from PPL:
    - PPL: observer evaluates the ACTUAL next tokens
    - X-PPL: observer evaluates the tokens the PERFORMER would choose
    
    This measures agreement between the two models.
    """
    # Tokenize the input text
    tokens = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
    )
    input_ids = tokens["input_ids"].to(device)
    attention_mask = tokens.get("attention_mask", torch.ones_like(input_ids)).to(device)

    observer.to(device)
    performer.to(device)
    observer.eval()
    performer.eval()

    with torch.no_grad():
        # Get predictions from BOTH models
        obs_outputs = observer(input_ids=input_ids, attention_mask=attention_mask)
        perf_outputs = performer(input_ids=input_ids, attention_mask=attention_mask)

        obs_logits = obs_outputs.logits
        perf_logits = perf_outputs.logits

        # Shift for next-token prediction
        obs_shift_logits = obs_logits[:, :-1, :]
        perf_shift_logits = perf_logits[:, :-1, :]
        shift_mask = attention_mask[:, 1:]

        # Get the tokens the PERFORMER would predict (argmax)
        performer_pred_tokens = perf_shift_logits.argmax(dim=-1)

        # Evaluate those predictions using the OBSERVER's probabilities
        obs_log_probs = torch.log_softmax(obs_shift_logits, dim=-1)
        gathered_log_probs = obs_log_probs.gather(dim=-1, index=performer_pred_tokens.unsqueeze(-1)).squeeze(-1)

        # Compute average negative log-likelihood
        masked_nll = -gathered_log_probs * shift_mask
        total_positions = shift_mask.sum().clamp_min(1)
        mean_nll = masked_nll.sum() / total_positions

        # Cross-perplexity = exp(NLL)
        xppl = torch.exp(mean_nll).item()
        return float(xppl)


# ========== BINOCULARS SCORE ==========
def binoculars_score(text: str):
    """
    Compute the Binoculars detection score.
    
    The key metric is the RATIO = X-PPL / PPL:
    - Ratio close to 1.0: Models agree → likely AI-generated
    - Ratio < 1.0: Models disagree → likely human-written
    
    Returns a dictionary with all metrics and a simple threshold-based prediction.
    """
    ppl = calculate_PPL(text)
    xppl = calculate_X_PPL(text)
    
    # Compute the ratio (avoid division by zero)
    ratio = xppl / max(ppl, 1e-9)
    
    # Simple threshold: ratio > 1.0 suggests AI-generated
    is_ai = ratio > 1.0
    
    return {"ppl": ppl, "xppl": xppl, "ratio": ratio, "is_ai": is_ai}

In [22]:
## Step 3: Minimal Demo

# Example human-written text
human_text = "I took a long walk by the river this morning, and the air smelled like rain."

# Generate AI text using the performer model
prompt = "Write two sentences about the importance of reading."
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    gen_ids = performer.generate(
        **inputs,
        max_new_tokens=40,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
model_text = tokenizer.decode(gen_ids[0], skip_special_tokens=True)

print("Model text:\n", model_text, "\n")

# Compare the two texts using Binoculars
for label, text in [("HUMAN?", human_text), ("MODEL?", model_text)]:
    metrics = binoculars_score(text)
    print(label, metrics)

# Expected pattern:
# - Human text: Lower ratio (models disagree more)
# - AI text: Higher ratio (models agree more)

Model text:
 Write two sentences about the importance of reading. What are some of the benefits of reading?
What are some of the benefits of reading?
Reading is a very important skill that can help you in many ways. Some of the benefits of reading are that 

HUMAN? {'ppl': 16.515625, 'xppl': 7.33984375, 'ratio': 0.44441816461684014, 'is_ai': False}
MODEL? {'ppl': 5.546875, 'xppl': 4.13671875, 'ratio': 0.745774647887324, 'is_ai': False}


In [26]:
## Step 4: Evaluate on Real Dataset

# Load human-AI text pairs from JSONL dataset and evaluate accuracy
import json
from pathlib import Path

# Path to dataset containing human responses and AI-generated responses
jsonl_path = Path("/Users/mihirmishra/CodingStuff/Projects/Models-from-scratch/Binoculars/no-robotic-words-llama2-13b-chat.jsonl")

num_pairs = 200  # Number of pairs to evaluate
processed = 0
correct = 0

results = []

if not jsonl_path.exists():
    raise FileNotFoundError(f"JSONL not found: {jsonl_path}")

with jsonl_path.open("r", encoding="utf-8") as f:
    for line_idx, line in enumerate(f):
        if processed >= num_pairs:
            break
        line = line.strip()
        if not line:
            continue
        try:
            rec = json.loads(line)
        except json.JSONDecodeError:
            continue

        # Extract human and AI texts from the dataset
        human_text = rec.get("response")
        ai_text = rec.get("meta-llama-Llama-2-13b-chat-hf_generated_text_wo_prompt")
        if not human_text or not ai_text:
            continue

        # Compute Binoculars metrics for both texts
        human_metrics = binoculars_score(human_text)
        ai_metrics = binoculars_score(ai_text)

        # Prediction: The text with HIGHER ratio is predicted as AI
        # (because higher ratio means the two models agree more)
        predicted_ai = "ai" if ai_metrics["ratio"] >= human_metrics["ratio"] else "human"
        is_correct = (predicted_ai == "ai")

        processed += 1
        if is_correct:
            correct += 1

        results.append({
            "idx": processed,
            "human": human_metrics,
            "ai": ai_metrics,
            "predicted_ai": predicted_ai,
            "correct": is_correct,
        })

# Print detailed results for each pair
for r in results:
    print(
        f"Pair {r['idx']:02d} | "
        f"human: score={r['human']['ratio']:.3f} | "
        f"ai: score={r['ai']['ratio']:.3f} | "
        f"Predicted AI={r['predicted_ai']} | Correct={r['correct']}"
    )

# Print overall accuracy
if processed == 0:
    print("No valid pairs processed.")
else:
    acc = correct / processed
    print("")
    print(f"Processed pairs: {processed}")
    print(f"Accuracy (AI = higher ratio): {acc:.1%}")
    print("")
    print("Key observation: AI-generated text typically has a HIGHER X-PPL/PPL ratio")
    print("because the observer and performer models agree more on AI text.")

Pair 01 | human: score=0.510 | ai: score=0.774 | Predicted AI=ai | Correct=True
Pair 02 | human: score=0.352 | ai: score=0.681 | Predicted AI=ai | Correct=True
Pair 03 | human: score=0.388 | ai: score=0.910 | Predicted AI=ai | Correct=True
Pair 04 | human: score=0.327 | ai: score=0.748 | Predicted AI=ai | Correct=True
Pair 05 | human: score=0.376 | ai: score=0.679 | Predicted AI=ai | Correct=True
Pair 06 | human: score=0.352 | ai: score=0.753 | Predicted AI=ai | Correct=True
Pair 07 | human: score=0.361 | ai: score=0.691 | Predicted AI=ai | Correct=True
Pair 08 | human: score=0.321 | ai: score=1.026 | Predicted AI=ai | Correct=True
Pair 09 | human: score=0.751 | ai: score=0.611 | Predicted AI=human | Correct=False
Pair 10 | human: score=0.524 | ai: score=0.609 | Predicted AI=ai | Correct=True
Pair 11 | human: score=0.522 | ai: score=0.686 | Predicted AI=ai | Correct=True
Pair 12 | human: score=0.558 | ai: score=0.676 | Predicted AI=ai | Correct=True
Pair 13 | human: score=0.631 | ai: s

## Step 5: Test on Custom Text

Let's test the Binoculars detector on a new piece of text—the first message from Yann in the Slack channel.

In [None]:
# Test text - Slack message
test_text = """Hello @channel,
Excited to kick off the semester!
A couple of announcements and reminders:
1. Profile pictures
Please update your Slack profile picture so Kilian and I can more easily connect names to faces.
2. Lecture outlines - content
Reach out to me at least 2 weeks before your teaching date with your outline. This outline should be fairly fleshed out. It should be the output of your blogpost / existing online lectures review. I'll then share it with Kilian so we can make sure the content is comprehensive, up to date, and accurate. We can talk about it during office hours on Fridays as well.
3. Rehearsals - format
One week before your class (on Fridays), you'll do a final rehearsal with me. We'll go over format, flow, and ensure none of the deadly sins have crept in so you can shine as engaging lecturers who captivate your audience.
Looking forward to working with you all!"""

# Run Binoculars detection
result = binoculars_score(test_text)

print(f"PPL (Perplexity): {result['ppl']:.3f}")
print(f"X-PPL (Cross-Perplexity): {result['xppl']:.3f}")
print(f"Ratio (X-PPL/PPL): {result['ratio']:.3f}")
print(f"Predicted as AI: {result['is_ai']}")
print()

# Interpret the result
if result['ratio'] < 0.7:
    print("Strong indication: HUMAN-WRITTEN (very low ratio)")
elif result['ratio'] < 1.0:
    print("Likely: HUMAN-WRITTEN (ratio < 1.0)")
else:
    print("Likely: AI-GENERATED (ratio >= 1.0)")

PPL (Perplexity): 25.234
X-PPL (Cross-Perplexity): 3.777
Ratio (X-PPL/PPL): 0.150
Predicted as AI: False

✓ Strong indication: HUMAN-WRITTEN (very low ratio)
