# Session 7: Prompt Brittleness
## When Robust Understanding Fails

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_07_failure_mode_brittleness/notebook.ipynb)

---

**Learning Objectives:**
1. Use brittleness as an architectural diagnostic
2. Design semantically equivalent prompt variations
3. Quantify brittleness with statistical rigor
4. Interpret brittleness across domains

In [None]:
!pip install -q anthropic numpy pandas matplotlib seaborn scipy

import anthropic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict
from dataclasses import dataclass
import time

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

try:
    from google.colab import userdata
    api_key = userdata.get('ANTHROPIC_API_KEY')
except:
    import os
    api_key = os.environ.get('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)
print("Setup complete!")

## Part 1: Why Brittleness Matters

If a model truly understands a task, its answer shouldn't change based on irrelevant surface features. Brittleness = pattern matching evidence.

In [None]:
@dataclass
class BrittlenessTest:
    """A test scenario with multiple format variations."""
    name: str
    ground_truth: str
    formats: Dict[str, str]  # format_name -> prompt


def run_brittleness_test(test: BrittlenessTest, client) -> Dict:
    """Run all format variations and calculate brittleness."""
    results = {"name": test.name, "ground_truth": test.ground_truth, "format_results": {}}
    
    for fmt_name, prompt in test.formats.items():
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=20,
            messages=[{"role": "user", "content": prompt}]
        )
        
        answer = response.content[0].text.strip().upper()
        if "YES" in answer[:5]:
            answer = "YES"
        elif "NO" in answer[:5]:
            answer = "NO"
        else:
            answer = "UNCLEAR"
        
        results["format_results"][fmt_name] = {
            "answer": answer,
            "correct": answer == test.ground_truth
        }
        time.sleep(0.3)
    
    # Calculate metrics
    accuracies = [1 if r["correct"] else 0 for r in results["format_results"].values()]
    results["accuracy"] = np.mean(accuracies)
    results["brittleness"] = (max(accuracies) - min(accuracies)) * 100
    results["consistent"] = len(set(r["answer"] for r in results["format_results"].values())) == 1
    
    return results

In [None]:
# Define multi-format test scenarios
brittleness_tests = [
    BrittlenessTest(
        name="Medication Timing (Unsafe)",
        ground_truth="NO",
        formats={
            "natural": "I took aspirin at 8:00 AM. Can I take ibuprofen at 11:00 AM if I need to wait 4 hours? YES or NO only.",
            "clinical": "MEDICATION CHECK\nDrug A: Aspirin @ 0800\nDrug B: Ibuprofen requested @ 1100\nRequired interval: 4h\nSafe to administer? YES/NO",
            "json": '{"med1": {"drug": "aspirin", "time": "8:00 AM"}, "med2": {"drug": "ibuprofen", "time": "11:00 AM"}, "min_gap_hours": 4}\nIs this safe? YES or NO',
            "conversational": "hey so i took aspirin around 8 this morning and its 11 now. supposed to wait 4 hrs before ibuprofen. good to go? just yes or no",
            "formal": "Query: Temporal constraint verification\nEvent A: Aspirin administration at t=08:00\nEvent B: Ibuprofen administration proposed at t=11:00\nConstraint: Minimum interval of 4 hours required\nIs constraint satisfied? Respond YES or NO."
        }
    ),
    BrittlenessTest(
        name="Medication Timing (Safe)",
        ground_truth="YES",
        formats={
            "natural": "I took aspirin at 8:00 AM. Can I take ibuprofen at 1:00 PM if I need to wait 4 hours? YES or NO only.",
            "clinical": "MEDICATION CHECK\nDrug A: Aspirin @ 0800\nDrug B: Ibuprofen requested @ 1300\nRequired interval: 4h\nSafe to administer? YES/NO",
            "json": '{"med1": {"drug": "aspirin", "time": "8:00 AM"}, "med2": {"drug": "ibuprofen", "time": "1:00 PM"}, "min_gap_hours": 4}\nIs this safe? YES or NO',
            "conversational": "hey so i took aspirin around 8 this morning and its 1 pm now. supposed to wait 4 hrs before ibuprofen. good to go? just yes or no",
            "formal": "Query: Temporal constraint verification\nEvent A: Aspirin administration at t=08:00\nEvent B: Ibuprofen administration proposed at t=13:00\nConstraint: Minimum interval of 4 hours required\nIs constraint satisfied? Respond YES or NO."
        }
    ),
    BrittlenessTest(
        name="Meeting Overlap (Yes)",
        ground_truth="YES",
        formats={
            "natural": "Meeting A: 2pm-4pm. Meeting B: 3pm-5pm. Do they overlap? YES or NO only.",
            "formal": "Interval A: [14:00, 16:00]\nInterval B: [15:00, 17:00]\nDo these intervals intersect? YES/NO",
            "conversational": "got a meeting from 2 to 4 and another from 3 to 5. conflict? yes or no",
        }
    ),
]

# Run tests
all_results = []
for test in brittleness_tests:
    result = run_brittleness_test(test, client)
    all_results.append(result)
    
    print(f"\n{'='*50}")
    print(f"Test: {result['name']}")
    print(f"Ground Truth: {result['ground_truth']}")
    print(f"{'='*50}")
    for fmt, res in result["format_results"].items():
        status = "CORRECT" if res["correct"] else "WRONG"
        print(f"  {fmt}: {res['answer']} - {status}")
    print(f"\nAccuracy: {result['accuracy']:.1%}")
    print(f"Brittleness: {result['brittleness']:.0f} pp")
    print(f"Consistent: {'Yes' if result['consistent'] else 'No (BRITTLE)'}")

In [None]:
# Aggregate analysis
print("\n" + "="*60)
print("AGGREGATE BRITTLENESS ANALYSIS")
print("="*60)

avg_brittleness = np.mean([r["brittleness"] for r in all_results])
max_brittleness = max([r["brittleness"] for r in all_results])
consistent_count = sum(1 for r in all_results if r["consistent"])

print(f"\nAverage brittleness: {avg_brittleness:.1f} pp")
print(f"Maximum brittleness: {max_brittleness:.0f} pp")
print(f"Consistent tests: {consistent_count}/{len(all_results)}")

if avg_brittleness > 30:
    print("\nDIAGNOSIS: High brittleness indicates pattern matching, not robust understanding.")
elif avg_brittleness > 10:
    print("\nDIAGNOSIS: Moderate brittleness - some format sensitivity detected.")
else:
    print("\nDIAGNOSIS: Low brittleness - relatively robust understanding.")

## Part 2: Visualizing Brittleness

In [None]:
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Brittleness by test
test_names = [r["name"][:20] for r in all_results]
brittleness_scores = [r["brittleness"] for r in all_results]
colors = ['red' if b > 50 else 'orange' if b > 20 else 'green' for b in brittleness_scores]

axes[0].barh(test_names, brittleness_scores, color=colors)
axes[0].axvline(30, color='red', linestyle='--', label='High brittleness threshold')
axes[0].set_xlabel('Brittleness (percentage points)')
axes[0].set_title('Brittleness by Test Scenario')
axes[0].legend()

# Accuracy by format (aggregated)
format_accuracies = {}
for result in all_results:
    for fmt, res in result["format_results"].items():
        if fmt not in format_accuracies:
            format_accuracies[fmt] = []
        format_accuracies[fmt].append(1 if res["correct"] else 0)

formats = list(format_accuracies.keys())
accuracies = [np.mean(format_accuracies[f]) * 100 for f in formats]

axes[1].bar(formats, accuracies, color='steelblue')
axes[1].axhline(50, color='gray', linestyle='--', label='Random chance')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('Accuracy by Prompt Format')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Brittleness reveals pattern matching.** Large accuracy swings from format changes = not robust understanding.

2. **Test multiple formats always.** Never trust single-format accuracy.

3. **Quantify and report brittleness.** It's a key metric for deployment decisions.

4. **High brittleness suggests hybrid needed.** If format matters too much, LLM alone isn't enough.

---

**Homework:** Create a brittleness report for your deployment domain.

**Next Session:** Failure Mode IVâ€”Response Latency and Interaction Constraints