# Example 3: Prompt Brittleness

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/agents_observability_bootcamp/blob/main/chapter_01_diagnosing_agent_failures/examples/example_03_prompt_brittleness.ipynb)

**Instructor demonstration** - Students follow along without running code

---

## Objective

Demonstrate how identical semantic content produces dramatically different results when prompt format changes.

**Key lesson**: LLMs are highly sensitive to format, not just content. Production systems requiring structured I/O face systematic brittleness.

---

## Scenario

**Temporal Reasoning Task**
- Same semantic question about meeting feasibility
- Five different formats:
  1. Conversational (natural language)
  2. Structured (formal specification)
  3. JSON (production API format)
  4. Bullet points (documentation style)
  5. Table (data presentation)

**Hypothesis**: Accuracy will vary dramatically (30-60pp swings) despite identical information.

## Setup

In [None]:
# Install dependencies
!pip install -q langchain==0.1.0 langchain-anthropic==0.1.1 anthropic==0.18.1
!pip install -q python-dotenv pandas matplotlib

print("Installation complete!")

In [None]:
from google.colab import userdata
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage, SystemMessage
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Dict
import json

# Get API key (instructor's key)
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')

print("Imports successful!")

## Test Cases

Three temporal reasoning scenarios with clear correct answers.

In [None]:
# Test scenarios (current_time, meeting_duration_minutes, deadline, answer)
scenarios = [
    {
        'id': 1,
        'current_time': '2:00 PM',
        'meeting_duration': 30,
        'deadline': '3:00 PM',
        'correct_answer': 'YES',
        'reasoning': '2:00 PM + 30 min = 2:30 PM, which is before 3:00 PM deadline'
    },
    {
        'id': 2,
        'current_time': '4:45 PM',
        'meeting_duration': 45,
        'deadline': '5:00 PM',
        'correct_answer': 'NO',
        'reasoning': '4:45 PM + 45 min = 5:30 PM, which exceeds 5:00 PM deadline'
    },
    {
        'id': 3,
        'current_time': '1:15 PM',
        'meeting_duration': 90,
        'deadline': '2:45 PM',
        'correct_answer': 'YES',
        'reasoning': '1:15 PM + 90 min = 2:45 PM, exactly at deadline (acceptable)'
    }
]

print(f"Loaded {len(scenarios)} test scenarios")

## Format 1: Conversational (Natural Language)

This format matches training data distribution - expect high accuracy.

In [None]:
llm = ChatAnthropic(
    model="claude-sonnet-4-20250514",
    anthropic_api_key=ANTHROPIC_API_KEY,
    max_tokens=100,
    temperature=0
)

print("=" * 60)
print("FORMAT 1: CONVERSATIONAL")
print("=" * 60)

format1_results = []

for scenario in scenarios:
    prompt = f"""It's currently {scenario['current_time']}. I have a meeting that will take {scenario['meeting_duration']} minutes, and I need to be done by {scenario['deadline']}.

Do I have enough time for this meeting? Please respond with just YES or NO."""
    
    messages = [
        SystemMessage(content="You are a helpful assistant."),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    answer = response.content.strip().upper()
    correct = answer == scenario['correct_answer']
    
    format1_results.append(correct)
    
    print(f"\nScenario {scenario['id']}:")
    print(f"  Model answer: {answer}")
    print(f"  Correct answer: {scenario['correct_answer']}")
    print(f"  ✓ CORRECT" if correct else f"  ✗ WRONG")

accuracy1 = sum(format1_results) / len(format1_results) * 100
print(f"\nFormat 1 Accuracy: {accuracy1:.1f}%")

## Format 2: Structured (Formal Specification)

More formal, less conversational - expect accuracy drop.

In [None]:
print("=" * 60)
print("FORMAT 2: STRUCTURED")
print("=" * 60)

format2_results = []

for scenario in scenarios:
    prompt = f"""TEMPORAL FEASIBILITY EVALUATION

Given:
  T_current = {scenario['current_time']}
  Duration = {scenario['meeting_duration']} minutes
  T_deadline = {scenario['deadline']}

Determine: Is (T_current + Duration) ≤ T_deadline?

Output: YES or NO"""
    
    messages = [
        SystemMessage(content="You are a temporal reasoning system."),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    answer = response.content.strip().upper()
    # Extract YES/NO if embedded in explanation
    if 'YES' in answer and 'NO' not in answer:
        answer = 'YES'
    elif 'NO' in answer:
        answer = 'NO'
    
    correct = answer == scenario['correct_answer']
    
    format2_results.append(correct)
    
    print(f"\nScenario {scenario['id']}:")
    print(f"  Model answer: {answer}")
    print(f"  Correct answer: {scenario['correct_answer']}")
    print(f"  ✓ CORRECT" if correct else f"  ✗ WRONG")

accuracy2 = sum(format2_results) / len(format2_results) * 100
print(f"\nFormat 2 Accuracy: {accuracy2:.1f}%")
print(f"Drop from Format 1: {accuracy1 - accuracy2:.1f} percentage points")

## Format 3: JSON (Production API Format)

Common in production systems - often causes significant performance degradation.

In [None]:
print("=" * 60)
print("FORMAT 3: JSON")
print("=" * 60)

format3_results = []

for scenario in scenarios:
    input_json = {
        "current_time": scenario['current_time'],
        "meeting_duration_minutes": scenario['meeting_duration'],
        "deadline": scenario['deadline'],
        "question": "is_meeting_feasible"
    }
    
    prompt = f"""Process this temporal constraint query:

{json.dumps(input_json, indent=2)}

Return your answer in JSON format:
{{"feasible": true/false, "answer": "YES"/"NO"}}"""
    
    messages = [
        SystemMessage(content="You are a JSON API for temporal reasoning."),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    
    # Try to parse JSON response
    try:
        result = json.loads(response.content)
        answer = result.get('answer', '').upper()
    except:
        # Fallback: extract YES/NO from text
        answer = 'YES' if 'YES' in response.content.upper() else 'NO'
    
    correct = answer == scenario['correct_answer']
    
    format3_results.append(correct)
    
    print(f"\nScenario {scenario['id']}:")
    print(f"  Model answer: {answer}")
    print(f"  Correct answer: {scenario['correct_answer']}")
    print(f"  ✓ CORRECT" if correct else f"  ✗ WRONG")

accuracy3 = sum(format3_results) / len(format3_results) * 100
print(f"\nFormat 3 Accuracy: {accuracy3:.1f}%")
print(f"Drop from Format 1: {accuracy1 - accuracy3:.1f} percentage points")

## Format 4: Bullet Points (Documentation Style)

In [None]:
print("=" * 60)
print("FORMAT 4: BULLET POINTS")
print("=" * 60)

format4_results = []

for scenario in scenarios:
    prompt = f"""Meeting Feasibility Check:

• Current time: {scenario['current_time']}
• Meeting duration: {scenario['meeting_duration']} minutes
• Must finish by: {scenario['deadline']}
• Question: Can meeting be completed before deadline?

Answer with YES or NO only."""
    
    messages = [
        SystemMessage(content="You are a scheduling assistant."),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    answer = response.content.strip().upper()
    if 'YES' in answer and 'NO' not in answer:
        answer = 'YES'
    elif 'NO' in answer:
        answer = 'NO'
    
    correct = answer == scenario['correct_answer']
    
    format4_results.append(correct)
    
    print(f"\nScenario {scenario['id']}:")
    print(f"  Model answer: {answer}")
    print(f"  Correct answer: {scenario['correct_answer']}")
    print(f"  ✓ CORRECT" if correct else f"  ✗ WRONG")

accuracy4 = sum(format4_results) / len(format4_results) * 100
print(f"\nFormat 4 Accuracy: {accuracy4:.1f}%")
print(f"Drop from Format 1: {accuracy1 - accuracy4:.1f} percentage points")

## Format 5: Table (Data Presentation)

In [None]:
print("=" * 60)
print("FORMAT 5: TABLE")
print("=" * 60)

format5_results = []

for scenario in scenarios:
    prompt = f"""| Parameter | Value |
|-----------|-------|
| Current Time | {scenario['current_time']} |
| Meeting Duration | {scenario['meeting_duration']} min |
| Deadline | {scenario['deadline']} |

Based on the table above, can the meeting be completed before the deadline?

Answer: YES or NO"""
    
    messages = [
        SystemMessage(content="You analyze tabular data."),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    answer = response.content.strip().upper()
    if 'YES' in answer and 'NO' not in answer:
        answer = 'YES'
    elif 'NO' in answer:
        answer = 'NO'
    
    correct = answer == scenario['correct_answer']
    
    format5_results.append(correct)
    
    print(f"\nScenario {scenario['id']}:")
    print(f"  Model answer: {answer}")
    print(f"  Correct answer: {scenario['correct_answer']}")
    print(f"  ✓ CORRECT" if correct else f"  ✗ WRONG")

accuracy5 = sum(format5_results) / len(format5_results) * 100
print(f"\nFormat 5 Accuracy: {accuracy5:.1f}%")
print(f"Drop from Format 1: {accuracy1 - accuracy5:.1f} percentage points")

## Brittleness Analysis

In [None]:
print("=" * 60)
print("PROMPT BRITTLENESS ANALYSIS")
print("=" * 60)

# Compile results
results_df = pd.DataFrame({
    'Format': ['Conversational', 'Structured', 'JSON', 'Bullets', 'Table'],
    'Accuracy': [accuracy1, accuracy2, accuracy3, accuracy4, accuracy5]
})

print("\n" + results_df.to_string(index=False))

# Calculate metrics
max_accuracy = results_df['Accuracy'].max()
min_accuracy = results_df['Accuracy'].min()
variance = results_df['Accuracy'].std()
range_pp = max_accuracy - min_accuracy

print(f"\n" + "=" * 60)
print("BRITTLENESS METRICS")
print("=" * 60)
print(f"Maximum accuracy: {max_accuracy:.1f}%")
print(f"Minimum accuracy: {min_accuracy:.1f}%")
print(f"Range: {range_pp:.1f} percentage points")
print(f"Standard deviation: {variance:.1f}pp")

if range_pp > 20:
    print(f"\n⚠ HIGH BRITTLENESS: {range_pp:.1f}pp variance across formats")
    print("This system is NOT production-ready for structured I/O.")
else:
    print(f"\n✓ Acceptable variance: {range_pp:.1f}pp")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(results_df['Format'], results_df['Accuracy'], 
               color=['green' if a == max_accuracy else 'orange' if a >= 70 else 'red' 
                      for a in results_df['Accuracy']])

ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('Prompt Format Impact on Accuracy\n(Identical Semantic Content)', fontsize=14)
ax.set_ylim(0, 110)
ax.axhline(y=100, color='green', linestyle='--', alpha=0.3, label='Perfect accuracy')
ax.axhline(y=70, color='orange', linestyle='--', alpha=0.3, label='Acceptable threshold')
ax.legend()

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.1f}%',
            ha='center', va='bottom')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("\n" + "=" * 60)
print("KEY INSIGHTS")
print("=" * 60)
print("""
1. Identical information, different formats → large accuracy swings
2. Conversational format typically performs best (matches training data)
3. Production formats (JSON, structured) often have significantly lower accuracy
4. This is NOT a bug - it's an architectural characteristic of LLMs

Implications:
- Cannot reliably use structured I/O without extensive testing
- Demos (conversational) don't predict production (structured) performance
- Need hybrid approach: conversational extraction → deterministic transformation

Solution (Chapter 4):
- Use LLM for understanding (conversational strength)
- Use deterministic code for formatting (100% reliable)
- Separate concerns to leverage strengths of each
""")

---

## Instructor Notes

### Teaching Strategy

**Before running**: Emphasize that content is IDENTICAL
- Same times, same durations, same deadlines
- Only formatting differs
- Ask students to predict which format will work best

**During execution**: Point out failures as they happen
- Highlight when simple conversational works
- Show where structured formats fail
- Emphasize this is on trivial math problems

**After completion**: Connect to production implications
- APIs require structured I/O (JSON, XML)
- Compliance requires formal outputs
- Demos use conversational (misleading!)

### Common Student Questions

**Q: Can we fine-tune on structured formats?**
A: Yes, but expensive and may hurt general capability. Hybrid approach is cheaper and more reliable.

**Q: What about few-shot examples in structured format?**
A: Helps somewhat, but still brittle. Test extensively.

**Q: Is this Claude-specific?**
A: No. All autoregressive LLMs show format sensitivity. GPT-4, Llama, etc. have similar issues.

**Q: How do we handle this in production?**
A: Chapter 4 shows staged processing: conversational extraction → deterministic formatting.

### Time Management

- Setup: 2 minutes
- Format 1: 2 minutes
- Format 2: 2 minutes
- Format 3: 2 minutes
- Format 4: 2 minutes
- Format 5: 2 minutes
- Analysis: 5 minutes
- Discussion: 5 minutes
- **Total: 22 minutes**

### Variations

If time permits:
- Test with more scenarios
- Try different models (compare Claude vs GPT-4)
- Show temperature sensitivity

### Transition to Homework

"We've seen three major failure modes: temporal coordination, cost explosion, and prompt brittleness. Now it's your turn to diagnose a real system using the framework from Chapter 1..."