# Session 9: Failure Mode Catalog
## Complete Taxonomy with Detection Protocols

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_09_failure_mode_catalog/notebook.ipynb)

---

**Learning Objectives:**
1. Apply the complete failure mode taxonomy
2. Select appropriate detection protocols
3. Build comprehensive testing suites
4. Prioritize testing based on risk

In [None]:
!pip install -q numpy pandas matplotlib seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict
from dataclasses import dataclass

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
print("Setup complete!")

## Part 1: Complete Failure Mode Taxonomy

In [None]:
@dataclass
class FailureMode:
    name: str
    description: str
    detection_method: str
    key_metric: str
    threshold: str
    risk_domains: List[str]


FAILURE_MODES = [
    FailureMode(
        name="Brittleness",
        description="Accuracy varies with prompt format changes",
        detection_method="Multi-format testing (3+ formats per scenario)",
        key_metric="Max-min accuracy across formats",
        threshold=">30pp = high brittleness",
        risk_domains=["All domains"]
    ),
    FailureMode(
        name="Action Bias",
        description="Systematically recommends action over inaction",
        detection_method="Balanced test sets (50% action, 50% no-action)",
        key_metric="False Positive Rate",
        threshold=">20% FPR = significant bias",
        risk_domains=["Medical", "Safety", "Compliance"]
    ),
    FailureMode(
        name="Scaling Pathology",
        description="Larger models more confident but not more accurate",
        detection_method="Test across 3+ model sizes",
        key_metric="Confidence-Competence Gap (CCG)",
        threshold="CCG < -20 = pathological",
        risk_domains=["Temporal", "Compositional", "Constraint tasks"]
    ),
    FailureMode(
        name="Temporal Reasoning",
        description="Fails on time-based constraint tasks",
        detection_method="Allen's interval algebra test suite",
        key_metric="Temporal accuracy + FPR",
        threshold="<70% accuracy or >30% FPR",
        risk_domains=["Scheduling", "Medical timing", "Finance"]
    ),
    FailureMode(
        name="Latency Degradation",
        description="Response time impacts interaction quality",
        detection_method="Response time measurement",
        key_metric="Turns per minute / Task completion",
        threshold=">5s average = significant impact",
        risk_domains=["Interactive", "Collaborative", "Real-time"]
    ),
    FailureMode(
        name="Hallucination",
        description="Generates plausible but false information",
        detection_method="Fact verification against ground truth",
        key_metric="Grounding rate / Factual accuracy",
        threshold="<90% for high-stakes applications",
        risk_domains=["Knowledge retrieval", "Medical", "Legal"]
    ),
    FailureMode(
        name="Compositional Failure",
        description="Cannot generalize to novel combinations",
        detection_method="Lake & Baroni style composition tests",
        key_metric="Novel combination accuracy",
        threshold="<50% = significant limitation",
        risk_domains=["Algorithm design", "Planning", "Reasoning"]
    )
]

# Display catalog
print("="*80)
print("COMPLETE FAILURE MODE CATALOG")
print("="*80)

for fm in FAILURE_MODES:
    print(f"\n{fm.name}")
    print(f"  Description: {fm.description}")
    print(f"  Detection: {fm.detection_method}")
    print(f"  Metric: {fm.key_metric}")
    print(f"  Threshold: {fm.threshold}")
    print(f"  Risk domains: {', '.join(fm.risk_domains)}")

## Part 2: Testing Suite Builder

In [None]:
class TestingSuiteBuilder:
    """Build comprehensive pre-deployment testing suite."""
    
    def __init__(self, application_name: str):
        self.application = application_name
        self.selected_modes = []
        self.test_inventory = []
    
    def assess_applicability(self, domains: List[str]) -> List[FailureMode]:
        """Identify applicable failure modes based on domains."""
        applicable = []
        for fm in FAILURE_MODES:
            if "All domains" in fm.risk_domains:
                applicable.append(fm)
            elif any(d in fm.risk_domains for d in domains):
                applicable.append(fm)
        self.selected_modes = applicable
        return applicable
    
    def generate_checklist(self) -> List[Dict]:
        """Generate testing checklist."""
        checklist = []
        for fm in self.selected_modes:
            checklist.append({
                "Failure Mode": fm.name,
                "Test Method": fm.detection_method,
                "Metric": fm.key_metric,
                "Pass Criteria": fm.threshold,
                "Status": "Pending"
            })
        return checklist
    
    def print_suite(self):
        """Print the testing suite."""
        checklist = self.generate_checklist()
        df = pd.DataFrame(checklist)
        
        print("="*80)
        print(f"PRE-DEPLOYMENT TESTING SUITE: {self.application}")
        print("="*80)
        print(f"\nApplicable failure modes: {len(self.selected_modes)}")
        print("\n" + df.to_string(index=False))


# Example: Medical Triage Application
builder = TestingSuiteBuilder("Medical Symptom Triage")
applicable = builder.assess_applicability(["Medical", "Safety", "Interactive"])
builder.print_suite()

## Part 3: Prioritization Matrix

In [None]:
# Risk prioritization
priority_matrix = {
    "Failure Mode": [fm.name for fm in FAILURE_MODES],
    "Medical": [5, 5, 3, 5, 3, 5, 2],
    "Financial": [4, 4, 3, 4, 2, 5, 3],
    "Customer Service": [4, 2, 2, 2, 4, 3, 2],
    "Code Assistant": [3, 2, 2, 1, 4, 3, 4],
}

df_priority = pd.DataFrame(priority_matrix)
df_priority = df_priority.set_index("Failure Mode")

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(df_priority, annot=True, cmap="YlOrRd", ax=ax,
            cbar_kws={"label": "Priority (1-5)"})
ax.set_title("Failure Mode Priority by Application Domain")
plt.tight_layout()
plt.show()

print("\nPriority Scale: 1 = Low priority, 5 = Critical")

## Part 4: Exercise - Build Your Testing Suite

In [None]:
# YOUR EXERCISE: Build your testing suite

my_builder = TestingSuiteBuilder("YOUR APPLICATION NAME")

# Select your applicable domains:
my_domains = [
    # Uncomment applicable domains:
    # "Medical",
    # "Safety",
    # "Financial",
    # "Scheduling",
    # "Interactive",
    # "Knowledge retrieval",
    # "Planning",
]

if my_domains:
    my_applicable = my_builder.assess_applicability(my_domains)
    my_builder.print_suite()
else:
    print("Please uncomment your applicable domains above.")

## Key Takeaways

1. **Use the complete taxonomy.** Don't cherry-pick failure modes to test.

2. **Prioritize by domain risk.** Medical/safety applications need more rigorous testing.

3. **Document everything.** Testing results are part of deployment documentation.

4. **Define pass/fail criteria upfront.** Avoid moving goalposts after testing.

---

**Next Session:** Hybrid Architecture Design I