# Session 1: Capability Taxonomy
## What LLMs Can and Cannot Reliably Do

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_01_capability_taxonomy/notebook.ipynb)

---

**Learning Objectives:**
1. Understand the six capability classes for LLM task classification
2. Identify architectural prerequisites for each capability class
3. Apply the classification framework to real deployment scenarios
4. Determine when hybrid architectures are necessary

## Setup

In [None]:
# Install required packages
!pip install -q anthropic langchain langgraph langchain-anthropic
!pip install -q numpy pandas matplotlib seaborn

import anthropic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import json
import time

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

# Get API key
try:
    from google.colab import userdata
    api_key = userdata.get('ANTHROPIC_API_KEY')
except:
    import os
    api_key = os.environ.get('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)
print("Setup complete!")

## Part 1: The Six Capability Classes

We define six capability classes based on what architectural mechanisms they require:

| Class | Description | LLM Reliability |
|-------|-------------|----------------|
| Pattern Completion | High-probability text continuations | HIGH |
| Knowledge Retrieval | Recall of training data facts | MEDIUM |
| Compositional Reasoning | Systematic combination of primitives | LOW |
| Continuous State | Maintenance of evolving quantities | VERY LOW |
| Constraint Satisfaction | Simultaneous hard constraint handling | LOW |
| Interactive Collaboration | Real-time human interaction | MEDIUM (latency-dependent) |

## Part 2: Demonstrating Capability Classes

### 2.1 Pattern Completion (HIGH Reliability)

Tasks where the model continues patterns seen during training.

In [None]:
def test_pattern_completion():
    """Test pattern completion capability - should work reliably."""
    
    prompts = [
        "Complete this Python function:\ndef fibonacci(n):\n    if n <= 1:\n        return n\n    return",
        "Summarize in one sentence: The quick brown fox jumps over the lazy dog. This sentence contains every letter of the alphabet.",
        "Translate to French: Hello, how are you today?"
    ]
    
    print("=" * 60)
    print("PATTERN COMPLETION TESTS")
    print("=" * 60)
    
    for i, prompt in enumerate(prompts, 1):
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=150,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"\nTest {i}:")
        print(f"Prompt: {prompt[:50]}...")
        print(f"Response: {response.content[0].text[:200]}")
        print("-" * 40)

test_pattern_completion()

### 2.2 Compositional Reasoning (LOW Reliability)

Tasks requiring systematic combination of learned primitives in novel ways.

In [None]:
def test_compositional_reasoning():
    """Test compositional reasoning - expect inconsistent results."""
    
    # Lake & Baroni style compositional tests
    prompts = [
        # Novel composition of known primitives
        """If 'dax' means 'jump twice' and 'wif' means 'turn left', 
        what does 'dax wif dax' mean? Describe the sequence of actions.""",
        
        # Multi-step logical reasoning
        """A is bigger than B. B is bigger than C. C is bigger than D.
        E is bigger than A. F is smaller than D.
        List all items from biggest to smallest.""",
        
        # Novel algorithm application
        """Apply this sorting rule to [5, 2, 8, 1, 9]:
        - Compare adjacent pairs
        - If left > right, swap them
        - Repeat until no swaps needed
        Show each step."""
    ]
    
    print("=" * 60)
    print("COMPOSITIONAL REASONING TESTS")
    print("=" * 60)
    
    results = []
    for i, prompt in enumerate(prompts, 1):
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"\nTest {i}:")
        print(f"Prompt: {prompt[:60]}...")
        print(f"Response:\n{response.content[0].text}")
        print("-" * 40)
        results.append(response.content[0].text)
    
    return results

comp_results = test_compositional_reasoning()

### 2.3 Continuous State Representation (VERY LOW Reliability)

Tasks requiring maintenance of continuous quantities over time.

In [None]:
def test_temporal_reasoning():
    """Test temporal/continuous state reasoning - expect failures."""
    
    scenarios = [
        {
            "prompt": """A patient takes medication A at 8:00 AM. 
            Medication A must be taken at least 4 hours before medication B.
            Can the patient take medication B at 11:30 AM?
            Answer only YES or NO, then explain.""",
            "correct": "NO",
            "reason": "11:30 AM is only 3.5 hours after 8:00 AM, not 4 hours"
        },
        {
            "prompt": """Meeting A runs from 2:00 PM to 3:30 PM.
            Meeting B runs from 3:00 PM to 4:00 PM.
            Do these meetings overlap?
            Answer only YES or NO, then explain.""",
            "correct": "YES",
            "reason": "They overlap from 3:00 PM to 3:30 PM"
        },
        {
            "prompt": """A train departs at 9:15 AM and travels for 2 hours 45 minutes.
            What time does it arrive?
            Give only the arrival time.""",
            "correct": "12:00 PM",
            "reason": "9:15 + 2:45 = 12:00"
        }
    ]
    
    print("=" * 60)
    print("TEMPORAL REASONING TESTS")
    print("=" * 60)
    
    results = []
    for i, scenario in enumerate(scenarios, 1):
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=200,
            messages=[{"role": "user", "content": scenario["prompt"]}]
        )
        
        answer = response.content[0].text
        correct = scenario["correct"].lower() in answer.lower()[:20]
        
        print(f"\nTest {i}:")
        print(f"Scenario: {scenario['prompt'][:60]}...")
        print(f"Expected: {scenario['correct']}")
        print(f"Response: {answer[:150]}")
        print(f"Correct: {'YES' if correct else 'NO'}")
        print("-" * 40)
        
        results.append({
            "test": i,
            "correct": correct,
            "expected": scenario["correct"],
            "response": answer
        })
    
    accuracy = sum(r["correct"] for r in results) / len(results)
    print(f"\nOverall Accuracy: {accuracy:.1%}")
    return results

temporal_results = test_temporal_reasoning()

## Part 3: Measuring Brittleness

A key diagnostic for whether LLMs are pattern matching vs. truly understanding is **brittleness**: how much does accuracy change when we vary prompt format while keeping semantic content identical?

In [None]:
def measure_brittleness():
    """Measure accuracy changes across prompt format variations."""
    
    # Same semantic content, different formats
    base_scenario = {
        "content": "Patient took aspirin at 8 AM. Must wait 4 hours before ibuprofen. Can take ibuprofen at 11 AM?",
        "correct": "NO"  # Only 3 hours have passed
    }
    
    prompt_formats = [
        # Format 1: Direct question
        """A patient took aspirin at 8:00 AM. They must wait at least 4 hours before taking ibuprofen. 
        Can they take ibuprofen at 11:00 AM? Answer YES or NO only.""",
        
        # Format 2: Clinical note style
        """CLINICAL NOTE
        Medication administered: Aspirin @ 08:00
        Required interval before ibuprofen: 4h minimum
        Proposed ibuprofen time: 11:00
        
        Is proposed time acceptable? (YES/NO)""",
        
        # Format 3: Conversational
        """Hey, I took aspirin this morning at 8. The bottle says wait 4 hours before 
        taking ibuprofen. It's 11 now - am I good to take it? Just yes or no.""",
        
        # Format 4: Formal medical
        """Given: Acetylsalicylic acid (aspirin) administered at 0800 hours.
        Constraint: Minimum 4-hour interval required before NSAID (ibuprofen) administration.
        Query: Is ibuprofen administration at 1100 hours compliant with temporal constraint?
        Response format: YES or NO""",
        
        # Format 5: JSON-style
        """{
          "medication_1": {"drug": "aspirin", "time": "8:00 AM"},
          "constraint": "4 hours between aspirin and ibuprofen",
          "medication_2": {"drug": "ibuprofen", "proposed_time": "11:00 AM"}
        }
        Is the proposed time valid? Respond with only YES or NO."""
    ]
    
    print("=" * 60)
    print("BRITTLENESS TEST: Same scenario, different formats")
    print("=" * 60)
    print(f"Correct answer for all formats: {base_scenario['correct']}")
    print("(8:00 AM + 3 hours = 11:00 AM, which is less than 4 hours)\n")
    
    results = []
    for i, prompt in enumerate(prompt_formats, 1):
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=50,
            messages=[{"role": "user", "content": prompt}]
        )
        
        answer = response.content[0].text.strip().upper()
        # Extract YES or NO from response
        if "YES" in answer[:10]:
            extracted = "YES"
        elif "NO" in answer[:10]:
            extracted = "NO"
        else:
            extracted = "UNCLEAR"
        
        correct = extracted == base_scenario["correct"]
        
        print(f"Format {i}: {extracted} - {'CORRECT' if correct else 'WRONG'}")
        results.append({"format": i, "response": extracted, "correct": correct})
        
        time.sleep(0.5)  # Rate limiting
    
    accuracy = sum(r["correct"] for r in results) / len(results)
    print(f"\nAccuracy across formats: {accuracy:.1%}")
    print(f"Brittleness indicator: {(1 - accuracy) * 100:.1f} percentage points variation")
    
    return results

brittleness_results = measure_brittleness()

## Part 4: Capability Classification Framework

Use this framework to classify any deployment scenario.

In [None]:
class CapabilityClassifier:
    """Framework for classifying LLM deployment scenarios."""
    
    CLASSES = {
        "pattern_completion": {
            "name": "Pattern Completion",
            "reliability": "HIGH",
            "hybrid_needed": False,
            "examples": ["text completion", "summarization", "translation", "code completion"]
        },
        "knowledge_retrieval": {
            "name": "Knowledge Retrieval",
            "reliability": "MEDIUM",
            "hybrid_needed": "Sometimes (RAG)",
            "examples": ["factual Q&A", "concept explanation", "definitions"]
        },
        "compositional_reasoning": {
            "name": "Compositional Reasoning",
            "reliability": "LOW",
            "hybrid_needed": True,
            "examples": ["multi-step proofs", "novel algorithms", "logical deduction"]
        },
        "continuous_state": {
            "name": "Continuous State Representation",
            "reliability": "VERY LOW",
            "hybrid_needed": True,
            "examples": ["temporal constraints", "scheduling", "physical simulation"]
        },
        "constraint_satisfaction": {
            "name": "Constraint Satisfaction",
            "reliability": "LOW",
            "hybrid_needed": True,
            "examples": ["scheduling", "resource allocation", "compliance verification"]
        },
        "interactive_collaboration": {
            "name": "Interactive Collaboration",
            "reliability": "MEDIUM",
            "hybrid_needed": "Latency-dependent",
            "examples": ["pair programming", "collaborative writing", "real-time support"]
        }
    }
    
    def __init__(self):
        self.classification = {}
        
    def classify(self, scenario_name: str, required_classes: List[str]) -> Dict:
        """Classify a deployment scenario."""
        
        self.classification = {
            "scenario": scenario_name,
            "classes": {},
            "weakest_link": None,
            "hybrid_needed": False,
            "testing_required": []
        }
        
        reliability_order = ["VERY LOW", "LOW", "MEDIUM", "HIGH"]
        min_reliability = "HIGH"
        
        for class_key in required_classes:
            if class_key in self.CLASSES:
                class_info = self.CLASSES[class_key]
                self.classification["classes"][class_key] = class_info
                
                # Track weakest link
                if reliability_order.index(class_info["reliability"]) < reliability_order.index(min_reliability):
                    min_reliability = class_info["reliability"]
                    self.classification["weakest_link"] = class_key
                
                # Check if hybrid needed
                if class_info["hybrid_needed"] == True:
                    self.classification["hybrid_needed"] = True
        
        # Determine required testing
        self._determine_testing()
        
        return self.classification
    
    def _determine_testing(self):
        """Determine required testing based on capability classes."""
        testing = []
        
        if "continuous_state" in self.classification["classes"]:
            testing.extend(["Temporal constraint testing", "Allen's interval algebra validation"])
        
        if "compositional_reasoning" in self.classification["classes"]:
            testing.extend(["Compositional generalization tests", "Multi-step reasoning validation"])
        
        if "knowledge_retrieval" in self.classification["classes"]:
            testing.extend(["Hallucination detection", "Confidence calibration"])
        
        if "constraint_satisfaction" in self.classification["classes"]:
            testing.extend(["Constraint violation detection", "Edge case testing"])
        
        if "interactive_collaboration" in self.classification["classes"]:
            testing.extend(["Latency impact measurement", "Interaction bandwidth analysis"])
        
        # Always include brittleness testing
        testing.append("Multi-format brittleness testing")
        
        self.classification["testing_required"] = testing
    
    def print_report(self):
        """Print classification report."""
        c = self.classification
        
        print("=" * 60)
        print(f"CAPABILITY CLASSIFICATION: {c['scenario']}")
        print("=" * 60)
        
        print("\nRequired Capability Classes:")
        for key, info in c["classes"].items():
            print(f"  - {info['name']}: Reliability = {info['reliability']}")
        
        print(f"\nWeakest Link: {self.CLASSES[c['weakest_link']]['name'] if c['weakest_link'] else 'N/A'}")
        print(f"Hybrid Architecture Needed: {'YES' if c['hybrid_needed'] else 'NO'}")
        
        print("\nRequired Testing:")
        for test in c["testing_required"]:
            print(f"  - {test}")
        
        print("=" * 60)

In [None]:
# Example: Medical Triage Chatbot
classifier = CapabilityClassifier()

classification = classifier.classify(
    scenario_name="Medical Symptom Triage Chatbot",
    required_classes=[
        "pattern_completion",      # Understanding natural language
        "knowledge_retrieval",      # Medical knowledge
        "continuous_state",         # Symptom timing
        "constraint_satisfaction"   # Drug interactions
    ]
)

classifier.print_report()

In [None]:
# Example: Code Assistant
classifier2 = CapabilityClassifier()

classification2 = classifier2.classify(
    scenario_name="Code Completion Assistant",
    required_classes=[
        "pattern_completion",       # Code patterns
        "knowledge_retrieval",      # API knowledge
        "interactive_collaboration" # Pair programming
    ]
)

classifier2.print_report()

## Part 5: Exercise - Classify Your Deployment Scenario

Use the framework below to classify your own LLM deployment scenario.

In [None]:
# YOUR EXERCISE: Fill in your deployment scenario

my_classifier = CapabilityClassifier()

# Replace with your scenario
my_classification = my_classifier.classify(
    scenario_name="YOUR SCENARIO NAME HERE",  # e.g., "Customer Support Chatbot"
    required_classes=[
        # Uncomment the classes your scenario requires:
        "pattern_completion",
        # "knowledge_retrieval",
        # "compositional_reasoning",
        # "continuous_state",
        # "constraint_satisfaction",
        # "interactive_collaboration"
    ]
)

my_classifier.print_report()

## Part 6: Key Takeaways

1. **Not all tasks are equal.** LLMs excel at pattern completion but fail at continuous state representation and compositional reasoning.

2. **Benchmarks lie.** High benchmark scores do not predict production reliability for tasks requiring capabilities beyond pattern matching.

3. **Classify before building.** Understanding capability requirements prevents wasted engineering effort on fundamentally unsuitable applications.

4. **Hybrid architectures are often necessary.** For high-stakes applications requiring Class 3-5 capabilities, pure LLM deployment is inappropriate.

5. **Testing must match capability class.** Different failure modes require different testing protocols.

## Homework

**Deliverable 1: Capability Classification Matrix**

1. Choose a production LLM deployment scenario (your own or hypothetical)
2. Complete the classification using the `CapabilityClassifier`
3. For each required capability class, provide:
   - Specific examples from your scenario
   - Potential failure modes
   - Risk assessment (Low/Medium/High/Critical)
4. Determine if hybrid architecture is needed and why
5. List all required testing protocols

Submit your classification matrix as a PDF or Markdown document.

---

**Next Session:** Architectural Prerequisites for Reliable Performance