# Session 5: Temporal Constraint Processing
## Why Discrete Tokens Fail for Continuous State

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_05_failure_mode_temporal/notebook.ipynb)

---

**Learning Objectives:**
1. Understand why LLMs fail on temporal constraint tasks
2. Measure bimodal performance and prompt brittleness
3. Detect systematic action bias in model responses
4. Implement Allen's interval algebra for hybrid architectures

## Setup

In [None]:
!pip install -q anthropic numpy pandas matplotlib seaborn scipy

import anthropic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional
from enum import Enum
import time
import json

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

try:
    from google.colab import userdata
    api_key = userdata.get('ANTHROPIC_API_KEY')
except:
    import os
    api_key = os.environ.get('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)
print("Setup complete!")

## Part 1: Allen's Interval Algebra Implementation

Allen (1983) defined 13 basic relations between temporal intervals. This provides the foundation for deterministic temporal reasoning.

In [None]:
class AllenRelation(Enum):
    """Allen's 13 interval relations."""
    BEFORE = "before"           # A ends before B starts
    AFTER = "after"             # A starts after B ends
    MEETS = "meets"             # A ends exactly when B starts
    MET_BY = "met_by"           # A starts exactly when B ends
    OVERLAPS = "overlaps"       # A starts before B, ends during B
    OVERLAPPED_BY = "overlapped_by"
    DURING = "during"           # A is entirely within B
    CONTAINS = "contains"       # A entirely contains B
    STARTS = "starts"           # Same start, A ends first
    STARTED_BY = "started_by"
    FINISHES = "finishes"       # Same end, A starts later
    FINISHED_BY = "finished_by"
    EQUALS = "equals"           # Identical intervals


@dataclass
class TemporalInterval:
    """Represents a temporal interval with start and end times."""
    name: str
    start: float  # Time in minutes from midnight
    end: float
    
    def __post_init__(self):
        if self.end < self.start:
            raise ValueError(f"End time must be >= start time: {self.start} -> {self.end}")
    
    @property
    def duration(self) -> float:
        return self.end - self.start
    
    def relation_to(self, other: 'TemporalInterval') -> AllenRelation:
        """Compute Allen relation between self and other interval."""
        # Before/After
        if self.end < other.start:
            return AllenRelation.BEFORE
        if self.start > other.end:
            return AllenRelation.AFTER
        
        # Meets/Met-by
        if self.end == other.start:
            return AllenRelation.MEETS
        if self.start == other.end:
            return AllenRelation.MET_BY
        
        # Equals
        if self.start == other.start and self.end == other.end:
            return AllenRelation.EQUALS
        
        # Starts/Started-by
        if self.start == other.start:
            if self.end < other.end:
                return AllenRelation.STARTS
            else:
                return AllenRelation.STARTED_BY
        
        # Finishes/Finished-by
        if self.end == other.end:
            if self.start > other.start:
                return AllenRelation.FINISHES
            else:
                return AllenRelation.FINISHED_BY
        
        # During/Contains
        if self.start > other.start and self.end < other.end:
            return AllenRelation.DURING
        if self.start < other.start and self.end > other.end:
            return AllenRelation.CONTAINS
        
        # Overlaps/Overlapped-by
        if self.start < other.start < self.end < other.end:
            return AllenRelation.OVERLAPS
        if other.start < self.start < other.end < self.end:
            return AllenRelation.OVERLAPPED_BY
        
        raise ValueError(f"Could not determine relation: {self} vs {other}")


def time_to_minutes(time_str: str) -> float:
    """Convert time string to minutes from midnight."""
    time_str = time_str.upper().strip()
    
    # Handle AM/PM
    is_pm = 'PM' in time_str
    is_am = 'AM' in time_str
    time_str = time_str.replace('AM', '').replace('PM', '').strip()
    
    parts = time_str.replace(':', ' ').split()
    hours = int(parts[0])
    minutes = int(parts[1]) if len(parts) > 1 else 0
    
    if is_pm and hours != 12:
        hours += 12
    elif is_am and hours == 12:
        hours = 0
    
    return hours * 60 + minutes


# Test the implementation
meeting_a = TemporalInterval("Meeting A", time_to_minutes("2:00 PM"), time_to_minutes("3:30 PM"))
meeting_b = TemporalInterval("Meeting B", time_to_minutes("3:00 PM"), time_to_minutes("4:00 PM"))

print(f"Meeting A: {meeting_a.start/60:.1f}h - {meeting_a.end/60:.1f}h")
print(f"Meeting B: {meeting_b.start/60:.1f}h - {meeting_b.end/60:.1f}h")
print(f"Relation: Meeting A {meeting_a.relation_to(meeting_b).value} Meeting B")

## Part 2: Temporal Constraint Checker

In [None]:
class TemporalConstraintChecker:
    """Deterministic temporal constraint verification."""
    
    def check_minimum_gap(self, 
                          event_a_end: float, 
                          event_b_start: float, 
                          min_gap_minutes: float) -> Tuple[bool, str]:
        """Check if there's sufficient gap between events."""
        actual_gap = event_b_start - event_a_end
        is_valid = actual_gap >= min_gap_minutes
        
        explanation = (
            f"Gap: {actual_gap:.1f} minutes. "
            f"Required: {min_gap_minutes:.1f} minutes. "
            f"{'VALID' if is_valid else 'INVALID - insufficient gap'}"
        )
        return is_valid, explanation
    
    def check_no_overlap(self, intervals: List[TemporalInterval]) -> Tuple[bool, List[str]]:
        """Check that no intervals overlap."""
        violations = []
        
        for i, a in enumerate(intervals):
            for b in intervals[i+1:]:
                relation = a.relation_to(b)
                if relation in [AllenRelation.OVERLAPS, AllenRelation.OVERLAPPED_BY,
                               AllenRelation.DURING, AllenRelation.CONTAINS,
                               AllenRelation.STARTS, AllenRelation.STARTED_BY,
                               AllenRelation.FINISHES, AllenRelation.FINISHED_BY,
                               AllenRelation.EQUALS]:
                    violations.append(f"{a.name} {relation.value} {b.name}")
        
        return len(violations) == 0, violations
    
    def check_medication_timing(self,
                                 med_a_time: str,
                                 med_b_time: str,
                                 min_hours: float) -> Dict:
        """Check medication timing constraint."""
        time_a = time_to_minutes(med_a_time)
        time_b = time_to_minutes(med_b_time)
        min_minutes = min_hours * 60
        
        elapsed = time_b - time_a
        is_safe = elapsed >= min_minutes
        
        return {
            "med_a_time": med_a_time,
            "med_b_time": med_b_time,
            "elapsed_minutes": elapsed,
            "required_minutes": min_minutes,
            "is_safe": is_safe,
            "explanation": f"Elapsed: {elapsed/60:.2f}h. Required: {min_hours}h. {'SAFE' if is_safe else 'UNSAFE'}"
        }


# Test the checker
checker = TemporalConstraintChecker()

# Medication timing test
result = checker.check_medication_timing("8:00 AM", "11:00 AM", min_hours=4)
print("Medication Timing Check:")
print(f"  {result['explanation']}")
print(f"  Safe to take: {result['is_safe']}")

## Part 3: Testing LLM Temporal Reasoning

Now let's test how well the LLM handles temporal constraints compared to our deterministic checker.

In [None]:
def create_temporal_test_suite() -> List[Dict]:
    """Create balanced test suite for temporal constraint testing."""
    
    tests = [
        # Valid cases (answer should be YES)
        {
            "scenario": "Patient took aspirin at 8:00 AM. Can they take ibuprofen at 12:30 PM? (4-hour minimum required)",
            "med_a": "8:00 AM",
            "med_b": "12:30 PM",
            "min_hours": 4,
            "expected": "YES"
        },
        {
            "scenario": "Medication A given at 6:00 AM. Medication B proposed at 2:00 PM. 6-hour gap required.",
            "med_a": "6:00 AM",
            "med_b": "2:00 PM",
            "min_hours": 6,
            "expected": "YES"
        },
        {
            "scenario": "First dose at 7:00 AM. Second dose at 11:15 AM. Minimum 4 hours between doses.",
            "med_a": "7:00 AM",
            "med_b": "11:15 AM",
            "min_hours": 4,
            "expected": "YES"
        },
        {
            "scenario": "Drug administered at 9:00 AM. Follow-up drug at 3:00 PM. 5-hour separation needed.",
            "med_a": "9:00 AM",
            "med_b": "3:00 PM",
            "min_hours": 5,
            "expected": "YES"
        },
        # Invalid cases (answer should be NO)
        {
            "scenario": "Patient took aspirin at 8:00 AM. Can they take ibuprofen at 11:00 AM? (4-hour minimum required)",
            "med_a": "8:00 AM",
            "med_b": "11:00 AM",
            "min_hours": 4,
            "expected": "NO"
        },
        {
            "scenario": "Medication A given at 10:00 AM. Medication B proposed at 1:30 PM. 4-hour gap required.",
            "med_a": "10:00 AM",
            "med_b": "1:30 PM",
            "min_hours": 4,
            "expected": "NO"
        },
        {
            "scenario": "First dose at 2:00 PM. Second dose at 5:00 PM. Minimum 4 hours between doses.",
            "med_a": "2:00 PM",
            "med_b": "5:00 PM",
            "min_hours": 4,
            "expected": "NO"
        },
        {
            "scenario": "Drug administered at 11:00 AM. Follow-up drug at 2:45 PM. 4-hour separation needed.",
            "med_a": "11:00 AM",
            "med_b": "2:45 PM",
            "min_hours": 4,
            "expected": "NO"
        },
    ]
    
    return tests


def test_llm_temporal_reasoning(tests: List[Dict]) -> pd.DataFrame:
    """Test LLM on temporal constraint scenarios."""
    
    results = []
    checker = TemporalConstraintChecker()
    
    for i, test in enumerate(tests):
        # Ground truth from deterministic checker
        ground_truth = checker.check_medication_timing(
            test["med_a"], test["med_b"], test["min_hours"]
        )
        
        # LLM response
        prompt = f"""{test['scenario']}

Answer with only YES or NO."""
        
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}]
        )
        
        llm_answer = response.content[0].text.strip().upper()
        if "YES" in llm_answer[:5]:
            llm_answer = "YES"
        elif "NO" in llm_answer[:5]:
            llm_answer = "NO"
        else:
            llm_answer = "UNCLEAR"
        
        correct = llm_answer == test["expected"]
        
        results.append({
            "test_id": i + 1,
            "expected": test["expected"],
            "llm_answer": llm_answer,
            "correct": correct,
            "elapsed_hours": ground_truth["elapsed_minutes"] / 60,
            "required_hours": test["min_hours"]
        })
        
        time.sleep(0.5)  # Rate limiting
    
    return pd.DataFrame(results)


# Run tests
tests = create_temporal_test_suite()
results_df = test_llm_temporal_reasoning(tests)

print("\n" + "="*60)
print("TEMPORAL REASONING TEST RESULTS")
print("="*60)
print(results_df.to_string(index=False))
print(f"\nOverall Accuracy: {results_df['correct'].mean():.1%}")

## Part 4: Measuring Brittleness

In [None]:
def measure_temporal_brittleness() -> pd.DataFrame:
    """Measure accuracy changes across prompt formats for temporal tasks."""
    
    # Same scenario, different formats
    base_scenario = {
        "med_a": "8:00 AM",
        "med_b": "11:00 AM",
        "min_hours": 4,
        "expected": "NO"  # 3 hours < 4 hours required
    }
    
    formats = {
        "natural": """A patient took aspirin at 8:00 AM. They need to wait at least 4 hours 
before taking ibuprofen. Can they take ibuprofen at 11:00 AM? Answer YES or NO only.""",
        
        "clinical": """MEDICATION TIMING CHECK
Drug A: Aspirin administered @ 08:00
Drug B: Ibuprofen requested @ 11:00
Minimum interval: 4 hours
Is administration of Drug B acceptable? (YES/NO)""",
        
        "json": """Given this medication schedule:
{"aspirin": {"time": "8:00 AM"}, "ibuprofen": {"requested": "11:00 AM"}, "min_gap_hours": 4}
Is the ibuprofen timing safe? YES or NO only.""",
        
        "conversational": """Hey doc, I took aspirin at 8 this morning. It's 11 now and I want to 
take ibuprofen. The bottle says wait 4 hours. Am I good? Just yes or no.""",
        
        "formal": """Query: Temporal constraint satisfaction
Event A: Acetylsalicylic acid administration at t=08:00
Event B: Ibuprofen administration proposed at t=11:00  
Constraint: |t_B - t_A| >= 240 minutes
Result (YES/NO):"""
    }
    
    results = []
    
    for format_name, prompt in formats.items():
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}]
        )
        
        answer = response.content[0].text.strip().upper()
        if "YES" in answer[:5]:
            answer = "YES"
        elif "NO" in answer[:5]:
            answer = "NO"
        else:
            answer = "UNCLEAR"
        
        correct = answer == base_scenario["expected"]
        results.append({
            "format": format_name,
            "answer": answer,
            "expected": base_scenario["expected"],
            "correct": correct
        })
        
        time.sleep(0.5)
    
    df = pd.DataFrame(results)
    
    print("\n" + "="*60)
    print("BRITTLENESS TEST: Same scenario, different formats")
    print("="*60)
    print(f"Expected answer: {base_scenario['expected']} (8AM + 3h = 11AM < 4h required)")
    print("\n" + df.to_string(index=False))
    
    accuracy = df['correct'].mean()
    print(f"\nAccuracy across formats: {accuracy:.1%}")
    print(f"Brittleness: {(1-accuracy)*100:.1f} percentage points")
    
    return df

brittleness_df = measure_temporal_brittleness()

## Part 5: Hybrid Architecture Demo

Combining LLM (for NLU) with deterministic checker (for temporal reasoning).

In [None]:
class HybridTemporalReasoner:
    """Hybrid LLM + Symbolic temporal reasoning system."""
    
    def __init__(self, client):
        self.client = client
        self.checker = TemporalConstraintChecker()
    
    def extract_temporal_info(self, scenario: str) -> Dict:
        """Use LLM to extract temporal information from natural language."""
        
        prompt = f"""Extract the temporal information from this scenario. 
Return ONLY a JSON object with these fields:
- event_a_time: time of first event (format: "HH:MM AM/PM")
- event_b_time: time of second event (format: "HH:MM AM/PM")
- min_gap_hours: minimum required gap in hours (number)

Scenario: {scenario}

JSON:"""
        
        response = self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=200,
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Parse JSON from response
        text = response.content[0].text
        # Find JSON in response
        start = text.find('{')
        end = text.rfind('}') + 1
        if start >= 0 and end > start:
            return json.loads(text[start:end])
        return None
    
    def reason(self, scenario: str) -> Dict:
        """Full hybrid reasoning pipeline."""
        
        # Step 1: LLM extracts temporal info
        extracted = self.extract_temporal_info(scenario)
        
        if not extracted:
            return {"error": "Could not extract temporal information"}
        
        # Step 2: Symbolic checker validates constraint
        result = self.checker.check_medication_timing(
            extracted["event_a_time"],
            extracted["event_b_time"],
            extracted["min_gap_hours"]
        )
        
        return {
            "extracted": extracted,
            "is_safe": result["is_safe"],
            "explanation": result["explanation"],
            "answer": "YES" if result["is_safe"] else "NO"
        }


# Test hybrid system
hybrid = HybridTemporalReasoner(client)

test_scenarios = [
    "I took my blood pressure medication at 7:30 this morning. The pharmacist said to wait at least 6 hours before taking the antihistamine. It's now 1:00 PM - is it safe?",
    "Patient received Drug A at 0900 hours. Protocol requires 4-hour separation before Drug B. Current time: 1245. Administer Drug B?"
]

print("="*60)
print("HYBRID TEMPORAL REASONER")
print("="*60)

for scenario in test_scenarios:
    print(f"\nScenario: {scenario[:60]}...")
    result = hybrid.reason(scenario)
    print(f"Extracted: {result.get('extracted', 'N/A')}")
    print(f"Answer: {result.get('answer', 'ERROR')}")
    print(f"Explanation: {result.get('explanation', 'N/A')}")
    print("-"*40)

## Part 6: Visualization

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Test Results
colors = ['green' if c else 'red' for c in results_df['correct']]
axes[0].bar(results_df['test_id'], [1]*len(results_df), color=colors, alpha=0.7)
axes[0].set_xlabel('Test ID')
axes[0].set_ylabel('Result')
axes[0].set_title(f'Temporal Reasoning Test Results\nAccuracy: {results_df["correct"].mean():.1%}')
axes[0].set_yticks([0, 1])
axes[0].set_yticklabels(['Wrong', 'Correct'])

# Add expected labels
for i, row in results_df.iterrows():
    axes[0].annotate(f"Exp:{row['expected']}", 
                     (row['test_id'], 0.5), 
                     ha='center', fontsize=8)

# Plot 2: Brittleness across formats
format_colors = ['green' if c else 'red' for c in brittleness_df['correct']]
bars = axes[1].bar(brittleness_df['format'], [1]*len(brittleness_df), color=format_colors, alpha=0.7)
axes[1].set_xlabel('Prompt Format')
axes[1].set_ylabel('Result')
axes[1].set_title(f'Brittleness Test: Same Scenario, Different Formats\nAccuracy: {brittleness_df["correct"].mean():.1%}')
axes[1].set_yticks([0, 1])
axes[1].set_yticklabels(['Wrong', 'Correct'])
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Temporal reasoning failures are architectural** - Discrete tokens cannot process continuous temporal state

2. **Brittleness reveals pattern matching** - Large accuracy drops from format changes prove the model isn't reasoning

3. **Hybrid architectures solve the problem** - Use LLM for NLU, symbolic checker for temporal logic

4. **Allen's algebra provides deterministic reasoning** - 13 relations cover all temporal relationships

5. **Always test with balanced distributions** - Equal positive/negative cases reveal systematic bias

---

**Homework:** Design and run a temporal constraint test suite for your deployment domain.