# Session 8: Response Latency
## Interaction Constraints and Bandwidth Degradation

**Production LLM Deployment: Risk Characterization Before Failure**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/Production_LLM_Deployment/blob/main/sessions/session_08_failure_mode_latency/notebook.ipynb)

---

**Learning Objectives:**
1. Quantify latency impact on interaction bandwidth
2. Measure collaboration efficiency degradation
3. Understand human cognitive timing expectations
4. Design for latency constraints

In [None]:
!pip install -q anthropic numpy pandas matplotlib seaborn

import anthropic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from typing import List, Dict

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

try:
    from google.colab import userdata
    api_key = userdata.get('ANTHROPIC_API_KEY')
except:
    import os
    api_key = os.environ.get('ANTHROPIC_API_KEY')

client = anthropic.Anthropic(api_key=api_key)
print("Setup complete!")

## Part 1: The Latency Problem

Response latency creates bandwidth constraints independent of response quality.

In [None]:
# Empirical latency-bandwidth relationship
latency_data = {
    "delay_seconds": [0, 2, 5, 7, 10, 15, 20],
    "turns_per_minute": [7.43, 5.89, 4.89, 4.12, 3.28, 2.45, 1.89],
    "task_completion_rate": [1.0, 0.92, 0.82, 0.73, 0.62, 0.48, 0.38]
}

df_latency = pd.DataFrame(latency_data)
df_latency["bandwidth_drop_pct"] = (1 - df_latency["turns_per_minute"] / df_latency["turns_per_minute"].iloc[0]) * 100

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].plot(df_latency["delay_seconds"], df_latency["turns_per_minute"], 'b-o', linewidth=2)
axes[0].fill_between(df_latency["delay_seconds"], df_latency["turns_per_minute"], alpha=0.3)
axes[0].set_xlabel("Response Delay (seconds)")
axes[0].set_ylabel("Turns per Minute")
axes[0].set_title("Interaction Bandwidth vs Latency")

axes[1].plot(df_latency["delay_seconds"], df_latency["bandwidth_drop_pct"], 'r-o', linewidth=2)
axes[1].set_xlabel("Response Delay (seconds)")
axes[1].set_ylabel("Bandwidth Drop (%)")
axes[1].set_title("Bandwidth Degradation")

axes[2].plot(df_latency["delay_seconds"], df_latency["task_completion_rate"] * 100, 'g-o', linewidth=2)
axes[2].set_xlabel("Response Delay (seconds)")
axes[2].set_ylabel("Task Completion Rate (%)")
axes[2].set_title("Task Completion Impact")

plt.tight_layout()
plt.show()

print("\nKey Finding: At 10s delay, bandwidth drops 56% and task completion drops 38%")

## Part 2: Measuring Actual API Latency

In [None]:
def measure_latency(prompt: str, n_trials: int = 3) -> Dict:
    """Measure response latency for a given prompt."""
    latencies = []
    
    for _ in range(n_trials):
        start = time.time()
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=100,
            messages=[{"role": "user", "content": prompt}]
        )
        end = time.time()
        latencies.append(end - start)
        time.sleep(0.5)  # Rate limiting
    
    return {
        "mean_latency": np.mean(latencies),
        "std_latency": np.std(latencies),
        "min_latency": np.min(latencies),
        "max_latency": np.max(latencies)
    }


# Measure latency for different prompt types
prompts = {
    "simple": "What is 2+2?",
    "medium": "Explain the concept of machine learning in 2 sentences.",
    "complex": "Analyze the trade-offs between microservices and monolithic architectures for a startup."
}

print("Measuring API latency...\n")
latency_results = {}
for name, prompt in prompts.items():
    result = measure_latency(prompt, n_trials=2)
    latency_results[name] = result
    print(f"{name}: {result['mean_latency']:.2f}s (±{result['std_latency']:.2f}s)")

print("\nLatency varies with prompt complexity and response length.")

## Part 3: Human Cognitive Expectations

In [None]:
# Human timing expectations
expectations = {
    "Context": [
        "Conversational turn-taking",
        "Simple question response",
        "Complex query response",
        "File/document processing",
        "Heavy computation"
    ],
    "Expected Delay": [
        "200-500ms",
        "1-2 seconds",
        "3-5 seconds",
        "5-15 seconds",
        "Progress indicator needed"
    ],
    "User Tolerance": [
        "Very low",
        "Low",
        "Moderate",
        "High (with feedback)",
        "High (with progress)"
    ]
}

df_expectations = pd.DataFrame(expectations)
print("Human Timing Expectations:")
print(df_expectations.to_string(index=False))

## Key Takeaways

1. **Latency degrades bandwidth independently of quality.** A perfect answer delivered slowly is still a degraded experience.

2. **Effect is consistent and large.** Cohen's d > 5 indicates massive impact.

3. **Humans don't adapt.** The delay is purely additive—no compensation.

4. **Design for latency.** Streaming, progress indicators, and async patterns are essential.

---

**Next Session:** Comprehensive Failure Mode Catalog