# Implementing Error Recovery and Resilience

> **Difficulty**: Advanced - This lab introduces production patterns used in distributed systems.
> If you're new to error recovery, focus on understanding the patterns conceptually first.
> The code demonstrates industry-standard techniques but may feel complex initially - that's expected!

> **Learning Outcomes:**
> - Classify error types and recovery strategies
> - Implement retry with exponential backoff
> - Build circuit breaker patterns
> - Handle partial failures gracefully
> - Add production-grade logging and monitoring
> - Design fallback mechanisms

## Introduction

In this lab, we will build a Multi-Source News Aggregator that demonstrates production-grade error handling and resilience patterns. This agent fetches news from multiple APIs and must handle:
- Network failures and timeouts
- Rate limiting (HTTP 429)
- Partial failures (some sources work, others don't)
- Service degradation without total failure

### The Scenario

You're building a news aggregator that pulls from 5 different APIs. In production:
- APIs go down randomly
- Rate limits are exceeded
- Network is unreliable
- Users expect results even if some sources fail

Your agent must:
- Retry failed requests intelligently
- Use circuit breakers to prevent cascade failures
- Aggregate partial results
- Log errors for monitoring
- Degrade gracefully

### Key Concepts

**Error Recovery Patterns**:

1. **Retry with Exponential Backoff**:
```python
attempt 1: wait 1s → retry
attempt 2: wait 2s → retry
attempt 3: wait 4s → retry
attempt 4: fail permanently
```

2. **Circuit Breaker**:
```
Closed (normal) → failures exceed threshold → Open (fail fast)
                                                    ↓
                                          wait timeout period
                                                    ↓
                                          Half-Open (test)
                                                    ↓
                    success → Closed    OR    failure → Open
```

3. **Partial Failure Handling**:
- 5 sources queried
- 3 succeed, 2 fail
- Return 3 results (not total failure)

4. **Fallback Mechanisms**:
- Primary API fails → Use cached results
- Multiple APIs down → Use default content

This pattern is essential when:
- Dealing with external APIs
- Building production systems
- Reliability is critical
- Failures are expected

## Setup and Installation

### Install Required Packages

We'll install the **latest stable versions** of:
- **LangGraph 1.0+**: Agent workflow framework
- **LangChain 1.0+**: Core LLM abstractions
- **LangChain OpenAI**: OpenAI model integration
- **Requests**: HTTP client for API simulation

**Version 1.0 Upgrade Notes**:
- LangGraph 1.0 has ZERO breaking changes from 0.6.6
- LangChain 1.0 requires Python 3.10+
- All resilience patterns work identically in version 1.0

In [None]:
%pip install -qU \
    langgraph \
    langchain \
    langchain-openai \
    requests

In [None]:
# Standard library
import os
import getpass
import time
import random
import logging
from enum import Enum
from datetime import datetime, timedelta
from typing import TypedDict, Annotated, List, Dict, Any, Optional
from collections import defaultdict

# External libraries
import requests

# LangChain/LangGraph
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from pydantic import BaseModel, Field

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# API key setup
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
print("✓ Environment configured")

### Step 1: Design Error Classification System

Not all errors are equal. We need to classify errors to determine recovery strategy:

**Error Categories**:

1. **Transient** (temporary, worth retrying):
   - Network timeouts
   - HTTP 503 (Service Unavailable)
   - HTTP 429 (Rate Limit)

2. **Permanent** (won't fix with retry):
   - HTTP 404 (Not Found)
   - HTTP 401 (Unauthorized)
   - Invalid API response format

3. **Partial** (some components failed):
   - 2 of 5 APIs failed
   - Incomplete data

**Recovery Actions**:
- Transient → Retry with backoff
- Permanent → Fail fast, don't retry
- Partial → Return available data

In [None]:
class ErrorType(Enum):
    """Error classification for recovery strategy."""
    TRANSIENT = "transient"  # Temporary, retry
    PERMANENT = "permanent"  # Won't fix, fail fast
    PARTIAL = "partial"  # Some success, some failure

class ErrorInfo(BaseModel):
    """Information about an error."""
    error_type: ErrorType
    message: str
    source: str
    timestamp: datetime = Field(default_factory=datetime.now)
    retry_count: int = 0

def classify_error(exception: Exception, source: str) -> ErrorInfo:
    """Classify error to determine recovery strategy."""
    
    # Network/timeout errors - transient
    if isinstance(exception, (requests.Timeout, requests.ConnectionError)):
        return ErrorInfo(
            error_type=ErrorType.TRANSIENT,
            message=f"Network error: {str(exception)}",
            source=source
        )
    
    # HTTP errors
    if isinstance(exception, requests.HTTPError):
        status_code = exception.response.status_code
        
        # Transient HTTP errors
        if status_code in [429, 503, 504]:
            return ErrorInfo(
                error_type=ErrorType.TRANSIENT,
                message=f"HTTP {status_code}: {exception}",
                source=source
            )
        
        # Permanent HTTP errors
        else:
            return ErrorInfo(
                error_type=ErrorType.PERMANENT,
                message=f"HTTP {status_code}: {exception}",
                source=source
            )
    
    # Default: treat as permanent
    return ErrorInfo(
        error_type=ErrorType.PERMANENT,
        message=f"Unknown error: {str(exception)}",
        source=source
    )

print("✓ Error classification system defined")

### Step 2: Implement Retry with Exponential Backoff

Exponential backoff prevents overwhelming failing services:

**Algorithm**:
```python
max_retries = 3
base_delay = 1  # second

for attempt in range(max_retries):
    try:
        return call_api()
    except TransientError:
        delay = base_delay * (2 ** attempt)  # 1s, 2s, 4s
        wait(delay)
        continue
    except PermanentError:
        raise  # Don't retry
```

**Benefits**:
- Gives service time to recover
- Reduces server load
- Prevents thundering herd problem

In [None]:
class RetryConfig(BaseModel):
    """Configuration for retry logic."""
    max_retries: int = 3
    base_delay: float = 1.0  # seconds
    max_delay: float = 10.0  # seconds
    jitter: bool = True  # Add randomness to prevent thundering herd

def retry_with_backoff(
    func,
    *args,
    config: RetryConfig = RetryConfig(),
    source: str = "unknown",
    **kwargs
):
    """Execute function with exponential backoff retry."""
    
    for attempt in range(config.max_retries + 1):
        try:
            result = func(*args, **kwargs)
            
            if attempt > 0:
                logger.info(f"✓ {source}: Succeeded on attempt {attempt + 1}")
            
            return result
        
        except Exception as e:
            error_info = classify_error(e, source)
            error_info.retry_count = attempt
            
            # Don't retry permanent errors
            if error_info.error_type == ErrorType.PERMANENT:
                logger.error(f"✗ {source}: Permanent error - {error_info.message}")
                raise
            
            # Last attempt - raise
            if attempt == config.max_retries:
                logger.error(f"✗ {source}: Max retries exceeded - {error_info.message}")
                raise
            
            # Calculate backoff delay
            delay = min(
                config.base_delay * (2 ** attempt),
                config.max_delay
            )
            
            # Add jitter (randomness)
            if config.jitter:
                delay = delay * (0.5 + random.random())  # 50-150% of delay
            
            logger.warning(
                f"⚠ {source}: Attempt {attempt + 1} failed - {error_info.message}. "
                f"Retrying in {delay:.2f}s..."
            )
            
            time.sleep(delay)

print("✓ Retry with backoff implemented")

### Step 3: Implement Circuit Breaker Pattern

Circuit breakers prevent cascading failures by failing fast when a service is down:

**States**:
1. **Closed** (normal): Requests pass through
2. **Open** (failing): Requests fail immediately
3. **Half-Open** (testing): Allow one request to test recovery

**State Transitions**:
```
Closed:
  - On success: stay Closed
  - On failure: increment failure count
  - If failures >= threshold: transition to Open

Open:
  - Fail all requests immediately
  - After timeout: transition to Half-Open

Half-Open:
  - Allow one test request
  - On success: transition to Closed, reset counts
  - On failure: transition back to Open
```

In [None]:
class CircuitState(Enum):
    """Circuit breaker states."""
    CLOSED = "closed"  # Normal operation
    OPEN = "open"  # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    """Circuit breaker for preventing cascade failures."""
    
    def __init__(
        self,
        failure_threshold: int = 3,
        timeout: float = 30.0,  # seconds
        name: str = "circuit"
    ):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.name = name
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.success_count = 0
    
    def call(self, func, *args, **kwargs):
        """Execute function through circuit breaker."""
        
        # Check if we should transition from OPEN to HALF_OPEN
        if self.state == CircuitState.OPEN:
            if self.last_failure_time and \
               (datetime.now() - self.last_failure_time).total_seconds() >= self.timeout:
                logger.info(f"Circuit {self.name}: OPEN → HALF_OPEN (testing recovery)")
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception(f"Circuit {self.name} is OPEN - failing fast")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        """Handle successful call."""
        self.success_count += 1
        
        if self.state == CircuitState.HALF_OPEN:
            logger.info(f"Circuit {self.name}: HALF_OPEN → CLOSED (recovery confirmed)")
            self.state = CircuitState.CLOSED
            self.failure_count = 0
            self.last_failure_time = None
    
    def _on_failure(self):
        """Handle failed call."""
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.state == CircuitState.HALF_OPEN:
            logger.warning(f"Circuit {self.name}: HALF_OPEN → OPEN (recovery failed)")
            self.state = CircuitState.OPEN
        
        elif self.state == CircuitState.CLOSED:
            if self.failure_count >= self.failure_threshold:
                logger.warning(
                    f"Circuit {self.name}: CLOSED → OPEN "
                    f"({self.failure_count} failures >= {self.failure_threshold} threshold)"
                )
                self.state = CircuitState.OPEN
    
    def reset(self):
        """Manually reset circuit breaker."""
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        logger.info(f"Circuit {self.name}: Manually reset to CLOSED")

print("✓ Circuit breaker implemented")

### Step 4: Build Simulated News Sources

We'll simulate 5 news APIs with different failure modes:

1. **Reliable API**: 95% success
2. **Slow API**: Often times out
3. **Rate-Limited API**: Returns 429 frequently
4. **Flaky API**: Random failures
5. **Down API**: Always fails

This lets us test all recovery patterns.

**Note**: We use realistic delays (1-3 seconds) so you can observe the exponential backoff in action.

In [None]:
class NewsSource:
    """Simulated news API source with configurable failure modes."""
    
    def __init__(
        self,
        name: str,
        failure_rate: float = 0.0,
        timeout_rate: float = 0.0,
        rate_limit_rate: float = 0.0
    ):
        self.name = name
        self.failure_rate = failure_rate
        self.timeout_rate = timeout_rate
        self.rate_limit_rate = rate_limit_rate
        self.call_count = 0
    
    def fetch_news(self, topic: str) -> List[Dict[str, str]]:
        """Fetch news articles (simulated)."""
        self.call_count += 1
        
        # Simulate timeout with realistic delay (1-3 seconds)
        # This makes retry backoff visible in logs
        if random.random() < self.timeout_rate:
            delay = random.uniform(1.0, 3.0)
            logger.debug(f"{self.name}: Simulating timeout after {delay:.1f}s delay")
            time.sleep(delay)
            raise requests.Timeout(f"{self.name}: Request timed out after {delay:.1f}s")
        
        # Simulate rate limiting
        if random.random() < self.rate_limit_rate:
            response = requests.Response()
            response.status_code = 429
            raise requests.HTTPError(f"{self.name}: Rate limit exceeded", response=response)
        
        # Simulate general failure
        if random.random() < self.failure_rate:
            response = requests.Response()
            response.status_code = 503
            raise requests.HTTPError(f"{self.name}: Service unavailable", response=response)
        
        # Success - return mock articles
        return [
            {
                "title": f"{topic} article {i+1} from {self.name}",
                "source": self.name,
                "timestamp": datetime.now().isoformat()
            }
            for i in range(3)
        ]

# Create sources with different reliability profiles
sources = {
    "ReliableNews": NewsSource("ReliableNews", failure_rate=0.05),
    "SlowNews": NewsSource("SlowNews", timeout_rate=0.4),
    "RateLimitedNews": NewsSource("RateLimitedNews", rate_limit_rate=0.3),
    "FlakyNews": NewsSource("FlakyNews", failure_rate=0.3, timeout_rate=0.2),
    "DownNews": NewsSource("DownNews", failure_rate=1.0)
}

print("✓ News sources created")
for name, source in sources.items():
    print(f"  - {name}: failure={source.failure_rate}, timeout={source.timeout_rate}, rate_limit={source.rate_limit_rate}")

### Step 5: Visualize the Workflow

Before we implement the aggregator, let's visualize how error recovery flows through the system.

This diagram shows the complete path from initial request to final result, including all recovery patterns.

In [None]:
# Workflow visualization
print("""\n
╔═══════════════════════════════════════════════════════════════════════════════╗
║                    RESILIENT AGGREGATOR WORKFLOW                              ║
╚═══════════════════════════════════════════════════════════════════════════════╝

┌─────────────────────────────────────────────────────────────────────────────┐
│  START: aggregate(topic)                                                    │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
                                 ▼
                    ┌────────────────────────┐
                    │  For each source       │
                    │  (parallel processing) │
                    └────────────┬───────────┘
                                 │
        ┌────────────────────────┼────────────────────────┐
        ▼                        ▼                        ▼
  ┌──────────┐            ┌──────────┐            ┌──────────┐
  │ Source 1 │            │ Source 2 │    ...     │ Source N │
  └─────┬────┘            └─────┬────┘            └─────┬────┘
        │                       │                       │
        ▼                       ▼                       ▼
  ┌─────────────────────────────────────────────────────────┐
  │ STEP 1: Circuit Breaker Check                          │
  ├─────────────────────────────────────────────────────────┤
  │  State: CLOSED    → Proceed to retry                   │
  │  State: OPEN      → Fail fast (skip retries)           │
  │  State: HALF_OPEN → Test recovery (one attempt)        │
  └────────────────────────┬────────────────────────────────┘
                           │
                           ▼
  ┌─────────────────────────────────────────────────────────┐
  │ STEP 2: Retry Loop (max 3 attempts)                    │
  ├─────────────────────────────────────────────────────────┤
  │  Attempt 1: immediate                                  │
  │      ↓ (fail)                                          │
  │  Wait: base_delay * 2^0 * jitter ≈ 0.5-1.5s           │
  │      ↓                                                 │
  │  Attempt 2: after backoff                             │
  │      ↓ (fail)                                          │
  │  Wait: base_delay * 2^1 * jitter ≈ 1.0-3.0s           │
  │      ↓                                                 │
  │  Attempt 3: after longer backoff                      │
  │      ↓ (success OR fail permanently)                  │
  └────────────────────────┬────────────────────────────────┘
                           │
                           ▼
  ┌─────────────────────────────────────────────────────────┐
  │ STEP 3: Error Classification                           │
  ├─────────────────────────────────────────────────────────┤
  │  TRANSIENT (timeout, 429, 503) → Retry                 │
  │  PERMANENT (404, 401)          → Fail fast             │
  │  PARTIAL (some success)        → Continue              │
  └────────────────────────┬────────────────────────────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
        ┌──────────┐              ┌──────────┐
        │ SUCCESS  │              │ FAILURE  │
        └─────┬────┘              └─────┬────┘
              │                         │
              ▼                         ▼
  ┌──────────────────────┐   ┌──────────────────────────┐
  │ Update metrics       │   │ Log error                │
  │ Return articles      │   │ Update circuit state     │
  │ Circuit: success()   │   │ Circuit: failure()       │
  └──────────┬───────────┘   └──────────┬───────────────┘
             │                          │
             └──────────┬───────────────┘
                        │
                        ▼
           ┌────────────────────────────┐
           │ Aggregate Results          │
           ├────────────────────────────┤
           │ 3/5 sources succeeded      │
           │ → Partial success          │
           │ → Return available data    │
           └────────────┬───────────────┘
                        │
                        ▼
           ┌────────────────────────────┐
           │ END: Return aggregated     │
           │      state with articles   │
           │      and error info        │
           └────────────────────────────┘

╔═══════════════════════════════════════════════════════════════════════════════╗
║  KEY PATTERNS                                                                 ║
╠═══════════════════════════════════════════════════════════════════════════════╣
║  • Circuit Breaker: Prevents cascade failures                                ║
║  • Exponential Backoff: Gives services time to recover                       ║
║  • Error Classification: Smart retry decisions                               ║
║  • Partial Success: Degrade gracefully, don't fail completely                ║
║  • Comprehensive Logging: Monitor production behavior                        ║
╚═══════════════════════════════════════════════════════════════════════════════╝
""")

print("\n✓ Workflow visualization complete")
print("  Watch for these patterns in the test output below:")
print("  - Retry backoff delays (1s → 2s → 4s)")
print("  - Circuit state transitions (CLOSED → OPEN → HALF_OPEN)")
print("  - Partial success (some sources work, others fail)")

### Step 6: Build Resilient Aggregator

The aggregator combines all patterns:
1. Circuit breaker per source
2. Retry with backoff
3. Partial failure handling
4. Logging and monitoring

**Key Design**:
- Each source has its own circuit breaker
- Failed sources don't block successful ones
- Results are aggregated from available sources
- Detailed error tracking

In [None]:
class AggregatorState(TypedDict):
    """State for news aggregator."""
    topic: str
    articles: List[Dict[str, Any]]
    errors: List[ErrorInfo]
    source_status: Dict[str, str]  # source -> status
    start_time: datetime
    end_time: Optional[datetime]

class ResilientAggregator:
    """News aggregator with circuit breakers and retry logic."""
    
    def __init__(self, sources: Dict[str, NewsSource]):
        self.sources = sources
        self.circuit_breakers = {
            name: CircuitBreaker(failure_threshold=3, timeout=15.0, name=name)
            for name in sources.keys()
        }
        self.retry_config = RetryConfig(max_retries=2, base_delay=0.5)
    
    def aggregate(self, topic: str) -> AggregatorState:
        """Aggregate news from all sources with error recovery."""
        logger.info(f"\n{'='*60}")
        logger.info(f"Starting aggregation for topic: {topic}")
        logger.info(f"{'='*60}")
        
        state = {
            "topic": topic,
            "articles": [],
            "errors": [],
            "source_status": {},
            "start_time": datetime.now(),
            "end_time": None
        }
        
        # Try each source
        for source_name, source in self.sources.items():
            try:
                logger.info(f"\n--- Fetching from {source_name} ---")
                
                # Wrap in circuit breaker and retry
                circuit = self.circuit_breakers[source_name]
                
                articles = circuit.call(
                    retry_with_backoff,
                    source.fetch_news,
                    topic,
                    config=self.retry_config,
                    source=source_name
                )
                
                state["articles"].extend(articles)
                state["source_status"][source_name] = "success"
                logger.info(f"✓ {source_name}: Retrieved {len(articles)} articles")
            
            except Exception as e:
                error_info = classify_error(e, source_name)
                state["errors"].append(error_info)
                state["source_status"][source_name] = "failed"
                logger.error(f"✗ {source_name}: {error_info.message}")
        
        state["end_time"] = datetime.now()
        
        # Summary
        duration = (state["end_time"] - state["start_time"]).total_seconds()
        success_count = sum(1 for status in state["source_status"].values() if status == "success")
        total_sources = len(self.sources)
        
        logger.info(f"\n{'='*60}")
        logger.info(f"Aggregation complete in {duration:.2f}s")
        logger.info(f"Success: {success_count}/{total_sources} sources")
        logger.info(f"Total articles: {len(state['articles'])}")
        logger.info(f"Errors: {len(state['errors'])}")
        logger.info(f"{'='*60}")
        
        return state
    
    def get_circuit_status(self) -> Dict[str, str]:
        """Get status of all circuit breakers."""
        return {
            name: circuit.state.value
            for name, circuit in self.circuit_breakers.items()
        }
    
    def reset_circuits(self):
        """Reset all circuit breakers."""
        for circuit in self.circuit_breakers.values():
            circuit.reset()

print("✓ Resilient aggregator created")

### Step 7: Add Health Monitoring

Production systems need visibility into error patterns and performance metrics.

**Metrics to Track**:
- Success rate per source
- Average response time
- Circuit breaker state
- Retry attempts
- Partial failure rate

This helps answer:
- Which sources are most reliable?
- When should we remove a failing source?
- Is our retry strategy working?
- What's our overall system health?

In [None]:
class SourceMetrics(BaseModel):
    """Metrics for a single source."""
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0
    total_response_time: float = 0.0
    retry_count: int = 0
    
    @property
    def success_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return (self.successful_requests / self.total_requests) * 100
    
    @property
    def avg_response_time(self) -> float:
        if self.successful_requests == 0:
            return 0.0
        return self.total_response_time / self.successful_requests

class MetricsCollector:
    """Collects and reports health metrics."""
    
    def __init__(self):
        self.metrics: Dict[str, SourceMetrics] = defaultdict(SourceMetrics)
    
    def record_request(self, source: str, success: bool, response_time: float = 0.0, retries: int = 0):
        """Record a request attempt."""
        metrics = self.metrics[source]
        metrics.total_requests += 1
        
        if success:
            metrics.successful_requests += 1
            metrics.total_response_time += response_time
        else:
            metrics.failed_requests += 1
        
        metrics.retry_count += retries
    
    def get_health_report(self, circuit_states: Dict[str, str]) -> str:
        """Generate health dashboard."""
        lines = []
        lines.append("\n" + "="*85)
        lines.append("HEALTH DASHBOARD")
        lines.append("="*85)
        lines.append(f"{'Source':<20} | {'Requests':>8} | {'Success':>7} | {'Avg Time':>8} | {'Retries':>7} | {'Circuit':<10}")
        lines.append("-"*85)
        
        for source, metrics in sorted(self.metrics.items()):
            circuit_state = circuit_states.get(source, "unknown")
            
            lines.append(
                f"{source:<20} | "
                f"{metrics.total_requests:>8} | "
                f"{metrics.success_rate:>6.1f}% | "
                f"{metrics.avg_response_time:>7.3f}s | "
                f"{metrics.retry_count:>7} | "
                f"{circuit_state:<10}"
            )
        
        lines.append("="*85)
        
        # Overall stats
        total_requests = sum(m.total_requests for m in self.metrics.values())
        total_successes = sum(m.successful_requests for m in self.metrics.values())
        overall_success = (total_successes / total_requests * 100) if total_requests > 0 else 0
        
        lines.append(f"Overall Success Rate: {overall_success:.1f}% ({total_successes}/{total_requests} requests)")
        lines.append("="*85)
        
        return "\n".join(lines)

# Add metrics to aggregator
class ResilientAggregatorWithMetrics(ResilientAggregator):
    """Aggregator with health monitoring."""
    
    def __init__(self, sources: Dict[str, NewsSource]):
        super().__init__(sources)
        self.metrics = MetricsCollector()
    
    def aggregate(self, topic: str) -> AggregatorState:
        """Aggregate with metrics collection."""
        logger.info(f"\n{'='*60}")
        logger.info(f"Starting aggregation for topic: {topic}")
        logger.info(f"{'='*60}")
        
        state = {
            "topic": topic,
            "articles": [],
            "errors": [],
            "source_status": {},
            "start_time": datetime.now(),
            "end_time": None
        }
        
        for source_name, source in self.sources.items():
            start_time = time.time()
            retry_attempts = 0
            
            try:
                logger.info(f"\n--- Fetching from {source_name} ---")
                circuit = self.circuit_breakers[source_name]
                
                articles = circuit.call(
                    retry_with_backoff,
                    source.fetch_news,
                    topic,
                    config=self.retry_config,
                    source=source_name
                )
                
                response_time = time.time() - start_time
                state["articles"].extend(articles)
                state["source_status"][source_name] = "success"
                
                # Record success metrics
                self.metrics.record_request(
                    source_name,
                    success=True,
                    response_time=response_time,
                    retries=retry_attempts
                )
                
                logger.info(f"✓ {source_name}: Retrieved {len(articles)} articles in {response_time:.2f}s")
            
            except Exception as e:
                response_time = time.time() - start_time
                error_info = classify_error(e, source_name)
                state["errors"].append(error_info)
                state["source_status"][source_name] = "failed"
                
                # Record failure metrics
                self.metrics.record_request(
                    source_name,
                    success=False,
                    response_time=0.0,
                    retries=error_info.retry_count
                )
                
                logger.error(f"✗ {source_name}: {error_info.message}")
        
        state["end_time"] = datetime.now()
        
        # Summary
        duration = (state["end_time"] - state["start_time"]).total_seconds()
        success_count = sum(1 for status in state["source_status"].values() if status == "success")
        total_sources = len(self.sources)
        
        logger.info(f"\n{'='*60}")
        logger.info(f"Aggregation complete in {duration:.2f}s")
        logger.info(f"Success: {success_count}/{total_sources} sources")
        logger.info(f"Total articles: {len(state['articles'])}")
        logger.info(f"Errors: {len(state['errors'])}")
        logger.info(f"{'='*60}")
        
        return state
    
    def get_health_report(self) -> str:
        """Get health dashboard."""
        return self.metrics.get_health_report(self.get_circuit_status())

print("✓ Health monitoring added")

### Step 8: Test Error Recovery

Run multiple aggregation cycles to see recovery patterns in action:

1. **First run**: Some sources fail, retries occur
2. **Second run**: Circuit breakers may be open
3. **After timeout**: Circuits test recovery

Watch the logs to see:
- Retry attempts with increasing delays (1s → 2s → 4s)
- Circuit breaker state transitions
- Partial success (some sources work)
- Graceful degradation
- Health metrics

In [None]:
# Create aggregator with metrics
aggregator = ResilientAggregatorWithMetrics(sources)

# Test 1: First aggregation
print("\n" + "#"*60)
print("# TEST 1: First Aggregation")
print("#"*60)

result1 = aggregator.aggregate("AI advancements")

print("\nResults:")
print(f"  Articles retrieved: {len(result1['articles'])}")
print(f"  Source status: {result1['source_status']}")
print(f"  Circuit states: {aggregator.get_circuit_status()}")

# Show health dashboard
print(aggregator.get_health_report())

In [None]:
# Test 2: Immediate retry (circuits may be open)
print("\n" + "#"*60)
print("# TEST 2: Immediate Retry (may hit open circuits)")
print("#"*60)

result2 = aggregator.aggregate("Climate change")

print("\nResults:")
print(f"  Articles retrieved: {len(result2['articles'])}")
print(f"  Source status: {result2['source_status']}")
print(f"  Circuit states: {aggregator.get_circuit_status()}")

# Show updated health dashboard
print(aggregator.get_health_report())

In [None]:
# Test 3: Manual circuit reset and retry
print("\n" + "#"*60)
print("# TEST 3: After Circuit Reset")
print("#"*60)

print("\nResetting all circuits...")
aggregator.reset_circuits()
print(f"Circuit states after reset: {aggregator.get_circuit_status()}")

result3 = aggregator.aggregate("Technology trends")

print("\nResults:")
print(f"  Articles retrieved: {len(result3['articles'])}")
print(f"  Source status: {result3['source_status']}")
print(f"  Circuit states: {aggregator.get_circuit_status()}")

# Show final health dashboard
print(aggregator.get_health_report())

## Challenge 1: Add Fallback Cache

**Goal**: When all sources fail for a topic, return cached results from previous successful queries.

**Requirements**:
- Add `cache: Dict[str, List[Dict]]` to aggregator (topic → articles)
- After successful aggregation, cache results with timestamp
- If aggregation returns 0 articles, check cache
- Return cached results with "CACHED" indicator in logs
- Expire cache after 1 hour

**Hints**:
- Store timestamp with each cache entry: `{"articles": [...], "timestamp": datetime.now()}`
- Check `(now - cache_time) < timedelta(hours=1)`
- Add logging: `logger.info(f"Using cached results from {timestamp}")`
- Test by temporarily setting all sources to `failure_rate=1.0`

In [None]:
# YOUR CODE HERE

## Summary

In this lab, you built a production-grade resilient system with:

✅ **Error classification** - Distinguish transient, permanent, and partial failures  
✅ **Retry with exponential backoff** - Handle transient errors intelligently  
✅ **Circuit breakers** - Prevent cascade failures and enable recovery testing  
✅ **Partial failure handling** - Degrade gracefully, return available data  
✅ **Comprehensive logging** - Monitor errors and recovery in production  
✅ **Health monitoring** - Track metrics and system health  
✅ **Resilience patterns** - Apply industry-standard error recovery  

### Key Takeaways

**Error Recovery Principles**:
1. **Classify before acting**: Not all errors deserve retry
2. **Fail fast on permanent errors**: Don't waste time
3. **Back off exponentially**: Give services time to recover
4. **Isolate failures**: One bad source shouldn't block others
5. **Monitor and log**: Visibility is critical in production
6. **Measure everything**: Metrics drive operational decisions

**When to Use These Patterns**:
- External API dependencies
- Distributed systems
- Production reliability requirements
- High-availability systems

**Real-World Applications**:
- Microservices architectures
- Cloud-native applications
- Data pipelines
- Customer-facing services

### Advanced Note

These patterns represent **advanced production engineering**. If you're new to error recovery:
- Focus on understanding the *why* behind each pattern
- Experiment with the failure rates to see different behaviors
- Review the health dashboard to understand system performance
- Start simple (retry) before adding complexity (circuit breakers)


### Next Steps

- Apply these resilience patterns to your own projects
- Explore advanced patterns: bulkheads, rate limiting, load shedding
- Study production incident reports to see these patterns in action