---

# Chapter 16: Resilience and Fault Tolerance

## Opening Context

In a distributed system, failures are not a matter of **if** but **when**. Networks glitch, services crash, databases time out, and resources become exhausted. In a monolithic application, a failure often means the entire application is down. In a microservices architecture, the stakes are different: a failure in one service can cascade, consuming resources and bringing down dependent services—a phenomenon known as **cascading failure**.

Building resilient systems requires deliberate design. We must assume that dependencies will fail and protect our services from being dragged down with them. This chapter explores three essential patterns for fault tolerance:

1. **Circuit Breaker** – Prevents repeated calls to a failing service, giving it time to recover and avoiding wasted resources.
2. **Retry with Exponential Backoff** – Handles transient failures by retrying operations with increasing delays.
3. **Bulkhead** – Isolates failures by partitioning resources, so a problem in one part doesn’t sink the whole system.

These patterns, inspired by electrical engineering and shipbuilding, form the bedrock of resilient distributed systems. By the end of this chapter, you’ll know how to apply them to keep your applications stable even when things go wrong.

---

## 16.1 Circuit Breaker Pattern

### Intent
*Protect a system from repeatedly trying to execute an operation that is likely to fail, allowing it to recover and preventing cascading failures.*

### The Problem

Imagine an e‑commerce frontend service that calls a `payment‑service` to process credit cards. If the payment service becomes slow or starts failing (e.g., due to a database outage), the frontend service might continue to send requests, each waiting for a timeout. This has several negative consequences:

- **Resource exhaustion** – The frontend service holds onto threads/connections while waiting, potentially exhausting its own resources.
- **User frustration** – Users experience long waits and eventual errors.
- **Cascading failure** – The frontend’s resource exhaustion can cause it to fail, affecting other functionalities that don’t even use payment.

What’s needed is a way to **stop trying** when failure is likely, and to **try again** later when the service may have recovered.

### The Solution: Circuit Breaker

The Circuit Breaker pattern, named after its electrical counterpart, introduces a state machine that monitors for failures. When failures reach a threshold, the circuit **trips** and subsequent calls fail immediately (or with a fallback) without attempting the operation. After a timeout, the circuit transitions to a **half‑open** state, allowing a limited number of test requests to see if the service has recovered.

**States**:
- **Closed** – Normal operation. Requests pass through; failures are counted. When failures exceed a threshold, the circuit opens.
- **Open** – Requests fail immediately. After a timeout, the circuit transitions to half‑open.
- **Half‑Open** – A limited number of test requests are allowed. If they succeed, the circuit closes; if they fail, it opens again.

### Implementation Example

Let’s implement a simple circuit breaker in TypeScript that wraps an asynchronous function.

```typescript
// circuit-breaker.ts
export class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failureCount: number = 0;
  private readonly failureThreshold: number;
  private readonly timeout: number;
  private nextAttempt: number = Date.now();

  constructor(failureThreshold: number = 3, timeout: number = 10000) {
    this.failureThreshold = failureThreshold;
    this.timeout = timeout;
  }

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() > this.nextAttempt) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED';
    }
  }

  private onFailure(): void {
    this.failureCount++;
    if (this.state === 'HALF_OPEN' || this.failureCount >= this.failureThreshold) {
      this.trip();
    }
  }

  private trip(): void {
    this.state = 'OPEN';
    this.nextAttempt = Date.now() + this.timeout;
  }
}
```

**Explanation**:
- The constructor sets the failure threshold (how many failures before opening) and the timeout (how long to stay open).
- `call()` checks the current state. If open and the timeout has passed, it moves to half‑open.
- In half‑open or closed, it attempts the function.
- On success, it resets failure count and closes the circuit if it was half‑open.
- On failure, it increments the count; if the threshold is reached (or if already half‑open), it trips the circuit (opens it).

#### Usage Example

```typescript
// payment-client.ts
import { CircuitBreaker } from './circuit-breaker';

class PaymentClient {
  private breaker = new CircuitBreaker(3, 10000); // 3 failures, 10s timeout

  async processPayment(amount: number): Promise<string> {
    return this.breaker.call(async () => {
      // Simulate call to external payment service
      const response = await fetch('https://payment.example.com/charge', {
        method: 'POST',
        body: JSON.stringify({ amount })
      });
      if (!response.ok) throw new Error('Payment failed');
      return response.json();
    });
  }
}
```

**Explanation**:
- The `processPayment` method wraps the external call with the circuit breaker.
- If the payment service fails three times within a short period, the circuit opens and subsequent calls will throw immediately without hitting the network.
- After 10 seconds, the circuit allows a single test request. If it succeeds, normal operation resumes.

### Real‑World Considerations

- **Timeouts** – The circuit breaker should work in conjunction with timeouts. If a call takes too long, it’s considered a failure.
- **Fallbacks** – In open state, you might provide a fallback value or use cached data.
- **Monitoring** – Expose metrics (state, failure count) for dashboards and alerts.
- **Library Support** – In production, use battle‑tested libraries like **Polly** (.NET), **resilience4j** (Java), or **opossum** (Node.js).

---

## 16.2 Retry with Exponential Backoff

### Intent
*Automatically retry failed operations when the failure is transient, increasing the delay between retries to avoid overwhelming the system.*

### The Problem

Transient failures—network glitches, temporary unavailability, database deadlocks—are common in distributed systems. Simply giving up after the first failure would make the system unnecessarily brittle. However, retrying immediately and repeatedly can make things worse: if the service is struggling, a flood of retries can delay its recovery.

### The Solution: Retry with Exponential Backoff

Instead of retrying immediately, we wait a short time, then a longer time, and so on, often adding **jitter** (randomness) to avoid all clients retrying simultaneously (the **thundering herd** problem). The delay typically follows a formula like:

```
delay = baseDelay * (2 ^ attempt) + jitter
```

#### Implementation Example

```typescript
// retry.ts
export async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  baseDelay: number = 100, // milliseconds
  maxDelay: number = 10000
): Promise<T> {
  let lastError: Error;
  
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      if (attempt === maxRetries) break;
      
      // Calculate delay with exponential backoff and jitter
      const delay = Math.min(
        baseDelay * Math.pow(2, attempt) + Math.random() * 100,
        maxDelay
      );
      
      console.log(`Retry attempt ${attempt + 1} after ${delay}ms`);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
  
  throw lastError;
}
```

**Explanation**:
- The function attempts the operation up to `maxRetries + 1` times.
- After each failure, it calculates an exponential delay (`baseDelay * 2^attempt`) and adds jitter (random up to 100ms) to spread out retries.
- The delay is capped at `maxDelay` to prevent excessively long waits.
- If all retries fail, the last error is thrown.

#### Usage Example

```typescript
// database-client.ts
import { retryWithBackoff } from './retry';

async function queryDatabase(sql: string): Promise<any> {
  return retryWithBackoff(async () => {
    // Simulate database call that may fail transiently
    const result = await db.execute(sql);
    return result;
  }, 3, 200, 5000);
}
```

**Explanation**:
- The database query is wrapped with retry logic. If it fails due to a transient error (e.g., deadlock), it will retry up to three times with increasing delays.
- The `maxDelay` of 5 seconds ensures we don’t wait too long between retries.

### Important Considerations

- **Idempotency** – Retries are safe only if the operation is idempotent (executing it multiple times has the same effect as once). For non‑idempotent operations (e.g., creating a resource), you must ensure that duplicates are handled (e.g., using request IDs).
- **Which errors to retry** – Not all errors are retryable. Client errors (4xx) usually indicate a problem with the request and should not be retried; server errors (5xx) or network timeouts are good candidates.
- **Retry storms** – Exponential backoff with jitter helps, but you also need circuit breakers to stop retries when a service is genuinely down.

### Integration with Circuit Breaker

Retry and circuit breaker work well together. The typical pattern is:

1. Apply retry with backoff for transient failures.
2. If retries fail, the circuit breaker counts that as a failure.
3. When the circuit opens, retries are bypassed entirely (or a fallback is used).

This way, you get both local resilience and system‑wide protection.

---

## 16.3 Bulkhead Pattern

### Intent
*Isolate failures by partitioning resources (like thread pools, connection pools, or queues) so that a problem in one part of the system does not bring down the whole system.*

### The Problem

In a typical application, all outbound calls might share the same thread pool or connection pool. If one downstream service becomes slow or unresponsive, it can exhaust the entire pool, causing requests to other, healthy services to wait or fail. This is a form of **resource contention** leading to cascading failure.

For example, suppose your application has a connection pool of 10 connections to a database. If a poorly optimised query suddenly takes 30 seconds, it could tie up all 10 connections, leaving no connections for other queries—even quick ones. The application appears dead.

### The Solution: Bulkhead

The Bulkhead pattern, named after the compartments in a ship that prevent water from flooding the entire vessel, divides resources into isolated pools. If one compartment fails, the others remain operational.

In software, we create separate thread pools, connection pools, or queues for different downstream dependencies. For instance, you might have:
- A pool of 5 connections for the payment service.
- A pool of 5 connections for the inventory service.
- A pool of 10 connections for the database (shared, but still isolated from the others).

#### Implementation Example: Separate HTTP Client Pools

Using Node.js with `axios`, we can create separate instances with their own connection limits.

```typescript
// http-clients.ts
import axios, { AxiosInstance } from 'axios';

// Create a dedicated client for payment service with max 2 concurrent connections
export const paymentClient: AxiosInstance = axios.create({
  baseURL: 'https://payment.example.com',
  timeout: 5000,
  // In Node.js, axios uses the http/https agent for connection pooling
  // We can configure maxSockets to limit concurrent connections
});

// Set max concurrent connections to 2 for this client
import http from 'http';
import https from 'https';

paymentClient.defaults.httpAgent = new http.Agent({ keepAlive: true, maxSockets: 2 });
paymentClient.defaults.httpsAgent = new https.Agent({ keepAlive: true, maxSockets: 2 });

// Similarly for inventory service
export const inventoryClient: AxiosInstance = axios.create({
  baseURL: 'https://inventory.example.com',
  timeout: 5000,
});
inventoryClient.defaults.httpAgent = new http.Agent({ keepAlive: true, maxSockets: 5 });
inventoryClient.defaults.httpsAgent = new https.Agent({ keepAlive: true, maxSockets: 5 });
```

**Explanation**:
- Each client has its own connection pool (`maxSockets` limits concurrent connections).
- If the payment service slows down, it can at most consume 2 connections. The inventory service still has its own 5 connections available.
- This isolation prevents one slow dependency from exhausting all available connections.

#### Implementation Example: Thread Pools (Conceptual)

In languages with explicit thread pools (Java, C#), you might assign each dependency a separate thread pool.

```java
// Java example (pseudo)
ExecutorService paymentPool = Executors.newFixedThreadPool(2);
ExecutorService inventoryPool = Executors.newFixedThreadPool(5);

// Submit tasks to appropriate pools
Future<PaymentResult> paymentFuture = paymentPool.submit(() -> callPayment());
Future<InventoryResult> inventoryFuture = inventoryPool.submit(() -> callInventory());
```

#### Semaphore‑Based Bulkhead

For finer control, you can use semaphores to limit concurrent calls to a specific operation.

```typescript
// semaphore-bulkhead.ts
import { Semaphore } from 'async-mutex';

class Bulkhead {
  private semaphore: Semaphore;

  constructor(maxConcurrent: number) {
    this.semaphore = new Semaphore(maxConcurrent);
  }

  async run<T>(fn: () => Promise<T>): Promise<T> {
    const [value, release] = await this.semaphore.acquire();
    try {
      return await fn();
    } finally {
      release();
    }
  }
}

// Usage
const paymentBulkhead = new Bulkhead(2);

async function callPaymentWithBulkhead(amount: number) {
  return paymentBulkhead.run(() => paymentClient.post('/charge', { amount }));
}
```

**Explanation**:
- The semaphore ensures at most `maxConcurrent` calls are in flight at once.
- Additional calls wait until a slot is free.
- This protects the caller from being overwhelmed by responses, and also limits load on the downstream service.

### Bulkhead at Different Levels

- **Connection pools** – Limit concurrent connections to a service.
- **Thread pools** – Isolate processing of different types of tasks.
- **Queues** – Use separate queues for different priorities or tenants.
- **Semaphores** – Limit concurrent execution of specific code paths.

### Trade‑offs

- **Resource underutilisation** – If you allocate too many small pools, you may waste resources that could be shared. You need to size pools based on expected load.
- **Complexity** – Managing multiple pools adds configuration overhead.
- **Monitoring** – You need to monitor each pool’s usage to detect issues.

### Combining Patterns

In a resilient system, you often combine bulkheads with circuit breakers and retries. For example:
- Use a bulkhead to limit concurrent calls to a service.
- Wrap the call with a circuit breaker to stop calling when the service is failing.
- Use retries for transient failures, but respect the bulkhead limits.

---

## Chapter Summary

This chapter covered three foundational patterns for building resilient distributed systems:

1. **Circuit Breaker** – Prevents cascading failures by stopping calls to a failing service and allowing recovery time. It acts as a state machine that opens after a threshold of failures and tests the waters before closing again.

2. **Retry with Exponential Backoff** – Handles transient failures by retrying operations with increasing delays and jitter. It must be used with idempotent operations and combined with circuit breakers to avoid hammering a struggling service.

3. **Bulkhead** – Isolates failures by partitioning resources (connection pools, thread pools, semaphores). This ensures that a problem in one dependency does not exhaust resources needed by others.

**Key Insight**: Resilience is not a single pattern but a combination of them. Circuit breakers stop the bleeding, retries handle temporary glitches, and bulkheads contain the damage. Together, they form a robust defence against the inevitable failures of distributed systems.

---

## Next Chapter Preview

**Chapter 17: Data Management in Distributed Systems (CQRS, Event Sourcing, Saga, Sharding)**

In distributed systems, data is rarely stored in a single database. We must manage consistency, scalability, and complex queries across services. Chapter 17 will explore patterns for data management: **CQRS** (Command Query Responsibility Segregation) separates writes and reads, **Event Sourcing** stores state as a sequence of events, **Saga** manages distributed transactions, and **Sharding** partitions data across multiple databases. These patterns help you design scalable, consistent, and auditable data layers.



<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../4. architectural_patterns/15. service_oriented_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='17. data_management_in_distributed_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
