# **Chapter 7: Microservices Architecture**

The shift from monolithic to microservices architecture represents one of the most significant evolutions in software engineering over the past decade. While microservices offer scalability, flexibility, and team autonomy, they introduce significant complexity in distributed systems. This chapter explores when to use microservices, how to decompose systems, communication patterns, service discovery, and strategies for managing the inherent challenges of distributed architectures.

---

## **7.1 Monolithic vs. Microservices Architecture**

Understanding the trade-offs between monolithic and microservices architectures is crucial for making informed design decisions.

### **Monolithic Architecture**

**Concept**: Single deployable unit where all functionality—user interface, business logic, and data access—exists in one codebase and runs as a single process.

**Architecture**:
```
┌─────────────────────────────────────────────────────────────┐
│                    Monolithic Application                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   Web UI    │  │   Mobile    │  │   Admin     │        │
│  │   (React)   │  │   (iOS/     │  │   Panel     │        │
│  │             │  │   Android)  │  │   (Vue)     │        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
│         │                │                │               │
│         └────────────────┴────────────────┘                │
│                          │                                  │
│                  ┌───────▼───────┐                          │
│                  │   API Layer   │                          │
│                  │  (REST API)   │                          │
│                  └───────┬───────┘                          │
│                          │                                  │
│  ┌───────────────────────▼───────────────────────┐          │
│  │           Business Logic Layer                │          │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐        │          │
│  │  │  User   │ │  Order  │ │Inventory│        │          │
│  │  │ Service │ │ Service │ │ Service │        │          │
│  │  │ (Module)│ │ (Module)│ │ (Module)│        │          │
│  │  └────┬────┘ └────┬────┘ └────┬────┘        │          │
│  │       └───────────┼───────────┘             │          │
│  │                   │                         │          │
│  │  ┌────────────────▼────────────────┐        │          │
│  │  │      Data Access Layer          │        │          │
│  │  │  ┌─────────┐ ┌─────────┐       │        │          │
│  │  │  │  User   │ │  Order  │       │        │          │
│  │  │  │   DB    │ │   DB    │       │        │          │
│  │  │  │(Tables) │ │(Tables) │       │        │          │
│  │  │  └─────────┘ └─────────┘       │        │          │
│  │  └────────────────────────────────┘        │          │
│  └─────────────────────────────────────────────┘          │
│                          │                                  │
│                  ┌───────▼───────┐                          │
│                  │  Single DB    │                          │
│                  │ (PostgreSQL)  │                          │
│                  └───────────────┘                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Deployment: Single deployable unit (one JAR/WAR, one Docker image)
Scaling: Scale entire application (all components together)
Development: Single codebase, shared libraries
Team: All developers work on same codebase
```

**Advantages**:
1. **Simplicity**: Single codebase, single deployment unit
2. **Performance**: In-process method calls (no network latency)
3. **Data Consistency**: Single database, ACID transactions across all operations
4. **Testing**: Easy to write integration tests (everything in one process)
5. **Debugging**: Stack traces show full call flow; easy to trace issues
6. **Deployment**: One deployment artifact, simpler CI/CD pipeline

**Disadvantages**:
1. **Scalability**: Must scale entire application even if only one component needs scaling
2. **Technology Lock-in**: Stuck with one technology stack for everything
3. **Deployment Risk**: Small change requires redeploying entire application
4. **Team Coordination**: Large teams conflict on code changes (merge conflicts)
5. **Fault Isolation**: One bug can crash entire application
6. **Onboarding**: New developers must understand entire codebase

**When to Use**:
- Small teams (< 10 developers)
- Simple applications with clear domain boundaries
- When time-to-market is critical (MVP, startups)
- When data consistency is paramount (banking transactions)
- Low complexity requirements

---

### **Microservices Architecture**

**Concept**: Application composed of small, independent services that communicate over a network. Each service owns its own data and can be developed, deployed, and scaled independently.

**Architecture**:
```
Clients (Web, Mobile, 3rd Party)
    │
    ▼
┌─────────────────────────────────────┐
│         API Gateway                  │
│  (Authentication, Routing,          │
│   Rate Limiting, SSL)               │
└────────┬──────────┬─────────────────┘
         │          │
    ┌────┴────┐     │
    │         │     │
    ▼         ▼     ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ User  │ │ Order │ │Payment│ │Inventory│
│Service│ │Service│ │Service│ │Service │
│       │ │       │ │       │ │        │
│ ┌───┐ │ │ ┌───┐ │ │ ┌───┐ │ │  ┌───┐ │
│ │UI │ │ │ │UI │ │ │ │UI │ │ │  │UI │ │
│ └───┘ │ │ └───┘ │ │ └───┘ │ │  └───┘ │
│       │ │       │ │       │ │        │
│ ┌───┐ │ │ ┌───┐ │ │ ┌───┐ │ │  ┌───┐ │
│ │API│ │ │ │API│ │ │ │API│ │ │  │API│ │
│ └───┘ │ │ └───┘ │ │ └───┘ │ │  └───┘ │
│       │ │       │ │       │ │        │
│ ┌───┐ │ │ ┌───┐ │ │ ┌───┐ │ │  ┌───┐ │
│ │DB │ │ │ │DB │ │ │ │DB │ │ │  │DB │ │
│ └───┘ │ │ └───┘ │ │ └───┘ │ │  └───┘ │
└───┬───┘ └───┬───┘ └───┬───┘ └────┬───┘
    │         │         │          │
    ▼         ▼         ▼          ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│User DB│ │OrderDB│ │Payment│ │Inventory│
│(Post- │ │(Mongo)│ │  DB   │ │  DB    │
│greSQL)│ │       │ │(MySQL)│ │(Redis) │
└───────┘ └───────┘ └───────┘ └───────┘

Communication:
- Synchronous: REST/gRPC (for real-time operations)
- Asynchronous: Message Queue (for background processing)

Deployment: Each service deployed independently
Scaling: Scale individual services based on demand
Development: Separate codebases, independent teams
Technology: Polyglot (different stacks per service)
```

**Advantages**:
1. **Independent Scaling**: Scale only the services that need it (cost-effective)
2. **Technology Diversity**: Use best tool for each job (Node.js for UI, Python for ML, Go for high-performance)
3. **Fault Isolation**: Failure in one service doesn't crash others
4. **Team Autonomy**: Teams own services end-to-end (DevOps culture)
5. **Independent Deployment**: Deploy services without affecting others (faster releases)
6. **Reusability**: Services can be reused by other applications

**Disadvantages**:
1. **Distributed System Complexity**: Network latency, partial failures, consistency challenges
2. **Operational Overhead**: Monitoring, logging, tracing across dozens of services
3. **Data Consistency**: Distributed transactions, eventual consistency
4. **Testing Complexity**: Integration testing requires all services running
5. **Network Latency**: Service calls over network (milliseconds vs. microseconds)
6. **Security**: More attack surfaces (each service is an endpoint)

**When to Use**:
- Large teams (> 30 developers) organized around business capabilities
- Complex applications with distinct business domains
- When different components have different scaling requirements
- When teams need autonomy and frequent independent deployments
- When technology diversity provides competitive advantage

---

### **The Decision Framework**

```
┌─────────────────────────────────────────────────────────────┐
│              Monolithic vs Microservices Decision              │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  Start Here:                                                  │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Team Size?                                          │    │
│  │ ┌──────────┐    ┌──────────┐                      │    │
│  │ │ < 10 dev │    │ > 30 dev │                      │    │
│  │ │          │    │          │                      │    │
│  │ │ Monolith │    │ Micro-   │                      │    │
│  │ │          │    │ services │                      │    │
│  │ └──────────┘    └──────────┘                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  If team size 10-30:                                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Domain Complexity?                                  │    │
│  │ ┌──────────┐    ┌──────────┐                      │    │
│  │ │ Simple   │    │ Complex  │                      │    │
│  │ │          │    │          │                      │    │
│  │ │ Monolith │    │ Consider │                      │    │
│  │ │          │    │ Micro-   │                      │    │
│  │ │          │    │ services │                      │    │
│  │ └──────────┘    └──────────┘                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  If considering microservices:                                │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Scaling Requirements?                               │    │
│  │ ┌──────────┐    ┌──────────┐                      │    │
│  │ │ Uniform  │    │ Varied   │                      │    │
│  │ │          │    │          │                      │    │
│  │ │ Monolith │    │ Micro-   │                      │    │
│  │ │          │    │ services │                      │    │
│  │ └──────────┘    └──────────┘                      │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  If still unsure:                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Start with Monolith, extract services when needed   │    │
│  │ (Evolutionary Architecture)                         │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
└─────────────────────────────────────────────────────────────┘
```

**Industry Insight**: Many successful companies (Shopify, Stack Overflow, Basecamp) run large monoliths. Microservices are not always the answer. The "right" architecture depends on team size, complexity, and organizational structure—not just technical considerations.

---

## **7.2 Service Decomposition Strategies**

Decomposing a monolith into microservices is the most critical and difficult aspect of microservices adoption. Poor decomposition leads to distributed monoliths—worst of both worlds.

### **Domain-Driven Design (DDD)**

**Concept**: Decompose services based on business domains (bounded contexts), not technical layers.

**Key Concepts**:
- **Bounded Context**: Explicit boundary within which a domain model exists. Each microservice should align with one bounded context.
- **Aggregate**: Cluster of domain objects that can be treated as a single unit (transaction boundary).
- **Entity**: Object with unique identity (User, Order).
- **Value Object**: Immutable object without identity (Money, Address).

**Decomposition Strategy**:
```
E-commerce Domain:
┌─────────────────────────────────────────────────────────────┐
│                     Business Domains                         │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   User      │  │   Catalog   │  │   Pricing   │          │
│  │  Context    │  │  Context    │  │  Context    │          │
│  │             │  │             │  │             │          │
│  │ • User      │  │ • Product   │  │ • Discounts │          │
│  │ • Profile   │  │ • Category  │  │ • Promos    │          │
│  │ • Auth      │  │ • Inventory │  │ • Taxes     │          │
│  │ • Preferences│ │ • Search    │  │             │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   Order     │  │   Payment   │  │  Shipping   │          │
│  │  Context    │  │  Context    │  │  Context    │          │
│  │             │  │             │  │             │          │
│  │ • Cart      │  │ • Cards     │  │ • Carriers  │          │
│  │ • Checkout  │  │ • Fraud     │  │ • Tracking  │          │
│  │ • History   │  │ • Refunds   │  │ • Labels    │          │
│  │             │  │ • Billing   │  │ • Rates     │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                               │
│  ┌─────────────┐  ┌─────────────┐                           │
│  │Notification │  │  Analytics  │                           │
│  │  Context    │  │  Context    │                           │
│  │             │  │             │                           │
│  │ • Email     │  │ • Reports   │                           │
│  │ • SMS       │  │ • Dashboards│                           │
│  │ • Push      │  │ • Metrics   │                           │
│  └─────────────┘  └─────────────┘                           │
│                                                               │
└─────────────────────────────────────────────────────────────┘

Each context becomes a microservice with its own database.
```

**Implementation Example**:
```python
# User Service (Bounded Context: User Management)
# user_service/models.py
class User:
    def __init__(self, user_id, email, name):
        self.user_id = user_id
        self.email = email
        self.name = name
        self.addresses = []  # Value objects
        self.preferences = UserPreferences()  # Entity
    
    def add_address(self, street, city, zip_code):
        address = Address(street, city, zip_code)  # Value Object
        self.addresses.append(address)
    
    def update_preferences(self, preferences):
        self.preferences = preferences

class Address:
    """Value Object - no identity, immutable"""
    def __init__(self, street, city, zip_code):
        self.street = street
        self.city = city
        self.zip_code = zip_code
    
    def __eq__(self, other):
        if not isinstance(other, Address):
            return False
        return (self.street == other.street and 
                self.city == other.city and 
                self.zip_code == other.zip_code)

# Order Service (Bounded Context: Order Management)
# order_service/models.py
class Order:
    def __init__(self, order_id, user_id):  # Reference to User by ID only
        self.order_id = order_id
        self.user_id = user_id  # Foreign key, not User object
        self.items = []
        self.status = OrderStatus.PENDING
        self.shipping_address = None  # Snapshot of address at order time
    
    def add_item(self, product_id, quantity, price):
        item = OrderItem(product_id, quantity, price)
        self.items.append(item)
        self._recalculate_total()
    
    def _recalculate_total(self):
        self.total = sum(item.subtotal for item in self.items)

class OrderItem:
    def __init__(self, product_id, quantity, unit_price):
        self.product_id = product_id
        self.quantity = quantity
        self.unit_price = unit_price
        self.subtotal = quantity * unit_price

# Note: Order service doesn't import User service models.
# It only references users by ID (user_id).
# This maintains bounded context boundaries.
```

**Anti-Patterns to Avoid**:
1. **Database Sharing**: Services sharing databases violate bounded contexts
2. **Distributed Monolith**: Services tightly coupled (changes require coordinated deployment)
3. **Chatty Services**: Excessive inter-service communication (network overhead)
4. **God Service**: One service doing everything (just a distributed monolith)

---

### **Decomposition by Business Capability**

**Concept**: Align services with business capabilities (what the business does), not technical functions (how it does it).

**Wrong Approach** (Technical Layering):
```
┌─────────────────────────────────────────────────────────────┐
│                  Technical Layering (WRONG)                  │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   UI        │  │   UI        │  │   UI        │          │
│  │  Service    │  │  Service    │  │  Service    │          │
│  │             │  │             │  │             │          │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │          │
│  │ │User UI  │ │  │ │Order UI │ │  │ │Product  │ │          │
│  │ └─────────┘ │  │ └─────────┘ │  │ │UI       │ │          │
│  └─────────────┘  └─────────────┘  │ └─────────┘ │          │
│                                     └─────────────┘          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   API       │  │   API       │  │   API       │          │
│  │  Service    │  │  Service    │  │  Service    │          │
│  │             │  │             │  │             │          │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │          │
│  │ │User API │ │  │ │Order API│ │  │ │Product  │ │          │
│  │ └─────────┘ │  │ └─────────┘ │  │ │API      │ │          │
│  └─────────────┘  └─────────────┘  │ └─────────┘ │          │
│                                     └─────────────┘          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   DB        │  │   DB        │  │   DB        │          │
│  │  Service    │  │  Service    │  │  Service    │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                               │
│  Problem: Changes to "User" require updating 3 services!    │
│           High coupling, coordinated deployments.             │
└─────────────────────────────────────────────────────────────┘
```

**Correct Approach** (Business Capability):
```
┌─────────────────────────────────────────────────────────────┐
│               Business Capability (CORRECT)                  │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   User Service                       │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐           │    │
│  │  │ User UI │  │ User API│  │ User DB │           │    │
│  │  │ (React) │  │ (REST)  │  │(Postgre)│           │    │
│  │  └─────────┘  └─────────┘  └─────────┘           │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   Order Service                      │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐           │    │
│  │  │Order UI │  │Order API│  │Order DB │           │    │
│  │  │ (React) │  │ (REST)  │  │ (Mongo) │           │    │
│  │  └─────────┘  └─────────┘  └─────────┘           │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                  Product Service                     │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐           │    │
│  │  │Product  │  │Product  │  │Product  │           │    │
│  │  │UI       │  │API      │  │DB       │           │    │
│  │  │ (Vue)   │  │ (gRPC)  │  │(Elastic)│           │    │
│  │  └─────────┘  └─────────┘  └─────────┘           │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  Benefit: Changes to "User" only affect User Service.        │
│           Independent deployment.                             │
└─────────────────────────────────────────────────────────────┘
```

---

## **7.3 Inter-Service Communication Patterns**

Microservices must communicate to fulfill business processes. The choice between synchronous and asynchronous communication significantly impacts system behavior.

### **Synchronous Communication (REST/gRPC)**

**REST over HTTP**: Standard for web APIs. Simple, universal, but higher latency.

**gRPC**: High-performance RPC framework using Protocol Buffers and HTTP/2. Binary, efficient, supports streaming.

**Implementation** (REST):
```python
# user_service/client.py
import requests
from typing import Optional

class UserServiceClient:
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.timeout = 5.0
    
    def get_user(self, user_id: str) -> Optional[dict]:
        """Synchronous call to User Service"""
        try:
            response = requests.get(
                f"{self.base_url}/users/{user_id}",
                timeout=self.timeout
            )
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            # Handle failure (circuit breaker, retry, etc.)
            print(f"Failed to get user {user_id}: {e}")
            return None

# order_service/service.py
from user_service.client import UserServiceClient

class OrderService:
    def __init__(self):
        self.user_client = UserServiceClient("http://user-service:8080")
    
    def create_order(self, user_id: str, items: list) -> dict:
        # Synchronous call to User Service
        user = self.user_client.get_user(user_id)
        
        if not user:
            raise ValueError(f"User {user_id} not found")
        
        # Create order with user data
        order = {
            'user_id': user_id,
            'user_email': user['email'],  # Denormalized
            'items': items,
            'status': 'created'
        }
        
        # Save to database...
        return order

# Problem: If User Service is slow (500ms), Order Service is blocked.
# If User Service is down, Order Service cannot create orders.
```

**Implementation** (gRPC):
```protobuf
// user_service.proto
syntax = "proto3";

package userservice;

service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
  rpc GetUsers(stream GetUserRequest) returns (stream GetUserResponse);
}

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  string user_id = 1;
  string email = 2;
  string name = 3;
}
```

```python
# Generated from proto
import grpc
import user_service_pb2
import user_service_pb2_grpc

class UserServiceClient:
    def __init__(self, target: str):
        self.channel = grpc.insecure_channel(target)
        self.stub = user_service_pb2_grpc.UserServiceStub(self.channel)
    
    def get_user(self, user_id: str):
        request = user_service_pb2.GetUserRequest(user_id=user_id)
        try:
            response = self.stub.GetUser(request, timeout=5.0)
            return {
                'user_id': response.user_id,
                'email': response.email,
                'name': response.name
            }
        except grpc.RpcError as e:
            print(f"gRPC error: {e}")
            return None

# Benefits of gRPC:
# - Binary protocol (smaller payloads)
# - HTTP/2 (multiplexing, streaming)
# - Strongly typed (compile-time safety)
# - Bidirectional streaming
# - 5-10x faster than REST
```

**Challenges of Synchronous Communication**:
1. **Latency**: Network calls add latency (1-100ms depending on distance)
2. **Cascading Failures**: If Service A calls Service B, and B fails, A fails
3. **Blocking**: Threads blocked waiting for responses (resource exhaustion)
4. **Timeouts**: Must handle timeouts and partial failures

**Solutions**:
- **Circuit Breakers**: Fail fast when downstream service is unhealthy
- **Timeouts**: Set aggressive timeouts to free resources quickly
- **Bulkheads**: Isolate resources per service (thread pools)
- **Retry with Backoff**: Retry failed requests with exponential backoff

---

### **Asynchronous Communication (Message Queues)**

**Concept**: Services communicate via message broker (Kafka, RabbitMQ). Producer sends message and continues; consumer processes asynchronously.

**Implementation**:
```python
# user_service/publisher.py
from kafka import KafkaProducer
import json

class UserEventPublisher:
    def __init__(self):
        self.producer = KafkaProducer(
            bootstrap_servers=['kafka:9092'],
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
    
    def publish_user_created(self, user_id: str, email: str):
        event = {
            'event_type': 'USER_CREATED',
            'user_id': user_id,
            'email': email,
            'timestamp': time.time()
        }
        
        self.producer.send('user-events', key=user_id, value=event)
        # Returns immediately, no waiting!

# order_service/consumer.py
from kafka import KafkaConsumer
import json

class UserEventConsumer:
    def __init__(self):
        self.consumer = KafkaConsumer(
            'user-events',
            bootstrap_servers=['kafka:9092'],
            group_id='order-service-group',
            value_deserializer=lambda m: json.loads(m.decode('utf-8'))
        )
    
    def start_consuming(self):
        for message in self.consumer:
            event = message.value
            
            if event['event_type'] == 'USER_CREATED':
                # Cache user data locally (denormalize)
                self.cache_user(event['user_id'], event['email'])
                print(f"Cached user {event['user_id']}")

# Saga Pattern for Distributed Transactions
class OrderSaga:
    """Orchestrate order creation across multiple services via events"""
    
    def create_order(self, user_id: str, items: list):
        # Step 1: Create order (local)
        order_id = self.create_order_local(user_id, items)
        
        # Step 2: Publish event (asynchronous)
        self.event_publisher.publish_order_created(order_id, user_id, items)
        
        # Other services respond via events:
        # - Inventory Service: Reserved inventory
        # - Payment Service: Processed payment
        # - Shipping Service: Prepared shipment
        
        # Saga coordinator handles success/failure via event choreography
        return {'order_id': order_id, 'status': 'pending'}

# Benefits:
# - Decoupled services (no direct calls)
# - Resilient (messages queued if consumer down)
# - Scalable (add consumers to handle load)
# - Eventual consistency acceptable for many use cases
```

---

## **7.4 Service Discovery**

In dynamic microservices environments (Kubernetes, AWS ECS), service instances come and go. Service discovery enables services to find each other without hardcoded IP addresses.

### **Client-Side Discovery**

**Concept**: Client queries service registry (Consul, Eureka, etcd) to find available instances, then makes request directly.

**Architecture**:
```
┌─────────────┐         ┌─────────────────┐         ┌─────────────┐
│   Client    │────────>│ Service Registry│         │  Service A  │
│             │         │   (Consul)      │         │  Instance 1 │
│  1. Query   │         │                 │         │  10.0.1.10  │
│     "Where  │         │  ┌───────────┐  │         │             │
│     is A?"  │         │  │ Service A │  │<────────│             │
│             │         │  │ - 10.0.1.10│ │         │             │
│  2. Cache   │         │  │ - 10.0.1.11│ │         │             │
│     result  │         │  └───────────┘  │         │             │
│             │         │                 │         │             │
│  3. Request │────────────────────────────────────>│             │
│     directly│         │                 │         │             │
└─────────────┘         └─────────────────┘         └─────────────┘
```

**Implementation** (Consul):
```python
import consul
import requests

class ServiceDiscovery:
    def __init__(self):
        self.consul = consul.Consul(host='consul-server')
        self.cached_services = {}
    
    def discover_service(self, service_name: str) -> str:
        """Discover healthy instance of service"""
        # Query Consul for healthy instances
        index, services = self.consul.health.service(service_name, passing=True)
        
        if not services:
            raise Exception(f"No healthy instances of {service_name}")
        
        # Select instance (round-robin or random)
        import random
        service = random.choice(services)
        
        address = service['Service']['Address']
        port = service['Service']['Port']
        return f"http://{address}:{port}"
    
    def call_service(self, service_name: str, endpoint: str):
        """Discover and call service"""
        # Discovery (with caching)
        if service_name not in self.cached_services:
            self.cached_services[service_name] = self.discover_service(service_name)
        
        base_url = self.cached_services[service_name]
        
        try:
            response = requests.get(f"{base_url}{endpoint}", timeout=5.0)
            return response.json()
        except requests.RequestException:
            # Invalidate cache and retry
            del self.cached_services[service_name]
            raise

# Service Registration (when service starts)
def register_service(service_name: str, port: int):
    c = consul.Consul()
    
    # Register service with health check
    c.agent.service.register(
        service_name,
        service_id=f"{service_name}-{port}",
        port=port,
        check=consul.Check.http(
            f"http://localhost:{port}/health",
            interval="10s"
        )
    )

# Usage
sd = ServiceDiscovery()
user_data = sd.call_service('user-service', '/users/123')
```

**Advantages**:
- **No Hop**: Direct connection to service (lower latency)
- **Client Control**: Client chooses load balancing algorithm
- **Simple Infrastructure**: Just a registry, no proxy needed

**Disadvantages**:
- **Client Complexity**: Must implement discovery logic
- **Language Binding**: Need client libraries for each language
- **Caching Issues**: Stale cache if service goes down

---

### **Server-Side Discovery**

**Concept**: Client sends request to load balancer (reverse proxy), which queries registry and forwards to healthy instance.

**Architecture**:
```
┌─────────────┐         ┌─────────────────┐         ┌─────────────┐
│   Client    │────────>│  Load Balancer  │         │  Service A  │
│             │         │   (NGINX/       │         │  Instance 1 │
│  Request    │         │    Traefik)     │         │  10.0.1.10  │
│  to A       │         │                 │         │             │
│             │         │  1. Query       │<────────│             │
│             │         │     Registry    │         │             │
│             │         │                 │         │             │
│             │         │  2. Forward     │────────>│             │
│             │         │     to healthy  │         │             │
└─────────────┘         └─────────────────┘         └─────────────┘
```

**Implementation** (Kubernetes DNS):
```yaml
# Kubernetes automatically provides DNS-based service discovery
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  selector:
    app: user-service
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

# Other services can reach user-service at:
# http://user-service.default.svc.cluster.local
# Or simply: http://user-service (within same namespace)
```

**Implementation** (AWS Cloud Map):
```python
import boto3

# Server-side discovery with AWS Cloud Map
servicediscovery = boto3.client('servicediscovery')

# Discover instances
response = servicediscovery.discover_instances(
    NamespaceName='production',
    ServiceName='user-service',
    MaxResults=10
)

# AWS returns IP addresses of healthy instances
for instance in response['Instances']:
    print(f"Instance: {instance['Attributes']['AWS_INSTANCE_IPV4']}")
```

**Advantages**:
- **Client Simplicity**: Client just calls a URL (no discovery logic)
- **Centralized Control**: Load balancer handles routing, retries, circuit breaking
- **Language Agnostic**: Works with any HTTP client

**Disadvantages**:
- **Extra Hop**: Request goes through load balancer (added latency)
- **Single Point of Failure**: If load balancer fails, system fails
- **Infrastructure Complexity**: Must manage load balancer cluster

**Modern Approach**: Most cloud-native systems use server-side discovery with Kubernetes Ingress or cloud load balancers (AWS ALB, GCP Load Balancer).

---

## **7.5 Configuration Management**

Microservices require externalized configuration (not hardcoded) to support different environments (dev, staging, prod) without code changes.

### **Externalized Configuration**

**12-Factor App Principle**: Store config in environment variables or external config service, never in code.

**Implementation** (Environment Variables):
```python
import os
from dataclasses import dataclass

@dataclass
class Config:
    database_url: str
    redis_url: str
    kafka_brokers: list
    log_level: str
    feature_flags: dict

def load_config() -> Config:
    """Load configuration from environment variables"""
    return Config(
        database_url=os.getenv('DATABASE_URL', 'postgresql://localhost/db'),
        redis_url=os.getenv('REDIS_URL', 'redis://localhost:6379'),
        kafka_brokers=os.getenv('KAFKA_BROKERS', 'localhost:9092').split(','),
        log_level=os.getenv('LOG_LEVEL', 'INFO'),
        feature_flags={
            'new_checkout': os.getenv('FEATURE_NEW_CHECKOUT', 'false').lower() == 'true',
            'beta_feature': os.getenv('FEATURE_BETA', 'false').lower() == 'true'
        }
    )

# Usage
config = load_config()
print(f"Connecting to database: {config.database_url}")
```

**Implementation** (Consul/Vault):
```python
import consul
import hvac  # HashiCorp Vault client

class ConfigurationManager:
    def __init__(self):
        self.consul = consul.Consul(host='consul-server')
        self.vault = hvac.Client(url='https://vault-server:8200')
        self.vault.token = os.getenv('VAULT_TOKEN')
        self.cache = {}
    
    def get_config(self, key: str) -> str:
        """Get configuration from Consul"""
        if key in self.cache:
            return self.cache[key]
        
        index, data = self.consul.kv.get(f"myapp/config/{key}")
        if data:
            value = data['Value'].decode('utf-8')
            self.cache[key] = value
            return value
        return None
    
    def get_secret(self, path: str) -> dict:
        """Get secrets from Vault"""
        secret = self.vault.secrets.kv.v2.read_secret_version(path=path)
        return secret['data']['data']

# Usage
config_mgr = ConfigurationManager()
db_password = config_mgr.get_secret('database/credentials')['password']
feature_flag = config_mgr.get_config('features/new_ui')
```

---

## **7.6 Data Management in Microservices**

The database-per-service pattern is crucial for microservices autonomy, but introduces data consistency challenges.

### **Database Per Service**

**Pattern**: Each microservice owns its own database. Other services access data only through the service's API, not directly to the database.

**Architecture**:
```
┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│   Order     │         │   User      │         │  Inventory  │
│   Service   │         │   Service   │         │   Service   │
│             │         │             │         │             │
│ ┌─────────┐ │         │ ┌─────────┐ │         │ ┌─────────┐ │
│ │Order API│ │         │ │User API │ │         │ │Inv API  │ │
│ └────┬────┘ │         │ └────┬────┘ │         │ └────┬────┘ │
│      │      │         │      │      │         │      │      │
│ ┌────▼────┐ │         │ ┌────▼────┐ │         │ ┌────▼────┐ │
│ │Order DB │ │         │ │User DB  │ │         │ │Inv DB   │ │
│ │(Mongo)  │ │         │ │(Postgre)│ │         │ │(Redis)  │ │
│ └─────────┘ │         │ └─────────┘ │         │ └─────────┘ │
└─────────────┘         └─────────────┘         └─────────────┘
       │                        │                        │
       └────────────────────────┴────────────────────────┘
                                  │
                         ┌────────▼────────┐
                         │  Shared Nothing │
                         │  (No direct DB  │
                         │   access)       │
                         └─────────────────┘

Enforcement:
- Order Service cannot query User DB directly
- Order Service must call User Service API
- This maintains service boundaries and encapsulation
```

**Benefits**:
- **Loose Coupling**: Services don't depend on each other's database schemas
- **Technology Diversity**: Each service uses best database for its needs
- **Independent Scaling**: Scale databases independently based on load
- **Failure Isolation**: Database failure affects only one service

**Challenges**:
- **Data Consistency**: No ACID transactions across services
- **Query Complexity**: Joins across services require API composition
- **Data Duplication**: Data may be duplicated across services (denormalization)

---

### **Saga Pattern for Distributed Transactions**

**Concept**: Manage distributed transactions across multiple services using a sequence of local transactions. Each local transaction updates the database and publishes an event or message.

**Types**:
1. **Choreography**: Services react to events (decentralized)
2. **Orchestration**: Central coordinator manages saga (centralized)

**Choreography Example**:
```python
# Order Service
class OrderService:
    def create_order(self, user_id, items):
        # Step 1: Create order
        order = self.db.orders.insert({'user_id': user_id, 'status': 'pending'})
        
        # Step 2: Publish event
        self.kafka.send('order-created', {
            'order_id': order.id,
            'user_id': user_id,
            'items': items
        })
        return order

# Inventory Service (listens to order-created)
class InventoryService:
    def handle_order_created(self, event):
        # Step 3: Reserve inventory
        for item in event['items']:
            self.db.inventory.update(
                {'product_id': item['id']},
                {'$inc': {'quantity': -item['qty']}}
            )
        
        # Step 4: Publish success or failure
        self.kafka.send('inventory-reserved', {
            'order_id': event['order_id'],
            'success': True
        })

# Payment Service (listens to inventory-reserved)
class PaymentService:
    def handle_inventory_reserved(self, event):
        # Step 5: Process payment
        payment = self.process_payment(event['order_id'])
        
        if payment.success:
            self.kafka.send('payment-processed', {
                'order_id': event['order_id'],
                'status': 'completed'
            })
        else:
            # Compensation: Trigger rollback
            self.kafka.send('payment-failed', {
                'order_id': event['order_id']
            })

# Compensation (rollback on failure)
class InventoryCompensation:
    def handle_payment_failed(self, event):
        # Restore inventory
        order = get_order(event['order_id'])
        for item in order.items:
            self.db.inventory.update(
                {'product_id': item['id']},
                {'$inc': {'quantity': item['qty']}}  # Restore
            )
```

**Orchestration Example** (with Saga Coordinator):
```python
class OrderSagaOrchestrator:
    def execute_create_order_saga(self, order_data):
        saga = Saga()
        
        try:
            # Step 1: Create Order
            saga.add_step(
                action=lambda: order_service.create_order(order_data),
                compensation=lambda: order_service.cancel_order(order_data['id'])
            )
            
            # Step 2: Reserve Inventory
            saga.add_step(
                action=lambda: inventory_service.reserve(order_data['items']),
                compensation=lambda: inventory_service.release(order_data['items'])
            )
            
            # Step 3: Process Payment
            saga.add_step(
                action=lambda: payment_service.charge(order_data['total']),
                compensation=lambda: payment_service.refund(order_data['total'])
            )
            
            # Execute all steps
            saga.execute()
            
        except SagaFailure as e:
            # Compensation automatically triggered
            print(f"Saga failed: {e}")
            raise

class Saga:
    def __init__(self):
        self.steps = []
        self.completed_steps = []
    
    def add_step(self, action, compensation):
        self.steps.append({'action': action, 'compensation': compensation})
    
    def execute(self):
        for step in self.steps:
            try:
                step['action']()
                self.completed_steps.append(step)
            except Exception as e:
                # Compensate completed steps in reverse order
                self.compensate()
                raise SagaFailure(e)
    
    def compensate(self):
        for step in reversed(self.completed_steps):
            try:
                step['compensation']()
            except Exception as e:
                # Log compensation failure (requires manual intervention)
                print(f"Compensation failed: {e}")
```

---

## **7.7 The Strangler Fig Pattern**

**Concept**: Gradually migrate from monolith to microservices by incrementally replacing functionality, rather than big-bang rewrite.

**How It Works**:
```
Phase 1: Monolith Only
┌─────────────────────────────────────┐
│           Monolith                   │
│  ┌─────────┐ ┌─────────┐ ┌────────┐ │
│  │  User   │ │  Order  │ │Payment │ │
│  │         │ │         │ │        │ │
│  └─────────┘ └─────────┘ └────────┘ │
└─────────────────────────────────────┘

Phase 2: Add API Gateway (Strangler)
┌─────────────────────────────────────┐
│         API Gateway                  │
│         (Strangler)                  │
└────────┬────────────┬──────────────┘
         │            │
    ┌────┴────┐   ┌───┴───────────────┐
    │Monolith │   │  New User Service │
    │ (User   │   │  (Extracted)      │
    │  still  │   │                   │
    │  there) │   └───────────────────┘
    └─────────┘

Phase 3: Extract Order Service
┌─────────────────────────────────────┐
│         API Gateway                  │
└────────┬────────┬────────┬──────────┘
         │        │        │
    ┌────┴───┐ ┌─┴─────┐ ┌┴──────────┐
    │Monolith│ │ User  │ │  Order    │
    │(Payment│ │Service│ │  Service  │
    │ only)  │ │       │ │           │
    └────────┘ └───────┘ └───────────┘

Phase 4: Extract Payment Service (Monolith retired)
┌─────────────────────────────────────┐
│         API Gateway                  │
└────────┬────────┬────────┬──────────┘
         │        │        │
         ▼        ▼        ▼
    ┌────────┐ ┌────────┐ ┌────────┐
    │  User  │ │  Order │ │ Payment│
    │Service │ │Service │ │Service │
    └────────┘ └────────┘ └────────┘
```

**Implementation**:
```python
# API Gateway Routing Logic (Strangler)
class StranglerRouter:
    def __init__(self):
        self.migrated_features = {
            'user': True,   # Migrated to microservice
            'order': True,  # Migrated to microservice
            'payment': False  # Still in monolith
        }
        
        self.microservices = {
            'user': 'http://user-service:8080',
            'order': 'http://order-service:8080'
        }
        
        self.monolith_url = 'http://monolith:8080'
    
    def route(self, request_path: str):
        """Route request to appropriate service"""
        if request_path.startswith('/api/users'):
            if self.migrated_features['user']:
                return self._proxy_to_microservice('user', request_path)
            else:
                return self._proxy_to_monolith(request_path)
        
        elif request_path.startswith('/api/orders'):
            if self.migrated_features['order']:
                return self._proxy_to_microservice('order', request_path)
            else:
                return self._proxy_to_monolith(request_path)
        
        else:
            # Default to monolith
            return self._proxy_to_monolith(request_path)
    
    def _proxy_to_microservice(self, service: str, path: str):
        target = self.microservices[service] + path
        return requests.get(target)
    
    def _proxy_to_monolith(self, path: str):
        target = self.monolith_url + path
        return requests.get(target)

# Gradually migrate features by updating migrated_features dict
# When all features migrated, retire monolith
```

**Benefits**:
- **Risk Reduction**: Gradual migration vs. big-bang rewrite
- **Incremental Value**: Deliver value incrementally
- **Rollback Capability**: Can revert to monolith if issues arise
- **Learning**: Learn about microservices gradually

---

## **7.8 Challenges of Distributed Systems**

Microservices introduce significant complexity. Understanding these challenges is essential for successful implementation.

### **The Fallacies of Distributed Computing**

**Common misconceptions that lead to failures**:
1. **The network is reliable** (it's not—packets get lost, services go down)
2. **Latency is zero** (network calls take milliseconds to seconds)
3. **Bandwidth is infinite** (network has limits)
4. **The network is secure** (must secure all service-to-service communication)
5. **Topology doesn't change** (services move, IPs change)
6. **There is one administrator** (multiple teams, conflicting changes)
7. **Transport cost is zero** (serialization/deserialization costs CPU)
8. **The network is homogeneous** (different networks, firewalls, protocols)

### **Observability**

**Challenge**: Debugging distributed systems requires correlating logs, metrics, and traces across dozens of services.

**Solution**: Distributed Tracing (OpenTelemetry, Jaeger)

**Implementation**:
```python
from opentelemetry import trace
from opentelemetry.propagate import extract, inject

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        # Call user service (trace propagates automatically)
        with tracer.start_as_current_span("call_user_service"):
            user = get_user(order_id)
        
        # Call inventory service
        with tracer.start_as_current_span("call_inventory_service"):
            inventory = check_inventory(order_id)
        
        # Span shows entire call tree with timing
        return finalize_order(user, inventory)

# Result in Jaeger UI:
# process_order (100ms)
#   ├── call_user_service (30ms)
#   └── call_inventory_service (70ms)
```

### **Security**

**Challenge**: More services = more attack surfaces. Service-to-service communication must be secured.

**Solution**: mTLS (mutual TLS), Service Mesh (Istio), Zero Trust

**Implementation**:
```yaml
# Istio mTLS configuration
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: STRICT  # Require mTLS for all services

# Result: All service-to-service communication encrypted
# and authenticated automatically via sidecar proxies
```

---

## **7.9 Key Takeaways**

1. **Start with monolith, extract when needed**: Don't start with microservices unless you have specific scaling or organizational needs. The Strangler Fig pattern allows evolutionary migration.

2. **Decompose by business domain**: Use Domain-Driven Design to identify bounded contexts. Avoid decomposition by technical layer (UI, API, DB).

3. **Database per service**: Essential for loose coupling, but requires accepting eventual consistency and implementing Saga patterns for distributed transactions.

4. **Prefer async over sync**: Asynchronous communication (message queues) provides better resilience and scalability than synchronous (REST/gRPC), though both have their place.

5. **Invest in observability**: Distributed tracing, centralized logging, and metrics are non-negotiable for microservices success.

6. **Service mesh for complexity**: When inter-service communication becomes complex, use a service mesh (Istio, Linkerd) to handle cross-cutting concerns.

7. **Accept eventual consistency**: Distributed systems cannot provide strong consistency without sacrificing availability (CAP theorem). Design for eventual consistency.

---

## **Chapter Summary**

In this chapter, we explored microservices architecture—the benefits of independent deployment and scaling, and the costs of distributed system complexity. We compared monolithic and microservices architectures, providing a decision framework for choosing between them.

We covered Domain-Driven Design for service decomposition, emphasizing business capabilities over technical layers. We explored inter-service communication patterns (synchronous REST/gRPC vs. asynchronous message queues) and their trade-offs.

Service discovery mechanisms (client-side vs. server-side) enable dynamic service location, while externalized configuration supports environment-specific deployments without code changes.

We examined the database-per-service pattern and the Saga pattern for managing distributed transactions. The Strangler Fig pattern provides a safe migration path from monoliths to microservices.

Finally, we acknowledged the challenges of distributed systems: observability, security, and the fallacies of distributed computing.

**Coming up next**: In Chapter 8, we'll explore Cloud-Native Architecture & Serverless, covering containerization, Kubernetes, serverless functions, and modern deployment patterns.

---

**Exercises**:

1. **Architecture Decision**: Your startup has 5 developers building an MVP e-commerce platform. Would you choose monolith or microservices? Why? What would change your decision?

2. **Service Decomposition**: Decompose this monolithic e-commerce application into microservices using DDD principles:
   - User registration/login
   - Product catalog (browse, search)
   - Shopping cart
   - Checkout process (payment, shipping)
   - Order history
   - Email notifications
   What are the bounded contexts? How do they communicate?

3. **Saga Pattern Design**: Design a Saga for a hotel booking system involving:
   - Booking Service (creates reservation)
   - Payment Service (charges credit card)
   - Hotel Service (confirms room availability)
   - Notification Service (sends confirmation email)
   Draw the choreography diagram and write compensation logic for each step.

4. **Service Discovery Comparison**: Compare client-side vs. server-side discovery for a mobile app backend with 50 microservices. Which would you choose? Why?

5. **Strangler Fig Implementation**: You have a monolithic Java application with these endpoints:
   - `/api/users/*` (high traffic, needs scaling)
   - `/api/reports/*` (low traffic, batch processing)
   - `/api/admin/*` (low traffic, internal use)
   
   Design a strangler fig migration plan to extract the user functionality first. What routing logic would you implement?

---