# Service High Availability with ExaBGP

**Building resilient, self-healing services with BGP-based failover**

> 🔄 **Application-driven high availability** - services control their own routing and failover

---

## Table of Contents

- [Overview](#overview)
- [High Availability Concepts](#high-availability-concepts)
- [Architecture Patterns](#architecture-patterns)
- [Health Check Strategies](#health-check-strategies)
- [Failover Mechanisms](#failover-mechanisms)
- [Load Distribution](#load-distribution)
- [Common HA Scenarios](#common-ha-scenarios)
- [Implementation Examples](#implementation-examples)
- [Best Practices](#best-practices)
- [Monitoring and Alerting](#monitoring-and-alerting)
- [Troubleshooting](#troubleshooting)

---

## Overview

**Service High Availability (HA)** with ExaBGP enables services to automatically announce their availability via BGP and withdraw when unhealthy.

### The Traditional HA Problem

**Without ExaBGP:**
```
Load Balancer (Single Point of Failure)
       ↓
   ┌───┴───┐
   ▼       ▼
Server 1  Server 2

Issues:
- Load balancer is SPOF
- Expensive hardware
- Manual failover configuration
- Limited geographic distribution
```

**With ExaBGP:**
```
No central load balancer
Network routes to healthy instances

Server 1 (healthy) ──→ Announces route ──→ Receives traffic ✅
Server 2 (healthy) ──→ Announces route ──→ Receives traffic ✅
Server 3 (failed)  ──→ Withdraws route ──→ No traffic ❌

Benefits:
- No single point of failure
- Automatic failover (5-15 seconds)
- Geographic distribution
- Cost-effective
```

---

## High Availability Concepts

### Service Availability

**Key metrics:**
- **Uptime**: Percentage of time service is available
- **MTBF** (Mean Time Between Failures): Average time service runs
- **MTTR** (Mean Time To Recover): Average time to restore service
- **RTO** (Recovery Time Objective): Maximum acceptable downtime
- **RPO** (Recovery Point Objective): Maximum acceptable data loss

**HA Formula:**
```
Availability = MTBF / (MTBF + MTTR)

Example:
MTBF = 720 hours (30 days)
MTTR = 0.25 hours (15 minutes)
Availability = 720 / (720 + 0.25) = 99.97%
```

---

### ExaBGP HA Advantages

**ExaBGP's Key Advantage: No Single Point of Failure**

```
Traditional Architecture (Load Balancer):
┌─────────────────────────────────────────────┐
│         Load Balancer (HAProxy/NGINX)       │ ← Single Point of Failure
│         (Central Device)                     │ ← Must be in ONE location
└──────────────────┬──────────────────────────┘
                   │
       ┌───────────┼───────────┐
       ▼           ▼           ▼
   Server 1    Server 2    Server 3

Problem: Load balancer MUST be centralized
- Cannot span multiple data centers without becoming SPOF
- Very fast failover (< 1 second) BUT only between backends
- Load balancer itself is single point of failure
- If DC with load balancer fails, entire service fails

ExaBGP Architecture (Distributed):
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Server 1     │    │ Server 2     │    │ Server 3     │
│ + ExaBGP     │    │ + ExaBGP     │    │ + ExaBGP     │
│ (DC-1)       │    │ (DC-1)       │    │ (DC-2)       │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    BGP Announcements

No single point of failure:
- Each instance independent
- Can span multiple data centers
- DC-1 fails → DC-2 automatically takes over (BGP convergence: 5-15s)
- Slower failover than load balancer, but survives DC failure
```

**Comparison with other HA mechanisms:**
```
Layer 7 Load Balancer (HAProxy/NGINX):
- Very fast failover between backends (< 1 second)
- Works across Layer 3 (no Layer 2 requirement)
- BUT: Centralized architecture (single device)
- BUT: Cannot span data centers without becoming SPOF
- Best for: Fast failover within single location

ExaBGP:
- Slower failover (5-15 seconds BGP convergence)
- Fully distributed (no central device)
- Can span multiple data centers
- Survives entire DC failure
- Best for: Geographic redundancy, eliminating SPOF

Combined Architecture (Best of Both):
  ExaBGP → Distribute traffic across multiple DCs
      ↓
  HAProxy/NGINX in each DC → Fast local failover
      ↓
  Backend servers

DNS-based HA:
- Very slow (30-60 seconds due to DNS TTL)
- Client-side caching issues
- Best used with ExaBGP for multi-region routing
```

**Common Use Case: ExaBGP Provides Resilience TO Load Balancers**
```
ExaBGP announces load balancer VIPs:
- HAProxy-DC1 (healthy) → announces 100.10.0.100 → receives traffic
- HAProxy-DC2 (healthy) → announces 100.10.0.100 → receives traffic
- If HAProxy-DC1 fails → withdraws route → traffic goes to DC2

Result: Fast local failover + geographic redundancy
```

---

## Architecture Patterns

### Pattern 1: Active-Active HA

**Multiple active instances serving traffic simultaneously:**

```
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Server 1     │    │ Server 2     │    │ Server 3     │
│ Service: UP  │    │ Service: UP  │    │ Service: UP  │
│ ExaBGP: ✅   │    │ ExaBGP: ✅   │    │ ExaBGP: ✅   │
│ Announces    │    │ Announces    │    │ Announces    │
│ 100.10.0.100 │    │ 100.10.0.100 │    │ 100.10.0.100 │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                           ▼
                    Traffic distributed
                   (ECMP load balancing)
```

**Characteristics:**
- All instances active
- Traffic distributed via ECMP
- Horizontal scaling (add more servers = more capacity)
- No wasted standby capacity

**Configuration:**

```python
# Each server announces same IP
SERVICE_IP = "100.10.0.100"

if is_service_healthy():
    announce route {SERVICE_IP}/32 next-hop self
```

---

### Pattern 2: Active-Passive HA

**One active instance, others on standby:**

```
┌──────────────┐    ┌──────────────┐
│ Primary      │    │ Secondary    │
│ Service: UP  │    │ Service: UP  │
│ ExaBGP: ✅   │    │ ExaBGP: ⏸️   │
│ Announces    │    │ Silent       │
│ MED=100      │    │ (or MED=200) │
└──────┬───────┘    └──────┬───────┘
       │                   │
       └────────┬──────────┘
                ▼
        Traffic to Primary

If Primary fails:
                ┌──────────────┐
                │ Secondary    │
                │ Service: UP  │
                │ ExaBGP: ✅   │
                │ Announces    │
                │ MED=100      │
                └──────┬───────┘
                       ▼
              Traffic to Secondary
```

**Implementation with MED:**

```python
# Primary
if is_service_healthy():
    announce route 100.10.0.100/32 next-hop self med 100

# Secondary
if is_service_healthy():
    announce route 100.10.0.100/32 next-hop self med 200  # Higher MED = backup
```

---

### Pattern 3: Geographic HA

**Active instances in multiple regions:**

```
Region A (US-East)         Region B (EU-West)
┌──────────────┐           ┌──────────────┐
│ Servers 1-3  │           │ Servers 4-6  │
│ ExaBGP       │           │ ExaBGP       │
│ 100.10.0.100 │           │ 100.10.0.100 │
└──────┬───────┘           └──────┬───────┘
       │                          │
       ▼                          ▼
US Clients routed to A     EU Clients routed to B

If Region A fails → all traffic to Region B
If Region B fails → all traffic to Region A
```

**Benefits:**
- Disaster recovery
- Low latency (geo-proximity routing)
- Regulatory compliance (data residency)

---

## Health Check Strategies

> **⭐ RECOMMENDED: Use Built-in Healthcheck Module**
>
> ExaBGP includes a production-ready `exabgp healthcheck` tool that handles all health check patterns below - **no custom scripting required!**
>
> ```bash
> # Zero-code health check with rise/fall dampening, metrics, and execution hooks
> exabgp healthcheck --cmd "curl -sf http://localhost/health" --ip 10.0.0.1/32 --rise 3 --fall 2
> ```
>
> See [Healthcheck Module](Healthcheck-Module) for complete documentation with examples.
>
> **Custom scripts** (shown below) are only needed for complex logic (10% of use cases). For most deployments, **use the built-in module**.

---

### 1. TCP Port Check (Basic)

**Check if port is open:**

```python
import socket

def tcp_check(host, port, timeout=2):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except:
        return False
```

**Pros:**
- Simple
- Fast

**Cons:**
- Doesn't verify service functionality
- Port open ≠ service healthy

---

### 2. HTTP Endpoint Check

**Check HTTP /health endpoint:**

```python
import urllib.request

def http_health_check(url='http://127.0.0.1/health', timeout=2):
    try:
        response = urllib.request.urlopen(url, timeout=timeout)
        if response.getcode() == 200:
            # Optionally check response body
            body = response.read().decode('utf-8')
            return 'OK' in body
        return False
    except:
        return False
```

**Health endpoint example (Flask):**

```python
from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

@app.route('/health')
def health():
    # Check database connection
    try:
        conn = psycopg2.connect('dbname=mydb')
        conn.close()
        return jsonify({'status': 'healthy'}), 200
    except:
        return jsonify({'status': 'unhealthy'}), 503

if __name__ == '__main__':
    app.run(port=8080)
```

**Pros:**
- Verifies service responds
- Can check dependencies (database, cache, etc.)
- Application-specific logic

---

### 3. Comprehensive Health Check

**Check all critical dependencies:**

```python
import socket
import urllib.request
import psycopg2
import redis

def comprehensive_health_check():
    checks = {
        'web': check_web_server(),
        'database': check_database(),
        'cache': check_redis(),
        'disk_space': check_disk_space(),
        'memory': check_memory(),
    }

    # All checks must pass
    return all(checks.values())

def check_web_server():
    try:
        response = urllib.request.urlopen('http://127.0.0.1:80/health', timeout=2)
        return response.getcode() == 200
    except:
        return False

def check_database():
    try:
        conn = psycopg2.connect(host='127.0.0.1', database='mydb', user='monitor', password='secret')
        cursor = conn.cursor()
        cursor.execute('SELECT 1')
        result = cursor.fetchone()
        conn.close()
        return result[0] == 1
    except:
        return False

def check_redis():
    try:
        r = redis.Redis(host='127.0.0.1', port=6379, socket_timeout=2)
        return r.ping()
    except:
        return False

def check_disk_space():
    import shutil
    stat = shutil.disk_usage('/')
    free_percent = (stat.free / stat.total) * 100
    return free_percent > 10  # At least 10% free

def check_memory():
    import psutil
    mem = psutil.virtual_memory()
    return mem.available > 1024 * 1024 * 1024  # At least 1 GB free
```

---

### 4. Load-Based Health Checks

**Health based on current load/performance:**

> **⚠️ Important: BGP is Binary (All-or-Nothing)**
>
> BGP cannot do proportional/weighted traffic distribution. You can only:
> - **Announce** a route (receive traffic)
> - **Withdraw** a route (stop receiving traffic)
>
> There is NO way to receive "50% of traffic" via BGP. When multiple instances announce the same prefix, routers use ECMP (Equal-Cost Multi-Path) which distributes traffic equally via flow-based hashing.
>
> **For TCP services**: Withdrawing a route causes existing connections to break. Use high thresholds (e.g., 95% CPU) to avoid unnecessary disruptions.

```python
import psutil

def load_based_health():
    """
    Binary health check based on load.
    Returns False only when server is severely overloaded.
    Use HIGH thresholds to avoid connection disruption.
    """
    # CPU load - very high threshold
    cpu_percent = psutil.cpu_percent(interval=1)
    if cpu_percent > 95:
        return False  # Severely overloaded

    # Memory - very high threshold
    mem = psutil.virtual_memory()
    if mem.percent > 95:
        return False  # Critical memory pressure

    # Connection count - very high threshold
    connections = len(psutil.net_connections(kind='inet'))
    if connections > 50000:
        return False  # Dangerously high connection count

    return True
```

**Use case:** Prevent complete service failure by removing severely overloaded instances

**Not suitable for:**
- Proportional load balancing (use HAProxy/NGINX for Layer 7 weighted distribution)
- Fine-grained traffic shaping
- Gradual capacity management

---

## Failover Mechanisms

### Automatic Failover

**ExaBGP script with automatic failover:**

```python
#!/usr/bin/env python3
"""
Automatic failover based on health checks
"""
import sys
import time
import socket

SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5

# Dampening: require N consecutive failures
FALL_THRESHOLD = 2
fall_count = 0
announced = False

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)

while True:
    healthy = is_healthy()

    if healthy:
        fall_count = 0
        if not announced:
            # Service recovered, announce
            sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[FAILOVER] Service recovered, announcing route\n")
            announced = True

    else:
        fall_count += 1
        if fall_count >= FALL_THRESHOLD and announced:
            # Service failed, trigger failover
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[FAILOVER] Service failed, withdrawing route (traffic fails over to other instances)\n")
            announced = False

    time.sleep(CHECK_INTERVAL)
```

**Failover timeline:**
```
T+0s  : Service fails
T+5s  : Health check detects failure
T+10s : Second check confirms (fall threshold = 2)
T+10s : ExaBGP withdraws route
T+15s : BGP convergence complete
T+15s : Traffic fails over to healthy instances
```

---

### Manual Failover (Maintenance Mode)

**Gracefully drain traffic before maintenance:**

```python
#!/usr/bin/env python3
"""
Maintenance mode support
Create /var/run/maintenance file to drain traffic
"""
import sys
import time
import socket
import os

SERVICE_IP = "100.10.0.100"
MAINTENANCE_FILE = "/var/run/maintenance"

def is_maintenance_mode():
    return os.path.exists(MAINTENANCE_FILE)

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', 80))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_maintenance_mode():
        if announced:
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[MAINTENANCE] Entering maintenance mode\n")
            announced = False
    else:
        healthy = is_healthy()
        if healthy and not announced:
            sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            announced = True
        elif not healthy and announced:
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            announced = False

    time.sleep(5)
```

**Maintenance workflow:**

```bash
# 1. Enter maintenance mode (stops receiving new traffic)
touch /var/run/maintenance

# 2. Wait for existing connections to drain
watch 'ss -tan | grep :80 | grep ESTAB | wc -l'

# 3. Perform maintenance
systemctl restart nginx
systemctl restart application

# 4. Exit maintenance mode (resume receiving traffic)
rm /var/run/maintenance
```

---

## Load Distribution

### Equal Load Distribution (ECMP)

**All servers announce with same metric:**

```python
# All servers run identical script
announce route 100.10.0.100/32 next-hop self
```

**Router performs ECMP (Equal-Cost Multi-Path):**
```
Router sees 3 equal-cost paths
→ Distributes traffic equally (hash-based)
→ Per-flow load balancing (same src/dst goes to same server)
```

**Enable ECMP on routers:**

```cisco
# Cisco
router bgp 65000
 maximum-paths 8

# Juniper
set protocols bgp group servers multipath
```

---

### Weighted Load Distribution

**Use BGP MED to control traffic distribution:**

```python
# High-capacity server (receives more traffic)
announce route 100.10.0.100/32 next-hop self med 50

# Medium-capacity server
announce route 100.10.0.100/32 next-hop self med 100

# Low-capacity server (receives less traffic)
announce route 100.10.0.100/32 next-hop self med 150
```

**Note:** Lower MED = preferred path = more traffic

---

### Dynamic Load-Based Distribution

**Adjust MED based on current load:**

```python
#!/usr/bin/env python3
import sys
import time
import psutil

SERVICE_IP = "100.10.0.100"
BASE_MED = 100

def calculate_med():
    """Calculate MED based on CPU load"""
    cpu_percent = psutil.cpu_percent(interval=1)

    # Higher CPU = higher MED = less preferred
    load_factor = int(cpu_percent)
    med = BASE_MED + load_factor

    return med

time.sleep(2)

while True:
    med = calculate_med()

    sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self med {med}\n")
    sys.stdout.flush()

    sys.stderr.write(f"[LOAD] Announced with MED={med}\n")

    time.sleep(30)  # Update every 30 seconds
```

**Result:** Traffic automatically distributed based on real-time load

---

## Common HA Scenarios

### Scenario 1: Web Application HA

**Setup:**
- 3 web servers (NGINX + application)
- Anycast IP: 100.10.0.80
- Health check: HTTP /health endpoint
- Active-active configuration

**Configuration:**

```ini
# /etc/exabgp/web-ha.conf
neighbor 192.168.1.1 {
    router-id 192.168.1.10;
    local-address 192.168.1.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ web-healthcheck ];
    }
}

process web-healthcheck {
    run /etc/exabgp/web-healthcheck.py;
    encoder text;
}
```

**Health check script:**

```python
#!/usr/bin/env python3
import sys
import time
import urllib.request

SERVICE_IP = "100.10.0.80"

def is_web_healthy():
    try:
        response = urllib.request.urlopen('http://127.0.0.1/health', timeout=2)
        return response.getcode() == 200
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_web_healthy() and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not is_web_healthy() and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)
```

---

### Scenario 2: Database Read Replica HA

**Setup:**
- 1 primary database (writes)
- 3 read replicas (reads)
- Anycast read IP: 100.10.0.5432
- Health check: replication lag

**Health check:**

```python
#!/usr/bin/env python3
import sys
import time
import psycopg2

SERVICE_IP = "100.10.0.5432"
MAX_LAG_SECONDS = 10

def get_replication_lag():
    try:
        conn = psycopg2.connect(host='127.0.0.1', database='postgres', user='monitor')
        cursor = conn.cursor()

        cursor.execute("""
            SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
        """)

        lag = cursor.fetchone()[0]
        conn.close()
        return lag if lag else 0
    except:
        return float('inf')

time.sleep(2)
announced = False

while True:
    lag = get_replication_lag()
    healthy = lag < MAX_LAG_SECONDS

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        sys.stderr.write(f"[DB] Replication lag OK ({lag:.1f}s), announcing\n")
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        sys.stderr.write(f"[DB] Replication lag too high ({lag:.1f}s), withdrawing\n")
        announced = False

    time.sleep(10)
```

---

### Scenario 3: Multi-Region HA

**Setup:**
- Region A: 3 servers
- Region B: 3 servers
- Same anycast IP in both regions
- Clients routed to nearest region

**Benefits:**
- Low latency (geo-proximity)
- Disaster recovery (region failure)
- Active-active across regions

---

## Implementation Examples

### Complete HA Setup

**1. Install ExaBGP on all servers:**

```bash
pip install exabgp
```

**2. Configure service IP on loopback:**

```bash
ip addr add 100.10.0.100/32 dev lo
```

**3. Create ExaBGP config:**

```ini
neighbor 192.168.1.1 {
    router-id 192.168.1.10;
    local-address 192.168.1.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ ha-healthcheck ];
    }
}

process ha-healthcheck {
    run /etc/exabgp/ha-healthcheck.py;
    encoder text;
}
```

**4. Create health check script:**

```python
#!/usr/bin/env python3
import sys
import time
import socket

SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_healthy() and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not is_healthy() and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)
```

**5. Start ExaBGP:**

```bash
exabgp /etc/exabgp/ha.conf
```

**6. Verify:**

```bash
# Check route on router
show ip bgp 100.10.0.100

# Should see multiple paths (one per healthy server)
```

---

## Best Practices

### 1. Use Rise/Fall Thresholds

**Prevent route flapping:**

```python
RISE_THRESHOLD = 3  # 3 consecutive successes to announce
FALL_THRESHOLD = 2  # 2 consecutive failures to withdraw
```

---

### 2. Monitor BGP Session Health

```python
import subprocess

def check_exabgp_running():
    result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
    if result.returncode != 0:
        send_alert("ExaBGP not running!")
        return False
    return True
```

---

### 3. Log All Announcements

```python
import logging

logging.basicConfig(filename='/var/log/exabgp-ha.log', level=logging.INFO)

def announce_route(ip):
    sys.stdout.write(f"announce route {ip}/32 next-hop self\n")
    sys.stdout.flush()
    logging.info(f"ANNOUNCE: {ip}")
```

---

### 4. Implement Maintenance Mode

**Allow graceful traffic draining:**

```bash
# Enter maintenance
touch /var/run/maintenance

# Wait for connections to drain
watch 'ss -tan | grep ESTAB | wc -l'

# Perform maintenance
systemctl restart service

# Exit maintenance
rm /var/run/maintenance
```

---

### 5. Test Failover Regularly

```bash
# Monthly failover drill
systemctl stop nginx
# Verify traffic failed over
sleep 60
systemctl start nginx
```

---

## Monitoring and Alerting

### Metrics to Monitor

**1. Service Health:**
- Health check success rate
- Time since last successful check
- Health check latency

**2. BGP State:**
- BGP session state (up/down)
- Routes announced
- Routes withdrawn
- BGP convergence time

**3. Failover Events:**
- Number of failovers
- Time to failover
- Failed node recovery time

---

### Monitoring Script

```python
#!/usr/bin/env python3
"""
Monitor HA metrics and export to Prometheus
"""
import time
from prometheus_client import start_http_server, Gauge, Counter

# Metrics
health_check_success = Gauge('ha_health_check_success', 'Health check status (1=healthy, 0=unhealthy)')
route_announced = Gauge('ha_route_announced', 'Route announcement status (1=announced, 0=withdrawn)')
failover_count = Counter('ha_failover_total', 'Total number of failovers')

def monitor_ha():
    announced = False

    while True:
        healthy = is_healthy()
        health_check_success.set(1 if healthy else 0)

        if healthy and not announced:
            route_announced.set(1)
            announced = True
        elif not healthy and announced:
            route_announced.set(0)
            failover_count.inc()
            announced = False

        time.sleep(5)

if __name__ == '__main__':
    # Start Prometheus metrics server
    start_http_server(9100)
    monitor_ha()
```

---

## Troubleshooting

### Issue 1: Route Not Failing Over

**Symptoms:** Service down but traffic still routed to failed instance

**Check:**

```bash
# 1. Verify ExaBGP withdrew route
grep WITHDRAW /var/log/exabgp.log

# 2. Check BGP table on router
show ip bgp 100.10.0.100

# 3. Verify health check detecting failure
tail -f /var/log/exabgp.log
```

**Common causes:**
- Health check not detecting failure
- ExaBGP not running
- BGP session down
- Router not removing route

---

### Issue 2: Route Flapping

**Symptoms:** Route repeatedly announced/withdrawn

**Diagnosis:**

```bash
# Monitor route changes
watch -d 'show ip bgp 100.10.0.100 | grep paths'
```

**Solutions:**
- Implement rise/fall thresholds
- Increase health check interval
- Fix unstable service
- Add dampening

---

### Issue 3: Uneven Load Distribution

**Symptoms:** One server gets all traffic despite ECMP

**Check:**

```cisco
# Verify ECMP enabled
show ip bgp 100.10.0.100
# Should show "multipath" or "ECMP"

# Check routing table
show ip route 100.10.0.100
# Should show multiple next-hops
```

**Solutions:**

```cisco
# Enable ECMP
router bgp 65000
 maximum-paths 8
```

---

## Next Steps

### Learn More

- **[Anycast Management](Anycast-Management)** - Anycast patterns
- **[DDoS Mitigation](DDoS-Mitigation)** - DDoS protection
- **[Quick Start](Quick-Start)** - Getting started

### Operations

- **[Debugging](Debugging)** - Troubleshooting
- **[Monitoring](Monitoring)** - Monitoring setup

### Configuration

- **[Configuration Syntax](Configuration-Syntax)** - Config reference
- **[API Overview](API-Overview)** - API patterns

---

**Ready to implement HA?** See [Quick Start](Quick-Start) →

---