# Service High Availability with ExaBGP **Building resilient, self-healing services with BGP-based failover** > πŸ”„ **Application-driven high availability** - services control their own routing and failover --- ## Table of Contents - [Overview](#overview) - [High Availability Concepts](#high-availability-concepts) - [Architecture Patterns](#architecture-patterns) - [Health Check Strategies](#health-check-strategies) - [Failover Mechanisms](#failover-mechanisms) - [Load Distribution](#load-distribution) - [Common HA Scenarios](#common-ha-scenarios) - [Implementation Examples](#implementation-examples) - [Best Practices](#best-practices) - [Monitoring and Alerting](#monitoring-and-alerting) - [Troubleshooting](#troubleshooting) --- ## Overview **Service High Availability (HA)** with ExaBGP enables services to automatically announce their availability via BGP and withdraw when unhealthy. ### The Traditional HA Problem **Without ExaBGP:** ``` Load Balancer (Single Point of Failure) ↓ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β–Ό β–Ό Server 1 Server 2 Issues: - Load balancer is SPOF - Expensive hardware - Manual failover configuration - Limited geographic distribution ``` **With ExaBGP:** ``` No central load balancer Network routes to healthy instances Server 1 (healthy) ──→ Announces route ──→ Receives traffic βœ… Server 2 (healthy) ──→ Announces route ──→ Receives traffic βœ… Server 3 (failed) ──→ Withdraws route ──→ No traffic ❌ Benefits: - No single point of failure - Automatic failover (5-15 seconds) - Geographic distribution - Cost-effective ``` --- ## High Availability Concepts ### Service Availability **Key metrics:** - **Uptime**: Percentage of time service is available - **MTBF** (Mean Time Between Failures): Average time service runs - **MTTR** (Mean Time To Recover): Average time to restore service - **RTO** (Recovery Time Objective): Maximum acceptable downtime - **RPO** (Recovery Point Objective): Maximum acceptable data loss **HA Formula:** ``` Availability = MTBF / (MTBF + MTTR) Example: MTBF = 720 hours (30 days) MTTR = 0.25 hours (15 minutes) Availability = 720 / (720 + 0.25) = 99.97% ``` --- ### ExaBGP HA Advantages **ExaBGP's Key Advantage: No Single Point of Failure** ``` Traditional Architecture (Load Balancer): β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Load Balancer (HAProxy/NGINX) β”‚ ← Single Point of Failure β”‚ (Central Device) β”‚ ← Must be in ONE location β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό Server 1 Server 2 Server 3 Problem: Load balancer MUST be centralized - Cannot span multiple data centers without becoming SPOF - Very fast failover (< 1 second) BUT only between backends - Load balancer itself is single point of failure - If DC with load balancer fails, entire service fails ExaBGP Architecture (Distributed): β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Server 1 β”‚ β”‚ Server 2 β”‚ β”‚ Server 3 β”‚ β”‚ + ExaBGP β”‚ β”‚ + ExaBGP β”‚ β”‚ + ExaBGP β”‚ β”‚ (DC-1) β”‚ β”‚ (DC-1) β”‚ β”‚ (DC-2) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ BGP Announcements No single point of failure: - Each instance independent - Can span multiple data centers - DC-1 fails β†’ DC-2 automatically takes over (BGP convergence: 5-15s) - Slower failover than load balancer, but survives DC failure ``` **Comparison with other HA mechanisms:** ``` Layer 7 Load Balancer (HAProxy/NGINX): - Very fast failover between backends (< 1 second) - Works across Layer 3 (no Layer 2 requirement) - BUT: Centralized architecture (single device) - BUT: Cannot span data centers without becoming SPOF - Best for: Fast failover within single location ExaBGP: - Slower failover (5-15 seconds BGP convergence) - Fully distributed (no central device) - Can span multiple data centers - Survives entire DC failure - Best for: Geographic redundancy, eliminating SPOF Combined Architecture (Best of Both): ExaBGP β†’ Distribute traffic across multiple DCs ↓ HAProxy/NGINX in each DC β†’ Fast local failover ↓ Backend servers DNS-based HA: - Very slow (30-60 seconds due to DNS TTL) - Client-side caching issues - Best used with ExaBGP for multi-region routing ``` **Common Use Case: ExaBGP Provides Resilience TO Load Balancers** ``` ExaBGP announces load balancer VIPs: - HAProxy-DC1 (healthy) β†’ announces 100.10.0.100 β†’ receives traffic - HAProxy-DC2 (healthy) β†’ announces 100.10.0.100 β†’ receives traffic - If HAProxy-DC1 fails β†’ withdraws route β†’ traffic goes to DC2 Result: Fast local failover + geographic redundancy ``` --- ## Architecture Patterns ### Pattern 1: Active-Active HA **Multiple active instances serving traffic simultaneously:** ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Server 1 β”‚ β”‚ Server 2 β”‚ β”‚ Server 3 β”‚ β”‚ Service: UP β”‚ β”‚ Service: UP β”‚ β”‚ Service: UP β”‚ β”‚ ExaBGP: βœ… β”‚ β”‚ ExaBGP: βœ… β”‚ β”‚ ExaBGP: βœ… β”‚ β”‚ Announces β”‚ β”‚ Announces β”‚ β”‚ Announces β”‚ β”‚ 100.10.0.100 β”‚ β”‚ 100.10.0.100 β”‚ β”‚ 100.10.0.100 β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό Traffic distributed (ECMP load balancing) ``` **Characteristics:** - All instances active - Traffic distributed via ECMP - Horizontal scaling (add more servers = more capacity) - No wasted standby capacity **Configuration:** ```python # Each server announces same IP SERVICE_IP = "100.10.0.100" if is_service_healthy(): announce route {SERVICE_IP}/32 next-hop self ``` --- ### Pattern 2: Active-Passive HA **One active instance, others on standby:** ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Primary β”‚ β”‚ Secondary β”‚ β”‚ Service: UP β”‚ β”‚ Service: UP β”‚ β”‚ ExaBGP: βœ… β”‚ β”‚ ExaBGP: ⏸️ β”‚ β”‚ Announces β”‚ β”‚ Silent β”‚ β”‚ MED=100 β”‚ β”‚ (or MED=200) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό Traffic to Primary If Primary fails: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Secondary β”‚ β”‚ Service: UP β”‚ β”‚ ExaBGP: βœ… β”‚ β”‚ Announces β”‚ β”‚ MED=100 β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό Traffic to Secondary ``` **Implementation with MED:** ```python # Primary if is_service_healthy(): announce route 100.10.0.100/32 next-hop self med 100 # Secondary if is_service_healthy(): announce route 100.10.0.100/32 next-hop self med 200 # Higher MED = backup ``` --- ### Pattern 3: Geographic HA **Active instances in multiple regions:** ``` Region A (US-East) Region B (EU-West) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Servers 1-3 β”‚ β”‚ Servers 4-6 β”‚ β”‚ ExaBGP β”‚ β”‚ ExaBGP β”‚ β”‚ 100.10.0.100 β”‚ β”‚ 100.10.0.100 β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό US Clients routed to A EU Clients routed to B If Region A fails β†’ all traffic to Region B If Region B fails β†’ all traffic to Region A ``` **Benefits:** - Disaster recovery - Low latency (geo-proximity routing) - Regulatory compliance (data residency) --- ## Health Check Strategies > **⭐ RECOMMENDED: Use Built-in Healthcheck Module** > > ExaBGP includes a production-ready `exabgp healthcheck` tool that handles all health check patterns below - **no custom scripting required!** > > ```bash > # Zero-code health check with rise/fall dampening, metrics, and execution hooks > exabgp healthcheck --cmd "curl -sf http://localhost/health" --ip 10.0.0.1/32 --rise 3 --fall 2 > ``` > > See [Healthcheck Module](Healthcheck-Module) for complete documentation with examples. > > **Custom scripts** (shown below) are only needed for complex logic (10% of use cases). For most deployments, **use the built-in module**. --- ### 1. TCP Port Check (Basic) **Check if port is open:** ```python import socket def tcp_check(host, port, timeout=2): try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(timeout) result = sock.connect_ex((host, port)) sock.close() return result == 0 except: return False ``` **Pros:** - Simple - Fast **Cons:** - Doesn't verify service functionality - Port open β‰  service healthy --- ### 2. HTTP Endpoint Check **Check HTTP /health endpoint:** ```python import urllib.request def http_health_check(url='http://127.0.0.1/health', timeout=2): try: response = urllib.request.urlopen(url, timeout=timeout) if response.getcode() == 200: # Optionally check response body body = response.read().decode('utf-8') return 'OK' in body return False except: return False ``` **Health endpoint example (Flask):** ```python from flask import Flask, jsonify import psycopg2 app = Flask(__name__) @app.route('/health') def health(): # Check database connection try: conn = psycopg2.connect('dbname=mydb') conn.close() return jsonify({'status': 'healthy'}), 200 except: return jsonify({'status': 'unhealthy'}), 503 if __name__ == '__main__': app.run(port=8080) ``` **Pros:** - Verifies service responds - Can check dependencies (database, cache, etc.) - Application-specific logic --- ### 3. Comprehensive Health Check **Check all critical dependencies:** ```python import socket import urllib.request import psycopg2 import redis def comprehensive_health_check(): checks = { 'web': check_web_server(), 'database': check_database(), 'cache': check_redis(), 'disk_space': check_disk_space(), 'memory': check_memory(), } # All checks must pass return all(checks.values()) def check_web_server(): try: response = urllib.request.urlopen('http://127.0.0.1:80/health', timeout=2) return response.getcode() == 200 except: return False def check_database(): try: conn = psycopg2.connect(host='127.0.0.1', database='mydb', user='monitor', password='secret') cursor = conn.cursor() cursor.execute('SELECT 1') result = cursor.fetchone() conn.close() return result[0] == 1 except: return False def check_redis(): try: r = redis.Redis(host='127.0.0.1', port=6379, socket_timeout=2) return r.ping() except: return False def check_disk_space(): import shutil stat = shutil.disk_usage('/') free_percent = (stat.free / stat.total) * 100 return free_percent > 10 # At least 10% free def check_memory(): import psutil mem = psutil.virtual_memory() return mem.available > 1024 * 1024 * 1024 # At least 1 GB free ``` --- ### 4. Load-Based Health Checks **Health based on current load/performance:** > **⚠️ Important: BGP is Binary (All-or-Nothing)** > > BGP cannot do proportional/weighted traffic distribution. You can only: > - **Announce** a route (receive traffic) > - **Withdraw** a route (stop receiving traffic) > > There is NO way to receive "50% of traffic" via BGP. When multiple instances announce the same prefix, routers use ECMP (Equal-Cost Multi-Path) which distributes traffic equally via flow-based hashing. > > **For TCP services**: Withdrawing a route causes existing connections to break. Use high thresholds (e.g., 95% CPU) to avoid unnecessary disruptions. ```python import psutil def load_based_health(): """ Binary health check based on load. Returns False only when server is severely overloaded. Use HIGH thresholds to avoid connection disruption. """ # CPU load - very high threshold cpu_percent = psutil.cpu_percent(interval=1) if cpu_percent > 95: return False # Severely overloaded # Memory - very high threshold mem = psutil.virtual_memory() if mem.percent > 95: return False # Critical memory pressure # Connection count - very high threshold connections = len(psutil.net_connections(kind='inet')) if connections > 50000: return False # Dangerously high connection count return True ``` **Use case:** Prevent complete service failure by removing severely overloaded instances **Not suitable for:** - Proportional load balancing (use HAProxy/NGINX for Layer 7 weighted distribution) - Fine-grained traffic shaping - Gradual capacity management --- ## Failover Mechanisms ### Automatic Failover **ExaBGP script with automatic failover:** ```python #!/usr/bin/env python3 """ Automatic failover based on health checks """ import sys import time import socket SERVICE_IP = "100.10.0.100" SERVICE_PORT = 80 CHECK_INTERVAL = 5 # Dampening: require N consecutive failures FALL_THRESHOLD = 2 fall_count = 0 announced = False def is_healthy(): try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(2) result = sock.connect_ex(('127.0.0.1', SERVICE_PORT)) sock.close() return result == 0 except: return False time.sleep(2) while True: healthy = is_healthy() if healthy: fall_count = 0 if not announced: # Service recovered, announce sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() sys.stderr.write(f"[FAILOVER] Service recovered, announcing route\n") announced = True else: fall_count += 1 if fall_count >= FALL_THRESHOLD and announced: # Service failed, trigger failover sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() sys.stderr.write(f"[FAILOVER] Service failed, withdrawing route (traffic fails over to other instances)\n") announced = False time.sleep(CHECK_INTERVAL) ``` **Failover timeline:** ``` T+0s : Service fails T+5s : Health check detects failure T+10s : Second check confirms (fall threshold = 2) T+10s : ExaBGP withdraws route T+15s : BGP convergence complete T+15s : Traffic fails over to healthy instances ``` --- ### Manual Failover (Maintenance Mode) **Gracefully drain traffic before maintenance:** ```python #!/usr/bin/env python3 """ Maintenance mode support Create /var/run/maintenance file to drain traffic """ import sys import time import socket import os SERVICE_IP = "100.10.0.100" MAINTENANCE_FILE = "/var/run/maintenance" def is_maintenance_mode(): return os.path.exists(MAINTENANCE_FILE) def is_healthy(): try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(2) result = sock.connect_ex(('127.0.0.1', 80)) sock.close() return result == 0 except: return False time.sleep(2) announced = False while True: if is_maintenance_mode(): if announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() sys.stderr.write(f"[MAINTENANCE] Entering maintenance mode\n") announced = False else: healthy = is_healthy() if healthy and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = True elif not healthy and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = False time.sleep(5) ``` **Maintenance workflow:** ```bash # 1. Enter maintenance mode (stops receiving new traffic) touch /var/run/maintenance # 2. Wait for existing connections to drain watch 'ss -tan | grep :80 | grep ESTAB | wc -l' # 3. Perform maintenance systemctl restart nginx systemctl restart application # 4. Exit maintenance mode (resume receiving traffic) rm /var/run/maintenance ``` --- ## Load Distribution ### Equal Load Distribution (ECMP) **All servers announce with same metric:** ```python # All servers run identical script announce route 100.10.0.100/32 next-hop self ``` **Router performs ECMP (Equal-Cost Multi-Path):** ``` Router sees 3 equal-cost paths β†’ Distributes traffic equally (hash-based) β†’ Per-flow load balancing (same src/dst goes to same server) ``` **Enable ECMP on routers:** ```cisco # Cisco router bgp 65000 maximum-paths 8 # Juniper set protocols bgp group servers multipath ``` --- ### Weighted Load Distribution **Use BGP MED to control traffic distribution:** ```python # High-capacity server (receives more traffic) announce route 100.10.0.100/32 next-hop self med 50 # Medium-capacity server announce route 100.10.0.100/32 next-hop self med 100 # Low-capacity server (receives less traffic) announce route 100.10.0.100/32 next-hop self med 150 ``` **Note:** Lower MED = preferred path = more traffic --- ### Dynamic Load-Based Distribution **Adjust MED based on current load:** ```python #!/usr/bin/env python3 import sys import time import psutil SERVICE_IP = "100.10.0.100" BASE_MED = 100 def calculate_med(): """Calculate MED based on CPU load""" cpu_percent = psutil.cpu_percent(interval=1) # Higher CPU = higher MED = less preferred load_factor = int(cpu_percent) med = BASE_MED + load_factor return med time.sleep(2) while True: med = calculate_med() sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self med {med}\n") sys.stdout.flush() sys.stderr.write(f"[LOAD] Announced with MED={med}\n") time.sleep(30) # Update every 30 seconds ``` **Result:** Traffic automatically distributed based on real-time load --- ## Common HA Scenarios ### Scenario 1: Web Application HA **Setup:** - 3 web servers (NGINX + application) - Anycast IP: 100.10.0.80 - Health check: HTTP /health endpoint - Active-active configuration **Configuration:** ```ini # /etc/exabgp/web-ha.conf neighbor 192.168.1.1 { router-id 192.168.1.10; local-address 192.168.1.10; local-as 65001; peer-as 65001; family { ipv4 unicast; } api { processes [ web-healthcheck ]; } } process web-healthcheck { run /etc/exabgp/web-healthcheck.py; encoder text; } ``` **Health check script:** ```python #!/usr/bin/env python3 import sys import time import urllib.request SERVICE_IP = "100.10.0.80" def is_web_healthy(): try: response = urllib.request.urlopen('http://127.0.0.1/health', timeout=2) return response.getcode() == 200 except: return False time.sleep(2) announced = False while True: if is_web_healthy() and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = True elif not is_web_healthy() and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = False time.sleep(5) ``` --- ### Scenario 2: Database Read Replica HA **Setup:** - 1 primary database (writes) - 3 read replicas (reads) - Anycast read IP: 100.10.0.5432 - Health check: replication lag **Health check:** ```python #!/usr/bin/env python3 import sys import time import psycopg2 SERVICE_IP = "100.10.0.5432" MAX_LAG_SECONDS = 10 def get_replication_lag(): try: conn = psycopg2.connect(host='127.0.0.1', database='postgres', user='monitor') cursor = conn.cursor() cursor.execute(""" SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) """) lag = cursor.fetchone()[0] conn.close() return lag if lag else 0 except: return float('inf') time.sleep(2) announced = False while True: lag = get_replication_lag() healthy = lag < MAX_LAG_SECONDS if healthy and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() sys.stderr.write(f"[DB] Replication lag OK ({lag:.1f}s), announcing\n") announced = True elif not healthy and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() sys.stderr.write(f"[DB] Replication lag too high ({lag:.1f}s), withdrawing\n") announced = False time.sleep(10) ``` --- ### Scenario 3: Multi-Region HA **Setup:** - Region A: 3 servers - Region B: 3 servers - Same anycast IP in both regions - Clients routed to nearest region **Benefits:** - Low latency (geo-proximity) - Disaster recovery (region failure) - Active-active across regions --- ## Implementation Examples ### Complete HA Setup **1. Install ExaBGP on all servers:** ```bash pip install exabgp ``` **2. Configure service IP on loopback:** ```bash ip addr add 100.10.0.100/32 dev lo ``` **3. Create ExaBGP config:** ```ini neighbor 192.168.1.1 { router-id 192.168.1.10; local-address 192.168.1.10; local-as 65001; peer-as 65001; family { ipv4 unicast; } api { processes [ ha-healthcheck ]; } } process ha-healthcheck { run /etc/exabgp/ha-healthcheck.py; encoder text; } ``` **4. Create health check script:** ```python #!/usr/bin/env python3 import sys import time import socket SERVICE_IP = "100.10.0.100" SERVICE_PORT = 80 def is_healthy(): try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(2) result = sock.connect_ex(('127.0.0.1', SERVICE_PORT)) sock.close() return result == 0 except: return False time.sleep(2) announced = False while True: if is_healthy() and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = True elif not is_healthy() and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = False time.sleep(5) ``` **5. Start ExaBGP:** ```bash exabgp /etc/exabgp/ha.conf ``` **6. Verify:** ```bash # Check route on router show ip bgp 100.10.0.100 # Should see multiple paths (one per healthy server) ``` --- ## Best Practices ### 1. Use Rise/Fall Thresholds **Prevent route flapping:** ```python RISE_THRESHOLD = 3 # 3 consecutive successes to announce FALL_THRESHOLD = 2 # 2 consecutive failures to withdraw ``` --- ### 2. Monitor BGP Session Health ```python import subprocess def check_exabgp_running(): result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True) if result.returncode != 0: send_alert("ExaBGP not running!") return False return True ``` --- ### 3. Log All Announcements ```python import logging logging.basicConfig(filename='/var/log/exabgp-ha.log', level=logging.INFO) def announce_route(ip): sys.stdout.write(f"announce route {ip}/32 next-hop self\n") sys.stdout.flush() logging.info(f"ANNOUNCE: {ip}") ``` --- ### 4. Implement Maintenance Mode **Allow graceful traffic draining:** ```bash # Enter maintenance touch /var/run/maintenance # Wait for connections to drain watch 'ss -tan | grep ESTAB | wc -l' # Perform maintenance systemctl restart service # Exit maintenance rm /var/run/maintenance ``` --- ### 5. Test Failover Regularly ```bash # Monthly failover drill systemctl stop nginx # Verify traffic failed over sleep 60 systemctl start nginx ``` --- ## Monitoring and Alerting ### Metrics to Monitor **1. Service Health:** - Health check success rate - Time since last successful check - Health check latency **2. BGP State:** - BGP session state (up/down) - Routes announced - Routes withdrawn - BGP convergence time **3. Failover Events:** - Number of failovers - Time to failover - Failed node recovery time --- ### Monitoring Script ```python #!/usr/bin/env python3 """ Monitor HA metrics and export to Prometheus """ import time from prometheus_client import start_http_server, Gauge, Counter # Metrics health_check_success = Gauge('ha_health_check_success', 'Health check status (1=healthy, 0=unhealthy)') route_announced = Gauge('ha_route_announced', 'Route announcement status (1=announced, 0=withdrawn)') failover_count = Counter('ha_failover_total', 'Total number of failovers') def monitor_ha(): announced = False while True: healthy = is_healthy() health_check_success.set(1 if healthy else 0) if healthy and not announced: route_announced.set(1) announced = True elif not healthy and announced: route_announced.set(0) failover_count.inc() announced = False time.sleep(5) if __name__ == '__main__': # Start Prometheus metrics server start_http_server(9100) monitor_ha() ``` --- ## Troubleshooting ### Issue 1: Route Not Failing Over **Symptoms:** Service down but traffic still routed to failed instance **Check:** ```bash # 1. Verify ExaBGP withdrew route grep WITHDRAW /var/log/exabgp.log # 2. Check BGP table on router show ip bgp 100.10.0.100 # 3. Verify health check detecting failure tail -f /var/log/exabgp.log ``` **Common causes:** - Health check not detecting failure - ExaBGP not running - BGP session down - Router not removing route --- ### Issue 2: Route Flapping **Symptoms:** Route repeatedly announced/withdrawn **Diagnosis:** ```bash # Monitor route changes watch -d 'show ip bgp 100.10.0.100 | grep paths' ``` **Solutions:** - Implement rise/fall thresholds - Increase health check interval - Fix unstable service - Add dampening --- ### Issue 3: Uneven Load Distribution **Symptoms:** One server gets all traffic despite ECMP **Check:** ```cisco # Verify ECMP enabled show ip bgp 100.10.0.100 # Should show "multipath" or "ECMP" # Check routing table show ip route 100.10.0.100 # Should show multiple next-hops ``` **Solutions:** ```cisco # Enable ECMP router bgp 65000 maximum-paths 8 ``` --- ## Next Steps ### Learn More - **[Anycast Management](Anycast-Management)** - Anycast patterns - **[DDoS Mitigation](DDoS-Mitigation)** - DDoS protection - **[Quick Start](Quick-Start)** - Getting started ### Operations - **[Debugging](Debugging)** - Troubleshooting - **[Monitoring](Monitoring)** - Monitoring setup ### Configuration - **[Configuration Syntax](Configuration-Syntax)** - Config reference - **[API Overview](API-Overview)** - API patterns --- **Ready to implement HA?** See [Quick Start](Quick-Start) β†’ ---