# Health Checks with ExaBGP Health checking is a critical component of using ExaBGP for anycast, high availability, and load balancing scenarios. ExaBGP provides flexible health checking capabilities through both a built-in module and custom health check scripts. ## Table of Contents - [Overview](#overview) - [Built-in Healthcheck Module](#built-in-healthcheck-module) - [Custom Health Check Scripts](#custom-health-check-scripts) - [Health Check Patterns](#health-check-patterns) - [Integration with Monitoring Systems](#integration-with-monitoring-systems) - [Best Practices](#best-practices) - [Troubleshooting](#troubleshooting) - [See Also](#see-also) ## Overview **Important**: ExaBGP does NOT manipulate the routing table (RIB/FIB). Health checks determine when ExaBGP should announce or withdraw routes via BGP. The operating system or other routing software must install routes from BGP into the FIB. ### Why Health Checks Matter Health checks enable ExaBGP to: - **Announce routes only when services are healthy** - Prevents traffic black-holing - **Withdraw routes automatically on failure** - Enables fast failover - **Support anycast architectures** - Multiple servers advertise the same IP - **Enable graceful maintenance** - Controlled traffic drainage ### Health Check Architecture ``` ┌──────────────────┐ │ Health Check │ │ Script/Module │ └────────┬─────────┘ │ checks service ▼ ┌──────────────────┐ │ Local Service │ │ (HTTP/DNS/etc) │ └──────────────────┘ │ │ healthy/unhealthy ▼ ┌──────────────────┐ │ ExaBGP API │ │ announce/withdraw│ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ BGP Routers │ │ receive updates │ └──────────────────┘ ``` ## Built-in Healthcheck Module ExaBGP 5.x includes a built-in healthcheck module that eliminates the need for external scripts for simple HTTP/HTTPS health checks. ### Basic Configuration ```ini # /etc/exabgp/exabgp.conf process healthcheck { run /usr/bin/python3 -m exabgp.application.healthcheck; encoder json; } neighbor 192.0.2.1 { router-id 192.0.2.10; local-address 192.0.2.10; local-as 65001; peer-as 65001; family { ipv4 unicast; } api { processes [ healthcheck ]; } } ``` ### Healthcheck Configuration File The healthcheck module supports a configuration file that mirrors command-line options. Each line in the file corresponds to a command-line option (without the `--` prefix). Create `/etc/exabgp/healthcheck.conf`: ```ini # /etc/exabgp/healthcheck.conf # Each line is a command-line option without the -- prefix # Lines starting with # are comments # Logging and identification debug name = haproxy syslog-facility = daemon # Health check timing interval = 10 fast-interval = 1 timeout = 5 rise = 3 fall = 2 # The health check command (exit code 0 = healthy) command = curl -sf http://127.0.0.1:80/health # IP addresses to announce (use label OR explicit IPs) label = web # Or specify explicit IPs: # ip = 198.51.100.10/32 # BGP attributes next-hop = 192.0.2.10 up-metric = 100 down-metric = 1000 withdraw-on-down # Optional: community tagging community = 65001:100 # Optional: execute commands on state changes down-execute = logger "Service DOWN" up-execute = logger "Service UP" ``` **Important**: The healthcheck module executes a **command** you provide. It does NOT have built-in HTTP/TCP check types. Use standard tools in your command: ```ini # HTTP check using curl command = curl -sf http://127.0.0.1:80/health # TCP port check using nc (netcat) command = nc -z 127.0.0.1 3306 # DNS check using dig command = dig @127.0.0.1 example.com +short # MySQL check command = mysql -u healthcheck -e 'SELECT 1' # Multi-step check command = sh -c 'curl -sf http://127.0.0.1/health && redis-cli ping' ``` Use the configuration file with: ```bash python -m exabgp healthcheck --config /etc/exabgp/healthcheck.conf ``` ### Healthcheck Parameters All parameters correspond to command-line options. See `python -m exabgp healthcheck --help` for the full list. | Parameter | Description | Default | |-----------|-------------|---------| | `command` | Command to execute (exit 0 = healthy) | Required | | `timeout` | Command execution timeout in seconds | `5` | | `interval` | Seconds between health checks | `5` | | `fast-interval` | Interval when state change is occurring | `1` | | `rise` | Consecutive successes before UP | `3` | | `fall` | Consecutive failures before DOWN | `3` | | `ip` | IP address/network to announce (CIDR) | Auto from label | | `label` | Match IPs with this label prefix (e.g., `lo:web*`) | None | | `next-hop` | BGP next-hop for announced routes | `self` | | `up-metric` | MED when service is UP | `100` | | `down-metric` | MED when service is DOWN | `1000` | | `withdraw-on-down` | Withdraw route instead of changing MED | `false` | | `community` | BGP community to attach | None | | `disable` | If this file exists, service is disabled | None | | `debounce` | Only announce on state changes | `false` | ### Built-in Module Advantages - **No custom script needed** - Use any command as health check - **Simple configuration** - Command-line options in a file - **Flexible checks** - Use curl, nc, dig, mysql, or any command - **Flap protection** - Rise/fall thresholds prevent flapping - **Automatic IP management** - Can setup/teardown IPs automatically - **Metric-based failover** - Different MEDs for UP/DOWN states ## Custom Health Check Scripts For more complex health checking logic, write custom scripts that communicate with ExaBGP via its API. ### Python Health Check Example ```python #!/usr/bin/env python3 # /etc/exabgp/healthcheck.py import sys import time import requests from subprocess import run # Configuration SERVICE_URL = "http://127.0.0.1:80/health" CHECK_INTERVAL = 10 ROUTE = "198.51.100.10/32" NEXT_HOP = "192.0.2.10" RISE_THRESHOLD = 2 FALL_THRESHOLD = 3 def announce_route(): """Announce route via ExaBGP API""" print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) def withdraw_route(): """Withdraw route via ExaBGP API""" print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) def check_health(): """Check if service is healthy""" try: response = requests.get(SERVICE_URL, timeout=5) return response.status_code == 200 except Exception as e: sys.stderr.write(f"Health check failed: {e}\n") return False def main(): consecutive_successes = 0 consecutive_failures = 0 route_announced = False while True: healthy = check_health() if healthy: consecutive_successes += 1 consecutive_failures = 0 # Announce route if we've reached rise threshold if consecutive_successes >= RISE_THRESHOLD and not route_announced: announce_route() route_announced = True sys.stderr.write(f"Service UP - route announced\n") else: consecutive_failures += 1 consecutive_successes = 0 # Withdraw route if we've reached fall threshold if consecutive_failures >= FALL_THRESHOLD and route_announced: withdraw_route() route_announced = False sys.stderr.write(f"Service DOWN - route withdrawn\n") time.sleep(CHECK_INTERVAL) if __name__ == "__main__": main() ``` ### Bash Health Check Example ```bash #!/bin/bash # /etc/exabgp/healthcheck.sh ROUTE="198.51.100.10/32" NEXT_HOP="192.0.2.10" SERVICE_URL="http://127.0.0.1:80/health" CHECK_INTERVAL=10 RISE_THRESHOLD=2 FALL_THRESHOLD=3 consecutive_successes=0 consecutive_failures=0 route_announced=0 announce_route() { echo "announce route $ROUTE next-hop $NEXT_HOP" } withdraw_route() { echo "withdraw route $ROUTE next-hop $NEXT_HOP" } check_health() { curl -sf "$SERVICE_URL" > /dev/null 2>&1 return $? } while true; do if check_health; then ((consecutive_successes++)) consecutive_failures=0 if [ $consecutive_successes -ge $RISE_THRESHOLD ] && [ $route_announced -eq 0 ]; then announce_route route_announced=1 echo "Service UP - route announced" >&2 fi else ((consecutive_failures++)) consecutive_successes=0 if [ $consecutive_failures -ge $FALL_THRESHOLD ] && [ $route_announced -eq 1 ]; then withdraw_route route_announced=0 echo "Service DOWN - route withdrawn" >&2 fi fi sleep $CHECK_INTERVAL done ``` ### ExaBGP Configuration for Custom Script ```ini # /etc/exabgp/exabgp.conf process healthcheck { run /etc/exabgp/healthcheck.py; encoder text; } neighbor 192.0.2.1 { router-id 192.0.2.10; local-address 192.0.2.10; local-as 65001; peer-as 65001; family { ipv4 unicast; } api { processes [ healthcheck ]; } } ``` ## Health Check Patterns ### Pattern 1: Anycast DNS with Health Checks Multiple DNS servers advertise the same anycast IP. Each server only announces when its local DNS service is healthy. ```python #!/usr/bin/env python3 # /etc/exabgp/dns-healthcheck.py import sys import time import socket ANYCAST_IP = "198.51.100.53/32" NEXT_HOP = "self" DNS_PORT = 53 CHECK_INTERVAL = 5 def check_dns(): """Check if DNS server is responding""" try: sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) sock.settimeout(3) # Send DNS query for version.bind query = b'\x00\x00\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x07version\x04bind\x00\x00\x10\x00\x03' sock.sendto(query, ('127.0.0.1', DNS_PORT)) data, addr = sock.recvfrom(512) sock.close() return True except: return False route_announced = False while True: healthy = check_dns() if healthy and not route_announced: print(f"announce route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True) route_announced = True sys.stderr.write("DNS healthy - announcing anycast IP\n") elif not healthy and route_announced: print(f"withdraw route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True) route_announced = False sys.stderr.write("DNS unhealthy - withdrawing anycast IP\n") time.sleep(CHECK_INTERVAL) ``` ### Pattern 2: Multi-Service Health Check Check multiple services before announcing a route. All services must be healthy. ```python #!/usr/bin/env python3 # /etc/exabgp/multi-service-healthcheck.py import sys import time import requests import socket ROUTE = "198.51.100.100/32" NEXT_HOP = "192.0.2.10" CHECK_INTERVAL = 10 def check_http(): try: r = requests.get("http://127.0.0.1:80/health", timeout=3) return r.status_code == 200 except: return False def check_database(): try: sock = socket.create_connection(("127.0.0.1", 5432), timeout=3) sock.close() return True except: return False def check_cache(): try: sock = socket.create_connection(("127.0.0.1", 6379), timeout=3) sock.close() return True except: return False route_announced = False while True: # All services must be healthy all_healthy = check_http() and check_database() and check_cache() if all_healthy and not route_announced: print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = True sys.stderr.write("All services healthy - route announced\n") elif not all_healthy and route_announced: print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = False sys.stderr.write("Service failure detected - route withdrawn\n") time.sleep(CHECK_INTERVAL) ``` ### Pattern 3: Weighted Health Check Different services have different weights. Announce if total health score exceeds threshold. ```python #!/usr/bin/env python3 # /etc/exabgp/weighted-healthcheck.py import sys import time import requests ROUTE = "198.51.100.100/32" NEXT_HOP = "192.0.2.10" CHECK_INTERVAL = 10 HEALTH_THRESHOLD = 70 # Announce if score >= 70 CHECKS = [ {"name": "web", "url": "http://127.0.0.1:80/health", "weight": 50}, {"name": "api", "url": "http://127.0.0.1:8080/health", "weight": 30}, {"name": "cache", "url": "http://127.0.0.1:11211/health", "weight": 20}, ] def calculate_health_score(): total_score = 0 for check in CHECKS: try: r = requests.get(check["url"], timeout=3) if r.status_code == 200: total_score += check["weight"] sys.stderr.write(f"{check['name']}: OK (+{check['weight']})\n") else: sys.stderr.write(f"{check['name']}: FAIL (status {r.status_code})\n") except Exception as e: sys.stderr.write(f"{check['name']}: FAIL ({e})\n") return total_score route_announced = False while True: score = calculate_health_score() sys.stderr.write(f"Health score: {score}/100\n") if score >= HEALTH_THRESHOLD and not route_announced: print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = True sys.stderr.write(f"Score {score} >= threshold {HEALTH_THRESHOLD} - route announced\n") elif score < HEALTH_THRESHOLD and route_announced: print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = False sys.stderr.write(f"Score {score} < threshold {HEALTH_THRESHOLD} - route withdrawn\n") time.sleep(CHECK_INTERVAL) ``` ### Pattern 4: Graceful Shutdown Detect shutdown signal and withdraw routes before service stops. ```python #!/usr/bin/env python3 # /etc/exabgp/graceful-healthcheck.py import sys import time import signal import requests ROUTE = "198.51.100.100/32" NEXT_HOP = "192.0.2.10" CHECK_INTERVAL = 10 shutdown_requested = False def signal_handler(signum, frame): global shutdown_requested shutdown_requested = True sys.stderr.write(f"Shutdown signal received - withdrawing route\n") # Register signal handlers signal.signal(signal.SIGTERM, signal_handler) signal.signal(signal.SIGINT, signal_handler) def check_health(): try: r = requests.get("http://127.0.0.1:80/health", timeout=3) return r.status_code == 200 except: return False route_announced = False while True: if shutdown_requested: if route_announced: print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = False sys.stderr.write("Graceful shutdown complete\n") sys.exit(0) healthy = check_health() if healthy and not route_announced: print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = True elif not healthy and route_announced: print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) route_announced = False time.sleep(CHECK_INTERVAL) ``` ## Integration with Monitoring Systems ### Prometheus Exporter Integration Export health check metrics to Prometheus: ```python #!/usr/bin/env python3 # /etc/exabgp/healthcheck-with-metrics.py import sys import time import requests from prometheus_client import start_http_server, Gauge, Counter # Prometheus metrics health_status = Gauge('exabgp_health_status', 'Service health status (1=healthy, 0=unhealthy)') route_announced = Gauge('exabgp_route_announced', 'Route announcement status (1=announced, 0=withdrawn)') health_checks_total = Counter('exabgp_health_checks_total', 'Total health checks performed', ['result']) ROUTE = "198.51.100.100/32" NEXT_HOP = "192.0.2.10" CHECK_INTERVAL = 10 METRICS_PORT = 9101 def check_health(): try: r = requests.get("http://127.0.0.1:80/health", timeout=3) is_healthy = r.status_code == 200 health_checks_total.labels(result='success' if is_healthy else 'fail').inc() health_status.set(1 if is_healthy else 0) return is_healthy except Exception as e: health_checks_total.labels(result='error').inc() health_status.set(0) return False # Start Prometheus metrics server start_http_server(METRICS_PORT) sys.stderr.write(f"Prometheus metrics available on port {METRICS_PORT}\n") is_announced = False while True: healthy = check_health() if healthy and not is_announced: print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) is_announced = True route_announced.set(1) elif not healthy and is_announced: print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True) is_announced = False route_announced.set(0) time.sleep(CHECK_INTERVAL) ``` ### Nagios/Icinga Integration Check if ExaBGP health check script is running and routes are announced: ```bash #!/bin/bash # /usr/lib/nagios/plugins/check_exabgp_health.sh ROUTE="198.51.100.100/32" EXABGP_PID_FILE="/var/run/exabgp/exabgp.pid" # Check if ExaBGP is running if [ ! -f "$EXABGP_PID_FILE" ]; then echo "CRITICAL: ExaBGP not running" exit 2 fi # Check if route is in routing table (route is announced and installed) if ip route show | grep -q "$ROUTE"; then echo "OK: Route $ROUTE is announced and active" exit 0 else echo "WARNING: Route $ROUTE not in routing table" exit 1 fi ``` ## Best Practices ### 1. Use Rise/Fall Thresholds Prevent route flapping by requiring multiple consecutive successes/failures: ```python RISE_THRESHOLD = 2 # Announce after 2 consecutive successes FALL_THRESHOLD = 3 # Withdraw after 3 consecutive failures ``` ### 2. Set Appropriate Timeouts ```python CHECK_INTERVAL = 10 # Check every 10 seconds CHECK_TIMEOUT = 5 # Individual check timeout (must be < interval) ``` ### 3. Check Localhost Services Health checks should verify the **local** service, not remote dependencies: ```python # GOOD: Check local service SERVICE_URL = "http://127.0.0.1:80/health" # BAD: Check remote dependency SERVICE_URL = "http://database.example.com:5432/" ``` ### 4. Implement Comprehensive Health Endpoints Your application should provide a health endpoint that checks all critical components: ```python # Example Flask health endpoint from flask import Flask, jsonify app = Flask(__name__) @app.route('/health') def health(): checks = { 'database': check_database_connection(), 'cache': check_cache_connection(), 'disk_space': check_disk_space(), } if all(checks.values()): return jsonify({'status': 'healthy', 'checks': checks}), 200 else: return jsonify({'status': 'unhealthy', 'checks': checks}), 503 ``` ### 5. Log Health State Changes Always log when routes are announced or withdrawn: ```python if healthy and not route_announced: print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True) sys.stderr.write(f"{time.strftime('%Y-%m-%d %H:%M:%S')} - Route announced\n") route_announced = True ``` ### 6. Handle Script Startup Don't announce routes immediately on startup. Wait for initial health checks: ```python # Wait for initial checks before announcing initial_checks = 0 while initial_checks < RISE_THRESHOLD: if check_health(): initial_checks += 1 else: initial_checks = 0 time.sleep(CHECK_INTERVAL) # Now start normal operation announce_route() ``` ### 7. Monitor Health Check Script Use a process supervisor (systemd, supervisord) to ensure health check scripts keep running: ```ini # /etc/systemd/system/exabgp.service [Unit] Description=ExaBGP After=network.target [Service] Type=simple User=exabgp ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf Restart=always RestartSec=10 [Install] WantedBy=multi-user.target ``` ## Troubleshooting ### Problem: Routes Not Being Announced **Symptoms**: Health checks pass but routes aren't announced to peers. **Debugging steps**: 1. Check ExaBGP logs: ```bash tail -f /var/log/exabgp/exabgp.log ``` 2. Verify health check script output: ```bash # Health check should print to stdout /etc/exabgp/healthcheck.py ``` 3. Check API communication: ```bash # Enable ExaBGP API debugging exabgp.log.all = true exabgp.log.level = DEBUG ``` 4. Verify BGP session is established: ```bash # Check neighbor status in logs grep "Peer.*up" /var/log/exabgp/exabgp.log ``` ### Problem: Routes Flapping **Symptoms**: Routes are constantly announced and withdrawn. **Solutions**: 1. Increase rise/fall thresholds: ```python RISE_THRESHOLD = 3 # More conservative FALL_THRESHOLD = 5 ``` 2. Increase check interval: ```python CHECK_INTERVAL = 15 # Check less frequently ``` 3. Implement hysteresis: ```python # Stay in current state for minimum time MIN_STATE_TIME = 60 # 60 seconds minimum last_state_change = time.time() if time.time() - last_state_change >= MIN_STATE_TIME: # Allow state change pass ``` ### Problem: Health Check Script Crashes **Symptoms**: Routes withdrawn and never come back. **Solutions**: 1. Add exception handling: ```python def main(): try: while True: # Health check logic pass except Exception as e: sys.stderr.write(f"Fatal error: {e}\n") # Withdraw routes before exiting withdraw_route() sys.exit(1) ``` 2. Use systemd to restart the process: ```ini [Service] Restart=always RestartSec=10 ``` 3. Add logging to debug crashes: ```python import logging logging.basicConfig( filename='/var/log/exabgp/healthcheck.log', level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s' ) ``` ### Problem: Slow Health Checks **Symptoms**: Health checks take too long, affecting responsiveness. **Solutions**: 1. Use shorter timeouts: ```python requests.get(url, timeout=3) # 3 second timeout ``` 2. Run checks in parallel (for multiple services): ```python import concurrent.futures def check_all_services(): with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: futures = { executor.submit(check_web): "web", executor.submit(check_api): "api", executor.submit(check_db): "db", } results = {} for future in concurrent.futures.as_completed(futures): service = futures[future] results[service] = future.result() return all(results.values()) ``` 3. Use lighter health check methods: ```python # TCP connection check (faster than HTTP) import socket def check_tcp(host, port): try: sock = socket.create_connection((host, port), timeout=2) sock.close() return True except: return False ``` ## See Also - [Service High Availability](../Use-Cases/Service-High-Availability) - HA patterns with ExaBGP - [Anycast Management](../Use-Cases/Anycast-Management) - Anycast architectures - [API Overview](../API/API-Overview) - ExaBGP API documentation - [Monitoring](Monitoring) - Production monitoring setup - [Debugging](Debugging) - Troubleshooting ExaBGP issues - [Healthcheck Module](../Tools/Healthcheck-Module) - Built-in healthcheck module details ---