-
Notifications
You must be signed in to change notification settings - Fork 462
Healthcheck Module
Health checks are essential for reliable ExaBGP deployments, ensuring routes are only announced when services are actually healthy. This guide covers implementing robust health check modules for various scenarios.
- Overview
- Basic Health Check Pattern
- Health Check Types
- Dampening and Flap Prevention
- Advanced Patterns
- Production Health Check Module
- Integration Examples
- Monitoring and Logging
- Common Pitfalls
- See Also
A health check module continuously monitors service health and controls BGP route announcements based on service state.
Key Principles:
- Rise/Fall Dampening: Require multiple consecutive passes/fails before changing state
- Timeout Handling: Health checks must have timeouts (don't hang indefinitely)
- Logging: Log all state changes for troubleshooting
- Graceful Degradation: Handle partial failures intelligently
Basic Flow:
[Health Check] → [Dampening Logic] → [BGP Announcement/Withdrawal]
↓ ↓ ↓
Service State Rise/Fall Counters ExaBGP Route Control
#!/usr/bin/env python3
"""
Basic health check module for ExaBGP
Announces route when service is healthy, withdraws when unhealthy
"""
import sys
import time
import subprocess
import logging
# Configuration
SERVICE_IP = "100.64.1.1/32"
CHECK_INTERVAL = 5 # seconds
RISE_THRESHOLD = 3 # consecutive passes before announcing
FALL_THRESHOLD = 2 # consecutive failures before withdrawing
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
handlers=[
logging.FileHandler('/var/log/exabgp-healthcheck.log'),
logging.StreamHandler(sys.stderr)
]
)
def check_service_health():
"""
Check if service is healthy
Returns True if healthy, False otherwise
"""
try:
# Example: HTTP health check
result = subprocess.run(
['curl', '-sf', 'http://localhost/health'],
timeout=2,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
logging.warning("Health check timed out")
return False
except Exception as e:
logging.error(f"Health check error: {e}")
return False
def announce_route():
"""Announce BGP route"""
print(f"announce route {SERVICE_IP} next-hop self")
sys.stdout.flush()
logging.info(f"Announced route {SERVICE_IP}")
def withdraw_route():
"""Withdraw BGP route"""
print(f"withdraw route {SERVICE_IP} next-hop self")
sys.stdout.flush()
logging.warning(f"Withdrew route {SERVICE_IP}")
def main():
rise_count = 0
fall_count = 0
announced = False
logging.info("Health check module started")
while True:
healthy = check_service_health()
if healthy:
rise_count += 1
fall_count = 0
if rise_count >= RISE_THRESHOLD and not announced:
announce_route()
announced = True
rise_count = 0
else:
fall_count += 1
rise_count = 0
if fall_count >= FALL_THRESHOLD and announced:
withdraw_route()
announced = False
fall_count = 0
time.sleep(CHECK_INTERVAL)
if __name__ == '__main__':
main()# /etc/exabgp/healthcheck.conf
neighbor 192.0.2.1 {
router-id 192.0.2.2;
local-address 192.0.2.2;
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
}
api {
processes [ healthcheck ];
}
}
process healthcheck {
run /usr/local/bin/exabgp-healthcheck.py;
encoder text;
}Use Case: Web servers, APIs, load balancers
import requests
def http_health_check(url, timeout=2):
"""
Check HTTP endpoint
Returns True if status code 200 and (optionally) response matches pattern
"""
try:
response = requests.get(url, timeout=timeout)
return response.status_code == 200
except requests.exceptions.RequestException:
return False
# With content verification
def http_health_check_advanced(url, expected_text="OK", timeout=2):
"""Check HTTP endpoint with content verification"""
try:
response = requests.get(url, timeout=timeout)
return response.status_code == 200 and expected_text in response.text
except requests.exceptions.RequestException:
return False
# Example usage
healthy = http_health_check("http://localhost:8080/health")
healthy = http_health_check_advanced("https://localhost/status", expected_text='"status":"up"')Use Case: Databases, message queues, generic TCP services
import socket
def tcp_port_check(host, port, timeout=2):
"""
Check if TCP port is open and accepting connections
"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except Exception:
return False
# Example usage
healthy = tcp_port_check("localhost", 3306) # MySQL
healthy = tcp_port_check("localhost", 5432) # PostgreSQL
healthy = tcp_port_check("localhost", 6379) # RedisUse Case: Network reachability, simple aliveness
import subprocess
def ping_check(host, count=1, timeout=2):
"""
Ping host and return True if reachable
"""
try:
result = subprocess.run(
['ping', '-c', str(count), '-W', str(timeout), host],
timeout=timeout + 1,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
except Exception:
return False
# Example usage
healthy = ping_check("192.168.1.1")Use Case: Custom check scripts, database queries, file checks
import subprocess
def command_check(command, timeout=5):
"""
Execute command and return True if exit code is 0
"""
try:
result = subprocess.run(
command,
shell=True if isinstance(command, str) else False,
timeout=timeout,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
except Exception:
return False
# Examples
healthy = command_check("systemctl is-active nginx")
healthy = command_check(["mysql", "-e", "SELECT 1"])
healthy = command_check("test -f /var/run/myapp.pid")Use Case: Multiple services must all be healthy
def multi_service_check():
"""
Check multiple services - all must be healthy
"""
checks = {
'nginx': lambda: http_health_check("http://localhost:80"),
'redis': lambda: tcp_port_check("localhost", 6379),
'app': lambda: http_health_check("http://localhost:8080/health"),
}
results = {}
for name, check_func in checks.items():
results[name] = check_func()
if not results[name]:
logging.warning(f"Service {name} is unhealthy")
all_healthy = all(results.values())
logging.info(f"Health check results: {results}, all healthy: {all_healthy}")
return all_healthyProblem: Transient failures cause route flapping.
Solution: Require multiple consecutive passes/fails.
class HealthCheckDampener:
"""Dampening logic for health checks"""
def __init__(self, rise_threshold=3, fall_threshold=2):
self.rise_threshold = rise_threshold
self.fall_threshold = fall_threshold
self.rise_count = 0
self.fall_count = 0
self.state = 'down' # Current state: 'up' or 'down'
def update(self, healthy):
"""
Update health state based on check result
Returns True if state changed
"""
previous_state = self.state
if healthy:
self.rise_count += 1
self.fall_count = 0
if self.rise_count >= self.rise_threshold:
self.state = 'up'
self.rise_count = 0
else:
self.fall_count += 1
self.rise_count = 0
if self.fall_count >= self.fall_threshold:
self.state = 'down'
self.fall_count = 0
return self.state != previous_state
def is_up(self):
"""Return True if state is 'up'"""
return self.state == 'up'
# Usage
dampener = HealthCheckDampener(rise_threshold=3, fall_threshold=2)
while True:
healthy = check_service_health()
state_changed = dampener.update(healthy)
if state_changed:
if dampener.is_up():
announce_route()
else:
withdraw_route()
time.sleep(5)Use different thresholds for bringing route up vs down:
RISE_THRESHOLD = 3 # Require 3 passes to announce (cautious)
FALL_THRESHOLD = 2 # Only 2 failures to withdraw (fast failover)Rationale:
- Higher rise threshold: Avoid announcing prematurely after restart
- Lower fall threshold: Fail fast when service actually dies
Use Case: Different checks have different importance.
def weighted_health_check():
"""
Weighted health checks - return True if score > threshold
"""
checks = {
'critical': {
'app_health': {'weight': 10, 'check': lambda: http_health_check("http://localhost:8080/health")},
'database': {'weight': 10, 'check': lambda: tcp_port_check("localhost", 5432)},
},
'important': {
'cache': {'weight': 5, 'check': lambda: tcp_port_check("localhost", 6379)},
},
'optional': {
'monitoring': {'weight': 1, 'check': lambda: tcp_port_check("localhost", 9090)},
}
}
total_score = 0
max_score = 0
for category, items in checks.items():
for name, config in items.items():
max_score += config['weight']
if config['check']():
total_score += config['weight']
else:
logging.warning(f"Check {name} ({category}) failed")
health_percentage = (total_score / max_score) * 100 if max_score > 0 else 0
healthy = health_percentage >= 80 # Require 80% score
logging.info(f"Health score: {total_score}/{max_score} ({health_percentage:.1f}%)")
return healthyUse Case: Service A depends on Service B.
def dependency_check():
"""
Check dependencies in order - fail fast if dependency fails
"""
# Check critical dependencies first
if not tcp_port_check("localhost", 5432): # Database
logging.error("Database down - service cannot function")
return False
if not tcp_port_check("localhost", 6379): # Cache
logging.error("Cache down - service cannot function")
return False
# Only check app if dependencies are up
if not http_health_check("http://localhost:8080/health"):
logging.error("App health check failed")
return False
return TrueUse Case: Announce with higher MED when degraded (not fully healthy).
def graceful_degradation_check():
"""
Return health status with degradation level
Returns: ('healthy', 'degraded', or 'down'), med_value
"""
# Check critical services
app_ok = http_health_check("http://localhost:8080/health")
db_ok = tcp_port_check("localhost", 5432)
# Check optional services
cache_ok = tcp_port_check("localhost", 6379)
if app_ok and db_ok and cache_ok:
return ('healthy', 100) # MED 100 - fully healthy
elif app_ok and db_ok:
return ('degraded', 150) # MED 150 - degraded (no cache)
else:
return ('down', None) # Completely down
# Usage
while True:
status, med = graceful_degradation_check()
if status == 'healthy':
print(f"announce route {SERVICE_IP} next-hop self med {med}")
sys.stdout.flush()
elif status == 'degraded':
print(f"announce route {SERVICE_IP} next-hop self med {med}")
sys.stdout.flush()
logging.warning("Service degraded - announcing with higher MED")
elif status == 'down':
print(f"withdraw route {SERVICE_IP} next-hop self")
sys.stdout.flush()
time.sleep(10)Complete production-ready health check module with all features:
#!/usr/bin/env python3
"""
Production Health Check Module for ExaBGP
Features:
- Multiple check types (HTTP, TCP, command)
- Rise/fall dampening
- Weighted checks
- Graceful degradation with MED
- Comprehensive logging
- Signal handling
"""
import sys
import time
import signal
import logging
import subprocess
import socket
from typing import Dict, Callable, Tuple, Optional
# Configuration
CONFIG = {
'service_ip': '100.64.1.1/32',
'check_interval': 5,
'rise_threshold': 3,
'fall_threshold': 2,
'log_file': '/var/log/exabgp-healthcheck.log',
}
# Health checks configuration
CHECKS = {
'app_http': {
'type': 'http',
'url': 'http://localhost:8080/health',
'weight': 10,
'timeout': 2,
},
'database': {
'type': 'tcp',
'host': 'localhost',
'port': 5432,
'weight': 10,
'timeout': 2,
},
'cache': {
'type': 'tcp',
'host': 'localhost',
'port': 6379,
'weight': 5,
'timeout': 2,
},
}
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
handlers=[
logging.FileHandler(CONFIG['log_file']),
logging.StreamHandler(sys.stderr)
]
)
# Global shutdown flag
shutdown_flag = False
def signal_handler(signum, frame):
"""Handle shutdown signals gracefully"""
global shutdown_flag
logging.info(f"Received signal {signum}, shutting down gracefully")
shutdown_flag = True
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
def http_check(url: str, timeout: int = 2) -> bool:
"""HTTP health check"""
try:
import requests
response = requests.get(url, timeout=timeout)
return response.status_code == 200
except Exception as e:
logging.debug(f"HTTP check failed for {url}: {e}")
return False
def tcp_check(host: str, port: int, timeout: int = 2) -> bool:
"""TCP port check"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except Exception as e:
logging.debug(f"TCP check failed for {host}:{port}: {e}")
return False
def command_check(command: str, timeout: int = 5) -> bool:
"""Command execution check"""
try:
result = subprocess.run(
command,
shell=True,
timeout=timeout,
capture_output=True
)
return result.returncode == 0
except Exception as e:
logging.debug(f"Command check failed for '{command}': {e}")
return False
def run_checks() -> Tuple[bool, int]:
"""
Run all configured health checks
Returns: (healthy: bool, med: int)
"""
total_weight = sum(check['weight'] for check in CHECKS.values())
current_weight = 0
for name, config in CHECKS.items():
check_type = config['type']
passed = False
if check_type == 'http':
passed = http_check(config['url'], config.get('timeout', 2))
elif check_type == 'tcp':
passed = tcp_check(config['host'], config['port'], config.get('timeout', 2))
elif check_type == 'command':
passed = command_check(config['command'], config.get('timeout', 5))
if passed:
current_weight += config['weight']
else:
logging.warning(f"Check '{name}' failed")
health_percentage = (current_weight / total_weight) * 100 if total_weight > 0 else 0
# Determine health status and MED
if health_percentage >= 90:
# Fully healthy
return (True, 100)
elif health_percentage >= 70:
# Degraded but functional
return (True, 150)
else:
# Too degraded, withdraw
return (False, None)
class HealthState:
"""Track health state with dampening"""
def __init__(self, rise_threshold: int, fall_threshold: int):
self.rise_threshold = rise_threshold
self.fall_threshold = fall_threshold
self.rise_count = 0
self.fall_count = 0
self.announced = False
self.current_med = None
def update(self, healthy: bool, med: Optional[int]) -> bool:
"""
Update state based on check result
Returns True if announcement state should change
"""
if healthy:
self.rise_count += 1
self.fall_count = 0
if self.rise_count >= self.rise_threshold or self.announced:
# Announce or update MED
should_update = not self.announced or self.current_med != med
self.announced = True
self.current_med = med
self.rise_count = 0
return should_update
else:
self.fall_count += 1
self.rise_count = 0
if self.fall_count >= self.fall_threshold and self.announced:
# Withdraw
self.announced = False
self.current_med = None
self.fall_count = 0
return True
return False
def announce_route(med: int):
"""Announce BGP route with MED"""
cmd = f"announce route {CONFIG['service_ip']} next-hop self med {med}"
print(cmd)
sys.stdout.flush()
logging.info(f"Announced route with MED {med}")
def withdraw_route():
"""Withdraw BGP route"""
cmd = f"withdraw route {CONFIG['service_ip']} next-hop self"
print(cmd)
sys.stdout.flush()
logging.warning("Withdrew route")
def main():
"""Main health check loop"""
logging.info("Production health check module started")
state = HealthState(CONFIG['rise_threshold'], CONFIG['fall_threshold'])
while not shutdown_flag:
healthy, med = run_checks()
should_update = state.update(healthy, med)
if should_update:
if state.announced:
announce_route(state.current_med)
else:
withdraw_route()
time.sleep(CONFIG['check_interval'])
# Graceful shutdown - withdraw route
if state.announced:
logging.info("Shutting down - withdrawing route")
withdraw_route()
logging.info("Health check module stopped")
if __name__ == '__main__':
main()Monitor HAProxy backend health:
import requests
def haproxy_backend_check(stats_url, backend_name):
"""Check if HAProxy backend has at least one UP server"""
try:
response = requests.get(f"{stats_url};csv")
lines = response.text.split('\n')
for line in lines:
if backend_name in line and ',UP,' in line:
return True
return False
except:
return False
# Usage
healthy = haproxy_backend_check("http://localhost:8404/stats", "webservers")Check pod readiness:
import subprocess
import json
def kubernetes_pod_ready(namespace, app_label):
"""Check if at least one pod with app label is ready"""
try:
result = subprocess.run(
['kubectl', 'get', 'pods', '-n', namespace,
'-l', f'app={app_label}', '-o', 'json'],
timeout=5,
capture_output=True
)
if result.returncode != 0:
return False
pods = json.loads(result.stdout)
for pod in pods.get('items', []):
conditions = pod.get('status', {}).get('conditions', [])
for condition in conditions:
if condition['type'] == 'Ready' and condition['status'] == 'True':
return True
return False
except:
return False
# Usage
healthy = kubernetes_pod_ready("default", "myapp")Export health check metrics for Prometheus:
from prometheus_client import Gauge, Counter, start_http_server
# Metrics
health_status = Gauge('exabgp_health_status', 'Current health status (1=up, 0=down)')
check_duration = Gauge('exabgp_check_duration_seconds', 'Health check duration')
state_changes = Counter('exabgp_state_changes_total', 'Total state changes', ['from_state', 'to_state'])
# Start metrics server
start_http_server(9100)
# Update metrics
health_status.set(1 if healthy else 0)
check_duration.set(duration)
state_changes.labels(from_state='down', to_state='up').inc()Use structured logging for better analysis:
import json
import logging
class JSONFormatter(logging.Formatter):
def format(self, record):
log_obj = {
'timestamp': self.formatTime(record),
'level': record.levelname,
'message': record.getMessage(),
}
return json.dumps(log_obj)
handler = logging.FileHandler('/var/log/exabgp-healthcheck.json')
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)
logging.info("Health check passed", extra={'check': 'http', 'url': 'http://localhost:8080'})- No timeout on checks: Always set timeouts (2-5 seconds typical)
- No dampening: Causes route flapping on transient failures
- Blocking checks: Use subprocess.run with timeout, not os.system
- Forgot sys.stdout.flush(): Commands buffer and don't reach ExaBGP
- No logging: Impossible to troubleshoot when things break
- Checking too frequently: Every 5-10 seconds is usually sufficient
- Not handling shutdown gracefully: Routes not withdrawn on stop
- Service High Availability - HA patterns
- Anycast Management - Anycast with health checks
- Production Best Practices - Production deployment
- Common Pitfalls - Common mistakes to avoid
👻 Ghost written by Claude (Anthropic AI)
Getting Started
Configuration
- Configuration Syntax
- Neighbor Configuration
- Directives A-Z
- Templates
- Environment Variables
- Process Configuration
API
- API Overview
- Text API Reference
- JSON API Reference
- API Commands
- Writing API Programs
- Error Handling
- Production Best Practices
Address Families
- Overview
- IPv4 Unicast
- IPv6 Unicast
- FlowSpec
- EVPN
- L3VPN
- BGP-LS
- VPLS
- SRv6 / MUP
- Multicast
- RT Constraint
Features
Use Cases
Tools
Operations
Reference
- Architecture
- Design
- Attribute Reference
- Command Reference
- BGP State Machine
- Capabilities
- Communities
- Examples Index
- Glossary
- RFC Support
Integration
Migration
Community
External