Skip to content

Healthcheck Module

Thomas Mangin edited this page Nov 10, 2025 · 10 revisions

Health Check Module

Health checks are essential for reliable ExaBGP deployments, ensuring routes are only announced when services are actually healthy. This guide covers implementing robust health check modules for various scenarios.

📚 Recommended Reading

Vincent Bernat's blog post: High availability with ExaBGP is an excellent real-world guide to production health check patterns and is highly recommended reading alongside this documentation.

⚠️ ExaBGP Version Compatibility

  • ExaBGP 4.x: All examples in this guide work as-is. The API does NOT send ACK responses.
  • ExaBGP 5.x/main: The ACK feature is enabled by default and sends acknowledgments for every command. Scripts in this guide do NOT handle ACK messages.

For ExaBGP 5.x/main users:

  • Option 1 (Recommended): ACK is automatically disabled in process blocks - no changes needed
  • Option 2: Handle ACK messages using select() - See ACK Feature Guide

All examples assume ExaBGP 4.x behavior unless specifically noted.

Table of Contents


Overview

A health check module continuously monitors service health and controls BGP route announcements based on service state.

Key Principles:

  1. Rise/Fall Dampening: Require multiple consecutive passes/fails before changing state
  2. Timeout Handling: Health checks must have timeouts (don't hang indefinitely)
  3. Logging: Log all state changes for troubleshooting
  4. Graceful Degradation: Handle partial failures intelligently

Basic Flow:

[Health Check] → [Dampening Logic] → [BGP Announcement/Withdrawal]
     ↓                   ↓                      ↓
  Service State    Rise/Fall Counters    ExaBGP Route Control

Basic Health Check Pattern

Simple Health Check Script

#!/usr/bin/env python3
"""
Basic health check module for ExaBGP
Announces route when service is healthy, withdraws when unhealthy
"""

import sys
import time
import subprocess
import logging

# Configuration
SERVICE_IP = "100.64.1.1/32"
CHECK_INTERVAL = 5  # seconds
RISE_THRESHOLD = 3  # consecutive passes before announcing
FALL_THRESHOLD = 2  # consecutive failures before withdrawing

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s',
    handlers=[
        logging.FileHandler('/var/log/exabgp-healthcheck.log'),
        logging.StreamHandler(sys.stderr)
    ]
)

def check_service_health():
    """
    Check if service is healthy
    Returns True if healthy, False otherwise
    """
    try:
        # Example: HTTP health check
        result = subprocess.run(
            ['curl', '-sf', 'http://localhost/health'],
            timeout=2,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        logging.warning("Health check timed out")
        return False
    except Exception as e:
        logging.error(f"Health check error: {e}")
        return False

def announce_route():
    """Announce BGP route"""
    print(f"announce route {SERVICE_IP} next-hop 192.0.2.1")
    sys.stdout.flush()
    logging.info(f"Announced route {SERVICE_IP}")

def withdraw_route():
    """Withdraw BGP route"""
    print(f"withdraw route {SERVICE_IP}")
    sys.stdout.flush()
    logging.warning(f"Withdrew route {SERVICE_IP}")

def main():
    rise_count = 0
    fall_count = 0
    announced = False

    logging.info("Health check module started")

    while True:
        healthy = check_service_health()

        if healthy:
            rise_count += 1
            fall_count = 0

            if rise_count >= RISE_THRESHOLD and not announced:
                announce_route()
                announced = True
                rise_count = 0

        else:
            fall_count += 1
            rise_count = 0

            if fall_count >= FALL_THRESHOLD and announced:
                withdraw_route()
                announced = False
                fall_count = 0

        time.sleep(CHECK_INTERVAL)

if __name__ == '__main__':
    main()

ExaBGP Configuration

# /etc/exabgp/healthcheck.conf

neighbor 192.0.2.1 {
    router-id 192.0.2.2;
    local-address 192.0.2.2;
    local-as 65001;
    peer-as 65000;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

process healthcheck {
    run /usr/local/bin/exabgp-healthcheck.py;
    encoder text;
}

Health Check Types

HTTP/HTTPS Health Checks

Use Case: Web servers, APIs, load balancers

import requests

def http_health_check(url, timeout=2):
    """
    Check HTTP endpoint
    Returns True if status code 200 and (optionally) response matches pattern
    """
    try:
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

# With content verification
def http_health_check_advanced(url, expected_text="OK", timeout=2):
    """Check HTTP endpoint with content verification"""
    try:
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200 and expected_text in response.text
    except requests.exceptions.RequestException:
        return False

# Example usage
healthy = http_health_check("http://localhost:8080/health")
healthy = http_health_check_advanced("https://localhost/status", expected_text='"status":"up"')

TCP Port Checks

Use Case: Databases, message queues, generic TCP services

import socket

def tcp_port_check(host, port, timeout=2):
    """
    Check if TCP port is open and accepting connections
    """
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except Exception:
        return False

# Example usage
healthy = tcp_port_check("localhost", 3306)  # MySQL
healthy = tcp_port_check("localhost", 5432)  # PostgreSQL
healthy = tcp_port_check("localhost", 6379)  # Redis

ICMP Ping Checks

Use Case: Network reachability, simple aliveness

import subprocess

def ping_check(host, count=1, timeout=2):
    """
    Ping host and return True if reachable
    """
    try:
        result = subprocess.run(
            ['ping', '-c', str(count), '-W', str(timeout), host],
            timeout=timeout + 1,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    except Exception:
        return False

# Example usage
healthy = ping_check("192.168.1.1")

Command Execution Checks

Use Case: Custom check scripts, database queries, file checks

import subprocess

def command_check(command, timeout=5):
    """
    Execute command and return True if exit code is 0
    """
    try:
        result = subprocess.run(
            command,
            shell=True if isinstance(command, str) else False,
            timeout=timeout,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    except Exception:
        return False

# Examples
healthy = command_check("systemctl is-active nginx")
healthy = command_check(["mysql", "-e", "SELECT 1"])
healthy = command_check("test -f /var/run/myapp.pid")

Multi-Service Checks

Use Case: Multiple services must all be healthy

def multi_service_check():
    """
    Check multiple services - all must be healthy
    """
    checks = {
        'nginx': lambda: http_health_check("http://localhost:80"),
        'redis': lambda: tcp_port_check("localhost", 6379),
        'app': lambda: http_health_check("http://localhost:8080/health"),
    }

    results = {}
    for name, check_func in checks.items():
        results[name] = check_func()
        if not results[name]:
            logging.warning(f"Service {name} is unhealthy")

    all_healthy = all(results.values())
    logging.info(f"Health check results: {results}, all healthy: {all_healthy}")

    return all_healthy

Dampening and Flap Prevention

Rise/Fall Counters

Problem: Transient failures cause route flapping.

Solution: Require multiple consecutive passes/fails.

class HealthCheckDampener:
    """Dampening logic for health checks"""

    def __init__(self, rise_threshold=3, fall_threshold=2):
        self.rise_threshold = rise_threshold
        self.fall_threshold = fall_threshold
        self.rise_count = 0
        self.fall_count = 0
        self.state = 'down'  # Current state: 'up' or 'down'

    def update(self, healthy):
        """
        Update health state based on check result
        Returns True if state changed
        """
        previous_state = self.state

        if healthy:
            self.rise_count += 1
            self.fall_count = 0

            if self.rise_count >= self.rise_threshold:
                self.state = 'up'
                self.rise_count = 0

        else:
            self.fall_count += 1
            self.rise_count = 0

            if self.fall_count >= self.fall_threshold:
                self.state = 'down'
                self.fall_count = 0

        return self.state != previous_state

    def is_up(self):
        """Return True if state is 'up'"""
        return self.state == 'up'

# Usage
dampener = HealthCheckDampener(rise_threshold=3, fall_threshold=2)

while True:
    healthy = check_service_health()
    state_changed = dampener.update(healthy)

    if state_changed:
        if dampener.is_up():
            announce_route()
        else:
            withdraw_route()

    time.sleep(5)

Hysteresis (Different Thresholds)

Use different thresholds for bringing route up vs down:

RISE_THRESHOLD = 3  # Require 3 passes to announce (cautious)
FALL_THRESHOLD = 2  # Only 2 failures to withdraw (fast failover)

Rationale:

  • Higher rise threshold: Avoid announcing prematurely after restart
  • Lower fall threshold: Fail fast when service actually dies

Advanced Patterns

Weighted Health Checks

Use Case: Different checks have different importance.

def weighted_health_check():
    """
    Weighted health checks - return True if score > threshold
    """
    checks = {
        'critical': {
            'app_health': {'weight': 10, 'check': lambda: http_health_check("http://localhost:8080/health")},
            'database': {'weight': 10, 'check': lambda: tcp_port_check("localhost", 5432)},
        },
        'important': {
            'cache': {'weight': 5, 'check': lambda: tcp_port_check("localhost", 6379)},
        },
        'optional': {
            'monitoring': {'weight': 1, 'check': lambda: tcp_port_check("localhost", 9090)},
        }
    }

    total_score = 0
    max_score = 0

    for category, items in checks.items():
        for name, config in items.items():
            max_score += config['weight']
            if config['check']():
                total_score += config['weight']
            else:
                logging.warning(f"Check {name} ({category}) failed")

    health_percentage = (total_score / max_score) * 100 if max_score > 0 else 0
    healthy = health_percentage >= 80  # Require 80% score

    logging.info(f"Health score: {total_score}/{max_score} ({health_percentage:.1f}%)")

    return healthy

Dependency Checks

Use Case: Service A depends on Service B.

def dependency_check():
    """
    Check dependencies in order - fail fast if dependency fails
    """
    # Check critical dependencies first
    if not tcp_port_check("localhost", 5432):  # Database
        logging.error("Database down - service cannot function")
        return False

    if not tcp_port_check("localhost", 6379):  # Cache
        logging.error("Cache down - service cannot function")
        return False

    # Only check app if dependencies are up
    if not http_health_check("http://localhost:8080/health"):
        logging.error("App health check failed")
        return False

    return True

Graceful Degradation

Use Case: Announce with higher MED when degraded (not fully healthy).

def graceful_degradation_check():
    """
    Return health status with degradation level
    Returns: ('healthy', 'degraded', or 'down'), med_value
    """
    # Check critical services
    app_ok = http_health_check("http://localhost:8080/health")
    db_ok = tcp_port_check("localhost", 5432)

    # Check optional services
    cache_ok = tcp_port_check("localhost", 6379)

    if app_ok and db_ok and cache_ok:
        return ('healthy', 100)  # MED 100 - fully healthy

    elif app_ok and db_ok:
        return ('degraded', 150)  # MED 150 - degraded (no cache)

    else:
        return ('down', None)  # Completely down

# Usage
while True:
    status, med = graceful_degradation_check()

    if status == 'healthy':
        print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
        sys.stdout.flush()

    elif status == 'degraded':
        print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
        sys.stdout.flush()
        logging.warning("Service degraded - announcing with higher MED")

    elif status == 'down':
        print(f"withdraw route {SERVICE_IP}")
        sys.stdout.flush()

    time.sleep(10)

Production Health Check Module

Complete production-ready health check module with all features:

#!/usr/bin/env python3
"""
Production Health Check Module for ExaBGP
Features:
- Multiple check types (HTTP, TCP, command)
- Rise/fall dampening
- Weighted checks
- Graceful degradation with MED
- Comprehensive logging
- Signal handling
"""

import sys
import time
import signal
import logging
import subprocess
import socket
from typing import Dict, Callable, Tuple, Optional

# Configuration
CONFIG = {
    'service_ip': '100.64.1.1/32',
    'check_interval': 5,
    'rise_threshold': 3,
    'fall_threshold': 2,
    'log_file': '/var/log/exabgp-healthcheck.log',
}

# Health checks configuration
CHECKS = {
    'app_http': {
        'type': 'http',
        'url': 'http://localhost:8080/health',
        'weight': 10,
        'timeout': 2,
    },
    'database': {
        'type': 'tcp',
        'host': 'localhost',
        'port': 5432,
        'weight': 10,
        'timeout': 2,
    },
    'cache': {
        'type': 'tcp',
        'host': 'localhost',
        'port': 6379,
        'weight': 5,
        'timeout': 2,
    },
}

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s',
    handlers=[
        logging.FileHandler(CONFIG['log_file']),
        logging.StreamHandler(sys.stderr)
    ]
)

# Global shutdown flag
shutdown_flag = False

def signal_handler(signum, frame):
    """Handle shutdown signals gracefully"""
    global shutdown_flag
    logging.info(f"Received signal {signum}, shutting down gracefully")
    shutdown_flag = True

signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)

def http_check(url: str, timeout: int = 2) -> bool:
    """HTTP health check"""
    try:
        import requests
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200
    except Exception as e:
        logging.debug(f"HTTP check failed for {url}: {e}")
        return False

def tcp_check(host: str, port: int, timeout: int = 2) -> bool:
    """TCP port check"""
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except Exception as e:
        logging.debug(f"TCP check failed for {host}:{port}: {e}")
        return False

def command_check(command: str, timeout: int = 5) -> bool:
    """Command execution check"""
    try:
        result = subprocess.run(
            command,
            shell=True,
            timeout=timeout,
            capture_output=True
        )
        return result.returncode == 0
    except Exception as e:
        logging.debug(f"Command check failed for '{command}': {e}")
        return False

def run_checks() -> Tuple[bool, int]:
    """
    Run all configured health checks
    Returns: (healthy: bool, med: int)
    """
    total_weight = sum(check['weight'] for check in CHECKS.values())
    current_weight = 0

    for name, config in CHECKS.items():
        check_type = config['type']
        passed = False

        if check_type == 'http':
            passed = http_check(config['url'], config.get('timeout', 2))
        elif check_type == 'tcp':
            passed = tcp_check(config['host'], config['port'], config.get('timeout', 2))
        elif check_type == 'command':
            passed = command_check(config['command'], config.get('timeout', 5))

        if passed:
            current_weight += config['weight']
        else:
            logging.warning(f"Check '{name}' failed")

    health_percentage = (current_weight / total_weight) * 100 if total_weight > 0 else 0

    # Determine health status and MED
    if health_percentage >= 90:
        # Fully healthy
        return (True, 100)
    elif health_percentage >= 70:
        # Degraded but functional
        return (True, 150)
    else:
        # Too degraded, withdraw
        return (False, None)

class HealthState:
    """Track health state with dampening"""

    def __init__(self, rise_threshold: int, fall_threshold: int):
        self.rise_threshold = rise_threshold
        self.fall_threshold = fall_threshold
        self.rise_count = 0
        self.fall_count = 0
        self.announced = False
        self.current_med = None

    def update(self, healthy: bool, med: Optional[int]) -> bool:
        """
        Update state based on check result
        Returns True if announcement state should change
        """
        if healthy:
            self.rise_count += 1
            self.fall_count = 0

            if self.rise_count >= self.rise_threshold or self.announced:
                # Announce or update MED
                should_update = not self.announced or self.current_med != med
                self.announced = True
                self.current_med = med
                self.rise_count = 0
                return should_update
        else:
            self.fall_count += 1
            self.rise_count = 0

            if self.fall_count >= self.fall_threshold and self.announced:
                # Withdraw
                self.announced = False
                self.current_med = None
                self.fall_count = 0
                return True

        return False

def announce_route(med: int):
    """Announce BGP route with MED"""
    cmd = f"announce route {CONFIG['service_ip']} next-hop self med {med}"
    print(cmd)
    sys.stdout.flush()
    logging.info(f"Announced route with MED {med}")

def withdraw_route():
    """Withdraw BGP route"""
    cmd = f"withdraw route {CONFIG['service_ip']} next-hop self"
    print(cmd)
    sys.stdout.flush()
    logging.warning("Withdrew route")

def main():
    """Main health check loop"""
    logging.info("Production health check module started")
    state = HealthState(CONFIG['rise_threshold'], CONFIG['fall_threshold'])

    while not shutdown_flag:
        healthy, med = run_checks()
        should_update = state.update(healthy, med)

        if should_update:
            if state.announced:
                announce_route(state.current_med)
            else:
                withdraw_route()

        time.sleep(CONFIG['check_interval'])

    # Graceful shutdown - withdraw route
    if state.announced:
        logging.info("Shutting down - withdrawing route")
        withdraw_route()

    logging.info("Health check module stopped")

if __name__ == '__main__':
    main()

Integration Examples

With HAProxy

Monitor HAProxy backend health:

import requests

def haproxy_backend_check(stats_url, backend_name):
    """Check if HAProxy backend has at least one UP server"""
    try:
        response = requests.get(f"{stats_url};csv")
        lines = response.text.split('\n')

        for line in lines:
            if backend_name in line and ',UP,' in line:
                return True

        return False
    except:
        return False

# Usage
healthy = haproxy_backend_check("http://localhost:8404/stats", "webservers")

With Kubernetes

Check pod readiness:

import subprocess
import json

def kubernetes_pod_ready(namespace, app_label):
    """Check if at least one pod with app label is ready"""
    try:
        result = subprocess.run(
            ['kubectl', 'get', 'pods', '-n', namespace,
             '-l', f'app={app_label}', '-o', 'json'],
            timeout=5,
            capture_output=True
        )

        if result.returncode != 0:
            return False

        pods = json.loads(result.stdout)

        for pod in pods.get('items', []):
            conditions = pod.get('status', {}).get('conditions', [])
            for condition in conditions:
                if condition['type'] == 'Ready' and condition['status'] == 'True':
                    return True

        return False
    except:
        return False

# Usage
healthy = kubernetes_pod_ready("default", "myapp")

Monitoring and Logging

Metrics Export

Export health check metrics for Prometheus:

from prometheus_client import Gauge, Counter, start_http_server

# Metrics
health_status = Gauge('exabgp_health_status', 'Current health status (1=up, 0=down)')
check_duration = Gauge('exabgp_check_duration_seconds', 'Health check duration')
state_changes = Counter('exabgp_state_changes_total', 'Total state changes', ['from_state', 'to_state'])

# Start metrics server
start_http_server(9100)

# Update metrics
health_status.set(1 if healthy else 0)
check_duration.set(duration)
state_changes.labels(from_state='down', to_state='up').inc()

Structured Logging

Use structured logging for better analysis:

import json
import logging

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_obj = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'message': record.getMessage(),
        }
        return json.dumps(log_obj)

handler = logging.FileHandler('/var/log/exabgp-healthcheck.json')
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)

logging.info("Health check passed", extra={'check': 'http', 'url': 'http://localhost:8080'})

Common Pitfalls

  1. No timeout on checks: Always set timeouts (2-5 seconds typical)
  2. No dampening: Causes route flapping on transient failures
  3. Blocking checks: Use subprocess.run with timeout, not os.system
  4. Forgot sys.stdout.flush(): Commands buffer and don't reach ExaBGP
  5. No logging: Impossible to troubleshoot when things break
  6. Checking too frequently: Every 5-10 seconds is usually sufficient
  7. Not handling shutdown gracefully: Routes not withdrawn on stop

See Also


👻 Ghost written by Claude (Anthropic AI)

Clone this wiki locally