# Common Pitfalls This guide covers the most common mistakes and misunderstandings when using ExaBGP, along with their solutions. Reading this can save you hours of debugging time. ## Table of Contents - [Critical Misunderstandings](#critical-misunderstandings) - [Configuration Errors](#configuration-errors) - [API Programming Mistakes](#api-programming-mistakes) - [BGP Protocol Issues](#bgp-protocol-issues) - [Performance Problems](#performance-problems) - [Security Mistakes](#security-mistakes) - [Deployment Issues](#deployment-issues) - [Version-Specific Pitfalls](#version-specific-pitfalls) - [See Also](#see-also) --- ## Critical Misunderstandings ### Pitfall #1: Thinking ExaBGP is a Router **❌ Wrong Assumption:** "I announced a route via ExaBGP, so traffic will be forwarded." **✅ Reality:** ExaBGP is a **BGP protocol implementation**, NOT a router. It does NOT: - Install routes in the kernel routing table (RIB/FIB) - Forward IP packets - Handle ARP/NDP - Create VRFs, VXLAN tunnels, or MPLS labels **What ExaBGP Actually Does:** - Sends/receives BGP UPDATE messages - Provides API for applications to control BGP announcements - Handles BGP session management **Solution:** 1. ExaBGP announces route via BGP → Peer router receives it 2. Peer router installs route in its RIB/FIB (if best path) 3. Peer router forwards traffic based on its routing table 4. Your application must handle traffic locally (configure IPs on interfaces, run services, etc.) **Example:** ```python # This announces 100.64.1.1/32 via BGP print("announce route 100.64.1.1/32 next-hop self") # But you MUST also: # 1. Configure 100.64.1.1 on a local interface (e.g., loopback) # 2. Run the actual service on that IP # 3. Ensure next-hop is reachable ``` ```bash # Configure the service IP on loopback ip addr add 100.64.1.1/32 dev lo # Start your service systemctl start myservice # Then let ExaBGP announce it ``` --- ### Pitfall #2: Forgetting to Flush stdout **❌ Wrong Code:** ```python #!/usr/bin/env python3 import sys print("announce route 100.64.1.0/24 next-hop self") # Missing sys.stdout.flush()! time.sleep(60) ``` **Problem:** ExaBGP reads from STDIN line by line. Without flushing, commands buffer and may not be sent immediately. **✅ Correct Code:** ```python #!/usr/bin/env python3 import sys print("announce route 100.64.1.0/24 next-hop self") sys.stdout.flush() # Always flush! time.sleep(60) ``` **Impact:** Routes announced with significant delay or not at all. **Solution:** **ALWAYS** call `sys.stdout.flush()` after every `print()` statement. --- ### Pitfall #3: Incorrect Next-Hop **❌ Wrong:** ```python # Next-hop is not a local IP print("announce route 100.64.1.0/24 next-hop 203.0.113.1") ``` **Problem:** If 203.0.113.1 is not reachable from the peer router, the route won't be installed in the peer's FIB. **✅ Correct:** ```python # Use 'self' (ExaBGP substitutes local-address) print("announce route 100.64.1.0/24 next-hop self") ``` **Or ensure next-hop is explicitly configured:** ```ini neighbor 192.0.2.1 { local-address 192.0.2.2; # This becomes 'next-hop self' # ... } ``` **Rule:** Next-hop must be reachable from the receiving router via its routing table. --- ## Configuration Errors ### Pitfall #4: Missing Family Declaration **❌ Wrong:** ```ini neighbor 192.0.2.1 { router-id 192.0.2.2; local-address 192.0.2.2; local-as 65001; peer-as 65000; # Missing family configuration! } ``` **Problem:** For non-default address families (EVPN, FlowSpec, VPNv4, etc.), you must explicitly enable them. **✅ Correct:** ```ini neighbor 192.0.2.1 { router-id 192.0.2.2; local-address 192.0.2.2; local-as 65001; peer-as 65000; family { ipv4 flow; # FlowSpec evpn; # EVPN ipv4 vpn; # VPNv4 } } ``` **Note:** IPv4 unicast is enabled by default; others must be explicit. --- ### Pitfall #5: Incorrect Indentation **❌ Wrong:** ```ini neighbor 192.0.2.1 { router-id 192.0.2.2; # No indentation! local-address 192.0.2.2; } ``` **Problem:** ExaBGP's config parser is sensitive to indentation. **✅ Correct:** ```ini neighbor 192.0.2.1 { router-id 192.0.2.2; # Consistent indentation local-address 192.0.2.2; } ``` **Solution:** Use tabs or consistent spaces (4 spaces recommended). Don't mix. --- ### Pitfall #6: Wrong ASN Format **❌ Wrong:** ```ini neighbor 192.0.2.1 { local-as 65001.100; # Dot notation not supported } ``` **✅ Correct:** ```ini neighbor 192.0.2.1 { local-as 65001; # Plain integer } ``` **For 4-byte ASNs:** ```ini local-as 4200000000; # Use integer form, not asdot ``` --- ## API Programming Mistakes ### Pitfall #7: Not Handling stdin EOF **❌ Wrong:** ```python while True: time.sleep(60) # Never checks for ExaBGP shutdown ``` **Problem:** When ExaBGP terminates, your process keeps running as a zombie. **✅ Correct:** ```python while True: line = sys.stdin.readline() if not line: # EOF - ExaBGP terminated break # Process messages... ``` **Or for announcement-only scripts:** ```python import signal import sys def signal_handler(signum, frame): sys.exit(0) signal.signal(signal.SIGTERM, signal_handler) signal.signal(signal.SIGINT, signal_handler) while True: time.sleep(60) ``` --- ### Pitfall #8: Ignoring JSON Parse Errors **❌ Wrong:** ```python while True: line = sys.stdin.readline() msg = json.loads(line) # Will crash on invalid JSON ``` **Problem:** Invalid JSON crashes your script, taking down your BGP announcements. **✅ Correct:** ```python while True: line = sys.stdin.readline() if not line: break try: msg = json.loads(line) # Process message... except json.JSONDecodeError as e: print(f"JSON parse error: {e}", file=sys.stderr) continue # Don't crash, just skip bad message except Exception as e: print(f"Error: {e}", file=sys.stderr) continue ``` --- ### Pitfall #9: No Health Check Dampening **❌ Wrong:** ```python while True: if check_health(): announce() else: withdraw() time.sleep(1) ``` **Problem:** Transient health check failures cause route flapping. **✅ Correct (with dampening):** ```python rise_count = 0 fall_count = 0 announced = False while True: if check_health(): rise_count += 1 fall_count = 0 if rise_count >= 3 and not announced: # 3 consecutive passes announce() announced = True rise_count = 0 else: fall_count += 1 rise_count = 0 if fall_count >= 2 and announced: # 2 consecutive failures withdraw() announced = False fall_count = 0 time.sleep(5) ``` **Why:** Avoids BGP churn from momentary failures. --- ### Pitfall #10: Hardcoded Paths **❌ Wrong:** ```python #!/usr/bin/env python3 # Hardcoded path won't work on other systems import sys sys.path.append('/home/alice/myproject') ``` **✅ Correct:** ```python #!/usr/bin/env python3 import sys import os # Use relative paths or environment variables script_dir = os.path.dirname(os.path.realpath(__file__)) sys.path.append(os.path.join(script_dir, 'lib')) ``` --- ## BGP Protocol Issues ### Pitfall #11: Mismatched AS Numbers **❌ Wrong:** ```ini # ExaBGP config neighbor 192.0.2.1 { local-as 65001; peer-as 65002; # Says peer is AS 65002 } # But peer router is actually configured as AS 65000! ``` **Problem:** BGP session won't establish. Logs show `OPEN message error`. **✅ Solution:** Verify peer's ASN: ```bash # Check peer's actual ASN show bgp summary # On router ``` Ensure `peer-as` in ExaBGP matches peer's actual `local-as`. --- ### Pitfall #12: Incorrect Router ID **❌ Wrong:** ```ini neighbor 192.0.2.1 { router-id 192.0.2.1; # Same as neighbor! } neighbor 192.0.2.2 { router-id 192.0.2.1; # Same router-id for different neighbors! } ``` **Problem:** BGP router-id must be unique per ExaBGP instance, not per neighbor. **✅ Correct:** ```ini # Use same router-id for all neighbors (but unique per ExaBGP instance) neighbor 192.0.2.1 { router-id 192.0.2.100; # ExaBGP's unique ID } neighbor 192.0.2.2 { router-id 192.0.2.100; # Same router-id } ``` **Rule:** One router-id per ExaBGP process, unique across your network. --- ### Pitfall #13: TCP MD5 Password Mismatch **❌ Wrong:** ```ini neighbor 192.0.2.1 { tcp { md5-password "secret123"; } } # But peer router has "secret456" ``` **Problem:** TCP connection fails silently. No BGP session. **✅ Solution:** ```ini neighbor 192.0.2.1 { tcp { md5-password "secret456"; # Must match peer exactly } } ``` **Verification:** ```bash # On peer router show bgp neighbors 192.0.2.2 | include password # Check logs for TCP connection refused env exabgp.log.level=DEBUG exabgp config.ini ``` --- ### Pitfall #14: Route Filtering on Peer **❌ Issue:** ```python # Announced route print("announce route 100.64.1.0/24 next-hop self") ``` But peer router has: ``` # Cisco IOS-XR router bgp 65000 neighbor 192.0.2.2 address-family ipv4 unicast route-policy BLOCK-ALL in # Blocks everything! ``` **Problem:** Routes announced but peer rejects them via import policy. **✅ Solution:** Verify peer's import filters: ```bash # Check peer's import policy show bgp neighbor 192.0.2.2 policy # Or allow ExaBGP routes route-policy ALLOW-EXABGP if as-path passes-through '65001' then pass endif end-policy ``` --- ## Performance Problems ### Pitfall #15: Excessive Health Check Frequency **❌ Wrong:** ```python while True: check_health() # Every 100ms! time.sleep(0.1) ``` **Problem:** Excessive CPU usage, doesn't improve convergence (BGP propagation takes seconds anyway). **✅ Correct:** ```python while True: check_health() time.sleep(5) # 5-10 seconds is reasonable ``` **Why:** BGP convergence typically 5-15 seconds. Checking every 100ms wastes resources. --- ### Pitfall #16: Not Using Route Reflectors **❌ Wrong (Full Mesh iBGP):** ``` 100 ExaBGP instances ↓ 100 × 99 / 2 = 4,950 BGP sessions! ``` **✅ Correct (Route Reflector):** ``` 100 ExaBGP instances ↓ 100 sessions to 2 Route Reflectors = 100 sessions total ``` **Solution:** Use BGP Route Reflectors for large deployments (>10 speakers). --- ### Pitfall #17: Announcing Too Many Routes **❌ Wrong:** ```python # Announcing /32 for every IP in /24 for i in range(1, 255): print(f"announce route 100.64.1.{i}/32 next-hop self") ``` **Problem:** Unnecessary churn, large BGP table, slow convergence. **✅ Correct:** ```python # Announce aggregate print("announce route 100.64.1.0/24 next-hop self") ``` **Rule:** Aggregate when possible. Only announce /32 for anycast or specific services. --- ## Security Mistakes ### Pitfall #18: No BGP Authentication **❌ Wrong:** ```ini neighbor 192.0.2.1 { # No authentication! } ``` **Problem:** Anyone who can reach your ExaBGP can inject routes. **✅ Correct:** ```ini neighbor 192.0.2.1 { tcp { md5-password "strong-random-password-here"; } } ``` **Better (TCP-AO):** ```ini neighbor 192.0.2.1 { tcp { ao-keyid 1; ao-key "hex:deadbeef..."; } } ``` --- ### Pitfall #19: Running API Process as Root **❌ Wrong:** ```ini process route-injector { run /root/inject.py; # Runs as root! } ``` **Problem:** If your script has vulnerabilities, attacker gets root access. **✅ Correct:** ```ini process route-injector { run /opt/exabgp/inject.py; user exabgp; # Run as unprivileged user env { exabgp.user = exabgp; } } ``` ```bash # Create unprivileged user useradd -r -s /bin/false exabgp chown exabgp:exabgp /opt/exabgp/inject.py ``` --- ### Pitfall #20: Exposing ExaBGP API **❌ Wrong:** ```bash # ExaBGP listening on all interfaces exabgp --bind 0.0.0.0:179 config.ini ``` **Problem:** Anyone on network can connect to BGP port. **✅ Correct:** ```bash # Bind to localhost or specific interface only exabgp --bind 127.0.0.1:179 config.ini # Or use firewall iptables -A INPUT -p tcp --dport 179 -s 192.0.2.0/24 -j ACCEPT iptables -A INPUT -p tcp --dport 179 -j DROP ``` --- ## Deployment Issues ### Pitfall #21: No Logging **❌ Wrong:** ```python #!/usr/bin/env python3 # No logging at all if check_health(): announce() ``` **Problem:** Impossible to troubleshoot when things go wrong. **✅ Correct:** ```python #!/usr/bin/env python3 import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s %(levelname)s: %(message)s', filename='/var/log/exabgp-health.log' ) if check_health(): logging.info("Health check passed, announcing route") announce() else: logging.warning("Health check failed, withdrawing route") withdraw() ``` --- ### Pitfall #22: Not Monitoring BGP Session State **❌ Wrong:** "I announced routes, they must be working." **Problem:** BGP session might be down, routes not actually advertised. **✅ Correct:** ```python # Parse BGP state messages from ExaBGP def handle_state(msg): state = msg.get('neighbor', {}).get('state') if state == 'down': logging.error("BGP session down!") # Alert ops team elif state == 'up': logging.info("BGP session established") # In your receiver loop if msg.get('type') == 'state': handle_state(msg) ``` **Better:** Use external monitoring (Prometheus, Grafana) to track BGP state. --- ### Pitfall #23: No Graceful Shutdown **❌ Wrong:** ```bash # Kill ExaBGP immediately kill -9 $(pidof exabgp) ``` **Problem:** Routes withdrawn abruptly, traffic drops. **✅ Correct:** ```bash # Graceful shutdown kill -TERM $(pidof exabgp) # Or withdraw routes first echo "withdraw route 100.64.1.0/24 next-hop self" | \ socat - UNIX-CONNECT:/run/exabgp/exabgp.sock sleep 30 # Wait for BGP convergence systemctl stop exabgp ``` --- ## Version-Specific Pitfalls ### Pitfall #24: Not Reading ACK Responses (Hanging Programs) **❌ Wrong (program hangs):** ```python import sys # Send command print("announce route 100.64.1.0/24 next-hop self") sys.stdout.flush() # Program hangs here because ACK is enabled by default! # ExaBGP sends "done\n" but we never read it # This causes backpressure and eventually hangs ``` **Problem:** ACK is **enabled by default** in ExaBGP 4.x and 5.x. If you don't read responses, the pipe fills up and blocks. **✅ Solution 1 - Read ACK responses (recommended):** ```python import sys import select import time def wait_for_ack(expected_count=1, timeout=30): """ Wait for ACK responses with polling loop. Handles both text and JSON encoder formats. """ import json received = 0 start_time = time.time() while received < expected_count: if time.time() - start_time >= timeout: return False ready, _, _ = select.select([sys.stdin], [], [], 0.1) if ready: line = sys.stdin.readline().strip() # Parse response (could be text or JSON) answer = None if line.startswith('{'): try: data = json.loads(line) answer = data.get('answer') except: pass else: answer = line if answer == "done": received += 1 elif answer == "error": return False elif answer == "shutdown": raise SystemExit(0) else: time.sleep(0.1) return True # Send command sys.stdout.write("announce route 100.64.1.0/24 next-hop self\n") sys.stdout.flush() # Wait for ACK (with polling loop) if not wait_for_ack(): sys.exit(1) # Command failed ``` **✅ Solution 2 - Disable ACK (simpler but no error feedback):** ```bash # Option A: Environment variable (4.x and 5.x) export exabgp.api.ack=false exabgp /etc/exabgp/exabgp.conf # Option B: Runtime command (5.x/main only) # Send: disable-ack or silence-ack ``` **See:** [ACK Feature Documentation](API-Overview#command-acknowledgment-ack-feature) for details. --- ## See Also ### Documentation - [Debugging Guide](Debugging) - Troubleshooting techniques - [First BGP Session](First-BGP-Session) - Basic setup guide - [API Overview](API-Overview) - API programming guide - [Production Best Practices](Production-Best-Practices) - Production deployment ### Getting Help - **GitHub Issues**: [https://github.com/Exa-Networks/exabgp/issues](https://github.com/Exa-Networks/exabgp/issues) - **Slack**: [https://exabgp.slack.com/](https://exabgp.slack.com/) - **Mailing List**: Archive at Google Groups ### Quick Fixes **Session won't establish?** 1. Check ASNs match (`local-as`, `peer-as`) 2. Check router-id is unique 3. Check TCP MD5 password matches 4. Verify network connectivity (`ping`, `tcpdump`) **Routes announced but not working?** 1. Verify peer accepts routes (`show bgp neighbor received-routes`) 2. Check next-hop is reachable from peer 3. Verify service IP is configured locally (`ip addr show`) 4. Check peer's import filters **Health checks flapping?** 1. Add dampening (rise/fall counters) 2. Increase health check interval 3. Check health check logic (timeouts, retries) ---