Enterprise-grade API monitoring solution for AWS-compatible cloud services with intelligent alerting and comprehensive network diagnostics.
Overview β’ Features β’ Quick Start β’ Installation β’ Configuration β’ Usage β’ Contributing
API Monitoring is a production-ready, enterprise-grade monitoring solution designed specifically for AWS-compatible cloud APIs. Built with modern asynchronous Python architecture, it provides real-time monitoring, intelligent alerting, and comprehensive network diagnostics to ensure your cloud infrastructure remains highly available.
Cloud API downtime can cost businesses thousands of dollars per minute. Traditional monitoring solutions often lack:
- Real-time detection of API failures
- Intelligent maintenance mode handling
- Comprehensive network diagnostics when issues occur
- Immediate notification to the right teams
- Automated resolution tracking
API Monitoring addresses these challenges by providing:
- Continuous monitoring of AWS-compatible APIs with configurable intervals
- Smart maintenance detection that prevents false alerts during planned downtime
- Instant Telegram alerts with detailed MTR network traces for rapid troubleshooting
- Automatic resolution notifications when services recover
- Production-ready deployment with Docker and systemd support
- DevOps Teams: Monitor critical cloud infrastructure APIs
- SRE Teams: Ensure SLA compliance with real-time alerting
- Cloud Providers: Monitor service availability across regions
- Enterprise IT: Track AWS-compatible private cloud APIs
- MSPs: Monitor multiple client cloud environments
- π High-Performance Asynchronous Architecture: Built with asyncio and aiohttp for efficient, non-blocking operations that can handle multiple concurrent checks
- βοΈ AWS-Compatible API Monitoring: Comprehensive monitoring of AWS EC2-compatible APIs using boto3/aiobotocore
- π§ Intelligent Maintenance Mode Detection: Automatically detects and skips monitoring during planned maintenance windows
- β±οΈ Configurable Check Intervals: Customizable monitoring frequency from seconds to hours based on your requirements
- π― Precise Timeout Management: Granular timeout controls for different operations (API calls, maintenance checks, network traces)
- π€ Intelligent Telegram Alerts: Rich HTML-formatted notifications with comprehensive diagnostic information
- π Network Diagnostics Integration: Automatic MTR (My Traceroute) network path analysis included in alerts
- π¬ Custom Alert Comments: Add contextual information to alerts for better incident management
- β Automatic Resolution Tracking: Smart detection and notification when issues are resolved
- π Alert Deduplication: Prevents spam by tracking alert states and sending updates only when status changes
- π Structured JSON Logging: Machine-readable logs for integration with log aggregation systems (ELK, Splunk, etc.)
- π³ Production-Ready Containerization: Docker and Docker Compose support with optimized images
- βοΈ Robust Configuration Management: Pydantic-based validation with environment variable support
- π Security Best Practices: Secure credential handling and input validation
- ποΈ Modular Architecture: Clean, maintainable codebase with comprehensive test coverage
API Monitoring follows a modern, modular architecture designed for scalability, maintainability, and reliability:
api_monitoring/
βββ config/ # Configuration management with Pydantic validation
βββ clients/ # AWS API client with connection pooling
βββ monitoring/ # Core monitoring logic and maintenance detection
βββ alerting/ # Multi-channel alerting system (Telegram, extensible)
βββ utils/ # Shared utilities (logging, networking, diagnostics)
βββ main.py # Application entry point and orchestration
- βοΈ Configuration Layer: Pydantic-based settings with environment variable support, validation, and type safety
- π Structured Logging: JSON-formatted logs with correlation IDs, structured fields, and configurable output destinations
- βοΈ AWS Client: Asynchronous boto3/aiobotocore client with automatic retry logic, connection pooling, and error handling
- π§ Maintenance Detection: Smart maintenance mode detection using configurable patterns and response analysis
- π Network Diagnostics: MTR integration for comprehensive network path analysis and troubleshooting
- π’ Alerting Engine: Extensible notification system with rich formatting, deduplication, and delivery confirmation
- ποΈ Monitoring Core: Event-driven monitoring loop with configurable intervals, timeout management, and state tracking
- π Application Bootstrap: Signal handling, graceful shutdown, dependency validation, and service lifecycle management
- Python: 3.11+ (optimized for 3.13)
- Memory: Minimum 128MB RAM (recommended 256MB+)
- CPU: Single core sufficient (benefits from multi-core for concurrent operations)
- Network: Outbound HTTPS (443) for API calls and Telegram notifications
- Storage: 50MB for application + logs (configurable log retention)
- Monitoring Latency: Sub-second API response detection
- Alert Delivery: < 5 seconds from detection to Telegram notification
- Resource Usage: < 50MB RAM, minimal CPU usage during normal operations
- Concurrent Operations: Supports multiple simultaneous API checks and network traces
- Scalability: Designed for monitoring 100+ endpoints (with appropriate resource allocation)
- Credential Management: Environment variable-based secrets with validation
- Input Sanitization: All user inputs sanitized and validated
- Network Security: HTTPS-only communications with certificate validation
- Error Handling: Secure error messages without sensitive data exposure
- Audit Trail: Comprehensive logging of all monitoring activities
Get up and running in minutes with Docker:
# Clone the repository
git clone https://github.com/dkamenkov/api-monitoring.git
cd api-monitoring
# Copy and configure environment variables
cp .env.example .env
# Edit .env with your settings
# Run with Docker Compose
docker-compose up -d
That's it! Your API monitoring is now running. Check the logs with:
docker-compose logs -f
- π Python 3.13+
- π MTR (My Traceroute) tool
- Clone the repository:
git clone https://github.com/dkamenkov/api-monitoring.git
cd api-monitoring
- Install the required Python libraries:
pip install -r requirements.txt
- Install the MTR tool:
- On Ubuntu/Debian:
sudo apt-get install mtr
- On CentOS:
sudo yum install mtr
- On Ubuntu/Debian:
- Clone the repository:
git clone https://github.com/dkamenkov/api-monitoring.git
cd api-monitoring
- Build and run with Docker Compose:
docker-compose up -d
- Copy the example configuration file:
cp .env.example .env
- Edit the
.env
file with your specific settings:
# API Endpoint Configuration
ENDPOINT_URL=https://api.your-cloud-provider.com
# The AWS-compatible API endpoint to monitor (without trailing slash)
# AWS Credentials
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
# Your AWS access key ID with EC2 permissions
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Your AWS secret access key
AWS_DEFAULT_REGION=us-east-1
# AWS region for API calls (must match your endpoint region)
# Telegram Configuration
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
# Bot token from @BotFather on Telegram
TELEGRAM_CHAT_ID=123456789
# Your Telegram chat ID (get from @userinfobot)
# Monitoring Intervals (in seconds)
CHECK_INTERVAL=60
# How often to check the API (recommended: 60-300 for production)
API_TIMEOUT=15
# Timeout for API requests (recommended: 10-30 seconds)
MAINTENANCE_CHECK_TIMEOUT=10
# Timeout for maintenance mode detection (recommended: 5-15 seconds)
# Logging Configuration
LOG_LEVEL=INFO
# Logging verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_FILE=logs.log
# Log file path (set to empty string to disable file logging)
- Never commit
.env
files to version control - Use IAM roles when running on AWS EC2 instead of access keys
- Rotate credentials regularly and use least-privilege access
- Restrict Telegram bot to specific chats only
- Use environment-specific configurations for different deployments
For enterprise deployments, consider:
# Advanced Settings
CHECK_INTERVAL=30 # More frequent checks for critical systems
API_TIMEOUT=10 # Shorter timeout for faster detection
LOG_LEVEL=WARNING # Reduce log verbosity in production
LOG_FILE="" # Disable file logging, use centralized logging
When using Docker, you can override settings via environment variables:
# docker-compose.override.yml
version: '3.8'
services:
api-monitoring:
environment:
- CHECK_INTERVAL=30
- LOG_LEVEL=WARNING
- LOG_FILE=""
python -m api_monitoring.main
docker-compose up -d
For production environments, deploy as a systemd service:
- Create a systemd service file:
sudo nano /etc/systemd/system/api-monitoring.service
- Add the following content:
[Unit]
Description=API Monitoring Service
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=api-monitor
Group=api-monitor
WorkingDirectory=/opt/api-monitoring
ExecStart=/usr/bin/python3 -m api_monitoring.main
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=api-monitoring
# Security settings
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/api-monitoring/logs
[Install]
WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable api-monitoring
sudo systemctl start api-monitoring
sudo systemctl status api-monitoring
When an API issue is detected, you'll receive a comprehensive Telegram alert:
π¨ Issue detected with API api.example.com π¨
Timestamp: 2024-01-15 14:30:25
Source IP: 203.0.113.42
Error: Cannot connect to the endpoint: Connection timeout
Trace to api.example.com:
Start: Mon Jan 15 14:30:26 2024
HOST: monitoring-server Loss% Snt Last Avg Best Wrst StDev
1.|-- gateway.local 0.0% 10 1.2 1.1 0.9 1.5 0.2
2.|-- isp-router.net 0.0% 10 12.3 11.8 10.2 15.1 1.4
3.|-- core-01.provider.com 10.0% 10 45.2 44.8 42.1 48.9 2.1
4.|-- ??? 100.0% 10 0.0 0.0 0.0 0.0 0.0
When the issue is resolved:
π’ Issue with API api.example.com resolved!
Structured JSON Logs:
{
"timestamp": "2024-01-15T14:30:25.123456",
"level": "INFO",
"message": "API check succeeded",
"module": "monitor",
"function": "run_once",
"line": 125,
"endpoint": "api.example.com",
"response_time_ms": 245,
"check_id": "550e8400-e29b-41d4-a716-446655440000"
}
Error Logs:
{
"timestamp": "2024-01-15T14:30:25.123456",
"level": "ERROR",
"message": "API check failed: Connection timeout",
"module": "aws_client",
"function": "check_api_availability",
"line": 78,
"endpoint": "api.example.com",
"error_type": "EndpointConnectionError",
"check_id": "550e8400-e29b-41d4-a716-446655440001"
}
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install mtr-tiny
# CentOS/RHEL
sudo yum install mtr
# Alpine Linux (Docker)
apk add --no-cache mtr
- Ensure AWS credentials have EC2 permissions
- Verify the endpoint URL is correct and accessible
- Check if the region matches your AWS setup
- Verify bot token is correct and bot is active
- Ensure chat ID is correct (use @userinfobot to get your chat ID)
- Check if bot has permission to send messages to the chat
# Check container logs
docker-compose logs -f api-monitoring
# Restart container
docker-compose restart api-monitoring
# Rebuild with latest changes
docker-compose down && docker-compose up -d --build
- Reduce
CHECK_INTERVAL
if set too low - Check for memory leaks in logs
- Consider resource limits in Docker deployment
Enable debug logging for troubleshooting:
# Set in .env file
LOG_LEVEL=DEBUG
# Or as environment variable
export LOG_LEVEL=DEBUG
python -m api_monitoring.main
Q: What APIs are supported?
A: Any AWS EC2-compatible API that supports the describe_availability_zones
operation. This includes AWS, OpenStack, and many private cloud solutions.
Q: How often should I set the check interval? A: For production systems, 60-300 seconds is recommended. For critical systems, 30-60 seconds. Avoid intervals below 30 seconds to prevent API rate limiting.
Q: Can I monitor multiple APIs? A: Currently, each instance monitors one API endpoint. Deploy multiple instances with different configurations for multiple endpoints.
Q: What network ports are required? A: Outbound HTTPS (443) for API calls and Telegram notifications. MTR requires ICMP and UDP for network tracing.
Q: How much bandwidth does it use? A: Minimal - typically less than 1MB per day for standard monitoring intervals.
Q: Is it suitable for production use? A: Yes, it's designed for production with proper error handling, logging, and deployment options.
Q: Can I extend the alerting to other channels? A: Yes, the alerting system is modular. See the Contributing Guide for extending to Slack, Discord, or email.
Q: How do I get a Telegram bot token? A: Message @BotFather on Telegram, create a new bot, and copy the provided token.
Q: How do I find my Telegram chat ID? A: Message @userinfobot on Telegram, or add your bot to a group and check the logs for the chat ID.
Q: Can I use IAM roles instead of access keys? A: Yes, when running on AWS EC2, the application will automatically use IAM instance profiles if no explicit credentials are provided.
We love contributions! Please read our Contributing Guide to get started.
Please open an issue with details about the bug.
We'd love to hear about it! Open a feature request.
Please review our Security Policy for reporting security vulnerabilities.
- Web dashboard for monitoring status
- Support for multiple notification channels (Slack, Discord, Email)
- Prometheus metrics export
- Custom health check endpoints
- Multi-region monitoring
- Advanced alerting rules and conditions
- Built with aiohttp for async HTTP operations
- Uses aiobotocore for AWS API interactions
- Configuration management with Pydantic
- Network tracing powered by MTR
This project is licensed under the MIT License - see the LICENSE file for details.
β Star this repository if you find it useful! β
Made with β€οΈ by the API Monitoring team