`APMonitor.py` - An On-Premises Monitoring Tool with Alert Delivery Guarantees

This is an on-prem monitoring tool written completely in very clear Python-only code (so you can modify it) and is designed to work on a LAN for on-prem availability monitoring of resources that aren't necesarilly connected to The Internet, and/or where the on-prem monitoring itself is also required to have availability guarantees.

It supports multi-threading of the availability checking of monitored resources for high speed near-realtime performance, if that is what you need (see the -t command line option). The default operation mode is single-threaded for log clarity that runs on small systems like a Raspberry Pi.

It also supports pacing of monitoring alarms using a decaying curve that delivers alert notifications quickly at the start, then slows down notifications over time.

APMonitor.py (APMonitor) is primarily designed to work in tandem with Site24x7 and integrates very well with their "Heartbeat Monitoring".

To achieve guaranteed always-on monitoring service levels, simply setup local availability monitors in your config, sign-up for a Pro Plan at Site24x7 then use heartbeat_url and heartbeat_every_n_secs configuration options to APMonitor.py to ping a Heartbeat Monitoring URL endpoint at Site24x7 when the monitored resource is up. This then ensures that when a heartbeat doesn't arrive from APMonitor, monitoring alerts fall back to Site24x7, and when both are working you have second-opinion availability monitoring reporting.

The service level guarantee works as follows: If the resource is down, APMonitor.py won't hit the Heartbeat Monitoring endpoint URL, and Site24x7 will then send an alert about the missed heartbeat without the need for any additional dependencies on-prem/on-site. So the entire machine APMonitor.py is running on can fall over, and you still get availability monitoring alerts sent, with all the benefits of having on-prem monitoring on your local network behind your firewall.

You can quickly signup for a Site24x7.com Lite or Pro Plan for $10-$50 USD per month, then setup a bunch of Heartbeat Monitoring URL endpoints that works with APMonitor.py rather easily.

Note: Heartbeat Monitoring is not available on their Website Monitoring plans. You need an 'Infrastructure Monitoring' or 'All-In-One' plan for it to work correctly.

APMonitor also integrates well with Slack and Pushover via webhook URL endpoints, and supports email notifications via SMTP.

APMonitor is a neat way to guarantee your on-prem availability monitoring will always let you know about an outage and to avoid putting resources onto the net that don't need to be.

ap

Recommended configuration for real-time environments

To put APMonitor into near-realtime mode so that it checks resources multiple times per second, use these global settings:

Dial up threads with -t 15 on the command line or max_threads: 15 in the site config,
set max_retries to 1 and
dial down max_try_secs to 10 or 15 seconds

for real-time environments.

Recommended configuration of Site24x7 Heartbeat Monitor Thresholds for HA Availability Monitoring

You do need to configure Site24x7's Hearbeat Monitoring to achieve high-availability second opinion availability monitoring.

As an exemplar, for the following monitored resource:

monitors:
  - type: http
    name: home-nas
    address: https://192.168.1.12/api/bump
    expect: "(C) COPYRIGHT 2005, Super NAS Storage Inc."
    ssl_fingerprint: a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890
    heartbeat_url: https://plus.site24x7.com/hb/your-unique-heartbeat-id/homenas
    heartbeat_every_n_secs: 300

Setup Site24x7 as follows:

This will send a heartbeat to Site24x7 every 5 minutes, and Site24x7 will drop an alarm whenever a heartbeat doesn't arrive or arrives out of sequence +/- 1 minute. This ensures availability monitoring will always function, even when one of APMonitor or Site24x7 is down.

This also means you don't need to expose internal LAN network resources to The Internets.

See Site24x7 docs for more info:

Recommended configuration for alarm notification pacing

You might also want to consider alarm notification pacing, so that recently down resources generate more frequent messages, whilst long outages are notified less frequently. To enable:

Set notify_every_n_secs to 3600 seconds (i.e., 1 hour), and
Set after_every_n_notifications to 8,

which will slow alarms down to one per hour after 8 notifications.

An alternate config for monitored resources that have long outages is as follows:

Set notify_every_n_secs to 43200 (i.e., 12 hours), and
Set after_every_n_notifications to 6,

which will slow alarms down to one every 12 hours after 6 notifications, which means after a few days you will only get at most one alarm whilst asleep.

To see how the alarm pacing will accelerate then subsequently delay notifications, use the example calculations spreadsheet in 20151122 Reminder Timing with Quadratic Bezier Curve.xlsx to experiment with various configuration scenarios:

Note that alarm pacing can be set at a global level in the site: config, and is overridden when set at a per monitored resource level in the monitors: section of the config.

`APMonitor.py` YAML/JSON Site Configuration Options

APMonitor uses a YAML or JSON configuration file to define the site being monitored and the resources to check. The configuration consists of two main sections: site-level settings that apply globally, and per-monitor settings that define individual resources to check.

Complete Example Configuration

Here's a complete example showing all available configuration options:

site:
  name: "HomeLab"
  
  email_server:
    smtp_host: "smtp.gmail.com"
    smtp_port: 587
    smtp_username: "alerts@example.com"
    smtp_password: "app_password_here"
    from_address: "alerts@example.com"
    use_tls: true
  
  outage_emails:
    - email: "admin@example.com"
      email_outages: true
      email_recoveries: true
      email_reminders: true
    - email: "manager@example.com"
      email_outages: yes
      email_recoveries: yes
      email_reminders: no
  
  outage_webhooks:
    - endpoint_url: "https://api.pushover.net/1/messages.json"
      request_method: POST
      request_encoding: JSON
      request_prefix: "token=your_app_token&user=your_user_key&message="
      request_suffix: ""
  
  max_threads: 1
  max_retries: 3
  max_try_secs: 20
  check_every_n_secs: 60
  notify_every_n_secs: 600
  after_every_n_notifications: 1

monitors:
  - type: ping
    name: home-fw
    address: "192.168.1.1"
    check_every_n_secs: 60
    email: true
    heartbeat_url: "https://hc-ping.com/uuid-here"
    heartbeat_every_n_secs: 300

  - type: http
    name: in3245622
    address: "http://192.168.1.21/Login?oldUrl=Index"
    expect: "System Name: <b>HomeLab</b>"
    check_every_n_secs: 120
    notify_every_n_secs: 3600
    after_every_n_notifications: 5
    email: yes

  - type: http
    name: nvr0
    address: "https://192.168.1.12/api/system"
    expect: "nvr0"
    ssl_fingerprint: "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890"
    ignore_ssl_expiry: true
    email: false
    heartbeat_url: "https://plus.site24x7.com/hb/uuid/nvr0"
    heartbeat_every_n_secs: 60

  - type: quic
    name: fast-api
    address: "https://192.168.1.50/api/health"
    expect: "ok"
    check_every_n_secs: 30

site: configuration options

The site section defines global settings for the monitoring site.

Required Fields

name (string): The name of the site being monitored. Used in notification messages to identify which site is reporting issues.

site:
  name: "HomeLab"

Optional Fields

email_server (object, optional): SMTP server configuration for sending email notifications. Required if outage_emails is configured.

email_server:
  smtp_host: "smtp.gmail.com"
  smtp_port: 587
  smtp_username: "alerts@example.com"
  smtp_password: "app_password_here"
  from_address: "alerts@example.com"
  use_tls: true

smtp_host (string, required): SMTP server hostname or IP address
smtp_port (integer, required): SMTP server port (typically 587 for TLS, 465 for SSL, 25 for unencrypted). Must be between 1 and 65535
smtp_username (string, optional): SMTP authentication username. Not required for servers without authentication
smtp_password (string, optional): SMTP authentication password. Not required for servers without authentication. Use app-specific passwords for Gmail/Google Workspace
from_address (string, required): Email address to use in the "From" field. Must be a valid email address
use_tls (boolean, optional): Whether to use TLS/STARTTLS encryption. Default: true

Note: For Gmail/Google Workspace, you must use an app-specific password rather than your account password. Port 587 with use_tls: true is the recommended configuration for most SMTP servers. For servers without authentication (like local SMTP relays), omit smtp_username and smtp_password.

outage_emails (list of objects, optional): Email addresses to notify when resources go down or recover. Requires email_server to be configured. Each entry is an object with an email field and optional notification control flags.

outage_emails:
  - email: "admin@example.com"
    email_outages: true
    email_recoveries: true
    email_reminders: true
  - email: "oncall@example.com"
    email_outages: yes
    email_recoveries: no

email (string, required): Valid email address matching standard email format
email_outages (boolean/integer/string, optional): Send email when resource goes down. Accepts: true/yes/on/1 (case-insensitive) for enabled, false/no/off/0 for disabled. Default: true
email_recoveries (boolean/integer/string, optional): Send email when resource recovers. Accepts same values as email_outages. Default: true
email_reminders (boolean/integer/string, optional): Send email for ongoing outage reminders (respecting notify_every_n_secs throttling). Accepts same values as email_outages. Default: true

Note: These email control flags allow fine-grained control over which notifications each recipient receives. For example, operations staff might want all notifications (email_outages: true, email_recoveries: true, email_reminders: true), while management might only want initial outage alerts (email_outages: true, email_recoveries: false, email_reminders: false).

outage_webhooks (list of objects, optional): Webhook endpoints to call when resources go down or recover. Each webhook requires several configuration fields.

outage_webhooks:
  - endpoint_url: "https://api.example.com/alerts"
    request_method: POST
    request_encoding: JSON
    request_prefix: ""
    request_suffix: ""

endpoint_url (string, required): Valid URL with scheme and host for the webhook
request_method (string, required): HTTP method, must be GET or POST
request_encoding (string, required): Message encoding format:
- URL: URL-encode the message (for query parameters or form data)
- HTML: HTML-escape the message
- JSON: Send as JSON object with message field (POST only)
- CSVQUOTED: CSV-quote the message for comma-separated values
request_prefix (string, optional): String to prepend to encoded message (e.g., API tokens, field names)
request_suffix (string, optional): String to append to encoded message
max_threads (integer, optional): Number of concurrent threads for checking resources in parallel. Must be ≥ 1. Default: 1 (single-threaded). Can be overridden by command line -t option.

max_threads: 1

Note: For near-realtime monitoring environments, set max_threads to 5-15 to enable parallel checking of multiple resources. Single-threaded mode (1) is recommended for small systems like Raspberry Pi or when log clarity is important. This setting is overridden by the -t command line argument if specified.

max_retries (integer, optional): Number of times to retry failed checks before marking resource as down. Must be ≥ 1. Default: 3

max_retries: 3

Note: For near-realtime monitoring, set max_retries: 1 to reduce detection latency. Higher values (3-5) are better for unstable networks where transient failures are common.

max_try_secs (integer, optional): Timeout in seconds for each individual check attempt. Must be ≥ 1. Default: 20

max_try_secs: 20

check_every_n_secs (integer, optional): Default seconds between checks for all monitors. Individual monitors can override this with their own check_every_n_secs setting. Must be ≥ 1. Default: 60

check_every_n_secs: 300

Note: This sets the baseline check interval for all monitors. Can be overridden per-monitor for resources requiring different check frequencies. When a monitor's configuration changes (detected via SHA-256 checksum), it is checked immediately regardless of this interval.

notify_every_n_secs (integer, optional): Default minimum seconds between outage notifications for all monitors. Individual monitors can override this with their own notify_every_n_secs setting. Must be ≥ 1. Default: 600

notify_every_n_secs: 1800

Note: This sets the baseline notification throttling interval. Combined with after_every_n_notifications, controls the notification escalation curve for all monitors unless overridden per-monitor.

after_every_n_notifications (integer, optional): Default number of notifications after which the notification interval reaches notify_every_n_secs for all monitors. Individual monitors can override this with their own after_every_n_notifications setting. Must be ≥ 1. Default: 1 (constant notification intervals)

after_every_n_notifications: 1

Note: When set to a value > 1, notification intervals start shorter and gradually increase following a quadratic Bezier curve until reaching notify_every_n_secs after the specified number of notifications. This provides more frequent alerts at the start of an outage when immediate attention is needed, then reduces notification frequency as the outage continues. A value of 1 maintains constant notification intervals (original behavior).

monitors: configuration options

The monitors section is a list of resources to monitor. Each monitor defines what to check and how often.

Required Fields (All Monitor Types)

type (string): Type of check to perform. Must be one of:
- ping: ICMP ping check
- http: HTTP/HTTPS endpoint check (supports both HTTP and HTTPS schemes, follows and checks redirect chain for errors)
- quic: HTTP/3 over QUIC endpoint check (UDP-based, faster than HTTP/HTTPS for high-latency networks)
name (string): Unique identifier for this monitor. Must be unique across all monitors in the configuration. Used in notifications and state tracking.
address (string): Resource to check. Format depends on monitor type:
- For ping: Valid hostname, IPv4, or IPv6 address
- For http/quic: Full URL with scheme and host

Optional Fields (All Monitor Types)

check_every_n_secs (integer, optional): Seconds between checks for this resource. Overrides site-level check_every_n_secs. Must be ≥ 1. Default: 60 (or site-level setting if configured)

check_every_n_secs: 300

Note: When a monitor's configuration changes (any field modification), the monitor is checked immediately on the next run regardless of this interval. Configuration changes are detected via SHA-256 checksum stored in the state file.

notify_every_n_secs (integer, optional): Minimum seconds between outage notifications while resource remains down. Must be ≥ 1 and ≥ check_every_n_secs. Default: 600

notify_every_n_secs: 1800

after_every_n_notifications (integer, optional): Number of notifications after which the notification interval reaches notify_every_n_secs for this specific monitor. Overrides site-level after_every_n_notifications. Can only be specified if notify_every_n_secs is present. Must be ≥ 1.

notify_every_n_secs: 3600
after_every_n_notifications: 5

Behavior: Notification timing follows a quadratic Bezier curve—intervals start shorter and gradually increase over the first N notifications until reaching the full notify_every_n_secs interval. After N notifications, the interval remains constant at notify_every_n_secs. This provides aggressive early alerting that tapers off as outages persist.

email (boolean/integer/string, optional): Master switch to enable/disable email notifications for this specific monitor. Accepts: true/yes/on/1 (case-insensitive) for enabled, false/no/off/0 for disabled. Default: true (enabled if email_server configured)

email: true

Note: When set to false, this monitor will not send any email notifications regardless of site-level outage_emails configuration. Useful for non-critical resources or during maintenance windows. This is a monitor-level override that takes precedence over all other email settings.

heartbeat_url (string, optional): URL to ping (HTTP GET) when resource check succeeds. Useful for external monitoring services like Site24x7 or Healthchecks.io. Must be valid URL with scheme and host.

heartbeat_url: "https://hc-ping.com/your-uuid-here"

heartbeat_every_n_secs (integer, optional): Seconds between heartbeat pings. Must be ≥ 1. Can only be specified if heartbeat_url is present. If not specified, heartbeat is sent on every successful check.

heartbeat_every_n_secs: 300

HTTP/QUIC Monitor Specific Fields

These fields are only valid for monitors with type: http or type: quic:

expect (string, optional): Substring that must appear in the HTTP response body for the check to succeed. If not present, any 200 OK response is considered successful. The check performs a simple string search—if the expected content appears anywhere in the response body, the check passes.

expect: "System Name: <b>HomeLab</b>"

Note: The expect field is string-only for simplicity. It performs exact substring matching (case-sensitive). For complex validation scenarios requiring status code checks, header validation, or regex matching, consider using external monitoring tools or extending APMonitor.

ssl_fingerprint (string, optional): SHA-256 fingerprint of the expected SSL/TLS certificate (with or without colons). Enables certificate pinning for self-signed certificates. When specified, the certificate is verified before making the HTTP request.

ssl_fingerprint: "e85260e8f8e85629cfa4d023ea0ae8dd3ce8ccc0040b054a4753c2a5ab269296"

ignore_ssl_expiry (boolean/integer/string, optional): Skip SSL/TLS certificate expiration checking. Accepts: true/1/"yes"/"ok" (case-insensitive) for true, or false/0/"no" for false. Useful for development environments or when certificate renewal is managed separately.

ignore_ssl_expiry: true

Example Configurations

Ping Monitor:

- type: ping
  name: home-gateway
  address: "192.168.1.1"
  check_every_n_secs: 60
  heartbeat_url: "https://hc-ping.com/uuid-here"

HTTP Monitor with Content Check:

- type: http
  name: web-server
  address: "http://192.168.1.100/health"
  expect: "status: ok"
  check_every_n_secs: 120
  notify_every_n_secs: 3600

HTTPS Monitor with Certificate Pinning:

- type: http
  name: nvr0
  address: "https://192.168.1.12/api/system"
  expect: "nvr0"
  ssl_fingerprint: "e85260e8f8e85629cfa4d023ea0ae8dd3ce8ccc0040b054a4753c2a5ab269296"
  ignore_ssl_expiry: true
  heartbeat_url: "https://plus.site24x7.com/hb/uuid/nvr0"
  heartbeat_every_n_secs: 60

QUIC Monitor (HTTP/3):

- type: quic
  name: fast-api
  address: "https://api.example.com/health"
  expect: "healthy"
  check_every_n_secs: 30
  ssl_fingerprint: "a1b2c3d4e5f67890abcdef1234567890abcdef1234567890abcdef1234567890"

Note: QUIC monitoring uses HTTP/3 over UDP (port 443 by default) and is particularly effective for high-latency networks or when monitoring resources over unreliable connections. QUIC provides built-in connection migration and improved performance compared to TCP-based HTTP/2.

Validation Rules

The configuration validator enforces these rules:

Monitor names must be unique across all monitors
notify_every_n_secs must be ≥ check_every_n_secs if both specified
heartbeat_every_n_secs can only be specified if heartbeat_url exists
expect, ssl_fingerprint, and ignore_ssl_expiry are only valid for HTTP/QUIC monitors
expect must be a non-empty string if specified
All URLs must include both scheme (http/https) and hostname
Email addresses must match standard email format (RFC 5322 simplified)
SSL fingerprints must be valid hexadecimal strings with length that's a power of two
after_every_n_notifications can only be specified if notify_every_n_secs is present
outage_emails can only be specified if email_server is configured
If email_server is present, smtp_host, smtp_port, and from_address are required
smtp_username and smtp_password are optional (for servers without authentication)
Email control flags (email_outages, email_recoveries, email_reminders) accept boolean or string values
Monitor-level email flag accepts boolean or string values

Dependencies

Install system-wide for production use:

sudo pip3 install PyYAML requests pyOpenSSL urllib3 aioquic

Or on Debian 12+ systems:

sudo pip3 install --break-system-packages PyYAML requests pyOpenSSL urllib3 aioquic

Note: The aioquic package is required for QUIC/HTTP3 monitoring support. If you don't plan to use type: quic monitors, you can omit this dependency.

Example invocation:

./APMonitor.py -s /tmp/statefile.json homelab-monitorhosts.yaml

./APMonitor.py --test-webhooks -v homelab-monitorhosts.yaml

./APMonitor.py --test-emails -v homelab-monitorhosts.yaml

Command Line Usage

APMonitor is invoked from the command line with various options to control verbosity, threading, state file location, and testing modes.

Synopsis

./APMonitor.py [OPTIONS] <config_file>

Command Line Options

config_file (required): Path to YAML or JSON configuration file
-v, --verbose: Increase verbosity level (can be repeated: -v, -vv, -vvv). Shows check progress, skip reasons, and diagnostic information. Useful for troubleshooting configuration or understanding monitoring behavior.
-t, --threads <N>: Number of concurrent threads for checking resources (default: 1). Higher values enable parallel checking of multiple resources but increase lock contention. Use values > 1 for systems with many independent monitors. Will override the configuration file settings if max_threads is specified in the site config.
-s, --statefile <path>: Path to state file for persistence (default: platform-dependent). Recommended: use /tmp/statefile.json to store state in tmpfs for better performance and reduced disk wear.
--test-webhooks: Test webhook notifications by sending a dummy alert to all configured webhooks, then exit. Does not check resources or modify state file. Useful for verifying webhook configuration and credentials.
--test-emails: Test email notifications by sending a dummy alert to all configured email addresses, then exit. Does not check resources or modify state file.

Common Usage Examples

Basic Monitoring (Single-Threaded)

Run with default settings, state stored in tmpfs:

./APMonitor.py -s /tmp/statefile.json monitoring-config.yaml

Verbose Monitoring for Debugging

Show detailed progress and decision-making:

./APMonitor.py -v -s /tmp/statefile.json monitoring-config.yaml

High-Frequency Monitoring (Multiple Threads)

Check many resources concurrently for near-realtime behavior:

./APMonitor.py -t 10 -s /tmp/statefile.json monitoring-config.yaml

Use higher thread counts (-t 5 to -t 20) when:

Monitoring many independent resources (50+)
Resources have long check timeouts
Near-realtime alerting is required
System has sufficient CPU cores

Warning: High thread counts increase lock contention. Test with -v to ensure checks aren't blocking each other.

Test Webhook Configuration

Verify webhooks are configured correctly before production use:

./APMonitor.py --test-webhooks -v monitoring-config.yaml

This sends test messages to all configured webhooks with verbose output showing request/response details.

Test Email Configuration

Verify email settings work correctly:

./APMonitor.py --test-emails -v monitoring-config.yaml

Running `APMonitor.py` Continuously

APMonitor is designed to be run repeatedly rather than as a long-running daemon. There are two common approaches:

Option 1: Cron (Recommended for Most Cases)

Run every minute via cron for standard monitoring:

* * * * * /path/to/APMonitor.py -s /tmp/statefile.json /path/to/monitoring-config.yaml 2>&1 | logger -t apmonitor

NB: PID file locking should keep this under control, in case you get a long-running process.

Advantages:

Automatic restart if process crashes
Built-in scheduling
System handles process lifecycle
Easy to enable/disable (comment out cron entry)

Best for: Production systems, servers with standard monitoring requirements (check intervals ≥ 60 seconds)

Option 2: While Loop (For Sub-Minute Monitoring)

Run continuously with short sleep intervals for near-realtime monitoring:

#!/bin/bash
while true; do
    ./APMonitor.py -t 5 -s /tmp/statefile.json monitoring-config.yaml
    sleep 10
done

Or as a one-liner:

while true; do ./APMonitor.py -s /tmp/statefile.json monitoring-config.yaml; sleep 30; done

Advantages:

Sub-minute check intervals
Near-realtime alerting
Fine control over execution frequency

Best for: Development, testing, systems requiring rapid failure detection (check intervals < 60 seconds)

Note: Use short sleep intervals (5-30 seconds) combined with per-resource check_every_n_secs settings to balance responsiveness and system load. APMonitor's internal scheduling prevents redundant checks even with frequent invocations.

Systemd Service (Alternative)

For production deployments requiring process supervision:

[Unit]
Description=APMonitor Network Resource Monitor
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'while true; do /usr/local/bin/APMonitor.py -vv -s /var/tmp/apmonitor-statefile.json /usr/local/etc/apmonitor-config.yaml; sleep 15; done'
Restart=always
RestartSec=10
User=monitoring
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Default State File Location

APMonitor automatically selects a platform-appropriate default location for the state file if the -s/--statefile option is not specified:

Linux, macOS, FreeBSD, OpenBSD, NetBSD

Default: /var/tmp/apmonitor-statefile.json

Located in /var/tmp which persists across system reboots
Preserves monitoring history and outage timestamps through restarts
Enables accurate outage duration reporting even after system reboot
No special permissions required (unlike /var/run)

Windows

Default: %TEMP%\apmonitor-statefile.json

Uses the system temporary directory defined by TEMP or TMP environment variables
Typically resolves to C:\Users\<username>\AppData\Local\Temp\apmonitor-statefile.json
Falls back to C:\Temp\apmonitor-statefile.json if environment variables are not set

Unknown/Other Platforms

Default: ./apmonitor-statefile.json

Creates state file in current working directory
Safe fallback for uncommon or embedded systems
Avoids permission issues on unfamiliar filesystem layouts

Concurrency and Multiple Instances

APMonitor's state file locking & PID locking is designed for single-process concurrency only—multiple threads within one process safely share state through internal locks. However, no file-level locking is implemented to coordinate between multiple APMonitor processes.

Having said that, APMonitor is very much re-entrant and thread safe for the most part, thus, if you specify different config files, it will happily allow a single process per config file to co-exist in parallel. The config filename is used as the hash when forming a PID lockfile in tempfs (/tmp/apmonitor-##########.lock), so that multiple lockfiles can coexist.

Thus, running multiple concurrent instances requires separate state files:

# Instance 1: Production monitoring
./APMonitor.py -s /var/tmp/apmonitor-prod.json prod-apmonitor-config.yaml

# Instance 2: Development monitoring
./APMonitor.py -s /var/tmp/apmonitor-dev.json dev-apmonitor-config.yaml

# Instance 3: Critical services (high-frequency)
./APMonitor.py -t 5 -s /tmp/apmonitor-critical.json critical-apmonitor-config.yaml

Which should mean sensible cardinality rules are enforced: one config per site, one process per config, one and IFF only one; good for running out of crontab.

Why separate state files are required:

No inter-process file locking mechanism exists
Concurrent writes from multiple processes will corrupt state files
Each process maintains independent monitoring schedules and notification state
Atomic file rotation (.new → .old) only protects single-process integrity

Use cases for multiple instances:

Different monitoring priorities (high-frequency critical vs. low-frequency non-critical)
Separate environments (production, staging, development)
Independent notification channels (ops team vs. dev team)
Isolated failure domains (prevent one misconfigured monitor from blocking others)

Override Behavior

Always specify -s/--statefile when:

Running from cron (working directory may vary)
Requiring tmpfs storage for performance (-s /tmp/apmonitor-statefile.json)
Managing multiple independent monitoring instances
Deploying in containers or restricted environments

Example: Force tmpfs storage (cleared on reboot, faster I/O):

./APMonitor.py -s /tmp/apmonitor-statefile.json apmonitor-config.yaml

Note: The apmonitor- prefix prevents naming collisions with other applications using generic statefile.json names.

Developer Notes for modifying `APMonitor.py`

State File

APMonitor uses a JSON state file to persist monitoring data across runs:

Location: Recommended path is /tmp/statefile.json for tmpfs storage (faster, no disk wear)
Format: JSON with per-resource nested objects containing timestamps, status, and counters
Atomic Updates: Uses .new and .old rotation to prevent corruption on crashes
Thread Safety: Protected by internal lock during concurrent access

The state file tracks:

is_up: Current resource status
last_checked: When resource was last checked (ISO 8601 timestamp)
last_response_time_ms: Response time in milliseconds for successful checks
last_notified: When last notification was sent (ISO 8601 timestamp)
last_alarm_started: When current/last outage began (ISO 8601 timestamp)
last_successful_heartbeat: When heartbeat URL last succeeded (ISO 8601 timestamp)
down_count: Consecutive failed checks
notified_count: Number of notifications sent for current outage
error_reason: Last error message
last_config_checksum: SHA-256 hash of monitor configuration (detects config changes)

Note: If using /tmp/statefile.json, the state file is cleared on system reboot. This resets all monitoring history but doesn't affect functionality—monitoring resumes normally on first run.

Configuration Change Detection: The last_config_checksum field stores a SHA-256 hash of the entire monitor configuration (all fields including type, name, address, expect, etc.). When APMonitor detects a configuration change (checksum mismatch), it immediately checks that monitor regardless of check_every_n_secs timing. This ensures configuration changes take effect on the next run without waiting for the scheduled check interval.

Execution Flow

Here are some basic devnotes on how APMonitor is built, in case you want to modify it.

Each invocation of APMonitor:

Acquires a PID lockfile via tempfs, using the config path as the hash to support multiple site configs in parallel.
Loads and validates configuration file
Loads previous state from state file (if exists)
For each monitor:
- Calculates SHA-256 checksum of monitor configuration
- Checks if configuration changed (checksum mismatch) or check_every_n_secs elapsed since last_checked
- If config changed: checks immediately (bypasses timing)
- If due: performs resource check
- If down and notify_every_n_secs elapsed: sends notifications
- If up and heartbeat configured: pings heartbeat URL if due
- Updates state atomically with new checksum
Saves state file with execution timing
Cleans up the PID file if possible.
Exits

This stateless design allows APMonitor to be killed/restarted safely at any time without losing monitoring history or creating duplicate notifications.

Modifying with AI

APMonitor was designed with an engineering based approach to Vibe Coding in mind, should you wish to change it.

Steps:

Paste in READAI.md (containing an Entrance Prompt) into your favourite AI coding tool (e.g., Grok 4.1 or Claude Sonnet)
Paste in APMonitor.py (tell your AI this is the source code)
Paste in README.md (tell your AI this is the documentation)
Vibe your changes as you see fit.

Enjoy!

Installation Instructions - Debian Linux

This guide covers installing APMonitor as a systemd service on Debian-based systems (Debian 10+, Ubuntu 20.04+).

Prerequisites

Fresh Debian/Ubuntu system with sudo access.

Automated Install - Quickstart

If you want to do an automated install, just follow these instructions, otherwise start with Step 1 below:

# Install (requires root)
sudo make install

# Edit configuration
sudo nano /usr/local/etc/apmonitor-config.yaml

# Test configuration
make test-config

# Enable and start service
sudo make enable

# Check status
make status

# View logs
make logs

# Restart after config changes
sudo make restart

# Uninstall completely
sudo make uninstall

Step 1: Install System Dependencies

sudo apt update
sudo apt install python3 python3-pip -y

Step 2: Install Python Dependencies

Install dependencies globally (required for systemd service):

sudo pip3 install --break-system-packages PyYAML requests pyOpenSSL urllib3 aioquic

Note: On Debian 12+, the --break-system-packages flag is required. On older systems, omit this flag:

sudo pip3 install PyYAML requests pyOpenSSL urllib3 aioquic

Dependencies installed:

PyYAML - YAML configuration file parsing
requests - HTTP/HTTPS resource checking and webhook notifications
pyOpenSSL - SSL certificate verification and fingerprint checking
urllib3 - HTTP connection pooling (dependency of requests)
aioquic - QUIC/HTTP3 protocol support (required for type: quic monitors)

Step 3: Create Monitoring User

Create a dedicated system user for running APMonitor:

sudo useradd -r -s /bin/bash -d /var/lib/apmonitor -m monitoring

Step 4: Install APMonitor

Copy the APMonitor script and example configuration to system locations:

# Install APMonitor script
sudo cp APMonitor.py /usr/local/bin/
sudo chmod +x /usr/local/bin/APMonitor.py

# Install example configuration
sudo cp example-apmonitor-config.yaml /usr/local/etc/apmonitor-config.yaml
sudo chown monitoring:monitoring /usr/local/etc/apmonitor-config.yaml
sudo chmod 640 /usr/local/etc/apmonitor-config.yaml

Important: Edit /usr/local/etc/apmonitor-config.yaml to configure your monitoring targets, notification endpoints, and site name before proceeding.

Step 5: Create systemd Service

Create the systemd service definition:

sudo nano /etc/systemd/system/apmonitor.service

Paste the following content:

[Unit]
Description=APMonitor Network Resource Monitor
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'while true; do /usr/local/bin/APMonitor.py -vv -s /var/tmp/apmonitor-statefile.json /usr/local/etc/apmonitor-config.yaml; sleep 15; done'
Restart=always
RestartSec=10
User=monitoring
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Save and exit (Ctrl+X, then Y, then Enter in nano).

Step 6: Enable and Start Service

Reload systemd, enable the service to start on boot, and start it:

sudo systemctl daemon-reload
sudo systemctl enable apmonitor.service
sudo systemctl start apmonitor.service

Step 7: Verify Operation

Check service status:

sudo systemctl status apmonitor.service

View live logs:

sudo journalctl -u apmonitor.service -f

View recent logs:

sudo journalctl -u apmonitor.service -n 100

Troubleshooting

Test APMonitor Manually

Run APMonitor manually as the monitoring user to verify configuration:

sudo -u monitoring /usr/local/bin/APMonitor.py -vv -s /var/tmp/apmonitor-statefile.json /usr/local/etc/apmonitor-config.yaml

Test Webhook Notifications

Test webhook configuration without checking resources:

sudo -u monitoring /usr/local/bin/APMonitor.py --test-webhooks -v /usr/local/etc/apmonitor-config.yaml

Test Email Notifications

Test email configuration without checking resources:

sudo -u monitoring /usr/local/bin/APMonitor.py --test-emails -v /usr/local/etc/apmonitor-config.yaml

Check State File Permissions

Verify the monitoring user can write to the state file location:

sudo ls -la /var/tmp/apmonitor-statefile.json

The /var/tmp directory should have permissions 1777 (drwxrwxrwt) allowing any user to create files.

View Configuration

Display the active configuration:

sudo cat /usr/local/etc/apmonitor-config.yaml

Service Management Commands

# Stop service
sudo systemctl stop apmonitor.service

# Restart service (after config changes)
sudo systemctl restart apmonitor.service

# Disable service from starting on boot
sudo systemctl disable apmonitor.service

# Check if service is enabled
sudo systemctl is-enabled apmonitor.service

Updating Configuration

After modifying /usr/local/etc/apmonitor-config.yaml, the changes take effect automatically on the next monitoring cycle (typically within 30 seconds). APMonitor detects configuration changes via SHA-256 checksums and immediately checks any modified monitors, so you don't need to restart the service unless you want immediate effect.

To force immediate checking of all monitors after config changes:

sudo systemctl restart apmonitor.service

Uninstallation

To completely remove APMonitor:

# Stop and disable service
sudo systemctl stop apmonitor.service
sudo systemctl disable apmonitor.service

# Remove service file
sudo rm /etc/systemd/system/apmonitor.service
sudo systemctl daemon-reload

# Remove files
sudo rm /usr/local/bin/APMonitor.py
sudo rm /usr/local/etc/apmonitor-config.yaml
sudo rm /var/tmp/apmonitor-statefile.json*

# Remove monitoring user
sudo userdel -r monitoring

# Optionally remove Python dependencies
sudo pip3 uninstall -y PyYAML requests pyOpenSSL urllib3 aioquic

TODO

Add additional monitors:
- TCP & UDP port monitoring
- SNMP w/defaults for managed switches and system performance tuning
- Update docs to provide webhook examples for Pushover, Slack & Diwscord
Add additional outputs:
- MRTG compatible logfiles
- MRTG compatible graph generation w/index.html
Aggregated root cause alerting:
- Specify parent dependencies using config option parent_name so we have a network topology graph
- Add loop detection to ensure the topology graph is a DAG
- Use the topology to only notify outages for the root cause and list the affected services in the same alert
- When a monitored resource has multiple parent dependencies, specify if it's down when all are down (AND relation) or down when one is down (OR relation)
- Consider correct use of pre/in/post-order traversal when deciding which alerts to drop
Convert finished version to pure C APMonitor.c
- Strictly only with libc/SVR4 C Systems Programming dependencies for really tiny cross-platform embedded systems application environments
- Add a Mercator + APTree.c #InfoRec inspired/styled priority queue for handling large numbers of monitored resources with proper realtime programming guarantees
- Test if we are root when doing a ping syscall and fallback to direct SOCK_RAW if we are for high performance

Licensing & Versioning

APMonitor.py is licensed under the GNU General Public License version 3.

Software: APMonitor 1.1.0
License: GNU General Public License version 3
Licensor: Andrew (AP) Prendergast, ap@andrewprendergast.com -- FSF Member

We use SemVer for version numbering.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
devnotes		devnotes
images		images
.gitignore		.gitignore
APMonitor.py		APMonitor.py
CHANGES.md		CHANGES.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
READAI.md		READAI.md
README.md		README.md
example-apmonitor-config.yaml		example-apmonitor-config.yaml
requirements.txt		requirements.txt

License

CompSciFutures/APMonitor

Folders and files

Latest commit

History

Repository files navigation

APMonitor.py - An On-Premises Monitoring Tool with Alert Delivery Guarantees