Advanced Nagios Plugins Collection
Largest, most advanced collection of production-grade Nagios monitoring code (over 450 programs).
Supports most major open source NoSQL technologies, Pub-Sub / Message Buses, CI, Web and Linux based infrastructure, including:
- Hadoop - extensive API integration to all major Hadoop vendors (Cloudera, Hortonworks, MapR, IBM BigInsights)
- Solr / SolrCloud
- Travis CI
- Linux - various including the widely used
check_yum.pyfor RHEL / CentOS yum security updates
- SSL Certificate expiry in days & validations
- Whois domain expiry in days & validations
- advanced DNS record checks (MX, NS, SRV etc)
- Git, MySQL ... etc.
Supports a a wide variety of compatible Enterprise Monitoring systems.
Most enterprise monitoring systems come with basic generic checks, while this project extends their monitoring capabilities significantly further in to advanced infrastructure, application layer, APIs etc.
If running against services in Cloud or Kubernetes, just target the load balancer address or the Kubernetes Service or Ingress addresses.
Also useful to be run on the command line for testing or in scripts for dependency availability checking, and comes with a selection of advanced HAProxy configurations for these technologies to make monitoring and scripting easier for clustered technologies.
Cloud & Big Data Contractor, United Kingdom
(you're welcome to connect with me on LinkedIn)
Make sure you run
make update if updating and not just
git pull as you will often need the latest library submodules and probably new upstream libraries too.
- Git clone this repo and compile dependencies by running
- Download pre-built self-contained Docker image
Execute each program on the command line with
--help to see its options.
Ready-to-run Docker Image
docker pull harisekhon/nagios-plugins
List all plugins:
docker run harisekhon/nagios-plugins
Run any given plugin by suffixing it to the
docker run command:
docker run harisekhon/nagios-plugins <program> <args>
docker run harisekhon/nagios-plugins check_ssl_cert.pl --help
There are also
:ubuntu tagged docker images available, as well as
:perl only images.
You should tag the build locally as
:stable or date-time stamped and run off that tag to avoid it getting auto-replaced by newer
:latest builds, to control updates to suit your schedule and prevent random delays from
docker runs pulling down newer builds from DockerHub.
Automated Build from Source
curl -L https://git.io/nagios-plugins-bootstrap | sh
git clone https://github.com/harisekhon/nagios-plugins cd nagios-plugins make build zookeeper
Now run any plugin with
--help to find out which switches to use.
Make sure to read Detailed Build Instructions further down for more information.
Optional: Generate self-contained Perl scripts with all dependencies built in to each file for easy distribution
make build has finished, if you want to make self-contained versions of all the perl scripts with all dependencies included for copying around, run:
The self-contained scripts will be available in the
Quick Plugins Guide
There are over 400 programs in this repo so these are just some of the highlights.
- Hadoop Ecosystem - HDFS, Yarn, HBase, Ambari, Atlas, Ranger
- Service Discovery & Coordination - ZooKeeper, Consul, Vault
- Cloud - AWS
- Docker / Containerization - Docker & Docker Swarm, Mesos, Kubernetes
- Search - Elasticsearch, Solr / SolrCloud
- NoSQL - Cassandra, Redis, Riak, Memcached, CouchDB
- SQL Databases - MySQL
- Pub-Sub / Message Queues - Kafka, Redis, RabbitMQ
- CI - Continuous Integration & Build Systems - Jenkins, Travis CI, GoCD, DockerHub, Git
check_hadoop_*.pl/py- various Apache Hadoop monitoring utilities for HDFS, YARN and MapReduce (both MRv1 & MRv2):
- Hadoop - Masters' status and High Availability (ZKFC, Active/Standby state), Worker nodes counts, dead nodes / blacklisted / unhealthy nodes, heap usage, metrics and JMX information with optional thresholds & graph data
- HDFS - NameNode & DataNode checks, cluster space, balance, block replication, block count limits per datanode / cluster total, safe mode, failed name dirs, WebHDFS (with HDFS HA failover support), HttpFS, HDFS writeability, rack resilience configuration (checks more than 1 rack configured, finds nodes with default rack configured), HDFS fsck status / last check / run time / max blocks, HDFS file / directory existence & metadata attributes
- Infrastructure - Resource Manager and NodeManager checks, unhealthy NodeManagers, queue state, queue capacity, app stats, % of memory allocated, metrics with optional thresholds to check things such as activeNodes, appsPending, lostNodes, unhealthyNodes etc.
- Apps - app last finished state / user / queue / elapsed time (batch job SLAs), queue apps allowed/disallowed (catch Spark Shells on production queue), app running (check long living yarn service is still alive), long running apps/jobs detection with name and queue include/exclude regexes (detect SLAs breaches for in-progress batch jobs or forgotten Spark Shells holding resources, both Spark Scala and PySpark)
check_hbase_*.py/pl- various Apache HBase monitoring utilities using Thrift + Stargate APIs, checks Masters / Backup Masters, RegionServers, table availability (exists, is enabled, and has minimum number of column families), number of expected table regions, unassigned table regions, regions stuck in transition, region count balance across RegionServers, requests per sec balance across RegionServers, compaction in progress (by table and by regionserver), number of regions in transition, longest current region migration time, hbck status for any inconsistencies, cell content vs optional regex + thresholds, table write and read back of unique generated values with write/read/delete latency checks against all detected column families, table write spray and read back of unique values across all regions for all column families with write/read/delete latency checks, gather metrics
check_atlas_*.py- Apache Atlas metadata server instance status, as well as metadata entity checks including entity existence, state=ACTIVE, expected type, expected tags are assigned to entity (eg. PII - important because Ranger ACLs to allow or deny access to data can be assigned based on tags)
check_ranger_*.pl/.py- Apache Ranger checks:
- policy checks - existence, enabled, has auditing enabled, is recursive, last updated vs thresholds (to catch changes), repository name and type that the policy belongs to
- repository checks - existence, active, type (eg. hive, hdfs), last updated vs thresholds (to catch changes)
- number of policies and repositories vs thresholds
check_ambari_*.pl- Apache Ambari API checks for Hadoop clusters written running the standard open source Hortonworks distribution - checks the service status, node(s) status, stale configs, cluster alerts summary, host alerts summary, cluster health report, hdfs rack resilience configured (checks more than 1 rack configured, finds nodes with default rack configured), kerberos enabled, cluster version, service config compatible with stack and cluster
Attivio, Blue Talon, Datameer, Platfora, Zaloni plugins are also available for those proprietary products related to Hadoop.
check_cloudera_manager_*.pl- Hadoop cluster checks via Cloudera Manager API - checks states and health of cluster services/roles/nodes, management services, config staleness, Cloudera Enterprise license expiry, Cloudera Manager and CDH cluster versions, utility switches to list clusters/services/roles/nodes as well as list users and their role privileges, fetch a wealth of Hadoop & OS monitoring metrics from Cloudera Manager and compare to thresholds. Disclaimer: I worked for Cloudera, but seriously CM collects an impressive amount of metrics making check_cloudera_manager_metrics.pl alone a very versatile program from which to create hundreds of checks to flexibly alert on
- Hortonworks - the standard modern Hadoop distribution - see
check_ambari_*.plin the Hadoop Ecosystem section above
check_mapr*.pl- Hadoop cluster checks via MapR Control System API - checks services and nodes, MapR-FS space (cluster and per volume), volume states, volume block replication, volume snapshots and mirroring, MapR-FS per disk space utilization on nodes, failed disks, CLDB heartbeats, MapR alarms, MapReduce mode and memory utilization, disk and role balancer metrics. These are noticeably faster than running equivalent maprcli commands (exceptions: disk/role balancer use maprcli).
check_ibm_biginsights_*.pl- Hadoop cluster checks via IBM BigInsights Console API - checks services, nodes, agents, BigSheets workbook runs, dfs paths and properties, HDFS space and block replication, BI console version, BI console applications deployed
check_hiveserver2*- Apache Hive - HiveServer2 LLAP Interactive server status and uptime, peer count, check for a specific peer host fqdn via regex and a basic beeline connection trivial query test
check_apache_drill_*.py/.pl- Apache Drill checks:
- cluster wide: number of online / offline cluster nodes, mismatched versions across cluster
- per drill node: status, cluster membership, encryption enabled, config settings, storage plugins enabled, version, metrics with optional thresholds
check_presto_*.py- Presto SQL DB
- cluster checks (via coordinator API) - number of current queries, running/failed/blocked/queued queries, tasks, worker nodes, failed worker nodes, workers with response lag to coordinator, workers with recent failures and recent failure ratios vs thresholds, version
- per node checks - status, if coordinator, environment
- per worker checks (via coordinator API) - specific worker registered with coordinator, response age to coordinator, recent requests vs threshold, recent successes, recent failures & failure ratio vs thresholds
Service Discovery & Coordination
check_zookeeper.pl- Apache ZooKeeper server checks, multiple layers: "is ok" status, is writable (quorum), operating mode (leader/follower vs standalone), gather statistics
check_zookeeper_*znode*.pl- ZooKeeper znode checks using ZK Perl API, useful for HBase, Kafka, SolrCloud, Hadoop HDFS & Yarn HA (ZKFC) and any other ZooKeeper-based service. Very versatile with multiple optional checks including data vs regex, json field extraction, ephemeral status, child znodes, znode last modified age
check_consul_*.py- Consul API write / read back, arbitrary key-value content checks, cluster leader election, number of cluster peers, service leader election, version
check_vault_*.py- Hashicorp's Vault API checks - health checks is initialized, is not standby, is vault sealed / unsealed, time skew between Vault server and local, is high availability enabled, is current leader, is leader found, version
check_aws_s3_file.pl- check for the existence of any arbitrary file on AWS S3, eg. to check backups have happened or _SUCCESS placeholder files are present for a job
check_aws_access_keys_age.py- checks for AWS access key age greater than N days to delete/rotate old keys as per best practice (optionally only alerts for active keys)
check_aws_access_keys_disabled.py- checks for AWS disabled access keys that should be removed
check_aws_api_ping.py- simple yes/no check for AWS API access, can be used to test access key credentials and as a dependency check for all other AWS checks
check_aws_cloudtrails_enabled.py- checks Cloud Trails have logging enabled, multi-region and logfile validation. Optionally check only a single named cloud trail
check_aws_cloudtrails_event_selectors.py- checks Cloud Trails have at least one event selector each with management and read+write logging. Optionally check only a single named cloud trail
check_aws_ec2_instance_count.py- checks the number of running instances with optional range thresholds
check_aws_ec2_instance_states.py- checks the state of all EC2 instances, outputting totals and checking warning thresholds for each status type
check_aws_password_policy.py- checks the AWS password policy including minimum length, maximum age, password reuse count, uppercase/lowercase/numbers/symbols and whether users are allowed to change their passwords
check_aws_root_account.py- checks the AWS root account has MFA enabled and no access keys as per best practice
check_aws_user_last_used.py- checks if a given AWS IAM user account has been used within the last N days (eg. if root account was recently used this may indicate a security breach or is at the very least against best practice)
check_aws_users_unused.py- detects old AWS IAM user accounts that haven't been used in the last N days, either passwords nor access keys, and should probably be removed
check_aws_users_password_last_used.py- detects AWS IAM user accounts that haven't had their passwords used in N days and should probably be removed
check_aws_users_mfa_enabled.py- checks all AWS user accounts with passwords have MFA enabled
Docker / Containerization
check_docker_*.py- Docker API checks including API ping, counts of running / paused / stopped / total containers with thresholds, specific container status by name or id, images count with thresholds, specific image:tag availability including size and checksum, counts of networks / volumes with thresholds, docker engine version
check_docker_swarm_*.py- Docker Swarm API checks including is swarm enabled, swarm node status, is the node a swarm manager, swarm service status including number of live replicas / tasks and if the service was updated recently, counts of services, swarm manager and worker nodes with thresholds, swarm errors, swarm version
check_mesos_*.pl- Mesos master health API, master & slaves state information including leader and versions, activated & deactivated slaves, number of Chronos jobs, master & slave metrics. Warning: Mesos & Mesosphere DC/OS is legacy semi-proprietary - major momentum has shifted to the open source Kubernetes project
check_kubernetes_*.py- Kubernetes API health and version
If running docker checks from within the nagios plugins docker image then you will need to expose the socket within the container, like so:
docker run -v /var/run/docker.sock:/var/run/docker.sock harisekhon/nagios-plugins check_docker_container_status.py -H unix:///var/run/docker.sock --container myContainer OK: Docker container 'myContainer' status = 'running', started at '2020-06-03T14:03:09.78303932Z' | query_time=0.0038s
See also DockerHub build status nagios plugin further down in the CI section.
check_elasticsearch_*.pl/.py- Elasticsearch cluster state, shards, replicas, number of nodes & data nodes online, shard and disk % balance between nodes, single node ok, specific node found in cluster state, slow tasks, pending tasks, elasticsearch / lucene versions, per index existence / shards / replicas / settings / age, stats per cluster / index / node, X-Pack license expiry and features enabled
check_logstash_*.py- Logstash status, uptime, hot threads, plugins, version, number of pipelines online, specific pipeline online and optionally its number of workers, if its dead letter queue is enabled, outputs pipeline batch size and delay
check_solr*.pl- checks for Apache Solr and SolrCloud including API write/read/delete, arbitrary Solr queries vs num matching documents, API ping, Solr Core Heap / Index Size / Number of Docs for a given Solr Collection, and thresholds in ms against all Solr API operations as well as perfdata for graphing, as well as SolrCloud ZooKeeper content checks for collection shards and replicas states, number of live nodes in SolrCloud cluster, overseer, SolrCloud config and Solr metrics.
check_datastax_opscenter_*.pl- Apache Cassandra and DataStax OpsCenter monitoring, including Cassandra cluster nodes, token balance, space, heap, keyspace replication settings, alerts, backups, best practice rule checks, DSE hadoop analytics service status and both nodetool and DataStax OpsCenter collected metrics
check_memcached_*.pl- Memcached API writes/reads/deletes with timings, check specific key's value against regex or value range, number of current connections, gather statistics
check_couchdb_*.py- Apache CouchDB API checks including server status, database exists, doc and deleted doc counts, data size, compaction running, version
check_riak_*.pl- Riak API writes/reads/deletes with timings, check a specific key's value against regex or value range, check all riak diagnostics, check node states, check all nodes agree on ring status, gather statistics, alert on any single stat
check_redis_*.pl- Redis API writes/reads/deletes with timings, check specific key's value against regex or value range, replication slaves I/O, replicated writes (write on master -> read from slave), publish/subscribe, connected clients, validate redis.conf against running server to check deployments or remote compliance checks, gather statistics, alert on any single stat
check_mysql_query.pl- flexible free-form MySQL SQL queries - can check almost anything - obsoleted a dozen custom MySQL plugins and prevented writing many more. Tested against many versions of MySQL and MariaDB. You may also be interested in Percona's plugins
check_mysql_config.pl- detect differences in your /etc/my.cnf and running MySQL config to catch DBAs making changes to running databases without saving to /etc/my.cnf or backporting to Puppet. Can also be used to remotely validate configuration compliance against a known good baseline. Tested against many versions of MySQL and MariaDB