Skip to content

450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...

master
Go to file
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
dev
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Advanced Nagios Plugins Collection

Codacy CodeFactor Language grade: Python Quality Gate Status Maintainability Rating Reliability Rating Security Rating Total alerts GitHub stars GitHub forks Contributors GitHub Last Commit Lines of Code

Linux Mac Docker DockerHub Pulls DockerHub Build Automated StarTrack

CI Builds Overview Jenkins Concourse GoCD

Travis CI AppVeyor Drone CircleCI Codeship Status for HariSekhon/Nagios-Plugins Shippable Codefresh BuildKite buddy pipeline Cirrus CI Semaphore Wercker

Azure DevOps Pipeline GitLab Pipeline BitBucket Pipeline

Repo on Azure DevOps Repo on GitHub Repo on GitLab Repo on BitBucket

GitHub Actions Ubuntu Mac Mac 10.15 Ubuntu Ubuntu 14.04 Ubuntu 16.04 Ubuntu 18.04 Ubuntu 20.04 Debian Debian 8 Debian 9 Debian 10 CentOS CentOS 7 CentOS 8 Fedora Alpine Alpine 3 Python 2.7 Python 3.5 Python 3.6 Python 3.7 Python 3.8 PyPy 2 PyPy 3 Perl Perl 5.10 Perl 5.12 Perl 5.14 Perl 5.16 Perl 5.18 Perl 5.20 Perl 5.22 Perl 5.24 Perl 5.26 Perl 5.28

git.io/nagios-plugins

Largest, most advanced collection of production-grade Nagios monitoring code (over 450 programs).

Specialised plugins for AWS, Hadoop, Big Data & NoSQL technologies, written by a former Clouderan (Cloudera was the first Hadoop Big Data vendor) and ex-Hortonworks consultant.

Supports most major open source NoSQL technologies, Pub-Sub / Message Buses, CI, Web and Linux based infrastructure, including:

Supports a a wide variety of compatible Enterprise Monitoring systems.

Most enterprise monitoring systems come with basic generic checks, while this project extends their monitoring capabilities significantly further in to advanced infrastructure, application layer, APIs etc.

If running against services in Cloud or Kubernetes, just target the load balancer address or the Kubernetes Service or Ingress addresses.

Also useful to be run on the command line for testing or in scripts for dependency availability checking, and comes with a selection of advanced HAProxy configurations for these technologies to make monitoring and scripting easier for clustered technologies.

Fix requests, suggestions, updates and improvements are most welcome via Github issues or pull requests (in which case GitHub will give you credit and mark you as a contributor to the project :) ).

Hari Sekhon

Cloud & Big Data Contractor, United Kingdom

My LinkedIn

(you're welcome to connect with me on LinkedIn)
Make sure you run make update if updating and not just git pull as you will often need the latest library submodules and probably new upstream libraries too.

Quick Start

  1. Git clone this repo and compile dependencies by running make
    OR
  2. Download pre-built self-contained Docker image

Execute each program on the command line with --help to see its options.

Ready-to-run Docker Image

All plugins and their pre-compiled dependencies can be found ready-to-run on DockerHub, if you have Docker installed, fetch this project like so:

docker pull harisekhon/nagios-plugins

List all plugins:

docker run harisekhon/nagios-plugins

Run any given plugin by suffixing it to the docker run command:

docker run harisekhon/nagios-plugins <program> <args>

eg.

docker run harisekhon/nagios-plugins check_ssl_cert.pl --help

There are also :centos (:latest), :alpine, :debian, :fedora and :ubuntu tagged docker images available, as well as :python and :perl only images.

You should tag the build locally as :stable or date-time stamped and run off that tag to avoid it getting auto-replaced by newer :latest builds, to control updates to suit your schedule and prevent random delays from docker runs pulling down newer builds from DockerHub.

Automated Build from Source

curl -L https://git.io/nagios-plugins-bootstrap | sh

or


git clone https://github.com/harisekhon/nagios-plugins

cd nagios-plugins

make build zookeeper

Now run any plugin with --help to find out which switches to use.

Make sure to read Detailed Build Instructions further down for more information.

Optional: Generate self-contained Perl scripts with all dependencies built in to each file for easy distribution

After the make build has finished, if you want to make self-contained versions of all the perl scripts with all dependencies included for copying around, run:

make fatpacks

The self-contained scripts will be available in the fatpacks/ directory.

Quick Plugins Guide

There are over 400 programs in this repo so these are just some of the highlights.

Quick Links:
Hadoop Ecosystem
  • check_hadoop_*.pl/py - various Apache Hadoop monitoring utilities for HDFS, YARN and MapReduce (both MRv1 & MRv2):
    • Hadoop - Masters' status and High Availability (ZKFC, Active/Standby state), Worker nodes counts, dead nodes / blacklisted / unhealthy nodes, heap usage, metrics and JMX information with optional thresholds & graph data
    • HDFS - NameNode & DataNode checks, cluster space, balance, block replication, block count limits per datanode / cluster total, safe mode, failed name dirs, WebHDFS (with HDFS HA failover support), HttpFS, HDFS writeability, rack resilience configuration (checks more than 1 rack configured, finds nodes with default rack configured), HDFS fsck status / last check / run time / max blocks, HDFS file / directory existence & metadata attributes
    • Yarn:
      • Infrastructure - Resource Manager and NodeManager checks, unhealthy NodeManagers, queue state, queue capacity, app stats, % of memory allocated, metrics with optional thresholds to check things such as activeNodes, appsPending, lostNodes, unhealthyNodes etc.
      • Apps - app last finished state / user / queue / elapsed time (batch job SLAs), queue apps allowed/disallowed (catch Spark Shells on production queue), app running (check long living yarn service is still alive), long running apps/jobs detection with name and queue include/exclude regexes (detect SLAs breaches for in-progress batch jobs or forgotten Spark Shells holding resources, both Spark Scala and PySpark)
  • check_hbase_*.py/pl - various Apache HBase monitoring utilities using Thrift + Stargate APIs, checks Masters / Backup Masters, RegionServers, table availability (exists, is enabled, and has minimum number of column families), number of expected table regions, unassigned table regions, regions stuck in transition, region count balance across RegionServers, requests per sec balance across RegionServers, compaction in progress (by table and by regionserver), number of regions in transition, longest current region migration time, hbck status for any inconsistencies, cell content vs optional regex + thresholds, table write and read back of unique generated values with write/read/delete latency checks against all detected column families, table write spray and read back of unique values across all regions for all column families with write/read/delete latency checks, gather metrics
  • check_atlas_*.py - Apache Atlas metadata server instance status, as well as metadata entity checks including entity existence, state=ACTIVE, expected type, expected tags are assigned to entity (eg. PII - important because Ranger ACLs to allow or deny access to data can be assigned based on tags)
  • check_ranger_*.pl/.py - Apache Ranger checks:
    • policy checks - existence, enabled, has auditing enabled, is recursive, last updated vs thresholds (to catch changes), repository name and type that the policy belongs to
    • repository checks - existence, active, type (eg. hive, hdfs), last updated vs thresholds (to catch changes)
    • number of policies and repositories vs thresholds
  • check_ambari_*.pl - Apache Ambari API checks for Hadoop clusters written running the standard open source Hortonworks distribution - checks the service status, node(s) status, stale configs, cluster alerts summary, host alerts summary, cluster health report, hdfs rack resilience configured (checks more than 1 rack configured, finds nodes with default rack configured), kerberos enabled, cluster version, service config compatible with stack and cluster

Attivio, Blue Talon, Datameer, Platfora, Zaloni plugins are also available for those proprietary products related to Hadoop.

Hadoop Distributions
  • check_cloudera_manager_*.pl - Hadoop cluster checks via Cloudera Manager API - checks states and health of cluster services/roles/nodes, management services, config staleness, Cloudera Enterprise license expiry, Cloudera Manager and CDH cluster versions, utility switches to list clusters/services/roles/nodes as well as list users and their role privileges, fetch a wealth of Hadoop & OS monitoring metrics from Cloudera Manager and compare to thresholds. Disclaimer: I worked for Cloudera, but seriously CM collects an impressive amount of metrics making check_cloudera_manager_metrics.pl alone a very versatile program from which to create hundreds of checks to flexibly alert on
  • Hortonworks - the standard modern Hadoop distribution - see check_ambari_*.pl in the Hadoop Ecosystem section above
  • check_mapr*.pl - Hadoop cluster checks via MapR Control System API - checks services and nodes, MapR-FS space (cluster and per volume), volume states, volume block replication, volume snapshots and mirroring, MapR-FS per disk space utilization on nodes, failed disks, CLDB heartbeats, MapR alarms, MapReduce mode and memory utilization, disk and role balancer metrics. These are noticeably faster than running equivalent maprcli commands (exceptions: disk/role balancer use maprcli).
  • check_ibm_biginsights_*.pl - Hadoop cluster checks via IBM BigInsights Console API - checks services, nodes, agents, BigSheets workbook runs, dfs paths and properties, HDFS space and block replication, BI console version, BI console applications deployed
SQL-on-Hadoop
  • check_hiveserver2* - Apache Hive - HiveServer2 LLAP Interactive server status and uptime, peer count, check for a specific peer host fqdn via regex and a basic beeline connection trivial query test
  • check_apache_drill_*.py/.pl - Apache Drill checks:
    • cluster wide: number of online / offline cluster nodes, mismatched versions across cluster
    • per drill node: status, cluster membership, encryption enabled, config settings, storage plugins enabled, version, metrics with optional thresholds
  • check_presto_*.py - Presto SQL DB
    • cluster checks (via coordinator API) - number of current queries, running/failed/blocked/queued queries, tasks, worker nodes, failed worker nodes, workers with response lag to coordinator, workers with recent failures and recent failure ratios vs thresholds, version
    • per node checks - status, if coordinator, environment
    • per worker checks (via coordinator API) - specific worker registered with coordinator, response age to coordinator, recent requests vs threshold, recent successes, recent failures & failure ratio vs thresholds
Service Discovery & Coordination
  • check_zookeeper.pl - Apache ZooKeeper server checks, multiple layers: "is ok" status, is writable (quorum), operating mode (leader/follower vs standalone), gather statistics
  • check_zookeeper_*znode*.pl - ZooKeeper znode checks using ZK Perl API, useful for HBase, Kafka, SolrCloud, Hadoop HDFS & Yarn HA (ZKFC) and any other ZooKeeper-based service. Very versatile with multiple optional checks including data vs regex, json field extraction, ephemeral status, child znodes, znode last modified age
  • check_consul_*.py - Consul API write / read back, arbitrary key-value content checks, cluster leader election, number of cluster peers, service leader election, version
  • check_vault_*.py - Hashicorp's Vault API checks - health checks is initialized, is not standby, is vault sealed / unsealed, time skew between Vault server and local, is high availability enabled, is current leader, is leader found, version
Cloud
  • AWS:
    • check_aws_s3_file.pl - check for the existence of any arbitrary file on AWS S3, eg. to check backups have happened or _SUCCESS placeholder files are present for a job
    • check_aws_access_keys_age.py - checks for AWS access key age greater than N days to delete/rotate old keys as per best practice (optionally only alerts for active keys)
    • check_aws_access_keys_disabled.py - checks for AWS disabled access keys that should be removed
    • check_aws_api_ping.py - simple yes/no check for AWS API access, can be used to test access key credentials and as a dependency check for all other AWS checks
    • check_aws_cloudtrails_enabled.py - checks Cloud Trails have logging enabled, multi-region and logfile validation. Optionally check only a single named cloud trail
    • check_aws_cloudtrails_event_selectors.py - checks Cloud Trails have at least one event selector each with management and read+write logging. Optionally check only a single named cloud trail
    • check_aws_ec2_instance_count.py - checks the number of running instances with optional range thresholds
    • check_aws_ec2_instance_states.py - checks the state of all EC2 instances, outputting totals and checking warning thresholds for each status type
    • check_aws_password_policy.py - checks the AWS password policy including minimum length, maximum age, password reuse count, uppercase/lowercase/numbers/symbols and whether users are allowed to change their passwords
    • check_aws_root_account.py - checks the AWS root account has MFA enabled and no access keys as per best practice
    • check_aws_user_last_used.py - checks if a given AWS IAM user account has been used within the last N days (eg. if root account was recently used this may indicate a security breach or is at the very least against best practice)
    • check_aws_users_unused.py - detects old AWS IAM user accounts that haven't been used in the last N days, either passwords nor access keys, and should probably be removed
    • check_aws_users_password_last_used.py - detects AWS IAM user accounts that haven't had their passwords used in N days and should probably be removed
    • check_aws_users_mfa_enabled.py - checks all AWS user accounts with passwords have MFA enabled
Docker / Containerization
  • check_docker_*.py - Docker API checks including API ping, counts of running / paused / stopped / total containers with thresholds, specific container status by name or id, images count with thresholds, specific image:tag availability including size and checksum, counts of networks / volumes with thresholds, docker engine version
  • check_docker_swarm_*.py - Docker Swarm API checks including is swarm enabled, swarm node status, is the node a swarm manager, swarm service status including number of live replicas / tasks and if the service was updated recently, counts of services, swarm manager and worker nodes with thresholds, swarm errors, swarm version
  • check_mesos_*.pl - Mesos master health API, master & slaves state information including leader and versions, activated & deactivated slaves, number of Chronos jobs, master & slave metrics. Warning: Mesos & Mesosphere DC/OS is legacy semi-proprietary - major momentum has shifted to the open source Kubernetes project
  • check_kubernetes_*.py - Kubernetes API health and version

If running docker checks from within the nagios plugins docker image then you will need to expose the socket within the container, like so:

docker run -v /var/run/docker.sock:/var/run/docker.sock harisekhon/nagios-plugins check_docker_container_status.py -H unix:///var/run/docker.sock --container myContainer
 OK: Docker container 'myContainer' status = 'running', started at '2020-06-03T14:03:09.78303932Z' | query_time=0.0038s

See also DockerHub build status nagios plugin further down in the CI section.

Search
  • check_elasticsearch_*.pl/.py - Elasticsearch cluster state, shards, replicas, number of nodes & data nodes online, shard and disk % balance between nodes, single node ok, specific node found in cluster state, slow tasks, pending tasks, elasticsearch / lucene versions, per index existence / shards / replicas / settings / age, stats per cluster / index / node, X-Pack license expiry and features enabled
    • check_logstash_*.py - Logstash status, uptime, hot threads, plugins, version, number of pipelines online, specific pipeline online and optionally its number of workers, if its dead letter queue is enabled, outputs pipeline batch size and delay
  • check_solr*.pl - checks for Apache Solr and SolrCloud including API write/read/delete, arbitrary Solr queries vs num matching documents, API ping, Solr Core Heap / Index Size / Number of Docs for a given Solr Collection, and thresholds in ms against all Solr API operations as well as perfdata for graphing, as well as SolrCloud ZooKeeper content checks for collection shards and replicas states, number of live nodes in SolrCloud cluster, overseer, SolrCloud config and Solr metrics.
NoSQL
  • check_cassandra_*.pl / check_datastax_opscenter_*.pl - Apache Cassandra and DataStax OpsCenter monitoring, including Cassandra cluster nodes, token balance, space, heap, keyspace replication settings, alerts, backups, best practice rule checks, DSE hadoop analytics service status and both nodetool and DataStax OpsCenter collected metrics
  • check_memcached_*.pl - Memcached API writes/reads/deletes with timings, check specific key's value against regex or value range, number of current connections, gather statistics
  • check_couchdb_*.py - Apache CouchDB API checks including server status, database exists, doc and deleted doc counts, data size, compaction running, version
  • check_riak_*.pl - Riak API writes/reads/deletes with timings, check a specific key's value against regex or value range, check all riak diagnostics, check node states, check all nodes agree on ring status, gather statistics, alert on any single stat
  • check_redis_*.pl - Redis API writes/reads/deletes with timings, check specific key's value against regex or value range, replication slaves I/O, replicated writes (write on master -> read from slave), publish/subscribe, connected clients, validate redis.conf against running server to check deployments or remote compliance checks, gather statistics, alert on any single stat
SQL Databases
  • check_mysql_query.pl - flexible free-form MySQL SQL queries - can check almost anything - obsoleted a dozen custom MySQL plugins and prevented writing many more. Tested against many versions of MySQL and MariaDB. You may also be interested in Percona's plugins
  • check_mysql_config.pl - detect differences in your /etc/my.cnf and running MySQL config to catch DBAs making changes to running databases without saving to /etc/my.cnf or backporting to Puppet. Can also be used to remotely validate configuration compliance against a known good baseline. Tested against many versions of MySQL and MariaDB