ams-publisher
is a component acting as bridge from Nagios to ARGO Messaging system. It's essential part of software stack running on ARGO monitoring instance and is responsible for forming and dispatching messages that wrap up results of Nagios probes/tests. It is running as a unix daemon and it consists of two subsystems:
- queueing mechanism
- publishing/dispatching part
Messages are cached in local directory queue with the help of OCSP Nagios commands and each queue is being monitored and consumed by the daemon. After configurable amount of accumulated messages, publisher that is associated to queue sends them to ARGO Messaging system and drains the queue. ams-publisher
is written in multiprocessing manner so there is support for multiple (consume, publish) pairs where for each, new worker process will be spawned.
Filling and draining of directory queue is asynchronous. Nagios delivers results on its own constant rate while ams-publisher
consume and publish them on its own configurable constant rate. It's important to keep the two rates close enough so that the results don't pile up in the queue and leave it early. Component has a mechanism of inspection of rates and trends over time to keep the constants in sync. Also it's resilient to network issues so it will retry configurable number of times to send a messages to ARGO Messaging system. It's also important to note that consume and publish of the queue is a serial process so if publish is stopped, consume part of the worker will be also stopped. That could lead to pile up of results in the queue and since every result is represented as a one file on the file system, easily exhaustion of free inodes and therefore unusable monitoring instance.
More about Directory queue design
Complete list of features are:
- efficient and scalable directory based queueing
- configurable number of (consume, publish) pairs - workers
- two type of publishers: metric results and alarms
- avro serialization of metric results
- configurable watch rate for queue
- configurable bulk of messages sent to ARGO Messaging system
- configurable retry attempts in case of network connection problems
- purger that will keep queue only with sound data
- message rate inspection of each worker for monitoring purposes
Component is supported on CentOS 7 and Rocky 9 Linux. RPM packages and all needed dependencies are available in ARGO repositories so installation of component simply narrows down to installing a package:
dnf install -y ams-publisher
Component relies on:
argo-ams-library
- interaction with ARGO Messagingavro
- avro serialization of messages' payloadpython-argparse
- ease build and parse of command line argumentspython-daemon
- ease daemonizing of componentpython-messaging
- CERN's library for directory based caching/queueingpytz
- timezone manipulation
File Types | Destination |
---|---|
Configuration | /etc/ams-publisher/ams-publisher.conf |
Daemon component | /usr/bin/ams-publisherd |
Cache delivery | /usr/bin/ams-alarm-to-queue, ams-metric-to-queue |
SystemD Unit (R9) | /usr/lib/systemd/system/ams-publisher.service |
Local caches | /var/spool/ams-publisher/ |
Inspection socket | /run/ams-publisher/sock |
Log files | /var/log/ams-publisher/ |
Central configuration is in ams-publisher.conf
. Configuration consists of [General]
section and [Queue_<workername>], [Topic_<workername>]
section pairs.
[General]
Host = NAGIOS.FQDN.EXAMPLE.COM
RunAsUser = nagios
StatsEveryHour = 24
PublishMsgFile = False
PublishMsgFileDir = /published
PublishArgoMessaging = True
TimeZone = UTC
StatSocket = /var/run/ams-publisher/sock
Host
- FQDN of ARGO Monitoring instance that will be part of formed messages dispatched to ARGO Messaging systemRunAsUser
- component will run with effective UID and GID of given user, usuallynagios
StatsEveryHour
- write periodic report in system logs. Example is given in RunningPublishMsgFile
,PublishMsgFileDir
- "file publisher" that is actually only for testing purposes. If enabled, messages will not be dispatched to ARGO Messaging System, instead it will just be appended to plain text fileTimeZone
- construct timestamp of messages within specified timezoneStatsSocket
- query socket that is used for inspection of rates of each worker. It used by theams-publisher
Nagios probe.
Eachs (queue, topic)
section pair designates one worker process. Two sections are linked by giving the same workername
so Public_Metrics
is associated to Queue_Metrics
but not Queue_MetricsDevel
. Each worker means one new process spawned. One part of the process inspects local on-disk queue with results, forms and accumulates messages into in-memory buffer - consumer. Other part of the worker process dispatchs accumulated messages to ARGO Messaging system when the targeted number is reached - publisher. Publisher part is resilient to network connection problems and it can be configured for arbitrary number of connection retries.
Example of one such pair:
[Queue_Metrics]
Directory = /var/spool/ams-publisher/metrics/
Rate = 10
Purge = True
PurgeEverySec = 300
MaxTemp = 300
MaxLock = 0
Granularity = 60
[Topic_Metrics]
Host = msg.argo.grnet.gr
Key = AMS_TENANTPROJECT_KEY
Project = NAME_OF_AMS_TENANTPROJECT
Topic = metric_data
Bulksize = 100
MsgType = metric_data
Avro = True
AvroSchema = /etc/ams-publisher/metric_data.avsc
Retry = 5
Timeout = 60
SleepRetry = 300
[Queue_Metrics].Directory
- path of directory queue on the filesystem where local cache delivery tools write results of Nagios tests/probes.[Queue_Metrics].Rate
- local cache inspection rate. 10 means that cache will be inspected 10 times at a second because thats the number of status results expected from Nagios that will be picked up verly early. For low volume tenants this could be a lower number.[Queue_Metrics].Purge,PurgeEverySec,MaxTemp,MaxLock
- purge the staled elements of directory queue everyPurgeEverySec
seconds. It cleans the empty intermediate directories below directory queue path, temporary results that exceededMaxTemp
time and locked results that exceededMaxLock
.
It is advisable to leave
MaxLock = 0
which skips every result that have been transformed into a message and added into in-memory queue, but had not yet been dispatched.
[Queue_Metrics].Granularity
- new intermediate directory in the toplevel directory queue path is created everyGranularity
seconds[Topic_Metrics].Host,Key,Project,Topic
- options needed for delivering of messages to ARGO Messaging system.Host
designates the FQDN,Key
is authorization token,Project
represents a tenant name andTopic
is final destination scoped to tenant[Topic_Metric].Bulksize
- accumulate a bulk ofBulksize
messages before dispatching them to ARGO Messaging system in one request.[Topic_Metric].MsgType
- string indicating what type of message publisher is sending. It can bemetric_data
oralarm
[Topic_Metrics].Retry,SleepRetry
- these configures how publisher is resilient to network connection problems and ARGO Messaging system downtimes. Default5 x 300
means that the publisher worker will be resilient to a minimum of1500
seconds (25 min) of connection problems retrying5
times and sleeping aproximately300
seconds before each next attempt[Topic_Metrics].Timeout
- defines the maximum amount of seconds within which request to ARGO Messaging system must be finalized. Consider raising it up for biggerBulksize
number.
Components needs to be properly configured before starting it. At least one worker must be spawned so at least one directory queue needs to exist, Nagios should deliver results to it and publisher part of the worker needs to be associated to it. Afterward, it's just a matter of starting the service.
CentOS 7 and Rocky 9:
systemctl start ams-publisher.service
Component periodically reports rates of each worker in system logs. It does so every StatsEveryHour
.
2020-04-08 08:53:42 ams-publisher[963]: INFO - Periodic report (every 6.0h)
2020-04-08 08:53:42 ams-publisher[983]: INFO - ConsumerQueue metrics: consumed 45787 msgs in 6.00 hours
2020-04-08 08:53:42 ams-publisher[983]: INFO - MessagingPublisher metrics: sent 45700 msgs in 6.00 hours
2020-04-08 08:53:42 ams-publisher[1005]: INFO - ConsumerQueue metricsdevel: consumed 45787 msgs in 6.00 hours
2020-04-08 08:53:42 ams-publisher[1005]: INFO - MessagingPublisher metricsdevel: sent 45700 msgs in 6.00 hours
2020-04-08 08:53:42 ams-publisher[993]: INFO - ConsumerQueue alarms: consumed 164 msgs in 6.00 hours
2020-04-08 08:53:42 ams-publisher[993]: INFO - MessagingPublisher alarms: sent 164 msgs in 6.00 hours
Check the rates of configured workers for past minutes:
% ams-publisherd -q 60
INFO - Asked for statistics for last 60 minutes
INFO - worker:metrics published:8500
INFO - worker:alarms published:26
INFO - worker:metricsdevel published:8500
INFO - worker:metrics consumed:8588
INFO - worker:alarms consumed:26
INFO - worker:metricsdevel consumed:8587