/
monitors.ad
51 lines (33 loc) · 2.14 KB
/
monitors.ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
=== Monitors
A monitor watches some part of the system for a problem, checking to see if the monitored value exceeds a threshold.
(To be notified immediately of new monitor events, configure a notification.)
Monitor ID:: The monitor ID is a unique name to refer to the monitor.
ifndef::pro[]
Node Group ID:: The node group that will run this monitor. Use "ALL" to match all groups.
External ID:: The external ID of nodes that will run this monitor. Use "ALL" to match all nodes.
endif::pro[]
ifdef::pro[]
Target Nodes:: The group of nodes that will run this monitor.
endif::pro[]
Monitor Type:: The monitor type is one of several built-in or custom types that run a specific check and return a numeric value that can
be compared to a threshold value.
[cols="<2,<7", options="header"]
|===
|Type
|Description
|cpu|Percentage from 0 to 100 of CPU usage for the server process.
|disk|Percentage from 0 to 100 of disk usage (tmp folder staging area) available to the server process.
|memory|Percentage from 0 to 100 of memory usage (tenured heap pool) available to the server process.
|batchError|Number of incoming and outgoing batches in error.
|batchUnsent|Number of outgoing batches waiting to be sent.
|dataUnrouted|Number of change capture rows that are waiting to be batched and sent.
|dataGaps|Number of active data gaps that are being checked during routing for data to commit.
|offlineNodes|The number of nodes that are offline based on the last heartbeat time. The console.report.as.offline.minutes parameter controls how many minutes before a node is considered offline.
|log|Number of entries found in the log for the specified severity level.
|===
Threshold:: When this threshold value is reached or exceeded, an event is recorded.
Run Period:: The time in seconds of how often to run this monitor. The monitor job runs on a period also, so the monitor can only run as often
as the monitor job.
Run Count:: The number of times to run the monitor before calculating an average value to compare against the threshold.
Severity Level:: The importance of this monitor event when it exceeds the threshold.
Enabled:: Whether or not this monitor is enabled to run.