Skip to content

Commit

Permalink
0002547: Notification of system event problems
Browse files Browse the repository at this point in the history
  • Loading branch information
erilong committed Jun 28, 2016
1 parent 2760550 commit 7268921
Show file tree
Hide file tree
Showing 6 changed files with 121 additions and 1 deletion.
5 changes: 4 additions & 1 deletion symmetric-assemble/src/asciidoc/configuration.ad
Expand Up @@ -43,6 +43,9 @@ include::configuration/parameters.ad[]
ifdef::pro[]
include::configuration/users.ad[]
include::configuration/ldap.ad[]
include::configuration/mail-server.ad[]
include::configuration/license-key.ad[]
endif::pro[]

include::configuration/mail-server.ad[]
include::configuration/monitors.ad[]
include::configuration/notifications.ad[]
47 changes: 47 additions & 0 deletions symmetric-assemble/src/asciidoc/configuration/monitors.ad
@@ -0,0 +1,47 @@

=== Monitors

A monitor watches some part of the system for a problem, checking to see if the monitored value exceeds a threshold.
(To be notified immediately of new monitor events, configure a notification.)

Monitor ID:: The monitor ID is a unique name to refer to the monitor.

ifndef::pro[]
Node Group ID:: The node group that will run this monitor. Use "ALL" to match all groups.
External ID:: The external ID of nodes that will run this monitor. Use "ALL" to match all nodes.
endif::pro[]
ifdef::pro[]
Target Nodes:: The group of nodes that will run this monitor.
endif::pro[]

Monitor Type:: The monitor type is one of several built-in or custom types that run a specific check and return a numeric value that can
be compared to a threshold value.

[cols="<2,<7", options="header"]
|===
|Type
|Description

|cpu|Percentage from 0 to 100 of CPU usage for the server process.

|disk|Percentage from 0 to 100 of disk usage (tmp folder staging area) available to the server process.

|memory|Percentage from 0 to 100 of memory usage (tenured heap pool) available to the server process.

|batchError|Number of incoming and outgoing batches in error.

|batchUnsent|Number of outgoing batches waiting to be sent.

|dataUnrouted|Number of change capture rows that are waiting to be batched and sent.

|dataGaps|Number of active data gaps that are being checked during routing for data to commit.

|===

Threshold:: When this threshold value is reached or exceeded, an event is recorded.
Run Period:: The time in seconds of how often to run this monitor. The monitor job runs on a period also, so the monitor can only run as often
as the monitor job.
Run Count:: The number of times to run the monitor before calculating an average value to compare against the threshold.
Severity Level:: The importance of this monitor event when it exceeds the threshold.
Enabled:: Whether or not this monitor is enabled to run.

40 changes: 40 additions & 0 deletions symmetric-assemble/src/asciidoc/configuration/notifications.ad
@@ -0,0 +1,40 @@

=== Notifications

A notification sends a message to the user when a monitor event records a system problem.
First configure a monitor to watch the system and record events with a specific severity level.
Then, configure a notification to match the severity level and write to the log or send an email.

Notification ID:: The notification ID is a unique name to refer to the notification.

ifndef::pro[]
Node Group ID:: The node group that will run this monitor. Use "ALL" to match all groups.
External ID:: The external ID of nodes that will run this monitor. Use "ALL" to match all nodes.
endif::pro[]
ifdef::pro[]
Target Nodes:: The group of nodes that will run this monitor.
endif::pro[]

Notification Type:: The notification type is either a built-in or custom type that is given the list of monitor events to send.

[cols="<2,<7", options="header"]
|===
|Type
|Description

|log|The monitor events are written to the log using the same severity level.
ifdef::pro[]
The web console will indicate WARN and ERROR level notifications in the top-right corner, which are also displayed on the main Dashboard screen.
endif::pro[]

|email|The monitor events are sent in an email to a list of recipients. Use the expression for the comma-separated list of email addresses.
ifdef::pro[]
Use the Configure->Mail Server screen to configure a mail server to use for sending emails.
endif::pro[]

|===

Expression:: Additional information to configure the notification type.
Severity Level:: Find monitor events that occur at this severity level or above.
Enabled:: Whether or not this notification is enabled to run.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions symmetric-assemble/src/asciidoc/manage.ad
Expand Up @@ -74,5 +74,7 @@ include::manage/jvm-properties.ad[]
include::manage/jvm-threads.ad[]
endif::pro[]

include::manage/monitors.ad[]

include::manage/logging.ad[]

28 changes: 28 additions & 0 deletions symmetric-assemble/src/asciidoc/manage/monitors.ad
@@ -0,0 +1,28 @@
=== Monitors

ifdef::pro[]

The Monitors screen allows you to view events of system problems recorded by both local and remote nodes.
The list of events can be filtered to limit the number of events displayed, event type, severity level, and node ID.
Filtering by severity level will match the level you choose and any level above it.
Events are listed in descending order by event time, but the order can be changed by clicking column headings.
The remove button will clear the event from the table on the current node. Events are purged automatically each night
using the `purge.retention.minutes` parameter to remove by event time.

image::manage/manage-monitors.png[]

endif::pro[]

When a <<_monitors,Monitor>> is configured, it is run periodically to check the current value of a system metric and compare it to a threshold value.
Different monitor types can check the CPU usage, disk usage, memory usage, batch errors, outstanding batches, unrouted data, and number
of data gaps.
Custom monitor types can be created using <<_extensions,Extensions>> that use the IMonitorType interface.
When the value returned from the check meets or exceeds the threshold value, a <<_monitor_event>> is recorded.
The <<_monitor_event>> table is synchronized on the heartbeat channel, which allows a central server to see events from remote nodes,
but this behavior can be disabled by setting the `monitor.events.capture.enabled` parameter to false.

To be immediately notified of a monitor event, use <<_notifications,Notifications>> to match on the severity level.
Different notification type can send a message by writing to the log or sending an email.
Custom notification types can be created using <<_extensions,Extensions>> that use the INotificationType interface.
In order to send email, the <<_mail_server,Mail Server>> should be configured.

0 comments on commit 7268921

Please sign in to comment.