Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][POC] Adding is_improving to alert document for conditional actions. #183543

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented May 15, 2024

In this POC:

  • adds optional severity definition to action group definition
  • calculates whether alert is improving based on value of action group severity
  • sets kibana.alert.is_improving flag to alert document

Also for the purposes of testing this POC, I added another action group to the metric threshold rule type. This way, we can mimic the same action group changes that the SLO burn rate rule moves through.

With these changes, we can create a metric threshold rule with a triggered, warning and low condition and add the following actions:

Action Group Action
triggered PagerDuty incident with Critical severity
warning PagerDuty incident with Warning severity
low Slack message if kibana.alert.is_improving: false
low Different "improving" Slack message if kibana.alert.is_improving: true
low Resolve PagerDuty incident if kibana.alert.is_improving: true
recovered Slack message
Action Group definition
[
    {
        "group": "metrics.threshold.fired",
        "id": "2f59e135-ee28-4598-95fa-a8d7bc6fe285",
        "params": {
            "dedupKey": "{{rule.id}}:{{alert.id}}",
            "eventAction": "trigger",
            "summary": "{{context.reason}}",
            "severity": "critical"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "uuid": "bb31b6fe-b97b-4483-ae31-9095d6653777"
    },
    {
        "group": "metrics.threshold.warning",
        "id": "2f59e135-ee28-4598-95fa-a8d7bc6fe285",
        "params": {
            "dedupKey": "{{rule.id}}:{{alert.id}}",
            "eventAction": "trigger",
            "summary": "{{context.reason}}",
            "severity": "warning"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "uuid": "092287d6-f7dd-49ee-b62a-7cd425202972"
    },
    {
        "group": "metrics.threshold.low",
        "id": "slack",
        "params": {
            "message": "{{context.reason}}\n\n{{rule.name}} is active with the following conditions:\n\n- Affected: {{context.group}}\n- Metric: {{context.metric}}\n- Observed value: {{context.value}}\n- Threshold: {{context.threshold}}\n\n[View alert details]({{context.alertDetailsUrl}})\n"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "alerts_filter": {
            "query": {
                "kql": "kibana.alert.is_improving : false ",
                "filters": []
            }
        },
        "uuid": "cf2824c1-00f9-440a-87ca-082246807cff"
    },
    {
        "group": "metrics.threshold.low",
        "id": "2f59e135-ee28-4598-95fa-a8d7bc6fe285",
        "params": {
            "dedupKey": "{{rule.id}}:{{alert.id}}",
            "eventAction": "resolve"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "alerts_filter": {
            "query": {
                "kql": "kibana.alert.is_improving : true",
                "filters": []
            }
        },
        "uuid": "d52d8a1d-1180-4dc3-8f09-e363a3e81c42"
    },
    {
        "group": "recovered",
        "id": "slack",
        "params": {
            "message": "{{rule.name}} has recovered.\n\n- Affected: {{context.group}}\n- Metric: {{context.metric}}\n- Threshold: {{context.threshold}}\n\n[View alert details]({{context.alertDetailsUrl}})\n"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "uuid": "c0dbb568-a06f-4d74-8a1b-86275866621f"
    },
    {
        "group": "metrics.threshold.low",
        "id": "slack",
        "params": {
            "message": "{{rule.name}} is improving with the following conditions:\n\n- Affected: {{context.group}}\n- Metric: {{context.metric}}\n- Observed value: {{context.value}}\n- Threshold: {{context.threshold}}\n- Previous action group: {{alert.previousActionGroup}}\n\n[View alert details]({{context.alertDetailsUrl}})\n"
        },
        "frequency": {
            "notify_when": "onActionGroupChange",
            "throttle": null,
            "summary": false
        },
        "alerts_filter": {
            "query": {
                "kql": "kibana.alert.is_improving : true",
                "filters": []
            }
        },
        "uuid": "f519afa1-926c-4900-9c6a-1a818b0301f1"
    }
]

For a SLO burn rate rule with the same configuration, this is the behavior we would expect:

No alert -> Medium -> slack message
Screenshot 2024-05-15 at 1 48 24 PM

Medium -> High -> PD incident opened at Warning severity
Screenshot 2024-05-15 at 1 48 57 PM

High -> Critical -> Same PD incident changed to Critical severity
Screenshot 2024-05-15 at 1 49 21 PM

Critical -> High -> Same PD incident change to Warning severity
Screenshot 2024-05-15 at 1 49 43 PM

High -> Medium -> PD incident resolved, "improving" slack message for Medium action group sent
Screenshot 2024-05-15 at 1 50 01 PM
Screenshot 2024-05-15 at 1 50 13 PM

Medium -> Recovered -> slack recovery message sent
Screenshot 2024-05-15 at 1 50 35 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant