From fd025ed35ad60381636d562d0527bd43ec0ae38a Mon Sep 17 00:00:00 2001 From: aidenvaines-cgi Date: Mon, 23 Feb 2026 12:06:46 +0000 Subject: [PATCH 1/4] CCM-14044 Adding eventpub anom alarms config --- .../terraform/modules/eventpub/README.md | 46 +++++++++++++++++++ ...udwatch_metric_alarm_publishing_anomaly.tf | 42 +++++++++++++++++ .../terraform/modules/eventpub/outputs.tf | 8 ++++ .../terraform/modules/eventpub/variables.tf | 24 ++++++++++ 4 files changed, 120 insertions(+) create mode 100644 infrastructure/terraform/modules/eventpub/cloudwatch_metric_alarm_publishing_anomaly.tf diff --git a/infrastructure/terraform/modules/eventpub/README.md b/infrastructure/terraform/modules/eventpub/README.md index 4be7358..4306d48 100644 --- a/infrastructure/terraform/modules/eventpub/README.md +++ b/infrastructure/terraform/modules/eventpub/README.md @@ -1,3 +1,44 @@ +# EventPub Module + +## Overview + +The `eventpub` module provides a centralized event publishing infrastructure for NHS Notify bounded contexts. It creates an SNS topic with configurable subscribers (Lambda, Firehose, SQS) and includes comprehensive monitoring via CloudWatch alarms. + +``` +┌─────────────────┐ +│ Service Lambda │ +│ (Publisher) │ +└────────┬────────┘ + │ publishes to + ▼ +┌─────────────────────────┐ +│ SNS Topic │ +│ (eventpub module) │ +│ │ +│ - Anomaly Detection │ +│ - Delivery Logging │ +│ - KMS Encryption │ +└─────────┬───────────────┘ + │ fan-out to: + ├─────────────────────────┐ + │ │ + ▼ ▼ +┌─────────────────┐ ┌──────────────────┐ +│ Kinesis │ │ EventBridge │ +│ Firehose │ │ Rules │ +│ ↓ S3 │ │ ↓ Subscribers │ +│ (Event Cache) │ │ (SQS/Lambda) │ +└─────────────────┘ └──────────────────┘ + │ │ + ▼ ▼ +┌─────────────────┐ ┌──────────────────┐ +│ CloudWatch │ │ CloudWatch │ +│ - DLQ Alarm │ │ - Anomaly │ +│ - Delivery │ │ Detection │ +│ Failures │ │ │ +└─────────────────┘ └──────────────────┘ +``` + @@ -19,6 +60,7 @@ | [default\_tags](#input\_default\_tags) | Default tag map for application to all taggable resources in the module | `map(string)` | `{}` | no | | [enable\_event\_cache](#input\_enable\_event\_cache) | Enable caching of events to an S3 bucket | `bool` | `false` | no | | [enable\_firehose\_raw\_message\_delivery](#input\_enable\_firehose\_raw\_message\_delivery) | Enables raw message delivery on firehose subscription | `bool` | `false` | no | +| [enable\_publishing\_anomaly\_detection](#input\_enable\_publishing\_anomaly\_detection) | Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume. | `bool` | `true` | no | | [enable\_sns\_delivery\_logging](#input\_enable\_sns\_delivery\_logging) | Enable SNS Delivery Failure Notifications | `bool` | `false` | no | | [environment](#input\_environment) | The name of the terraformscaffold environment the module is called for | `string` | n/a | yes | | [event\_cache\_buffer\_interval](#input\_event\_cache\_buffer\_interval) | The buffer interval for data firehose | `number` | `500` | no | @@ -31,6 +73,9 @@ | [log\_retention\_in\_days](#input\_log\_retention\_in\_days) | The retention period in days for the Cloudwatch Logs events generated by the lambda function | `number` | n/a | yes | | [name](#input\_name) | A unique name to distinguish this module invocation from others within the same CSI scope | `string` | n/a | yes | | [project](#input\_project) | The name of the terraformscaffold project calling the module | `string` | n/a | yes | +| [publishing\_anomaly\_band\_width](#input\_publishing\_anomaly\_band\_width) | The width of the anomaly detection band. Higher values (e.g., 4-6) reduce sensitivity and noise, lower values (e.g., 2-3) increase sensitivity. Recommended: 2-4 depending on traffic patterns. | `number` | `3` | no | +| [publishing\_anomaly\_evaluation\_periods](#input\_publishing\_anomaly\_evaluation\_periods) | Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period. | `number` | `2` | no | +| [publishing\_anomaly\_period](#input\_publishing\_anomaly\_period) | The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600 for event-driven workloads. | `number` | `300` | no | | [region](#input\_region) | The AWS Region | `string` | n/a | yes | | [sns\_success\_logging\_sample\_percent](#input\_sns\_success\_logging\_sample\_percent) | Enable SNS Delivery Successful Sample Percentage | `number` | `0` | no | ## Modules @@ -42,6 +87,7 @@ | Name | Description | |------|-------------| +| [publishing\_anomaly\_alarm](#output\_publishing\_anomaly\_alarm) | CloudWatch anomaly detection alarm details for SNS publishing | | [s3\_bucket\_event\_cache](#output\_s3\_bucket\_event\_cache) | S3 Bucket ARN and Name for event cache | | [sns\_topic](#output\_sns\_topic) | SNS Topic ARN and Name | diff --git a/infrastructure/terraform/modules/eventpub/cloudwatch_metric_alarm_publishing_anomaly.tf b/infrastructure/terraform/modules/eventpub/cloudwatch_metric_alarm_publishing_anomaly.tf new file mode 100644 index 0000000..18d81ea --- /dev/null +++ b/infrastructure/terraform/modules/eventpub/cloudwatch_metric_alarm_publishing_anomaly.tf @@ -0,0 +1,42 @@ +resource "aws_cloudwatch_metric_alarm" "publishing_anomaly" { + count = var.enable_event_publishing_anomaly_detection ? 1 : 0 + + alarm_name = "${local.csi}-sns-publishing-anomaly" + alarm_description = "RELIABILITY: Anomaly detection alarm for abnormal SNS message publishing patterns. Detects unexpected drops or spikes in event publishing volume that may indicate service degradation or misconfiguration." + comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold" + evaluation_periods = var.event_publishing_anomaly_evaluation_periods # Number of evaluation periods for the publishing anomaly alarm. + threshold_metric_id = "ad1" + treat_missing_data = "notBreaching" + actions_enabled = true + + tags = merge( + local.default_tags, + { + AlarmType = "AnomalyDetection" + AlarmPurpose = "EventPublishingAbnormality" + } + ) + + metric_query { + id = "m1" + return_data = true + + metric { + metric_name = "NumberOfMessagesPublished" + namespace = "AWS/SNS" + period = var.event_publishing_anomaly_period # The period in seconds over which the specified statistic is applied for anomaly detection. + stat = "Sum" + + dimensions = { + TopicName = aws_sns_topic.main.name + } + } + } + + metric_query { + id = "ad1" + expression = "ANOMALY_DETECTION_BAND(m1, ${var.event_publishing_anomaly_band_width})" # The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity. + label = "NumberOfMessagesPublished (expected)" + return_data = true + } +} diff --git a/infrastructure/terraform/modules/eventpub/outputs.tf b/infrastructure/terraform/modules/eventpub/outputs.tf index e2ff3b3..cbba9df 100644 --- a/infrastructure/terraform/modules/eventpub/outputs.tf +++ b/infrastructure/terraform/modules/eventpub/outputs.tf @@ -13,3 +13,11 @@ output "s3_bucket_event_cache" { bucket = module.s3bucket_event_cache[0].bucket } : {} } + +output "publishing_anomaly_alarm" { + description = "CloudWatch anomaly detection alarm details for SNS publishing" + value = var.enable_event_publishing_anomaly_detection ? { + arn = aws_cloudwatch_metric_alarm.publishing_anomaly[0].arn + name = aws_cloudwatch_metric_alarm.publishing_anomaly[0].alarm_name + } : null +} diff --git a/infrastructure/terraform/modules/eventpub/variables.tf b/infrastructure/terraform/modules/eventpub/variables.tf index 41141f9..9a29a70 100644 --- a/infrastructure/terraform/modules/eventpub/variables.tf +++ b/infrastructure/terraform/modules/eventpub/variables.tf @@ -129,3 +129,27 @@ variable "additional_policies_for_event_cache_bucket" { description = "A list of JSON policies to use to build the bucket policy" default = [] } + +variable "enable_event_publishing_anomaly_detection" { + type = bool + description = "Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume." + default = true +} + +variable "event_publishing_anomaly_evaluation_periods" { + type = number + description = "Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period." + default = 2 +} + +variable "event_publishing_anomaly_period" { + type = number + description = "The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600." + default = 300 +} + +variable "event_publishing_anomaly_band_width" { + type = number + description = "The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity. Recommended: 2-4." + default = 3 +} From 966e0dcce7292fb385f5f6cef6d3e60ffcdf799c Mon Sep 17 00:00:00 2001 From: aidenvaines-cgi Date: Mon, 23 Feb 2026 12:11:08 +0000 Subject: [PATCH 2/4] CCM-14044 Setting prod defaults --- infrastructure/terraform/modules/eventpub/variables.tf | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/infrastructure/terraform/modules/eventpub/variables.tf b/infrastructure/terraform/modules/eventpub/variables.tf index 9a29a70..7bdaa30 100644 --- a/infrastructure/terraform/modules/eventpub/variables.tf +++ b/infrastructure/terraform/modules/eventpub/variables.tf @@ -139,7 +139,7 @@ variable "enable_event_publishing_anomaly_detection" { variable "event_publishing_anomaly_evaluation_periods" { type = number description = "Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period." - default = 2 + default = 3 } variable "event_publishing_anomaly_period" { @@ -151,5 +151,5 @@ variable "event_publishing_anomaly_period" { variable "event_publishing_anomaly_band_width" { type = number description = "The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity. Recommended: 2-4." - default = 3 + default = 5 } From b3bb3fe085a4de4f9bef332cbfdc2efb8a4bb543 Mon Sep 17 00:00:00 2001 From: aidenvaines-cgi Date: Mon, 23 Feb 2026 14:21:50 +0000 Subject: [PATCH 3/4] CCM-14044 Adding eventpub anom alarms config --- .../terraform/modules/eventpub/README.md | 41 ------------------- 1 file changed, 41 deletions(-) diff --git a/infrastructure/terraform/modules/eventpub/README.md b/infrastructure/terraform/modules/eventpub/README.md index 4306d48..b8d9ab5 100644 --- a/infrastructure/terraform/modules/eventpub/README.md +++ b/infrastructure/terraform/modules/eventpub/README.md @@ -1,44 +1,3 @@ -# EventPub Module - -## Overview - -The `eventpub` module provides a centralized event publishing infrastructure for NHS Notify bounded contexts. It creates an SNS topic with configurable subscribers (Lambda, Firehose, SQS) and includes comprehensive monitoring via CloudWatch alarms. - -``` -┌─────────────────┐ -│ Service Lambda │ -│ (Publisher) │ -└────────┬────────┘ - │ publishes to - ▼ -┌─────────────────────────┐ -│ SNS Topic │ -│ (eventpub module) │ -│ │ -│ - Anomaly Detection │ -│ - Delivery Logging │ -│ - KMS Encryption │ -└─────────┬───────────────┘ - │ fan-out to: - ├─────────────────────────┐ - │ │ - ▼ ▼ -┌─────────────────┐ ┌──────────────────┐ -│ Kinesis │ │ EventBridge │ -│ Firehose │ │ Rules │ -│ ↓ S3 │ │ ↓ Subscribers │ -│ (Event Cache) │ │ (SQS/Lambda) │ -└─────────────────┘ └──────────────────┘ - │ │ - ▼ ▼ -┌─────────────────┐ ┌──────────────────┐ -│ CloudWatch │ │ CloudWatch │ -│ - DLQ Alarm │ │ - Anomaly │ -│ - Delivery │ │ Detection │ -│ Failures │ │ │ -└─────────────────┘ └──────────────────┘ -``` - From 038a02d96099c58052602066c2c23601c4d19d5d Mon Sep 17 00:00:00 2001 From: aidenvaines-cgi Date: Mon, 23 Feb 2026 14:23:46 +0000 Subject: [PATCH 4/4] CCM-14044 Adding eventpub anom alarms config --- infrastructure/terraform/modules/eventpub/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/infrastructure/terraform/modules/eventpub/README.md b/infrastructure/terraform/modules/eventpub/README.md index b8d9ab5..66d08ca 100644 --- a/infrastructure/terraform/modules/eventpub/README.md +++ b/infrastructure/terraform/modules/eventpub/README.md @@ -18,12 +18,15 @@ | [data\_plane\_bus\_arn](#input\_data\_plane\_bus\_arn) | Data plane event bus arn | `string` | n/a | yes | | [default\_tags](#input\_default\_tags) | Default tag map for application to all taggable resources in the module | `map(string)` | `{}` | no | | [enable\_event\_cache](#input\_enable\_event\_cache) | Enable caching of events to an S3 bucket | `bool` | `false` | no | +| [enable\_event\_publishing\_anomaly\_detection](#input\_enable\_event\_publishing\_anomaly\_detection) | Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume. | `bool` | `true` | no | | [enable\_firehose\_raw\_message\_delivery](#input\_enable\_firehose\_raw\_message\_delivery) | Enables raw message delivery on firehose subscription | `bool` | `false` | no | -| [enable\_publishing\_anomaly\_detection](#input\_enable\_publishing\_anomaly\_detection) | Enable CloudWatch anomaly detection alarm for SNS message publishing. Detects abnormal drops or spikes in event publishing volume. | `bool` | `true` | no | | [enable\_sns\_delivery\_logging](#input\_enable\_sns\_delivery\_logging) | Enable SNS Delivery Failure Notifications | `bool` | `false` | no | | [environment](#input\_environment) | The name of the terraformscaffold environment the module is called for | `string` | n/a | yes | | [event\_cache\_buffer\_interval](#input\_event\_cache\_buffer\_interval) | The buffer interval for data firehose | `number` | `500` | no | | [event\_cache\_expiry\_days](#input\_event\_cache\_expiry\_days) | s3 archiving expiry in days | `number` | `30` | no | +| [event\_publishing\_anomaly\_band\_width](#input\_event\_publishing\_anomaly\_band\_width) | The width of the anomaly detection band. Higher values (e.g. 4-6) reduce sensitivity and noise, lower values (e.g. 2-3) increase sensitivity. Recommended: 2-4. | `number` | `5` | no | +| [event\_publishing\_anomaly\_evaluation\_periods](#input\_event\_publishing\_anomaly\_evaluation\_periods) | Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event\_publishing\_anomaly\_period. | `number` | `3` | no | +| [event\_publishing\_anomaly\_period](#input\_event\_publishing\_anomaly\_period) | The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600. | `number` | `300` | no | | [force\_destroy](#input\_force\_destroy) | When enabled will force destroy event-cache S3 bucket | `bool` | `false` | no | | [group](#input\_group) | The name of the tfscaffold group | `string` | `null` | no | | [iam\_permissions\_boundary\_arn](#input\_iam\_permissions\_boundary\_arn) | The ARN of the permissions boundary to use for the IAM role | `string` | `null` | no | @@ -32,9 +35,6 @@ | [log\_retention\_in\_days](#input\_log\_retention\_in\_days) | The retention period in days for the Cloudwatch Logs events generated by the lambda function | `number` | n/a | yes | | [name](#input\_name) | A unique name to distinguish this module invocation from others within the same CSI scope | `string` | n/a | yes | | [project](#input\_project) | The name of the terraformscaffold project calling the module | `string` | n/a | yes | -| [publishing\_anomaly\_band\_width](#input\_publishing\_anomaly\_band\_width) | The width of the anomaly detection band. Higher values (e.g., 4-6) reduce sensitivity and noise, lower values (e.g., 2-3) increase sensitivity. Recommended: 2-4 depending on traffic patterns. | `number` | `3` | no | -| [publishing\_anomaly\_evaluation\_periods](#input\_publishing\_anomaly\_evaluation\_periods) | Number of evaluation periods for the publishing anomaly alarm. Each period is defined by event_publishing_anomaly_period. | `number` | `2` | no | -| [publishing\_anomaly\_period](#input\_publishing\_anomaly\_period) | The period in seconds over which the specified statistic is applied for anomaly detection. Minimum 300 seconds (5 minutes). Recommended: 300-600 for event-driven workloads. | `number` | `300` | no | | [region](#input\_region) | The AWS Region | `string` | n/a | yes | | [sns\_success\_logging\_sample\_percent](#input\_sns\_success\_logging\_sample\_percent) | Enable SNS Delivery Successful Sample Percentage | `number` | `0` | no | ## Modules