Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds CloudWatch monitoring for Archivematica instances across dev, staging, and production environments by installing the CloudWatch Observability add-on and creating alarms to detect when pod counts drop below expected thresholds. Additionally, it adds explicit namespace declarations to all Kubernetes resources and fixes a naming bug in the production ingress configuration.
- Installs amazon-cloudwatch-observability EKS addon with proper IAM roles for Container Insights metrics
- Creates CloudWatch alarms with SNS email notifications to alert when running pods fall below threshold of 7
- Adds explicit kubernetes_namespace resources and references them throughout all deployments, services, and volume claims
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| archivematica/test_cluster/variables.tf | Adds alert_email variable with default email address; applies formatting alignment |
| archivematica/test_cluster/staging_redis_deployment.tf | Adds namespace references to deployment, service, and PVC resources |
| archivematica/test_cluster/staging_ingress.tf | Adds namespace references to all ingress resources |
| archivematica/test_cluster/staging_gearman_deployment.tf | Adds namespace references to deployment and service resources |
| archivematica/test_cluster/staging_archivematica_deployment.tf | Creates namespace resource, adds namespace references throughout, applies formatting alignment |
| archivematica/test_cluster/staging_alerts.tf | Creates SNS topic, subscription, and CloudWatch alarm for staging environment |
| archivematica/test_cluster/secrets.tf | Adds namespace references to secret resources |
| archivematica/test_cluster/load_balancer.tf | Applies formatting alignment |
| archivematica/test_cluster/eks-cluster.tf | Installs amazon-cloudwatch-observability addon, creates IAM role for CloudWatch access, applies formatting alignment |
| archivematica/test_cluster/dev_redis_deployment.tf | Adds namespace references to deployment, service, and PVC resources |
| archivematica/test_cluster/dev_ingress.tf | Adds namespace references to all ingress resources |
| archivematica/test_cluster/dev_gearman_deployment.tf | Adds namespace references to deployment and service resources |
| archivematica/test_cluster/dev_archivematica_deployment.tf | Creates namespace resource, adds namespace references throughout, applies formatting alignment |
| archivematica/test_cluster/dev_alerts.tf | Creates SNS topic, subscription, and CloudWatch alarm for dev environment |
| archivematica/prod_cluster/variables.tf | Adds alert_email variable with default email address |
| archivematica/prod_cluster/secrets.tf | Adds namespace reference to secret resource; applies formatting alignment |
| archivematica/prod_cluster/redis_deployment.tf | Adds namespace references to deployment, service, and PVC resources |
| archivematica/prod_cluster/ingress.tf | Fixes naming bug (dev→prod), adds namespace references to all ingress resources |
| archivematica/prod_cluster/gearman_deployment.tf | Adds namespace references to deployment and service resources |
| archivematica/prod_cluster/eks-cluster.tf | Installs amazon-cloudwatch-observability addon, creates IAM role for CloudWatch access, applies formatting alignment |
| archivematica/prod_cluster/archivematica_deployment.tf | Creates namespace resource (with ordering issue), adds namespace references throughout |
| archivematica/prod_cluster/alerting.tf | Creates SNS topic, subscription, and CloudWatch alarm for prod environment |
| archivematica/prod_cluster/--addon-name | Appears to be an accidentally committed empty file that should be removed |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| data "kubernetes_resource" "archivematica_prod" { | ||
| kind = "Deployment" | ||
| api_version = "apps/v1" | ||
| metadata { name = "archivematica-prod" } | ||
| metadata { | ||
| name = "archivematica-prod" | ||
| namespace = kubernetes_namespace.archivematica_prod.metadata[0].name | ||
| } | ||
| } | ||
|
|
||
| resource "kubernetes_namespace" "archivematica_prod" { | ||
| metadata { | ||
| name = "archivematica-prod" | ||
| } | ||
| } |
There was a problem hiding this comment.
The data.kubernetes_resource.archivematica_prod data source is defined before the kubernetes_namespace.archivematica_prod resource it depends on. This creates incorrect resource ordering in Terraform. The namespace resource should be defined first (lines 10-14), followed by the data source (lines 1-8). This matches the correct pattern used in dev_archivematica_deployment.tf and staging_archivematica_deployment.tf where the namespace is created before being referenced.
There was a problem hiding this comment.
This isn't a requirement functionally; the terraform runs fine as is, but I'll make the change for consistency across the environments
c67594f to
3a9e082
Compare
cecilia-donnelly
left a comment
There was a problem hiding this comment.
Seems reasonable to me.
| metadata { | ||
| name = "prod-storage-share" | ||
| name = "prod-storage-share" | ||
| namespace = kubernetes_namespace.archivematica_prod.metadata[0].name |
There was a problem hiding this comment.
What does namespace give us, out of curiosity?
There was a problem hiding this comment.
The monitor needs a namespace to exist so it can count the pods therein. Technically we could have told the prod monitor to look at the default namespace, but we need explicit ones in the test envs so I went with that in prod too.
Several terraform files aren't formatted correctly; this commit fixes that.
Currently, deploys will fail if existing copies of our deployments can't be found, even if values for all the images have been specified in image_overrides. This commit updates the config to only require existing image values when they're actually needed.
We need to know when our Archivematica instances go down, so this commit adds Cloudwatch alarms in case they drop below the expected number of running pods.
3a9e082 to
1886b40
Compare
|
Waiting to deploy this until #216 is merged, because I rebased this atop that to make troubleshooting the monkeypatching issues easier. |
We need to know when our Archivematica instances go down, so this commit adds Cloudwatch alarms in case they drop below the expected number of running pods.