Skip to content

[DatadogMonitor] Unregister metrics forwarder on finalization#2804

Merged
tbavelier merged 2 commits intomainfrom
tbavelier/fix-go-routine-unregister
Mar 24, 2026
Merged

[DatadogMonitor] Unregister metrics forwarder on finalization#2804
tbavelier merged 2 commits intomainfrom
tbavelier/fix-go-routine-unregister

Conversation

@tbavelier
Copy link
Member

@tbavelier tbavelier commented Mar 24, 2026

What does this PR do?

  • On finalization (deletion) of a DatadogMonitor, make sure to un-register its associated metrics forwarder to close the channel
  • If error on deletion attempt, do not remove finalizer and instead requeue for a next attempt except in the case of a 404: if a monitor in the process of being deleted is not found, it was already deleted from the UI/api somewhere else, so we should keep proceeding with the deletion. All other errors should be retried

Motivation

Fixes #2803 -> goroutine leak

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

  • Deploy with --datadogMonitorEnabled=true (requires adding DD_API_KEY and DD_APP_KEY to your operator deployment)
  • Expose locally the operator deploy to get goroutines (note: the port might be different based on the deployment method) k port-forward deploy/datadog-operator-manager 8080:8080
  • Once it's leader, check the baseline number of goroutines: curl -s localhost:8080/metrics | grep -i go_goroutines
  • Create 100 DatadogMonitor with NAMESPACE=<operator namespace> bash datadogmonitor-load-test.sh create 100
  • Wait for a minute or two, and verify the new number of goroutines is baseline + ~ 100 (1 goroutine per monitor)
  • Delete the monitors with NAMESPACE=<operator namespace> bash datadogmonitor-load-test.sh delete
  • Verify after a minute or two that the number of goroutines is back to baseline (around 100 less goroutines)

Load test script

#!/usr/bin/env bash

set -euo pipefail

usage() {
  cat <<'EOF'
Usage:
  datadogmonitor-load-test.sh create [count]
  datadogmonitor-load-test.sh delete

Environment variables:
  NAMESPACE     Target namespace. Default: datadog-monitor-load-test
  PREFIX        Resource name prefix. Default: leak-check
  LABEL_KEY     Label key used for cleanup. Default: load-test.datadoghq.com/group
  LABEL_VALUE   Label value used for cleanup. Default: datadogmonitor-leak
  QUERY         Monitor query. Default: avg(last_5m):avg:system.cpu.user{*} > 100
  MESSAGE       Monitor message. Default: DatadogMonitor leak test

Examples:
  bash hack/datadogmonitor-load-test.sh create 1000
  bash hack/datadogmonitor-load-test.sh delete
EOF
}

if [[ $# -lt 1 ]]; then
  usage
  exit 1
fi

mode="$1"
count="${2:-1000}"
namespace="${NAMESPACE:-datadog-monitor-load-test}"
prefix="${PREFIX:-leak-check}"
label_key="${LABEL_KEY:-load-test.datadoghq.com/group}"
label_value="${LABEL_VALUE:-datadogmonitor-leak}"
query="${QUERY:-avg(last_5m):avg:system.load.1{*\} > 100}"
message="${MESSAGE:-DatadogMonitor leak test}"

yaml_escape() {
  printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g'
}

case "$mode" in
  create)
    query_escaped="$(yaml_escape "$query")"
    message_escaped="$(yaml_escape "$message")"
    kubectl get namespace "$namespace" >/dev/null 2>&1 || kubectl create namespace "$namespace" >/dev/null
    {
      echo "apiVersion: v1"
      echo "kind: List"
      echo "items:"
      for i in $(seq 1 "$count"); do
        cat <<EOF
- apiVersion: datadoghq.com/v1alpha1
  kind: DatadogMonitor
  metadata:
    name: ${prefix}-$(printf "%05d" "$i")
    namespace: ${namespace}
    labels:
      ${label_key}: "${label_value}"
  spec:
    name: "${prefix}-$(printf "%05d" "$i")"
    type: "metric alert"
    query: "${query_escaped}"
    message: "${message_escaped}"
EOF
      done
    } | kubectl apply -f -
    ;;
  delete)
    kubectl delete datadogmonitors.datadoghq.com -n "$namespace" -l "${label_key}=${label_value}" --ignore-not-found
    ;;
  *)
    usage
    exit 1
    ;;
esac

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@tbavelier tbavelier added this to the v1.25.0 milestone Mar 24, 2026
@tbavelier tbavelier requested a review from a team March 24, 2026 10:27
@tbavelier tbavelier added the bug Something isn't working label Mar 24, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f71bcbad8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 63 to 67
if err != nil {
logger.Error(err, "failed to finalize monitor", "Monitor ID", fmt.Sprint(dm.Status.ID))

return
return err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat monitor-not-found as successful finalization

Returning every deleteMonitor error here can permanently block CR deletion when the Datadog monitor was already removed out-of-band (for example, deleted in Datadog UI before deleting the DatadogMonitor resource). In that case Datadog returns a 404, this path keeps the finalizer forever, and the Kubernetes object stays in Terminating until users manually patch finalizers. The controller already treats 404 as a recoverable state in normal reconcile (ctrutils.NotFoundString), so finalization should do the same.

Useful? React with 👍 / 👎.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b625500 : indeed, an already deleted monitor (from the UI / somewhere else) should not prevent the finalization to complete

@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 27.27273% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.57%. Comparing base (9b607f1) to head (b625500).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/datadogmonitor/finalizer.go 28.57% 3 Missing and 2 partials ⚠️
internal/controller/datadogmonitor/monitor.go 25.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2804      +/-   ##
==========================================
- Coverage   38.83%   38.57%   -0.27%     
==========================================
  Files         309      311       +2     
  Lines       26906    27422     +516     
==========================================
+ Hits        10450    10579     +129     
- Misses      15674    16054     +380     
- Partials      782      789       +7     
Flag Coverage Δ
unittests 38.57% <27.27%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogmonitor/monitor.go 73.59% <25.00%> (-0.70%) ⬇️
internal/controller/datadogmonitor/finalizer.go 60.00% <28.57%> (-7.75%) ⬇️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b607f1...b625500. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +57 to +59
if r.forwarders != nil {
r.forwarders.Unregister(dm)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that fixes the leak

@tbavelier tbavelier merged commit 15a35f5 into main Mar 24, 2026
36 checks passed
@tbavelier tbavelier deleted the tbavelier/fix-go-routine-unregister branch March 24, 2026 13:30
dd-octo-sts bot pushed a commit that referenced this pull request Mar 25, 2026
* [DatadogMonitor] Unregister metrics forwarder on finalization

* ignore 404 not found and consider success

(cherry picked from commit 15a35f5)
tbavelier added a commit that referenced this pull request Mar 25, 2026
…#2824)

* [DatadogMonitor] Unregister metrics forwarder on finalization

* ignore 404 not found and consider success

(cherry picked from commit 15a35f5)

Co-authored-by: Timothée Bavelier <97530782+tbavelier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/v1.25 bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DatadogMonitor metrics forwarder goroutines not unregistered on CR deletion

3 participants