Skip to content

Conversation

@tbavelier
Copy link
Member

What does this PR do?

  • Emits an error log once when number of goroutines is exceeded:
{"level":"ERROR","ts":"2026-01-15T13:19:04.732Z","logger":"setup.healthz","msg":"healthz check entering failing state","checker":"goroutines-number","goroutines":262,"limit":50,"error":"too many goroutines: 262 > limit: 50","stacktrace":"main.newGoroutinesNumberHealthzCheck.func1\n\t/workspace/cmd/main.go:520\nsigs.k8s.io/controller-runtime/pkg/healthz.(*Handler).serveAggregated\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/healthz/healthz.go:59\nsigs.k8s.io/controller-runtime/pkg/healthz.(*Handler).ServeHTTP\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/healthz/healthz.go:148\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).addHealthProbeServer.StripPrefix.func4\n\t/usr/local/go/src/net/http/server.go:2384\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2322\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2861\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:3340\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:2109"}
  • Adds unit test

Motivation

  • Such a situation kills the operator due to its liveness probe failing and it's hard to investigate as controller-runtime wraps the healthz check failing at the debug (V(1).Info) level:
{"level":"DEBUG","ts":"2026-01-15T13:19:04.733Z","logger":"controller-runtime.healthz","msg":"healthz check failed","checker":"goroutines-number","error":"too many goroutines: 262 > limit: 50"}

or at the info level without any details (doesn't wrap the error):

{"level":"INFO","ts":"2026-01-15T13:19:04.733Z","logger":"controller-runtime.healthz","msg":"healthz check failed","statuses":[{}]}
  • Let's surface it so it's easier to troubleshoot without having to run debug level

Additional Notes

We use sync/atomic to keep a consistent protected state instead of a clearer boolean flag as we could have multiple HTTP requests to access it (someone manually probing on top of the kubelet probe, concurrent probe requests, etc.) so we don't want to flip-flop

Alternative approaches:

  • Modify customSetupLogging (we use it to route info to stdout and err to stderr), but it would be a lot of plumbing for a single log and we'd need to make sure we not only promote it to err correctly but also not twice
  • Patch/forker controller-runtime to log healthz failures to error but once again, that seems overkill / hard to maintain

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

  1. Deploy operator with a low number of goroutines (e.g. 50) and with debug log level:
╰─❯ k get deploy datadog-operator-manager -oyaml | yq '.spec.template.spec.containers[0].args'
- --enable-leader-election
- --pprof
- --loglevel=debug
- --maximumGoroutines=50
  1. Apply an Agent manifest
  2. Operator should start to CLBO, look at the previous logs grepping ERROR and ensure you see the goroutine one:
╰─❯ k logs deploy/datadog-operator-manager -p | grep ERROR
{"level":"ERROR","ts":"2026-01-15T13:45:15.240Z","logger":"setup","msg":"[WARNING] Agent DaemonSet selector changed in Operator v1.21. If you rely on Datadog Agent pod labels e.g. in NetworkPolicies, verify if you may be impacted. See README for details.","stacktrace":"main.run\n\t/workspace/cmd/main.go:228\nmain.main\n\t/workspace/cmd/main.go:205\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:285"}
{"level":"ERROR","ts":"2026-01-15T13:46:34.747Z","logger":"setup.healthz","msg":"healthz check entering failing state","checker":"goroutines-number","goroutines":262,"limit":50,"error":"too many goroutines: 262 > limit: 50","stacktrace":"main.newGoroutinesNumberHealthzCheck.func1\n\t/workspace/cmd/main.go:520\nsigs.k8s.io/controller-runtime/pkg/healthz.(*Handler).serveAggregated\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/healthz/healthz.go:59\nsigs.k8s.io/controller-runtime/pkg/healthz.(*Handler).ServeHTTP\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/healthz/healthz.go:148\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).addHealthProbeServer.StripPrefix.func4\n\t/usr/local/go/src/net/http/server.go:2384\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2322\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2861\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:3340\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:2109"}

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@tbavelier tbavelier added this to the v1.24.0 milestone Jan 15, 2026
@tbavelier tbavelier added the enhancement New feature or request label Jan 15, 2026
@tbavelier tbavelier requested a review from a team as a code owner January 15, 2026 13:55
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 38.38%. Comparing base (08c7140) to head (67fff91).

Files with missing lines Patch % Lines
cmd/main.go 93.75% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2496      +/-   ##
==========================================
+ Coverage   38.34%   38.38%   +0.03%     
==========================================
  Files         300      300              
  Lines       25544    25555      +11     
==========================================
+ Hits         9795     9809      +14     
+ Misses      15005    15002       -3     
  Partials      744      744              
Flag Coverage Δ
unittests 38.38% <93.75%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cmd/main.go 10.86% <93.75%> (+4.24%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08c7140...67fff91. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tbavelier tbavelier merged commit cab4ef9 into main Jan 15, 2026
36 checks passed
@tbavelier tbavelier deleted the tbavelier/promote-goroutine-to-error-log branch January 15, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants