Skip to content

fix(correctness): filter agent-internal service checks from dsd-service-checks analysis#1578

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
thieman/fix-dsd-service-checks-flakiness
May 4, 2026
Merged

fix(correctness): filter agent-internal service checks from dsd-service-checks analysis#1578
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
thieman/fix-dsd-service-checks-flakiness

Conversation

@thieman
Copy link
Copy Markdown
Contributor

@thieman thieman commented May 4, 2026

Summary

  • The DDA emits datadog.agent.up on a ~15s flush cycle; under parallel test load the number of flush cycles completing before the dump is non-deterministic, causing spurious count mismatches between baseline and comparison in dsd-service-checks
  • Adds a datadog. prefix filter in ServiceChecksAnalyzer, matching the identical approach already used in MetricsAnalyzer for the same reason
  • Confirmed via a probe test (millstone configured with service_check: 0) that the only non-user checks present are datadog.agent.up (timing-dependent → filtered) and the DDA forwarder connectivity probe {"check":"test","status":0} (one-shot on startup, always identical on both sides → passes through correctly)
  • Improves the count-mismatch error path to log the names/details of extra checks on whichever side has more, to aid debugging if a mismatch still occurs after filtering

Test plan

  • Run make test-correctness-case CASE=dsd-service-checks in isolation — should pass
  • Run full make test-correctness with default parallelism several times — dsd-service-checks should no longer flake

🤖 Generated with Claude Code

…ce-checks analysis

The DDA emits `datadog.agent.up` on a ~15s flush cycle. Under parallel
test load the number of flush cycles that complete before the dump is
non-deterministic, producing spurious count mismatches between baseline
and comparison.

This matches the existing approach in the metrics analyzer, which already
filters `datadog.*` (and other known internal prefixes) for the same
reason.

Confirmed via a probe test that the only non-user checks present are
`datadog.agent.up` (timing-dependent, filtered) and the DDA forwarder
connectivity probe `test` (one-shot on startup, identical on both sides,
passes through the filter correctly).

Also improves the count-mismatch error path to log the names of the extra
checks on whichever side has more, to aid debugging if a mismatch still
occurs after filtering.

Closes #1576

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@tobz tobz added the type/bug Bug fixes. label May 4, 2026
@thieman thieman marked this pull request as ready for review May 4, 2026 18:52
@thieman thieman requested a review from a team as a code owner May 4, 2026 18:52
@thieman thieman changed the title fix(correctness): filter agent-internal service checks from dsd-service-checks analysis [#1576] fix(correctness): filter agent-internal service checks from dsd-service-checks analysis May 4, 2026
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 200c2b8 into main May 4, 2026
71 of 72 checks passed
dd-octo-sts Bot pushed a commit that referenced this pull request May 4, 2026
…ce-checks analysis (#1578)

## Summary

- The DDA emits `datadog.agent.up` on a ~15s flush cycle; under parallel test load the number of flush cycles completing before the dump is non-deterministic, causing spurious count mismatches between baseline and comparison in `dsd-service-checks`
- Adds a `datadog.` prefix filter in `ServiceChecksAnalyzer`, matching the identical approach already used in `MetricsAnalyzer` for the same reason
- Confirmed via a probe test (millstone configured with `service_check: 0`) that the only non-user checks present are `datadog.agent.up` (timing-dependent → filtered) and the DDA forwarder connectivity probe `{"check":"test","status":0}` (one-shot on startup, always identical on both sides → passes through correctly)
- Improves the count-mismatch error path to log the names/details of extra checks on whichever side has more, to aid debugging if a mismatch still occurs after filtering

## Test plan

- [ ] Run `make test-correctness-case CASE=dsd-service-checks` in isolation — should pass
- [ ] Run full `make test-correctness` with default parallelism several times — `dsd-service-checks` should no longer flake

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> 200c2b8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants