Add agent autodiscovery e2e tests for redisdb and mcache#23515
Draft
Add agent autodiscovery e2e tests for redisdb and mcache#23515
Conversation
Design for a dedicated py3.13-ad-7.0 hatch env that exercises the Agent's container autodiscovery against a default-port redis container via the integration's auto_conf.yaml. Minimal verification proves discovery + check execution; broken cases from the DSCVR confluence page are deferred. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Task-by-task implementation plan derived from the 2026-04-17 spec. Adds a py3.13-ad-7.0 hatch env, dedicated compose, branched dd_environment, and a minimal test that proves autodiscovery via configcheck and a can_connect assertion with redis_port:6379. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- test_e2e_autodiscovery: assert_service_check(tags=...) requires exact tag-list equality, but the autodiscovered check emits Docker tags (docker_image, image_id, image_name, image_tag, redis_host, short_image) whose values vary per run. Switch to scanning service_checks() and asserting any with status OK contains 'redis_port:6379'. - test_e2e: skip the 1m-2s cluster test when REDIS_AUTODISCOVERY=true, per the plan's Task 6 Step 3 instructions, since the cluster compose isn't started under that env. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two doc-only fixes surfaced in final review: - Network name `autodiscovery_default_default` was a typo; the compose file and project name dictate `autodiscovery-default_default`. - Replace the spec's `assert_service_check(tags=[...])` snippet with the `service_checks() + any(...)` subset scan that actually shipped. `assert_service_check` requires tag-list equality, but the autodiscovered check emits volatile Docker tags (docker_image, image_id, image_name, image_tag, redis_host, short_image), so subset matching is needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When dd_environment yields no static instance config (the autodiscovery setup), the existing config-dir bind-mount is skipped, so the Agent falls back to the auto_conf.yaml baked into its image rather than the version shipped in the integration's source tree. That lets autodiscovery e2e tests pass against an outdated agent-image template even when the local `data/auto_conf.yaml` would be wrong. File-bind-mount the integration's own `data/auto_conf.yaml` over `/etc/datadog-agent/conf.d/<integration>.d/auto_conf.yaml` whenever no static config is yielded and the file exists in the source tree. SNMP (the only other yield-None user today) ships no auto_conf.yaml so its behavior is unchanged. Verified by editing redisdb's local auto_conf.yaml ad_identifiers to something the agent can't match: the redisdb autodiscovery e2e now correctly fails, and reverts to passing once the identifier is restored. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When auto_conf.yaml is edited between test runs, the docker bind-mount of the source-tree file can become stale because tools like git checkout replace the inode. Document the env-restart workaround. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror redisdb's autodiscovery e2e pattern for memcached: dedicated py3.13-ad-1.6 hatch env, vanilla memcached:1.6 on bridge network, configcheck + service-check assertions. Helper extraction is left to a follow-up spec informed by the duplication this lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bite-sized, TDD-shaped tasks driving the spec landed in 2026-04-28-mcache-autodiscovery-e2e-design.md. Pre-flight verifies the ddev autodiscovery fix from the redisdb branch is in place; final task runs the two-direction sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
memcached:1.6 is silent at default verbosity, so the CheckDockerLogs 'server listening' readiness probe times out. Run with -vv so memcached emits the slab classes and 'server listening (auto-negotiate)' lines on startup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The autodiscovery env starts a vanilla memcached:1.6 container on the Docker bridge network. The integration tests in test_integration_e2e.py all rely on SASL credentials, a unix socket, or IPv6 — none of which exist under the autodiscovery dd_environment. Promote the existing test_e2e skipif to a module-level pytestmark so every test in the file is skipped when MCACHE_AUTODISCOVERY=true. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ling Two small follow-ups from review: - The dd_environment docstring referenced 'existing e2e and integration tests'. After the module-level skipif landed, every test in test_integration_e2e.py is skipped under autodiscovery; reword to reflect that. - The compose's -vv flag and conftest's 'server listening' log probe are silently coupled. Document the dependency in the compose so a future change to verbosity does not mysteriously break the readiness probe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plan a new py3.13-adproc-7.0 hatch env that enables the agent's process listener and disables the docker feature, runs a host-networking redis container, and verifies that the cel_selector.processes rule shipped in auto_conf.yaml drives discovery. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eleven bite-sized tasks for shipping py3.13-adproc-7.0: modify auto_conf.yaml with cel://process and cel_selector.processes, add the new hatch env, compose file, common.py constants, conftest.py branch, skipif update, new test, docs, and three verification passes (happy path, negative sanity check, container env non-regression).
….yaml Discovered during execution that auto_conf.yaml is regenerated from assets/configuration/spec.yaml by ddev validate config -s. CI runs the same validator (without --sync) and fails on any drift. Update Task 1 to edit spec.yaml as the source of truth and commit both spec.yaml and the regenerated auto_conf.yaml together. The file structure table also calls this out so future readers don't repeat the mistake. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per https://datadoghq.atlassian.net/wiki/spaces/DSCVR/pages/6631130024/Adding+process+support+to+auto+configurations the process autodiscovery listener requires both an explicit cel://process entry in ad_identifiers (auto-injection only happens when ad_identifiers is empty) and a cel_selector.processes CEL rule. Add both to spec.yaml and regenerate the shipped auto_conf.yaml. The existing redis identifier is preserved so docker-listener container autodiscovery keeps working unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Toggles REDIS_AUTODISCOVERY_PROCESS=true so conftest.py can switch to the host-networking redis container path needed for process autodiscovery e2e.
Single redis service running with network_mode: host so the redis-server process is visible in the host process table where the agent's process listener will look for it.
Mirrors the existing AUTODISCOVERY/AUTODISCOVERY_COMPOSE_PATH pair so the new conftest branch and test file can switch on a single env-var-derived boolean.
REDIS_AUTODISCOVERY_PROCESS=true selects a host-networking redis container and forwards DD_DISCOVERY_ENABLED, DD_EXTRA_LISTENERS, and DD_AUTOCONFIG_EXCLUDE_FEATURES via e2e_metadata['env_vars'] so the agent runs the process listener (and skips the docker feature so processes inside containers don't get container-IDs that would exclude them).
Same reasoning as for the existing container autodiscovery env: the cluster fixture isn't running, so the cluster-specific assertions can't pass.
Waits for 'agent configcheck' to list the redisdb config, then runs the agent and asserts that redis.can_connect is OK with redis_port:6379. Only runs in the py3.13-adproc-7.0 env (REDIS_AUTODISCOVERY_PROCESS=true).
Mirrors the existing container autodiscovery section: invocation, the env vars used, and the host-port + bind-mount caveats developers will hit.
Some agent components require Linux capabilities not granted by the default docker run invocation. system-probe-lite (used by the discovery check that feeds workloadmeta for the cel://process listener) needs CAP_SYS_PTRACE and CAP_DAC_READ_SEARCH to read /proc entries of other processes; without them it logs "service discovery may be impaired" and the process listener never matches anything. Add a `cap_add` metadata key (mirrors existing `docker_volumes` / `custom_hosts` keys) so individual integrations can request capabilities from their dd_environment fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
system-probe-lite (the agent's service discovery worker) needs these caps to read /proc entries of other processes. Without them the redis-server process is never classified as a service and the cel://process listener has nothing to match against. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original test used dd_agent_check (which spawns a short-lived "agent check" subprocess and waits for AD to schedule the redisdb config). That doesn't work for process autodiscovery: the subprocess starts its own workloadmeta which never finishes initialising before the deadline (logs "Workloadmeta collectors are not ready after 17 retries"). The process listener can never match the redis-server process because workloadmeta never gets populated. Instead, query the long-running agent's "status --json" output. It already runs the discovery check and the process listener, so we can assert: - the redisdb config is scheduled (configcheck shows host: 127.0.0.1, port: 6379), - the running agent has executed it at least once with no errors. This is a fundamentally different verification surface than the container autodiscovery test, but it's the right one for process-listener-driven scheduling where the short-lived subprocess approach doesn't terminate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent picks up every .yaml file in conf.d/<integration>.d/ as a separate template. Integrations may now ship multiple auto_conf-style files (e.g. an `auto_conf.yaml` for the docker listener and an `auto_conf_process.yaml` for the cel://process listener; one file can't serve both, see DSCVR/6631130024). Bind-mount any file matching `auto_conf*.yaml` so e2e tests verify all of them, not just the canonical single-file form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…yaml Shipping cel_selector.processes alongside the existing redis ad_identifier in a single auto_conf.yaml regresses container autodiscovery: when the docker listener finds a redis container, the agent runs the config's CEL matching program (target=ProcessType) against the container entity, the type mismatch returns false, and the redis container is filtered out (comp/core/autodiscovery/listeners/common_filter.go:24, comp/core/autodiscovery/integration/matching_program.go:38). Switch to subpage option 2 (DSCVR/6631130024 §"Add a separate auto_conf_process.yaml"): keep the original auto_conf.yaml unchanged so container autodiscovery keeps working, and ship a second auto_conf_process.yaml whose ad_identifiers and cel_selector are process-only. The agent loads both files; the docker listener uses the first, the cel://process listener uses the second. No regression. If the agent later treats CEL filters as additive-when-present rather than required-for-all, a future change can collapse both files into one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two findings during execution required rewrites: 1. The original design followed DSCVR/6631130024's first solution (single auto_conf.yaml with both redis and cel://process in ad_identifiers). Implementation showed this regresses container autodiscovery: a process-only CEL filter unconditionally drops container candidates from the match list. Switched to the subpage's second solution (separate auto_conf_process.yaml). 2. The test cannot use dd_agent_check: the short-lived agent check subprocess never finishes initialising workloadmeta. The test now asserts via the long-running agent's status --json output. Also documents the SYS_PTRACE/DAC_READ_SEARCH caps required by system-probe-lite, the ddev cap_add hook that grants them, and the ddev glob mount that picks up the new auto_conf_process.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 9b83d6e | Docs | Datadog PR Page | Give us feedback! |
ddev validate license-headers expects the 3-clause BSD style header text matching the rest of the repo; the new files were copied with the older "Simplified BSD License" wording. Auto-fixed via ddev validate license-headers --fix.
The autodiscovery envs replace the 1m-2s cluster compose with a single default-port redis (REDIS_AUTODISCOVERY) or a host-network redis (REDIS_AUTODISCOVERY_PROCESS). The cluster master/replica/unhealthy ports (6382/6380/6381) are no longer exposed, so the existing integration tests in test_default.py / test_replication.py / test_auth.py fail with ConnectionRefused when ddev test runs them as part of the default matrix for these envs. Add a pytest_collection_modifyitems hook that, in autodiscovery envs, adds a skip marker to every test marked pytest.mark.integration except the autodiscovery e2e tests. test_unit.py is unmarked so it still runs; the e2e_autodiscovery files carry pytest.mark.e2e and are gated by their own pytestmark skipif.
ddev: cap_add metadata key for the docker e2e agent, plus bind-mount of all auto_conf*.yaml files (not only auto_conf.yaml) so e2e tests verify multi-file autodiscovery configurations. redisdb: ship a process autodiscovery e2e environment driven by the new auto_conf_process.yaml.
The other e2e tests (test_e2e.py, test_e2e_autodiscovery.py) request dd_agent_check, whose fixture skips with "Not running E2E tests" when e2e_testing() is false. test_e2e_autodiscovery_process intentionally avoids dd_agent_check (its short-lived `agent check` subprocess gives up before workloadmeta/the process listener finish initialising) and queries the long-running agent's status JSON directly, but that left it without an auto-skip path: under `ddev test redisdb` in the py3.13-adproc-7.0 env the test was collected, the WaitFor in setup failed because the agent container only exists under `ddev env test`, and the job exited 1 with `RetryError: Result: None`. Take dd_agent_check on the redisdb_scheduled_and_running fixture purely for its skip side effect; the fixture body still uses the docker-exec/status path.
Contributor
Validation ReportAll 20 validations passed. Show details
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds end-to-end tests that exercise the Datadog Agent's autodiscovery for redisdb and mcache, covering both container autodiscovery (docker listener) and — for redisdb only — process autodiscovery (`cel://process` listener).
For each integration the test starts a real backing container, runs the agent against the integration's shipped `auto_conf.yaml` (no static `conf.yaml` mounted), and verifies the agent discovers and successfully runs the check. Tests are gated to dedicated Hatch envs:
Supporting changes in `ddev/src/ddev/e2e/agent/docker.py`:
For redisdb process autodiscovery specifically, ships a new `auto_conf_process.yaml` alongside the existing `auto_conf.yaml` (per DSCVR/6631130024 §"Add a separate auto_conf_process.yaml"). Bundling both into one file would regress container autodiscovery — see the spec for details.
Motivation
Tracks the Integrations autodiscovery exploration work. Until now we shipped `auto_conf.yaml` for several integrations but had no automated coverage that the agent could actually schedule the check from those files. These tests close that gap for redisdb (container + process) and mcache (container) and lay down the patterns (env-var-toggled Hatch envs, ddev bind-mount of `auto_conf*.yaml`, etc.) for extending coverage to other integrations.
Design and implementation plans:
Test plan
Review checklist (to be filled by reviewers)