Skip to content

Add agent autodiscovery e2e tests for redisdb and mcache#23515

Draft
vitkyrka wants to merge 44 commits intomasterfrom
redis-autodiscovery-e2e
Draft

Add agent autodiscovery e2e tests for redisdb and mcache#23515
vitkyrka wants to merge 44 commits intomasterfrom
redis-autodiscovery-e2e

Conversation

@vitkyrka
Copy link
Copy Markdown

What does this PR do?

Adds end-to-end tests that exercise the Datadog Agent's autodiscovery for redisdb and mcache, covering both container autodiscovery (docker listener) and — for redisdb only — process autodiscovery (`cel://process` listener).

For each integration the test starts a real backing container, runs the agent against the integration's shipped `auto_conf.yaml` (no static `conf.yaml` mounted), and verifies the agent discovers and successfully runs the check. Tests are gated to dedicated Hatch envs:

  • `py3.13-ad-7.0` — redisdb container autodiscovery (default-port bridge network).
  • `py3.13-adproc-7.0` — redisdb process autodiscovery (`network_mode: host`, `DD_EXTRA_LISTENERS=process`, `DD_AUTOCONFIG_EXCLUDE_FEATURES=docker`).
  • `py3.13-ad-1.6` — mcache container autodiscovery (default-port bridge network).

Supporting changes in `ddev/src/ddev/e2e/agent/docker.py`:

  • Bind-mount the integration's shipped `auto_conf*.yaml` files when no static config is provided (otherwise the agent uses whatever `auto_conf.yaml` is baked into the agent image, defeating the test).
  • Add a `cap_add` metadata key so integrations can request Linux capabilities at agent start (`SYS_PTRACE` and `DAC_READ_SEARCH` are required by `system-probe-lite` for service discovery).

For redisdb process autodiscovery specifically, ships a new `auto_conf_process.yaml` alongside the existing `auto_conf.yaml` (per DSCVR/6631130024 §"Add a separate auto_conf_process.yaml"). Bundling both into one file would regress container autodiscovery — see the spec for details.

Motivation

Tracks the Integrations autodiscovery exploration work. Until now we shipped `auto_conf.yaml` for several integrations but had no automated coverage that the agent could actually schedule the check from those files. These tests close that gap for redisdb (container + process) and mcache (container) and lay down the patterns (env-var-toggled Hatch envs, ddev bind-mount of `auto_conf*.yaml`, etc.) for extending coverage to other integrations.

Design and implementation plans:

  • `docs/superpowers/specs/2026-04-17-redis-autodiscovery-e2e-design.md` — redisdb container.
  • `docs/superpowers/specs/2026-04-28-mcache-autodiscovery-e2e-design.md` — mcache container.
  • `docs/superpowers/specs/2026-04-28-redis-process-autodiscovery-e2e-design.md` — redisdb process.

Test plan

  • `ddev env start --dev redisdb py3.13-ad-7.0 && ddev env test --dev redisdb py3.13-ad-7.0 && ddev env stop redisdb py3.13-ad-7.0` passes locally.
  • `ddev env start --dev redisdb py3.13-adproc-7.0 && ddev env test --dev redisdb py3.13-adproc-7.0 && ddev env stop redisdb py3.13-adproc-7.0` passes locally.
  • `ddev env start --dev mcache py3.13-ad-1.6 && ddev env test --dev mcache py3.13-ad-1.6 && ddev env stop mcache py3.13-ad-1.6` passes locally.
  • CI passes on the existing redisdb / mcache test envs (no regressions on `py3.13-{5.0,6.0,7.0,8.0,cloud}` for redisdb, `py3.13` for mcache).
  • Negative-direction sanity check on each autodiscovery env: changing the `ad_identifier` (or CEL rule for process) to a non-matching value makes the test fail at the readiness fixture, confirming the test exercises the source-tree files.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the `qa/skip-qa` label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the `backport/` label to the PR and it will automatically open a backport PR once this one is merged

vitkyrka and others added 30 commits April 17, 2026 16:43
Design for a dedicated py3.13-ad-7.0 hatch env that exercises the Agent's
container autodiscovery against a default-port redis container via the
integration's auto_conf.yaml. Minimal verification proves discovery +
check execution; broken cases from the DSCVR confluence page are deferred.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Task-by-task implementation plan derived from the 2026-04-17 spec.
Adds a py3.13-ad-7.0 hatch env, dedicated compose, branched
dd_environment, and a minimal test that proves autodiscovery via
configcheck and a can_connect assertion with redis_port:6379.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- test_e2e_autodiscovery: assert_service_check(tags=...) requires exact
  tag-list equality, but the autodiscovered check emits Docker tags
  (docker_image, image_id, image_name, image_tag, redis_host, short_image)
  whose values vary per run. Switch to scanning service_checks() and
  asserting any with status OK contains 'redis_port:6379'.
- test_e2e: skip the 1m-2s cluster test when REDIS_AUTODISCOVERY=true,
  per the plan's Task 6 Step 3 instructions, since the cluster compose
  isn't started under that env.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two doc-only fixes surfaced in final review:
- Network name `autodiscovery_default_default` was a typo; the compose
  file and project name dictate `autodiscovery-default_default`.
- Replace the spec's `assert_service_check(tags=[...])` snippet with the
  `service_checks() + any(...)` subset scan that actually shipped.
  `assert_service_check` requires tag-list equality, but the autodiscovered
  check emits volatile Docker tags (docker_image, image_id, image_name,
  image_tag, redis_host, short_image), so subset matching is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When dd_environment yields no static instance config (the autodiscovery
setup), the existing config-dir bind-mount is skipped, so the Agent falls
back to the auto_conf.yaml baked into its image rather than the version
shipped in the integration's source tree. That lets autodiscovery e2e
tests pass against an outdated agent-image template even when the local
`data/auto_conf.yaml` would be wrong.

File-bind-mount the integration's own `data/auto_conf.yaml` over
`/etc/datadog-agent/conf.d/<integration>.d/auto_conf.yaml` whenever no
static config is yielded and the file exists in the source tree. SNMP
(the only other yield-None user today) ships no auto_conf.yaml so its
behavior is unchanged.

Verified by editing redisdb's local auto_conf.yaml ad_identifiers to
something the agent can't match: the redisdb autodiscovery e2e now
correctly fails, and reverts to passing once the identifier is restored.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When auto_conf.yaml is edited between test runs, the docker bind-mount
of the source-tree file can become stale because tools like git
checkout replace the inode. Document the env-restart workaround.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror redisdb's autodiscovery e2e pattern for memcached: dedicated
py3.13-ad-1.6 hatch env, vanilla memcached:1.6 on bridge network,
configcheck + service-check assertions. Helper extraction is left to a
follow-up spec informed by the duplication this lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bite-sized, TDD-shaped tasks driving the spec landed in
2026-04-28-mcache-autodiscovery-e2e-design.md. Pre-flight verifies the
ddev autodiscovery fix from the redisdb branch is in place; final task
runs the two-direction sanity check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
memcached:1.6 is silent at default verbosity, so the CheckDockerLogs
'server listening' readiness probe times out. Run with -vv so memcached
emits the slab classes and 'server listening (auto-negotiate)' lines on
startup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The autodiscovery env starts a vanilla memcached:1.6 container on the
Docker bridge network. The integration tests in test_integration_e2e.py
all rely on SASL credentials, a unix socket, or IPv6 — none of which
exist under the autodiscovery dd_environment. Promote the existing
test_e2e skipif to a module-level pytestmark so every test in the file
is skipped when MCACHE_AUTODISCOVERY=true.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ling

Two small follow-ups from review:

- The dd_environment docstring referenced 'existing e2e and integration
  tests'. After the module-level skipif landed, every test in
  test_integration_e2e.py is skipped under autodiscovery; reword to
  reflect that.

- The compose's -vv flag and conftest's 'server listening' log probe are
  silently coupled. Document the dependency in the compose so a future
  change to verbosity does not mysteriously break the readiness probe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plan a new py3.13-adproc-7.0 hatch env that enables the agent's process
listener and disables the docker feature, runs a host-networking redis
container, and verifies that the cel_selector.processes rule shipped in
auto_conf.yaml drives discovery.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eleven bite-sized tasks for shipping py3.13-adproc-7.0: modify auto_conf.yaml
with cel://process and cel_selector.processes, add the new hatch env, compose
file, common.py constants, conftest.py branch, skipif update, new test, docs,
and three verification passes (happy path, negative sanity check, container
env non-regression).
….yaml

Discovered during execution that auto_conf.yaml is regenerated from
assets/configuration/spec.yaml by ddev validate config -s. CI runs the same
validator (without --sync) and fails on any drift. Update Task 1 to edit
spec.yaml as the source of truth and commit both spec.yaml and the
regenerated auto_conf.yaml together. The file structure table also calls
this out so future readers don't repeat the mistake.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per https://datadoghq.atlassian.net/wiki/spaces/DSCVR/pages/6631130024/Adding+process+support+to+auto+configurations
the process autodiscovery listener requires both an explicit cel://process
entry in ad_identifiers (auto-injection only happens when ad_identifiers is
empty) and a cel_selector.processes CEL rule. Add both to spec.yaml and
regenerate the shipped auto_conf.yaml. The existing redis identifier is
preserved so docker-listener container autodiscovery keeps working unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Toggles REDIS_AUTODISCOVERY_PROCESS=true so conftest.py can switch to the
host-networking redis container path needed for process autodiscovery e2e.
Single redis service running with network_mode: host so the redis-server
process is visible in the host process table where the agent's process
listener will look for it.
Mirrors the existing AUTODISCOVERY/AUTODISCOVERY_COMPOSE_PATH pair so the
new conftest branch and test file can switch on a single env-var-derived
boolean.
vitkyrka and others added 10 commits April 29, 2026 09:33
REDIS_AUTODISCOVERY_PROCESS=true selects a host-networking redis container
and forwards DD_DISCOVERY_ENABLED, DD_EXTRA_LISTENERS, and
DD_AUTOCONFIG_EXCLUDE_FEATURES via e2e_metadata['env_vars'] so the agent
runs the process listener (and skips the docker feature so processes inside
containers don't get container-IDs that would exclude them).
Same reasoning as for the existing container autodiscovery env: the cluster
fixture isn't running, so the cluster-specific assertions can't pass.
Waits for 'agent configcheck' to list the redisdb config, then runs the
agent and asserts that redis.can_connect is OK with redis_port:6379. Only
runs in the py3.13-adproc-7.0 env (REDIS_AUTODISCOVERY_PROCESS=true).
Mirrors the existing container autodiscovery section: invocation, the env
vars used, and the host-port + bind-mount caveats developers will hit.
Some agent components require Linux capabilities not granted by the default
docker run invocation. system-probe-lite (used by the discovery check that
feeds workloadmeta for the cel://process listener) needs CAP_SYS_PTRACE and
CAP_DAC_READ_SEARCH to read /proc entries of other processes; without them
it logs "service discovery may be impaired" and the process listener never
matches anything.

Add a `cap_add` metadata key (mirrors existing `docker_volumes` /
`custom_hosts` keys) so individual integrations can request capabilities
from their dd_environment fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
system-probe-lite (the agent's service discovery worker) needs these caps
to read /proc entries of other processes. Without them the redis-server
process is never classified as a service and the cel://process listener
has nothing to match against.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original test used dd_agent_check (which spawns a short-lived
"agent check" subprocess and waits for AD to schedule the redisdb config).
That doesn't work for process autodiscovery: the subprocess starts its own
workloadmeta which never finishes initialising before the deadline (logs
"Workloadmeta collectors are not ready after 17 retries"). The process
listener can never match the redis-server process because workloadmeta
never gets populated.

Instead, query the long-running agent's "status --json" output. It already
runs the discovery check and the process listener, so we can assert:
- the redisdb config is scheduled (configcheck shows host: 127.0.0.1, port: 6379),
- the running agent has executed it at least once with no errors.

This is a fundamentally different verification surface than the container
autodiscovery test, but it's the right one for process-listener-driven
scheduling where the short-lived subprocess approach doesn't terminate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent picks up every .yaml file in conf.d/<integration>.d/ as a
separate template. Integrations may now ship multiple auto_conf-style
files (e.g. an `auto_conf.yaml` for the docker listener and an
`auto_conf_process.yaml` for the cel://process listener; one file can't
serve both, see DSCVR/6631130024). Bind-mount any file matching
`auto_conf*.yaml` so e2e tests verify all of them, not just the canonical
single-file form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…yaml

Shipping cel_selector.processes alongside the existing redis ad_identifier
in a single auto_conf.yaml regresses container autodiscovery: when the
docker listener finds a redis container, the agent runs the config's CEL
matching program (target=ProcessType) against the container entity, the
type mismatch returns false, and the redis container is filtered out
(comp/core/autodiscovery/listeners/common_filter.go:24,
comp/core/autodiscovery/integration/matching_program.go:38).

Switch to subpage option 2 (DSCVR/6631130024 §"Add a separate
auto_conf_process.yaml"): keep the original auto_conf.yaml unchanged so
container autodiscovery keeps working, and ship a second
auto_conf_process.yaml whose ad_identifiers and cel_selector are
process-only. The agent loads both files; the docker listener uses the
first, the cel://process listener uses the second. No regression.

If the agent later treats CEL filters as additive-when-present rather
than required-for-all, a future change can collapse both files into one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two findings during execution required rewrites:

1. The original design followed DSCVR/6631130024's first solution (single
   auto_conf.yaml with both redis and cel://process in ad_identifiers).
   Implementation showed this regresses container autodiscovery: a
   process-only CEL filter unconditionally drops container candidates from
   the match list. Switched to the subpage's second solution (separate
   auto_conf_process.yaml).

2. The test cannot use dd_agent_check: the short-lived agent check
   subprocess never finishes initialising workloadmeta. The test now
   asserts via the long-running agent's status --json output.

Also documents the SYS_PTRACE/DAC_READ_SEARCH caps required by
system-probe-lite, the ddev cap_add hook that grants them, and the ddev
glob mount that picks up the new auto_conf_process.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@datadog-official
Copy link
Copy Markdown
Contributor

datadog-official Bot commented Apr 29, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 51.22%
Overall Coverage: 87.26% (-1.16%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9b83d6e | Docs | Datadog PR Page | Give us feedback!

ddev validate license-headers expects the 3-clause BSD style header text
matching the rest of the repo; the new files were copied with the older
"Simplified BSD License" wording. Auto-fixed via ddev validate
license-headers --fix.
The autodiscovery envs replace the 1m-2s cluster compose with a single
default-port redis (REDIS_AUTODISCOVERY) or a host-network redis
(REDIS_AUTODISCOVERY_PROCESS). The cluster master/replica/unhealthy
ports (6382/6380/6381) are no longer exposed, so the existing
integration tests in test_default.py / test_replication.py / test_auth.py
fail with ConnectionRefused when ddev test runs them as part of the
default matrix for these envs.

Add a pytest_collection_modifyitems hook that, in autodiscovery envs,
adds a skip marker to every test marked pytest.mark.integration except
the autodiscovery e2e tests. test_unit.py is unmarked so it still runs;
the e2e_autodiscovery files carry pytest.mark.e2e and are gated by their
own pytestmark skipif.
ddev: cap_add metadata key for the docker e2e agent, plus bind-mount of
all auto_conf*.yaml files (not only auto_conf.yaml) so e2e tests verify
multi-file autodiscovery configurations.

redisdb: ship a process autodiscovery e2e environment driven by the new
auto_conf_process.yaml.
The other e2e tests (test_e2e.py, test_e2e_autodiscovery.py) request
dd_agent_check, whose fixture skips with "Not running E2E tests" when
e2e_testing() is false. test_e2e_autodiscovery_process intentionally
avoids dd_agent_check (its short-lived `agent check` subprocess gives up
before workloadmeta/the process listener finish initialising) and
queries the long-running agent's status JSON directly, but that left it
without an auto-skip path: under `ddev test redisdb` in the
py3.13-adproc-7.0 env the test was collected, the WaitFor in setup
failed because the agent container only exists under `ddev env test`,
and the job exited 1 with `RetryError: Result: None`.

Take dd_agent_check on the redisdb_scheduled_and_running fixture purely
for its skip side effect; the fixture body still uses the
docker-exec/status path.
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 30, 2026

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 51.21951% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.78%. Comparing base (fef832e) to head (9b83d6e).
⚠️ Report is 337 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
active_directory ?
activemq_xml ?
aerospike ?
airflow ?
amazon_msk ?
ambari ?
apache ?
appgate_sdp ?
arangodb ?
argo_rollouts ?
argo_workflows ?
argocd ?
aspdotnet ?
avi_vantage ?
aws_neuron ?
azure_iot_edge ?
boundary ?
btrfs ?
cacti ?
calico ?
cassandra_nodetool ?
celery ?
ceph ?
cert_manager ?
cilium ?
cisco_aci ?
citrix_hypervisor ?
clickhouse ?
cloud_foundry_api ?
cloudera ?
cockroachdb ?
consul ?
coredns ?
couch ?
couchbase ?
crio ?
datadog_checks_base ?
datadog_checks_dev ?
datadog_checks_downloader ?
datadog_cluster_agent ?
dcgm ?
ddev ?
directory ?
disk ?
dns_check ?
dotnetclr ?
druid ?
duckdb ?
ecs_fargate ?
eks_fargate ?
elastic ?
envoy ?
esxi ?
etcd ?
exchange_server ?
external_dns ?
falco ?
fluentd ?
fluxcd ?
fly_io ?
foundationdb ?
gearmand ?
gitlab ?
gitlab_runner ?
glusterfs ?
go_expvar ?
gunicorn ?
haproxy ?
harbor ?
hazelcast ?
hdfs_datanode ?
hdfs_namenode ?
http_check ?
ibm_ace ?
ibm_db2 ?
ibm_i ?
ibm_mq ?
ibm_was ?
iis ?
impala ?
infiniband ?
istio ?
kafka_consumer ?
karpenter ?
keda ?
kong ?
krakend ?
kube_apiserver_metrics ?
kube_controller_manager ?
kube_dns ?
kube_metrics_server ?
kube_proxy ?
kube_scheduler ?
kubeflow ?
kubelet ?
kubernetes_cluster_autoscaler ?
kubernetes_state ?
kubevirt_api ?
kubevirt_controller ?
kubevirt_handler ?
kuma ?
kyototycoon ?
kyverno ?
lighttpd ?
linkerd ?
linux_proc_extras ?
litellm ?
lustre ?
mac_audit_logs ?
mapr ?
mapreduce ?
marathon ?
marklogic ?
mcache ?
mesos_master ?
milvus ?
mongo ?
mysql ?
nagios ?
network ?
nfsstat ?
nginx ?
nginx_ingress_controller ?
nvidia_nim ?
nvidia_triton ?
octopus_deploy ?
openldap ?
openmetrics ?
openstack ?
openstack_controller ?
pdh_check ?
pgbouncer ?
php_fpm ?
postfix ?
postgres ?
powerdns_recursor ?
process ?
prometheus ?
proxmox ?
proxysql ?
pulsar ?
quarkus ?
rabbitmq ?
ray ?
redisdb ?
rethinkdb ?
riak ?
riakcs ?
sap_hana ?
scylla ?
silk ?
silverstripe_cms ?
singlestore ?
slurm ?
snmp ?
snowflake ?
sonarqube ?
sonatype_nexus ?
spark ?
sqlserver ?
squid ?
ssh_check ?
statsd ?
strimzi ?
supabase ?
supervisord ?
system_core ?
system_swap ?
tcp_check ?
teamcity ?
tekton ?
teleport ?
temporal ?
teradata ?
tibco_ems ?
tls ?
torchserve ?
traefik_mesh ?
traffic_server ?
twemproxy ?
twistlock ?
varnish ?
vault ?
velero ?
vertica ?
vllm ?
voltdb ?
vsphere ?
weaviate ?
win32_event_log ?
windows_performance_counters ?
windows_service ?
wmi_check ?
yarn ?
zk ?

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant