fix(integration): bump test timeout to 90s#1581
Conversation
The `timeout` field changed from "assertion-phase budget" to "whole-test budget" in eb250ff. Tests authored under the prior semantic set 60s expecting it to cover only the ~10s assertion phase. Under the new semantic that 60s also has to absorb container_start, which under parallel Docker contention routinely reaches 30-57s and pushes total wall clock past the budget. The result is flaky `cancelled` failures. Raise 60s to 90s on the affected configs. The adp-*-exit configs at 45s are unchanged.
Binary Size Analysis (Agent Data Plane)Target: e8a875d (baseline) vs 1cae29b (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.14393392841373718055 |
-130 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.8158336758582549391 |
+129 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.14393392841373718055 |
-115 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.8158336758582549391 |
+114 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.14393392841373718055 |
-109 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.8158336758582549391 |
+108 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.14393392841373718055 |
-97 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.8158336758582549391 |
+96 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.14393392841373718055 |
-95 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.8158336758582549391 |
+94 B | 1 |
[Unmapped] |
-3 B | 1 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +129 [NEW] +40 anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.8158336758582549391
[NEW] +114 [NEW] +25 anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.8158336758582549391
[NEW] +108 [NEW] +19 anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.8158336758582549391
[NEW] +96 [NEW] +7 anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.8158336758582549391
[NEW] +94 [NEW] +5 anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.8158336758582549391
-0.1% -3 [ = ] 0 [Unmapped]
[DEL] -95 [DEL] -5 anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.14393392841373718055
[DEL] -97 [DEL] -7 anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.14393392841373718055
[DEL] -109 [DEL] -19 anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.14393392841373718055
[DEL] -115 [DEL] -25 anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.14393392841373718055
[DEL] -130 [DEL] -40 anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.14393392841373718055
-0.0% -8 [ = ] 0 TOTAL
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: ecfb71fe-c9ea-4056-921a-069230385ece Baseline: e8a875d Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs_5mb_memory | memory utilization | +2.53 | [+2.22, +2.83] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +2.32 | [-2.69, +7.33] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | -0.00 | [-0.13, +0.12] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs_5mb_memory | memory utilization | +2.53 | [+2.22, +2.83] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +2.32 | [-2.69, +7.33] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | +1.99 | [-3.40, +7.39] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_cpu | % cpu utilization | +1.84 | [-0.47, +4.16] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_cpu | % cpu utilization | +1.12 | [-0.91, +3.16] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | +0.94 | [-5.16, +7.03] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | +0.71 | [+0.59, +0.84] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | +0.56 | [-1.59, +2.70] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_memory | memory utilization | +0.30 | [+0.07, +0.54] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | +0.25 | [+0.21, +0.29] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | +0.22 | [+0.05, +0.39] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | +0.21 | [+0.07, +0.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | +0.17 | [-53.02, +53.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | +0.16 | [+0.08, +0.25] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.11 | [-0.04, +0.27] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_throughput | ingress throughput | +0.10 | [+0.02, +0.18] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | +0.08 | [-0.05, +0.21] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | +0.04 | [-0.10, +0.19] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | +0.01 | [-0.03, +0.04] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.05, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.05, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | -0.00 | [-0.13, +0.12] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | -0.03 | [-0.21, +0.16] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | -0.03 | [-0.20, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | -0.17 | [-0.34, +0.00] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | -0.22 | [-0.37, -0.07] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | -0.38 | [-0.54, -0.23] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | -0.50 | [-0.66, -0.34] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_throughput | ingress throughput | -0.51 | [-0.61, -0.40] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_memory | memory utilization | -0.53 | [-0.69, -0.36] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | -0.79 | [-2.25, +0.66] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | -0.81 | [-0.93, -0.69] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | -1.76 | [-1.97, -1.56] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | -2.11 | [-32.50, +28.28] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | -2.15 | [-57.02, +52.72] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | 123.46MiB ≤ 140MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | 41.01MiB ≤ 50MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | 61.98MiB ≤ 75MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | 176.15MiB ≤ 200MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | 27.71MiB ≤ 40MiB | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
## Summary The `timeout` field in integration test configs now bounds the entire test (container start + dynamic vars + assertions) rather than only the assertion phase. The behavior changed in eb250ff. Tests authored under the prior semantic set `timeout: 60s` expecting that to cover only the ~10s assertion phase. After the change, `container_start` under parallel Docker contention regularly takes 30-57s, so the wall clock hits 60s mid-assertions and the test reports `cancelled`. Raise the affected configs from 60s to 90s. The two `adp-*-exit` configs at 45s are deliberately distinct and unchanged. I think the reason this tends to hit the `bind-host` tests may be because of where they land in the execution order (later waves of parallelism). ### Pre-change timing (timeout bounded only the assertion phase) | Run | Test | Total | container_start | assertions | |---|---|---:|---:|---:| | 20260430-194029 | dogstatsd-bind-custom-hostname | 22.68s | 1.91s | 10.09s | | 20260430-194029 | dogstatsd-bind-host | 45.21s | 32.43s | 10.07s | | 20260430-194029 | dogstatsd-default-bind | 32.73s | 19.98s | 10.41s | | 20260430-194029 | dogstatsd-non-local-overrides-bind-host | 39.75s | 19.02s | 10.40s | | 20260430-194640 | dogstatsd-bind-custom-hostname | 37.48s | 24.76s | 10.07s | | 20260430-194640 | dogstatsd-bind-host | 33.76s | 21.03s | 10.08s | | 20260430-194640 | dogstatsd-default-bind | 28.31s | 15.40s | 10.38s | | 20260430-194640 | dogstatsd-non-local-overrides-bind-host | 17.99s | 4.99s | 10.28s | | 20260430-195845 | dogstatsd-bind-custom-hostname | 15.00s | 2.00s | 10.10s | | 20260430-195845 | dogstatsd-bind-host | 48.48s | 35.71s | 10.09s | | 20260430-195845 | dogstatsd-default-bind | 42.28s | 29.68s | 10.28s | | 20260430-195845 | dogstatsd-non-local-overrides-bind-host | 33.02s | 20.24s | 10.40s | | 20260430-201339 | dogstatsd-bind-host | 49.14s | 35.13s | 10.09s | | 20260430-201339 | dogstatsd-default-bind | 32.86s | 20.10s | 10.36s | | 20260430-201339 | dogstatsd-non-local-overrides-bind-host | 39.75s | 18.92s | 10.38s | | 20260430-202951 | dogstatsd-non-local-overrides-bind-host | 52.52s | 31.87s | 10.30s | | 20260501-125239 | dogstatsd-default-bind | 64.87s | 52.30s | 10.25s | | 20260501-125239 | dogstatsd-non-local-overrides-bind-host | 53.12s | 40.07s | 10.37s | | 20260501-132008 | dogstatsd-non-local-overrides-bind-host | 54.70s | 42.05s | 10.30s | | 20260501-190149 | dogstatsd-bind-host | 69.86s | 57.05s | 10.09s | | 20260501-190149 | dogstatsd-default-bind | 67.98s | 55.11s | 10.33s | | 20260501-190149 | dogstatsd-non-local-overrides-bind-host | 52.87s | 40.08s | 10.32s | | 20260501-193326 | dogstatsd-non-local-overrides-bind-host | 60.29s | 47.47s | 10.38s | `assertions` is consistently ~10s. `container_start` is variable: 1.9s to 57s. Total wall clock crossed 60s on multiple runs and those still passed under the prior semantic. Under the current semantic those same conditions cancel mid-assertion, which matches the local flakes observed. ## Change Type - [x] Bug fix ## How did you test this PR? - Ran `make test-integration` locally on Apple Silicon multiple times. ## References Closes #1580 b28ad98
Summary
The
timeoutfield in integration test configs now bounds the entire test (container start +dynamic vars + assertions) rather than only the assertion phase. The behavior changed in
eb250ff. Tests authored under the prior semantic set
timeout: 60sexpecting that to coveronly the ~10s assertion phase. After the change,
container_startunder parallel Docker contentionregularly takes 30-57s, so the wall clock hits 60s mid-assertions and the test reports
cancelled.Raise the affected configs from 60s to 90s. The two
adp-*-exitconfigs at 45s are deliberatelydistinct and unchanged.
I think the reason this tends to hit the
bind-hosttests may be because of where they land in the execution order (later waves of parallelism).Pre-change timing (timeout bounded only the assertion phase)
assertionsis consistently ~10s.container_startis variable: 1.9s to 57s. Total wall clockcrossed 60s on multiple runs and those still passed under the prior semantic. Under the current
semantic those same conditions cancel mid-assertion, which matches the local flakes observed.
Change Type
How did you test this PR?
make test-integrationlocally on Apple Silicon multiple times.References
Closes #1580