-
Notifications
You must be signed in to change notification settings - Fork 468
ci(threads): fix multiproces threads forks slow in ci #15263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 206 ± 2 ms. The average import time from base is: 211 ± 2 ms. The import time difference between this PR and base is: -4.71 ± 0.09 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate avara1986/skip_tests (94c9515) with baseline main (466fcfa) 📈 Performance Regressions (1 suite)📈 iast_aspects - 40/40✅ re_expand_aspectTime: ✅ 33.975µs (SLO: <40.000µs 📉 -15.1%) vs baseline: +7.0% Memory: ✅ 37.375MB (SLO: <39.000MB -4.2%) vs baseline: +5.2% ✅ re_expand_noaspectTime: ✅ 29.798µs (SLO: <40.000µs 📉 -25.5%) vs baseline: +3.7% Memory: ✅ 37.297MB (SLO: <39.000MB -4.4%) vs baseline: +5.1% ✅ re_findall_aspectTime: ✅ 2.907µs (SLO: <10.000µs 📉 -70.9%) vs baseline: -0.5% Memory: ✅ 37.198MB (SLO: <39.000MB -4.6%) vs baseline: +4.9% ✅ re_findall_noaspectTime: ✅ 1.441µs (SLO: <10.000µs 📉 -85.6%) vs baseline: +1.0% Memory: ✅ 37.277MB (SLO: <39.000MB -4.4%) vs baseline: +5.1% ✅ re_finditer_aspectTime: ✅ 4.644µs (SLO: <10.000µs 📉 -53.6%) vs baseline: +5.2% Memory: ✅ 37.277MB (SLO: <39.000MB -4.4%) vs baseline: +5.0% ✅ re_finditer_noaspectTime: ✅ 1.398µs (SLO: <10.000µs 📉 -86.0%) vs baseline: -0.1% Memory: ✅ 37.218MB (SLO: <39.000MB -4.6%) vs baseline: +4.7% ✅ re_fullmatch_aspectTime: ✅ 2.630µs (SLO: <10.000µs 📉 -73.7%) vs baseline: -0.8% Memory: ✅ 37.257MB (SLO: <39.000MB -4.5%) vs baseline: +5.1% ✅ re_fullmatch_noaspectTime: ✅ 1.318µs (SLO: <10.000µs 📉 -86.8%) vs baseline: +1.2% Memory: ✅ 37.257MB (SLO: <39.000MB -4.5%) vs baseline: +4.9% ✅ re_group_aspectTime: ✅ 3.150µs (SLO: <10.000µs 📉 -68.5%) vs baseline: +6.9% Memory: ✅ 37.198MB (SLO: <39.000MB -4.6%) vs baseline: +4.5% ✅ re_group_noaspectTime: ✅ 1.634µs (SLO: <10.000µs 📉 -83.7%) vs baseline: +1.9% Memory: ✅ 37.218MB (SLO: <39.000MB -4.6%) vs baseline: +5.0% ✅ re_groups_aspectTime: ✅ 3.457µs (SLO: <10.000µs 📉 -65.4%) vs baseline: 📈 +13.2% Memory: ✅ 37.257MB (SLO: <39.000MB -4.5%) vs baseline: +5.0% ✅ re_groups_noaspectTime: ✅ 1.716µs (SLO: <10.000µs 📉 -82.8%) vs baseline: +2.0% Memory: ✅ 37.238MB (SLO: <39.000MB -4.5%) vs baseline: +4.7% ✅ re_match_aspectTime: ✅ 2.919µs (SLO: <10.000µs 📉 -70.8%) vs baseline: +9.5% Memory: ✅ 37.218MB (SLO: <39.000MB -4.6%) vs baseline: +4.9% ✅ re_match_noaspectTime: ✅ 1.311µs (SLO: <10.000µs 📉 -86.9%) vs baseline: +1.3% Memory: ✅ 37.257MB (SLO: <39.000MB -4.5%) vs baseline: +4.9% ✅ re_search_aspectTime: ✅ 2.682µs (SLO: <10.000µs 📉 -73.2%) vs baseline: +6.2% Memory: ✅ 37.218MB (SLO: <39.000MB -4.6%) vs baseline: +4.8% ✅ re_search_noaspectTime: ✅ 1.196µs (SLO: <10.000µs 📉 -88.0%) vs baseline: -0.2% Memory: ✅ 37.198MB (SLO: <39.000MB -4.6%) vs baseline: +4.8% ✅ re_sub_aspectTime: ✅ 3.519µs (SLO: <10.000µs 📉 -64.8%) vs baseline: +3.8% Memory: ✅ 37.159MB (SLO: <39.000MB -4.7%) vs baseline: +4.5% ✅ re_sub_noaspectTime: ✅ 1.518µs (SLO: <10.000µs 📉 -84.8%) vs baseline: -0.3% Memory: ✅ 37.218MB (SLO: <39.000MB -4.6%) vs baseline: +4.6% ✅ re_subn_aspectTime: ✅ 3.745µs (SLO: <10.000µs 📉 -62.5%) vs baseline: +3.8% Memory: ✅ 37.277MB (SLO: <39.000MB -4.4%) vs baseline: +5.1% ✅ re_subn_noaspectTime: ✅ 1.613µs (SLO: <10.000µs 📉 -83.9%) vs baseline: +0.8% Memory: ✅ 37.257MB (SLO: <39.000MB -4.5%) vs baseline: +4.8% 🟡 Near SLO Breach (7 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 19.280ms (SLO: <22.300ms 📉 -13.5%) vs baseline: +0.3% Memory: ✅ 65.972MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ exception-replay-enabledTime: ✅ 1.341ms (SLO: <1.450ms -7.5%) vs baseline: +0.5% Memory: ✅ 64.058MB (SLO: <67.000MB -4.4%) vs baseline: +4.6% ✅ iastTime: ✅ 19.274ms (SLO: <22.250ms 📉 -13.4%) vs baseline: ~same Memory: ✅ 65.967MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ profilerTime: ✅ 15.344ms (SLO: <16.550ms -7.3%) vs baseline: ~same Memory: ✅ 53.918MB (SLO: <54.500MB 🟡 -1.1%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 19.248ms (SLO: <21.750ms 📉 -11.5%) vs baseline: -0.4% Memory: ✅ 66.022MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ span-code-originTime: ✅ 22.760ms (SLO: <28.200ms 📉 -19.3%) vs baseline: -0.5% Memory: ✅ 67.259MB (SLO: <69.500MB -3.2%) vs baseline: +5.1% ✅ tracerTime: ✅ 19.282ms (SLO: <21.750ms 📉 -11.3%) vs baseline: +0.1% Memory: ✅ 65.995MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ tracer-and-profilerTime: ✅ 21.204ms (SLO: <23.500ms -9.8%) vs baseline: -0.1% Memory: ✅ 67.929MB (SLO: <68.000MB 🟡 -0.1%) vs baseline: +5.3% ✅ tracer-dont-create-db-spansTime: ✅ 19.273ms (SLO: <21.500ms 📉 -10.4%) vs baseline: +0.2% Memory: ✅ 65.987MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.7% ✅ tracer-minimalTime: ✅ 16.641ms (SLO: <17.500ms -4.9%) vs baseline: +0.3% Memory: ✅ 65.984MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% ✅ tracer-nativeTime: ✅ 19.240ms (SLO: <21.750ms 📉 -11.5%) vs baseline: -0.2% Memory: ✅ 67.702MB (SLO: <72.500MB -6.6%) vs baseline: +4.9% ✅ tracer-no-cachesTime: ✅ 17.322ms (SLO: <19.650ms 📉 -11.8%) vs baseline: +0.1% Memory: ✅ 66.046MB (SLO: <67.000MB 🟡 -1.4%) vs baseline: +5.0% ✅ tracer-no-databasesTime: ✅ 18.701ms (SLO: <20.100ms -7.0%) vs baseline: -0.4% Memory: ✅ 65.946MB (SLO: <67.000MB 🟡 -1.6%) vs baseline: +4.8% ✅ tracer-no-middlewareTime: ✅ 18.937ms (SLO: <21.500ms 📉 -11.9%) vs baseline: -0.1% Memory: ✅ 66.096MB (SLO: <67.000MB 🟡 -1.3%) vs baseline: +5.1% ✅ tracer-no-templatesTime: ✅ 19.077ms (SLO: <22.000ms 📉 -13.3%) vs baseline: -0.1% Memory: ✅ 65.989MB (SLO: <67.000MB 🟡 -1.5%) vs baseline: +4.8% 🟡 errortrackingdjangosimple - 6/6✅ errortracking-enabled-allTime: ✅ 16.311ms (SLO: <19.850ms 📉 -17.8%) vs baseline: -0.4% Memory: ✅ 65.893MB (SLO: <66.500MB 🟡 -0.9%) vs baseline: +5.5% ✅ errortracking-enabled-userTime: ✅ 16.424ms (SLO: <19.400ms 📉 -15.3%) vs baseline: +1.0% Memory: ✅ 65.736MB (SLO: <66.500MB 🟡 -1.1%) vs baseline: +5.0% ✅ tracer-enabledTime: ✅ 16.316ms (SLO: <19.450ms 📉 -16.1%) vs baseline: -0.3% Memory: ✅ 65.719MB (SLO: <66.500MB 🟡 -1.2%) vs baseline: +4.9% 🟡 errortrackingflasksqli - 6/6✅ errortracking-enabled-allTime: ✅ 2.072ms (SLO: <2.300ms -9.9%) vs baseline: +0.1% Memory: ✅ 52.534MB (SLO: <53.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ errortracking-enabled-userTime: ✅ 2.073ms (SLO: <2.250ms -7.9%) vs baseline: +0.5% Memory: ✅ 52.573MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +5.1% ✅ tracer-enabledTime: ✅ 2.064ms (SLO: <2.300ms 📉 -10.3%) vs baseline: +0.3% Memory: ✅ 52.573MB (SLO: <53.500MB 🟡 -1.7%) vs baseline: +5.0% 🟡 flasksimple - 18/18✅ appsec-getTime: ✅ 4.582ms (SLO: <4.750ms -3.5%) vs baseline: +0.3% Memory: ✅ 61.980MB (SLO: <65.000MB -4.6%) vs baseline: +4.6% ✅ appsec-postTime: ✅ 6.609ms (SLO: <6.750ms -2.1%) vs baseline: ~same Memory: ✅ 62.259MB (SLO: <65.000MB -4.2%) vs baseline: +5.2% ✅ appsec-telemetryTime: ✅ 4.582ms (SLO: <4.750ms -3.5%) vs baseline: -0.3% Memory: ✅ 62.010MB (SLO: <65.000MB -4.6%) vs baseline: +4.8% ✅ debuggerTime: ✅ 1.862ms (SLO: <2.000ms -6.9%) vs baseline: +0.3% Memory: ✅ 45.100MB (SLO: <47.000MB -4.0%) vs baseline: +4.6% ✅ iast-getTime: ✅ 1.853ms (SLO: <2.000ms -7.4%) vs baseline: ~same Memory: ✅ 41.972MB (SLO: <49.000MB 📉 -14.3%) vs baseline: +5.2% ✅ profilerTime: ✅ 1.913ms (SLO: <2.100ms -8.9%) vs baseline: ~same Memory: ✅ 46.536MB (SLO: <47.000MB 🟡 -1.0%) vs baseline: +4.6% ✅ resource-renamingTime: ✅ 3.363ms (SLO: <3.650ms -7.9%) vs baseline: -0.2% Memory: ✅ 52.396MB (SLO: <53.500MB -2.1%) vs baseline: +4.8% ✅ tracerTime: ✅ 3.356ms (SLO: <3.650ms -8.0%) vs baseline: -0.2% Memory: ✅ 52.373MB (SLO: <53.500MB -2.1%) vs baseline: +5.0% ✅ tracer-nativeTime: ✅ 3.354ms (SLO: <3.650ms -8.1%) vs baseline: ~same Memory: ✅ 54.035MB (SLO: <60.000MB -9.9%) vs baseline: +4.8% 🟡 otelspan - 22/22✅ add-eventTime: ✅ 39.639ms (SLO: <47.150ms 📉 -15.9%) vs baseline: +2.9% Memory: ✅ 36.419MB (SLO: <47.000MB 📉 -22.5%) vs baseline: +4.8% ✅ add-metricsTime: ✅ 260.882ms (SLO: <344.800ms 📉 -24.3%) vs baseline: +1.4% Memory: ✅ 40.757MB (SLO: <47.500MB 📉 -14.2%) vs baseline: +5.0% ✅ add-tagsTime: ✅ 317.777ms (SLO: <321.000ms 🟡 -1.0%) vs baseline: +0.3% Memory: ✅ 40.679MB (SLO: <47.500MB 📉 -14.4%) vs baseline: +4.9% ✅ get-contextTime: ✅ 78.866ms (SLO: <92.350ms 📉 -14.6%) vs baseline: +0.2% Memory: ✅ 36.665MB (SLO: <46.500MB 📉 -21.2%) vs baseline: +4.6% ✅ is-recordingTime: ✅ 36.249ms (SLO: <44.500ms 📉 -18.5%) vs baseline: +0.4% Memory: ✅ 36.190MB (SLO: <47.500MB 📉 -23.8%) vs baseline: +4.9% ✅ record-exceptionTime: ✅ 56.823ms (SLO: <67.650ms 📉 -16.0%) vs baseline: -0.2% Memory: ✅ 36.857MB (SLO: <47.000MB 📉 -21.6%) vs baseline: +4.7% ✅ set-statusTime: ✅ 42.442ms (SLO: <50.400ms 📉 -15.8%) vs baseline: -0.3% Memory: ✅ 36.265MB (SLO: <47.000MB 📉 -22.8%) vs baseline: +5.5% ✅ startTime: ✅ 35.425ms (SLO: <43.450ms 📉 -18.5%) vs baseline: +0.1% Memory: ✅ 36.295MB (SLO: <47.000MB 📉 -22.8%) vs baseline: +5.5% ✅ start-finishTime: ✅ 81.686ms (SLO: <88.000ms -7.2%) vs baseline: +0.2% Memory: ✅ 33.994MB (SLO: <46.500MB 📉 -26.9%) vs baseline: +4.7% ✅ start-finish-telemetryTime: ✅ 83.204ms (SLO: <89.000ms -6.5%) vs baseline: +0.6% Memory: ✅ 33.994MB (SLO: <46.500MB 📉 -26.9%) vs baseline: +4.7% ✅ update-nameTime: ✅ 37.029ms (SLO: <45.150ms 📉 -18.0%) vs baseline: +0.1% Memory: ✅ 36.391MB (SLO: <47.000MB 📉 -22.6%) vs baseline: +4.8% 🟡 recursivecomputation - 8/8✅ deepTime: ✅ 309.245ms (SLO: <320.950ms -3.6%) vs baseline: +0.3% Memory: ✅ 32.676MB (SLO: <34.500MB -5.3%) vs baseline: +4.9% ✅ deep-profiledTime: ✅ 328.645ms (SLO: <359.150ms -8.5%) vs baseline: -0.4% Memory: ✅ 38.251MB (SLO: <39.000MB 🟡 -1.9%) vs baseline: +6.5% ✅ mediumTime: ✅ 6.987ms (SLO: <7.400ms -5.6%) vs baseline: ~same Memory: ✅ 31.497MB (SLO: <34.000MB -7.4%) vs baseline: +4.7% ✅ shallowTime: ✅ 0.941ms (SLO: <1.050ms 📉 -10.3%) vs baseline: ~same Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% 🟡 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.054µs (SLO: <20.000µs 📉 -84.7%) vs baseline: +1.4% Memory: ✅ 31.497MB (SLO: <34.000MB -7.4%) vs baseline: +4.8% ✅ 1-count-metrics-100-timesTime: ✅ 206.504µs (SLO: <220.000µs -6.1%) vs baseline: +1.0% Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +5.1% ✅ 1-distribution-metric-1-timesTime: ✅ 3.528µs (SLO: <20.000µs 📉 -82.4%) vs baseline: +4.0% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.9% ✅ 1-distribution-metrics-100-timesTime: ✅ 217.935µs (SLO: <220.000µs 🟡 -0.9%) vs baseline: -0.6% Memory: ✅ 31.634MB (SLO: <34.000MB -7.0%) vs baseline: +5.1% ✅ 1-gauge-metric-1-timesTime: ✅ 2.237µs (SLO: <20.000µs 📉 -88.8%) vs baseline: -1.1% Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.6% ✅ 1-gauge-metrics-100-timesTime: ✅ 138.967µs (SLO: <150.000µs -7.4%) vs baseline: ~same Memory: ✅ 31.556MB (SLO: <34.000MB -7.2%) vs baseline: +5.0% ✅ 1-rate-metric-1-timesTime: ✅ 3.249µs (SLO: <20.000µs 📉 -83.8%) vs baseline: +3.6% Memory: ✅ 31.575MB (SLO: <34.000MB -7.1%) vs baseline: +5.1% ✅ 1-rate-metrics-100-timesTime: ✅ 218.377µs (SLO: <250.000µs 📉 -12.6%) vs baseline: +0.6% Memory: ✅ 31.497MB (SLO: <34.000MB -7.4%) vs baseline: +4.7% ✅ 100-count-metrics-100-timesTime: ✅ 20.427ms (SLO: <22.000ms -7.1%) vs baseline: -0.8% Memory: ✅ 31.536MB (SLO: <34.000MB -7.2%) vs baseline: +4.7% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.278ms (SLO: <2.300ms 🟡 -1.0%) vs baseline: +0.9% Memory: ✅ 31.772MB (SLO: <34.000MB -6.6%) vs baseline: +4.5% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.432ms (SLO: <1.550ms -7.6%) vs baseline: +1.5% Memory: ✅ 31.516MB (SLO: <34.000MB -7.3%) vs baseline: +4.6% ✅ 100-rate-metrics-100-timesTime: ✅ 2.235ms (SLO: <2.550ms 📉 -12.3%) vs baseline: -0.7% Memory: ✅ 31.477MB (SLO: <34.000MB -7.4%) vs baseline: +4.7% ✅ flush-1-metricTime: ✅ 4.597µs (SLO: <20.000µs 📉 -77.0%) vs baseline: -0.4% Memory: ✅ 31.988MB (SLO: <34.000MB -5.9%) vs baseline: +5.2% ✅ flush-100-metricsTime: ✅ 175.538µs (SLO: <250.000µs 📉 -29.8%) vs baseline: -0.2% Memory: ✅ 31.929MB (SLO: <34.000MB -6.1%) vs baseline: +4.9% ✅ flush-1000-metricsTime: ✅ 2.130ms (SLO: <2.500ms 📉 -14.8%) vs baseline: -0.4% Memory: ✅ 32.676MB (SLO: <34.500MB -5.3%) vs baseline: +4.8%
|
Description
Temporarily skip IAST multiprocessing tests that are failing in CI due to fork + multithreading deadlocks. Despite extensive investigation and
multiple attempted fixes, these tests remain unstable in the CI environment while working perfectly locally.
Problem Statement
Since merging commit e9582f2 (profiling test fix), several IAST multiprocessing tests began failing
exclusively in CI environments, while continuing to pass reliably in local development.
Affected Tests
test_subprocess_has_tracer_running_and_iast_envtest_multiprocessing_with_iast_no_segfaulttest_multiple_fork_operationstest_eval_in_forked_processtest_uvicorn_style_worker_with_evaltest_sequential_workers_stress_testtest_direct_fork_with_eval_no_crashSymptoms
In CI:
exitcode=NoneAssertionError: child process did not exit in timemaximum recursion depth exceeded while calling a Python objectLocally:
Timeline:
Root Cause Analysis
The issue is a fork + multithreading deadlock. When pytest loads ddtrace, several background services start threads:
When tests call
fork()or createmultiprocessing.Process()while these threads are running, child processes inherit locks in unknownstates. If any background thread held a lock during fork, that lock remains permanently locked in the child, causing deadlocks.
Why it fails in CI but not locally:
Attempted Fixes
Experiment 1: Environment Variables
Result: ❌ Tests still hang in CI
Experiment 2: Fixture to Disable Services
Result: ❌ Tests still hang in CI
Experiment 3: Combined Approach (Env Vars + Fixtures)
Applied both environment variables in riotfile.py and fixtures in conftest.py:
Result: ❌ Tests still hang in CI
Experiment 4: Using --no-ddtrace Flag
Result: ❌ Tests still hang, telemetry recursion errors persist
CI Error Logs
https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/1235039604
Performance Impact
Tests that do complete in CI are dramatically slower:
Decision: Skip Tests Temporarily
After extensive investigation and multiple attempted fixes, we cannot reliably resolve this CI-specific issue. The tests work perfectly
locally and in the 3.19 branch, indicating this is an environment-specific interaction introduced during the 4.0 merge.
Next Steps:
Related Issues