Skip to content

[AI-4964] Add support for timeout in process isolation#22975

Open
sarah-witt wants to merge 7 commits intomasterfrom
sarah/process-isolation-timeout
Open

[AI-4964] Add support for timeout in process isolation#22975
sarah-witt wants to merge 7 commits intomasterfrom
sarah/process-isolation-timeout

Conversation

@sarah-witt
Copy link
Contributor

@sarah-witt sarah-witt commented Mar 18, 2026

What does this PR do?

Adds a parameter process_isolation_timeout to be used with process_isolation that sets a timeout for a check running in a new process. If the check exceeds the timeout, it is stopped and will be rescheduled.

Motivation

We have a case where a third party library causes a check to hang sporadically. This is due to intermittent network errors, and a rerun of the check is sufficient to fix the issue. Currently, the only way to stop the check is to restart the agent. If we run the check using process isolation and have a timeout for the process, the check will timeout after hanging, and get rescheduled.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@datadog-official
Copy link
Contributor

datadog-official bot commented Mar 18, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d1675a4 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.34%. Comparing base (79e0762) to head (d1675a4).
⚠️ Report is 13 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarah-witt sarah-witt changed the title Add support for timeout in process isolation [AI-4964] Add support for timeout in process isolation Mar 19, 2026
@sarah-witt sarah-witt marked this pull request as ready for review March 19, 2026 13:56
@sarah-witt sarah-witt requested a review from a team as a code owner March 19, 2026 13:56
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 568125556b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

def _kill_on_timeout():
nonlocal timed_out
timed_out = True
process.kill()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Terminate the entire isolated process tree on timeout

process.kill() only stops the wrapper Python process here. Checks can legitimately spawn child commands (for example process/datadog_checks/process/process.py:244-246 and tibco_ems/datadog_checks/tibco_ems/tibco_ems.py:81-88), and with process_isolation_timeout enabled those descendants will keep running after the wrapper is killed because this subprocess was not started in its own session/process group. In those integrations, repeated timeouts would leak orphaned commands and the check is not actually "stopped" as advertised.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

@sarah-witt sarah-witt Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point however the process_isolation feature is still experimental and the timeout in particular only will be used on vSphere, which doesn't spawn processes. I added a TODO, but reviewers let me know if I should address!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants