Fix a race condition when querying op metadata by rostan-t · Pull Request #6309 · NVIDIA/DALI

rostan-t · 2026-04-22T17:46:20Z

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

When getting the ndim, dtype or layout of an operator in dynamic mode, the invocation first checks that it has no result and if that's the case uses the operator to infer metadata statically. This can cause a race condition if another thread is executing the operator because the _operator attribute of the invocation is set to None when running is complete.

Additional information:

Affected modules and functionalities:

Dynamic mode.

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-4668

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

greptile-apps · 2026-04-22T17:50:26Z

Greptile Summary

This PR fixes a race condition in Invocation._try_get_metadata (new helper): the old ndim/dtype/layout methods read self._operator after the self._results is None guard, giving another thread a window to null out _operator (in _run_impl) between those two steps. The fix captures self._operator into a local variable first, then re-checks _results — ensuring a valid reference is always held if metadata inference proceeds. A new multithreaded stress test covers the scenario directly.

Confidence Score: 5/5

Safe to merge — the fix is minimal, correct, and directly tested.

The race condition is precisely diagnosed and the fix is tight: capture-then-recheck is the standard pattern for this class of race. CPython's GIL guarantees that any thread observing self._operator = None has already observed self._results = tuple(r) (assignment ordering within the same thread is preserved), so the inner _results is None re-check is a valid sentinel. The added truthiness guard on output_desc is a minor defensive improvement. No P0/P1 issues remain.

No files require special attention.

Important Files Changed

Filename	Overview
dali/python/nvidia/dali/experimental/dynamic/_invocation.py	Extracts metadata-lookup logic into `_try_get_metadata` that atomically captures `self._operator` before re-checking `self._results`, correctly eliminating the race window where another thread could null out `_operator` between the outer `_results` guard and the attribute access.
dali/test/python/experimental_mode/test_multithreading.py	Adds `test_parallel_op_metadata_query` which hammers `.layout`, `.dtype`, and `.ndim` across all available threads in a tight loop, providing direct regression coverage for the fixed race condition.

Sequence Diagram

sequenceDiagram
    participant A as Thread A
    participant B as Thread B
    participant I as Invocation

    Note over A,I: Race avoided - B completes first
    A->>I: operator = self._operator
    B->>I: self._results = tuple(r)
    B->>I: self._operator = None
    A->>I: if self._results is None?
    I-->>A: False - return None safely

    Note over A,I: Race avoided - A queries first
    A->>I: operator = self._operator
    A->>I: if self._results is None?
    I-->>A: True
    A->>I: assert operator is not None
    A->>I: init_spec and return metadata

_{Reviews (2): Last reviewed commit: "Apply review suggestions" | Re-trigger Greptile}

rostan-t · 2026-04-23T09:09:21Z

!build

dali-automaton · 2026-04-23T09:16:38Z

CI MESSAGE: [49276376]: BUILD STARTED

jantonguirao · 2026-04-23T11:43:49Z

+            assert operator is not None
+            if init_spec := getattr(operator, "_init_spec", None):
+                init_spec(self._inputs, self._args)
+                if output_desc := operator._op_spec.OutputDesc(result_index):


[Nit] OpSpec::OutputDesc always returns a fixed 5-tuple (truthy), so the walrus check is equivalent to a non-None check here. Still, if output_desc := ... is not None: (or just unconditional indexing) would express the intent more clearly and stay correct if the API ever returns an empty container.

If OutputDesc ever returns an empty tuple, we want this to be falsy because it would otherwise result on an IndexError on the next line

dali-automaton · 2026-04-23T13:15:24Z

CI MESSAGE: [49276376]: BUILD FAILED

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

rostan-t · 2026-04-23T13:33:40Z

!build

dali-automaton · 2026-04-23T13:35:59Z

CI MESSAGE: [49295377]: BUILD STARTED

dali-automaton · 2026-04-23T20:56:32Z

CI MESSAGE: [49295377]: BUILD PASSED

rostan-t added 2 commits April 22, 2026 17:41

Add test for multithreaded op metadata query

9a52579

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

Fix a race condition in metadata access

abfa1ed

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

rostan-t added the Dynamic Mode label Apr 22, 2026

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py

mzient approved these changes Apr 22, 2026

View reviewed changes

mzient self-assigned this Apr 22, 2026

dali-automaton assigned jantonguirao Apr 23, 2026

jantonguirao approved these changes Apr 23, 2026

View reviewed changes

Apply review suggestions

8bb2ba4

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

rostan-t merged commit e0ae465 into NVIDIA:main Apr 24, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a race condition when querying op metadata#6309

Fix a race condition when querying op metadata#6309
rostan-t merged 3 commits into
NVIDIA:mainfrom
rostan-t:ndd-race-condition-fix

rostan-t commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

rostan-t commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

Uh oh!

jantonguirao Apr 23, 2026

Uh oh!

rostan-t Apr 23, 2026

Uh oh!

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

rostan-t commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

rostan-t commented Apr 22, 2026

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

rostan-t commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

Uh oh!

jantonguirao Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

rostan-t Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

rostan-t commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

dali-automaton commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading