Skip to content

Fix a race condition when querying op metadata#6309

Merged
rostan-t merged 3 commits into
NVIDIA:mainfrom
rostan-t:ndd-race-condition-fix
Apr 24, 2026
Merged

Fix a race condition when querying op metadata#6309
rostan-t merged 3 commits into
NVIDIA:mainfrom
rostan-t:ndd-race-condition-fix

Conversation

@rostan-t

Copy link
Copy Markdown
Collaborator

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

When getting the ndim, dtype or layout of an operator in dynamic mode, the invocation first checks that it has no result and if that's the case uses the operator to infer metadata statically. This can cause a race condition if another thread is executing the operator because the _operator attribute of the invocation is set to None when running is complete.

Additional information:

Affected modules and functionalities:

Dynamic mode.

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4668

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
@greptile-apps

greptile-apps Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a race condition in Invocation._try_get_metadata (new helper): the old ndim/dtype/layout methods read self._operator after the self._results is None guard, giving another thread a window to null out _operator (in _run_impl) between those two steps. The fix captures self._operator into a local variable first, then re-checks _results — ensuring a valid reference is always held if metadata inference proceeds. A new multithreaded stress test covers the scenario directly.

Confidence Score: 5/5

Safe to merge — the fix is minimal, correct, and directly tested.

The race condition is precisely diagnosed and the fix is tight: capture-then-recheck is the standard pattern for this class of race. CPython's GIL guarantees that any thread observing self._operator = None has already observed self._results = tuple(r) (assignment ordering within the same thread is preserved), so the inner _results is None re-check is a valid sentinel. The added truthiness guard on output_desc is a minor defensive improvement. No P0/P1 issues remain.

No files require special attention.

Important Files Changed

Filename Overview
dali/python/nvidia/dali/experimental/dynamic/_invocation.py Extracts metadata-lookup logic into _try_get_metadata that atomically captures self._operator before re-checking self._results, correctly eliminating the race window where another thread could null out _operator between the outer _results guard and the attribute access.
dali/test/python/experimental_mode/test_multithreading.py Adds test_parallel_op_metadata_query which hammers .layout, .dtype, and .ndim across all available threads in a tight loop, providing direct regression coverage for the fixed race condition.

Sequence Diagram

sequenceDiagram
    participant A as Thread A
    participant B as Thread B
    participant I as Invocation

    Note over A,I: Race avoided - B completes first
    A->>I: operator = self._operator
    B->>I: self._results = tuple(r)
    B->>I: self._operator = None
    A->>I: if self._results is None?
    I-->>A: False - return None safely

    Note over A,I: Race avoided - A queries first
    A->>I: operator = self._operator
    A->>I: if self._results is None?
    I-->>A: True
    A->>I: assert operator is not None
    A->>I: init_spec and return metadata
Loading

Reviews (2): Last reviewed commit: "Apply review suggestions" | Re-trigger Greptile

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py
Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py
@mzient mzient self-assigned this Apr 22, 2026
@rostan-t

Copy link
Copy Markdown
Collaborator Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49276376]: BUILD STARTED

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py Outdated
assert operator is not None
if init_spec := getattr(operator, "_init_spec", None):
init_spec(self._inputs, self._args)
if output_desc := operator._op_spec.OutputDesc(result_index):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] OpSpec::OutputDesc always returns a fixed 5-tuple (truthy), so the walrus check is equivalent to a non-None check here. Still, if output_desc := ... is not None: (or just unconditional indexing) would express the intent more clearly and stay correct if the API ever returns an empty container.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OutputDesc ever returns an empty tuple, we want this to be falsy because it would otherwise result on an IndexError on the next line

Comment thread dali/python/nvidia/dali/experimental/dynamic/_invocation.py
@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49276376]: BUILD FAILED

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
@rostan-t

Copy link
Copy Markdown
Collaborator Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49295377]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49295377]: BUILD PASSED

@rostan-t rostan-t merged commit e0ae465 into NVIDIA:main Apr 24, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants