Skip to content

Conversation

@mzient
Copy link
Contributor

@mzient mzient commented Nov 27, 2025

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

Random operators now use Philox generator with state counter calculated based on sample index and offset within sample.
This required a major overhaul of all random operators.
Now the random operators carry just one generator and its cloned and adjusted. No state arrays need to be maintained.

Additional information:

Affected modules and functionalities:

All operators that involve randomness except readers.

Key points relevant for the review:

If you know of any other place when randomness is used but the rework hasn't reached it.
Pay attention to the base classes used in random ops.

Tests:

Mostly the old tests for random operators apply.
Added a new test for passing random state to fn.random.uniform - it checks that the operator yields the same results when the same state is passed but a different result when a different state is passed.

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@greptile-apps
Copy link

greptile-apps bot commented Nov 27, 2025

Greptile Overview

Greptile Summary

This PR successfully refactors all random operators to use a unified Philox4x32_10 generator with deterministic state management based on sample index and element offsets. The architecture is significantly simplified by replacing per-sample state arrays with a single master_rng_ that is cloned and advanced using skipahead operations.

Key Changes:

  • Replaced BatchRNG<std::mt19937_64> and curand_states with single Philox4x32_10 master_rng_
  • Introduced kSkipaheadPerElement (257) and kSkipaheadPerSample (65537) constants for deterministic state advancement
  • Removed state array management - operators now clone and adjust the master RNG as needed
  • Deleted obsolete files: rng_checkpointing_utils.h, batch_rng.h, randomizer.cuh, randomizer.cu, randomizer_test.cu
  • Fixed string parsing in philox.cc to use custom hex parsing instead of sscanf
  • Updated all random operators (uniform, normal, coin_flip, choice, noise generators, etc.) to use new base classes
  • Added test coverage for _random_state argument to verify deterministic behavior

Previous Issues Addressed:
All previously reported issues have been fixed by the author in this commit.

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-executed refactoring with comprehensive test coverage and all previously identified issues have been resolved.
  • Score of 5 reflects thorough refactoring with proper design patterns (CRTP), consistent architecture across CPU/GPU backends, deterministic RNG behavior, comprehensive test coverage including new _random_state tests, and successful resolution of all previously reported issues. The code quality is high with proper error handling, clear documentation, and simplified architecture that removes 824 lines while adding 610 lines of cleaner code.
  • No files require special attention - all implementations are consistent and well-structured.

Important Files Changed

File Analysis

Filename Score Overview
dali/operators/random/rng_base.h 5/5 Core RNG base class refactored to use single Philox generator per operator, eliminating state arrays. Clean CRTP design with proper checkpointing support.
dali/operators/random/philox.h 5/5 Philox4x32_10 implementation with proper state management, skipahead functions, and serialization. Fixed string parsing to avoid sscanf issues.
dali/operators/random/philox.cc 5/5 Implementation of Philox algorithm and state serialization/deserialization using custom hex parsing instead of sscanf.
dali/operators/random/rng_base_cpu.h 5/5 CPU RNG implementation using Philox generator with proper element-level skipahead for parallel generation. All previous skipahead issues fixed.
dali/operators/random/rng_base_gpu.cuh 5/5 GPU kernel implementation using Philox with consistent skipahead strategy across all variants. Previous multiplication inconsistency fixed.
dali/operators/random/rng_base_gpu.h 5/5 GPU backend structures simplified by removing curand_states randomizer, now relying on single Philox state passed to kernels.
dali/operators/random/uniform_distribution.h 5/5 Uniform distribution operator updated to use new RNG base class. Continuous and discrete implementations remain functionally equivalent.
dali/test/python/operator_2/test_uniform.py 5/5 Added test for _random_state argument to verify operators produce consistent results with same state and different results with different states.

Sequence Diagram

sequenceDiagram
    participant User
    participant Operator
    participant OperatorWithRng
    participant RNGBase
    participant Philox as Philox4x32_10
    participant Backend as CPU/GPU Backend

    User->>Operator: Create operator with seed
    Operator->>OperatorWithRng: Initialize (spec)
    OperatorWithRng->>Philox: init(seed, 0, 0)
    Note over Philox: master_rng_ initialized

    User->>Operator: Run(workspace)
    Operator->>OperatorWithRng: Run(workspace)
    
    opt Has _random_state argument
        OperatorWithRng->>OperatorWithRng: LoadRandomState(ws)
        OperatorWithRng->>Philox: set_state(loaded_state)
    end
    
    OperatorWithRng->>RNGBase: SetupImpl(ws)
    Note over RNGBase: Acquire arguments<br/>Setup distributions
    
    OperatorWithRng->>RNGBase: RunImpl(ws)
    
    loop For each sample in batch
        RNGBase->>OperatorWithRng: GetSampleRNG(sample_idx)
        OperatorWithRng->>Philox: Clone and skipahead_sequence(sample_idx * kSkipaheadPerSample)
        Philox-->>RNGBase: Sample-specific RNG
        
        RNGBase->>Backend: RunImplTyped<T, Dist>(ws)
        
        loop For each element in sample
            Backend->>Philox: skipahead(element_idx * kSkipaheadPerElement)
            Backend->>Philox: next() or Generate via distribution
            Philox-->>Backend: Random value
            Backend->>Backend: ConvertSat and store to output
        end
    end
    
    RNGBase-->>OperatorWithRng: Processing complete
    OperatorWithRng->>Philox: skipahead_sequence(batch_size)
    Note over Philox: Advance master_rng_<br/>for next batch
    
    OperatorWithRng-->>User: Output produced
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. dali/operators/random/rng_base_cpu.h, line 195 (link)

    style: Dead code - seed array is declared but never used after removing the chunk seeding logic

35 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

explicit RNGBase(const OpSpec &spec)
: OperatorWithRng<Backend>(spec) {}
: Base(spec)
, backend_data_(NumDists()) {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Calling NumDists() in member initializer list before object is fully constructed - max_batch_size_ may not be initialized yet. Is max_batch_size_ guaranteed to be initialized in the Base constructor before this derived constructor runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, max_batch_size_ is populated in the base class constructor.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


template<typename Backend, bool RngPerSample = true>
class OperatorWithRng : public Operator<Backend>{
void _DetectOperatrBackend(int /* ... */);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: typo: _DetectOperatrBackend should be _DetectOperatorBackend

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

36 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39279042]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39279042]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39297892]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39299073]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39297892]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39301861]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39299073]: BUILD FAILED

@jantonguirao jantonguirao self-assigned this Nov 28, 2025
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. dali/operators/random/choice.h, line 305-306 (link)

    logic: ElementCopy doesn't apply skipahead per element - loop reuses same rng state without advancing it. Each iteration calls dist.Generate(rng) with same state, generating correlated random numbers.

  2. dali/operators/random/choice.h, line 312-313 (link)

    logic: Same issue - missing skipahead per element iteration.

37 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39304715]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mzient
Copy link
Contributor Author

mzient commented Nov 28, 2025

Additional Comments (2)

1. `dali/operators/random/choice.h`, line 305-306 ([link](/nvidia/dali/blob/49f3d8f577416bdf9ce90bc04a2eac4f4f1fcdaa/dali/operators/random/choice.h#L305-L306))
   **logic:** `ElementCopy` doesn't apply skipahead per element - loop reuses same `rng` state without advancing it. Each iteration calls `dist.Generate(rng)` with same state, generating correlated random numbers.

2. `dali/operators/random/choice.h`, line 312-313 ([link](/nvidia/dali/blob/49f3d8f577416bdf9ce90bc04a2eac4f4f1fcdaa/dali/operators/random/choice.h#L312-L313))
   **logic:** Same issue - missing skipahead per element iteration.

37 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptileai This isn't true. dist.Generate(rng) will advance rng internally, so there's no correlation. The rng will be advanced by 1, not by kSkipaheadPerElement, but that's not a problem - the constant is there only to avoid correlation between adjacent elements to account for situations where handling a single element requires multiple random numbers - which isn't the case here.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. I apologize for the confusion in my comment.

Looking at the code more carefully, dist.Generate(rng) does indeed advance the RNG state internally with each call. The kSkipaheadPerElement constant is specifically designed to handle cases where processing a single element might require multiple random numbers, creating potential correlation between adjacent elements.

In the ElementCopy loop in the choice operator, each call to dist.Generate(rng) consumes exactly one random number and advances the state appropriately, so there's no correlation issue here. The RNG will naturally advance by 1 for each element, which is the correct behavior for this use case.

Thank you for the clarification - the implementation is correct as written.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39306692]: BUILD STARTED

Comment on lines +72 to +76
struct State {
uint64_t key;
uint64_t ctr[2];
int phase;
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now public and doesn't contain the output (which can be recalculated). This is useful for checkpointing.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39307077]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

mzient and others added 13 commits December 1, 2025 11:11
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39413172]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39413172]: BUILD PASSED

… step is chosen.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39419231]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39420377]: BUILD STARTED

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

38 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39419231]: BUILD PASSED

@mzient mzient merged commit d3d98d2 into NVIDIA:main Dec 1, 2025
6 checks passed
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39420377]: BUILD PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants