Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checkpointing for random GPU operators #5148

Merged
merged 9 commits into from
Nov 16, 2023

Conversation

szkarpinski
Copy link
Collaborator

@szkarpinski szkarpinski commented Nov 7, 2023

Category:

New feature (non-breaking change which adds functionality)

Description:

In this PR I add checkpointing support to fn.random operators on the GPU.

Additional information:

Affected modules and functionalities:

curand_states (from randomizer.cuh) now has get_states and set_states methods, which allow to export internal state or restore it.

RNGBase (from rng_base.h) can now save/restore states also on the GPU. The checkpoint of GPU random operator is a pointer to states (as returned by get_states). An ability to serialize/deserialize such checkpoints was added.

Key points relevant for the review:

Checkpoints are generally kept on the device

Note that the states returned by get_states are still on the GPU and set_states also expects its input to be on the GPU. RNGBase's Save and Restore also keep the checkpoint as a pointer to the GPU memory. The checkpoint is copied to the host at the very last moment during Serialize. This limits the device->host transfers significantly as we only copy the checkpoints that the user intends to save, while the checkpoints which we just trace and then delete never leave the device.

Serialization of curandState

During serialization, we treat curandState as a sequence of bytes, not caring about its structure. I consider it an implementation detail of curand.

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-3532

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski szkarpinski marked this pull request as ready for review November 7, 2023 12:50
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski szkarpinski marked this pull request as draft November 7, 2023 16:19
@szkarpinski
Copy link
Collaborator Author

szkarpinski commented Nov 7, 2023

Found some performance problems, converting back to draft

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski
Copy link
Collaborator Author

There are 1024 curand states kept per sample (48B each), serializing them one by one was not the best idea. Now a4c521e I serialize array of states to one big bytestring and it reduces the overhead by ~4x. Now the overhead is comparable to that of CPU random operators, so I'd leave it as it is and try to optimize them both in a follow-up.

@szkarpinski szkarpinski marked this pull request as ready for review November 9, 2023 10:36
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, one small question.

I also have more general question regarding synchronous operations. Are there any plans to make the checkpointing to run on some stream? Now we do the synchronous copies in few places. (I don't see this as a bad thing, the change is simpler thanks to that, but it may have some impact in cases where the checkpoints are created more often).

dali/operators/random/rng_base.h Outdated Show resolved Hide resolved
dali/operators/random/rng_base.h Outdated Show resolved Hide resolved
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski szkarpinski mentioned this pull request Nov 14, 2023
18 tasks
Copy link
Member

@stiepan stiepan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Please remeber to add the more fine-grained synchronization to the roadmap.

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more place with state vs curand_state, otherwise ok.

dali/operators/random/rng_base.h Outdated Show resolved Hide resolved
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10787351]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10787351]: BUILD FAILED

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10839737]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10839737]: BUILD PASSED

@szkarpinski szkarpinski merged commit 5f3f12e into NVIDIA:main Nov 16, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants