-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement checkpointing for random GPU operators #5148
Conversation
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
ff769f5
to
9edf25f
Compare
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Found some performance problems, converting back to draft |
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
There are 1024 curand states kept per sample (48B each), serializing them one by one was not the best idea. Now a4c521e I serialize array of states to one big bytestring and it reduces the overhead by ~4x. Now the overhead is comparable to that of CPU random operators, so I'd leave it as it is and try to optimize them both in a follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, one small question.
I also have more general question regarding synchronous operations. Are there any plans to make the checkpointing to run on some stream? Now we do the synchronous copies in few places. (I don't see this as a bad thing, the change is simpler thanks to that, but it may have some impact in cases where the checkpoints are created more often).
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! Please remeber to add the more fine-grained synchronization to the roadmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more place with state vs curand_state, otherwise ok.
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
!build |
CI MESSAGE: [10787351]: BUILD STARTED |
CI MESSAGE: [10787351]: BUILD FAILED |
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
!build |
CI MESSAGE: [10839737]: BUILD STARTED |
CI MESSAGE: [10839737]: BUILD PASSED |
Category:
New feature (non-breaking change which adds functionality)
Description:
In this PR I add checkpointing support to
fn.random
operators on the GPU.Additional information:
Affected modules and functionalities:
curand_states
(fromrandomizer.cuh
) now hasget_states
andset_states
methods, which allow to export internal state or restore it.RNGBase
(fromrng_base.h
) can now save/restore states also on the GPU. The checkpoint of GPU random operator is a pointer to states (as returned byget_states
). An ability to serialize/deserialize such checkpoints was added.Key points relevant for the review:
Checkpoints are generally kept on the device
Note that the states returned by
get_states
are still on the GPU andset_states
also expects its input to be on the GPU.RNGBase
's Save and Restore also keep the checkpoint as a pointer to the GPU memory. The checkpoint is copied to the host at the very last moment duringSerialize
. This limits the device->host transfers significantly as we only copy the checkpoints that the user intends to save, while the checkpoints which we just trace and then delete never leave the device.Serialization of
curandState
During serialization, we treat
curandState
as a sequence of bytes, not caring about its structure. I consider it an implementation detail of curand.Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-3532