Implement checkpointing for random GPU operators #5148

szkarpinski · 2023-11-07T09:43:39Z

Category:

New feature (non-breaking change which adds functionality)

Description:

In this PR I add checkpointing support to fn.random operators on the GPU.

Additional information:

Affected modules and functionalities:

curand_states (from randomizer.cuh) now has get_states and set_states methods, which allow to export internal state or restore it.

RNGBase (from rng_base.h) can now save/restore states also on the GPU. The checkpoint of GPU random operator is a pointer to states (as returned by get_states). An ability to serialize/deserialize such checkpoints was added.

Key points relevant for the review:

Checkpoints are generally kept on the device

Note that the states returned by get_states are still on the GPU and set_states also expects its input to be on the GPU. RNGBase's Save and Restore also keep the checkpoint as a pointer to the GPU memory. The checkpoint is copied to the host at the very last moment during Serialize. This limits the device->host transfers significantly as we only copy the checkpoints that the user intends to save, while the checkpoints which we just trace and then delete never leave the device.

Serialization of `curandState`

During serialization, we treat curandState as a sequence of bytes, not caring about its structure. I consider it an implementation detail of curand.

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-3532

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski · 2023-11-07T16:19:19Z

Found some performance problems, converting back to draft

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski · 2023-11-09T10:36:10Z

There are 1024 curand states kept per sample (48B each), serializing them one by one was not the best idea. Now a4c521e I serialize array of states to one big bytestring and it reduces the overhead by ~4x. Now the overhead is comparable to that of CPU random operators, so I'd leave it as it is and try to optimize them both in a follow-up.

klecki

Looks ok, one small question.

I also have more general question regarding synchronous operations. Are there any plans to make the checkpointing to run on some stream? Now we do the synchronous copies in few places. (I don't see this as a bad thing, the change is simpler thanks to that, but it may have some impact in cases where the checkpoints are created more often).

dali/operators/random/rng_base.h

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

dali/operators/util/randomizer.cuh

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

stiepan

Looks nice! Please remeber to add the more fine-grained synchronization to the roadmap.

klecki

One more place with state vs curand_state, otherwise ok.

dali/operators/random/rng_base.h

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski · 2023-11-14T16:29:50Z

!build

dali-automaton · 2023-11-14T16:35:11Z

CI MESSAGE: [10787351]: BUILD STARTED

dali-automaton · 2023-11-14T17:12:17Z

CI MESSAGE: [10787351]: BUILD FAILED

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski · 2023-11-16T11:42:50Z

!build

dali-automaton · 2023-11-16T11:46:36Z

CI MESSAGE: [10839737]: BUILD STARTED

dali-automaton · 2023-11-16T17:22:21Z

CI MESSAGE: [10839737]: BUILD PASSED

Implement checkpointing for random GPU operators

9edf25f

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski force-pushed the gpu-random-cpt branch from ff769f5 to 9edf25f Compare November 7, 2023 09:45

stiepan self-assigned this Nov 7, 2023

Keep state on the device

e4e1d24

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski marked this pull request as ready for review November 7, 2023 12:50

Linter fixes

b6d900e

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski marked this pull request as draft November 7, 2023 16:19

Optimize serialization

a4c521e

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski marked this pull request as ready for review November 9, 2023 10:36

dali-automaton assigned klecki Nov 9, 2023

klecki reviewed Nov 13, 2023

View reviewed changes

dali/operators/random/rng_base.h Outdated Show resolved Hide resolved

dali/operators/random/rng_base.h Outdated Show resolved Hide resolved

Fix states_gpu name

9b4aa8b

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski commented Nov 14, 2023

View reviewed changes

dali/operators/util/randomizer.cuh Outdated Show resolved Hide resolved

szkarpinski added 2 commits November 14, 2023 14:25

Keep and check curand state length

af4ccdc

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

Remove const from set

e165432

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski mentioned this pull request Nov 14, 2023

Add checkpointing benchmarks #5166

Merged

18 tasks

stiepan approved these changes Nov 14, 2023

View reviewed changes

klecki approved these changes Nov 14, 2023

View reviewed changes

dali/operators/random/rng_base.h Outdated Show resolved Hide resolved

States length fix

697c7ba

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

Add missing DLL_PUBLIC

80a71ca

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski merged commit 5f3f12e into NVIDIA:main Nov 16, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement checkpointing for random GPU operators #5148

Implement checkpointing for random GPU operators #5148

szkarpinski commented Nov 7, 2023 •

edited

Loading

szkarpinski commented Nov 7, 2023 •

edited

Loading

szkarpinski commented Nov 9, 2023

klecki left a comment

stiepan left a comment

klecki left a comment

szkarpinski commented Nov 14, 2023

dali-automaton commented Nov 14, 2023

dali-automaton commented Nov 14, 2023

szkarpinski commented Nov 16, 2023

dali-automaton commented Nov 16, 2023

dali-automaton commented Nov 16, 2023

Implement checkpointing for random GPU operators #5148

Implement checkpointing for random GPU operators #5148

Conversation

szkarpinski commented Nov 7, 2023 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Checkpoints are generally kept on the device

Serialization of curandState

Tests:

Checklist

Documentation

DALI team only

Requirements

szkarpinski commented Nov 7, 2023 • edited Loading

szkarpinski commented Nov 9, 2023

klecki left a comment

Choose a reason for hiding this comment

stiepan left a comment

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

szkarpinski commented Nov 14, 2023

dali-automaton commented Nov 14, 2023

dali-automaton commented Nov 14, 2023

szkarpinski commented Nov 16, 2023

dali-automaton commented Nov 16, 2023

dali-automaton commented Nov 16, 2023

szkarpinski commented Nov 7, 2023 •

edited

Loading

Serialization of `curandState`

szkarpinski commented Nov 7, 2023 •

edited

Loading