Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support checkpointing in Caffe reader #5181

Merged
merged 3 commits into from
Nov 23, 2023
Merged

Conversation

szkarpinski
Copy link
Collaborator

@szkarpinski szkarpinski commented Nov 21, 2023

Category:

New feature (non-breaking change which adds functionality)

Description:

This PR adds checkpointing support to fn.readers.caffe and fn.readers.caffe2.

Additional information:

Adding checkpointing support is a simple task, similar for every loader and reader.

The following changes were required to enable checkpointing in LMDB loader:

  • Make it inherit from Loader<..., supports_checkpointing=true>
  • Implement Skip() method. Skip should behave like ReadSample in terms of side effects, but should skip a sample instead of reading it. This method is used to implement fast-forwarding in Loader baseclass.
  • Change Reset to use virtual_shard_id_ (which is the shard currently processed)
    instead of shard_id_ (which is the initial shard requested by the user). See [2] for more details.

The following changes were required to enable checkpointing in caffe and caffe2 readers:

  • Make it inherit from DataReader<..., supports_checkpointing=true>. See [1] for more details.
  • Store the very first snapshot in the constructor with this->SetInitialSnapshot().
    Subsequent snapshots are saved by DataReader baseclass.

[1] Changing DataReader template parameters

Inheriting from DataReader<..., supports_checkpointing=true> might look strange in the diff, because the DataReader is defined as:

template <typename Backend, typename LoadTarget,
          typename ParseTarget = LoadTarget, bool supports_checkpointing = false>

and is used mostly as:

DataReader<Backend, Target>

so to enable checkpointing one needs to add two parameters:

DataReader<Backend, Target, Target, true>

[2] virtual_shard_id_

virtual_shard_id_ and shard_id_ might differ when stick_to_shard=False.

This change doesn't impact existing code, because Reset is normally called after each full pass over the data, so then those two are equal. It might happen that a checkpoint is saved when the reader was is processing a different shard that shard_id_, so to restore from such checkpoint we need to be able to reset to the current shard (virtual_shard_id_).

The same change was made in FileReader in #4954.

Affected modules and functionalities:

  • Caffe and caffe2 readers
  • LMDB loader

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-3693

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski szkarpinski changed the title Add checkpointing support in Caffe reader Support checkpointing in Caffe reader Nov 21, 2023
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski szkarpinski marked this pull request as ready for review November 21, 2023 10:33
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10995316]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10995316]: BUILD PASSED

@szkarpinski szkarpinski merged commit 0ba4fc1 into NVIDIA:main Nov 23, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants