Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support checkpointing in FileReader #4954

Merged
merged 15 commits into from
Jul 31, 2023

Conversation

staniewzki
Copy link
Contributor

@staniewzki staniewzki commented Jul 25, 2023

Category:

New feature (non-breaking change which adds functionality)

Description:

This PR adds support for checkpointing to FileReader.

In the prefetching thread, a state snapshot is produced every epoch by the Loader, and stored in a queue.

Currently, only pad_last_batch=true is supported, because otherwise, the epoch might change mid-batch, which complicates checkpointing.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Note: most of the changes in file_label_loader.h diff come from fixing the indentation there.

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-3527

Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
@staniewzki staniewzki marked this pull request as ready for review July 26, 2023 13:00
Comment on lines 62 to 116
vector<string> files;
vector<int> labels;

has_files_arg_ = spec.TryGetRepeatedArgument(files, "files");
has_labels_arg_ = spec.TryGetRepeatedArgument(labels, "labels");
has_file_list_arg_ = spec.TryGetArgument(file_list_, "file_list");
has_file_root_arg_ = spec.TryGetArgument(file_root_, "file_root");
bool has_file_filters_arg = spec.TryGetRepeatedArgument(filters_, "file_filters");

// TODO(ksztenderski): CocoLoader inherits after FileLabelLoader and it doesn't work with
// GetArgument.
spec.TryGetArgument(case_sensitive_filter_, "case_sensitive_filter");

DALI_ENFORCE(has_file_root_arg_ || has_files_arg_ || has_file_list_arg_,
"``file_root`` argument is required when not using ``files`` or ``file_list``.");

DALI_ENFORCE(has_files_arg_ + has_file_list_arg_ <= 1,
"File paths can be provided through ``files`` or ``file_list`` but not both.");

DALI_ENFORCE(has_files_arg_ || !has_labels_arg_,
"The argument ``labels`` is valid only when file paths "
"are provided as ``files`` argument.");

DALI_ENFORCE(!has_file_filters_arg || filters_.size() > 0,
"``file_filters`` list cannot be empty.");

if (has_file_list_arg_) {
DALI_ENFORCE(!file_list_.empty(), "``file_list`` argument cannot be empty");
if (!has_file_root_arg_) {
auto idx = file_list_.rfind(filesystem::dir_sep);
if (idx != string::npos) {
file_root_ = file_list_.substr(0, idx);
}
}
}

if (has_files_arg_) {
DALI_ENFORCE(files.size() > 0, "``files`` specified an empty list.");
if (has_labels_arg_) {
DALI_ENFORCE(files.size() == labels.size(), make_string("Provided ", labels.size(),
" labels for ", files.size(), " files."));
if (has_files_arg_) {
DALI_ENFORCE(files.size() > 0, "``files`` specified an empty list.");
if (has_labels_arg_) {
DALI_ENFORCE(files.size() == labels.size(), make_string("Provided ", labels.size(),
" labels for ", files.size(), " files."));

for (int i = 0, n = files.size(); i < n; i++)
image_label_pairs_.emplace_back(std::move(files[i]), labels[i]);
} else {
for (int i = 0, n = files.size(); i < n; i++)
image_label_pairs_.emplace_back(std::move(files[i]), labels[i]);
} else {
for (int i = 0, n = files.size(); i < n; i++)
image_label_pairs_.emplace_back(std::move(files[i]), i);
}
image_label_pairs_.emplace_back(std::move(files[i]), i);
}
}

/*
* Those options are mutually exclusive as `shuffle_after_epoch` will make every shard looks differently
* after each epoch so coexistence with `stick_to_shard` doesn't make any sense
* Still when `shuffle_after_epoch` we will set `stick_to_shard` internally in the FileLabelLoader so all
* DALI instances will do shuffling after each epoch
*/
DALI_ENFORCE(!(shuffle_after_epoch_ && stick_to_shard_),
"shuffle_after_epoch and stick_to_shard cannot be both true");
DALI_ENFORCE(!(shuffle_after_epoch_ && shuffle_),
"shuffle_after_epoch and random_shuffle cannot be both true");
/*
* Imply `stick_to_shard` from `shuffle_after_epoch`
*/
if (shuffle_after_epoch_) {
stick_to_shard_ = true;
}
/*
* Those options are mutually exclusive as `shuffle_after_epoch` will make every shard looks differently
* after each epoch so coexistence with `stick_to_shard` doesn't make any sense
* Still when `shuffle_after_epoch` we will set `stick_to_shard` internally in the FileLabelLoader so all
* DALI instances will do shuffling after each epoch
*/
DALI_ENFORCE(!(shuffle_after_epoch_ && stick_to_shard_),
"shuffle_after_epoch and stick_to_shard cannot be both true");
DALI_ENFORCE(!(shuffle_after_epoch_ && shuffle_),
"shuffle_after_epoch and random_shuffle cannot be both true");
/*
* Imply `stick_to_shard` from `shuffle_after_epoch`
*/
if (shuffle_after_epoch_) {
stick_to_shard_ = true;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lengthy diff, but this is just an indentation fix.

dali/operators/reader/loader/file_label_loader.cc Outdated Show resolved Hide resolved
Comment on lines 275 to 280
std::vector<FileLabelLoaderState> state_queue_;
Index state_queue_front_;
Index state_queue_back_;
int checkpoint_epoch_;
std::mutex state_queue_mutex_;
using Loader<CPUBackend, ImageLabelWrapper>::checkpointing_;
Copy link
Member

@stiepan stiepan Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record of what we talked offline:

Can we move it all to the Loader base? And make the FileLabelLoaderState just a LoaderCheckpoint? There does not seem to be anything file-reader specific in here.

We don't exactly know how other loaders will play out with the featrue, so I wouldn't push for that at that moment if it wasn't for the fact that feature flag and pushing new loader checkpoint already leaked to the base class. Now, the base loader has the dummy "push method" and no corresponding pop.

As a safegaurd for the readers we do not test for now you could add a flag to loader

template <typename Backend, typename LoadTarget, bool support_checkpointing=false>
class Loader {...}

and somwhere inside to be used instead of plain checkpointing_

bool EnableCheckpointing() {  // no need for virtual methods
    return support_checkpointing && checkpointing_;
}

In the future, the enable_checkpointing could become the actual type of loder's checkpoint/state to be put in the queue.

@stiepan stiepan self-assigned this Jul 26, 2023
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
@staniewzki
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9175984]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9175984]: BUILD PASSED

@staniewzki staniewzki merged commit d52fd08 into NVIDIA:main Jul 31, 2023
5 checks passed
@szkarpinski szkarpinski mentioned this pull request Oct 10, 2023
18 tasks
JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023
This PR adds support for checkpointing to FileReader.

In the prefetching thread, a state snapshot is produced every epoch by the Loader, and stored in a queue.

Currently, only pad_last_batch=true is supported, because otherwise, the epoch might change mid-batch, which complicates checkpointing.

---------

Signed-off-by: Michał Staniewski <mstaniewski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants