Webdataset reader operator implementation #3306

barci2 · 2021-09-01T14:52:21Z

Description

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactoring (Redesign of existing code that doesn't affect functionality)
Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

This PR adds a reader operator for loading tar-based webdatasets.

Additional information

Affected modules and functionalities: Added dali.fn.readers.webdataset

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: RDWDS.01, RDWDS.02, RDWDS.03, RDWDS.04, RDWDS.05, RDWDS.06, RDWDS.07, RDWDS.08, RDWDS.09, RDWDS.10, RDWDS.11, RDWDS.12, RDWDS.13, RDWDS.20

JIRA TASK: DALI-2231

mzient · 2021-09-01T15:00:23Z

dali/operators/reader/loader/webdataset_loader.h

+  std::unordered_map<std::string, std::vector<size_t>>
+      ext_map_;  // maps an extension to sample indicies
+  MissingExt missing_component_behavior_;
+  std::vector<DALIDataType> dtype_;


Suggested change

std::vector<DALIDataType> dtype_;

std::vector<DALIDataType> dtypes_;

?

alright, wasn't sure about naming it

mzient · 2021-09-01T15:00:51Z

dali/operators/reader/loader/webdataset_loader.cc

+               "Invalid value for missing_component_behavior");
+
+  for (auto& component_dtype : dtype_) {
+    DALI_ENFORCE(kSupportedTypes.find(component_dtype) != kSupportedTypes.end(),


Suggested change

DALI_ENFORCE(kSupportedTypes.find(component_dtype) != kSupportedTypes.end(),

DALI_ENFORCE(kSupportedTypes.count(component_dtype),

lgtm-com · 2021-09-01T15:11:24Z

This pull request introduces 3 alerts when merging ffae5a2482858e95f9907eba273db68621a2dc2a into 5403edd - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unreachable code

JanuszL · 2021-09-01T15:47:53Z

dali/operators/reader/webdataset_reader_op.h

+    std::cout << "TEST" << std::endl;
+    std::cout << "TEST" << std::endl;
+    std::cout << "TEST" << std::endl;
+    std::cout << "TEST" << std::endl;


some tests, stuff, that's why it's still WIP

JanuszL · 2021-09-01T15:49:37Z

tools/wds2idx.py

+            #if member.type != tarfile.REGTYPE or member.name.startswith('.'):
+            #    last_skipped = member.offset
+            #    continue
+            #last_skipped = self.farchive.fileobj.tell()
+            #basename, extension = IndexCreator.split_name(member.name)
+            #offset = member.offset
+            #if not data or data[-1][0] != basename:
+            #    data.append((offset, [extension]))
+            #else:
+            #    data[-1][1].append(extension)


also debugging stuff

JanuszL · 2021-09-01T15:52:05Z

tools/wds2idx.py

+    creator.close()
+
+if __name__ == '__main__':
+    main()


Please add a newline.

JanuszL · 2021-09-01T16:27:06Z

dali/operators/reader/loader/webdataset_loader.h

+  std::unordered_map<std::string, std::vector<size_t>>
+      ext_map_;  // maps an extension to sample indicies


Suggested change

std::unordered_map<std::string, std::vector<size_t>>

ext_map_; // maps an extension to sample indicies

std::unordered_map<std::string, std::vector<size_t>> ext_map_; // maps an extension to sample indicies

100 character limit

Then how about:

Suggested change

std::unordered_map<std::string, std::vector<size_t>>

ext_map_; // maps an extension to sample indicies

// maps an extension to sample indicies

std::unordered_map<std::string, std::vector<size_t>> ext_map_;

?

JanuszL · 2021-09-01T16:28:19Z

dali/operators/reader/loader/webdataset_loader.h

+  struct SampleConfig {
+    int64_t start_offset;
+    int64_t end_offset;
+    std::set<std::string> extensions;


Can each sample have different set of extensions?

dali/operators/reader/loader/webdataset_loader.h

JanuszL · 2021-09-01T16:31:23Z

dali/operators/reader/loader/webdataset_loader.cc

+
+  DALI_ENFORCE(uris_.size() == configs_.size(),
+               "Number of uris does not match the number of config files");
+  DALI_ENFORCE(uris_.size() == dtype_.size(), "Number of uris does not match the number of types");


As I understand uris are index files. So why you need to specify dtype for each index file?

yeah, that is a bug, should be ext

JanuszL · 2021-09-01T16:31:36Z

dali/operators/reader/loader/webdataset_loader.cc

+  dtype_ = spec.HasArgument("dtype") ? spec.GetRepeatedArgument<DALIDataType>("dtype") :
+                                       std::vector<DALIDataType>(uris_.size(), DALI_UINT8);
+
+  DALI_ENFORCE(uris_.size() == configs_.size(),


What the config file is for?

that's the index file with offsets for all the samples and the extensions of components in that specific sample

So you have index and config file side by side?

no, the index file is the config file

just thought that "configs" variable would sound better than "indexes"

+1 for indexes (indices ?), file_maps or something along these lines. "Config" sounds more like some kind of setting that's there beside the contents of the file and which can be changed independently, whereas the index is determined exactly by the file contents.

JanuszL · 2021-09-01T16:45:52Z

dali/operators/reader/webdataset_reader_op.cc

+  for (int data_idx = 0; data_idx < num_samples; data_idx++) {
+    auto& sample = GetSample(data_idx);
+    for (int output_idx = 0; output_idx < num_outputs; output_idx++) {
+      ws.OutputRef<CPUBackend>(output_idx)[data_idx].ShareData(&sample[output_idx]);


I'm not sure if it is safe.
ReadOne from the loader.h returns LoadTargetSharedPtr, which is stored in prefetched_batch_queue_.
GetSample returns a *LoadTargetSharedPtr.
When you continue reading prefetched_batch_queue_ is trashed, LoadTargetSharedPtr calls its custom deleter which moves underlying tensor to empty_tensors_ (see).
So sharing memory with something what can be trashed doesn't sound like a safe thing to do.

that's how all the other readers fetch their data, I just used it analogously

I think other readers do parsing and do not directly share the tensors they get from GetSample (at least I cannot recall nay example now). But I can be wrong...

JanuszL · 2021-09-01T16:55:11Z

dali/operators/reader/loader/webdataset_loader.cc

+            shared_tensor_data, size, {size / static_cast<int64_t>(component_dtype_info.size())},
+            component_dtype_info);
+      }
+      sample_was_set[component_index] = true;


Should it break here or keep looping and share over and over the same shared_tensor_data with different sample[component_index] ?

yes, it should

I would add it for the readability.

add what? break would mean a behavior that we would not want

Mhh, so why we share the same memory with different sample[component_index]? I would expect to get a different piece every time.

no, because what ext_map_ does is it maps a specific extension to which indicies of samples it should go to, so those components are actually meant to share the same data

Understood.

I'm having second thought about that. How often we want components to share the underlying memory?
Would be something like:

["a.a;a.b;a.a;a.b"],

?

well that's just one output so no it wouldn't

JanuszL · 2021-09-01T17:00:11Z

dali/operators/reader/loader/webdataset_loader.cc

+    // Check in case of encountering an unneeded entry
+    const std::string extension = GetExtension(current_wds_shard.GetFileName());
+    if (ext_map_.find(extension) == ext_map_.end()) {
+      DALI_ENFORCE(current_wds_shard.NextFile(), "Index file reporting a file longer than actual");


Please adjust the error message. It would be good to print filename and offset.

JanuszL · 2021-09-01T17:00:13Z

dali/operators/reader/loader/webdataset_loader.cc

+  while (current_wds_shard.TellArchive() < current_sample.end_offset) {
+    // Check in case of encountering a tar entry that is not a file
+    if (current_wds_shard.GetFileType() != detail::TarArchive::ENTRY_FILE) {
+      DALI_ENFORCE(current_wds_shard.NextFile(), "Index file reporting a file longer than actual");


It would be good to print filename and offset.

JanuszL · 2021-09-01T17:17:32Z

dali/operators/reader/loader/webdataset_loader.cc

+
+  // initializing all the readers
+  for (auto& uri : uris_) {
+    wds_shards_.emplace_back(FileStream::Open(uri, read_ahead_, !dont_use_mmap_));


Do you want to keep all files open? In some edge cases I'm afraid we can run over the open file descriptors limit.

yes, because technically the way things are implemented does allow for the loader interface to change the argument for reset (even though it's not implemented like that at the moment), so yes that does necessitate keeping all of them open. What I could do would be to open them only when I use them, but that would also mean getting rid of potential caching implemented in that specific file stream later on (for example in the case of the web accessed tar files)

TFRecordReader and MXNet reader opens on demand.
Do you have any particular caching in mind?

yes, the one that the implementation of the web reader will probably do. It's just that I don't want to assume anything about the underlying implementation of FileStream

One more thing:

Suggested change

wds_shards_.emplace_back(FileStream::Open(uri, read_ahead_, !dont_use_mmap_));

wds_shards_.emplace_back(FileStream::Open(uri, read_ahead_, !copy_read_data_);));

JanuszL · 2021-09-01T17:20:39Z

dali/operators/reader/loader/webdataset_loader.cc

+  if (stick_to_shard_) {
+    current_wds_shard_index_ = first_wds_shard_index_;
+    current_sample_index_ = first_sample_index_;
+  }


Suggested change

if (stick_to_shard_) {

current_wds_shard_index_ = first_wds_shard_index_;

current_sample_index_ = first_sample_index_;

}

current_wds_shard_index_ = first_wds_shard_index_;

current_sample_index_ = first_sample_index_;

stick_to_shard_ matters only when Reset is called. Here it should be irrelevant.

It shouldn't since reset is not called after the first looping over. Other implementations solve that like that as well.

JanuszL · 2021-09-01T17:20:52Z

dali/operators/reader/loader/webdataset_loader.cc

+  }
+
+  // initializing the first reader
+  if (stick_to_shard_) {


Same as above.

JanuszL · 2021-09-01T17:22:22Z

dali/operators/reader/loader/webdataset_loader.cc

+  for (detail::TarArchive& wds_shard : wds_shards_) {
+    wds_shard.SeekArchive(0);
+  }


L184 handles that. This is not needed here I guess.

removed in the current changes, will push it tomorrow

lgtm-com · 2021-09-03T09:11:42Z

This pull request introduces 2 alerts when merging fb46bf71278f21e65f940714c9e1cf97e7919cb5 into f05e931 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2021-09-06T16:00:30Z

This pull request introduces 2 alerts when merging 53fa3a0c7a2cfeb8c24be21271d75be3d585d89c into f1a61b6 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2021-09-07T10:09:59Z

This pull request introduces 2 alerts when merging aeaa897219dd007e7173b9308919c1c5eb829271 into a49640d - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2021-09-07T16:35:31Z

This pull request introduces 2 alerts when merging 6bf084fb591809517f2b49cc15da47cc06e9578c into 948ccb7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2021-09-07T20:09:07Z

This pull request introduces 2 alerts when merging 7301b1a733fb84ddb8811de24278f0e6c58b133d into 948ccb7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

lgtm-com · 2021-09-07T20:39:04Z

This pull request introduces 2 alerts when merging cc4cf9456bf53d44185a0f8c597fa71121fc408d into 948ccb7 - view on LGTM.com

new alerts:

1 for Unused local variable
1 for Unused import

dali-automaton · 2021-09-15T22:19:03Z

CI MESSAGE: [2990240]: BUILD STARTED

dali-automaton · 2021-09-15T23:49:09Z

CI MESSAGE: [2987769]: BUILD FAILED

dali-automaton · 2021-09-16T01:57:01Z

CI MESSAGE: [2990240]: BUILD FAILED

dali-automaton · 2021-09-16T09:33:33Z

CI MESSAGE: [2990240]: BUILD PASSED

Signed-off-by: Bartłomiej Cieślar <bcieslar2001@gmail.com>

barci2 · 2021-09-16T09:38:51Z

!build

dali-automaton · 2021-09-16T09:40:56Z

CI MESSAGE: [2993471]: BUILD STARTED

dali-automaton · 2021-09-16T10:46:50Z

CI MESSAGE: [2993471]: BUILD PASSED

Implementation of nvidia.dali.fn.readers.webdataset

mzient marked this pull request as draft September 1, 2021 14:54

mzient reviewed Sep 1, 2021

View reviewed changes

JanuszL reviewed Sep 1, 2021

View reviewed changes

dali/operators/reader/loader/webdataset_loader.h Outdated Show resolved Hide resolved

JanuszL reviewed Sep 1, 2021

View reviewed changes

barci2 changed the title ~~Webdataset reader operator implementation (WIP)~~ [WIP] Webdataset reader operator implementation Sep 2, 2021

barci2 force-pushed the webdataset_reader branch from 53fa3a0 to aeaa897 Compare September 7, 2021 09:54

barci2 marked this pull request as ready for review September 8, 2021 09:07

barci2 changed the title ~~[WIP] Webdataset reader operator implementation~~ Webdataset reader operator implementation Sep 8, 2021

updated dali extra hash

fccad59

Signed-off-by: Bartłomiej Cieślar <bcieslar2001@gmail.com>

barci2 merged commit 76b87c3 into NVIDIA:main Sep 16, 2021

cyyever pushed a commit to cyyever/DALI that referenced this pull request Oct 17, 2021

Webdataset reader operator implementation (NVIDIA#3306)

f7609b6

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

c817eb7

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

68819df

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

3593b98

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

b5921a3

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

0cb7108

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

ec2745f

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

5ce5320

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

ab8f3f7

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

2fa714f

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

76de679

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

4c673c3

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

fb34334

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Webdataset reader operator implementation (NVIDIA#3306)

0926890

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022

Webdataset reader operator implementation (NVIDIA#3306)

4359b02

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022

Webdataset reader operator implementation (NVIDIA#3306)

26b43c9

Implementation of nvidia.dali.fn.readers.webdataset

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022

Webdataset reader operator implementation (NVIDIA#3306)

132a0d8

Implementation of nvidia.dali.fn.readers.webdataset

	std::vector<DALIDataType> dtype_;
	std::vector<DALIDataType> dtypes_;

	DALI_ENFORCE(kSupportedTypes.find(component_dtype) != kSupportedTypes.end(),
	DALI_ENFORCE(kSupportedTypes.count(component_dtype),

		std::unordered_map<std::string, std::vector<size_t>>
		ext_map_; // maps an extension to sample indicies

	wds_shards_.emplace_back(FileStream::Open(uri, read_ahead_, !dont_use_mmap_));
	wds_shards_.emplace_back(FileStream::Open(uri, read_ahead_, !copy_read_data_);));

Webdataset reader operator implementation #3306

Webdataset reader operator implementation #3306

Conversation

barci2 commented Sep 1, 2021 • edited

Description

What happened in this PR

Additional information

Checklist

Tests

Documentation

DALI team only

Requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Sep 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanuszL Sep 1, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Sep 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanuszL Sep 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgtm-com bot commented Sep 3, 2021

lgtm-com bot commented Sep 6, 2021

lgtm-com bot commented Sep 7, 2021

lgtm-com bot commented Sep 7, 2021

lgtm-com bot commented Sep 7, 2021

lgtm-com bot commented Sep 7, 2021

dali-automaton commented Sep 15, 2021

dali-automaton commented Sep 15, 2021

dali-automaton commented Sep 16, 2021

dali-automaton commented Sep 16, 2021

barci2 commented Sep 16, 2021

dali-automaton commented Sep 16, 2021

dali-automaton commented Sep 16, 2021

barci2 commented Sep 1, 2021 •

edited

JanuszL Sep 1, 2021 •

edited

mzient Sep 8, 2021 •

edited

JanuszL Sep 8, 2021 •

edited