Streamline data loaders #1429

lrzpellegrini · 2023-06-23T12:39:05Z

As part of the effort to add support for FFCV data loading, this PR brings an overall change in the way data loaders are created.

Avalanche data loaders are used to return batches which are the concatenation of smaller mini-batches from multiple datasets.

Up to now, this was accomplished by:

Creating a data loader for each dataset (each with the same parameters!)
Manually looping over all loaders, fetching the sub-mini-batches
Collate the sub-mini-batches before returning them (in the main thread!)

This led to issues regarding num_workers (which were spawned for each sub-dataset thus leading to spawning N * num_workers), bottlenecks related to the collate on the main thread, etc, not to mention that the code of each loader replicated operations already implemented in other loaders.

With the improvements found in this PR, a single PyTorch DataLoader is created based on a newly implemented Sampler that is used to return the indices in the desired order. The new mechanism supports distributed sampling, ensures that the correct number of workers is created, and that flags like persistent_workers and pin_memory have appreciable results. In addition, minibatch collate is done at the DataLoader level, thus removing a problematic bottleneck.

Note: experiment results may slightly change w.r.t. the previous implementation. This is not a bug and is related to the way random number generators are consumed, which may lead to a different choice of replay examples in the ReplayPlugin. Setting shuffle=False or setting a manual seed in the ReplayPlugin update method leads to the same results as the previous loader implementations.

coveralls · 2023-06-23T12:49:06Z

Pull Request Test Coverage Report for Build 5356470892

170 of 194 (87.63%) changed or added relevant lines in 4 files are covered.
8 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.04%) to 72.525%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
avalanche/benchmarks/utils/detection_dataset.py	2	6	33.33%
avalanche/benchmarks/utils/data_loader.py	162	182	89.01%

Files with Coverage Reduction	New Missed Lines	%
avalanche/benchmarks/utils/data_loader.py	2	86.72%
avalanche/benchmarks/utils/collate_functions.py	6	61.9%

Totals
Change from base Build 5332470848:	0.04%
Covered Lines:	16300
Relevant Lines:	22475

💛 - Coveralls

AntonioCarta · 2023-06-26T11:47:24Z

Thanks, this is much cleaner than the previous solution.

lrzpellegrini added 2 commits June 22, 2023 18:43

Add MultiDatasetSampler. Improve replay loader.

e484644

Make all loaders based on MultiDatasetLoader. Minor fixes.

9bd6de5

lrzpellegrini requested a review from AntonioCarta June 23, 2023 12:39

Fix typo.

3208f76

AntonioCarta merged commit 427d398 into ContinualAI:master Jun 26, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline data loaders #1429

Streamline data loaders #1429

lrzpellegrini commented Jun 23, 2023

coveralls commented Jun 23, 2023 •

edited

AntonioCarta commented Jun 26, 2023

Streamline data loaders #1429

Streamline data loaders #1429

Conversation

lrzpellegrini commented Jun 23, 2023

coveralls commented Jun 23, 2023 • edited

Pull Request Test Coverage Report for Build 5356470892

💛 - Coveralls

AntonioCarta commented Jun 26, 2023

coveralls commented Jun 23, 2023 •

edited