Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline data loaders #1429

Merged
merged 3 commits into from Jun 26, 2023
Merged

Conversation

lrzpellegrini
Copy link
Collaborator

As part of the effort to add support for FFCV data loading, this PR brings an overall change in the way data loaders are created.

Avalanche data loaders are used to return batches which are the concatenation of smaller mini-batches from multiple datasets.

Up to now, this was accomplished by:

  • Creating a data loader for each dataset (each with the same parameters!)
  • Manually looping over all loaders, fetching the sub-mini-batches
  • Collate the sub-mini-batches before returning them (in the main thread!)

This led to issues regarding num_workers (which were spawned for each sub-dataset thus leading to spawning N * num_workers), bottlenecks related to the collate on the main thread, etc, not to mention that the code of each loader replicated operations already implemented in other loaders.

With the improvements found in this PR, a single PyTorch DataLoader is created based on a newly implemented Sampler that is used to return the indices in the desired order. The new mechanism supports distributed sampling, ensures that the correct number of workers is created, and that flags like persistent_workers and pin_memory have appreciable results. In addition, minibatch collate is done at the DataLoader level, thus removing a problematic bottleneck.

Note: experiment results may slightly change w.r.t. the previous implementation. This is not a bug and is related to the way random number generators are consumed, which may lead to a different choice of replay examples in the ReplayPlugin. Setting shuffle=False or setting a manual seed in the ReplayPlugin update method leads to the same results as the previous loader implementations.

@coveralls
Copy link

coveralls commented Jun 23, 2023

Pull Request Test Coverage Report for Build 5356470892

  • 170 of 194 (87.63%) changed or added relevant lines in 4 files are covered.
  • 8 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.04%) to 72.525%

Changes Missing Coverage Covered Lines Changed/Added Lines %
avalanche/benchmarks/utils/detection_dataset.py 2 6 33.33%
avalanche/benchmarks/utils/data_loader.py 162 182 89.01%
Files with Coverage Reduction New Missed Lines %
avalanche/benchmarks/utils/data_loader.py 2 86.72%
avalanche/benchmarks/utils/collate_functions.py 6 61.9%
Totals Coverage Status
Change from base Build 5332470848: 0.04%
Covered Lines: 16300
Relevant Lines: 22475

💛 - Coveralls

@AntonioCarta
Copy link
Collaborator

Thanks, this is much cleaner than the previous solution.

@AntonioCarta AntonioCarta merged commit 427d398 into ContinualAI:master Jun 26, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants