Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prefetching feature #17

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

AzHicham
Copy link
Contributor

@AzHicham AzHicham commented Sep 1, 2023

This is a first working implementation but with some drawbacks:

  • Dataloader is consumed, and we cannot iterate multiples times over it

EDIT:
Deadlock should be avoided thanks to crossbeam select! macro

pub fn iter(&self) -> SingleProcessDataLoaderIter<'_, D, S, C> {
/// Return owning iterator over the dataloader.
/// TODO: Find a way to not consume the Dataloader
pub fn iter(self) -> SingleProcessDataLoaderIter<D, S, C> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataLoaderBuilder can return an Arc<DataLoader> instead of DataLoader this way SingleProcessDataLoaderIter will not consume the Dataloader (moved in a thread)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next commit where dataset is stored as Arc inside Dataloader

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've also come to an Arc when iterating during the designing of the API. At the end, I choose to mimic what the standard iterator do:

  • Dataloader::into_iter take owensership
  • Dataloader::iter and Dataloader::iter_mut don't

I thought it would be less surprising for the end user and also have a lower overhead.
Is the Arc necessary to your change or you can go with iter if you don' t want to consume the dataset?

Copy link
Contributor Author

@AzHicham AzHicham Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately in order to not consume the Dataloader like here :
let mut iter = dataloader.iter(); and be able to iterator over again I didn't find a solution without using Arc because of the thread::spawn that move variables (dataloader, sampler ..)

But IMO we should not be impacted by the overhead during the iteration, btw burn-rs use a lot Arc for multitheaded dataloading

Copy link
Contributor Author

@AzHicham AzHicham Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution could be to remove fn iter or make it private, for the Dataloader and keep only into_iter then we can move Dataset & Sampler into the spawned thread without using an Arc but it seems too restrictive WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @Tudyx I have two commit in this branch now:

  • One with Arc
  • One without -> but we must consume the Dataloader in order to create the iterator

@AzHicham
Copy link
Contributor Author

AzHicham commented Sep 5, 2023

Also on another topic: @Tudyx
Maybe we can have a both a BatchSampler and a BatchSamplerExact then use generic in Dataloader & Builder instead of the boolean value drop_last ?

Also impl ExactSizeIterator when possible

@AzHicham AzHicham force-pushed the haz/prefetching branch 8 times, most recently from 547e353 to 3b390da Compare September 5, 2023 20:12
@Tudyx
Copy link
Owner

Tudyx commented Sep 5, 2023

Also on another topic: @Tudyx Maybe we can have a both a BatchSampler and a BatchSamplerExact then use generic in Dataloader & Builder instead of the boolean value drop_last ?

That's another possible design. I think having a drop_last function is a pretty good API though, it's discoverable through auto completion, the end user doesn't need to be aware of the underlying BatchSampler and it's in phase with the PyTorch API.

Also impl ExactSizeIterator when possible

Totally agree! I think this could be a good subject for a separated PR

@AzHicham
Copy link
Contributor Author

AzHicham commented Sep 5, 2023

That's another possible design. I think having a drop_last function is a pretty good API though, it's discoverable through auto completion, the end user doesn't need to be aware of the underlying BatchSampler and it's in phase with the PyTorch API.

My bad I think it's better to keep it that way :) with a simple API

EDIT: @Tudyx
Another solution could be to do like you did for pub fn shuffle(self) -> Builder<D, RandomSampler, C>
two functions, which implies having a trait for the BatchSampler
pub fn drop_last(self) -> Builder<D, BatchSamplerExact, S, C>
pub fn keep_last(self) -> Builder<D, BatchSampler, S, C>

EDIT 2:
I tried an impl with BatchSamplerTrait<S: Sampler> but it becomes more complicated :/

@AzHicham AzHicham force-pushed the haz/prefetching branch 4 times, most recently from 16565b1 to e68e8ac Compare September 5, 2023 20:56
@AzHicham
Copy link
Contributor Author

AzHicham commented Sep 5, 2023

Note:
Find a way to cancel the background thread (prefetching) if the dataloader is droped, without deadlocking !!!

@AzHicham AzHicham force-pushed the haz/prefetching branch 5 times, most recently from f74b62f to 8db9d2b Compare September 7, 2023 12:30
@AzHicham
Copy link
Contributor Author

@Tudyx
I found a way to avoid the deadlock by using this amazing macro in crossbeam
WDYT ?

@Tudyx Tudyx mentioned this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants