Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShuffledDataset's get() called twice per iteration in BatchDataloaderIterator #1426

Closed
L-M-Sherlock opened this issue Mar 6, 2024 · 2 comments
Labels
bug Something isn't working dataset Related to `burn-dataset`

Comments

@L-M-Sherlock
Copy link

L-M-Sherlock commented Mar 6, 2024

Describe the bug
get() for ShuffledDataset is called twice for each self.dataset.get(self.current_index) in BatchDataloaderIterator

This method is called twice:

fn get(&self, index: usize) -> Option<I> {
let index = match self.indices.get(index) {
Some(index) => index,
None => return None,
};
self.dataset.get(*index)
}

, when this method is called:

while let Some(item) = self.dataset.get(self.current_index) {

I add two info!() in the get() method of random.rs and dataset.rs. The number of log of the former is twice of the later.

2024-03-06T18:29:35.381089Z  INFO burn_dataset::transform::random: type of self.dataset: "alloc::sync::Arc<dyn burn_dataset::dataset::base::Dataset<regression::dataset::DiabetesItem>>"    
2024-03-06T18:29:35.381105Z  INFO burn_dataset::transform::random: origial index: 0 new index: 120    
2024-03-06T18:29:35.381109Z  INFO regression::dataset: get 120    
2024-03-06T18:29:35.381111Z  INFO burn_dataset::transform::random: type of self.dataset: "burn_dataset::dataset::sqlite::SqliteDataset<regression::dataset::DiabetesItem>"    
2024-03-06T18:29:35.381114Z  INFO burn_dataset::transform::random: origial index: 120 new index: 411    
2024-03-06T18:29:35.381138Z  INFO burn_core::data::dataloader::batch: current index: 1    
2024-03-06T18:29:35.381142Z  INFO burn_dataset::transform::random: type of self.dataset: "alloc::sync::Arc<dyn burn_dataset::dataset::base::Dataset<regression::dataset::DiabetesItem>>"    
2024-03-06T18:29:35.381145Z  INFO burn_dataset::transform::random: origial index: 1 new index: 12    
2024-03-06T18:29:35.381147Z  INFO regression::dataset: get 12    
2024-03-06T18:29:35.381149Z  INFO burn_dataset::transform::random: type of self.dataset: "burn_dataset::dataset::sqlite::SqliteDataset<regression::dataset::DiabetesItem>"    
2024-03-06T18:29:35.381152Z  INFO burn_dataset::transform::random: origial index: 12 new index: 262    
2024-03-06T18:29:35.381167Z  INFO burn_core::data::dataloader::batch: current index: 2    
2024-03-06T18:29:35.381170Z  INFO burn_dataset::transform::random: type of self.dataset: "alloc::sync::Arc<dyn burn_dataset::dataset::base::Dataset<regression::dataset::DiabetesItem>>"    
2024-03-06T18:29:35.381173Z  INFO burn_dataset::transform::random: origial index: 2 new index: 81    
2024-03-06T18:29:35.381176Z  INFO regression::dataset: get 81    
2024-03-06T18:29:35.381178Z  INFO burn_dataset::transform::random: type of self.dataset: "burn_dataset::dataset::sqlite::SqliteDataset<regression::dataset::DiabetesItem>"    
2024-03-06T18:29:35.381180Z  INFO burn_dataset::transform::random: origial index: 81 new index: 72    
2024-03-06T18:29:35.381196Z  INFO burn_core::data::dataloader::batch: current index: 3    
2024-03-06T18:29:35.381205Z  INFO burn_dataset::transform::random: type of self.dataset: "alloc::sync::Arc<dyn burn_dataset::dataset::base::Dataset<regression::dataset::DiabetesItem>>"    
2024-03-06T18:29:35.381208Z  INFO burn_dataset::transform::random: origial index: 3 new index: 50    
2024-03-06T18:29:35.381211Z  INFO regression::dataset: get 50    
2024-03-06T18:29:35.381213Z  INFO burn_dataset::transform::random: type of self.dataset: "burn_dataset::dataset::sqlite::SqliteDataset<regression::dataset::DiabetesItem>"    
2024-03-06T18:29:35.381216Z  INFO burn_dataset::transform::random: origial index: 50 new index: 169    

To Reproduce

Expected behavior

It should only be called once.

Screenshots

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context

My guess is the get() is binded to the ShuffledDataset itself. And the self.dataset.get(*index) call itself again.

@antimora antimora added bug Something isn't working dataset Related to `burn-dataset` labels Mar 6, 2024
@antimora antimora changed the title get() for ShuffledDataset is called twice for each self.dataset.get(self.current_index) in BatchDataloaderIterator ShuffledDataset's get() called twice per iteration in BatchDataloaderIterator Mar 29, 2024
@laggui
Copy link
Member

laggui commented Aug 13, 2024

This issue is a bit old sorry, was just going through the issues and stumbled upon it.

It seems you are using the simple regression example, which uses the ShuffledDataset.

When using a dataloader with a random seed for shuffling, the provided dataset is also wrapped by a ShuffledDataset.

The symptom you're observing is just that the first ShuffledDataset::get(index) call comes from the dataloader, which calls the regression dataset's get(index), which in turn calls the self.dataset.get(index) (which is a ShuffledDataset).

The number of items actually retrieved by the dataset should still correspond to its length, it's just the debug info you added that seems to be leading you down the wrong path 🙂

If you really think there is a bug somewhere, please let us know. Otherwise I will close this issue.

@L-M-Sherlock
Copy link
Author

Thanks for that clarification. I have refactored my code a lot, so it's not issue for me now. I close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dataset Related to `burn-dataset`
Projects
None yet
Development

No branches or pull requests

3 participants