Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset splitting not working #3

Closed
danuta-w opened this issue May 22, 2023 · 3 comments
Closed

Dataset splitting not working #3

danuta-w opened this issue May 22, 2023 · 3 comments

Comments

@danuta-w
Copy link

Hi,

when setting up training for an Animal-Spot binary classification I am presented with a weird error. The dataset seems to not be split according to specified values in main.py. As you can see in the error messages below, the training set contains 0 files whereas the validation and test split contain the remaining files.

When I run the script multiple time, it is random whether the train, val or test dataset is omitted. In all re-runs one of the categories contains 0 files which results in the error below.

What can I do?

Greetings,
Danuta

15:18:01|I|Found 6878 audio files for training.
15:18:01|I|Model predict 2 classes
15:18:01|D|Generating /home/scb/scripts/ANIMAL-SPOT/ANIMAL-DATA/val.csv
15:18:01|D|Generating /home/scb/scripts/ANIMAL-SPOT/ANIMAL-DATA/bkp/test.csv
15:18:01|I|Init dataset train...
15:18:01|D|Number of files : 0
15:18:01|D|Init augmentation transforms for time and pitch shift
15:18:01|D|No noise augmentation
15:18:01|D|Init min-max-normalization activated
15:18:01|I|Init dataset val...
15:18:01|D|Number of files : 4184
15:18:01|D|Number of samples in val for noise: 3455
15:18:01|D|Number of samples in val for target: 729
15:18:01|D|Running without augmentation
15:18:01|D|Init min-max-normalization activated
15:18:01|I|Init dataset test...
15:18:01|D|Number of files : 2694
15:18:01|D|Number of samples in test for noise: 2545
15:18:01|D|Number of samples in test for target: 149
Traceback (most recent call last):
  File "/home/scb/scripts/ANIMAL-SPOT/ANIMAL-SPOT//main.py", line 454, in <module>
    dataloaders = {
15:18:01|D|Running without augmentation
15:18:01|D|Init min-max-normalization activated
  File "/home/scb/scripts/ANIMAL-SPOT/ANIMAL-SPOT//main.py", line 455, in <dictcomp>
    split: torch.utils.data.DataLoader(
  File "/home/scb/miniconda3/envs/animalspot/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 351, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/home/scb/miniconda3/envs/animalspot/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 107, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0
@ChristianBergler
Copy link
Owner

Hi Danuta,
i am not aware about your data structure, but i have a guess what might went wrong. ANIMAL-SPOT internally takes the following filename structure "label_id_year_tape_startlabeltime_endlabeltime" ... Based on the "Year and Tape" information it internally creates a set of "recording tapes" based on the given data. A recording tape is always the comination between year and tapename. When ANIMAL-SPOT is doing the data split (automatically) it makes sure that NONE of the tapes are shared across partitions, in order to avoid "cheating", e.g. audio data from the same tape, distributed across training and test, makes it easier for the model, because it has already seen the data during training. So, and i think this is your problem. Very likely the amount of different tapes (in your case) is not much, so ANIMAL-SPOT puts the stuff either in one of the buckets but nothing is left for the remaining buckets. In case you dont have more different tapes and everything comes e.g. from one recording, you can also "fool" ANIMAL-SPOT by naming the "year_tape" information in an artificial random way, to simulate different recording tapes. That should solve your problem

@danuta-w
Copy link
Author

danuta-w commented May 24, 2023 via email

@danuta-w
Copy link
Author

Hi again,

The training works with the randomized tape names. Thanks again!

Best,
Danuta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants