-
Notifications
You must be signed in to change notification settings - Fork 98
Adding parallelisation to splitting files in Miditok #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This looks good! Thank you for working on it. Did you test with different chunk sizes and if so do you have any idea of a good default value? I'll probably exclude the abc file making the tests files (due to symusic) soon. |
|
I was going to do all the other functions we talked for parallelisation on this branch. So don't worry about the linting till the end. I will also try to clean it all up, but there might be some stuff at the end. As for the chunk size, I did some experimenting but after thinking about it I think it should be int(len(files_paths)/ parallel_workers_size) as this would make it spread the chunks evenly over the threads and reduce the need for another thread to be spawned. I took the default formula |
Before ChangeFile splitting completed in 83.25 seconds. With 1 parallel workers. After ChangeFile splitting completed in 61.65 seconds. With 10 parallel workers. with 10 as the chunk size This was against the entire maestro-v3.0.0 dataset 1276 files |
|
DatasetMIDI parallelisation pre-tokenization on 1,759,104 files takes 1794 seconds with 1 thread and 227s on 16 threads, a 7x increase |
|
Quite a notable speed up, congrats! |
…erid message when no files are passed and the process_map spits out a confusing message
|
Ok I think its good to go now. Please take a look. I changed the parallelisation code, the base case of only 1 parallel worker defaults to a tqdm loop as this avoids issues where the python concurrent.futures takes ages to create the parallelised threads so this keeps the original speed for this scenario. |
Natooz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these last changes!
Everything looks good to me, except that we would benefit from having the default numbers of workers as constants
| out_path: Path | str | None = None, | ||
| copy_original_in_new_location: bool = True, | ||
| save_data_aug_report: bool = True, | ||
| parallel_workers_size: int = min(32, cpu_count() + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to have the default max workers as a constant in constants.py, maybe offset (4) too
src/miditok/midi_tokenizer.py
Outdated
| validation_fn: Callable[[Score], bool] | None = None, | ||
| save_programs: bool | None = None, | ||
| verbose: bool = True, | ||
| parallel_workers_size: int = min(32, cpu_count() + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here if using constants
src/miditok/pytorch_data/datasets.py
Outdated
| | None = None, | ||
| sample_key_name: str = "input_ids", | ||
| labels_key_name: str = "labels", | ||
| parallel_workers_size: int = min(32, cpu_count() + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
src/miditok/utils/split.py
Outdated
| num_overlap_bars: int = 1, | ||
| min_seq_len: int | None = None, | ||
| preprocessing_method: callable[Score, Score] | None = None, | ||
| parallel_workers_size: int = min(32, cpu_count() + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
@Natooz So this is an example of what I was thinking for parallelisation
This change has already resulted in a 40% increase in speed when splitting the maestro dataset.
I have added two parameters one is how many threads you want and the other is how many files processed in each batch. This will require some tuning on the side of the user but the results are great.
The only issue is that warning messaged come up from the tokenizers:
📚 Documentation preview 📚: https://miditok--260.org.readthedocs.build/en/260/