Skip to content

Conversation

@adricl
Copy link
Contributor

@adricl adricl commented Nov 8, 2025

@Natooz So this is an example of what I was thinking for parallelisation
This change has already resulted in a 40% increase in speed when splitting the maestro dataset.

I have added two parameters one is how many threads you want and the other is how many files processed in each batch. This will require some tuning on the side of the user but the results are great.

The only issue is that warning messaged come up from the tokenizers:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)`


📚 Documentation preview 📚: https://miditok--260.org.readthedocs.build/en/260/

@Natooz
Copy link
Owner

Natooz commented Nov 8, 2025

This looks good! Thank you for working on it.

Did you test with different chunk sizes and if so do you have any idea of a good default value?

I'll probably exclude the abc file making the tests files (due to symusic) soon.
I can take care of the lint if you want.
Do you plan to do other changes on this branch?

@adricl
Copy link
Contributor Author

adricl commented Nov 9, 2025

I was going to do all the other functions we talked for parallelisation on this branch.

So don't worry about the linting till the end. I will also try to clean it all up, but there might be some stuff at the end.

As for the chunk size, I did some experimenting but after thinking about it I think it should be int(len(files_paths)/ parallel_workers_size) as this would make it spread the chunks evenly over the threads and reduce the need for another thread to be spawned.

I took the default formula min(32, cpu_count() + 4) from here: tqdm docs

@adricl
Copy link
Contributor Author

adricl commented Nov 9, 2025

Before Change

File splitting completed in 83.25 seconds. With 1 parallel workers.

After Change

File splitting completed in 61.65 seconds. With 10 parallel workers. with 10 as the chunk size
File splitting completed in 60.16 seconds. With 10 parallel workers. with 128 as the chunk size
File splitting completed in 60.16 seconds. With 18 parallel workers. with 71 as the chunk size

This was against the entire maestro-v3.0.0 dataset 1276 files

@adricl
Copy link
Contributor Author

adricl commented Nov 10, 2025

DatasetMIDI parallelisation pre-tokenization on 1,759,104 files takes 1794 seconds with 1 thread and 227s on 16 threads, a 7x increase

@Natooz Natooz marked this pull request as ready for review November 10, 2025 17:29
@Natooz
Copy link
Owner

Natooz commented Nov 10, 2025

Quite a notable speed up, congrats!

@adricl adricl changed the title WIP: Adding parallelisation to splitting files in Miditok Adding parallelisation to splitting files in Miditok Dec 15, 2025
@adricl
Copy link
Contributor Author

adricl commented Dec 15, 2025

Ok I think its good to go now. Please take a look.

I changed the parallelisation code, the base case of only 1 parallel worker defaults to a tqdm loop as this avoids issues where the python concurrent.futures takes ages to create the parallelised threads so this keeps the original speed for this scenario.

Copy link
Owner

@Natooz Natooz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these last changes!
Everything looks good to me, except that we would benefit from having the default numbers of workers as constants

out_path: Path | str | None = None,
copy_original_in_new_location: bool = True,
save_data_aug_report: bool = True,
parallel_workers_size: int = min(32, cpu_count() + 4)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to have the default max workers as a constant in constants.py, maybe offset (4) too

validation_fn: Callable[[Score], bool] | None = None,
save_programs: bool | None = None,
verbose: bool = True,
parallel_workers_size: int = min(32, cpu_count() + 4)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here if using constants

| None = None,
sample_key_name: str = "input_ids",
labels_key_name: str = "labels",
parallel_workers_size: int = min(32, cpu_count() + 4)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

num_overlap_bars: int = 1,
min_seq_len: int | None = None,
preprocessing_method: callable[Score, Score] | None = None,
parallel_workers_size: int = min(32, cpu_count() + 4)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too

@Natooz Natooz merged commit 64bee84 into Natooz:main Dec 16, 2025
2 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants