Adding parallelisation to splitting files in Miditok #260

adricl · 2025-11-08T13:04:46Z

@Natooz So this is an example of what I was thinking for parallelisation
This change has already resulted in a 40% increase in speed when splitting the maestro dataset.

I have added two parameters one is how many threads you want and the other is how many files processed in each batch. This will require some tuning on the side of the user but the results are great.

The only issue is that warning messaged come up from the tokenizers:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)`

📚 Documentation preview 📚: https://miditok--260.org.readthedocs.build/en/260/

Natooz · 2025-11-08T15:16:14Z

This looks good! Thank you for working on it.

Did you test with different chunk sizes and if so do you have any idea of a good default value?

I'll probably exclude the abc file making the tests files (due to symusic) soon.
I can take care of the lint if you want.
Do you plan to do other changes on this branch?

adricl · 2025-11-09T01:57:38Z

I was going to do all the other functions we talked for parallelisation on this branch.

So don't worry about the linting till the end. I will also try to clean it all up, but there might be some stuff at the end.

As for the chunk size, I did some experimenting but after thinking about it I think it should be int(len(files_paths)/ parallel_workers_size) as this would make it spread the chunks evenly over the threads and reduce the need for another thread to be spawned.

I took the default formula min(32, cpu_count() + 4) from here: tqdm docs

adricl · 2025-11-09T01:58:07Z

Before Change

File splitting completed in 83.25 seconds. With 1 parallel workers.

After Change

File splitting completed in 61.65 seconds. With 10 parallel workers. with 10 as the chunk size
File splitting completed in 60.16 seconds. With 10 parallel workers. with 128 as the chunk size
File splitting completed in 60.16 seconds. With 18 parallel workers. with 71 as the chunk size

This was against the entire maestro-v3.0.0 dataset 1276 files

… in the splits

adricl · 2025-11-10T03:33:04Z

DatasetMIDI parallelisation pre-tokenization on 1,759,104 files takes 1794 seconds with 1 thread and 227s on 16 threads, a 7x increase

Natooz · 2025-11-10T17:29:45Z

Quite a notable speed up, congrats!

…erid message when no files are passed and the process_map spits out a confusing message

adricl · 2025-12-15T15:31:01Z

Ok I think its good to go now. Please take a look.

I changed the parallelisation code, the base case of only 1 parallel worker defaults to a tqdm loop as this avoids issues where the python concurrent.futures takes ages to create the parallelised threads so this keeps the original speed for this scenario.

Natooz

Thank you for these last changes!
Everything looks good to me, except that we would benefit from having the default numbers of workers as constants

Natooz · 2025-12-15T19:19:40Z

src/miditok/data_augmentation/data_augmentation.py

    out_path: Path | str | None = None,
    copy_original_in_new_location: bool = True,
    save_data_aug_report: bool = True,
+    parallel_workers_size: int = min(32, cpu_count() + 4)


Suggest to have the default max workers as a constant in constants.py, maybe offset (4) too

Natooz · 2025-12-15T19:33:09Z

src/miditok/midi_tokenizer.py

        validation_fn: Callable[[Score], bool] | None = None,
        save_programs: bool | None = None,
        verbose: bool = True,
+        parallel_workers_size: int = min(32, cpu_count() + 4)


same here if using constants

Natooz · 2025-12-15T19:33:43Z

src/miditok/pytorch_data/datasets.py

        | None = None,
        sample_key_name: str = "input_ids",
        labels_key_name: str = "labels",
+        parallel_workers_size: int = min(32, cpu_count() + 4)


Natooz · 2025-12-15T19:34:05Z

src/miditok/utils/split.py

    num_overlap_bars: int = 1,
    min_seq_len: int | None = None,
    preprocessing_method: callable[Score, Score] | None = None,
+    parallel_workers_size: int = min(32, cpu_count() + 4)


adricl added 13 commits August 15, 2025 23:01

Added a small change to allow removal of drum tracks when splitting

c35faf6

Merge branch 'Natooz:main' into main

7c8fceb

Merge branch 'Natooz:main' into main

d21afe1

Fixed spliter to add preprocessing step

d6bfead

Merge branch 'Natooz:main' into main

8f20632

WIP test

26298a3

WIP

f3b3a22

Merge branch 'main' of github.com:adricl/MidiTok into main

da4ac8f

WIP fix

a7d8505

Updated

65e40e3

Add process_map function

0e27e09

Merge branch 'Natooz:main' into main

8ad58ee

Added paramater comments and fixed naming

2c70660

adricl mentioned this pull request Nov 8, 2025

Adding HDF5 (Hierarchical Data Format version 5) support and Parallelization of file midi processing #257

Closed

adricl added 4 commits November 9, 2025 00:24

Fix linting isuses

62e030a

Fix linting

6ccdf0d

Revert change

59d6245

Fix more linting issues

631b1a0

adricl added 6 commits November 9, 2025 12:59

Added better default splitting

78f4387

Fix return function from process_map

831610e

Added Data_augmentation parallalization

a82392c

FIxed linting issues

05b051b

Added datasets pre-tokenize parallelization and fixed import ordering…

1deca43

… in the splits

FIx import ordering issues

b15b5b7

adricl added 2 commits November 10, 2025 16:41

tokenize_dataset has now been parallaized

b0d412f

Added better updates to tqdm

b96dfc0

Fix Linting include

32c686a

Natooz marked this pull request as ready for review November 10, 2025 17:29

adricl added 7 commits December 1, 2025 17:33

Added better warning messages when no files are found. This fixes a w…

6f6d31e

…erid message when no files are passed and the process_map spits out a confusing message

Added better logging for progress

8e3ce24

Updated

399abce

Fix progress display issue

26bc3fa

Fix order of imports

f982e7e

Added base case where no parrallelisation is required

62b1ff0

Fix Import

881f3af

adricl changed the title ~~WIP: Adding parallelisation to splitting files in Miditok~~ Adding parallelisation to splitting files in Miditok Dec 15, 2025

Fix linting issues

2bd6f72

Natooz approved these changes Dec 15, 2025

View reviewed changes

adricl added 2 commits December 16, 2025 17:29

Added new Vars

8c16ae3

Added Contant Vars for parallelization

1436929

Natooz merged commit 64bee84 into Natooz:main Dec 16, 2025
2 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding parallelisation to splitting files in Miditok #260

Adding parallelisation to splitting files in Miditok #260

Uh oh!

adricl commented Nov 8, 2025 •

edited

Loading

Uh oh!

Natooz commented Nov 8, 2025

Uh oh!

adricl commented Nov 9, 2025

Uh oh!

adricl commented Nov 9, 2025 •

edited

Loading

Uh oh!

adricl commented Nov 10, 2025 •

edited

Loading

Uh oh!

Natooz commented Nov 10, 2025

Uh oh!

adricl commented Dec 15, 2025

Uh oh!

Natooz left a comment

Uh oh!

Natooz Dec 15, 2025

Uh oh!

Natooz Dec 15, 2025

Uh oh!

Natooz Dec 15, 2025

Uh oh!

Natooz Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding parallelisation to splitting files in Miditok #260

Adding parallelisation to splitting files in Miditok #260

Uh oh!

Conversation

adricl commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Natooz commented Nov 8, 2025

Uh oh!

adricl commented Nov 9, 2025

Uh oh!

adricl commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before Change

After Change

Uh oh!

adricl commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Natooz commented Nov 10, 2025

Uh oh!

adricl commented Dec 15, 2025

Uh oh!

Natooz left a comment

Choose a reason for hiding this comment

Uh oh!

Natooz Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Natooz Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Natooz Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Natooz Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adricl commented Nov 8, 2025 •

edited

Loading

adricl commented Nov 9, 2025 •

edited

Loading

adricl commented Nov 10, 2025 •

edited

Loading