In [1]:
import miditok
import pandas as pd
import os
import pickle
import shutil
import gc

from miditoolkit import MidiFile
from miditok import REMI, TokenizerConfig
from miditok.pytorch_data import DatasetMIDI, DataCollator
from miditok.data_augmentation import augment_dataset
from torch.utils.data import DataLoader
from pathlib import Path
from tqdm import tqdm



In this notebook, we perform augmentation of the MIDI files locally, due to the slow I/O operations on Google Colab. We begin by creating a list, containing all of the MIDI file paths for training, as well as a directory for the augmented files.

In [7]:
root_midi_folder = Path("../MAESTRO dataset/maestro-v3.0.0-midi/maestro-v3.0.0")

In [8]:
augmented_out_path = Path("../saved_data/datasets/train/augmented")
augmented_out_path.mkdir(parents=True, exist_ok=True)

In [9]:
midi_paths_train = []

for filename in metadata[metadata["split"] == "train"]["midi_filename"]:
    found_files = list(root_midi_folder.rglob(filename))
    if found_files:
        midi_paths_train.append(str(found_files[0]))

In [10]:
train_temp_dir = Path("../MAESTRO dataset/train_temp")
train_temp_dir.mkdir(parents=True, exist_ok=True)

for file_path in midi_paths_train:
    shutil.copy(file_path, train_temp_dir)

After we've got this set up, it's time to perform the augmentation itself. For this we are using MidiTok's `augment_dataset()` function with the configuration shown below:

In [11]:
augment_dataset(
    data_path=train_temp_dir,
    pitch_offsets=[-12, 12],
    velocity_offsets=[-4, 5],
    duration_offsets=[-0.5, 1],
    out_path=augmented_out_path
)

Performing data augmentation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 962/962 [00:09<00:00, 99.09it/s]


Now the only thing that's left is to fill in the file path list for the augmented data and then combine it into one big list:

In [12]:
augmented_midi_files = [str(p) for p in augmented_out_path.glob("*.midi")]

In [13]:
combined_train_files = midi_paths_train + augmented_midi_files