# Dataset Creation <a class="tocSkip">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Defining-input-data" data-toc-modified-id="Defining-input-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Defining input data</a></span></li><li><span><a href="#Choosing-and-configuring-degradations" data-toc-modified-id="Choosing-and-configuring-degradations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Choosing and configuring degradations</a></span></li><li><span><a href="#Excerpt-length" data-toc-modified-id="Excerpt-length-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Excerpt length</a></span></li><li><span><a href="#Reformatted-data---piano-roll-and-command" data-toc-modified-id="Reformatted-data---piano-roll-and-command-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reformatted data - piano roll and command</a></span></li><li><span><a href="#Cleaning-up-and-specifying-output-directory" data-toc-modified-id="Cleaning-up-and-specifying-output-directory-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Cleaning up and specifying output directory</a></span></li><li><span><a href="#Reproducibility" data-toc-modified-id="Reproducibility-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Reproducibility</a></span></li><li><span><a href="#Help" data-toc-modified-id="Help-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Help</a></span></li></ul></div>

We provide tools to create your own ACME datasets. You can:
* Use your own midi or csv data, or pick from our configured data to automatically download
* Choose what types of degradation to include, and supply parameters for them
* Define how long the excerpts taken should be
* Create re-formatted, compressed representations of the data
* Ensure reproducibility

If you want to degrade data on-the-fly, we also provide a `Degrader()` class which can be used in conjunction with a dataloader. This is described in a subsequent notebook [04_data_parsers_and_degrader.ipynb](./04_data_parsers_and_degrader.ipynb).

In this notebook, we provide some example calls.

# Defining input data

We have three datasets which will automatically download if specified: `PPDDSep2018Monophonic`, `PPDDSep2018Polyphonic`, and `PianoMidi`. The default is to download and use them all. To not download them, set the `--datasets` argument to `None`.

This command uses default parameters to create an acme dataset with just the `PianoMidi` dataset. It will take a few moments to run, and you can observe the output in adjacent folder called `./acme`.

In [1]:
! python ../make_dataset.py --datasets PianoMidi

No random seed supplied. Setting to 1172412958.
Loading data from downloaders, this could take a while...
Copying midi to /Users/jfowers/.mdtk_cache/PianoMidi/data: 100%|█| 328/328 [00:0
Loading data from PianoMidi: 100%|████████████| 328/328 [00:46<00:00,  7.02it/s]
Degrading data: 100%|█████████████████████████| 328/328 [00:05<00:00, 58.37it/s]
Creating command corpus: 100%|████████████████| 328/328 [00:12<00:00, 27.28it/s]
Creating pianoroll corpus: 100%|██████████████| 328/328 [00:07<00:00, 43.86it/s]


Count of degradations:
	* none: 36
	* pitch_shift: 37
	* time_shift: 37
	* onset_shift: 36
	* offset_shift: 36
	* remove_note: 37
	* add_note: 36
	* split_note: 37
	* join_notes: 36

You will find the generated data at /Users/jfowers/git/midi_degradation_toolkit/docs/acme with subdirectories
	* clean - contains the extracted clean excerpts
	* altered - contains the excerpts altered by the degradations described in metadata.csv

metadata.csv describes:
	* (the id number for) the type

If you don't want to use any automatic downloaders, you must **specify your own input data**. You can provide midi files, or csv data (in an expected format - see the introduction [01_the_ACME_dataset.ipynb](01_the_ACME_dataset.ipynb) for the expected format).

The below command will create a very small dataset with some of the data which will have been cached if the first command in this notebook was run.

```bash
make_dataset.py \
    --datasets None \
    --local-midi-dirs ~/.mdtk_cache/PianoMidi/brahms
```

# Choosing and configuring degradations

See the next notebook, [03_degradation_functions.ipynb](03_degradation_functions.ipynb), for a full description of all the degradations available.

This call again works with the small brahms data and:
* leaves 44% of the data clean (no degradation is applied)
* selects only `pitch_shift` and `time_shift` degradations
* attempts to perform these degradations at a ratio of 4 to 1 (sampling is done)
* sets some parameters for the `pitch_shift` degradation

```bash
python make_dataset.py \
    --datasets None \
    --local-midi-dirs ~/.mdtk_cache/PianoMidi/brahms \
    --clean-prop .44 \
    --degradations pitch_shift time_shift \
    --degradation-dist 4 1 \
    --degradation-kwargs '{"pitch_shift__min_pitch": 50, "pitch_shift__max_pitch": 80}' \
    --seed 42
```

Specifying the `--degradation-kwargs` as a json string can get finickity with quotes, so you can specify the path to a json file instead e.g.

```bash
python make_dataset.py \
    --datasets None \
    --local-midi-dirs ~/.mdtk_cache/PianoMidi/brahms \
    --clean-prop .44 \
    --degradations pitch_shift time_shift \
    --degradation-dist 4 1 \
    --degradation-kwargs deg_kwargs.json \
    --seed 42
```

where `deg_kwargs.json` is:
```
{
    "pitch_shift__min_pitch": 50,
    "pitch_shift__max_pitch": 80
}
```

# Excerpt length

You can define the minimum length for an excerpt in milliseconds and number of notes (both conditions are honoured). Note that the defaults are `5000` and `10` respectively. See `mdtk.df_utils.get_random_excerpt` for full details of how the excerpt selection is done.

This example produces excerpts of approximately 10 seconds in length, with a minimum of 20 notes in them.

```bash
python make_dataset.py \
    --datasets None \
    --local-midi-dirs ~/.mdtk_cache/PianoMidi/brahms \
    --excerpt-length 10000 \
    --min-notes 20
```

# Reformatted data - piano roll and command

By, default, we create compressed data which is reformatted for easy loading to models. This can be turned off by setting `--formats None`.

```bash
python make_dataset.py \
    --datasets None \
    --local-midi-dirs ~/.mdtk_cache/PianoMidi/brahms \
    --formats None
```

We discuss the format data in a subsequent notebook: [04_data_parsers_and_degrader.ipynb](./04_data_parsers_and_degrader.ipynb).

# Cleaning up and specifying output directory

To remove any cached files, you can run `python make_dataset.py --clean`. A prompt is given by default, this can be cancelled with `--no-prompt`. Also, note that the output directory is deleted and recreated with every run of the script. Again, the user is prompted prior to deletion, but this can be skipped with `--no-prompt`. Alternatively, a new path for the output to be written to can be provided with `--output-dir`. Examples:

```bash
python make_dataset.py --clean  # prompts user before deleting cache
python make_dataset.py --clean --no-prompt  # deletes cache with no prompt
python make_dataset.py --datasets PianoMidi  # create a fresh dataset
python make_dataset.py  # this raises a prompt to delete the old one located at ./acme
python make_dataset.py --output-dir ./new/output/dir
```

# Reproducibility

To ensure that you get the same result when you run the script again, set the `--seed` parameter. This must be a number between `0` and `2**32 - 1`.

```bash
make_dataset.py --seed 42
```

# Help

In [2]:
! ../make_dataset.py -h

usage: make_dataset.py [-h] [-o OUTPUT_DIR] [--config CONFIG]
                       [--formats [format [format ...]]]
                       [--local-midi-dirs [midi_dir [midi_dir ...]]]
                       [--local-csv-dirs [csv_dir [csv_dir ...]]]
                       [--recursive]
                       [--datasets [dataset_name [dataset_name ...]]]
                       [--degradations [deg_name [deg_name ...]]]
                       [--excerpt-length ms] [--min-notes N]
                       [--degradation-kwargs json_file_or_string]
                       [--degradation-dist [relative_probability [relative_probability ...]]]
                       [--clean-prop CLEAN_PROP] [--splits train valid test]
                       [--seed SEED] [--clean] [-v] [--no-prompt]

Make datasets of altered and corrupted midi excerpts.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                  