(datasets)=
# Datasets

Datasets are key for the data-driven computational research of a music tradition. Thoroughly designed collections of data that represent the most relevant aspects of a musical repertoire may open the door for solutions of several problems. For that reason, huge efforts have been made within the scope of this tutorial and ``compiam`` to (1) boost the visibility and access to Carnatic and Hindustani music datasets, and (2) provide standardized tools to get and use the said datasets.

## `mirdata`
[`mirdata`](https://mirdata.readthedocs.io/en/stable/) is an open-source and pip-installable Python library that provides tools for working with common Music Information Retrieval (MIR) datasets {cite}`mirdata`. Given the crucial importance and relevance of such a software for data and corpus-driven research, we have done a great efforts to integrate several IAM-centered datasets into `mirdata`. To date, the following datasets can be found in the latest `mirdata` release:

* Carnatic collection of Saraga {cite}`saraga`
* Hindustani collection of Saraga {cite}`saraga`
* Carnatic Varnam Dataset {cite}`carnatic_varnam`
* Carnatic Music Rhythm {cite}`carnatic_rhythm_dataset`
* Hindustani Music Rhythm {cite}`hindustani_rhythm_dataset`
* Indian Art Music Raga Dataset {cite}`raga_dataset`
* Mridangam Stroke Dataset {cite}`mridangam_stroke`
* Four-Way Tabla Dataset (ISMIR 2021) {cite}`4way_tabla`

`compiam` provides access to these datasets through the mirdata loaders. Make sure to check the [`mirdata` documentation](https://mirdata.readthedocs.io/en/stable/source/mirdata.html#dataset-loaders) to learn the functionalities of the loaders.

```{note}
The alias of the `mirdata` method ``mirdata.initialize()`` in our library is ``compiam.load_dataset()``. Use this wrapper to access the `mirdata` loaders for Indian Art Music datasets from `compiam`.
```

In [None]:
%pip install compiam

# Import compiam
import compiam

# Supress warnings to keep the tutorial clean
import warnings
warnings.filterwarnings('ignore')

In [None]:
mridangam_stroke = compiam.load_dataset("mridangam_stroke")
mridangam_stroke.download()
mridangam_stroke.validate()

In [None]:
## Let's get a random track from the dataset!
mridangam_stroke.choice_track()

```{tip}
Run ``compiam.list_datasets()`` to list the available datasets to use.
```

### Why mirdata loaders?
Accessing the datasets through `mirdata` brings numerous advantages and provides a more standardized and easy integration of the said datasets into our pipelines. See:

In [None]:
import numpy as np

## Loading all tracks from the dataset
mridangam_tracks = mridangam_stroke.load_tracks()

## Get available ragas
available_strokes = np.unique([mridangam_tracks[x].stroke_name \
    for x in mridangam_stroke.track_ids])
available_strokes

`mirdata` loaders help on getting the data loaded and organized without the need of writing functions to do that ourselves. In this example below, we create a dictionary in which stroke names are keys and for each key we have a list of audio samples including their respective stroke.

In [None]:
stroke_dict = {item: [] for item in available_strokes}
for i in mridangam_stroke.track_ids:
    stroke_dict[mridangam_tracks[i].stroke_name].append(mridangam_tracks[i].audio_path)

stroke_dict['bheem'][0]

**Let's play this example!** Audio (and also annotations!) can be easily loaded from each track.

In [None]:
# Let's first get a random track
random_track = mridangam_stroke.choice_track()

import IPython
print("Play recording of id: {}, including stroke '{}' and tonic {}"\
    .format(random_track.track_id, random_track.stroke_name, random_track.tonic))
IPython.display.Audio(random_track.audio[0], rate=random_track.audio[1])