A Mutimodal Audio Sheet Music Dataset
Switch branches/tags
Clone or download


Multimodal Sheet Music Dataset

MSMD is a synthetic dataset of 497 pieces of (classical) music that contains both audio and score representations of the pieces aligned at a fine-grained level (344,742 pairs of noteheads aligned to their audio/MIDI counterpart). It can be used for training and evaluating multimodal models that enable crossing from one modality to the other, such as retrieving sheet music using recordings or following a performance in the score image. The figure below shows an example of the data contained in MSMD.


If you have any questions, do not hesitate to contact the authors of the dataset:

MSMD was first used in the paper:

[1] Matthias Dorfer, Jan Hajič jr., Andreas Arzt, Harald Frostel, Gerhard Widmer.
Learning Audio-Sheet Music Correspondences for Cross-Modal Retrieval and Piece Identification (PDF).
Transactions of the International Society for Music Information Retrieval, issue 1, 2018.

If you use the dataset, we kindly ask that you cite this paper.

The appendix of our article also contains a detailed description of the MSMD dataset and its structure. If you would like to reproduce or extend our experiments please take a look at our corresponding repository.

Getting started (Quick Guide)

1.) Clone this repository

git clone git@github.com:CPJKU/msmd.git

2.) Follow the steps listed in Setup and Requirements

3.) Download the preprocessed MSMD data set.

4.) Check out the tutorials provided along with this repository.

5.) If you want to build the entire data set on your own, check out our data set tutorial (optional).

Setup and Requirements

For a list of required python packages see the requirements.txt or just install them all at once using pip.

pip install -r requirements.txt

We also provide an anaconda environment file which can be installed as follows:

conda env create -f environment.yaml

To install the audio_sheet_retrieval package in develop mode (this is what we recommend) run

python setup.py develop --user

in the root folder of the package.

Dataset structure

MSMD is structured into pieces. Pieces are abstract musical entities, encoded with a LilyPond file extracted from Mutopia, that can be embodied in MSMD either as scores, the visual modality, or performances, the audio modality. We extract various views of a score, and features of a performance. Finally, we align noteheads in the score to note events in the performances.

With respect to the file system, MSMD is a directory. Inside are piece directories, with names derived from Mutopia. Each piece directory has two subdirectories: performances/ and scores/. Then, each performance or score is a directory inside the corresponding subdir, containing its own encoding (PDF for scores, MIDI for performances) and derived features.

An example file structure for a piece with one score and two performances:



Each piece has the base LilyPond *.ly file, a normalized *.ly file, and a MIDI file generated directly from the normalized MIDI. Next, there is a meta.yml file that contains some information about the piece, such as the number of aligned notehead/note event pairs. Finally, there are the performances/ and scores/ subdirectories that, obviously, hold the Performances and Scores generated for this piece.


Each performance is a subdirectory of the piece's performances/ subdir. The authority encoding of the performance is a MIDI file derived from the piece MIDI. Currently, we only change its tempo. From the MIDI file, we generate an audio file using a piano soundfont. The audio is used to compute the spectrogram, and then discarded, so it does not show up in the example file structure of a piece described above. The tempo change and soundfont used for rendering the audio/spectrogram is added to the performance name.

The features computed from the audio and the performance MIDI are then stored in the features/ subdirectory. We compute:

  • MIDI matrix
  • Note events list
  • Onsets list
  • Spectrogram

For the frame-wise features (MIDI matrix and spectrogram), the frame rate is set to 20 frames per second.

The MIDI matrix is a 128 x N_FRAMES binary matrix. If a given pitch is active in a given frame, that matrix cell contains a 1.

The note events list is derived from the performance MIDI by pairing the corresponding note-on and note-off events. It is a N_EVENTS x 5 numpy array. The columns are: onset time (in seconds), pitch, duration (in seconds), and track and channel (the last two are not necessary for anything).

The onsets list is a vector of length N_EVENTS. It maps the note events to onset frames. This is how note events are related to the MIDI matrix and the spectrogram.

The spectrogram is a 92 x N_FRAMES matrix, computed from the synthesized audio. It is computed with a sample rate of 22050 Hz, FFT window size of 2048 samples. For dimensionality reduction we apply a normalized 16-band logarithmic filterbank allowing only frequencies from 30Hz to 16kHz, which results in those 92 frequency bins.


Each score is a subdirectory of the piece's scores/ subdir. The scores are based on the PDF generated by LilyPond, which is stored in the score directory. For a score, we generate:

  • Page images,
  • Coordinates of noteheads and systems,
  • MuNG (MUSCIMA++ Notation Graph) -- holds alignment to performances

From this PDF, we render the page images (imgs/01.png, /02.png, /03.png).

We store notehead and system coordinates for each page in the coords/ subdirectory of the score. Notehead coordinates are their centroids. For system regions, we store the coordinates of their corners.

Finally, we store the MUSCIMA++ Notation Graph (MuNG) representation, an XML format for describing music notation. The graph stores how noteheads are grouped into systems, which is not always trivial (see Appendix A of the article [1]). And more importantly, the XML records for individual noteheads also store the all-important alignment between a score and a performance.

The MuNG format and how alignment between the scores and performances is stored is described in the next section.

MuNG format and Alignment

The MuNG XML for a notehead in MSMD looks like this:

<CropObject xml:id="msmd_aug___BachCPE__cpe-bach-rondo__cpe-bach-rondo_ly-P00___0">
  <Mask>0:0 1:63</Mask>
		<DataItem key="BachCPE__cpe-bach-rondo__cpe-bach-rondo_tempo-1000_ElectricPiano_onset_frame" type="int">255</DataItem>
		<DataItem key="tied" type="int">0</DataItem>
		<DataItem key="BachCPE__cpe-bach-rondo__cpe-bach-rondo_tempo-1000_ElectricPiano_note_event_idx" type="int">48</DataItem>
		<DataItem key="BachCPE__cpe-bach-rondo__cpe-bach-rondo_tempo-1000_ElectricPiano_onset_seconds" type="float">12.727274</DataItem>
		<DataItem key="midi_pitch_code" type="int">68</DataItem>
		<DataItem key="ly_link" type="str">textedit:///media/matthias/Data/msmd_aug/BachCPE__cpe-bach-rondo__cpe-bach-rondo/BachCPE__cpe-bach-rondo__cpe-bach-rondo.norm.ly:704:15:16</DataItem>

The xml:id of the is the unique identifier for the given notehead within the entire MSMD dataset. The is its identifier within the given score, which works across pages even though the MuNG for each page is stored in a separate file. The , , and elements denote its bounding box. (The is irrelevant in MSMD, but required by the MuNG specification, so it is just filled with 1's.)

The element stores the of the system MuNG object. This is how we group noteheads into systems, which is necessary for properly "unrolling" the score when aligning noteheads to the note events.

The elements holds additional descriptors that are not required the MuNG format, but are an MSMD-specific extension of MuNG.

  • The elements that point to a performance have their "key" attribute start with the name of the performance.

  • The points to the Note Events List element from performance ${PERF_NAME} to which this particular notehead corresponds. THIS IS THE KEY ELEMENT FOR ALIGNING THE SCORE TO THE AUDIO (SPECTROGRAM).

  • The element points to the frame in the MIDI matrix and spectrogram of performance ${PERF_NAME} to which this particular notehead corresponds. This is derived from the alignment; it simplifies operation to store this in the MuNG.

  • The element points to the exact time in the audio of the performance when the notehead is interpreted. (This is also here just for convenience, but note that we do not retain the audio; however, if you re-render it from the performance MIDI, this element will make it easy to align noteheads directly to audio.)

  • The holds a reference to the exact location in the normalized LilyPond file from which this notehead was rendered by the LilyPond engraving engine. It helped us recover the pitch associated with this notehead.

  • The element holds the MIDI pitch code associated with this notehead. This information is extracted from the originating LilyPond file (see Appendix A of [1]).

To load MuNG files and use this representation, we recommend using the muscima package (https://github.com/hajicj/muscima) of the MuNG format authors.

Manipulating MSMD

To explore the code for loading and manipulating MSMD, we suggest starting from the function:


of the accompanying software of MSMD. This function implements the preprocessing pipeline described in sec. 3 of article [1], which includes loading the alignment from MuNG to present corresponding snippets of sheet music and excerpts of the spectrogram to the cross-modal retrieval model training.

A Python abstraction over MSMD is implemented in the data_model/ module. Going from an abstraction over the entire dataset downwards, to abstractions over the Piece and its corresponding Performances and Scores, there are classes:

msmd.py:MSMD piece.py:Piece performance.py:Performance score.py:Score

The classes' docstrings contain further details on how to use these objects. The classes are quite light-weight and mainly intended to ease loading MSMD from Python scripts.

If you want to explore how MSMD was generated, refer to:


The process is described in Appendix A of [1].