Taken from  https://nbviewer.jupyter.org/github/craffel/midi-ground-truth/blob/master/Statistics.ipynb

# Measuring Statistics about Information Sources in MIDI Files

A MIDI file can provide a cornucopia of musical information about a given piece of music, including transcription, key, lyrics, and meter. However, the presence and quantity of each of these sources of information can vary. Through a large-scale web scrape, we obtained 178,561 unique (i.e. having different MD5 checksums) MIDI files. This notebook measures the availability of each possible source of information in MIDI files in this collection of MIDIs found "in the wild".

In [18]:
import joblib
from pathlib import Path
from pprint import pprint

import pretty_midi
from utils.midi_utils import compute_statistics
'The compute_statistics function takes in a MIDI file and simply collects the number, ' \
'and values for, different events (for example, key change values and tempo settings) in the file.'

import plotly.express as px


## Length
To begin with, we can get an idea of the type of MIDI files which are available by looking at their length (i.e. time in seconds). MIDI files are variously used for storing short melodic fragmets, drum patterns, ringtones, and full-song transcriptions; looking at the distribution of lengths of MIDI files in our collection gives a rough picture of how common each of these uses is.

In [8]:
all_mid = Path('..', 'data', 'lmd_matched')

statistics = joblib.Parallel(n_jobs=4)(
    joblib.delayed(compute_statistics)(str(midi_file))
    for midi_file in all_mid.glob('**/*.mid')
    if midi_file.parent.name.startswith("TRAA")
)
statistics = [s for s in statistics if s is not None]

In [14]:
fig = px.histogram(
    [s['end_time'] for s in statistics],
)
fig.update_yaxes(title='Number of MIDI files')
fig.update_xaxes(title='Length in seconds')
fig.show()



## Instruments

In their simplest form, MIDI files contain a collection of notes played on a collection of instruments. Under the General MIDI specification, 128 instruments are available (see pretty_midi.INSTRUMENT_MAP for a list) which are indexed by their "program number". The distribution of the number of instruments in our MIDI files gives us further intuition into the MIDI files' usage; the distribution of program numbers shows us which instruments are more or less popular. The four most common program numbers (shown as the four tallest bars in the distribution of program numbers) were 0 (“Acoustic Grand Piano”), 48 (“String Ensemble 1”), 33 (“Electric Bass (finger)”), and 25 (“Acoustic Guitar (steel)”).



In [24]:
pprint({i: name for i, name in enumerate(pretty_midi.INSTRUMENT_MAP)})

{0: 'Acoustic Grand Piano',
 1: 'Bright Acoustic Piano',
 2: 'Electric Grand Piano',
 3: 'Honky-tonk Piano',
 4: 'Electric Piano 1',
 5: 'Electric Piano 2',
 6: 'Harpsichord',
 7: 'Clavinet',
 8: 'Celesta',
 9: 'Glockenspiel',
 10: 'Music Box',
 11: 'Vibraphone',
 12: 'Marimba',
 13: 'Xylophone',
 14: 'Tubular Bells',
 15: 'Dulcimer',
 16: 'Drawbar Organ',
 17: 'Percussive Organ',
 18: 'Rock Organ',
 19: 'Church Organ',
 20: 'Reed Organ',
 21: 'Accordion',
 22: 'Harmonica',
 23: 'Tango Accordion',
 24: 'Acoustic Guitar (nylon)',
 25: 'Acoustic Guitar (steel)',
 26: 'Electric Guitar (jazz)',
 27: 'Electric Guitar (clean)',
 28: 'Electric Guitar (muted)',
 29: 'Overdriven Guitar',
 30: 'Distortion Guitar',
 31: 'Guitar Harmonics',
 32: 'Acoustic Bass',
 33: 'Electric Bass (finger)',
 34: 'Electric Bass (pick)',
 35: 'Fretless Bass',
 36: 'Slap Bass 1',
 37: 'Slap Bass 2',
 38: 'Synth Bass 1',
 39: 'Synth Bass 2',
 40: 'Violin',
 41: 'Viola',
 42: 'Cello',
 43: 'Contrabass',
 44: 'Tremolo

In [22]:
fig = px.histogram(
    [s['n_instruments'] for s in statistics],
)
fig.update_yaxes(title='Number of MIDI files')
fig.update_xaxes(title='Number of instruments')
fig.show()

In [25]:
fig = px.histogram(
    [i for s in statistics for i in s['program_numbers']],
)
fig.update_yaxes(title='Number of occurences')
fig.update_xaxes(title='Program number')
fig.show()

##  Tempo changes
The timing of events in MIDI files is determined by tempo change events, which allow conversion from the MIDI "tick" timebase to absolute time in seconds. Using many tempo change events can allow for a MIDI transcription's timing to closely match that of a specific performance of a piece of music. While 120bpm is the default tempo for a MIDI file, the distribution of tempos shows that a wide variety of tempos is used.

In [26]:
fig = px.histogram(
    [i for s in statistics for i in s['tempos']],
)
fig.update_yaxes(title='Number of occurences')
fig.update_xaxes(title='Tempo')
fig.show()

## Key changes

The current key in a piece of music can be specified with the optional key signature change meta-events. These events do not affect playback, so many of our MIDI files omit these events altogether, though a roughly equal number had a single key change event. Interestingly, a disproportionate number of MIDI files had a key change to C major - this is likely a reflection of the fact that many MIDI transcription software packages automatically insert a C major key change.


In [28]:
fig = px.histogram(
    [len(s['key_numbers']) for s in statistics],
)
fig.update_yaxes(title='Number of MIDI files')
fig.update_xaxes(title='Number of key changes')
fig.show()

In [30]:
fig = px.histogram(
    [i for s in statistics for i in s['key_numbers']],
)
fig.update_yaxes(title='Number of occurences')
fig.update_xaxes(title='Key')
fig.show()


## Lyrics

MIDI files can also optionally include timestamped lyrics events. This in particular facilitates their use for karaoke. In our collection, we found 23,801 MIDI files (about 13.3%) which had at least one lyrics meta-event. Lyrics are often transcribed the word, syllable, or character level, as indicated by the distibution of the lengths of their text. The preponderance of length-1 lyrics is also caused by characters (e.g. newlines and spaces) which indicate the end of a phrase.

In [31]:
fig = px.histogram(
    [len(s['lyrics']) for s in statistics],
)
fig.update_yaxes(title='Number of MIDI files')
fig.update_xaxes(title='Number of lyrics events')
fig.show()


In [32]:
fig = px.histogram(
    [len(l) for s in statistics for l in s['lyrics']],
)
fig.update_yaxes(title='Number of occurences')
fig.update_xaxes(title='Length of lyrics')
fig.show()