# Taking a Look at MAESTRO Data with Chromagen

The MAESTRO dataset includes WAV/MIDI file pairs of piano compositions, and metadata in the form of .csv and .json files. My goal is to use the Chromagen module for an exploratory k-means clustering analysis of a set of WAV files. I'll ignore the MIDI files in this demo, and take the output of Chromagen.chromaweights() as our feature vector for each file.

In [9]:
import numpy as np #for linear algebra
import pandas as pd #for nice databases
from scipy.io.wavfile import read as wavread #we can read wavfiles this way
import zipfile as z #to get to the dataset
import chromagen as cg
import matplotlib.pyplot as plt

In [2]:
#since the dataset .zip has more than one file in it, we can't create a dataframe directly from the zip
file = z.ZipFile("C:/Users/jreif/Documents/Datasets/maestro-v1.0.0.zip")
#extracted just the .csv holding our metadata
file.extract("maestro-v1.0.0/maestro-v1.0.0.csv", path = "C:/Users/jreif/Documents/Datasets")
#now create a pandas dataframe using the csv
df = pd.read_csv("C:/Users/jreif/Documents/Datasets/maestro-v1.0.0/maestro-v1.0.0.csv")

In [3]:
##let's take a look at the metadata
df.head()

Unnamed: 0,canonical_composer,canonical_title,split,year,midi_filename,audio_filename,duration
0,Alban Berg,Sonata Op. 1,train,2017,2017/MIDI-Unprocessed_066_PIANO066_MID--AUDIO-...,2017/MIDI-Unprocessed_066_PIANO066_MID--AUDIO-...,464.649433
1,Alban Berg,Sonata Op. 1,train,2008,2008/MIDI-Unprocessed_03_R2_2008_01-03_ORIG_MI...,2008/MIDI-Unprocessed_03_R2_2008_01-03_ORIG_MI...,759.518471
2,Alexander Scriabin,"24 Preludes Op. 11, No. 13-24",train,2004,2004/MIDI-Unprocessed_XP_21_R1_2004_01_ORIG_MI...,2004/MIDI-Unprocessed_XP_21_R1_2004_01_ORIG_MI...,872.640588
3,Alexander Scriabin,"3 Etudes, Op. 65",test,2006,2006/MIDI-Unprocessed_17_R1_2006_01-06_ORIG_MI...,2006/MIDI-Unprocessed_17_R1_2006_01-06_ORIG_MI...,397.857508
4,Alexander Scriabin,"5 Preludes, Op.15",train,2009,2009/MIDI-Unprocessed_07_R1_2009_04-05_ORIG_MI...,2009/MIDI-Unprocessed_07_R1_2009_04-05_ORIG_MI...,400.557826


It would take some time to extract all of these files, and we don't need a huge dataset for our basic exploration. Let's take a subset of the 954 WAV files listed as "training" files to narrow the field a bit. I'll choose only the WAV files whose corresponding MIDI file is between 100 and 200 seconds.

In [4]:
df.query('split == "train" and 100<duration<200', inplace = True)

In [5]:
df

Unnamed: 0,canonical_composer,canonical_title,split,year,midi_filename,audio_filename,duration
8,Alexander Scriabin,"Etude Op. 8, No. 13",train,2009,2009/MIDI-Unprocessed_02_R1_2009_03-06_ORIG_MI...,2009/MIDI-Unprocessed_02_R1_2009_03-06_ORIG_MI...,167.085837
9,Alexander Scriabin,"Etude in D-flat Major, Op. 8 No. 10",train,2011,2011/MIDI-Unprocessed_15_R1_2011_MID--AUDIO_R1...,2011/MIDI-Unprocessed_15_R1_2011_MID--AUDIO_R1...,102.007110
34,Claude Debussy,"""Les collines d'Anacapri"" from Preludes, Book I",train,2008,2008/MIDI-Unprocessed_07_R3_2008_01-05_ORIG_MI...,2008/MIDI-Unprocessed_07_R3_2008_01-05_ORIG_MI...,166.495560
62,Claude Debussy,"Ondine from Preludes, Book II",train,2008,2008/MIDI-Unprocessed_10_R3_2008_01-05_ORIG_MI...,2008/MIDI-Unprocessed_10_R3_2008_01-05_ORIG_MI...,193.301089
68,Claude Debussy,"Preludes, Book II, III - La puerta del vino",train,2013,2013/ORIG-MIDI_01_7_8_13_Group__MID--AUDIO_02_...,2013/ORIG-MIDI_01_7_8_13_Group__MID--AUDIO_02_...,179.321402
...,...,...,...,...,...,...,...
1148,Sergei Rachmaninoff / György Cziffra,Flight of the Bumblebee,train,2006,2006/MIDI-Unprocessed_12_R1_2006_01-08_ORIG_MI...,2006/MIDI-Unprocessed_12_R1_2006_01-08_ORIG_MI...,114.583219
1149,Sergei Rachmaninoff / Vyacheslav Gryaznov,Italian Polka,train,2009,2009/MIDI-Unprocessed_04_R1_2009_04-06_ORIG_MI...,2009/MIDI-Unprocessed_04_R1_2009_04-06_ORIG_MI...,182.654732
1156,Wolfgang Amadeus Mozart,"Sonata in B-flat Major, K. 281, First Movement",train,2011,2011/MIDI-Unprocessed_02_R1_2011_MID--AUDIO_R1...,2011/MIDI-Unprocessed_02_R1_2011_MID--AUDIO_R1...,183.087424
1178,Wolfgang Amadeus Mozart,"Sonata in F Major, K. 280, 1st mov.",train,2013,2013/ORIG-MIDI_03_7_6_13_Group__MID--AUDIO_09_...,2013/ORIG-MIDI_03_7_6_13_Group__MID--AUDIO_09_...,192.605310


In [12]:
#extract the files we care about one by one
counter = 0
filecount = df.shape[0]
for index, row in df.iterrows():
    filepath = row['audio_filename']
    file.extract("maestro-v1.0.0/" + filepath, path = "C:/Users/jreif/Documents/Datasets/")
    counter+=1
    print("{0:3d} out of {1:3d} files extracted".format(counter, filecount), end="\r")

123 out of 123 files extracted

In [22]:
#we'll populate a dataframe with the fractions of the file during which each note is played during each song

X = pd.DataFrame(index = df.index, columns = ["A","A#","B","C","C#","D","D#","E","F","F#","G","G#"])
counter = 0
filecount = df.shape[0]
for ind, row in df.iterrows():
    filepath = row['audio_filename']
    rate, data = wavread("C:/Users/jreif/Documents/Datasets/maestro-v1.0.0/" + filepath)
    data = np.average(data, axis = 1)
    
    #Generate frequency data via short-time fourier transform
    f, t, c = cg.stft(data,10000,rate, windowtype = "Hann") 
    #get the chromagram for the song
    chrm = cg.chromagram(F_arr = f, Chi = c)
    #from the chromagram, get each chroma's weight
    cw = cg.chromaweights(chrm)
    #place the information at the correct row in X
    X.loc[ind] = cw
    counter+=1
    print("{0:3d} out of {1:3d} files analyzed".format(counter, filecount), end="\r")
X


123 out of 123 files analyzed

Unnamed: 0,A,A#,B,C,C#,D,D#,E,F,F#,G,G#
8,0.0803437,0.063793,0.0768133,0.0732304,0.116024,0.0625009,0.0861044,0.0910987,0.0751045,0.0857017,0.0626328,0.126653
9,0.0689926,0.0672908,0.0646549,0.091151,0.118409,0.0580891,0.0738329,0.0675813,0.122189,0.095352,0.0630416,0.109416
34,0.0519003,0.0806738,0.105082,0.0500887,0.115675,0.051471,0.110022,0.0811642,0.0613269,0.13904,0.0432982,0.110259
62,0.108328,0.0888835,0.0793374,0.0580683,0.106662,0.0853214,0.10244,0.0777145,0.0667672,0.0817177,0.0743195,0.0704403
68,0.0570065,0.067678,0.0734636,0.0788251,0.111224,0.0761047,0.0761938,0.0797061,0.0996578,0.0706358,0.077459,0.132045
...,...,...,...,...,...,...,...,...,...,...,...,...
1148,0.0926685,0.0681357,0.0801694,0.0739113,0.0868934,0.117494,0.0853386,0.095763,0.0944717,0.064333,0.0580734,0.0827485
1149,0.0665524,0.146736,0.0679399,0.0772902,0.0767955,0.0826845,0.10042,0.0501398,0.0986876,0.0751485,0.0645435,0.0930623
1156,0.0974596,0.108823,0.0510648,0.107877,0.0471214,0.100208,0.0933309,0.0616586,0.136492,0.058186,0.08551,0.0522697
1178,0.10563,0.0759637,0.0659519,0.0959414,0.0631783,0.102137,0.0515515,0.100383,0.119167,0.0547847,0.109076,0.0562353


In [23]:
def initialize(k):
    #initialize the centroids
    centroids = {
        i+1: np.random.rand(12)
    for i in range(k)
    }
    
    #This second loop will ensure each centroid is within our constraints (sum of coordinates must be 1)
    for i in range(k):
        centroids[i+1] = centroids[i+1]/np.sum(centroids[i+1])   
        
        
    return centroids

Now we want to assign each entry to a cluster. We need to calculate the distance from each cluster, and assign the entry to the cluster with the smallest distance.

In [24]:
note_names = ['A', 'A#', 'B', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#']
def assign_clusters(centroids,X):
    for i in centroids.keys():
        #calculate distance
        X["Distance from {}".format(i)] = np.sqrt(((X[note_names]-centroids[i])**2).sum(axis = 1))
    distances = ["Distance from {}".format(i) for i in centroids.keys()]
    
    #quick way to use our pre-existing distance columns to assign clusters as a numeric column
    assignments = pd.to_numeric((X[distances].idxmin(axis = 1)).str.lstrip("Distance from "))
    X["Assigned Cluster"] = assignments
    return X

Then, we'll need to update the centroids to the mean position of each cluster. This will be the mean along each "note" axis independently.

In [25]:
def update_centroids(centroids,X):
    for i in range(len(centroids)):
        #filter for entries assigned to this cluster
        temp = X[X["Assigned Cluster"]==i+1]
        number_of_entries = len(temp.index)
        #now take an average of all the coordinates for that cluster
        sum_dist = temp[note_names].sum(axis = 0)
        centroids[i+1] = np.array(sum_dist/number_of_entries)
    return centroids

K-means clustering is a good algorithm, but only if we have a good idea of how many clusters are present in our data. To estimate this, I'll use what's referred to as the elbow method. I'll vary K and learn the model each time, and after learning, I'll compute the sum of squared distances from the centroid for each cluster. Optimal K-value is the K at which the sum of squared distances stops dropping dramatically, the "elbow" of the curve.

We have a pretty small number of data points, so we don't want to vary K too high. I'll limit it to 12, which will account for all the major keys, though this reasoning is just based off of a hunch.

In [26]:
def ssd(centroids, X):
    squared_distances = np.zeros(len(centroids))
    for i in range(len(centroids)):
        temp = X[X["Assigned Cluster"]==i+1]
        number_of_entries = len(temp.index)
        squared_distances[i] = X["Distance from {}".format(i+1)].pow(2).sum()
    return np.sum(squared_distances)    
        

In [27]:
SSD = dict.fromkeys(np.linspace(2,12,11),0)
for i in np.linspace(2,12,11, dtype = np.int):
    centroids = initialize(i)
    X = assign_clusters(centroids, X)
    centroids = update_centroids(centroids, X)
    for i in range(30):
        X = assign_clusters(centroids, X)
        centroids = update_centroids
    SSD[i] = ssd(centroids, X)

AttributeError: 'function' object has no attribute 'keys'

In [None]:
plt.plot(SSD)

In [98]:
centroids = initialize(4)
X = assign_clusters(centroids, X)
centroids = update_centroids(centroids, X)
X

{1: array([0.13638182, 0.07419024, 0.15498943, 0.02993772, 0.10244948,
       0.06169297, 0.03571581, 0.06518238, 0.06254054, 0.08865936,
       0.14173528, 0.04652496]), 2: array([0.10751793, 0.16143615, 0.04545502, 0.15389116, 0.06011386,
       0.07139762, 0.03348382, 0.12752349, 0.04044872, 0.02069302,
       0.14171746, 0.03632175]), 3: array([0.01854403, 0.01435195, 0.17092009, 0.11490023, 0.02545691,
       0.13080568, 0.01697766, 0.08323536, 0.04933596, 0.15824295,
       0.06084919, 0.15638   ]), 4: array([0.06162952, 0.057803  , 0.08530854, 0.00757736, 0.11557548,
       0.1198581 , 0.11300798, 0.01269774, 0.11749823, 0.09080843,
       0.1152307 , 0.1030049 ])}
41
13
0
69
{1: array([0.09848247, 0.05552599, 0.09852717, 0.07604697, 0.0792315 ,
       0.09159722, 0.0707709 , 0.11100575, 0.07404633, 0.08689394,
       0.08270444, 0.07491132]), 2: array([0.09961098, 0.06851002, 0.07062615, 0.11906779, 0.06093295,
       0.0936129 , 0.05851541, 0.09299973, 0.11113873, 0.06300501,


Unnamed: 0,A,A#,B,C,C#,D,D#,E,F,F#,G,G#,Distance from 1,Distance from 2,Distance from 3,Distance from 4,Assigned Cluster
8,0.0803437,0.063793,0.0768133,0.0732304,0.116024,0.0625009,0.0861044,0.0910987,0.0751045,0.0857017,0.0626328,0.126653,0.165753,0.211906,0.203414,0.141698,4
9,0.0689926,0.0672908,0.0646549,0.091151,0.118409,0.0580891,0.0738329,0.0675813,0.122189,0.095352,0.0630416,0.109416,0.178811,0.217341,0.214669,0.136869,4
34,0.0519003,0.0806738,0.105082,0.0500887,0.115675,0.051471,0.110022,0.0811642,0.0613269,0.13904,0.0432982,0.110259,0.180066,0.255015,0.200222,0.151320,4
62,0.108328,0.0888835,0.0793374,0.0580683,0.106662,0.0853214,0.10244,0.0777145,0.0667672,0.0817177,0.0743195,0.0704403,0.133737,0.187921,0.234431,0.129357,4
68,0.0570065,0.067678,0.0734636,0.0788251,0.111224,0.0761047,0.0761938,0.0797061,0.0996578,0.0706358,0.077459,0.132045,0.175103,0.209281,0.200162,0.126923,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1148,0.0926685,0.0681357,0.0801694,0.0739113,0.0868934,0.117494,0.0853386,0.095763,0.0944717,0.064333,0.0580734,0.0827485,0.161726,0.192462,0.208255,0.137473,4
1149,0.0665524,0.146736,0.0679399,0.0772902,0.0767955,0.0826845,0.10042,0.0501398,0.0986876,0.0751485,0.0645435,0.0930623,0.187307,0.186211,0.241292,0.144250,4
1156,0.0974596,0.108823,0.0510648,0.107877,0.0471214,0.100208,0.0933309,0.0616586,0.136492,0.058186,0.08551,0.0522697,0.192503,0.167274,0.257532,0.166908,4
1178,0.10563,0.0759637,0.0659519,0.0959414,0.0631783,0.102137,0.0515515,0.100383,0.119167,0.0547847,0.109076,0.0562353,0.153022,0.148073,0.233295,0.168818,2


In [99]:
X = assign_clusters(centroids, X)
X

Unnamed: 0,A,A#,B,C,C#,D,D#,E,F,F#,G,G#,Distance from 1,Distance from 2,Distance from 3,Distance from 4,Assigned Cluster
8,0.0803437,0.063793,0.0768133,0.0732304,0.116024,0.0625009,0.0861044,0.0910987,0.0751045,0.0857017,0.0626328,0.126653,0.082408,0.123861,0.0,0.062707,3
9,0.0689926,0.0672908,0.0646549,0.091151,0.118409,0.0580891,0.0738329,0.0675813,0.122189,0.095352,0.0630416,0.109416,0.104416,0.111912,0.0,0.063620,3
34,0.0519003,0.0806738,0.105082,0.0500887,0.115675,0.051471,0.110022,0.0811642,0.0613269,0.13904,0.0432982,0.110259,0.120714,0.174385,0.0,0.106291,3
62,0.108328,0.0888835,0.0793374,0.0580683,0.106662,0.0853214,0.10244,0.0777145,0.0667672,0.0817177,0.0743195,0.0704403,0.070528,0.109044,0.0,0.070876,3
68,0.0570065,0.067678,0.0734636,0.0788251,0.111224,0.0761047,0.0761938,0.0797061,0.0996578,0.0706358,0.077459,0.132045,0.094808,0.113868,0.0,0.053585,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1148,0.0926685,0.0681357,0.0801694,0.0739113,0.0868934,0.117494,0.0853386,0.095763,0.0944717,0.064333,0.0580734,0.0827485,0.057468,0.083812,0.0,0.064110,3
1149,0.0665524,0.146736,0.0679399,0.0772902,0.0767955,0.0826845,0.10042,0.0501398,0.0986876,0.0751485,0.0645435,0.0930623,0.127865,0.126032,0.0,0.074784,3
1156,0.0974596,0.108823,0.0510648,0.107877,0.0471214,0.100208,0.0933309,0.0616586,0.136492,0.058186,0.08551,0.0522697,0.124100,0.074733,0.0,0.094212,3
1178,0.10563,0.0759637,0.0659519,0.0959414,0.0631783,0.102137,0.0515515,0.100383,0.119167,0.0547847,0.109076,0.0562353,0.082993,0.031689,0.0,0.097927,3


In [28]:
#merge on common indices
df = pd.merge(df, X, left_index = True, right_index = True)



Unnamed: 0,canonical_composer,canonical_title,split,year,midi_filename,audio_filename,duration,A,A#,B,C,C#,D,D#,E,F,F#,G,G#
8,Alexander Scriabin,"Etude Op. 8, No. 13",train,2009,2009/MIDI-Unprocessed_02_R1_2009_03-06_ORIG_MI...,2009/MIDI-Unprocessed_02_R1_2009_03-06_ORIG_MI...,167.085837,0.0803437,0.063793,0.0768133,0.0732304,0.116024,0.0625009,0.0861044,0.0910987,0.0751045,0.0857017,0.0626328,0.126653
9,Alexander Scriabin,"Etude in D-flat Major, Op. 8 No. 10",train,2011,2011/MIDI-Unprocessed_15_R1_2011_MID--AUDIO_R1...,2011/MIDI-Unprocessed_15_R1_2011_MID--AUDIO_R1...,102.007110,0.0689926,0.0672908,0.0646549,0.091151,0.118409,0.0580891,0.0738329,0.0675813,0.122189,0.095352,0.0630416,0.109416
34,Claude Debussy,"""Les collines d'Anacapri"" from Preludes, Book I",train,2008,2008/MIDI-Unprocessed_07_R3_2008_01-05_ORIG_MI...,2008/MIDI-Unprocessed_07_R3_2008_01-05_ORIG_MI...,166.495560,0.0519003,0.0806738,0.105082,0.0500887,0.115675,0.051471,0.110022,0.0811642,0.0613269,0.13904,0.0432982,0.110259
62,Claude Debussy,"Ondine from Preludes, Book II",train,2008,2008/MIDI-Unprocessed_10_R3_2008_01-05_ORIG_MI...,2008/MIDI-Unprocessed_10_R3_2008_01-05_ORIG_MI...,193.301089,0.108328,0.0888835,0.0793374,0.0580683,0.106662,0.0853214,0.10244,0.0777145,0.0667672,0.0817177,0.0743195,0.0704403
68,Claude Debussy,"Preludes, Book II, III - La puerta del vino",train,2013,2013/ORIG-MIDI_01_7_8_13_Group__MID--AUDIO_02_...,2013/ORIG-MIDI_01_7_8_13_Group__MID--AUDIO_02_...,179.321402,0.0570065,0.067678,0.0734636,0.0788251,0.111224,0.0761047,0.0761938,0.0797061,0.0996578,0.0706358,0.077459,0.132045
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1148,Sergei Rachmaninoff / György Cziffra,Flight of the Bumblebee,train,2006,2006/MIDI-Unprocessed_12_R1_2006_01-08_ORIG_MI...,2006/MIDI-Unprocessed_12_R1_2006_01-08_ORIG_MI...,114.583219,0.0926685,0.0681357,0.0801694,0.0739113,0.0868934,0.117494,0.0853386,0.095763,0.0944717,0.064333,0.0580734,0.0827485
1149,Sergei Rachmaninoff / Vyacheslav Gryaznov,Italian Polka,train,2009,2009/MIDI-Unprocessed_04_R1_2009_04-06_ORIG_MI...,2009/MIDI-Unprocessed_04_R1_2009_04-06_ORIG_MI...,182.654732,0.0665524,0.146736,0.0679399,0.0772902,0.0767955,0.0826845,0.10042,0.0501398,0.0986876,0.0751485,0.0645435,0.0930623
1156,Wolfgang Amadeus Mozart,"Sonata in B-flat Major, K. 281, First Movement",train,2011,2011/MIDI-Unprocessed_02_R1_2011_MID--AUDIO_R1...,2011/MIDI-Unprocessed_02_R1_2011_MID--AUDIO_R1...,183.087424,0.0974596,0.108823,0.0510648,0.107877,0.0471214,0.100208,0.0933309,0.0616586,0.136492,0.058186,0.08551,0.0522697
1178,Wolfgang Amadeus Mozart,"Sonata in F Major, K. 280, 1st mov.",train,2013,2013/ORIG-MIDI_03_7_6_13_Group__MID--AUDIO_09_...,2013/ORIG-MIDI_03_7_6_13_Group__MID--AUDIO_09_...,192.605310,0.10563,0.0759637,0.0659519,0.0959414,0.0631783,0.102137,0.0515515,0.100383,0.119167,0.0547847,0.109076,0.0562353


# Licensing

The MAESTRO dataset (v1.0.0) is made available by Google LLC under a [Creative Commons Attribution Non-Commercial Share-Alike 4.0 (CC BY-NC-SA 4.0) license](https://creativecommons.org/licenses/by-nc-sa/4.0/), and all new analysis of that unaltered data done in this notebook uses the same license. 

# Data Sources

The [MAESTRO v1.0.0 dataset](https://magenta.tensorflow.org/datasets/maestro#v100) used here was introduced in the following work:

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang,
  Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. "Enabling
  Factorized Piano Music Modeling and Generation with the MAESTRO Dataset."
  In International Conference on Learning Representations, 2019.