`python3 -m pip install -U pandas plotly nbformat`

`pip install "https://github.com/DCMLab/wavescapes/archive/refs/heads/johannes.zip"`

In [1]:
import pandas as pd
%reload_ext autoreload
%autoreload 2
import pandas as pd
from itertools import repeat
from etl import get_metadata, test_dict_keys, make_feature_vectors
from utils import longn2squaren

## Settings

In [2]:
DEBUSSY_REPO = '.'
DATA_FOLDER = '~/DATA/debussy_figures/data'
#DATA_FOLDER = './data/data'
EXAMPLE_FNAME = 'l000_etude'
LONG_FORMAT = True

## Loading metadata
Metadata for all pieces contained in the dataset.

In [3]:
metadata = get_metadata(DEBUSSY_REPO)
metadata.columns

Metadata for 82 files.


Index(['rel_paths', 'last_mc', 'last_mn', 'length_qb', 'length_qb_unfolded',
       'all_notes_qb', 'n_onsets', 'TimeSig', 'KeySig', 'label_count',
       'composer', 'workTitle', 'movementNumber', 'movementTitle',
       'workNumber', 'poet', 'lyricist', 'arranger', 'copyright',
       'creationDate', 'mscVersion', 'platform', 'source', 'translator',
       'musescore', 'ambitus', 'comment', 'comments', 'composed_end',
       'composed_start', 'originalFormat', 'pdf', 'staff_1_ambitus',
       'staff_1_instrument', 'staff_2_ambitus', 'staff_2_instrument',
       'staff_3_ambitus', 'staff_3_instrument', 'transcriber', 'typesetter',
       'year', 'median_recording', 'qb_per_minute', 'sounding_notes_per_qb',
       'sounding_notes_per_minute'],
      dtype='object')

The column `year` contains composition years as the middle between beginning and end  of the composition span.

In [4]:
metadata.year.head(10)

fnames
l000_etude                     1915.0
l000_soirs                     1917.0
l009_danse                     1880.0
l066-01_arabesques_premiere    1888.0
l066-02_arabesques_deuxieme    1891.0
l067_mazurka                   1890.0
l068_reverie                   1890.0
l069_tarentelle                1890.0
l070_ballade                   1890.0
l071_valse                     1890.0
Name: year, dtype: float64

Series `median_recording` contains median recording times in seconds, retrieved from the Spotify API. the Spotify API.

In [5]:
metadata.median_recording.head(10)

fnames
l000_etude                     272.5530
l000_soirs                     145.8265
l009_danse                     124.5995
l066-01_arabesques_premiere    240.7780
l066-02_arabesques_deuxieme    213.9330
l067_mazurka                   175.9130
l068_reverie                   265.4265
l069_tarentelle                331.5290
l070_ballade                   396.0200
l071_valse                     221.4500
Name: median_recording, dtype: float64

Columns mirroring a piece's activity are currently:
* `qb_per_minute`: the pieces' lengths (expressed as 'qb' = quarterbeats) normalized by the median recording times; a proxy for the tempo
* `sounding_notes_per_minute`: the summed length of all notes normalized by the piece's duration (in minutes)
* `sounding_notes_per_qb`: the summed length of all notes normalized by the piece's length (in qb)
Other measures of activity could be, for example, 'onsets per beat/second' or 'distinct pitch classes per beat/second'.

## Loading pickled 9-fold vectors

The function is a shortcut for
* loading a particular kind of pickled normalized magnitude-phase-matrices
* loading pickled tritone, major, and minor coefficients
* concatenating them toegther

In [6]:
norm_params = ('0c', True)
ninefold_dict = make_feature_vectors(DATA_FOLDER, norm_params=norm_params, long=LONG_FORMAT)
test_dict_keys(ninefold_dict, metadata)
longest = max(a.shape for a in ninefold_dict.values())[0]
print(f"Maximum number of nodes per wavescape: {longest}. Length of bottom row: {longn2squaren(longest)}")

Found matrices for all files listed in metadata.tsv.
Maximum number of nodes per wavescape: 653796. Length of bottom row: 1143


## Creating the meta index

Since the feature vectors are already in long format, we just need to concatenate them. In order to create a meta index for the ~4.8 million triangles, we assign IDs 0..81 to the
82 pieces in the (lexicographical) order in which they are listed in the metadata.tsv. In order to make the IDs robust to later changes in the scores' lengths, we have index 0 for every piece begin with meta index `ID * 1,000,000`. For example, index 3 of the last piece corresponds to meta index `81000003`. From the metaindex we get back to (ID, ix) via `divmod(81000003, 1000000) => (81, 3)`.

In [7]:
id2fname = dict(enumerate(metadata.index))
id2lengths = {id: (longn2squaren(ninefold_dict[fname].shape[0]), duration)
  for id, (fname, duration) in enumerate(metadata[['median_recording']].itertuples(name=None))
}
print(f"ID 0 corresponds to '{id2fname[0]}' with length = {id2lengths[0][0]} quarters and duration = {id2lengths[0][1]} seconds.")

ID 0 corresponds to 'l000_etude' with length = 284 quarters and duration = 272.553 seconds.


The DataFrame `meta_index` has the meta index as index and the metadata for each corresponding triangle:
* `abs_length`: triangles length in quarters
* `abs_start`: index of the triangles left-most node
* `rel_length`, `rel_start`: the same values divided by the wavescape's width
* `duration`: the triangle's absolute duration in seconds, based on the median of several commercial recordings

In [8]:
def make_meta_ix(id, square_n, duration=None):
    result = pd.DataFrame(
                  [length_start
                   for length, elements in enumerate(range(square_n,0,-1), 1)
                   for length_start in zip(repeat(length), range(elements))],
                  columns=['abs_length', 'abs_start'])
    relative = result / square_n
    relative.columns = ['rel_length', 'rel_start']
    result = pd.concat([result, relative], axis=1)
    if duration is not None:
        duration_col = result.rel_length * duration
        result = pd.concat([result, duration_col.rename('duration')], axis=1)
    result.index += id * 1000000
    result.index.name = 'meta_index'
    return result

meta_index = pd.concat([make_meta_ix(id, square_n, duration) for id, (square_n, duration) in id2lengths.items()])
meta_index

Unnamed: 0_level_0,abs_length,abs_start,rel_length,rel_start,duration
meta_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,0,0.003521,0.000000,0.959694
1,1,1,0.003521,0.003521,0.959694
2,1,2,0.003521,0.007042,0.959694
3,1,3,0.003521,0.010563,0.959694
4,1,4,0.003521,0.014085,0.959694
...,...,...,...,...,...
81003398,80,1,0.975610,0.012195,130.705366
81003399,80,2,0.975610,0.024390,130.705366
81003400,81,0,0.987805,0.000000,132.339183
81003401,81,1,0.987805,0.012195,132.339183


The following function concatenates any of the `{piece -> array}` dictionaries returned by the various `etl.get...` functions so that the result comes with the metaindex.

In [9]:
def meta_concat(piece_dict):
    return pd.concat([
        pd.DataFrame(piece_dict[fname], index=[id*1000000 + i for i in range(piece_dict[fname].shape[0])])
        for id, fname in enumerate(metadata.index)
    ])

concatenated_features = meta_concat(ninefold_dict)
concatenated_features

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.538214,0.749750,0.833167,0.749750,0.871881,0.666334,0.687347,0.595144,0.000000
1,0.444424,0.499500,0.666334,1.000000,0.772102,0.332667,0.670297,0.536056,0.000000
2,0.410684,0.166667,0.333333,1.000000,0.744017,0.333333,0.618338,0.463767,0.000000
3,0.410684,0.166667,0.333333,1.000000,0.744017,0.333333,0.618338,0.463767,0.000000
4,0.444424,0.499500,0.666334,1.000000,0.772102,0.332667,0.670297,0.536056,0.000000
...,...,...,...,...,...,...,...,...,...
81003398,0.009570,0.271610,0.211400,0.135730,0.455970,0.154236,0.728100,0.850491,1475.255769
81003399,0.015069,0.275625,0.215225,0.133001,0.456563,0.151566,0.726680,0.853887,1459.809939
81003400,0.008954,0.269911,0.210078,0.141136,0.457083,0.153272,0.730862,0.848490,1491.908615
81003401,0.012880,0.278424,0.218777,0.143814,0.461059,0.149676,0.731822,0.852843,1491.407261
