# Data Preparation

In this notebook, we download and prepare the audio data for running automated concerto accompaniment experiments.

There are five sections below, which set up and/or explain the content in five different folders in the root directory: `cfg_files/`, `audio/`, `annot/`, `queries/`, and `scenarios/`.


## Configuration Files

The necessary configuration files are in the `cfg_files/` directory.  There are three files:
- `train.list`: specifies a list of the concerto movements that will be used for training
- `test.list`: specifies a list of the concerto movements that will be used for testing
- `AudioDataSummary.csv`: contains information about each audio recording, including urls and licenses.

Because we are planning to expand the dataset, for now we place all movements in `train.list` and leave `test.list` empty.

In [1]:
TRAIN_LIST_FILE = 'cfg_files/train.list'
AUDIO_SUMMARY_FILE = 'cfg_files/AudioDataSummary.csv'

## Audio Data

This section sets up the `audio/` folder.  There are three types of audio files in this benchmark:
- Full mix recordings.  The full mix recordings are downloaded from IMSLP by running the bash script download_fullmixes.sh below.
- Piano only recordings.  A set of piano only recordings were collected for this project and can be downloaded through a link provided below.
- Orchestra only recordings.  The orchestra only recordings are taken from Music Minus One and are under a private license.  They must be purchased, downloaded, and renamed as described below.

All audio recordings will be stored in the `audio/` directory with the naming convention `<piece>_<movement>_<id>.<extension>`.  The `<id>` contains one of the following tags:
- 'PO': piano + orchestra
- 'P': piano only
- 'O': orchestra only

as well as a number identifier.  For example, `rach2_mov1_PO2.mp3` is a full mix recording of the first movement in Rachmaninov's Piano Concerto No. 2.

See `cfg_files/AudioDataSummary.csv` for more detailed information about each recording, including urls and licenses.

In [2]:
import os
import os.path
import pandas as pd
import import_ipynb
import system_utils

importing Jupyter notebook from system_utils.ipynb


In [3]:
AUDIO_ROOT = 'audio'
if not os.path.exists(AUDIO_ROOT):
    os.mkdir(AUDIO_ROOT)

Run the following bash script to download the full mixes from IMSLP.  These will be saved under the Audio/ folder.

In [5]:
!bash download_fullmixes.sh

File ‘audio/rach2_mov1_PO1.mp3’ already there; not retrieving.
File ‘audio/rach2_mov1_PO2.mp3’ already there; not retrieving.
File ‘audio/mozart21_mov1_PO1.mp3’ already there; not retrieving.
File ‘audio/mozart21_mov1_PO2.mp3’ already there; not retrieving.
File ‘audio/beeth1_mov1_PO1.mp3’ already there; not retrieving.
File ‘audio/beeth1_mov1_PO2.mp3’ already there; not retrieving.
File ‘audio/beeth1_mov1_PO3.mp3’ already there; not retrieving.
File ‘audio/beeth1_mov1_PO4.mp3’ already there; not retrieving.
File ‘audio/beeth1_mov1_PO5.mp3’ already there; not retrieving.
File ‘audio/bach5_mov1_PO1.mp3’ already there; not retrieving.
File ‘audio/bach5_mov1_PO2.mp3’ already there; not retrieving.
File ‘audio/rach2_mov2_PO1.mp3’ already there; not retrieving.
File ‘audio/rach2_mov2_PO2.mp3’ already there; not retrieving.
File ‘audio/rach2_mov3_PO1.mp3’ already there; not retrieving.
File ‘audio/rach2_mov3_PO2.mp3’ already there; not retrieving.


Download the following [zip file containing the piano only recordings](https://drive.google.com/file/d/1daMHu-jq2WZ7nN99dPFlsZd8qc4KOeVd/view?usp=sharing) and place the audio recordings in the `audio/` directory.

The orchestra only recordings are taken from the [Music Minus One Library](https://www.halleonard.com/series/MMONE?dt=item#products).  These are under a private license and must be purchased directly from the Hal Leonard website.  Once downloaded, the orchestra only files should be put in the `audio/` folder and renamed as described below.
- [Rachmaninov Piano Concerto No. 2 Mov. 1](https://www.halleonard.com/product-family/PC25985/rachmaninov-concerto-no-2-in-c-minor-op-18): rach2_mov1_O1.wav
- [Mozart Piano Concerto No. 21 Mov. 1](https://www.halleonard.com/product/400239/mozart-concerto-no-21-in-c-major-kv467-elvira-madigan): mozart21_mov1_O1.wav
- [Beethoven Piano Concerto No. 1 Mov. 1](https://www.halleonard.com/product-family/PC25983/beethoven-concerto-no-1-in-c-major-op-15): beeth1_mov1_O1.wav
- [Bach Harpsichord Concerto No. 5 Mov. 1](https://www.halleonard.com/product/44006419/bach-concerto-for-piano-strings-and-basso-continuo-bwv-1056-in-f-minor): bach5_mov1_O1.wav


The total cost of purchasing the MMO recordings in this dataset is approximately 96 USD.

Run the following cell to verify that all of the required audio files are present.

In [6]:
def verify_audio_dataset():
    '''
    Verifies that all of the required audio files for running experiments are present.
    '''
    passed = True
    d = pd.read_csv(AUDIO_SUMMARY_FILE)
    
    for filename in d['id']:
        filepath = f'{AUDIO_ROOT}/{filename}'
        if not os.path.exists(filepath):
            passed = False
            print(f'Missing file: {filepath}')
    
    if passed:
        print('All required files are present.')
    else:
        print('Missing files should be placed in audio/ before moving on.')

In [7]:
verify_audio_dataset()

All required files are present.


Now we convert all mp3 files to wav.  From this point forward, we will work exclusively with wav files.  The code below requires that ffmpeg be installed and available on the command line.

In [8]:
def convert_mp3_to_wav():
    '''
    Converts all the mp3 files to wav files with the same basename.  If timestamps are specified,
    trim the recording to only include the specified time interval.
    '''    
    d = system_utils.get_audio_summary_info()
    for audiofile in d:
        basename, ext = os.path.splitext(audiofile)
        if ext == '.mp3':
            src_filepath = f'{AUDIO_ROOT}/{basename}.mp3'
            dst_filepath = f'{AUDIO_ROOT}/{basename}.wav'
            if not os.path.exists(dst_filepath):
                if d[audiofile] is not None: 
                    (tStart, tEnd) = d[audiofile] # extract specified time interval
                    print(f'Converting {src_filepath} to .wav ({tStart}, {tEnd})')
                    os.system(f'ffmpeg -i {src_filepath} -ar 44100 -ss {tStart} -to {tEnd} {dst_filepath}')
                else:
                    print(f'Converting {src_filepath} to .wav') 
                    os.system(f'ffmpeg -i {src_filepath} -ar 44100 {dst_filepath}') # convert whole recording
            else:
                print(f'File {src_filepath} has already been converted to wav')

In [9]:
convert_mp3_to_wav()

File audio/rach2_mov1_PO1.mp3 has already been converted to wav
File audio/rach2_mov1_PO2.mp3 has already been converted to wav
File audio/mozart21_mov1_PO1.mp3 has already been converted to wav
File audio/mozart21_mov1_PO2.mp3 has already been converted to wav
File audio/beeth1_mov1_PO1.mp3 has already been converted to wav
File audio/beeth1_mov1_PO2.mp3 has already been converted to wav
File audio/bach5_mov1_PO1.mp3 has already been converted to wav
File audio/bach5_mov1_PO2.mp3 has already been converted to wav


## Annotation Files

The annotation files are already included in the `annot/` directory.  There are three kinds of files:
- `.beats`: These are files specifying the timestamps of measure downbeats.  There is one `.beats` file for each piano (only) and orchestra (only) recording.  Note that the orchestra only and piano only recordings are synchronized by design, so the piano annotation file is simply a soft link to the orchestra annotation file.  The full mix recordings do not have timestamp annotations, since they are only used as auxiliary information.
- `query.measures`: Each concerto movement is broken into a series of chunks corresponding to music segments in which the pianist is playing continuously.  This file indicates the measure numbers of each music segment.  Each music segment will serve as a query in our benchmark.
- `eval.measures`: We can only evaluate alignment quality in sections where both orchestra and piano are active.  This file indicates which measures in the concerto movement will be evaluated.

In [10]:
ANNOT_ROOT = 'annot'
QUERY_MEASURES_FILE = f'{ANNOT_ROOT}/query.measures'

The following script sets up the soft links for piano annotation files:

In [11]:
!bash setup_annot_links.sh

ln: failed to create symbolic link 'annot/bach5_mov1_P1.beats': File exists
ln: failed to create symbolic link 'annot/beeth1_mov1_P1.beats': File exists
ln: failed to create symbolic link 'annot/mozart21_mov1_P1.beats': File exists
ln: failed to create symbolic link 'annot/rach2_mov1_P1.beats': File exists


## Audio Queries

The audio queries will be generated and stored in the `queries/` directory.  As described above, each query is a single contiguous chunk of solo piano playing (as defined by the `query.measures` file).  Because we have a limited amount of data, we will augment the dataset by considering time scale modified versions of the original piano only recordings.  Each time scale modified version will need its own appropriately modified beat annotation file, which are generated and included in the same directory.

In [2]:
import re
import numpy as np
import librosa as lb
import soundfile as sf
import shutil
from hmc_mir import tsm_tools

In [13]:
def get_audio_files(regexp):
    '''
    Returns a list of audio filenames matching a given regular expression.
    
    Inputs
    regexp: a string specifying the regular expression
    '''
    
    df = pd.read_csv(AUDIO_SUMMARY_FILE)
    p_list = [a for a in df['id'] if re.search(regexp, a)] 
    return p_list

In [14]:
def generate_tsm_audio(infile, outfile, tsm_factor):
    '''
    Applies time-scale modification to a given audio recording and saves the generated audio to file.
    
    Inputs
    infile: The filepath of the input audio
    outfile: The filepath of the output audio
    tsm_factor: The time-scale modification factor to apply
    '''
    if tsm_factor == 1: # just copy the file
        shutil.copyfile(infile, outfile)
    else:
        y, sr = lb.load(infile)
        y_mod = tsm_tools.tsm_hybrid(y, tsm_factor, sr)
        sf.write(outfile, y_mod, sr, subtype = 'PCM_16')

In [16]:
def modify_annots_tsm(infile, outfile, tsm_factor):
    '''
    Modifies an annotation file according to a single global time-scale modification factor.
    
    Inputs
    infile: the annotation file to be modified
    tsm_factor: the time-scale modification factor to apply
    outfile: the output annotation file
    '''
    df = pd.read_csv(infile)
    df['start'] = df['start'] * tsm_factor
    df.to_csv(outfile, index=False)

In [17]:
def get_query_timestamps(piece_id, query_measures_file, annot_file):
    '''
    This function infers the timestamp locations of all queries in a piano only recording.
    
    Inputs
    piece_id: A string specifying the piece and movement, e.g. 'rach2_mov1'
    query_measures_file: Filepath to the query.measures file that specifies the measures in each query
    annot_file: Filepath to the annotation file that specifies timestamps for measure downbeats
    
    Returns a list of (tstart,tend) tuples that indicate the starting and ending timestamps 
    of each query in the piano only recording.
    '''

    # read annotation file
    df = pd.read_csv(annot_file) # has two columns: start (timestamp) and measure (number)
    
    # get query measure info
    d = {}
    with open(query_measures_file,'r') as f:
        for line in f: 
            parts = line.split(',') # e.g. 'rach2_mov1,1-75,83-161,177-297,313-374'
            cur_piece = parts[0]
            parts.pop(0)
            d[cur_piece] = parts
    if piece_id not in d:
        raise Exception(f"Cannot find entry for {piece_id} in {query_measures_file}.  Aborting.")
        
    # infer timestamps        
    times = []
    measures = []
    for pair in d[piece_id]:
        parts = pair.split('-')
        assert len(parts) == 2
        start_measure, end_measure = parts
        start_time = float(df.loc[df['measure'] == int(start_measure), 'start'])
        end_time = float(df.loc[df['measure'] == int(end_measure), 'start'])
        measures.append((int(start_measure), int(end_measure)))
        times.append((start_time, end_time))
                    
    return measures, times

In [18]:
def extract_audio_excerpt(infile, outfile, starttime, endtime):
    '''
    Extracts an audio segment from a given audio recording and writes the output to file.
    
    Inputs
    infile: The input audio recording from which the excerpt should be taken
    outfile: The output audio file to write
    starttime: The start time in seconds of the selected segment
    endtime: The end time in seconds of the selected segment
    '''
    y, sr = lb.load(infile)
    start_sample = int(np.round(starttime * sr))
    end_sample = int(np.around(endtime * sr))
    assert end_sample < len(y)
    sf.write(outfile, y[start_sample:end_sample], sr, subtype='PCM_16')

In [19]:
def modify_annots_select(infile, outfile, select_start, select_end):
    '''
    Modifies an annotation file by selecting a specified interval in the recording.
    Only annotations that fall within the interval will be included in the modified 
    annotation file, and the timestamps will be expressed relative to the interval start time.
    
    Inputs
    infile: the annotation file to be modified
    outfile: the output annotation file
    select_start: the start of the selected interval (in sec)
    select_end: the end of the selected interval (in sec)
    '''
    df = pd.read_csv(infile)
    select_rows = (df['start'] >= select_start) & (df['start'] <= select_end)
    df.loc[:,'start'] = df['start'] - select_start
    df = df[select_rows]
    df.to_csv(outfile, index=False)

In [20]:
def generateQueries(outdir, tsm_factors):
    '''
    Preps and generates time-scale modified audio queries and annotation files.
    
    Inputs
    outdir: directory to create and populate with audio queries
    tsm_factors: list of time-scale modification factors to use in generating queries
    '''
    
    if not os.path.exists(outdir):
        os.mkdir(outdir)    
    
    for p_file in get_audio_files(r'_P\d+.\S+$'): # all solo piano 
        
        base_id = os.path.splitext(p_file)[0] # e.g. rach2_mov1_P1
        piece_dir = f'{outdir}/{base_id}'
        
        if os.path.exists(piece_dir):
            print(f'Directory {piece_dir} already exists.  Skipping.')
            continue
        os.mkdir(piece_dir)
        
        for tsm_factor in tsm_factors:
            
            tsm_dir = f'{piece_dir}/tsm{tsm_factor:.2f}' # e.g. outdir/rach2_mov1_P1/tsm0.85
            os.mkdir(tsm_dir)
            
            # generate time-scale modified audio
            tsm_id = f'{base_id}_tsm{tsm_factor:.2f}_all' # e.g. rach2_mov1_P1_tsm0.85_all
            orig_audio_file = f'{AUDIO_ROOT}/{p_file}'
            tsm_audio_file = f'{tsm_dir}/{tsm_id}.wav'
            generate_tsm_audio(orig_audio_file, tsm_audio_file, tsm_factor)
            
            # generate time-scale modified annotation file
            orig_annot_file = f'{ANNOT_ROOT}/{base_id}.beats'
            tsm_annot_file = f'{tsm_dir}/{tsm_id}.beats'
            modify_annots_tsm(orig_annot_file, tsm_annot_file, tsm_factor)
            
            # get query start & end timestamps
            piece_id = re.sub(r'_P1$','', base_id) # e.g. rach2_mov1
            _, query_tuples = get_query_timestamps(piece_id, QUERY_MEASURES_FILE, tsm_annot_file)
            
            for cnt, (query_start, query_end) in enumerate(query_tuples):
                
                # generate query audio file
                query_id = f'{base_id}_tsm{tsm_factor:.2f}_q{cnt+1}' # e.g. rach2_mov1_P1_tsm0.85_q1
                query_audio_file = f'{tsm_dir}/{query_id}.wav'
                extract_audio_excerpt(tsm_audio_file, query_audio_file, query_start, query_end)
                
                # generate query annotation file
                query_annot_file = f'{tsm_dir}/{query_id}.beats'
                modify_annots_select(tsm_annot_file, query_annot_file, query_start, query_end)
                
    return

In [21]:
QUERIES_ROOT = 'queries'
tsm_factors = [0.8, 0.9, 1, 1.11, 1.25]

In [22]:
generateQueries(QUERIES_ROOT, tsm_factors)

## Alignment Scenarios

The benchmark consists of a set of alignment scenarios that are saved in the `scenarios/` directory.  Here, we define a single alignment scenario to be a tuple of three recordings:
- Piano query.  This is the user's audio input and should be processed in an online fashion.
- Orchestra only recording.  This is the accompaniment that we would like to time scale modify in order to match the user's playing.
- Full mix recording.  This recording serves as an intermediary that allows us to align the piano and orchestra recordings.

The goal of the alignment scenario is to accurately estimate where we are in the orchestra recording in an online fashion.

The alignment scenarios are simply numbered sequentially (e.g. `s1/`, `s2/`, etc), and each scenario has its own directory containing the following:
- p.wav: This is a soft link to the piano query recording.
- o.wav: This is a soft link to the orchestra only recording.
- po.wav: This is a soft link to the full mix recording.
- p.beats: This is a soft link to the piano query annotation file.
- o.beats: This is a soft link to the orchestra annotation file.
- scenarios.summary: This contains information about the recordings in the scenario.

Each line of the `scenarios.summary` file has the following fields:
- scenario id: a identifier for each scenario (e.g. s1, s2, etc)
- piano file: a filepath to the piano recording
- orchestra file: a filepath to the orchestra recording
- full mix file: a filepath to the full mix recording
- measure start: the index of the starting measure in the query (counting starts from 1)
- measure end: the index of the ending measure in the query (inclusive)
- piano start: the timestamp in the original full piano recording where the query begins, specified in seconds
- piano end: the timestamp in the original full piano recording where the query ends, specified in seconds
- orchestra start: the ground truth timestamp in the orchestra recording corresponding to the beginning of the query, specified in seconds
- orchestra end: the ground truth timestamp in the orchestra recording corresponding to the end of the query, specified in seconds

In [23]:
def get_piece_ids(infile):
    '''
    Parses the train.list or test.list configuration file and returns a list of piece ids to process.
    
    Inputs
    infile: the filepath to train.list or test.list
    
    Returns a list of piece ids.
    '''
    
    ids = []
    with open(infile,'r') as f:
        for line in f:
            ids.append(line.strip())
    return ids        

In [24]:
def myLogger(logfile, loginfo):
    '''
    Writes logging information to a specified log file.
    
    Inputs
    logfile: name of log file to generate
    loginfo: either a string or a list of strings to write to the log file
    '''
    
    assert not os.path.exists(logfile)
    with open(logfile, 'w') as f:
        if type(loginfo) == list:
            for ln in loginfo:
                f.write(ln)
        else:
            f.write(loginfo)
    return

In [25]:
def generateScenarios(outdir, piece_list, tsm_factors):
    '''
    Constructs alignment scenarios and populates scenario directories with relevant audio and annotation files.
    
    Inputs
    outdir: the root directory to create and populate with scenario directories
    piece_list: filepath of text file containing a list of piece ids to process
    tsm_factors: list of time-scale modification factors to consider
    '''
    
    if os.path.exists(outdir):
        print(f"Directory {outdir}/ already exists.  Aborting.") 
        return  # very fast to generate, so easiest way is to just delete directory and re-generate from scratch
    os.mkdir(outdir)

    cnt = 0
    logInfo = [] # debug info for logging
    
    for piece_id in get_piece_ids(piece_list): # e.g. rach2_mov1
        
        for fullmix_file in get_audio_files(f'^{piece_id}_PO\d+.\S+$'): # full mixes, e.g. rach2_mov1_PO2.mp3
            fullmix_file = re.sub(r'.mp3$', '.wav', fullmix_file) # use wav file (not mp3)
        
            for tsm_factor in tsm_factors:
                
                tsm_id = f'{piece_id}_P1_tsm{tsm_factor:.2f}'
                tsm_dir = f'{QUERIES_ROOT}/{piece_id}_P1/tsm{tsm_factor:.2f}'
                tsm_annot_file = f'{tsm_dir}/{tsm_id}_all.beats'
                o_annot_file = f'{ANNOT_ROOT}/{piece_id}_O1.beats'
                assert os.path.exists(tsm_annot_file)
                assert os.path.exists(o_annot_file)
                measures, q_times = get_query_timestamps(piece_id, QUERY_MEASURES_FILE, tsm_annot_file)
                _, o_times = get_query_timestamps(piece_id, QUERY_MEASURES_FILE, o_annot_file)
                
                for (m, qt, ot, queryIdx) in zip(measures, q_times, o_times, np.arange(len(measures))+1): # queries
                    
                    cnt += 1
                    scenario_dir = f'{outdir}/s{cnt}'
                    os.mkdir(scenario_dir)
                    cwd = os.getcwd()
                    
                    # piano only audio (query)
                    p_audio = f'{cwd}/{tsm_dir}/{tsm_id}_q{queryIdx}.wav'
                    p_link = f'{scenario_dir}/p.wav' # soft links must be absolute paths
                    os.symlink(p_audio, p_link)
                                        
                    # orchestra only audio
                    o_audio = f'{cwd}/{AUDIO_ROOT}/{piece_id}_O1.wav'
                    o_link = f'{scenario_dir}/o.wav'
                    os.symlink(o_audio, o_link)
                    
                    # full mix audio
                    po_audio = f'{cwd}/{AUDIO_ROOT}/{fullmix_file}'
                    po_link = f'{scenario_dir}/po.wav'
                    os.symlink(po_audio, po_link)
                    
                    # query annotation
                    query_annot = f'{cwd}/{tsm_dir}/{tsm_id}_q{queryIdx}.beats'
                    query_annot_link = f'{scenario_dir}/p.beats'
                    os.symlink(query_annot, query_annot_link)
                    
                    # orchestra annotation
                    o_annot = f'{cwd}/{ANNOT_ROOT}/{piece_id}_O1.beats'
                    o_annot_link = f'{scenario_dir}/o.beats'
                    os.symlink(o_annot, o_annot_link)
                    
                    # log file
                    # The format is: s1 p_file o_file po_file meas_start meas_end p_start p_end o_start o_end
                    logfile = f'{scenario_dir}/scenario.info'
                    logstr = f's{cnt} {p_audio} {o_audio} {po_audio} {m[0]} {m[1]} {qt[0]} {qt[1]} {ot[0]} {ot[1]}\n'
                    myLogger(logfile, logstr)
                    logInfo.append(logstr)
                    
    # summary log file                
    myLogger(f'{outdir}/scenarios.summary', logInfo)
    
    return          

In [26]:
SCENARIOS_ROOT = 'scenarios'

In [27]:
generateScenarios(SCENARIOS_ROOT, TRAIN_LIST_FILE, tsm_factors)