# Data Preprocessing

---

This Colab file contains the data preprocessing phase used in the VAE system. The goal of this project is the generation of unconditional/conditional audio, for which a Variational Autoencoder has been used. The data used for training comes from a dataset of MIDI files: the Lakh Dataset. This dataset has multiple partitions; in the project, a "cleansed" subset was used.



In [None]:
%apt-get update -qq && apt-get install -qq libfluidsynth1 fluid-soundfont-gm build-essential libasound2-dev libjack-dev
%pip install -qU pyfluidsynth pretty_midi
%pip install music21
%pip install pypianoroll

E: Package 'libfluidsynth1' has no installation candidate
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyfluidsynth (setup.py) ... [?25l[?25hdone
  Building wheel for pretty_midi (setup.py) ... [?25l[?25hdone
Collecting pypianoroll
  Downloading pypianoroll-1.0.4-py3-none-any.whl (26 kB)
Installing collected packages: pypianoroll
Successfully installed pypianoroll-1.0.4


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import shutil
import glob
import numpy as np
import pandas as pd
import pretty_midi
import pypianoroll
import tables
from music21 import converter, instrument, note, chord, stream
import music21
import librosa
import librosa.display
import matplotlib.pyplot as plt
import json
import IPython.display
from datetime import datetime
import random

import torch
import torch.nn as nn
from torch.nn import functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

from tqdm.notebook import tqdm, trange

import random
import itertools
root_dir = '/content/drive/MyDrive/ColabNotebooks/AsItSounds'
data_dir = root_dir + '/Lakh Piano Dataset/lpd_5/lpd_5_cleansed'
music_dataset_lpd_dir = root_dir + '/Music Dataset/midis/lmd_matched'

Using device: cpu


# Getting MIDI and Song Metadata

Utility functions which map the ids of the Song Million Dataset to the midi and their respective paths in the Lakh Dataset. The midi are placed in a hierarchical folder structure that follows the corresponding msd_id.

In [None]:
RESULTS_PATH = os.path.join(root_dir, 'Lakh Piano Dataset', 'Metadata')

# Utility functions for retrieving paths from a msd_id (milion song dataset id)
def msd_id_to_dirs(msd_id):
    """Given an MSD ID, generate the path prefix.
    E.g. TRABCD12345678 -> A/B/C/TRABCD12345678"""
    return os.path.join(msd_id[2], msd_id[3], msd_id[4], msd_id)

# function for retrieving path of file h5 (metadata) from a msd_id
def msd_id_to_h5(msd_id):
    """Given an MSD ID, return the path to the corresponding h5"""
    return os.path.join(RESULTS_PATH, 'lmd_matched_h5',
                        msd_id_to_dirs(msd_id) + '.h5')

# Load the midi npz file from the LMD cleansed folder, given the msd_id and the md5
def get_midi_npz_path(msd_id, midi_md5):
    return os.path.join(data_dir,
                        msd_id_to_dirs(msd_id), midi_md5 + '.npz')

# Load the midi file from the Music Dataset folder
def get_midi_path(msd_id, midi_md5):
    return os.path.join(music_dataset_lpd_dir,
                        msd_id_to_dirs(msd_id), midi_md5 + '.mid')

We read the csv file that contains the mapping between the msd_ids and the lpd_ids, we save the result in a vector. After that we create two dictionaries that return the corresponding msd_id given lpd_id and vice versa

In [None]:
# Open the cleansed ids - cleansed file ids : msd ids
cleansed_ids = pd.read_csv(os.path.join(root_dir, 'Lakh Piano Dataset', 'cleansed_ids.txt'), delimiter = '    ', header = None, engine ='python')
lpd_to_msd_ids = {a:b for a, b in zip(cleansed_ids[0], cleansed_ids[1])}
msd_to_lpd_ids = {a:b for a, b in zip(cleansed_ids[1], cleansed_ids[0])}

# Genre Management

The cell below creates a dictionary and dataframe that keeps track of the genre of each midi in the dataset. Each midi in fact has a reference music genre (Pop_Rock, Reggae, Blues and so on). It is possible that the midi also has a reference subgenre, in that eventuality it joins the genre and is considered as one music genre. In initial training this information is not used, but for conditional creation it may come in handy

In [None]:
# Reading the genre annotations
genre_file_dir = os.path.join(root_dir, 'Lakh Piano Dataset', 'msd_tagtraum_cd1.cls')
ids = []
genres = []

with open(genre_file_dir) as f:
    line = f.readline()
    while line:

        # Avoid the initial lines of the file
        if line[0] != '#':
          split = line.strip().split("\t")

          # Single genre case
          if len(split) == 2:
            ids.append(split[0])
            genres.append(split[1])
          # Sub-genre case
          elif len(split) == 3:
            ids.append(split[0])
            ids.append(split[0])
            genres.append(split[1])
            genres.append(split[2])
        line = f.readline()

# Dataframe and dictionary
genre_df = pd.DataFrame(data={"TrackID": ids, "Genre": genres})
genre_dict = genre_df.groupby('TrackID')['Genre'].apply(lambda x: x.tolist()).to_dict()

**Objects that we need**

- cleansed_ids: dictionary of LPD file name : MSD file name
- lmd_metadata: list of dictionaries - each dict has a msd_id field to identify
- Get the lmd_file_name (actual path )

#Building the Dataset

From the entire Lakh Cleansed Dataset we randomly take a number of songs, the process that will follow is to save and clean these chosen songs. We use the previously defined dictionary specifically msd_to_lpd. In a list we save all the keys and from that same list we randomly select our ids.

In [None]:
print(f"Total number of samples: {len(msd_to_lpd_ids.keys())}")

Total number of samples: 21425


We had to choose between one of these cells to select a random sample from  the entire dataset or the entire dataset instead.

In [None]:
# Randomly choose 1000 songs out of these
train_ids = random.choices(list(msd_to_lpd_ids.keys()), k = 10000)

In [None]:
train_ids = list(msd_to_lpd_ids.keys()) # full dataset loading (~21.000 songs)

For each of these retrieved ids we retrieve the name of the corresponding lpd file. With the latter, the previously defined utility functions are used, taking the path to the corresponding midi file. At this point with the use of the pypianoroll library we take the midi through the path and convert it to the pianoroll format, the result will be a pianoroll of 5 tracks, that is, 5 different instruments (Piano, Guitar, Bass, Strings, Drums).

A resolution of 2 was used; the resolution defines the number of ticks per quarter note (beat). This setting is crucial in determining the temporal precision with which notes within a MIDI track are represented.

A dictionary noting the various traces in the pianoroll was used to construct the tensors that will be given as input to the network. By doing so, it was possible to combine all 5 of them into a single tensor. In case one of the traces was empty, a bogus empty trace was placed in the tensor, and the corresponding tensor was marked as “having empty traces.” Only tensors which don't have empty traces were placed in a list




In [None]:
from tqdm import tqdm

combined_pianorolls = []
i = 0
for msd_file_name in tqdm(train_ids):

  lpd_file_name = msd_to_lpd_ids[msd_file_name]
  # Get the NPZ path
  npz_path = get_midi_npz_path(msd_file_name, lpd_file_name)
  #print(npz_path)
  multitrack = pypianoroll.load(npz_path)
  #print(multitrack)
  multitrack.set_resolution(2).pad_to_same()
  #print(multitrack)

  # Piano, Guitar, Bass, Strings, Drums
  # Splitting into different parts

  parts = {'piano_part': None, 'guitar_part': None, 'bass_part': None, 'strings_part': None, 'drums_part': None}
  song_length = None
  empty_array = None
  has_empty_parts = False
  for track in multitrack.tracks:
    #print(track.pianoroll.shape)
    #print(track.pianoroll)
    if track.name == 'Drums':
      parts['drums_part'] = track.pianoroll
    if track.name == 'Piano':
      parts['piano_part'] = track.pianoroll
    if track.name == 'Guitar':
      parts['guitar_part'] = track.pianoroll
    if track.name == 'Bass':
      parts['bass_part'] = track.pianoroll
    if track.name == 'Strings':
      parts['strings_part'] = track.pianoroll
    if track.pianoroll.shape[0] > 0:
      empty_array = np.zeros_like(track.pianoroll)
      #print(empty_array)
      #print(track.pianoroll)

  for k,v in parts.items():
    if v.shape[0] == 0:
      parts[k] = empty_array.copy()
      has_empty_parts = True

  # Stack all together - Piano, Guitar, Bass, Strings, Drums
  combined_pianoroll = torch.tensor([parts['piano_part'], parts['guitar_part'], parts['bass_part'], parts['strings_part'], parts['drums_part']])
  #print(combined_pianoroll.shape)

  # These contain velocity information - the force with which the notes are hit - which can be standardized to 0/1 if we want (to compress)
  if has_empty_parts == False:
    combined_pianorolls.append(combined_pianoroll)
    #print(combined_pianorolls.size())
    i+=1
    #print(i)


  combined_pianoroll = torch.tensor([parts['piano_part'], parts['guitar_part'], parts['bass_part'], parts['strings_part'], parts['drums_part']])
  combined_pianoroll = torch.tensor([parts['piano_part'], parts['guitar_part'], parts['bass_part'], parts['strings_part'], parts['drums_part']])
  2%|▏         | 447/21425 [07:52<6:09:50,  1.06s/it]


KeyboardInterrupt: 

# Saving Files

The final steps in this preprocessing are to save in two .pt files both the planorolls of our final dataset and their relative lengths. Then the lengths are taken and saved in a list, while the tensors of each planoroll are joined thanks to the torch.hstack method

In [None]:
# Stack of the pianorolls and list of lengths

pianoroll_lengths = [e.size()[1] for e in combined_pianorolls]
#print(combined_pianorolls.size()[1])
#print(pianoroll_lengths)
combined_pianorolls = torch.hstack(combined_pianorolls)
#print(combined_pianorolls)

[1018, 763, 845, 863, 129, 743, 1089, 1117, 1297, 658, 177, 825, 977, 465, 697, 747, 1113, 993, 1105, 974, 525, 1560, 881, 441, 671, 1297, 473, 911, 593, 793, 737, 1097, 590, 665, 1159, 841, 649, 779, 669, 573, 617, 1206, 733, 818, 1681, 585, 156, 739, 1358, 1411, 711, 640, 473, 932, 1265, 1113, 805, 545, 906, 926, 1345, 609, 601, 161, 1040, 675, 651, 867, 1049, 625, 1049, 875, 743, 1099, 193, 922, 689, 1145, 753, 853, 935, 914, 817, 193, 1026, 129, 905, 126, 949, 1000, 1185, 653, 1115, 129, 977, 392, 872, 971, 561, 361, 841, 1087, 715, 2153, 2187, 1233, 497, 615, 808, 1003, 1009, 1266, 721, 1208, 633, 345, 489, 838, 487, 2273, 817, 664, 361, 923, 768, 709, 869, 765, 1306, 545, 939, 481, 1369, 485, 801, 887, 888, 985, 1078, 1001, 1371, 841, 1045, 841, 474, 561, 1007, 577, 992, 1037, 706, 672, 889, 969, 1161, 390, 557, 1033, 345, 977, 1369, 1361, 1285, 561, 97, 1113, 569, 1117, 753, 944, 1053, 504, 735, 624, 297, 1362, 1385, 336, 944, 504, 857, 698, 733, 825, 907, 1155, 129, 785, 570, 1



The code for saving the results obtained in two .pt files is given below. These files will be used during generation to retrieve the dataset.

In [None]:
# Saving files

torch.save(combined_pianorolls, os.path.join(root_dir, 'Lakh Piano Dataset', '10000_pianorolls.pt'))
pianoroll_lengths = torch.tensor(pianoroll_lengths)
#print(pianoroll_lengths)
torch.save(pianoroll_lengths, os.path.join(root_dir, 'Lakh Piano Dataset', '10000_pianorolls_lengths.pt'))

tensor([1018,  763,  845,  863,  129,  743, 1089, 1117, 1297,  658,  177,  825,
         977,  465,  697,  747, 1113,  993, 1105,  974,  525, 1560,  881,  441,
         671, 1297,  473,  911,  593,  793,  737, 1097,  590,  665, 1159,  841,
         649,  779,  669,  573,  617, 1206,  733,  818, 1681,  585,  156,  739,
        1358, 1411,  711,  640,  473,  932, 1265, 1113,  805,  545,  906,  926,
        1345,  609,  601,  161, 1040,  675,  651,  867, 1049,  625, 1049,  875,
         743, 1099,  193,  922,  689, 1145,  753,  853,  935,  914,  817,  193,
        1026,  129,  905,  126,  949, 1000, 1185,  653, 1115,  129,  977,  392,
         872,  971,  561,  361,  841, 1087,  715, 2153, 2187, 1233,  497,  615,
         808, 1003, 1009, 1266,  721, 1208,  633,  345,  489,  838,  487, 2273,
         817,  664,  361,  923,  768,  709,  869,  765, 1306,  545,  939,  481,
        1369,  485,  801,  887,  888,  985, 1078, 1001, 1371,  841, 1045,  841,
         474,  561, 1007,  577,  992, 10