In [2]:
import librosa
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import json
import os


# File structuring

First, set up the `DATA_ROOT`

In [4]:
DATA_ROOT = '../data/openmic-2018/'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

Load the `OPENMIC` dataset

In [6]:
import numpy as np
OPENMIC = np.load(os.path.join(DATA_ROOT, 'openmic-2018.npz'), allow_pickle=True)

The contents of the dataset is described in detail on the [OPENMIC's Official Github](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb). It contains VGGish features, which are to be excluded, as we want to replicate the analysis of [Predominant Musical Instrument Classification based on Spectral Features
](https://arxiv.org/abs/1912.02606) (PMICSF).

In [7]:
X, Y_true, Y_mask, sample_key = OPENMIC['X'], OPENMIC['Y_true'], OPENMIC['Y_mask'], OPENMIC['sample_key']

# Feature extraction
The goal is to replicate the approach of [PMICSF](https://arxiv.org/abs/1912.02606). The features used in the article are:

* **Zero Crossing Rate** (ZCR):  indicates the rate at which the signal crosses zero.
* **Spectral Centroid** (SC): a metric that indicate the center of mass of the spectrum being located. It is *"the ratio of the frequency weighted magnitude spectrum with unweighted magnitude spectrum"*
* **Spectral Bandwidth** (SB): gives the weighted average of the frequency signal by its spectrum.
* **Spectral roll-off** (SR):  the frequency under which a certain proportion of the overall spectral energy belongs to
* **MFCC**: the meaon of the first 13 MFCC features.

We intend to  calculate these features for the `OPENMIC` dataset, and store them in a Panda's Dataframe ordered by sample key. The sample keys are available in the `sample_key` list produced in the code block above.


> **Question**: what did they use on Librosa? Same?

From PMICSF, page 3:

>  The audio spectrum is analyzed by extracting MFCCs based on the default inputs of hopSize (hop length between frames) and frame size. The default parameters for sampling rate is 44.1 kHz, **hopSize of 512** and **frame size of 1024** in Essentia

For convinience, and according to [Presets](https://librosa.org/blog/2019/07/17/resample-on-load/), we set the default parameters of Librosa to match those values. 

In [8]:
# first define the default values to match PMICSF
# TODO: why import it as _librosa?
from presets import Preset
import librosa as _librosa
librosa = Preset(_librosa)
librosa['n_fft'] = 2048
librosa['win_length'] = 1024
librosa['hop_length'] = 512

To find all audio files in the dataset, Librosa's util function [find_files](https://librosa.org/doc/main/generated/librosa.util.find_files.html) was used:

In [16]:
import librosa as lr
import pandas as pd

file_paths = lr.util.find_files(DATA_ROOT + "audio", ext="ogg")
index = pd.DataFrame({"file_path": file_paths, "sample_key": sample_key})
index

Unnamed: 0,file_path,sample_key
0,/workspaces/instrument-detection/data/openmic-...,000046_3840
1,/workspaces/instrument-detection/data/openmic-...,000135_483840
2,/workspaces/instrument-detection/data/openmic-...,000139_119040
3,/workspaces/instrument-detection/data/openmic-...,000141_153600
4,/workspaces/instrument-detection/data/openmic-...,000144_30720
...,...,...
19995,/workspaces/instrument-detection/data/openmic-...,155294_184320
19996,/workspaces/instrument-detection/data/openmic-...,155295_76800
19997,/workspaces/instrument-detection/data/openmic-...,155307_211200
19998,/workspaces/instrument-detection/data/openmic-...,155310_372480


In [14]:
index['file_path'].head()

0    /workspaces/instrument-detection/data/openmic-...
1    /workspaces/instrument-detection/data/openmic-...
2    /workspaces/instrument-detection/data/openmic-...
3    /workspaces/instrument-detection/data/openmic-...
4    /workspaces/instrument-detection/data/openmic-...
Name: file_path, dtype: object

Next step is to preprocess all files and extract all features. For efficiancy, we used Spotify's [Pedalboard](https://github.com/spotify/pedalboard) for loading audio and Librosas Feature package for feature extraction. Below is what packages we used:

* ZCR: [librosa.feature.zero_crossing_rate](https://librosa.org/doc/main/generated/librosa.feature.zero_crossing_rate.html)
* SC: [librosa.feature.spectral_centroid](https://librosa.org/doc/main/generated/librosa.feature.spectral_centroid.html)
* SB: [librosa.feature.spectral_bandwidth](https://librosa.org/doc/main/generated/librosa.feature.spectral_bandwidth.html)
* SR: [librosa.feature.spectral_rolloff](https://librosa.org/doc/main/generated/librosa.feature.spectral_rolloff.html)

The article also extracts the mean of the first 13 MFCC features with Librosa:

> We extracted the first 13 MFCC features using Librosa/Essentia. For each audio clip, we obtained 259 × 13 matrix features. **We took the mean of all the columns to get the condensed feature** providing us with 1 × 13 feature vector, along with five other features as mentioned above. We labeled each vector with the instrument class using scikit- learn’s ‘labelencoder’ function.

In [25]:
from turtle import ycor
import pedalboard as pb
import librosa as lr


def preprocess(index):
    """
    Preprocess audio and extract features according to PMICSF
    Input: an audiofile
    Returns: a dictionary with zcrs, scs, mfccs
    """
    features = {}
    with pb.io.AudioFile(index[0]) as f:
        # TODO so some files have varying SR, which could be problematic
        #assert f.samplerate == 44100, f"Sample rate is not 44.1khz for {file}!"
        # TODO count _one_ spectogram and then use as input to all features
        y = f.read(f.frames)
        y = y.mean(axis=0)  # mono
        # To speed up calculation, calculate one spectogram
        # TODO is this right, and does the standard values as defined above really hold?
        S = np.abs(librosa.stft(y))**2
        zcrs = librosa.feature.zero_crossing_rate(y=y)
        features["sample_key"] = index[1]
        features["zcr_mean"] = zcrs.mean()
        features["zcr_std"] = zcrs.std()
        scs = librosa.feature.spectral_centroid(S=S)
        features["sc_mean"] = scs.mean()
        features["sc_std"] = scs.std()
        sbs = librosa.feature.spectral_bandwidth(S=S)
        features["sb_mean"] = sbs.mean()
        features["sb_std"] = sbs.std()
        srs = librosa.feature.spectral_rolloff(S=S)
        features["sr_mean"] = sbs.mean()
        features["sr_std"] = sbs.std()
        mfccs = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=13)
        for i, mfcc in enumerate(mfccs):
            features['mfcc' + str(i+1) + '_mean'] = mfcc.mean()
            features['mfcc' + str(i+1) + 'std'] = mfcc.std()

    return features



The audio files were preprocessed in parallell, using Python's [multiprocess](https://docs.python.org/3/library/multiprocessing.html) package.

> **NOTE:** if using the enclosed `.devcontainer`, make sure to adjust the RAM and available cores in your Docker settings. Not doing so will render the kernel to crash! We used 4gb ram and 6 cores on a MacBook Air M1, leading to a calculation time of roughly 10 minutes. 

In [24]:
# sandbox

ttt = index.values.tolist()
ttt

[['/workspaces/instrument-detection/data/openmic-2018/audio/000/000046_3840.ogg',
  '000046_3840'],
 ['/workspaces/instrument-detection/data/openmic-2018/audio/000/000135_483840.ogg',
  '000135_483840'],
 ['/workspaces/instrument-detection/data/openmic-2018/audio/000/000139_119040.ogg',
  '000139_119040'],
 ['/workspaces/instrument-detection/data/openmic-2018/audio/000/000141_153600.ogg',
  '000141_153600'],
 ['/workspaces/instrument-detection/data/openmic-2018/audio/000/000144_30720.ogg',
  '000144_30720']]

In [26]:
from tqdm import tqdm
import numpy as np

from multiprocess import Pool
if not os.path.exists('features'):
    with Pool() as p:
        # index.values.tolist() returns a list with [file_path, sample_key]
        ys = list(tqdm(p.imap(preprocess, index.values.tolist()), total=len(index)))
else:
    ys = pd.DataFrame.from_csv('features.csv')
pd.DataFrame(ys)

100%|██████████| 20000/20000 [17:06<00:00, 19.48it/s]  


Unnamed: 0,sample_key,zcr_mean,zcr_std,sc_mean,sc_std,sb_mean,sb_std,sr_mean,sr_std,mfcc1_mean,...,mfcc9_mean,mfcc9std,mfcc10_mean,mfcc10std,mfcc11_mean,mfcc11std,mfcc12_mean,mfcc12std,mfcc13_mean,mfcc13std
0,000046_3840,0.036367,0.021923,296.517136,133.109808,288.159666,267.139540,288.159666,267.139540,-978.598022,...,36.752838,18.461390,21.056427,14.100028,16.246675,12.125834,23.335825,16.301609,14.897064,12.292964
1,000135_483840,0.052411,0.013491,478.192757,141.199973,409.844515,135.532246,409.844515,135.532246,-413.563721,...,6.231468,14.745573,17.053007,12.037521,-11.894317,11.358349,5.294181,12.435369,2.507157,12.328708
2,000139_119040,0.081234,0.014925,604.505859,189.731508,719.222449,139.070343,719.222449,139.070343,-544.543701,...,-0.072678,13.965829,-26.966000,14.030764,26.628426,13.969789,17.684668,10.871439,-17.259043,12.796794
3,000141_153600,0.053718,0.013248,462.825122,153.244140,391.783679,84.894649,391.783679,84.894649,-934.030762,...,9.737517,12.798279,25.201883,11.753377,35.146244,11.763553,28.466101,11.159425,9.925189,10.857844
4,000144_30720,0.082449,0.034076,601.706717,329.074617,766.763811,375.408621,766.763811,375.408621,-561.763611,...,27.277163,19.787683,13.598028,16.042135,14.319709,16.555758,27.292217,15.994335,0.721995,16.865313
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,155294_184320,0.050435,0.009352,508.286842,94.249039,266.331118,63.090107,266.331118,63.090107,-978.444946,...,-11.920152,10.829905,-14.391589,9.820566,-8.987197,9.933827,-2.683615,8.903602,-1.517069,9.117269
19996,155295_76800,0.009752,0.004809,61.216410,42.815086,101.288169,53.390921,101.288169,53.390921,-1092.291382,...,41.367481,9.498998,33.264076,7.879162,28.860458,6.616966,24.709829,6.231831,20.218401,5.848857
19997,155307_211200,0.065545,0.022368,450.432458,192.751011,665.939788,285.751704,665.939788,285.751704,-625.574707,...,1.175638,16.541275,29.078180,17.536085,-2.689997,16.350401,0.299629,17.147369,9.396278,15.461529
19998,155310_372480,0.042145,0.015084,332.139261,146.760990,409.563611,169.775745,409.563611,169.775745,-786.657532,...,35.033508,19.646029,26.835651,18.322279,-27.157253,22.870502,-6.835197,24.779594,17.680960,16.964596


I think we want to store the features in a csv file

In [27]:
pd.DataFrame(ys).to_csv(path_or_buf='features.csv')


# Labeling
> **NOTERING**: Se även [test.ipynb](/src/test.ipynb)

From PMICSF part C, *Classifier Training*

> We labeled each vector with the instrument class using scikitlearn’s ‘labelencoder’ function

> **Question**: not sure how they use [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

The team behind OPENMIC has made a good split of the data, and most of the following code is from their [tutorial on GitHub](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb).

In [28]:
# Let's split the data into the training and test set
# We use squeeze=True here to return a single array for each, rather than a full DataFrame

split_train = pd.read_csv(
    os.path.join(DATA_ROOT, "partitions/split01_train.csv"), header=None, squeeze=True
)
split_test = pd.read_csv(
    os.path.join(DATA_ROOT, "partitions/split01_test.csv"), header=None, squeeze=True
)

split_test.head()



  split_train = pd.read_csv(


  split_test = pd.read_csv(


0      000178_3840
1     000308_61440
2    000312_184320
3    000319_145920
4    000321_218880
Name: 0, dtype: object

In [29]:
# How many train and test examples do we have?  About 75%/25%
print("# Train: {},  # Test: {}".format(len(split_train), len(split_test)))

# Train: 14915,  # Test: 5085


In [30]:
# convert df to set
train_set = set(split_train)
test_set = set(split_test)

In [31]:
# These loops go through all sample keys, and save their row numbers
# to either idx_train or idx_test
#
# This will be useful in the next step for slicing the array data
idx_train, idx_test = [], []

for idx, n in enumerate(sample_key):
    if n in train_set:
        idx_train.append(idx)
    elif n in test_set:
        idx_test.append(idx)
    else:
        # This should never happen, but better safe than sorry.
        raise RuntimeError("Unknown sample key={}! Abort!".format(sample_key[n]))

# Finally, cast the idx_* arrays to numpy structures
idx_train = np.asarray(idx_train)
idx_test = np.asarray(idx_test)

From OPENMIC tutorial:

> For convenience, we provide a simple JSON object that maps class indices to names.

In [33]:
import json

with open(os.path.join(DATA_ROOT, 'class-map.json'), 'r') as f:
    class_map = json.load(f)
class_map

{'accordion': 0,
 'banjo': 1,
 'bass': 2,
 'cello': 3,
 'clarinet': 4,
 'cymbals': 5,
 'drums': 6,
 'flute': 7,
 'guitar': 8,
 'mallet_percussion': 9,
 'mandolin': 10,
 'organ': 11,
 'piano': 12,
 'saxophone': 13,
 'synthesizer': 14,
 'trombone': 15,
 'trumpet': 16,
 'ukulele': 17,
 'violin': 18,
 'voice': 19}

In [34]:
data_train = Y_mask[idx_train]
data_test = Y_mask[idx_train]
sample_key_train = sample_key[idx_train]  # numpy.ndarray
sample_key_test = sample_key[idx_test]

type(sample_key_train)

numpy.ndarray

In [None]:
# Finally, we use the split indices to partition the features, labels, and masks
X_train = X[idx_train]
X_test = X[idx_test]

Y_true_train = Y_true[idx_train]
Y_true_test = Y_true[idx_test]

Y_mask_train = Y_mask[idx_train]
Y_mask_test = Y_mask[idx_test]

In [35]:
def generate_dataframe_with_labels(indices, class_map, Y_mask):
    """
    A method for creating a pd.df with each label in Y_mask and instrument in class_map
    Requires sorting of indices first (idx_train and idx_test) as explained in OPENMIC2018 tutorial
    """
    data = {}
    for i in indices:
        tmp_dict = {}
        for instr, pred in zip(class_map, Y_mask[i]):
            tmp_dict[instr] = pred
        data[sample_key[i]] = tmp_dict

    return pd.DataFrame.from_dict(data, orient="index")

Using the util function above, we create a big table with all the labels, as predicted by the OPENMIC team.

In [36]:
# set up dataframes with sample key and correct label
df_train = generate_dataframe_with_labels(idx_train, class_map, Y_mask)
df_test = generate_dataframe_with_labels(idx_test, class_map, Y_mask)
df_test.head()

Unnamed: 0,accordion,banjo,bass,cello,clarinet,cymbals,drums,flute,guitar,mallet_percussion,mandolin,organ,piano,saxophone,synthesizer,trombone,trumpet,ukulele,violin,voice
000178_3840,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
000308_61440,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
000312_184320,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
000319_145920,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
000321_218880,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False


Should we perhaps *merge* the tables together or is this OK?