# Introduction

Instrument detection is a cornerstone in music information retrieval (MIR). In this notebook, we will explore the task of instrument detection and train varying models. We will follow the method of [Predominant Musical Instrument Classification based on Spectral Features](https://arxiv.org/pdf/1912.02606.pdf) (MIC) by Racharla et al. (2020). The dataset used in this notebook is the [OpenMIC-2018](http://ismir2018.ircam.fr/doc/pdfs/248_Paper.pdf), a crowd-sourced dataset of 20,000 audio files with 10-second clips of 20 different instruments, sponsored by Spotify and New York Univiversity's MARL and Center for Data Science. The dataset is available on [paperswithcode.com](https://paperswithcode.com/dataset/openmic-2018). They also have a [github repo](https://github.com/cosmir/openmic-2018) with the dataset and [modelling baseline code](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb).

The OpenMIC-2018 dataset were chosen before we understood the limitations of the dataset, and so, our final model and analysis will have some limitations. The comparions with MIC are not going to be fair, which is furthered discussed in later sections. All together, it has been a great learning experience.

This project is a part of [DT2470 Music Informatics](https://www.kth.se/student/kurser/kurs/DT2470?l=en) at KTH Royal Institute of Technology. The time from start to finish was approximately three weeks. 

Let's load some of the necessary libraries and get started:

In [182]:
import librosa
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import json
import os

# Preprocessing

First we preprocess the data, and extract features from it. Then, we train and evaluate the models. The OpenMIC-2018 dataset consists of 20 000 audio samples, but not all of them have human annotations. If annotations are missing, that specific instrument and sample can't be included in model traning and testing. It was a bit cumbersome to understand the design of the dataset, but the modelling baseline was really helpful. The dataset comes with a predefined `.npz` file. It also includeds a well-balanced train and test split of the data. 

First, set up the `DATA_ROOT`

In [183]:
DATA_ROOT = '../data/openmic-2018/'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

Load the `OPENMIC` dataset

In [184]:
import numpy as np
OPENMIC = np.load(os.path.join(DATA_ROOT, 'openmic-2018.npz'), allow_pickle=True)

The contents of the dataset is described in detail on the [OPENMIC's Official Github](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb). It contains VGGish features, which are disregarded, as we want to replicate the analysis of [MIC](https://arxiv.org/abs/1912.02606) and extract our own features.

In [185]:
X, Y_true, Y_mask, sample_key = OPENMIC['X'], OPENMIC['Y_true'], OPENMIC['Y_mask'], OPENMIC['sample_key']

# Feature extraction
The goal is to replicate the approach of [MIC](https://arxiv.org/abs/1912.02606). The features used in the article are:

* **Zero Crossing Rate** (ZCR):  indicates the rate at which the signal crosses zero.
* **Spectral Centroid** (SC): a metric that indicate the center of mass of the spectrum being located. It is *"the ratio of the frequency weighted magnitude spectrum with unweighted magnitude spectrum"*
* **Spectral Bandwidth** (SB): gives the weighted average of the frequency signal by its spectrum.
* **Spectral roll-off** (SR):  the frequency under which a certain proportion of the overall spectral energy belongs to
* **MFCC**: the mean of the first 13 MFCC features.

We intend to  extract these features for OpenMIC-2018, and store them in a Panda's Dataframe ordered by sample key. The sample keys are available in the `sample_key` list produced in the code block above. The dataframe is then exported as a `.csv` file.

From PMICSF, page 3:

>  The audio spectrum is analyzed by extracting MFCCs based on the default inputs of hopSize (hop length between frames) and frame size. The default parameters for sampling rate is 44.1 kHz, **hopSize of 512** and **frame size of 1024** in Essentia

They don't mention what was used for Librosa, but we decided to pick these parameter values as they are quite common and make sense in this context. For convinience, and according to [Presets](https://librosa.org/blog/2019/07/17/resample-on-load/), we set the default parameters of Librosa to match those values. 

In [186]:
# define the default values to match MIC
from presets import Preset
import librosa as _librosa
librosa = Preset(_librosa)
librosa['n_fft'] = 2048
librosa['win_length'] = 1024
librosa['hop_length'] = 512

To find all audio files in the dataset, Librosa's util function [find_files](https://librosa.org/doc/main/generated/librosa.util.find_files.html) was used:

In [187]:
import librosa as lr
import pandas as pd

file_paths = lr.util.find_files(DATA_ROOT + "audio", ext="ogg")
index = pd.DataFrame({"file_path": file_paths, "sample_key": sample_key})
index

Unnamed: 0,file_path,sample_key
0,/workspaces/instrument-detection/data/openmic-...,000046_3840
1,/workspaces/instrument-detection/data/openmic-...,000135_483840
2,/workspaces/instrument-detection/data/openmic-...,000139_119040
3,/workspaces/instrument-detection/data/openmic-...,000141_153600
4,/workspaces/instrument-detection/data/openmic-...,000144_30720
...,...,...
19995,/workspaces/instrument-detection/data/openmic-...,155294_184320
19996,/workspaces/instrument-detection/data/openmic-...,155295_76800
19997,/workspaces/instrument-detection/data/openmic-...,155307_211200
19998,/workspaces/instrument-detection/data/openmic-...,155310_372480


Next step is to preprocess all files and extract their features. For efficiancy, we used Spotify's [Pedalboard](https://github.com/spotify/pedalboard) for loading audio and Librosas Feature packages for feature extraction:

* ZCR: [librosa.feature.zero_crossing_rate](https://librosa.org/doc/main/generated/librosa.feature.zero_crossing_rate.html)
* SC: [librosa.feature.spectral_centroid](https://librosa.org/doc/main/generated/librosa.feature.spectral_centroid.html)
* SB: [librosa.feature.spectral_bandwidth](https://librosa.org/doc/main/generated/librosa.feature.spectral_bandwidth.html)
* SR: [librosa.feature.spectral_rolloff](https://librosa.org/doc/main/generated/librosa.feature.spectral_rolloff.html)

As explained above, the article also extracts the mean of the first 13 MFCC features with Librosa:

> We extracted the first 13 MFCC features using Librosa/Essentia. For each audio clip, we obtained 259 × 13 matrix features. **We took the mean of all the columns to get the condensed feature** providing us with 1 × 13 feature vector, along with five other features as mentioned above. We labeled each vector with the instrument class using scikit- learn’s ‘labelencoder’ function.

They only use the mean to train their models. Out of curiosity, and to see if we can improve accuracy, we also save the standard deviations of all features. 

In [188]:
import pedalboard as pb
import librosa as lr


def preprocess(index):
    """
    Preprocess audio and extract features according to PMICSF
    Input: an audiofile
    Returns: a dictionary with zcrs, scs, mfccs
    """
    features = {}
    with pb.io.AudioFile(index[0]) as f:
        # TODO so some files have varying SR, which could be problematic
        #assert f.samplerate == 44100, f"Sample rate is not 44.1khz for {file}!"
        y = f.read(f.frames)
        y = y.mean(axis=0)  # mono
        # To speed up calculation, calculate one spectogram
        S = np.abs(librosa.stft(y))**2
        zcrs = librosa.feature.zero_crossing_rate(y=y)
        features["sample_key"] = index[1]
        features["zcr_mean"] = zcrs.mean()
        features["zcr_std"] = zcrs.std()
        scs = librosa.feature.spectral_centroid(S=S)
        features["sc_mean"] = scs.mean()
        features["sc_std"] = scs.std()
        sbs = librosa.feature.spectral_bandwidth(S=S)
        features["sb_mean"] = sbs.mean()
        features["sb_std"] = sbs.std()
        srs = librosa.feature.spectral_rolloff(S=S)
        features["sr_mean"] = sbs.mean()
        features["sr_std"] = sbs.std()
        mfccs = librosa.feature.mfcc(S=S, n_mfcc=13)
        for i, mfcc in enumerate(mfccs):
            features['mfcc' + str(i+1) + '_mean'] = mfcc.mean()
            features['mfcc' + str(i+1) + 'std'] = mfcc.std()

    return features

The audio files were preprocessed in parallell, using Python's [multiprocess](https://docs.python.org/3/library/multiprocessing.html) package.

> **NOTE:** if using the enclosed `.devcontainer`, make sure to adjust the RAM and available cores in your Docker settings. Not doing so will render the kernel to crash! We used 4gb ram and 6 cores on a MacBook Air M1, leading to a calculation time of roughly 10 minutes.

Following Librosas convention of naming audiofiles `y`, we name all audio files `ys`. Think of it as "audio files in plural".

In [189]:
from tqdm import tqdm
import numpy as np

from multiprocessing import Pool

if not os.path.exists('features.csv'):
    with Pool() as p:
        # index.values.tolist() returns a list with [file_path, sample_key]
        ys = list(tqdm(p.imap(preprocess, index.values.tolist()), total=len(index)))
        ys = pd.DataFrame(ys)
        ys = ys.set_index("sample_key")
else:
    ys = pd.read_csv('features.csv', index_col="sample_key")
    #ys.drop(columns="Unnamed: 0")
    


We then go on to store the features in a `.csv` file.

In [190]:
if not os.path.exists('features.csv'):
    pd.DataFrame(ys).to_csv('features.csv')
else:
    print("Features already calculated and stored in features.csv")

Features already calculated and stored in features.csv


# Training

Now we lean heavily on the modelling baseline from the OPENMIC's Official Github.

In [191]:
import librosa
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import json
import os

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

In [192]:
DATA_ROOT = '../data/openmic-2018'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

In [193]:
OPENMIC = np.load(os.path.join(DATA_ROOT, 'openmic-2018.npz'), allow_pickle=True)

In [194]:
#we will overwrite 'X' later
X, Y_true, Y_mask, sample_key = OPENMIC['X'], OPENMIC['Y_true'], OPENMIC['Y_mask'], OPENMIC['sample_key']

We load the previously calculated features, and print it as a sanity check. It should contain 20 000 indices.

In [195]:
features=pd.read_csv('features.csv')

# Let's have a look at the data
features

Unnamed: 0,sample_key,zcr_mean,zcr_std,sc_mean,sc_std,sb_mean,sb_std,sr_mean,sr_std,mfcc1_mean,...,mfcc9_mean,mfcc9std,mfcc10_mean,mfcc10std,mfcc11_mean,mfcc11std,mfcc12_mean,mfcc12std,mfcc13_mean,mfcc13std
0,000046_3840,0.036367,0.021923,296.517136,133.109808,288.159666,267.139540,288.159666,267.139540,147.188060,...,166.07562,216.15845,158.768600,209.785170,151.586730,203.43971,144.200240,197.13676,136.577680,190.80775
1,000135_483840,0.052411,0.013491,478.192757,141.199973,409.844515,135.532246,409.844515,135.532246,1282.769400,...,765.03280,495.68510,590.328800,558.758240,425.313660,616.81950,274.802150,663.55350,142.181600,694.35410
2,000139_119040,0.081234,0.014925,604.505859,189.731508,719.222449,139.070343,719.222449,139.070343,292.690280,...,214.49968,316.38672,198.657150,314.673430,186.793670,313.87470,177.732700,313.87650,170.007640,313.99084
3,000141_153600,0.053718,0.013248,462.825122,153.244140,391.783679,84.894649,391.783679,84.894649,188.996630,...,146.75562,140.59268,130.843980,137.768260,116.143745,135.27222,102.521700,133.01720,89.947320,130.88220
4,000144_30720,0.082449,0.034076,601.706717,329.074617,766.763811,375.408621,766.763811,375.408621,196.932450,...,172.74915,268.45245,166.110350,266.885220,159.133510,265.19022,151.693700,263.76523,143.998100,262.70773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,155294_184320,0.050435,0.009352,508.286842,94.249039,266.331118,63.090107,266.331118,63.090107,147.754410,...,75.58570,66.08795,52.333115,52.021423,29.923525,44.92963,8.635664,46.96052,-11.322803,55.88632
19996,155295_76800,0.009752,0.004809,61.216410,42.815086,101.288169,53.390921,101.288169,53.390921,631.029240,...,866.60800,485.64580,861.181340,485.591300,855.655100,485.54526,850.119600,485.50488,844.646670,485.46692
19997,155307_211200,0.065545,0.022368,450.432458,192.751011,665.939788,285.751704,665.939788,285.751704,100.515335,...,87.85434,141.25677,81.044754,138.202670,74.912180,134.90385,69.371315,131.49872,64.451385,128.12172
19998,155310_372480,0.042145,0.015084,332.139261,146.760990,409.563611,169.775745,409.563611,169.775745,144.219150,...,139.86327,216.60248,130.948000,216.237840,122.950190,216.05978,116.407776,216.06279,111.108376,216.22318


These are all the available features:

In [199]:
features.columns

Index(['sample_key', 'zcr_mean', 'zcr_std', 'sc_mean', 'sc_std', 'sb_mean',
       'sb_std', 'sr_mean', 'sr_std', 'mfcc1_mean', 'mfcc1std', 'mfcc2_mean',
       'mfcc2std', 'mfcc3_mean', 'mfcc3std', 'mfcc4_mean', 'mfcc4std',
       'mfcc5_mean', 'mfcc5std', 'mfcc6_mean', 'mfcc6std', 'mfcc7_mean',
       'mfcc7std', 'mfcc8_mean', 'mfcc8std', 'mfcc9_mean', 'mfcc9std',
       'mfcc10_mean', 'mfcc10std', 'mfcc11_mean', 'mfcc11std', 'mfcc12_mean',
       'mfcc12std', 'mfcc13_mean', 'mfcc13std'],
      dtype='object')

In the next cell, we swap the VGGish features with our own features.

In [197]:
X = np.array(features.iloc[:,2:])

Splitting into well-balanced, predefined train and test sets.

In [200]:
split_train = pd.read_csv(
    os.path.join(DATA_ROOT, "partitions/split01_train.csv"), header=None,
).squeeze("columns")
split_test = pd.read_csv(
    os.path.join(DATA_ROOT, "partitions/split01_test.csv"), header=None,
).squeeze("columns")

split_test.head()

0      000178_3840
1     000308_61440
2    000312_184320
3    000319_145920
4    000321_218880
Name: 0, dtype: object

In [201]:
print('# Train: {},  # Test: {}'.format(len(split_train), len(split_test)))

# Train: 14915,  # Test: 5085


In [113]:
train_set = set(split_train)
test_set = set(split_test)

In [114]:
# These loops go through all sample keys, and save their row numbers
# to either idx_train or idx_test
#
# This will be useful in the next step for slicing the array data
idx_train, idx_test = [], []

for idx, n in enumerate(sample_key):
    if n in train_set:
        idx_train.append(idx)
    elif n in test_set:
        idx_test.append(idx)
    else:
        # This should never happen, but better safe than sorry.
        raise RuntimeError('Unknown sample key={}! Abort!'.format(sample_key[n]))
        
# Finally, cast the idx_* arrays to numpy structures
idx_train = np.asarray(idx_train)
idx_test = np.asarray(idx_test)

In [115]:
# Finally, we use the split indices to partition the features, labels, and masks
X_train = X[idx_train]
X_test = X[idx_test]

Y_true_train = Y_true[idx_train]
Y_true_test = Y_true[idx_test]

Y_mask_train = Y_mask[idx_train]
Y_mask_test = Y_mask[idx_test]

In [116]:
# Print out the sliced shapes as a sanity check
print(X_train.shape)
print(X_test.shape)

(14915, 33)
(5085, 33)


In [117]:
with open(os.path.join(DATA_ROOT, 'class-map.json'), 'r') as f:
    class_map = json.load(f)

# Model training and evaluation

In [MIC](https://arxiv.org/abs/1912.02606) they use the IRMAS dataset, which has one predominant instrument per sound file. The OpenMIC-2018 dataset has multiple instruments per sound file. This means that the OpenMIC-2018 dataset is a multi-label classification problem, while the IRMAS dataset is a multi-class classification problem. We didn't really understand the difference when we chose this project, and have since then learned that both problems have their own challenges.

We work around this by using a one-vs-rest approach. This means that we train a binary classifier for each instrument. The classifier predicts whether the instrument is present in the audio file or not. We then use the predictions from all classifiers to predict the instruments in the audio file.

In [MIC](https://arxiv.org/abs/1912.02606) they train six different models from [scikit-learn](https://scikit-learn.org/stable/):

- *SVM* (Support Vector Machine)
- *RF* (Random Forest)
- *LR* (Logistic Regression)
- *Decision Tree*
- *XGBoost* (eXtreme Gradient Boosting)
- *LGBM* (Light Gradient Boosting Machine)

To evalute their performance, they use the following metrics:

- *Precision*: $ \frac{tp}{tp + fp} $
- *Recall*: $ \frac{tp}{tp + fn} $
- *F1-score* (harmonic mean of precision and recall): $ \frac{2 \times precision \times recall}{precision + recall} $
- *Accuracy*: the number of correct predictions divided by the total number of predictions

Since the OPENMIC dataset is an ongoing project, and still acquiring annotations, we found it a bit hard to mold the data. Out of convienence, we are still using the approach in [modelling baseline](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb), as suggested by the authors of the dataset. 

We are replicating the method of [MIC](https://arxiv.org/abs/1912.02606), and so, only looking at six instrument:

```python
    ['flute', 'guitar', 'organ', 'piano', 'trumpet', 'voice']
```

In [118]:
instruments = ['accordion', 'banjo', 'bass', 'cello', 'clarinet',
       'cymbals', 'drums', 'flute', 'guitar', 'mallet_percussion', 'mandolin',
       'organ', 'piano', 'saxophone', 'synthesizer', 'trombone', 'trumpet',
       'ukulele', 'violin', 'voice']

MIC_instruments = ['flute', 'guitar', 'organ', 'piano', 'trumpet', 'voice']

features = ['zcr_mean', 'zcr_std', 'sc_mean',
       'sc_std', 'sb_mean', 'sb_std', 'sr_mean', 'sr_std', 'mfcc1_mean',
       'mfcc1std', 'mfcc2_mean', 'mfcc2std', 'mfcc3_mean', 'mfcc3std',
       'mfcc4_mean', 'mfcc4std', 'mfcc5_mean', 'mfcc5std', 'mfcc6_mean',
       'mfcc6std', 'mfcc7_mean', 'mfcc7std', 'mfcc8_mean', 'mfcc8std',
       'mfcc9_mean', 'mfcc9std', 'mfcc10_mean', 'mfcc10std', 'mfcc11_mean',
       'mfcc11std', 'mfcc12_mean', 'mfcc12std', 'mfcc13_mean', 'mfcc13std']

MIC_features = ['zcr_mean', 'sc_mean', 'sb_mean', 'sr_mean', 'mfcc1_mean',
       'mfcc2_mean', 'mfcc3_mean', 'mfcc4_mean', 'mfcc5_mean', 'mfcc6_mean',
       'mfcc7_mean', 'mfcc8_mean', 'mfcc9_mean', 'mfcc10_mean', 'mfcc11_mean',
       'mfcc12_mean', 'mfcc13_mean']

We have the choice to replicate the article features and instruments, or picking some additional features we used.
Our baseline is the article features:

In [119]:
FEATURES = MIC_features
#FEATURES = features
INSTRUMENTS = MIC_instruments
#INSTRUMENTS = instruments

## Modeling baseline

This is the heart of the [modelling baseline](https://github.com/cosmir/openmic-2018/blob/master/examples/modeling-baseline.ipynb). As we are training six models, we turned it into a method that returns accuracy, classifier and classification report:

In [120]:
from sklearn.metrics import classification_report,balanced_accuracy_score,precision_score,recall_score

def modelling_baseline(X_train, X_test, Y_true_train, Y_true_test, Y_mask_train, Y_mask_test, class_map, clf, instruments, output_dict=True):
    """
    Use the modeling baseline to train a classifier for each instrument
    Returns randomforrest_models - classifiers for trained model
            reports - classification reports for each instrument
            accuracies - accuracies for each instrument
    """
    # This dictionary will include the classifiers for each model
    models = dict()
    reports = dict()
    accuracies = dict()


    # We'll iterate over all istrument classes, and fit a model for each one
    # After training, we'll print a classification report for each instrument
    for instrument in instruments:

        # Map the instrument name to its column number
        inst_num = class_map[instrument]

        # Step 1: sub-sample the data

        # First, we need to select down to the data for which we have annotations
        # This is what the mask arrays are for
        train_inst = Y_mask_train[:, inst_num]
        test_inst = Y_mask_test[:, inst_num]

        # Here, we're using the Y_mask_train array to slice out only the training examples
        # for which we have annotations for the given class
        X_train_inst = X_train[train_inst]

        # Again, we slice the labels to the annotated examples
        # We thresold the label likelihoods at 0.5 to get binary labels
        Y_true_train_inst = Y_true_train[train_inst, inst_num] >= 0.5

        # Repeat the above slicing and dicing but for the test set
        X_test_inst = X_test[test_inst]
        Y_true_test_inst = Y_true_test[test_inst, inst_num] >= 0.5

        # Step 3.
        # Initialize a new classifier
        # No

        # Step 4.
        clf.fit(X_train_inst, Y_true_train_inst)

        # Step 5.
        # Finally, evaluate the model on both test data
        Y_pred_test = clf.predict(X_test_inst)
        
        # Store the classifier in our dictionary
        models[instrument] = clf
        reports[instrument] = classification_report(Y_true_test_inst, Y_pred_test, output_dict=output_dict)
        accuracies[instrument] = balanced_accuracy_score(Y_true_test_inst, Y_pred_test)

    return models, reports, accuracies
        

As an example, if you want to find the result of a the *Random Forest* model on *flute* instrument, you can do:


```python
    >>> results['RandomForest']['Accuracy']['flute']
    0.43333333
```

Here we set up final `results` dictionary to store all the results:

In [146]:
keys = {"model": None, "report": None, "accuracy": None}
models = ['RandomForest', 'SVM', 'LogisticRegression', 'GradientBoost', 'DecisionTree', 'LGBM']
results = dict()

for model in models:
    results[model] = dict(keys)

# Training

From here, we call the previously defined `modelling_baseline` method to train all six classifiers. MIC doesn't mention any use of standardizing, but we add a `StandardScaler()` where we think it makes sense to use it. The results are stored in the `results` dictionary.

In [122]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

clfs = {"RandomForest": RandomForestClassifier(n_estimators=100, random_state=42),
        "SVM": make_pipeline(StandardScaler(), SVC()),
        "LogisticRegression": make_pipeline(StandardScaler(), LogisticRegression(max_iter = 1000)),
        "GradientBoost": GradientBoostingClassifier(learning_rate=0.3),
        "DecisionTree": DecisionTreeClassifier(),
        "LGBM": LGBMClassifier()}

In [131]:
for model, clf in clfs.items():
    # a quite nasty one-liner, but it does the job ¯\_(ツ)_/¯
    results[model]["model"], results[model]["report"], results[model]["accuracy"] = modelling_baseline(
                                                    X_train, 
                                                    X_test, 
                                                    Y_true_train, 
                                                    Y_true_test, 
                                                    Y_mask_train, 
                                                    Y_mask_test, 
                                                    class_map, 
                                                    clf, 
                                                    INSTRUMENTS)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Results

We are going to present and compare our metrics to the once in MIC. Too much time was spent on pivoting the Pandas dataframe.

In [144]:
article_columns = ['P', 'R', 'F1', 'support']

columns = ["precision",	"recall", "f1-score", "support", "model", "instrument"]
df = pd.DataFrame(columns=columns)
df = df.rename(columns={ "precision": "P", 
                    "recall": "R", 
                    "f1-score": "F1"
                    }, inplace=True)

for instrument in INSTRUMENTS:
    for model in models:
        di = results[model]["report"][instrument]['True']
        di['model'] = model
        di['instrument'] = instrument
        df = pd.concat([df, pd.DataFrame(di, index=[0]).rename(columns={ "precision": "P", 
                    "recall": "R", 
                    "f1-score": "F1",
                    })], ignore_index=True)

n_rows, n_cols = df.shape
df_melted = df.melt(id_vars=["model", "instrument"], var_name="metric", value_name="value")

df_pivot = df_melted.pivot_table(columns=['model', "metric"], 
                        index=['instrument'], 
                        values=['value'], 
                        aggfunc='mean'
                        )
df_tex = df_pivot.to_latex(float_format="%.2f", 
                            multicolumn=True, 
                            multirow=True)


  df_tex = df_pivot.to_latex(float_format="%.2f",


We present our result as a heatmap, with the f1-score (F1), precision (P), and recall (R), of each model on each instrument. We also add `support` to get a sense of how many samples contains. It is hard to tell which model is the better, but we can see that piano, voice and guitar does well over all models. Trumpet and organ does not do as well, and flute is the worst performing instrument. Compared to MICs table, there is a lot of variation in the results. Piano is doing much better than MIC, something which could be the result of many potential factors. First, having a predominant instrument per sound file, and having multiple instruments per sound file, might make a big difference for their suggested approach. To get to the bottom of this, we would need to do a more in-depth analysis of the data, which is outside the scope of this project. Second, it could be a sign of overfitting.

> **NOTE**: as of now, the SVM model is not working for flute. We are not sure why, but we are looking into it.

In [145]:
df_pivot[df_pivot.columns[0:]].style.format("{:.2f}").background_gradient(cmap='OrRd')

Unnamed: 0_level_0,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value
model,DecisionTree,DecisionTree,DecisionTree,DecisionTree,GradientBoost,GradientBoost,GradientBoost,GradientBoost,LGBM,LGBM,LGBM,LGBM,LogisticRegression,LogisticRegression,LogisticRegression,LogisticRegression,RandomForest,RandomForest,RandomForest,RandomForest,SVM,SVM,SVM,SVM
metric,F1,P,R,support,F1,P,R,support,F1,P,R,support,F1,P,R,support,F1,P,R,support,F1,P,R,support
instrument,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3,Unnamed: 23_level_3,Unnamed: 24_level_3
flute,0.37,0.36,0.39,175.0,0.34,0.44,0.28,175.0,0.38,0.5,0.3,175.0,0.07,0.35,0.04,175.0,0.33,0.55,0.23,175.0,0.0,0.0,0.0,175.0
guitar,0.76,0.75,0.76,286.0,0.82,0.77,0.87,286.0,0.82,0.78,0.87,286.0,0.82,0.73,0.94,286.0,0.83,0.78,0.89,286.0,0.83,0.74,0.95,286.0
organ,0.34,0.35,0.33,121.0,0.37,0.46,0.31,121.0,0.34,0.4,0.3,121.0,0.38,0.51,0.31,121.0,0.37,0.5,0.29,121.0,0.38,0.51,0.3,121.0
piano,0.85,0.88,0.82,285.0,0.9,0.9,0.91,285.0,0.9,0.89,0.92,285.0,0.91,0.88,0.95,285.0,0.9,0.91,0.9,285.0,0.91,0.88,0.94,285.0
trumpet,0.5,0.49,0.52,318.0,0.51,0.54,0.48,318.0,0.52,0.54,0.5,318.0,0.36,0.55,0.26,318.0,0.47,0.53,0.43,318.0,0.46,0.57,0.39,318.0
voice,0.72,0.69,0.75,224.0,0.81,0.74,0.89,224.0,0.81,0.73,0.91,224.0,0.8,0.72,0.9,224.0,0.79,0.72,0.89,224.0,0.79,0.7,0.91,224.0


Table II from MIC:

<img src="../imgs/MIC-table-ii.png" alt="MIC Table II" width="600"/>

We also compare the accuracy of our models to the ones in MIC. As we have one binary classifier for each instrument, we take the mean and standard deviation of each classifier and model. It is clear that SVM was the best for MIC. Our best model is the XGBoost model, with a mean accuracy of 0.67 ± 0.10. As the datsets are quite different, it is wise to take this comparison with a grain of salt.

In [143]:
df_accuracy = pd.DataFrame(columns=['model'])

for model in models:
    di = results[model]['accuracy']
    di['model'] = model
    df_accuracy = pd.concat([df_accuracy, pd.DataFrame(di, index=[0])])

# TODO apriori weighting of the classes
df_accuracy.melt('model').groupby(['model']).mean(numeric_only=True).rename(columns={'value': 'Mean'}).join(df_accuracy.melt('model').groupby(['model']).std(numeric_only=True).rename(columns={'value': 'STD'})).style.format("{:.2f}").background_gradient(cmap='OrRd')


Unnamed: 0_level_0,Mean,STD
model,Unnamed: 1_level_1,Unnamed: 2_level_1
DecisionTree,0.62,0.09
GradientBoost,0.66,0.1
LGBM,0.66,0.1
LogisticRegression,0.64,0.11
RandomForest,0.66,0.11
SVM,0.64,0.11


Table III from MIC:

<img src="../imgs/MIC-table-iii.png" alt="MIC Table III" width="600"/>

# Threats of validity

We have some known threats of validity that we will to address in bullet points:

* As of now, the final reported accuracy is not weighted against how many samples there are in each category. Doing so might improve or worsen the accuracy.
* The models chosen by MIC have been tweaked to work well on the IRMAS dataset. We have not tweaked our parameters to work well on the OpenMIC dataset.
* OpenMIC contains 20 000 audio samples, but annotations are missing. This means that the dataset is not complete. We leaned heavily on the modelling baseline from the OPENMIC's Official Github, and it filters many samples with missing annotations. Reporting the support of each instrument gives a better picture of the dataset. However, one should keep in mind that, while IRMAS might have fewer samples, it is complete and a better fit for the approach chosen by MIC.
* Flute is not working for the SVM model. This could be due to a bug in the pipeline.

# Conclusions & Future Work

We have replicated the results from MIC, and we have evaluated our models. The most time-consuming part was understanding the differences between OpenMIC-2018 and IRMAS, which to some extent took its toll on the quality of our analysis. The absolute goal of this project is to pass DT2470 Music Informatics, and we believe we have achieved what we set out to do in the beginning of the project. Going forward, it would be interesting to see if we can improve the results. We have a few ideas:

* Tweak model parameters.
* Besides only using the mean of all features, we have also saved the standard deviation. Including them might make more expressive models.
* Set up a weighted accuracy, where the accuracy is weighted against the number of samples in each category.
* See what happens when we include all 20 instruments in the OpenMIC dataset.
* Set up a pipeline to test with audio outside of the test set.