Before you turn this exercise in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says "YOUR ANSWER HERE" or `YOUR CODE HERE` and remove the `raise NotImplementedError()` lines. Please add your name and student ID below:

In [1]:
NAME = "Peter Rjabcsenko"
STUDENT_ID = "1228563"

---

# Intelligent Audio and Music Analysis Exercise 1

The goal of this exercise is to learn the basics in onset detection, beat
tracking and tempo estimation.

After completing this exercise you should have learned some music information
retrieval (MIR) basics and fostered your knowledge about these topics.

All data needed for this exercise is in the `data` directory. 
The folder contains audio files as well as annotations
(simple text files, one annotation per line) for `onsets`, `beats`, and `tempo`.
Not all audio files have all kinds of annotations, thus depending on the task
only a subset of all files can be used for evaluation.

For development of the algorithms, you can use any software packages as long
as you code the steps by yourself (exceptions are indicated).

Note: steps marked as optional are not needed to be implemented to achieve all points,
but can compensate for otherwise missing points throughout this exercise.

Grading will be based on the solution and not on the achieved performance.
Max. 100 points are achievable.

The notebook structure for tasks 1 to 3 is rather strict and split into sub-tasks to
provide some guidance. Tasks 4 to 6 are more flexible, but it is recommended to define
the functions similar to those of tasks 1 to 3.

Recommended software packages:

- madmom (https://github.com/CPJKU/madmom)
- librosa (https://github.com/librosa/librosa)
- mir_eval (https://github.com/craffel/mir_eval)

You are free to add code and textual cells as you need them.
However `CONSTANTS` should not be altered.
You may add visualisations, tables, etc. to enhance your assignment.

### Chocolate challenge

There will be again a chocolate challenge comprising prices for the following sub-challenges:

1. best performing tempo estimation on a hidden test set,
2. best performing beat tracking on a hidden test set,
3. nicest visualisation.

In order to participate in the challenge, please make sure that the `chocolate_challenge()`
function writes the detected tempo and beats of the supplied test file in `data/test` to `.txt`
files.

Good luck!


In [2]:
import os
import pickle

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import madmom
import librosa
import mir_eval

Define default parameters:

In [3]:
FPS = 100

Define audio files and function to match them to annotation files.

In [4]:
from madmom.utils import search_files, match_file

AUDIO_FILES = search_files('data/train', '.wav')

def find_audio_files(ann_files, audio_files, ann_suffix=None, audio_suffix='.wav'):
    """
    Find matching audio files.
    
    Parameters
    ----------
    ann_files : list
        List with annotation file names.
    audio_files : list
        List with audio file names to be matched
    ann_suffix : str, optional
        Suffix of the annotation files. If 'None'
        the suffix is inferred from the annotation
        files.
    audio_suffix : str, optional
        Suffix of the audio files.
    
    Returns
    -------
    matched_files : list
        List of matched audio file (names).
    matched_indices : list
        List of matching indices in `audio_files`.
        
    """
    matched_files = []
    matched_indices = []
    for i, ann_file in enumerate(ann_files):
        if ann_suffix is None:
            ann_suffix = os.path.splitext(ann_file)[1]
        matches = match_file(ann_file, audio_files,
                             ann_suffix, audio_suffix)
        if len(matches) == 1:
            matched_files.append(matches[0])
            matched_indices.append(i)
        else:
            continue
    return matched_files, matched_indices



In [5]:
######## FIX FOR ABOVE FUNCTION FOR TASK 4 and 5 ########

def find_audio_files(ann_files, audio_files, ann_suffix=None, audio_suffix='.wav'):
    matched_files = []
    matched_indices = []
    for i, ann_file in enumerate(ann_files):
        if ann_suffix is None:
            ann_suffix = os.path.splitext(ann_file)[1]
        matches = match_file(ann_file, audio_files,
                             ann_suffix, audio_suffix)
        if len(matches) == 1:
            matched_files.append(matches[0])
            matched_indices.append(audio_files.index(matches[0]))
        else:
            continue
    return matched_files, matched_indices

---
# Audio pre-processing
---

## Task 1: audio pre-processing (10 points)


Step 1: read in the audio signal (all audio files: `.wav` format, 44.1kHz, 16bit, mono)

Step 2: split signal into overlapping frames of length 2048 samples and a frame rate of 100 fps

Step 3: for each frame compute the STFT

Step 4: discard phase information and keep only the magnitudes
  
Step 5: filter the magnitudes with a Mel filterbank (40 bands)

Step 6: apply logarithmic scaling (adding a constant for numerical stability)

You are allowed to use the functionality of any audio framework to load the audio files and compute the discrete Fourier transform.
However, all remaining steps should be coded by yourself and recognisable as such.


In [6]:
# define additional constants
SR = 44100 # samping rate
FRAME_SIZE = 2048 # number of samples per frame
HOP_SIZE = int(SR / FPS) # hop size depends on sampling rate and frame rate
NUM_BANDS = 40 # number of mel bins

def pre_process(filename, frame_size=2048, frame_rate=FPS, num_bands=40, **kwargs):
    """
    Pre-process the audio signal.

    Parameters
    ----------
    filename : str
        File to be processed.
    frame_size : int
        Size of the frames.
    frame_rate : float
        Frame rate used for the STFT.
    num_bands : int
        Number of frequency bands for the Mel filterbank.
    kwargs : dict, optional
        Additional keyword arguments.

    Returns
    -------
    spectrogram : numpy array
        Spectrogram.

    """
    # STEP 1: read in audio
    signal, sr = librosa.load(filename, sr=SR) # read file
    
    # STEP 2,3: compute stft (default windowing function is Hann)
    stft = librosa.core.stft(y=signal, n_fft=frame_size, hop_length=HOP_SIZE)
    
    # STEP 4: discard phase info and square magnitudes
    initial_spectrogram = abs(stft)**2
    
    # STEP 5: apply mel scaling
    mel_bins = librosa.filters.mel(sr=SR, n_fft=frame_size, n_mels=num_bands)
    mel_spectrogram = mel_bins.dot(initial_spectrogram)
    
    # STEP 6: apply DB scaling
    db_mel_spectrogram = librosa.power_to_db(mel_spectrogram)
            
    spectrogram = db_mel_spectrogram
    return spectrogram

Pre-compute the spectrograms for all audio files with onset annotations.

In [7]:
# list for collecting pre-processed spectrograms
# Note: it is not necessary to use this list but recommended in order to
#       avoid recomputation of the same features over and over again.
#       *_AUDIO_IDX canbe used to acces the precomputed spectrograms by
#       index.
SPECTROGRAMS = []

for audio_file in AUDIO_FILES:
    spec = pre_process(audio_file)
    SPECTROGRAMS.append(spec)

---
# Onset detection
---

In [8]:
# you are not required to use these predefined constants, but it is recommended
ONSET_ANNOTATION_FILES = search_files('data/train', '.onsets')
ONSET_AUDIO_FILES, ONSET_AUDIO_IDX = find_audio_files(ONSET_ANNOTATION_FILES, AUDIO_FILES)
ONSET_AUDIO = [SPECTROGRAMS[i] for i in ONSET_AUDIO_IDX]
ONSET_ANNOTATIONS = [madmom.io.load_onsets(f) for f in ONSET_ANNOTATION_FILES]

assert len(ONSET_ANNOTATION_FILES) == 321
assert len(ONSET_AUDIO_FILES) == 321
assert len(ONSET_AUDIO) == 321
assert len(ONSET_ANNOTATIONS) == 321

## Task 2: signal processing-based onset detection (20 + 5 points)

For onset detection, the spectral flux should be used.

### Task 2a: define onset detection function (5 points)

Step 1: compute the temporal difference  

Step 2: keep only the positive differences

Step 3: sum or average these differences, to obtain the onset detection function (ODF)

In [9]:
def onset_detection_function(spectrogram):
    """
    Compute an onset detection function.

    Parameters
    ----------
    spectrogram : numpy array
        Spectrogram

    Returns
    -------
    odf : numpy array
        Onset detection function.

    """
    spectrogram_T = spectrogram.transpose()
    
    odf = []
    for i, frame in enumerate(spectrogram_T):
        sum = 0
        for j, bin in enumerate(frame):
            diff = spectrogram_T[i][j] - (spectrogram_T[i-1][j] if i > 0 else 0)
            flux = diff if diff >= 0 else 0
            sum = sum + flux

        odf.append(sum / NUM_BANDS)
           
    return odf

### Task 2b: detect onsets from onset detection function (6 points)

To detect the onsets in the ODF, the following procedure should be applied:

Step 1: (optional) subtract a moving average from the ODF

Step 2: discard all ODF values below a certain threshold 

Step 3: select local maxima as onset positions

Step 4: (optional) discard onsets too close together (recommended value: within 30ms)


In [10]:
MAX_LEFT = 2 # left side widnow size for local maximum
MAX_RIGHT = 3 # right side widnow size for local maximum
AVG_LEFT = 10 # left side widnow size for moving average
AVG_RIGHT = 11 # right side widnow size for moving average
MIN_DIST = 3 # (30ms) minimum distance
            # 0.5 threshold is used

def detect_onsets(odf, threshold, frame_rate=FPS, **kwargs):
    """
    Detect the onsets in the onset detection function (ODF).

    Parameters
    ----------
    odf : numpy array
        Onset detection function.
    threshold : float
        Threshold for peak picking
    frame_rate : float
        Frame rate of the onset detection function.
    kwargs : dict, optional
        Additional keyword arguments.

    Returns
    -------
    onsets : numpy array
        Detected onsets (in seconds).

    """
            
    new_odf = []
    
    ######## MOVING AVERAGE AND THRESHOLD ########
    
    for i in range(0, len(odf)):
        l = i - AVG_LEFT if i - AVG_LEFT > 0 else 0
        r = i + AVG_RIGHT if i + AVG_RIGHT < len(odf) else len(odf)
        
        new_val = odf[i] - np.average(odf[l:r])
        new_odf.append(new_val if new_val >= threshold else 0)
    
    ######## LOCAL MAXIMUM ########
    
    for i in range(0, len(new_odf)):
        l = i - MAX_LEFT if i - MAX_LEFT > 0 else 0
        r = i + MAX_RIGHT if i + MAX_RIGHT < len(odf) else len(odf)
        
        if new_odf[i] < max(new_odf[l:r]):
            new_odf[i] = 0
    
    ######## MINIMUM DISTANCE ########
    
    last = -1
    for i in range(0, len(new_odf)):
        if new_odf[i] > 0 and (last < 0 or i - last > MIN_DIST):
            last = i
        else:
            new_odf[i] = 0
    
    ######## SELECTING ONSETS ########

    onsets = np.array([])
    
    for i, el in enumerate(new_odf):
        if new_odf[i] > 0:
            onsets = np.append(onsets, i)
        
    return onsets / frame_rate

### Task 2c: predict onsets on dataset (4 points)

Run the complete onset detection pipeline on all audio files of the dataset.

Step 1: Pre-process the audio.

Step 2: Compute the ODF.

Step 3: Detect the onsets. Set the threshold such that F-measure gets maximises on the dataset (see also task 2d).

In [11]:
# define additional constants
THRESHOLD = 0.5

# list for collecting the onset detections
onset_detections = []

for i, spec in enumerate(ONSET_AUDIO):
    odf = onset_detection_function(spec)
    onsets = detect_onsets(odf, THRESHOLD, FPS)
    onset_detections.append(onsets)

### Task 2d: evaluate detected onsets against the ground truth (5 points)

Evaluate onset detection performance with `precision`, `recall`, and `fmeasure`.
Either use the `madmom.evaluate.onsets` module or the `mir_eval` package.
Compute the average over all files with corresponding onset annotations.
As an evaluation window, ±25ms should be used.

In [12]:
def evaluate_onsets(onsets, annotations):
    """
    Evaluate detected onsets against ground truth annotations.
    
    Parameters
    ----------
    onsets : list
        List with onset detections for all files.
    annotations : list
        List with corresponding ground truth annotations.

    Returns
    -------
    precision : float
        Averaged precision.
    recall : float
        Averaged recall.
    fmeasure : float
        Averaged f-measure.
    
    """
    sum_precision = 0
    sum_recall = 0
    sum_fmeasure = 0
    for i in range(0, len(onsets)):
        tp, fp, tn, fn, errors = madmom.evaluation.onsets.onset_evaluation(onsets[i], annotations[i], window=0.025)
        p = len(tp) / (len(tp) + len(fp)) if len(tp) > 0 else 0
        r = len(tp) / (len(tp) + len(fn)) if len(tp) > 0 else 0
        f = 2*p*r / (p + r) if p + r > 0 else 0
        sum_precision = sum_precision + p
        sum_recall = sum_recall + r
        sum_fmeasure = sum_fmeasure + f
    
    precision = sum_precision / len(onsets)
    recall = sum_recall / len(onsets)
    fmeasure = sum_fmeasure / len(onsets)
    return precision, recall, fmeasure
    
# evaluate against ground truth
p, r, f = evaluate_onsets(onset_detections, ONSET_ANNOTATIONS)

print('Signal processing-based onset detection\nPrecision: %.3f\nRecall:    %.3f\nF-measure: %.3f' % (p, r, f))

Signal processing-based onset detection
Precision: 0.796
Recall:    0.754
F-measure: 0.758


### Task 2e: (optional) optimise parameters (5 points)

Optimise the parameters of task 1 and 2 to get the best performance on the dataset.

Parameters to be optimised: frame size (e.g. 1024, 2048, 4096), number of filter bands
(e.g. 20, 40, 80), different logarithmic scaling parameters (e.g. natural logarithm or
base 10; adding a constant) and the detection threshold.
Replace the default arguments/values in the functions with the optimised parameters.

The values in parentheses are suggested variations, experiment as you like.
Please be aware that parameters may very likely have mutual influences.
A coarse optimisation is enough. The main goal of this step is to understand 
the impact of these variations rather than getting another 0.01% performance.
      

In [13]:
def optimize_parameters(verbose=False):
    frame_sizes = [1024, 2048, 4096]
    num_bands = [20, 40, 80]
    thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    
    best_fmeasure = 0
    best_frame_size = 0
    best_num_bands = 0
    best_threshold = 0

    for i in range(0, len(frame_sizes)):
        for j in range(0, len(num_bands)):
            for k in range(0, len(thresholds)):
                FRAME_SIZE = frame_sizes[i]
                NUM_BANDS = num_bands[j]
                THRESHOLD = thresholds[k]
                
                if verbose:
                    print("parameters:", FRAME_SIZE, NUM_BANDS, THRESHOLD)
                
                # spectrograms
                specs = []
                for audio_file in AUDIO_FILES:
                    spec = pre_process(audio_file, FRAME_SIZE, FPS, NUM_BANDS)
                    specs.append(spec)

                onset_audio = [specs[i] for i in ONSET_AUDIO_IDX]
                
                # onset detections
                ods = []
                for l, spec in enumerate(onset_audio):
                    odf = onset_detection_function(spec)
                    onsets = detect_onsets(odf, THRESHOLD, FPS)
                    ods.append(onsets)
                
                # evaluation
                precision, recall, fmeasure = evaluate_onsets(ods, ONSET_ANNOTATIONS)
                if verbose:
                    print('Signal processing-based onset detection\nPrecision: %.3f\nRecall:    %.3f\nF-measure: %.3f' % (precision, recall, fmeasure))
                    print('')
                
                if fmeasure > best_fmeasure:
                    best_fmeasure = fmeasure
                    best_frame_size = FRAME_SIZE
                    best_num_bands = NUM_BANDS
                    best_threshold = THRESHOLD
                    
    return best_fmeasure, best_frame_size, best_num_bands, best_threshold

# uncomment and run block to optimize parameters
# best_fmeasure, best_frame_size, best_num_bands, best_threshold = optimize_parameters(verbose=True)
# print("best found parameters are:", best_frame_size, best_num_bands, best_threshold, "with F-measure:", best_fmeasure)

Put your observations/findings about task 2e in textual form below:

Parameter optimization was run on the following parameters: <br>
frame size (1024, 2048 and 4096), <br>
number of mel bins (20, 40 and 80) and <br>
threshold (in range from 0 to 1.0 (or 1.5 in some cases) with step size 0.1).
<br><br>
An example of a well performing combination: 2048 40 0.5 with precision: 79.6%
recall: 75.4%, F-measure: 75.8%
<br><br>
The results are uploaded to the root directory in 3 separate files grouped for convenience by the frame size parameter: "1024 param config.txt", "2048 param config.txt" and "4096 param config.txt"
<br><br>
The first and most obvious observation in all cases is the influence of the threshold parameter on precision and recall values, starting with a low threshold value (high recall) and moving upwards (high precision) we can see how hitting a sweet spot with the threshold somewhere in the middle is necessary for a good F-measure value.
<br><br>
Furthermore we can see that selecting 20 as the number of mel bins almost universally yields slightly worse results regarldess of other parameters (within reasonable bounds) than the other 2 values. 20 seems to be too few bins, while 40 and 80 perform more or less similarly. <br>
That being said we still achieved an F-measure of 74.5% with parameters 2048 20 0.3, while our overall best achieved F-measure was at 75.9%, so probably this difference is negligible
<br><br>
Most interestingly though one can see how picking 4096 as frame size results in significantly worse F-measure values, in best cases barely hitting the 65% mark, while 1024 and 2048 are consistently above 70%, often reaching the maximum of 75.9% with proper threshold and bin number parameters. <br> This can be attributed to the fact that by selecting a larger frame size one loses some of the temporal accuracy that is essential for onset detection.

---
## Task 3: machine learning-based onset detection (20 points)

A simple machine learning approach should be investigated. The question to 
be answered is: can a simple neural network improve the onset detection
performance compared to the standard spectral flux approach above?

In order to answer this question, the hand-crafted ODF computation should be
replaced by a multiplayer perceptron (MLP).

### Task 3a: define a trainig function (10 points)

Step 1: Use `sklearn` to create an `MLPRegressor` with given parameters.

Step 2: Use the same features as in the audio pre-processing section (task 1)
        as inputs (or if task 2e was done: use the optimised parameters).

Step 3: As targets, use the annotated onset positions of the dataset and
        assign each target frame a value of 1.

Step 4: Concatenate all audio frames and target frames to be used for training.

Step 5: Fit the model with the given data.

Step 6: Save the model to the given file name. Use Python's `pickle` module.

In [14]:
def train(audio, annotations, diffs=False, early_stopping=False,
          verbose=True, model='model.pkl', **kwargs):
    """
    Train an MLP on the data.

    Parameters
    ----------
    audio : list
        List of audio files or precomputed spectrograms.
    annotations : list of numpy arrays
        List with corresponding onset annotations.
    diffs : bool, optional
        Include diffs as input features (step 7).
    early_stopping : bool, optional
        Use early stopping to prevent overfitting (step 8).
    verbose : bool, optional
        Be verbose during training.
    model : str, optional
        Save the fitted model to given file name.
    kwargs : dict, optional
        Additional keyword arguments.
        
    Returns
    -------
    mlp : MLPRegressor
        Trained MLP.

    """
    from sklearn.neural_network import MLPRegressor
    # define MLP
    mlp = MLPRegressor(hidden_layer_sizes=(50, 50), tol=1e-4, max_iter=100,
                       early_stopping=early_stopping, verbose=verbose)
    if verbose:
        print(mlp)
        
    # prepare input features and targets
    x = []
    y = []
    
    ######## INPUT PREPARATION ########
    
    # concatenate all features and transpose to fit the MLP input format
    spectral_features = np.concatenate((audio), axis=1)
    spectral_features_T = spectral_features.transpose()
    x = spectral_features_T
    
    # add spectral flux to features
    if diffs:
        if verbose:
            print('')
            print('adding spectral flux to input features...')
            print('')
        flux = kwargs['flux']
        flux = np.concatenate((flux))
        flux = np.vstack(flux)
        x = np.concatenate((x, flux), axis=1)
    
    # create target as 0 array with value 1 where index matches the frame 
    y = np.array([])

    for i in range(0, len(audio)):
        spec_T = audio[i].transpose()
        target = np.zeros(len(spec_T))
        
        onset_frames = np.rint(annotations[i] * FPS)
        for j in range(0, len(onset_frames)):
            target[int(onset_frames[j])] = 1

        y = np.append(y, target)
        
    ###################################
    
    # reshape x and y
    # Note: depending on your data pre-processing these lines might
    #       need to be adjusted accordingly
    x = np.vstack(x)
    y = np.hstack(y)
    
    # train model
    if verbose:
        print('training model:', model)
    mlp.fit(x.squeeze(), y.squeeze())
    
    # save model and return it
    with open(model, 'wb') as f:
        pickle.dump(mlp, f)
    return mlp
    

### Task 3b: train the model (2 points)

Train the model on the dataset and save as `model.pkl`.

In [15]:
MLP_MODEL = train(ONSET_AUDIO, ONSET_ANNOTATIONS, False, False, model='model.pkl')

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(50, 50), learning_rate='constant',
             learning_rate_init=0.001, max_iter=100, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=True, warm_start=False)
training model: model.pkl
Iteration 1, loss = 1.25240510
Iteration 2, loss = 0.02946456
Iteration 3, loss = 0.02688286
Iteration 4, loss = 0.02408670
Iteration 5, loss = 0.02256656
Iteration 6, loss = 0.02181341
Iteration 7, loss = 0.02148313
Iteration 8, loss = 0.02131591
Iteration 9, loss = 0.02125165
Iteration 10, loss = 0.02124676
Iteration 11, loss = 0.02119086
Iteration 12, loss = 0.02119527
Iteration 13, loss = 0.02115439
Iteration 14, loss = 0.02113956
Iteration 15, loss = 0.02111525
Iteration 16

### Task 3c: evaluate performance on the dataset (3 points)

Step 1: Predict onset activations for the dataset.

Step 2: Adjust the threshold parameter to yield the best F-measure on the dataset
        (use the `detect_onsets()` function defined in task 2b).

Step 3: Evaluate performance on the dataset.

In [16]:
# A solid arbitrary starting value for the threshold
MLP_THRESHOLD = 0.0025

In [17]:
# Function for optimizing the threshold parameter
def optimize_mlp_threshold(model, thresholds=[], diffs=False, verbose=True, **kwargs):
    best_threshold = 0
    best_fmeasure = 0
    
    if diffs and verbose:
        print('running optimization with spectral flux...')
        print('')

    for i in range(0, len(thresholds)):
        ods_opt = []
        for j, spec in enumerate(ONSET_AUDIO):
            
            x = spec.transpose()
            
            # add spectral flux to features
            if diffs:
                flux = kwargs['flux']
                flux = np.vstack(flux[j])
                x = np.concatenate((x, flux), axis=1)
            
            mlp_odf = model.predict(x)
            mlp_onsets = detect_onsets(mlp_odf, thresholds[i], FPS)
            ods_opt.append(mlp_onsets)
        
        p, r, f = evaluate_onsets(ods_opt, ONSET_ANNOTATIONS)
        if verbose:
            print('Current threshold:', thresholds[i])
            print('MLP onset detection\nPrecision: %.3f\nRecall:    %.3f\nF-measure: %.3f' % (p, r, f))
            print('')
        
        if f > best_fmeasure:
            best_fmeasure = f
            best_threshold = thresholds[i]
        
    if verbose:
        print('Optimized threshold is:', best_threshold, 'with F measure:', best_fmeasure)
    return best_threshold

mlp_thresholds = np.arange(0,0.005,0.0005)  # thresholds to use for optimization

# COMMENT OUT LINE BELOW TO AVOID RUNNING THRESHOLD OPTIMIZATION (might take up to a minute or two)
MLP_THRESHOLD = optimize_mlp_threshold(model=MLP_MODEL, thresholds=mlp_thresholds)

Current threshold: 0.0
MLP onset detection
Precision: 0.387
Recall:    0.683
F-measure: 0.454

Current threshold: 0.0005
MLP onset detection
Precision: 0.455
Recall:    0.644
F-measure: 0.507

Current threshold: 0.001
MLP onset detection
Precision: 0.531
Recall:    0.588
F-measure: 0.541

Current threshold: 0.0015
MLP onset detection
Precision: 0.586
Recall:    0.538
F-measure: 0.540

Current threshold: 0.002
MLP onset detection
Precision: 0.613
Recall:    0.494
F-measure: 0.519

Current threshold: 0.0025
MLP onset detection
Precision: 0.635
Recall:    0.458
F-measure: 0.495

Current threshold: 0.003
MLP onset detection
Precision: 0.642
Recall:    0.430
F-measure: 0.472

Current threshold: 0.0035
MLP onset detection
Precision: 0.646
Recall:    0.407
F-measure: 0.452

Current threshold: 0.004
MLP onset detection
Precision: 0.629
Recall:    0.385
F-measure: 0.431

Current threshold: 0.0045000000000000005
MLP onset detection
Precision: 0.618
Recall:    0.368
F-measure: 0.414

Optimized th

In [18]:
mlp_onset_detections = []

for i, spec in enumerate(ONSET_AUDIO):
    mlp_odf = MLP_MODEL.predict(spec.transpose())
    mlp_onsets = detect_onsets(mlp_odf, MLP_THRESHOLD, FPS)
    mlp_onset_detections.append(mlp_onsets)

# evaluate against ground truth
p, r, f = evaluate_onsets(mlp_onset_detections, ONSET_ANNOTATIONS)

print('MLP onset detection\nPrecision: %.3f\nRecall:    %.3f\nF-measure: %.3f' % (p, r, f))


MLP onset detection
Precision: 0.531
Recall:    0.588
F-measure: 0.541


### Task 3d: describe your findings (5 points)

Describe your findings/observations in textual form below:

The MLP Regressor performed rather poorly compared to the hand-crafted method. Its performance averaged around 53% (across multiple trainings) compared to the approximate 75% of the hand-crafted one.

It seems like having only the spectrogram as input features for the MLP is not enough for it to be able to generate a reasonable ODF, most likely because the features are treated in isolation and are viewed as a set by the MLP and the temporal structure of the spectrogram is not taken into account (i.e. differences in energy between consecutive frames).

The reason the MLP did not fail completely could be possibly attributed to the fact that even if viewed in isolation a spectral feature vector still carries some information that is relevant to onset detection, for example high frequency content that is usually present at onsets. This could have helped the MLP learn a somewhat functioning ODF.

### Task 3e: add temporal differences as additional features (2 points)

Train a new model with first order temporal differences (as in spectral flux) as
aditional features (stacked to the magnitudes) and save as `model_diff.pkl`.

Note: modify the `train()` function of task 3a to be able to be called with `diffs=True`.

In [19]:
# calculate flux

FLUX = []
for i in range(0, len(ONSET_AUDIO)):
    diff = onset_detection_function(ONSET_AUDIO[i])
    FLUX.append(diff)

In [20]:
MLP_DIFF_MODEL = train(ONSET_AUDIO, ONSET_ANNOTATIONS, True, False, model='model_diff.pkl', flux=FLUX)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(50, 50), learning_rate='constant',
             learning_rate_init=0.001, max_iter=100, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=True, warm_start=False)

adding spectral flux to input features...

training model: model_diff.pkl
Iteration 1, loss = 0.80333189
Iteration 2, loss = 0.02634326
Iteration 3, loss = 0.02376980
Iteration 4, loss = 0.02132270
Iteration 5, loss = 0.01976381
Iteration 6, loss = 0.01900296
Iteration 7, loss = 0.01833693
Iteration 8, loss = 0.01779528
Iteration 9, loss = 0.01742194
Iteration 10, loss = 0.01722965
Iteration 11, loss = 0.01713869
Iteration 12, loss = 0.01701473
Iteration 13, loss = 0.01692768
Iteration 14, loss = 0.0167

### Task 3f: evaluate model (3 points)

Compare the performance of this model with the one of task task 2 and task 3b.
Again, use a suitable threshold which to maximises the performance on the dataset.


In [21]:
# A solid arbitrary starting value for the threshold
MLP_DIFF_THRESHOLD = 0.075

In [22]:
mlp_diff_thresholds = np.arange(0.03,0.11,0.01)  # thresholds to use for optimization

# COMMENT OUT LINE BELOW TO AVOID RUNNING THRESHOLD OPTIMIZATION (might take up to a minute or two)
MLP_DIFF_THRESHOLD = optimize_mlp_threshold(model=MLP_DIFF_MODEL, thresholds=mlp_diff_thresholds, diffs=True, flux=FLUX)

running optimization with spectral flux...

Current threshold: 0.03
MLP onset detection
Precision: 0.667
Recall:    0.812
F-measure: 0.708

Current threshold: 0.04
MLP onset detection
Precision: 0.700
Recall:    0.799
F-measure: 0.725

Current threshold: 0.05
MLP onset detection
Precision: 0.729
Recall:    0.786
F-measure: 0.738

Current threshold: 0.060000000000000005
MLP onset detection
Precision: 0.755
Recall:    0.772
F-measure: 0.748

Current threshold: 0.07
MLP onset detection
Precision: 0.777
Recall:    0.759
F-measure: 0.752

Current threshold: 0.08000000000000002
MLP onset detection
Precision: 0.793
Recall:    0.744
F-measure: 0.753

Current threshold: 0.09000000000000001
MLP onset detection
Precision: 0.808
Recall:    0.731
F-measure: 0.751

Current threshold: 0.1
MLP onset detection
Precision: 0.820
Recall:    0.714
F-measure: 0.747

Optimized threshold is: 0.08000000000000002 with F measure: 0.7526878542099927


In [23]:
mlp_diff_detections = []

# for task 4
MLP_DIFF_ODFS = []

for i, spec in enumerate(ONSET_AUDIO):
    flux = np.vstack(FLUX[i])
    x = np.concatenate((spec.transpose(), flux), axis=1)

    mlp_diff_odf = MLP_DIFF_MODEL.predict(x)
    mlp_diff_onsets = detect_onsets(mlp_diff_odf, MLP_DIFF_THRESHOLD, FPS)
    mlp_diff_detections.append(mlp_diff_onsets)
    
    MLP_DIFF_ODFS.append(mlp_diff_odf)

# evaluate against ground truth
p, r, f = evaluate_onsets(mlp_diff_detections, ONSET_ANNOTATIONS)

print('MLP onset detection with temporal diffs\nPrecision: %.3f\nRecall:    %.3f\nF-measure: %.3f' % (p, r, f))

MLP onset detection with temporal diffs
Precision: 0.793
Recall:    0.744
F-measure: 0.753


### Task 3g: describe your findings (5 points)

Describe your findings/observations in textual form below:

The MLP Regressor that included the spectral flux as input additionally to the spectrogram performed much better than the one using only the spectrogram. Having the difference in energy from previous frame to the current one as input proved to be essential for learning an ODF.

The Diff MLP model achieved 76.5% F-measure score, compared to the MLP models 53% and hand-crafted methods 75%.

While the 1.5% increase in performance doesnt seem like a tremendously big one, we need to keep in mind that the regressor used for the task was a fairly generic one, so a network that is more configured to this particular task would probably achieve better results

In conclusion we can say that using a neural network in combination with some smartly prepared input data is probably the most optimal way to approach onset detection

---
# Tempo estimation
---

In [24]:
# you are not required to use these predefined constants, but it is recommended
TEMPO_ANNOTATION_FILES = search_files('data/train', '.bpm')
TEMPO_AUDIO_FILES, TEMPO_AUDIO_IDX = find_audio_files(TEMPO_ANNOTATION_FILES, AUDIO_FILES)
TEMPO_AUDIO = [SPECTROGRAMS[i] for i in TEMPO_AUDIO_IDX]
TEMPO_ANNOTATIONS = [madmom.io.load_tempo(f)[0, 0] for f in TEMPO_ANNOTATION_FILES]

assert len(TEMPO_ANNOTATION_FILES) == 107
assert len(TEMPO_AUDIO_FILES) == 107
assert len(TEMPO_AUDIO) == 107
assert len(TEMPO_ANNOTATIONS) == 107

## Task 4: detect tempo of ODF (25 + 5 points)

To detect the tempo/periodicity the ODF, the following procedure should be
applied:

Step 1: Compute the auto-correlation function (ACF) of the ODF.

Step 2: Select an appropriate peak of the ACF as the main periodicity.

Step 3: Compute the tempo (in bpm, beats per minute).

Step 4: Evaluate the mean tempo estimation performance (e.g. with `madmom.evaluation.tempo` module)
        on the dataset. Use `Accuracy 1` (with 4% tolerance) and `Accuracy 2` (allowing 4% tolerance,
        including double and half tempo variants) as metrics.

Step 5: (optional) optimise the parameters to get the best performance on the dataset
        parameters to be optimised: lag range for ACF computation (lower bound: 40-80bpm,
        upper bound 140-220bpm), peak selection mechanism (e.g. clustering of peaks).
        Replace the default arguments/values in the function definition with the optimised parameters.


In [25]:
# ODFs computed by the MLP with diffs
TEMPO_MLP_ODFS = [MLP_DIFF_ODFS[i] for i in TEMPO_AUDIO_IDX]

In [26]:
def detect_tempo(odf, min_bpm=60, max_bpm=180, frame_rate=FPS, **kwargs):
    """
    Detect the tempo of the onset detection function (ODF).

    Parameters
    ----------
    odf : numpy array
        Onset detection function.
    min_bpm : float
        Minimum tempo, given in beats per minute (BPM).
    max_bpm : float
        Maximum tempo, given in beats per minute (BPM).
    frame_rate : float
        Frame rate of the onset detection function.
    kwargs : dict, optional
        Additional keyword arguments.

    Returns
    -------
    tempo : float
        Detected tempo (in BPM).

    """
    odf_copy = odf.copy()
    
    ######## MEDIAN FILTER ########
    
    if kwargs['median_filter']:
        new_odf = []
        med_left = 10 # 10
        med_right = 11 # 11
        for i in range(0, len(odf_copy)):
            l = i - med_left if i - med_left > 0 else 0
            r = i + med_right if i + med_right < len(odf_copy) else len(odf_copy)

            new_odf.append(odf_copy[i] if odf_copy[i] > np.median(odf_copy[l:r]) else 0)

        odf_copy = new_odf
    
    ###############################
    
    highest_correlation = 0
    tempo = min_bpm
    for i in range(min_bpm, max_bpm+1):
        shift = round((60 * frame_rate) / i) # number of frames to shift
        shifted_odf = odf_copy[shift:] 
        
        sum = 0
        for j in range(0, len(shifted_odf)):
            sum = sum + odf_copy[j] * shifted_odf[j]
            
        if sum > highest_correlation:
            highest_correlation = sum
            tempo = i

    return float(tempo)


def evaluate_tempo(tempi, annotations):
    """
    Evaluate detected tempi against ground truth annotations.
    
    Parameters
    ----------
    tempi : list
        List with tempo detections for all files.
    annotations : list
        List with corresponding ground truth annotations.

    Returns
    -------
    accuracy_1 : float
        Averaged accuracy 1.
    accuracy_2 : float
        Averaged accuracy 2.
    
    """
    sum_acc1 = 0
    sum_acc2 = 0
    for i in range(0, len(tempi)):
        result = madmom.evaluation.tempo.TempoEvaluation(tempi[i], annotations[i], tolerance=0.04, double=True, triple=False, sort=False)
        sum_acc1 = sum_acc1 + result.acc1
        sum_acc2 = sum_acc2 + result.acc2
        
    return sum_acc1 / len(tempi), sum_acc2 / len(tempi)

In [27]:
#### TEMPO PIPELINE ####
# depends on the ODFs computed by the MLP with diffs
# Optimal params: 70, 170, False

def tempo_pipeline(min_bpm=60, max_bpm=180, median_filter=False):
    tempi = []
    for i in range(0, len(TEMPO_MLP_ODFS)):
        tempo = detect_tempo(TEMPO_MLP_ODFS[i], min_bpm=min_bpm, max_bpm=max_bpm, frame_rate=FPS, median_filter=median_filter)
        tempi.append(tempo)

    acc_1, acc_2 = evaluate_tempo(tempi, TEMPO_ANNOTATIONS)
    print('parameters:', min_bpm, 'to', max_bpm, 'bpm with median filter', median_filter)
    print('Accuracy metric 1 (w/o double/half tempos):', acc_1, '\nAccuracy metric 2 (with double/half tempos):', acc_2)
    print('')
    return tempi, acc_1, acc_2

TEMPI, ACC_1, ACC_2 = tempo_pipeline(min_bpm=70, max_bpm=170, median_filter=False)

parameters: 70 to 170 bpm with median filter False
Accuracy metric 1 (w/o double/half tempos): 0.48598130841121495 
Accuracy metric 2 (with double/half tempos): 0.8504672897196262



In [28]:
#### TEMPO PARAMETER OPTIMIZATION ####

def tempo_parameter_optimization():
    min_bpms = [40, 50, 60, 70, 80]
    max_bpms = [140, 150, 160, 170, 180, 190, 200, 210, 220]
    median_filters = [False, True]
    
    best_tempi = []
    best_acc1 = 0
    best_acc2 = 0
    best_min_bpm = 0
    best_max_bpm = 0
    best_median_filter = False
    
    for i in range(0, len(min_bpms)):
        for j in range(0, len(max_bpms)):
            for k in range(0, len(median_filters)):
                t, a1, a2 = tempo_pipeline(min_bpms[i], max_bpms[j], median_filters[k])
                
                if a1+a2 > best_acc1 + best_acc2:
                    best_acc1 = a1
                    best_acc2 = a2
                    best_tempi = t
                    best_min_bpm = min_bpms[i]
                    best_max_bpm = max_bpms[j]
                    best_median_filter = median_filters[j]
    
    return best_tempi, best_acc1, best_acc2, best_min_bpm, best_max_bpm, best_median_filter

# UNCOMMENT LINE TO RUN PARAMETER OPTIMIZATION
# BEST_TEMPI, BEST_ACC_1, BEST_ACC_2, BEST_MIN_BPM, BEST_MAX_BPM, BEST_MEDIAN_FILTER = tempo_parameter_optimization()

Summarise your observations/findings in textual form below:

Tested parameters:
min bpm range: 40 - 80 with step size 10
max bpm range: 140 - 220 with step size 10
with and without a median filter as a preprocessing step for the ODF
the ODF itself was taken from the MLP Diff prediction

The implemented method suffers greatly from double/half tempo errors, this is indicated by the difference between the 2 accuracy measures. Acc 1 measure varying between 40-50% and Acc 2 measure varying between 80-85% with best results achieved being 48.6% for Acc 1 and 86% for Acc 2 with min bpm 70, max bpm 170 and no median filtering (numbers may slightly vary because of the MLP Diff model retraining).

Expanding the bpm range to the maximum of 40-220 did not result in performance increase even though the the minimum and maximum ground truth tempos of the audios are 41 and 208 bpm respectively, which seems counterintuitive at first but then it only makes sense that the algorithm works better when confined to a smaller interval, so a reasonable middle ground must be found when choosing the bpm interval.

An optional median filter was added as a preprocessing step to the ODF, which filtered out all values below a 
moving median, the size of the median window was selected to be the same size as of the moving average window for peak picking. The use of the median filter proved to be inconsequential, it improved the results slightly in some cases and worsened them in others.

---
# Beat tracking
---

In [29]:
# you are not required to use these predefined constants, but it is recommended
BEAT_ANNOTATION_FILES = search_files('data/train', '.beats')
BEAT_AUDIO_FILES, BEAT_AUDIO_IDX = find_audio_files(BEAT_ANNOTATION_FILES, AUDIO_FILES)
BEAT_AUDIO = [SPECTROGRAMS[i] for i in BEAT_AUDIO_IDX]
BEAT_ANNOTATIONS = [madmom.io.load_beats(f) for f in BEAT_ANNOTATION_FILES]

assert len(BEAT_ANNOTATION_FILES) == 177
assert len(BEAT_AUDIO_FILES) == 177
assert len(BEAT_AUDIO) == 177
assert len(BEAT_ANNOTATIONS) == 177

## Task 5: track the beats based on ODF and periodicity (25 + 5 points)

To detect the beats in the ODF, the following procedure should be applied:

Step 1: Determine the best possible offset for beat tracking given the tempo
        or periodicity determined in task 4 and select the first beat.

Step 2: Determine consecutive beats based on the tempo; allow ±10% tempo 
        deviation between consecutive beats.

Step 3: Continue until all beats are tracked.

Step 4: Evaluate beat tracking performance (e.g. with `madmom.evaluation.beats` module)
        on the dataset. Use `CMLt` and `AMLt` as evaluation metrics, 

Step 5: (optional) optimise the parameters to get the best performance on the dataset.
        Parameters to be optimised: allowed deviation of the tempo, length of the audio
        used to determine the tempo.
        Replace the default arguments/values in the functions with the optimised parameters.

In [30]:
def detect_beats(odf, min_bpm=60, max_bpm=180, frame_rate=FPS, **kwargs):
    """
    Detect the beats in an onset detection function (ODF).

    Parameters
    ----------
    odf : numpy array
        Onset detection function.
    min_bpm : float
        Minimum tempo, given in beats per minute (BPM).
    max_bpm : float
        Maximum tempo, given in beats per minute (BPM).
    frame_rate : float
        Frame rate of the onset detection function.
    kwargs : dict, optional
        Additional keyword arguments.

    Returns
    -------
    beats : numpy array
        Detected beats (in seconds).

    """
    # determine tempo from within this function in order to be used
    # with a single input (the ODF)
    tempo = detect_tempo(odf, min_bpm, max_bpm, frame_rate)
    # YOUR CODE HERE
    raise NotImplementedError()
    return beats


def evaluate_beats(beats, annotations):
    """
    Evaluate detected beats against ground truth annotations.
    
    Parameters
    ----------
    beats : list
        List with beats detections for all files.
    annotations : list
        List with corresponding ground truth annotations.

    Returns
    -------
    cmlt : float
        Averaged CMLt.
    amlt : float
        Averaged AMLt.
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return cmlt, amlt

# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

Summarise your observations/findings in textual form below:

YOUR ANSWER HERE

---
## Task 6: (optional) visualise the results (10 points)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Chocolate challenge (no points, only chocolate and glory)

Put all needed functions defined above in place to be able to detect the tempo and beats in the given audio files.

To qualify for the chocolate challenge, please check that running the function below produces
two (hopefully empty) detection files.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print("Well done!")