# Music Key Estimation
Notebook file to explore the processes required to estimate the music key.


## Constants


In [1]:
import pathlib

DATA_FOLDER = pathlib.Path("data")
AUDIO_DATA_FOLDER = pathlib.Path("data/audio")


## Notebook Setup


In [2]:
from matplotlib import rcParams

%matplotlib inline

rcParams["figure.figsize"] = 14, 8  # In inches


## Music Key Estimation
We utilise the wonderful [KeyCNN](https://github.com/hendriks73/key-cnn) to make estimates about the key of a piece of music.

Code below is adapted from KeyCNN.


First we define some utility functions that will come into use later.


In [3]:
import numpy as np

def ensure_length(data, length):
    """
    Ensure that the data has a length of `length` along the 3rd dimension (1-indexed).
    """

    # Create the array to store the padded data
    padded_data = np.zeros((1, data.shape[1], length, 1), dtype=data.dtype)

    # Place the given data within the padded data
    padded_data[0, :, 0:data.shape[2], 0] = data[0, :, :, 0]
    return padded_data


def add_zeros(data, num_zeros):
    """
    Add `num_zeros` zeros to the data, with the data being centered.
    """

    # Create the array to store the padded data
    padded_data = np.zeros((1, data.shape[1], data.shape[2] + num_zeros, 1), dtype=data.dtype)

    # Place the given data within the padded data
    padded_data[0, :, num_zeros // 2:data.shape[2] + (num_zeros // 2), 0] = data[0, :, :, 0]
    return padded_data


def to_sliding_window(data, window_length, hop_length):
    """
    Converts the given data array into separate sliding windows of length `window_length` separated by `hop_length`.
    """

    # Determine the number of windows that we need
    num_windows = ((data.shape[2] - window_length) // hop_length + 1) * hop_length

    # Generate the windows
    windowed_data = []
    for offset in range(0, num_windows, hop_length):
        windowed_data.append(np.copy(data[:, :, offset:window_length + offset, :]))

    # Concatenate all the windows together and return
    return np.concatenate(windowed_data, axis=0)

def std_normalizer(data):
    """
    Normalizes data to zero mean and unit variance.
    """

    # Cast data as 64-bit float to avoid numpy warnings
    data = data.astype(np.float64)

    # Get existing mean and standard deviation of data
    mean = np.mean(data)
    std = np.std(data)

    # Normalize data
    if std != 0.:
        data = (data - mean) / std

    # Now cast data back to 16-bit floats and return
    return data.astype(np.float16)


Now we create a function that will generate the `features` numpy array that will be used for key estimation.

In [4]:
import librosa

def read_features(file, num_frames=60, hop_length=30, zero_pad=False):
    """
    Resample file to 22050 Hz, then transform using CQT with length 8192
    and hop size 4096, ranging from E1 + 7 octaves with two semitones
    per bin.

    Since we require at least 60 frames, shorter audio excerpts are always
    zero padded.

    Specifically for keygram, 30 frames each can be added at the front and
    at the back in order to make the calculation of key values for the first
    and the last window possible.

    Args:
        file: File to load audio data from.
        num_frames: Number of frames that we want.
        hop_length: Hop length for the sliding window.
        zero_pad: Whether to add 30 zero frames both at the front and back to the features array.

    Returns:
        A feature tensor for the whole file.
    """

    # Constants
    octaves = 7  # Number of octaves to generate
    bins_per_semitone = 2  # Number of frequency bins per 'note'
    bins_per_octave = 12 * bins_per_semitone  # Number of bins per octave (each octave has 12 semitones)
    window_length = 8192

    # Get the samples and sample rate of the audio file
    y, sr = librosa.load(file, sr=22050)   # We set sample rate to 22050 for easier processing (and lesser data)

    # Get the required data by running a CQT on the samples
    data = np.abs(librosa.cqt(
        y,
        sr=sr,
        hop_length=window_length // 2,
        fmin=librosa.note_to_hz("E1"),
        n_bins=bins_per_octave * octaves,
        bins_per_octave=bins_per_octave
    ))

    # Reshape the data
    data = np.reshape(data, (1, data.shape[0], data.shape[1], 1))

    # Add `num_frames/2` zero frames before and after the data
    if zero_pad:
        data = add_zeros(data, num_frames)

    # If we have less than the required number of frames, zero-pad the data to make sure we get some result at all
    if data.shape[2] < num_frames:
        data = ensure_length(data, num_frames)

    # Convert data to overlapping windows, where each window is one sample
    return to_sliding_window(data, num_frames, hop_length)


Finally, we create functions that handle the estimation and interpretation of the estimation.


In [5]:
def estimate_key_distribution(data, model):
    """
    Estimate a key distribution.
    Probabilities are indexed, starting with 30 BPM and ending with 286 BPM. (wait why bpm???)
    """

    # Check that the data is of the correct shape
    assert len(data.shape) == 4, "Input data must be four dimensional. Actual shape was " + str(data.shape)
    assert data.shape[1] == 168, "Second dim of data must be 168. Actual shape was " + str(data.shape)
    assert data.shape[2] == 60, "Third dim of data must be 60. Actual shape was " + str(data.shape)
    assert data.shape[3] == 1, "Fourth dim of data must be 1. Actual shape was " + str(data.shape)

    # Normalize the data
    norm_data = std_normalizer(data)

    # Use the model to estimate the key distribution
    return model.predict(norm_data, norm_data.shape[0])


def estimate_key(data, model):
    """
    Estimates the pre-dominant global key.
    """

    # First estimate the key distribution
    est_key_dist = estimate_key_distribution(data, model)

    # Compute the averaged prediction distribution
    avg_est_key_dist = np.average(est_key_dist, axis=0)

    # Find the key with the highest probability
    highest_key_index = np.argmax(avg_est_key_dist)

    # Determine if the key is in minor or major mode
    is_minor = highest_key_index >= 12

    # Determine midi value of the key
    key_midi_val = highest_key_index + 12 if not is_minor else highest_key_index

    # Get the tonic and mode
    tonic = librosa.midi_to_note(midi=key_midi_val, octave=False)
    mode = "Minor" if is_minor else "Major"

    # Concat and return
    return f"{tonic} {mode}"


## Model Experiments

Let's test the model on an example audio file, `Melancholy.wav`.

First we need to load the model using Tensorflow.


In [6]:
import tensorflow as tf

myModel = tf.keras.models.load_model("Model.h5")  # See "Update KeyCNN Models.ipynb" to generate this model

Metal device set to: Apple M1


2022-08-13 14:32:07.701481: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-08-13 14:32:07.701706: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Let's get a summary of the model.

In [7]:
myModel.summary()

Model: "DEEPSPEC_K8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_106 (InputLayer)      [(None, 168, None, 1)]    0         
                                                                 
 Conv0 (Conv2D)              (None, 168, None, 8)      48        
                                                                 
 BN0 (BatchNormalization)    (None, 168, None, 8)      32        
                                                                 
 Conv1 (Conv2D)              (None, 168, None, 8)      200       
                                                                 
 BN1 (BatchNormalization)    (None, 168, None, 8)      32        
                                                                 
 MaxPool2D1 (MaxPooling2D)   (None, 84, None, 8)       0         
                                                                 
 dropout_630 (Dropout)       (None, 84, None, 8)       

Now we load the features from the audio file.

In [8]:
audioFeatures = read_features("Melancholy.wav")

What is the shape of the features array?

In [9]:
audioFeatures.shape

(4, 168, 60, 1)

Now let's estimate the tonic and mode of the audio file.


In [10]:
estimate_key(audioFeatures, myModel)


2022-08-13 14:32:09.730902: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-08-13 14:32:09.833769: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




'C Major'