# Summary of Purpose
In this demo, two MIDI (Musical Instrument Digital Interface) files will be created. One of those MIDIs is the result of a predictions from a recurrent autoencoder and the other will be the song that was used as input for the recurrent autoencoder. The two songs similarity will then be compared using Kullback-Liebler Divergence. The MIDIs should be viewed and played for qualitative results.

In [30]:
## Imports

# Neural Nets
import tensorflow.keras as keras
from keras.losses import KLDivergence

# Generating Midis
from music21 import *

# Data Wrangling
import numpy as np

# Misc
import datetime
import os
from pathlib import Path
import pickle
from IPython.display import display

# Loading Data

In [31]:
# Load the onehot train/val/holdval split labeled dataset
# Note: The file name indicates that the loaded data is `labeled` (i.e., multilabel onehot encoded chords)
cwd = os.getcwd()
with open(os.path.join(cwd, 'piano_note_encoding_dict'), 'rb') as fobj:
    encoding_dict = pickle.load(fobj)

with open(os.path.join(cwd, 'piano_holdval_array_for_time_series_LABELED_8_input_8_output'), 'rb') as fobj:
    X_holdout_validation = pickle.load(fobj)

# LOG
print('Pickled data loaded into program memory')

Pickled data loaded into program memory


In [32]:
# Inspect the loaded dictionary keys
print('X_holdout_validation:', X_holdout_validation.shape)
print('A single record of X_holdout_validation is a 3D array with shape (1, 8, 98) where each dimension corresponds to a format of (ith_slice_of_a_song, t_timesteps, n_notes). `t_timesteps` is the number of notes in a sequence and the `n_notes` is the number of labels in the dataset.')
print()
print('encoding_dict keys:', list(encoding_dict.keys()))

X_holdout_validation: (4620, 8, 98)
A single record of X_holdout_validation is a 3D array with shape (1, 8, 98) where each dimension corresponds to a format of (ith_slice_of_a_song, t_timesteps, n_notes). `t_timesteps` is the number of notes in a sequence and the `n_notes` is the number of labels in the dataset.

encoding_dict keys: ['int_to_str_chord', 'str_to_int_chord']


## Description of the Data and the Contents of Each Unpickled Objects
### `encoding_dict`
Approximately 30 Final Fantasy MIDI files were processed, and the number of unique notes across *all* MIDIs was determined. Each string form of the note was mapped to an integer. For a very simple example, a single song might be composed of four notes/chords. Each chord is represented as a list of strings shown below:

First Chord:  [["C4", "D4", "E4"]]<br/>
Second Chord: [["G3"]]<br/>       
Third Chord:  [["E-4", "B-4"]]<br/>
Fourth Chord: [["D4", "F4", "G#4"]]<br/>

Footnotes about these chords: <br/>
1. Letter denotes which musical note and the number denotes which octave (basically the pitch).  <br/>
2. A single letter, while technically a note and not a chord, is referred to as a chord. <br/>
3. The '-' and the '#' symbols denote a note that is 'flat' or 'sharp,' respectively. <br/>

Therefore, the integer mapping for these strings for this simple set of chords would consider only the number of unique notes (in this case there are eight unique notes and one repeated "D4"). The dictionary that maps each of these string notes to an integer would therefore be of length eight.

### `X_holdout_validation`
This consists of sliding window (slices) of length eight from each song from the NES Final Fantasy game. A single input record represents the eight notes/chords in sequence and the corresponding output record would be the next sequence of eight notes. Each note or chord is onehot encoded for multiple labels. This is because a chord is composed of several notes, and each note has a particular integer associated with it. Therefore, if the integers 0, 1, and 2 map to "C4", "E4", and "G4", the corresponding onehot vector for this chord is simply, <br/><br/>
[[1 1 1 ....... 0]]. <br/>

A one hot vector has a number of elements equal to the number of unique notes found across the whole dataset, and therefore can represent any chord in the entire dataset.

# Functions and Summary of Functions
`make_predictions` will take a the recurrent autoencoder model and input to that model input and then produce a generated song where the number of notes/chords is determined by the model architecture (in this case output=8). <br/>
`onehot_label_nn_output` will take an array of probabilities (which is the output of the neural network) and 'hot' a particular label if the label's value exceeds the minimum positive classification threshold. <br/>
`make_music21_stream` converts an array of onehot vectors into a MIDI writable format.

In [33]:
def make_predictions(model, X_test, positive_classification_threshold):
    """Builds two onehot matrices (generated song and original song) AND probability array for generated song.

    :param model: keras model
    :param X_test: <class 'numpy.ndarray'> from which a random chord
        sequence will be selected.
    :param positive_classification_threshold: <class 'float'> that
        determines the minimum probability that an output
        neuron must have in order to be considered a positive classification
        for a particular category.
    :return: <class 'tuple'> of <class 'numpy.ndarray'>
    """
    # LOG
    print('Generating music')

    # Occurrence of failed threshold
    cnt_failed_threshold = 0

    # Take random starting starting point for validation
    random_ix_of_sequence_elem_in_x_test = np.random.randint(
        0, X_test.shape[0])

    ## Variables to be modified in song generation loop

    # The one hot generated song
    generated_song = np.empty(shape=(1, 0, X_test.shape[2]))

    # The generated song as an array of probabilities
    generated_song_probability_array = np.empty(shape=(1, 0, X_test.shape[2]))

    # The input to the model
    input_tensor = X_test[random_ix_of_sequence_elem_in_x_test].reshape(
        1, X_test.shape[1], X_test.shape[2])

    # A copy of the original input
    original_song = input_tensor.copy()

    # (?, output_timestep, categories)
    predicted_chords_tensor = model.predict(
        input_tensor, verbose=0)

    # A sample is (timesteps, categories) dimensional
    for sample in predicted_chords_tensor:

        # A chord is (labels, ) dimensional
        for ix, chord_ in enumerate(sample):

            # Append to the probability array
            generated_song_probability_array = np.append(
                generated_song_probability_array, 
                chord_.reshape(1, 1, predicted_chords_tensor.shape[2]),
                axis=1
            )

            # Convert the array of probabilities to a one hot vector
            # representing that chord
            chord_ = onehot_label_nn_output(
                chord_, positive_classification_threshold)

            # Append the chord to the generated song but if no chord
            # is generated, just take the previous chord from the input sequence
            chord_ = chord_.reshape(1, 1, predicted_chords_tensor.shape[2])
            if (np.amax(chord_) == 0):

                # If no classification meets the
                # positive_classification_threshold, then the one-hot
                # vector will be all 0s. Therefore, the output
                # will be estimated as the very last element in the
                # input sequence. Since the prediction_input_matrix
                # has dims (?, timestep, categories) then [-1][-1]
                # gets the last time step's chord represented by
                # a vector (categories,)

                # Map input directly to output
                filler_chord_from_input_tensor = input_tensor[0][ix].reshape(
                    1, 1, input_tensor.shape[2])
                generated_song = np.append(
                    generated_song, filler_chord_from_input_tensor, axis=1)

                # Incremement the number of failed classifications
                cnt_failed_threshold += 1
            else:
                generated_song = np.append(
                    generated_song, chord_, axis=1)

    # Return the onehot multilabeled generated song
    # and the original song used to generate it
    print('Music generated.')
    print('Number of failed predictions for the generated song:', cnt_failed_threshold)
    return (generated_song, original_song, generated_song_probability_array)

In [34]:
def onehot_label_nn_output(chord, positive_classification_threshold):
    """Takes a chord vector output from nn and converts to onehot vector.
    
    Get the element of a chord (which is probability vector)
    and hot a category based on probability threshold.

    :param chord: <class 'numpy.ndarray'> of probabilities
        for multilabel classification
    :param positive_classification_threshold: <class 'float'> minimum
        threshold that an element of the probability vector must
        exceed in order to decide positive classification of a label (note)
        or not.
    :return: <class 'numpy.ndarray'> one hot vector
    """
    # Iterate through labels in chord
    for ix, label in enumerate(chord):
        if (label > positive_classification_threshold):
            chord[ix] = 1
        else:
            chord[ix] = 0

    # Return the labeled chord
    return chord

In [35]:
def make_music21_stream(onehot_matrix, int_to_str_chord, instrument_part=None):
    """Converts one-hot matrices to writable songs.

    :param onehot_matrix: <class 'numpy.ndarray'> of shape
        (? =~ 1, timestep, classes) to be converted to string chords.
    :param int_to_str_chord: <class 'dict'> that maps integers to
        individual notes (still referred to as chords).
    :param instrument_part: <class 'music21.stream.instrument.Instrument'>
        to be used for the stream generation.
    """
    # Default instrument
    if (not instrument_part):
        instrument_part = instrument.KeyboardInstrument()

    # The music stream
    music21_stream = stream.Part()
    music21_stream.append(instrument_part)

    # Iterate through songs (should just be 1 song)
    for song in onehot_matrix:

        # A song will have some number of chords determined a priori
        for chord_ in song:

            # A string'ified musical element representing
            # a note or a chord
            musical_element = []
            for ix, label in enumerate(chord_):

                # Get the predicted musical element as a list of strings
                if (label == 1):
                    musical_element.append(int_to_str_chord[ix])

            # If the length of the musical element is 1 then the musical element
            # must be a NOTE otherwise it's a collection of NOTES aka a CHORD
            if (len(musical_element) == 1):
                music21_stream.append(note.Note(musical_element[0]))
            else:
                music21_stream.append(chord.Chord(musical_element))

    # Return the musical score
    return music21_stream

# Testing the Recurrent Autoencoder Model

In [36]:
# Load recurrent autoencoder model
model = keras.models.load_model(os.path.join(cwd, '20210502_08-25-53_max_kld_model.h5'))

In [37]:
# Make predicted onehot labeled songs
positive_classification_threshold = 0.2
(onehot_generated_song, onehot_original_song, generated_song_proba_arr) = make_predictions(
    model,
    X_holdout_validation,
    positive_classification_threshold=positive_classification_threshold,
)

Generating music
Music generated.
Number of failed predictions for the generated song: 2


In [38]:
## Convert the output to music21 objects

# The song generated by the model
generated_song = make_music21_stream(
    onehot_generated_song,
    int_to_str_chord= encoding_dict['int_to_str_chord']
)

# The slice of the song from the Final Fantasy NES Game (aka the `original` or `template` song)
original_song = make_music21_stream(
    onehot_original_song,
    int_to_str_chord=encoding_dict['int_to_str_chord']
)

# LOG
print('Original and generated song converted to music21 objects.')

Original and generated song converted to music21 objects.


In [39]:
## Write the songs as midi files to file

# Create a destination directory
if not os.path.exists(os.path.join(cwd, 'midis')):
    os.mkdir(os.path.join(cwd, 'midis'))

# Write the songs
now = datetime.datetime.now().strftime('%Y%m%d_%H-%M-%S')  # Current date and time
generated_song.write('midi', os.path.join(cwd, 'midis', f'{now}_generated_song.mid'))
original_song.write('midi', os.path.join(cwd, 'midis', f'{now}_original_song.mid'))

# LOG
print('Midis successfully written.')

Midis successfully written.


In [40]:
## Compare similarity

# Kullback-Leibler Divergence compares the `statistical similarity` between distributions
kl = KLDivergence()
kl_original = kl(
    onehot_original_song.astype('float32').reshape(X_holdout_validation.shape[1], X_holdout_validation.shape[2]), 
    onehot_original_song.astype('float32').reshape(X_holdout_validation.shape[1], X_holdout_validation.shape[2])
)

kl_generated = kl(
    onehot_original_song.astype('float32').reshape(X_holdout_validation.shape[1], X_holdout_validation.shape[2]), 
    generated_song_proba_arr.astype('float32').reshape(X_holdout_validation.shape[1], X_holdout_validation.shape[2])
)

# LOG
print('When comparing the same song, the KL Divergence:', kl_original.numpy())
print('When comparing the generated song and the original song, the KL Divergence:', kl_generated.numpy())

When comparing the same song, the KL Divergence: 0.0
When comparing the generated song and the original song, the KL Divergence: 13.34582


# Interpreting the Results
For music, ultimately the best thing to do is to listen to the music produced. However, the KL Divergence is a metric that is commonly used for multilabel classification tasks as it is a measure of how different two probability distributions are. Since the output of the neural network is an array of vectors of probabilties, and the onehot vector that the result is based on can be interpreted as an array of probabilities (e.g., [[0 0 0 1]] is a onehot vector but the probability that a label is the 4th label is simply 100%), then the KL divergence can be used to compare these two probability distributions. If the KL Divergence is 0, the probability distributions of the generated music and the original music is exactly equal. Otherwise, as the KL Divergence increases, so too does the quantitative dissimilarity between the two pieces.

# References
## Frequently Used Tutorials
* Keras Blog
    * https://blog.keras.io/building-autoencoders-in-keras.html
* LSTM Auto-Encoder
    * https://machinelearningmastery.com/lstm-autoencoders/
* TF tutorial
    * https://www.datacamp.com/community/tutorials/using-tensorflow-to-compose-music
    
## Academic Papers and Other Materials
[[1]] M. Newman, "Video Game Music Archive: Nintendo Music," VGMusic, 1996. [Online]. Available: https://www.vgmusic.com/music/console/nintendo/nes/. [Accessed 08 March 2021]. <br/>
[[2]] S. AlSaigal, S. Aljanhi and N. Hewahi, "Generation of Music Pieces Using Machine Learning: Long Short-Term Memory Neural Networks Approach," Arab Journal of Basic and Applied Sciences, vol. 26, no. 1, pp. 397-413, 2019. <br/>
[[3]] N. Mauthes, VGM-RNN: Recurrent Neural Networks for Video Game Music Generation Generation, Master's Projects, 2018, p. 595. <br/>
[[4]] A. Geron, "Chapter 17: Representation Learning and Generative Learning Using Autoencoders and GANS," in Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed., Sebastopol, O'Reilly Media, Inc, 2019, pp. 567-574. <br/>
[[5]] J. Briot, G. Hadjeres and F. Pachet, "Deep Learning Techniques for Music Generation - A Survey," arXiv:1709.01620, 2017. <br/>
[[6]] A. Geron, "Chapter 15: Processing Sequences Using RNNs and CNNs," in Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed., Sebastopol, O'Reilly Media, Inc., 2019, pp. 497-499. <br/>
[[7]] S. Hochreiter, "The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 2, pp. 107-116, 1998. <br/>
[[8]] J. Svegliato and S. Witty, "Deep Jammer: A Music Generation Model," University of Massachusetts, Amherst, 2016.
D. Kang, J. Kim and S. Ringdahl, "Project milestone: Generating music with Machine Learning," Stanford University, Stanford, 2018. <br/>
[[9]] A. Ycart and E. Benetos, "A Study on LSTM Networks for Polyphonic Music Sequence Modelling," in 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, 2017. <br/> 
[[10]] A. Huang and R. Wu, "Deep Learning for Music," Stanford University, Stanford. <br/>
[[11]] L. Yang and A. Lerch, "On the Evaluation of Generative Models in Music," Neural Computing and Application, vol. 32, no. 9, p. 12, 2018.<br/> 
[[12]] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu, "Wavenet: A Generative Model for Raw Audio," arXiv:1609.03499, 2016. <br/>
[[13]] J. Ba, J. Kiros and G. Hinton, "Layer Normalization," arXiv:1607.06450 , 2016. <br/>