# Making Math "Rock"

Generative models are all the rage right now, and I've always been fascinated with them.

I've been playing and listening to music for most of my life, and the more music theory I learn, the more it becomes apparent how much structure there is in music.

For this project I'll attempt to generate songs in the math rock genre using, well, math. [Math rock](https://en.wikipedia.org/wiki/Math_rock) is one of my favorite genres of music - it's a modern progressive style of rock that often features elements such as [odd time signatures](https://en.wikipedia.org/wiki/Time_signature#Complex_time_signatures), [polyrhythm](https://en.wikipedia.org/wiki/Polyrhythm), [counterpoint](https://en.wikipedia.org/wiki/Counterpoint), unconventional song structures, and so on. [Melody 4 by Tera Melos](https://open.spotify.com/track/0JeVTUELKzEIsPtEWd1VDU?si=5f9d43510d4f4dce) is, in my opinion, a great example that characterizes many elements of the genre. It's not for everyone, and some say it just sounds like noise. This is beneficial because the results of the model are likely to be just noise, but the underlying structure of the songs used in the training data is going to be much more difficult to learn than a conventional rock song.

## Overview

Something about what all is happening with this and why

### Steps

1. Convert mp3 files to MIDI files with Spotify's [Basic Pitch](https://github.com/spotify/basic-pitch) library
2. Convert MIDI files to a NumPy array
3. Train a model to predict the note(s) at the next time step
4. Generate songs by passing a seed song snippet to the model and then continuously predict the note(s) at the next time step for a pre-determined amount of time
5. Convert the generated song from a NumPy array to MIDI
6. Convert the MIDI file to an audio file
7. Add special effects with Spotify's [Pedalboard](https://github.com/spotify/pedalboard) library


### Shortcomings

Unfortunately my training data for this project isn't great. This requires mp3 files, and I mostly started listening to math rock after I got a Spotify subscription and stopped purchasing mp3s.

The processing power for training the model is also limited. I'm training this on a desktop without a GPU, so 

### Other ideas

- Genre blending
- Compare to a song generated from chatGPT

In [None]:
# TODO:
# - Separate main script out into other scripts
#   - Training the model
# - Fix audio to MIDI conversion for ones not working
# - Re-train the model with azure
#   - Train model
#       - Try w/ VAEs
#   - Save & Download model
# - Figure out num notes to use for training sequence
# - See if we can augment data with other things - key, tempo, time signature, etc.

## Converting mp3 files to MIDI

We'll do this with Spotify's [Basic Pitch](https://github.com/spotify/basic-pitch) library. This is what originally sparked the idea for this project because it made it feasible and easy to turn audio files into a numerical representation.

This library uses a model to predict the the notes being played - specifically the start/stop times, pitch, and the velocity (how hard a note is played). It's surprisingly fast and lightweight, and it only took a few minutes to convert the songs I wanted to use for my training data.

If you want to learn more, I recommend listening to the [Spotify Engineering Podcast (NerdOut)](https://open.spotify.com/show/5eXZwvvxt3K2dxha3BSaAe) episode on [Basic Pitch](https://open.spotify.com/episode/4wDDgWn037xjuq4Hr0u6a3?si=8a93b14952d546ca) and checking out the [announcement post](https://engineering.atspotify.com/2022/06/meet-basic-pitch/) and [website](https://basicpitch.spotify.com/) for it.

We'll start with some library imports and then grab the paths to the MP3 files we want to use for the training set before using basic pitch.

In [None]:
import os
import numpy as np
import tensorflow as tf
from basic_pitch import inference as basic_pitch_inference
from basic_pitch import ICASSP_2022_MODEL_PATH  # Recommended when predicting for multiple songs

# Setting up the directories
cwd = os.getcwd()
mp3_directory = os.path.join(cwd, "Data/Songs/")
midi_directory = os.path.join(cwd, "Data/MIDIs/")

# Renaming all mp3s to numbers
mp3_index = 0
for root, dirs, files in os.walk(mp3_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mp3"):
            new_filepath = root + "\\" + str(mp3_index) + ".mp3"
            mp3_index += 1
            os.rename(filepath, new_filepath)

# Getting a list of already converted MIDI files and the remaining mp3s that need to be converted
mp3s = []
midis = []
for root, dirs, files in os.walk(midi_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mid"):
            midis.append(filepath)
for root, dirs, files in os.walk(mp3_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mp3"):
            mp3s.append(filepath)

# Removing mp3s that have already been converted
mp3s = [mp3 for mp3 in mp3s if mp3.replace(".mp3", ".mid").replace("/Songs/", "/MIDIs/") not in midis]

Next we'll use the basic pitch model to convert the mp3 files to MIDI files and save them in the MIDI directory. We'll also log any errors that occur into a separate file in case we run into issues and need to debug them.

It's important to note that the basic pitch model also outputs the notes, but we will not be using that here because we can extract them from the MIDI files and get additional information (e.g. the key and time signature) using other libraries like [music21](https://github.com/cuthbertLab/music21).

In [None]:
# Iterating through the mp3s, converting them to MIDI, and saving them to the MIDI directory
basic_pitch_model = tf.saved_model.load(str(ICASSP_2022_MODEL_PATH))
log_file = open("Data/Outputs/processFailures.log", "w")  # To report issues
num_processed_songs = 0  # For reporting overall results
for i, mp3 in enumerate(mp3s):
    print(f"{i / len(mp3s):.0%}")
    mp3_to_convert = mp3s[i]
    mp3 = mp3.replace("\\", "/")

    # Using basic pitch model to convert mp3 to MIDI
    try:
        _, midi_data, _ = basic_pitch_inference.predict(mp3, basic_pitch_model)
        midi_path = mp3.replace("/Songs/", "/MIDIs/").replace(".mp3", ".mid")
        midi_data.write(midi_path)  # Saving the MIDI file
        midis.append(midi_path)  # Adding the MIDI file to the list of MIDI files
        num_processed_songs += 1

    # Logging errors that occur
    except Exception as e:
        print(f"Issue with {mp3}")
        log_file.write(f"{mp3}: {str(e)}\n")
        pass
log_file.close()

# Reporting the results
num_processed_songs = num_processed_songs
num_total_songs = len(mp3s)
print(
    f"Successfully processed {num_processed_songs}  \
      songs of {num_total_songs} ({(num_processed_songs / num_total_songs)})"
)

## Converting MIDI files to NumPy arrays

Here we'll take the MIDI files we created with basic pitch and convert them to NumPy arrays that we can use to train our model. The goal is to convert each song into an array with one columns for each time step, one row for each note, and a boolean if a given note was being played at a given timestamp.

Basic pitch extracts the velocity, but we're starting with a simple binary representation of the note being played to keep things simple due to hardware limitations. This will result in not needing as much memory (booleans take up less space than integers) and the model will be able to train faster since it does not need to learn what the velocity of a note should be.

We'll start with ensuring that the ticks per beat is the same across all of the MIDI files. This is important because TODO: Finish this

TODO: Talk about how this is a "piano roll"

|Pitch   | Timestamp 0 | Timestamp 1 | ... | Timestamp *n* |
|--------|-------------|-------------|-----|---------------|
| 1      | FALSE       | FALSE       | ... | FALSE         |
| 2      | TRUE        | TRUE        | ... | FALSE         |
| 3      | FALSE       | FALSE       | ... | TRUE          |
| ...    | ...         | ...         | ... | ...           |
| 127    | FALSE       | FALSE       | ... | FALSE         |

In [None]:
import mido

# Double checking that the ticks per beat are consistent in all of the MIDI files
# This will be used to specify the length in seconds to slice by
all_ticks_per_beat = []
for midi in midis:
    mid = mido.MidiFile(midi)
    all_ticks_per_beat.append(mid.ticks_per_beat)

assert len(set(all_ticks_per_beat)) == 1, "Different ticks per beat in MIDI files"
ticks_per_beat = all_ticks_per_beat[0]

def parse_midi_to_array(file_path: str) -> np.ndarray:
    """TODO: Write docstring"""
    # Load MIDI file
    mid = mido.MidiFile(file_path)

    # Initialize the song array with zeros
    # The X axis is the timestamp, and the Y axis is the pitch
    ticks_per_beat = mid.ticks_per_beat
    max_time = int(mid.length * ticks_per_beat)
    max_pitch = 127
    song_array = np.full((max_time, max_pitch), False)

    # Create array of note on/off events
    note_events = []
    for msg in mid:
        if msg.type == "note_on":
            note_events.append((msg.note, msg.time))
        elif msg.type == "note_off":
            note_events.append((msg.note, msg.time))

    # Create array of note times
    note_times = []
    time = 0
    for event in note_events:
        note_times.append((event[0], time, event[1]))
        time += event[1]

    # Create array of note durations
    note_durations = []
    for i in range(len(note_times)):
        note = note_times[i]
        if i == len(note_times) - 1:
            duration = mid.length - note[1]
        else:
            duration = note_times[i + 1][1] - note[1]
        note_durations.append(duration)

    # Populate the song array
    for i in range(len(note_times)):
        note = note_times[i]
        duration = note_durations[i]
        start_time = int(note[1] * ticks_per_beat)
        end_time = int((note[1] + duration) * ticks_per_beat)
        pitch = note[0]
        song_array[start_time:end_time, pitch] = True

    return song_array


# Converting the MIDI files to numpy arrays
midi_arrays = []
for midi in midis:
    midi_arrays.append(parse_midi_to_array(midi))

assert len(midi_arrays) == len(midis), "MIDI arrays and # of MIDI files don't match"

asdf

In [None]:
# TODO: Code here

In [None]:
# Determining how long of sequences to use for training and the seeds for the song generation
song_snippet_length_s = 5  # In seconds
sequence_length = song_snippet_length_s * ticks_per_beat

# Taking the list of arrays and converting it to semi-redundant sequences based on the max length of the sequence
step_ratio_from_example = 3 / 100  # TODO: Figure out what we should be using instead
step = int(step_ratio_from_example * sequence_length)

# Getting the number of X axes to be able to create the array to fill
# Doing it this way to avoid having to create a list of all the sequences and then convert it to an array which is very memory heavy
num_sequences = 0
for song in midi_arrays:
    for i in range(0, len(song) - sequence_length, step):
        num_sequences += 1

# Initializing the arrays to fill
assert (
    len(set([song.shape[1] for song in midi_arrays])) == 1
), "All songs must have the same number of pitches"
num_pitches = midi_arrays[0].shape[1]
sequences = np.zeros((num_sequences, sequence_length, num_pitches), dtype=bool)
next_notes = np.zeros((num_sequences, num_pitches), dtype=bool)

# Iterating through the songs and filling the arrays
seq_num = 0
for song in midi_arrays:
    for i in range(0, len(song) - sequence_length, step):
        sequences[seq_num] = song[i : i + sequence_length]
        next_notes[seq_num] = song[i + sequence_length]
        seq_num += 1

# Saving the arrays
output_file = os.path.join(cwd, "Data/training_data.npy")
np.save(output_file, sequences)
np.save(output_file.replace("_data.npy", "_labels.npy"), next_notes)

print("Number of sequences:", len(sequences))
print("Shape of sequences:", sequences[0].shape)

## Training the song generation model

Ultimately, we just want a classification model that predicts the probability of each note/pitch being played at each time step. We'll expand upon this more in the song generation section for how to continue generating songs beyond just the next timestamp.

There are a variety of models that can be used to generate music. Here we are going to try a few models in increasing levels of complexity. I'm doing this because it helps explain different concepts and components, and it gives options in case you are looking to borrow from this project.

Lastly, we're going to use Keras to build the models. Keras is a high-level API (built on top of TensorFlow) that makes it easy to build and train neural networks, and I think the simplicity and readability of the code is preferable for a blog post over more complex libraries like PyTorch or TensorFlow.

### Simple LSTM

This is the most basic model we'll build, and it had surprisingly good results! It's just a single layer [long short-term memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) network with 56 units that is fed to a dense layer that gives the probability of each note being played at the next timestamp.

LSTMs are a type of [recurrent neural network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) that are designed to handle sequential data by "remembering" data from earlier in the sequence. I recommend reading Christopher Olah's [blog post on LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) if you want to learn more. I'm a big fan of his work, and I highly recommend the [Distill](https://distill.pub/) journal he cofounded to help explain research and concepts in a more interactive way than traditional papers. [Here is an example on memory within RNNs](https://distill.pub/2019/memorization-in-rnns/) that is germane to this project.

LSTMs are very useful for music generation because what is played at the next timestamp is most likely dependent on more than what was played at the previous timestamp. For example, if a song is in the key of C major, then there are only 7 possible notes that can be played if we ignore accidentals and octaves. If the model only pays attention to the previous note played, then it may not know if the song is in C major and will have a more difficult time distinguishing which notes could be played next. Another example is that songs  often have structure in the form of [chord progressions](https://en.wikipedia.org/wiki/Chord_progression), so if a model is trying to learn a song that is using the 12 bar blues chord progression (I-I-I-I-IV-IV-I-I-V-IV-I-V), then it will likely struggle if it only pays attention to the previous chord played.

In [None]:
# Input shape of the data for the model
sequence_length = sequences.shape[1]
num_pitches = sequences.shape[2]

# Simple LSTM
simple_model = tf.keras.Sequential()
simple_model.add(tf.keras.Input(shape=(sequence_length, num_pitches)))
simple_model.add(tf.keras.layers.LSTM(56))
simple_model.add(tf.keras.layers.Dense(num_pitches, activation="softmax"))
simple_model.summary()

simple_model.compile(
    loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Always use early stopping :)
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor="loss",
    min_delta=0,
    patience=2,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=True
)

# Training the model w/ early stopping
epochs = 10  # Increase if the hardware can handle it
batch_size = 128
history = simple_model.fit(
    sequences,
    next_notes,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=[early_stopping_callback],
)

# Saving the model
model_path = os.path.join(cwd, "Data/Outputs/simple_LSTM.h5")
simple_model.save(model_path)

### Variational autoencoder (VAE)

LSTMs can generate decent music, but research has shown other methods to be more effective. One of these methods is a [variational autoencoder (VAE)](https://en.wikipedia.org/wiki/Autoencoder#Variational_autoencoder_(VAE)). This is a type of [autoencoder](https://en.wikipedia.org/wiki/Autoencoder) that is designed to learn a latent space that can be used to generate new data.

TODO: Explain what a VAE is

This will be loosely based on the [MusicVAE](https://magenta.tensorflow.org/music-vae) model architecture created by Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. I recommend reading their [paper](https://arxiv.org/abs/1803.05428) and checking out the [blog post on MusicVAE](https://magenta.tensorflow.org/music-vae). We won't directly copy the hierarchical part of their model architecture because I'm not concerned about generating longer samples, but we may play with the latent space from the encoder.

Our model will also be much smaller than MusicVAE's due to hardware limitations. MusicVAE's encoder (section 3.1 in their paper) is a two layer LSTM with 2048 units each fed into a fully connected layer with 512 units, and their decoder (section 3.2) is a bit more complex due to the hierarchical nature, but it involves multiple LSTMs with 1024 units each. My computer (that doesn't have a GPU) struggled to train a simple LSTM with 56 units, so we'll stick to a smaller model since I'll be training it on a virtual machine.

In [None]:
# TODO: Code here

## Generating songs

The song generation will be pretty straightforward. We will have to feed the model a "seed" of a sequence of notes to predict the next timestamp. We'll then iterate over this to continuously generate the next timestamp given the previous predictions. For example, if our model generates the 6th timestamp given the previous 5 timestamps, then it will generate the 7th timestamp given the timestamps 2-6, and so on. In this example, our model will no longer be using the seed after the fifth prediction. We will repeat this process enough times until we have an adequate length song.

Because our model outputs the probability of a note being played, we can randomly sample the predictions to get the notes played at that timestsamp. This would be more difficult if we wanted to capture the velocity of a note, but we're just using a binary representation of whether a note is played or not.

We'll begin with gathering our seeds. This will just be the beginning of different songs that were used in the training set.

In [None]:
# TODO: Code here to get the seeds

In [None]:
def generate_song(seed: np.ndarray, model, ticks_per_beat: int, num_seconds=15) -> np.ndarray:
    """
    Generates a song based on a seed array and a trained model

    Args:
        seed (np.ndarray): The seed array to use to generate the song
        model (keras.Model): The trained model to use to generate the song
        ticks_per_beat (int): The number of ticks per beat in the MIDI files
        num_seconds (int, optional): The number of seconds to generate. Defaults to 15.

    Returns:
        np.ndarray: The generated song
    """
    song = seed
    seq_length = seed.shape[0]
    num_notes = num_seconds * ticks_per_beat
    for i in range(num_notes):
        probabilities = model.predict(song[-seq_length:].reshape(1, seq_length, 127))
        notes_played = np.random.binomial(n=1, p=probabilities)
        song = np.append(song, notes_played, axis=0)
    return song


generated_song = generate_song(seed=seed, model=model, ticks_per_beat=ticks_per_beat, num_seconds=20)

## Saving the generated songs as MIDI

The generated song is currently just a numpy array, so we'll need to do some work to convert it back to a MIDI file. We'll start by converting from the piano roll array into an array with the start/end time of each note played.

So our current generated song is an array that looks like this:

|Pitch   | Timestamp 0 | Timestamp 1 | ... | Timestamp *n* |
|--------|-------------|-------------|-----|---------------|
| 1      | FALSE       | FALSE       | ... | FALSE         |
| 2      | TRUE        | TRUE        | ... | FALSE         |
| 3      | FALSE       | FALSE       | ... | TRUE          |
| ...    | ...         | ...         | ... | ...           |
| 127    | FALSE       | FALSE       | ... | FALSE         |

And we will convert it to an array that looks like this for all notes that were played:

|Pitch     | Start Time | End Time |
|----------|------------|----------|
| 60       | 0          | 0.5      |
| 62       | 2.5        | 4.5      |
| 122      | 2.5        | 4.5      |


In [None]:
def reshape_array_for_note_start_end(
    arr: np.ndarray, ticks_per_beat: int = 220
) -> np.ndarray:
    """
    Find the start and end of consecutive True values in a 2D array.
    Thanks ChatGPT!

    Args:
        arr (np.ndarray): The array to reshape
        ticks_per_beat (int, optional): The number of ticks per beat in the MIDI files. Defaults to 220.

    Returns:
        np.ndarray: The reshaped array
    """
    if arr.shape[1] == 127:
        arr = arr.T

    sequences = []
    for row_idx in range(arr.shape[0]):
        row = arr[row_idx]
        start, end = None, None
        for col_idx in range(arr.shape[1]):
            if row[col_idx]:
                if start is None:
                    start = col_idx
                end = col_idx
            elif start is not None:
                sequences.append((row_idx, start, end))
                start, end = None, None
        if start is not None:
            sequences.append((row_idx, start, end))
    output_array = np.array(sequences, dtype=float)

    # Converting to seconds
    output_array[:, 1] /= ticks_per_beat
    output_array[:, 2] /= ticks_per_beat

    # Ordering the array by start time
    output_array = output_array[output_array[:, 1].argsort()]
    return output_array  # (pitch, start, end)


song_snippet_start_ends = reshape_array_for_note_start_end(song_snippet)

We need to do a little more data processing before being able to create a MIDI from our generated song. This is because the time argument for adding notes to a MIDI file with Mido is the time since the last note was played, not the absolute time. So if we have a song that has a note played at 0.5 seconds and another note played at 1.5 seconds, then the time argument for the second note will be 1 second.

We'll have to calculate the difference between the start time of each note and the end time of the previous note to get the time argument for each note. Fortunately, this is really easy to do with pandas. We'll start by converting our array of notes into a pandas DataFrame. Next, we'll use the `melt` method to convert the DataFrame from wide to long format where the start and end of each note has its own row. We'll then sort the DataFrame by the timestamp and lag the time of the next note with the `shift` method. Finally, we'll calculate the difference between the start time of each note and the end time of the previous note.

Our array of notes will now become a data frame that looks like this:

|Pitch     | Type | Time | lagged_time | time_delta|
|----------|------|------|-------------|-----------|
| 60       | start| 0    | np.NaN      | 0         |
| 60       | end  | 0.5  | 0           | 0.5       |
| 62       | start| 2.5  | 0.5         | 2.0       |
| 122      | start| 2.5  | 2.5         | 0         |
| 62       | end  | 4.5  | 2.5         | 2.0       |
| 122      | end  | 4.5  | 4.5         | 0         |

And we can now use the `time_delta` column to add the notes to the MIDI file.

In [None]:
import pandas as pd

def get_note_time_deltas(note_start_end_array: np.ndarray) -> pd.DataFrame:
    """
    Further formats the array to get time delta between each note to prep the data for MIDI conversion

    Args:
        note_start_end_array (np.ndarray): The array to reshape from reshape_array_for_note_start_end()

    Returns:
        pd.DataFrame: The reshaped array
    """
    # Formatting further to convert to MIDI
    # This is because we need the time delta between each note and if that note was a start or end
    # Using Pandas to do the remainder of the processing due to ease of use
    midi_df = pd.DataFrame(note_start_end_array, columns=["pitch", "start", "end"])
    midi_df = midi_df.melt(id_vars="pitch").rename(
        columns={"variable": "type", "value": "time"}
    )
    midi_df = midi_df.sort_values(by="time").reset_index(drop=True)
    midi_df["lagged_time"] = midi_df["time"].shift(-1)
    midi_df["time_delta"] = (midi_df["lagged_time"] - midi_df["time"]).fillna(0)
    return midi_df


song_snippet_for_output = get_note_time_deltas(song_snippet_start_ends)

In [None]:
def create_midi(
    df: pd.DataFrame, midi_path: str, ticks_per_beat: int = 220, tempo: int = 500000
) -> None:
    """
    Creates a MIDI file from a dataframe of notes from get_note_time_deltas()

    Args:
        df (pd.DataFrame): The dataframe of notes from get_note_time_deltas()
        midi_path (str): The path to save the MIDI file to
        ticks_per_beat (int, optional): The number of ticks per beat in the MIDI files. Defaults to 220.
        tempo (int, optional): The tempo of the MIDI file. Defaults to 500000.

    Returns:
        None: Saves the MIDI file to the specified path
    """
    mid = mido.MidiFile()
    track = mido.MidiTrack()
    mid.tracks.append(track)
    track.append(mido.MetaMessage("set_tempo", tempo=tempo, time=0))
    mid.ticks_per_beat = ticks_per_beat
    for row in df.itertuples():
        if row.type == "start":
            track.append(
                mido.Message(
                    "note_on",
                    note=int(row.pitch),
                    velocity=64,
                    time=int((row.time_delta * ticks_per_beat * 2)),
                )
            )
        else:
            track.append(
                mido.Message(
                    "note_on",
                    note=int(row.pitch),
                    velocity=0,
                    time=int((row.time_delta * ticks_per_beat) * 2),
                )
            )
    mid.save(midi_path)
    print(f"MIDI file saved to {midi_path}")
    

midi_output_path = generated_song_path.replace(".npy", ".mid")
create_midi(song_snippet_for_output, midi_output_path)

## Converting to audio

How has nobody named software dealing with audio files "audiofile"?

## Adding effects