# Making Math "Rock"

Generative models are all the rage right now, and I've always been fascinated with them.

I've been playing and listening to music for most of my life, and the more music theory I learn, the more it becomes apparent how much structure there is in music.

For this project I'll attempt to generate songs in the math rock genre using, well, math. [Math rock](https://en.wikipedia.org/wiki/Math_rock) is one of my favorite genres of music - it's a modern progressive style of rock that often features elements such as [odd time signatures](https://en.wikipedia.org/wiki/Time_signature#Complex_time_signatures), [polyrhythm](https://en.wikipedia.org/wiki/Polyrhythm), [counterpoint](https://en.wikipedia.org/wiki/Counterpoint), unconventional song structures, and so on. [Melody 4 by Tera Melos](https://open.spotify.com/track/0JeVTUELKzEIsPtEWd1VDU?si=5f9d43510d4f4dce) is, in my opinion, a great example that characterizes many elements of the genre. It's not for everyone, and some say it just sounds like noise. This is beneficial because the results of the model are likely to be just noise, but the underlying structure of the songs used in the training data is going to be much more difficult to learn than a conventional rock song.

## Overview

Something about what all is happening with this and why

### Steps

1. Convert mp3 files to MIDI files with Spotify's [Basic Pitch](https://github.com/spotify/basic-pitch) library
2. Convert MIDI files to a NumPy array
3. Train a model to predict the note(s) at the next time step
4. Generate songs by passing a seed song snippet to the model and then continuously predict the note(s) at the next time step for a pre-determined amount of time
5. Convert the generated song from a NumPy array to MIDI
6. Convert the MIDI file to an audio file
7. Add special effects with Spotify's [Pedalboard](https://github.com/spotify/pedalboard) library


### Shortcomings

Unfortunately my training data for this project isn't great. This requires mp3 files, and I mostly started listening to math rock after I got a Spotify subscription and stopped purchasing mp3s.

The processing power for training the model is also limited. I'm training this on a desktop without a GPU, so 

### Other ideas

- Genre blending
- Compare to a song generated from chatGPT

In [None]:
# TODO:
# - Separate main script out into other scripts
#   - Training the model
# - Fix audio to MIDI conversion for ones not working
# - Re-train the model with azure
#   - Train model
#       - Try w/ VAEs
#   - Save & Download model
# - Figure out num notes to use for training sequence
# - See if we can augment data with other things - key, tempo, time signature, etc.

## Converting mp3 files to MIDI

We'll do this with Spotify's [Basic Pitch](https://github.com/spotify/basic-pitch) library. This is what originally sparked the idea for this project because it made it feasible and easy to turn audio files into a numerical representation.

This library uses a model to predict the the notes being played - specifically the start/stop times, pitch, and the velocity (how hard a note is played). It's surprisingly fast and lightweight, and it only took a few minutes to convert the songs I wanted to use for my training data.

If you want to learn more, I recommend listening to the [Spotify Engineering Podcast (NerdOut)](https://open.spotify.com/show/5eXZwvvxt3K2dxha3BSaAe) episode on [Basic Pitch](https://open.spotify.com/episode/4wDDgWn037xjuq4Hr0u6a3?si=8a93b14952d546ca) and checking out the [announcement post](https://engineering.atspotify.com/2022/06/meet-basic-pitch/) and [website](https://basicpitch.spotify.com/) for it.

We'll start with some library imports and then grab the paths to the MP3 files we want to use for the training set before using basic pitch.

In [None]:
import os
import tensorflow as tf
from basic_pitch import inference as basic_pitch_inference
from basic_pitch import ICASSP_2022_MODEL_PATH  # Recommended when predicting for multiple songs

# Setting up the directories
cwd = os.getcwd()
mp3_directory = os.path.join(cwd, "Data/Songs/")
midi_directory = os.path.join(cwd, "Data/MIDIs/")

# Renaming all mp3s to numbers
mp3_index = 0
for root, dirs, files in os.walk(mp3_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mp3"):
            new_filepath = root + "\\" + str(mp3_index) + ".mp3"
            mp3_index += 1
            os.rename(filepath, new_filepath)

# Getting a list of already converted MIDI files and the remaining mp3s that need to be converted
mp3s = []
midis = []
for root, dirs, files in os.walk(midi_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mid"):
            midis.append(filepath)
for root, dirs, files in os.walk(mp3_directory):
    for file in files:
        filepath = root + "\\" + file
        if filepath.lower().endswith(".mp3"):
            mp3s.append(filepath)

# Removing mp3s that have already been converted
mp3s = [mp3 for mp3 in mp3s if mp3.replace(".mp3", ".mid").replace("/Songs/", "/MIDIs/") not in midis]

Next we'll use the basic pitch model to convert the mp3 files to MIDI files and save them in the MIDI directory. We'll also log any errors that occur into a separate file in case we run into issues and need to debug them.

It's important to note that the basic pitch model also outputs the notes, but we will not be using that here because we can extract them from the MIDI files and get additional information (e.g. the key and time signature) using other libraries like [music21](https://github.com/cuthbertLab/music21).

In [None]:
# Iterating through the mp3s, converting them to MIDI, and saving them to the MIDI directory
basic_pitch_model = tf.saved_model.load(str(ICASSP_2022_MODEL_PATH))
log_file = open("Data/Outputs/processFailures.log", "w")  # To report issues
num_processed_songs = 0  # For reporting overall results
for i, mp3 in enumerate(mp3s):
    print(f"{i / len(mp3s):.0%}")
    mp3_to_convert = mp3s[i]
    mp3 = mp3.replace("\\", "/")

    # Using basic pitch model to convert mp3 to MIDI
    try:
        _, midi_data, _ = basic_pitch_inference.predict(mp3, basic_pitch_model)
        midi_path = mp3.replace("/Songs/", "/MIDIs/").replace(".mp3", ".mid")
        midi_data.write(midi_path)  # Saving the MIDI file
        midis.append(midi_path)  # Adding the MIDI file to the list of MIDI files
        num_processed_songs += 1

    # Logging errors that occur
    except Exception as e:
        print(f"Issue with {mp3}")
        log_file.write(f"{mp3}: {str(e)}\n")
        pass
log_file.close()

# Reporting the results
num_processed_songs = num_processed_songs
num_total_songs = len(mp3s)
print(
    f"Successfully processed {num_processed_songs}  \
      songs of {num_total_songs} ({(num_processed_songs / num_total_songs)})"
)

## Converting MIDI files to NumPy arrays

Here we'll take the MIDI files we created with basic pitch and convert them to NumPy arrays that we can use to train our model. The goal is to convert each song into an array with one columns for each time step, one row for each note, and a boolean if a given note was being played at a given timestep.

Basic pitch extracts the velocity, but we're starting with a simple binary representation of the note being played to keep things simple due to hardware limitations. This will result in not needing as much memory (booleans take up less space than integers) and the model will be able to train faster since it does not need to learn what the velocity of a note should be.

We'll start with ensuring that the ticks per beat is the same across all of the MIDI files. This is important because TODO: Finish this

In [None]:
import mido
import numpy as np

# Double checking that the ticks per beat are consistent in all of the MIDI files
# This will be used to specify the length in seconds to slice by
all_ticks_per_beat = []
for midi in midis:
    mid = mido.MidiFile(midi)
    all_ticks_per_beat.append(mid.ticks_per_beat)

assert len(set(all_ticks_per_beat)) == 1, "Different ticks per beat in MIDI files"
ticks_per_beat = all_ticks_per_beat[0]

def parse_midi_to_array(file_path: str) -> np.ndarray:
    """TODO: Write docstring"""
    # Load MIDI file
    mid = mido.MidiFile(file_path)

    # Initialize the song array with zeros
    # The X axis is the timestamp, and the Y axis is the pitch
    ticks_per_beat = mid.ticks_per_beat
    max_time = int(mid.length * ticks_per_beat)
    max_pitch = 127
    song_array = np.full((max_time, max_pitch), False)

    # Create array of note on/off events
    note_events = []
    for msg in mid:
        if msg.type == "note_on":
            note_events.append((msg.note, msg.time))
        elif msg.type == "note_off":
            note_events.append((msg.note, msg.time))

    # Create array of note times
    note_times = []
    time = 0
    for event in note_events:
        note_times.append((event[0], time, event[1]))
        time += event[1]

    # Create array of note durations
    note_durations = []
    for i in range(len(note_times)):
        note = note_times[i]
        if i == len(note_times) - 1:
            duration = mid.length - note[1]
        else:
            duration = note_times[i + 1][1] - note[1]
        note_durations.append(duration)

    # Populate the song array
    for i in range(len(note_times)):
        note = note_times[i]
        duration = note_durations[i]
        start_time = int(note[1] * ticks_per_beat)
        end_time = int((note[1] + duration) * ticks_per_beat)
        pitch = note[0]
        song_array[start_time:end_time, pitch] = True

    return song_array


# Converting the MIDI files to numpy arrays
midi_arrays = []
for midi in midis:
    midi_arrays.append(parse_midi_to_array(midi))

assert len(midi_arrays) == len(midis), "MIDI arrays and # of MIDI files don't match"

asdf

In [None]:
# code

In [None]:
# Determining how long of sequences to use for training and the seeds for the song generation
song_snippet_length_s = 5  # In seconds
sequence_length = song_snippet_length_s * ticks_per_beat

# Taking the list of arrays and converting it to semi-redundant sequences based on the max length of the sequence
step_ratio_from_example = 3 / 100  # TODO: Figure out what we should be using instead
step = int(step_ratio_from_example * sequence_length)

# Getting the number of X axes to be able to create the array to fill
# Doing it this way to avoid having to create a list of all the sequences and then convert it to an array which is very memory heavy
num_sequences = 0
for song in midi_arrays:
    for i in range(0, len(song) - sequence_length, step):
        num_sequences += 1

# Initializing the arrays to fill
assert (
    len(set([song.shape[1] for song in midi_arrays])) == 1
), "All songs must have the same number of pitches"
num_pitches = midi_arrays[0].shape[1]
sequences = np.zeros((num_sequences, sequence_length, num_pitches), dtype=bool)
next_notes = np.zeros((num_sequences, num_pitches), dtype=bool)

# Iterating through the songs and filling the arrays
seq_num = 0
for song in midi_arrays:
    for i in range(0, len(song) - sequence_length, step):
        sequences[seq_num] = song[i : i + sequence_length]
        next_notes[seq_num] = song[i + sequence_length]
        seq_num += 1

# Saving the arrays
output_file = os.path.join(cwd, "Data/training_data.npy")
np.save(output_file, sequences)
np.save(output_file.replace("_data.npy", "_labels.npy"), next_notes)

print("Number of sequences:", len(sequences))
print("Shape of sequences:", sequences[0].shape)


## Training the song generation model

There are a variety of models that can be used to generate music. Here we are going to try a few models in increasing levels of complexity. I'm doing this because it helps explain different concepts and components, and it gives options in case you are looking to borrow from this project.

use a recurrent variational autoencoder (VAE). TODO: Explain why VAE

This is loosely based on the [MusicVAE](https://magenta.tensorflow.org/music-vae) model architecture. I recommend reading their [paper](https://arxiv.org/abs/1803.05428). We won't directly copy the hierarchical part of their model architecture because I'm not concerned about generating longer samples, but we may play with the latent space from the encoder. 

Our model will also be much smaller than MusicVAE's due to hardware limitations. MusicVAE's encoder (section 3.1 in their paper) is a two layer LSTM with 2048 units each fed into a fully connected layer with 512 units, and their decoder (section 3.2) is a bit more complex due to the hierarchical nature, but it involves multiple LSTMs with 1024 units each. My computer (that doesn't have a GPU) struggled to train a simple LSTM with 56 units, so we'll stick to a smaller model since I'll be training it on a virtual machine.

### Simple LSTM

This is the most basic model we'll build, and it had surprisingly good results! It's just a single layer [long short-term memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) network with 56 units that is fed to a dense layer that gives the probability of each note being played at the next timestep.

LSTMs are a type of [recurrent neural network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) that are designed to handle sequential data by "remembering" data from earlier in the sequence. I recommend reading Christopher Olah's [blog post on LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) if you want to learn more. I'm a big fan of his work, and I highly recommend the [Distill](https://distill.pub/) journal he cofounded to help explain research and concepts in a more interactive way than traditional papers. [Here is an example on memory within RNNs](https://distill.pub/2019/memorization-in-rnns/) that is germane to this project.

LSTMs are very useful for music generation because what is played at the next timestamp is most likely dependent on more than what was played at the previous timestamp. For example, if a song is in the key of C major, then there are only 7 possible notes that can be played if we ignore accidentals and octaves. If the model only pays attention to the previous note played, then it may not know if the song is in C major and will have a more difficult time distinguishing which notes could be played next. Another example is that songs  often have structure in the form of [chord progressions](https://en.wikipedia.org/wiki/Chord_progression), so if a model is trying to learn a song that is using the 12 bar blues chord progression (I-I-I-I-IV-IV-I-I-V-IV-I-V), then it will likely struggle if it only pays attention to the previous chord played.

## Generating songs

## Saving the generated songs as MIDI

## Converting to audio

How has nobody named software dealing with audio files "audiofile"?

## Adding special effects