# Introduction

This is a machine learning project using my earlier python synthesizer project.

A synthesizer is a machine that can generate sound. It creates different sounds based on the parameters you choose for it. In this project, I teach a neural network to predict these parameters based on the synthesized sound. I use tensorflow's keras for of the machine learning. 

---

# Imports

In [1]:
# For loading batches
import pickle 

# For machine learning
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.utils.vis_utils import plot_model

# About The Data

What makes this very suitable for machine learning, is the fact that we can generate our training data. We simply randomize the parameters for the synthesizer, generate a sound based on that, and the sound-parameters pair will work as the training data. I've generated the training data in the original synthesizer notebook.

---

Each generated sound was made to be 2 seconds long, with a sample rate of 44100/4. Having a lower sample rate means we need less samples to represent each sound, which makes them a bit more compact. We only lose very high frequency information, which isn't very relevant for these sounds.

Instead of representing the sound as a regular audio file, I decided to use fourier transform to get another type of representation. The fourier transform essentially describes a signal in terms of its frequencies, rather than its movement in time. However, it is also possible to have a compromise of sorts, by applying the fourier transform in chunks, where we end up with a matrix representation for the sound. Each row represents a time step, while each column represents the strength of a particular frequency. This way we can represent how the frequency information changes over time, which is actually quite close to how human perceive sounds.

Mirroring human perception isn't the only advantage of this. In audio applications, it is quite common to use a convolutional neural network, because convolution allows you to "scan through" the audio file looking for a specific pattern. A 2 second audio file with a sample rate of 44100/4 corresponds to an array of 22050 numbers, which is quite a long array to convolve over. By rearranging this into 20 time steps (matrix rows), each containing infromation for 551 frequencies (matrix columns), we only need to convolve over those 20 time steps. This makes the convolution process, and training the network, much faster.

---

The synthesizer has different kinds of parameters. Some of them are continuous, some of them are categorical. Some parameters weren't randomized, and shouldn't be learned.

The parameter values themselves are loaded as simple lists, that doesn't tell what kind of parameters we are dealing with. For that information, there are two more general files that I load under the "general synthesizer parameter information" heading:

- 'training_data/randomizer_parts.pkl' contains information about which parameters are continuous/categorical/constant
- 'training_data/keys.pkl' contains the names of the parameters

---

I generated the data as batches, each containing 32 examples.  

The sound data batches are in the 'training_data/fft/' folder. (fft stands for fast fourier transform).  
Since each sound is represented as a matrix of shape (20,551), the shape of these batches is (32,20,551).  

The parameter batches are in the 'training_data/parameters/' folder.  
The parameters are also given as batches, but since there are multiple different parameters, each file is a *list* of batches.  
Since the constant parameters aren't used for training, those items of the list aren't batches, but just a string saying "constant".  

The continuous parameters are just numbers, the categorical variables have been one-hot-encoded.

Below are examples clarifying this

In [2]:
# Shape of a sound batch
X_batch_example = np.load(f'training_data/fft/fft_batch_{0}.npy')
X_batch_example.shape

(32, 20, 551)

In [3]:
# Shape of a parameter batch
with open(f'training_data/parameters/parameter_batches_{0}.pkl', 'rb') as f:
    y_batch_example = pickle.load(f)

print('Length of list = amount of parameters:', len(y_batch_example), '\n')
print('Each item in the list is a batch for a specific parameter, with a length of:', len(y_batch_example[9]), '\n')
print('Except for the constant parameters, which are just a string saying:', y_batch_example[0], '\n')

print('A numerical variable batch (first 5 examples):', '\n', y_batch_example[9][:5], '\n')
print('A categorical variable batch (first 5 examples):', '\n', y_batch_example[10][:5], '\n')

Length of list = amount of parameters: 46 

Each item in the list is a batch for a specific parameter, with a length of: 32 

Except for the constant parameters, which are just a string saying: constant 

A numerical variable batch (first 5 examples): 
 [0.29331256 0.38375954 0.62783282 0.46206089 0.48295211] 

A categorical variable batch (first 5 examples): 
 [[0 0 1]
 [0 0 1]
 [0 1 0]
 [0 1 0]
 [0 1 0]] 



# General Synthesizer Parameter Information

These two files contain information about the names and types of the parameters.  
They're turned into variables here, and used later when defining the data loader and the model.

In [4]:
# Load parameter keys file
with open(f'training_data/keys.pkl', 'rb') as f:
    parameter_keys = pickle.load(f)
            
# Load randomizer parts file
with open(f'training_data/randomizer_parts.pkl', 'rb') as f:
    parameter_randomizer_parts = pickle.load(f)
            
# Turn parameter keys file into a list of parameter names
parameter_names = []
for i in range(len(parameter_keys)):
    key_path = parameter_keys[i]

    name = ""
    for j in range(len(key_path)):
        name = name + str(key_path[j])
        if j != len(key_path) - 1:
            name = name + "_"

    parameter_names.append(name)

# Data Loader

Because the training data was generated, it was possible to a lot of data. This makes it difficult to fit all of it into the computer's memory during the training. Luckily, keras has functionality to load the data during the training.

The class below defines how the data will be loaded during the training. The "getitem" function is called before training each batch, and will return the batch to be used for that particular training step. The "index" parameter for that function is the number of the training step (number of the batch). 

The sound data batch (X) is a single numpy array. The synthesizer parameters are actually returned as a dictionary of batches. Each dictionary entry is a batch for a single parameter. This is something keras allows us to do, which is to have multiple output layers for our neural network.

In [5]:
class DataLoader(keras.utils.Sequence):
    def __init__(self, batch_ids, parameter_names, batch_size=32):
        self.batch_amount = len(batch_ids)
        self.batch_size = batch_size
        self.batch_ids = batch_ids
        self.parameter_names = parameter_names

    def __len__(self):
        return self.batch_amount
    
    # This function will be called before training a batch, it will return the batch to be trained on
    def __getitem__(self, index):
        # Convert training step index into batch number
        batch_number = self.batch_ids[index]
        
        # Load sound data batch
        X_fft = np.load(f'training_data/fft/fft_batch_{batch_number}.npy')
        X = {'in_fft': X_fft}
        
        # Load parameter data batch
        y = {}
        with open(f'training_data/parameters/parameter_batches_{batch_number}.pkl', 'rb') as f:
            y_parameters = pickle.load(f)
        
        # Convert it into a dictionary of individual parameter batches
        # Here we leave out the constant parameters, which are not used for training
        for i in range(len(y_parameters)):
            parameter_name = self.parameter_names[i]
            parameter_batch = y_parameters[i]
            if type(parameter_batch) == str and parameter_batch == 'constant':
                continue
            else:
                y[f'out_{parameter_name}'] = parameter_batch

        return X, y
    
    # This function is called after every epoch
    # here it's used to shuffle the batches, so that each epoch goes through them in a different order
    def on_epoch_end(self):
        np.random.shuffle(self.batch_ids)

# Model

The model itself is a neural network with the following structure:  
Input layer -> Convolution layer -> Regular fully connected layer -> Output layers

The main source of complexity in the cell below, is the fact that there are so many output layers. Every synthesizer parameter to be learned has its own output layer, in total 38. The numerical and categorical synthesizer parameters also require different types of output layers.  

In [6]:
# Information for the shape of the sound data batch (for the shape of the convolution layer)
X_batch_example = np.load(f'training_data/fft/fft_batch_{0}.npy')
fft_chunk_size_half = X_batch_example.shape[2]
fft_chunk_amount = X_batch_example.shape[1]

# Input layer
fft_input = keras.Input(shape=(fft_chunk_amount,fft_chunk_size_half), name='in_fft')

# Convolution layer
x = keras.layers.Conv1D(filters=256, kernel_size=3, activation = keras.activations.relu)(fft_input)
x = keras.layers.Flatten()(x)

# Regular fully connected layer
x = keras.layers.Dense(256, activation = keras.activations.relu)(x)


# Output layers
# Since there are so many different outputs, we define these with a loop
loss = {}
metrics = {}
output_layers = []

for i in range(len(parameter_names)):
    if parameter_randomizer_parts[i]['random_type'] == 'constant':
        continue
    
    # Categorical variabels are one-hot encoded numpy arrays
    elif parameter_randomizer_parts[i]['random_type'] == 'choice':
        label_amount = len(parameter_randomizer_parts[i]['random_values'])
        layer_name = f'out_{parameter_names[i]}'
        loss[layer_name] = keras.losses.CategoricalCrossentropy()
        metrics[layer_name] = keras.metrics.categorical_accuracy
        layer = keras.layers.Dense(label_amount,  activation = keras.activations.softmax, name=layer_name)(x)
        output_layers.append(layer)
        
    # Numerical variables are numpy float scalars
    elif parameter_randomizer_parts[i]['random_type'] == 'range':
        layer_name = f'out_{parameter_names[i]}'
        loss[layer_name] = keras.losses.MeanAbsoluteError()
        layer = keras.layers.Dense(1, name=layer_name)(x)
        output_layers.append(layer)

# Compile model
model = keras.Model(inputs=[fft_input], outputs=output_layers)
model.compile(loss = loss,
              optimizer = keras.optimizers.SGD(),
              metrics=metrics)

In [7]:
# Summary of the model
# Most of these are output layers
print(model.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 in_fft (InputLayer)            [(None, 20, 551)]    0           []                               
                                                                                                  
 conv1d (Conv1D)                (None, 18, 256)      423424      ['in_fft[0][0]']                 
                                                                                                  
 flatten (Flatten)              (None, 4608)         0           ['conv1d[0][0]']                 
                                                                                                  
 dense (Dense)                  (None, 256)          1179904     ['flatten[0][0]']                
                                                                                              

  (Dense)                                                                                         
                                                                                                  
 out_amplitude_envelope_release  (None, 1)           257         ['dense[0][0]']                  
  (Dense)                                                                                         
                                                                                                  
 out_amplitude_lfo_waveform (De  (None, 4)           1028        ['dense[0][0]']                  
 nse)                                                                                             
                                                                                                  
 out_amplitude_lfo_frequency (D  (None, 1)           257         ['dense[0][0]']                  
 ense)                                                                                            
          

# Training

Below is an example of training with 9 batches for 20 epochs.  
I did also train a model with 3500 batches for 30 epochs, which learned much better.  
That bigger model is what I used in the original synthesizer notebook and for the sound examples.  

Because of the many output layers, the accuracies printed during the training are a bit troublesome to read.  
In practice, there isn't too much to actually see from these accuracies.  
They don't reflect the models capabilities as well as hearing the predictions it makes.  

One thing to note about the accuracies though, is the fact that the model stopped learning after a while.  
Adding even more training data or even more epochs didn't make a change anymore.  
In theory, I think the accuracy for this should be able to go all the way near perfect, if things are done right.  

I tried out different types of models and different input features (for example the regular audio file), but I wasn't able to find anything that'd work better than what's in this notebook. I suspect that it has more to do with the generated sounds themselves, because I realized that some of the parameters are impossible to predict for certain kinds of sounds. Perhaps these models could be improved simply by taking such things into account when generating the training data.  

In [8]:
# Create DataLoaders for training
train_amt = 9
valid_amt = 1
G_train = DataLoader(np.arange(train_amt-valid_amt), parameter_names)
G_valid = DataLoader(np.arange(valid_amt)+(train_amt-1), parameter_names)

In [9]:
# Train
history = model.fit(x=G_train, epochs = 20, validation_data=G_valid)

Epoch 1/20
Epoch 2/20


Epoch 3/20
Epoch 4/20


Epoch 5/20
Epoch 6/20


Epoch 7/20
Epoch 8/20


Epoch 9/20
Epoch 10/20


Epoch 11/20
Epoch 12/20


Epoch 13/20
Epoch 14/20


Epoch 15/20
Epoch 16/20


Epoch 17/20
Epoch 18/20


Epoch 19/20
Epoch 20/20


# Saving the Model

In [10]:
# Naming stands for the number of batches and epochs
model.save('keras_models/model_9_20.keras')