# Improving Our VAE
*Kiya Aminfar, Sean Steinle*

This notebook is our attempt to improve upon `./simple_vae.ipynb`, where we trained a simple but ineffective variational autoencoder to generate music. In this notebook, we try three techniques to improve our autoencoder: more epochs, more data, and more layers.

## Table of Contents
1. [More Epochs](#More-Epochs)
2. [More Data](#More-Data)
3. [More Layers](#More-Layers)

In [1]:
import librosa, os #audio processing and file system parsing
import librosa.display
import numpy as np #math library
import tensorflow as tf #for model building
from tensorflow.keras import layers, Model
import matplotlib.pyplot as plt # for visualization
import pandas as pd #for data analysis / prep
import IPython.display as ipd #for sound output

2025-03-14 14:22:49.813725: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-14 14:22:49.840199: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-14 14:22:50.087594: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-14 14:22:50.261332: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741976570.410230   26290 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741976570.45

## More Epochs

## More Data

It looks like the [Free Music Archive](https://github.com/mdeff/fma) is the easiest way to get access to more music! They have a paper detailing all of their work, freely available metadata, and audio-only zips. Let's start with the smallest audio-only zip (`fma_small.zip`), which contains about 8,000 tracks!

In [2]:
# Parameters
SAMPLE_RATE = 22050  # Standard sample rate for music processing
N_MELS = 128         # Number of Mel filterbanks
HOP_LENGTH = 512     # Hop length for STFT
N_FFT = 2048         # FFT window size
DURATION = 5         # Duration of each audio clip in seconds
BATCH_SIZE = 32

def load_audio_to_mel(file_path):
    y, sr = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)  # Convert to log scale (dB)
    mel_spec_norm = (mel_spec_db - mel_spec_db.min()) / (mel_spec_db.max() - mel_spec_db.min())
    return mel_spec_norm

def mel_to_audio(mel_spec, sr=22050, n_fft=2048, hop_length=512, n_mels=128, power=1.0):
    """
    Convert a Mel spectrogram back to audio using the Griffin-Lim algorithm.
    Args:
        mel_spec: Mel spectrogram (shape: [n_mels, time_steps])
        sr: Sample rate for the audio
        n_fft: FFT size for Griffin-Lim
        hop_length: Hop length for Griffin-Lim
        n_mels: Number of Mel bins in the spectrogram
        power: Exponent for the spectrogram
    Returns:
        Audio signal as a numpy array
    """
    # Invert Mel to linear scale
    mel_inverted = librosa.feature.inverse.mel_to_audio(mel_spec ** power, sr=sr, n_fft=n_fft, hop_length=hop_length)
    return mel_inverted

def load_df_from_kaggle(music_dir: str):
    """This function loads a directory of songs and creates a simple dataframe with (genre,song,numpy) rows, based on the Kaggle dataset. We expect music_dir to be a 2-level directory with genre directories on the first level and .wav songs on the seocnd level."""
    music_dicts,bad_paths = [],[]
    genres = os.listdir(music_dir)
    for genre in genres:
        try:
            for song in os.listdir(music_dir+genre):
                song_path = music_dir+genre+'/'+song
                try:
                    music_dicts.append({'genre': genre, 'song': song, 'numpy_representation': load_audio_to_mel(song_path)})
                except Exception as e:
                    print(f"couldn't load: {song_path}, got: {e}")
                    bad_paths.append([song_path,e])
        except Exception as e:
            print(f"couldn't process the {genre} directory, got: {e}")
    return pd.DataFrame(music_dicts)

def load_df_from_fma(music_dir: str, max_songs: int=8000):
    """This function loads a directory of songs and creates a simple dataframe with (song,numpy) rows, based on the FMA dataset. We expect music_dir to be a single-level directory with just songs."""
    music_dicts,bad_paths = [],[]
    for i,song in enumerate(os.listdir(music_dir)):
        song_path = music_dir+song
        try:
            audio = load_audio_to_mel(song_path)
            if np.isnan(audio).sum() > 0: 
                raise Exception("got nan in audio")
            music_dicts.append({'song': song, 'numpy_representation': audio})
        except Exception as e:
            print(f"couldn't load: {song_path}, got: {e}")
            bad_paths.append([song_path,e])
        if i >= max_songs:
            break
    return pd.DataFrame(music_dicts)

def create_tf_dataset(music_df):
    """Converts music_df's numpy column to a TensorFlow dataset."""
    assert "numpy_representation" in music_df.columns
    numpy_representations = np.array(music_df["numpy_representation"].tolist(), dtype=np.float32)

    # Ensure shape is (128, 216, 1) (Mel bins, Time steps, Channels)
    numpy_representations = np.expand_dims(numpy_representations, -1)  # Add channel dimension

    songs_dataset = tf.data.Dataset.from_tensor_slices(numpy_representations)
    songs_dataset = songs_dataset.shuffle(len(numpy_representations)).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
    return songs_dataset

In [4]:
music_df = load_df_from_kaggle("../data/kaggle/genres_original/")
songs_dataset = create_tf_dataset(music_df)

  y, sr = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


couldn't load: ../data/kaggle/genres_original/jazz/jazz.00054.wav, got: 


2025-03-14 14:24:53.046006: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


## More Layers

In [5]:
# Parameters
LATENT_DIM = 64  # Latent space dimension
INPUT_SHAPE = (128, 216)  # (Time steps, Mel bins)

# Reparameterization Trick
def sampling(args):
    """Reparameterization trick: z = mu + exp(log_var / 2) * epsilon"""
    mu, log_var = args
    epsilon = tf.keras.backend.random_normal(shape=tf.shape(mu))
    return mu + tf.exp(log_var * 0.5) * epsilon

# Encoder
def build_encoder(input_shape, latent_dim):
    inputs = layers.Input(shape=input_shape)
    x = layers.Conv1D(32, kernel_size=3, strides=2, padding="same", activation="relu")(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Conv1D(64, kernel_size=3, strides=2, padding="same", activation="relu")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv1D(128, kernel_size=3, strides=2, padding="same", activation="relu")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Flatten()(x)
    
    mu = layers.Dense(latent_dim, name="latent_mu")(x)
    log_var = layers.Dense(latent_dim, name="latent_log_var")(x)
    z = layers.Lambda(sampling, name="latent_sample")([mu, log_var])
    
    return Model(inputs, [mu, log_var, z], name="Encoder")

# Decoder
def build_decoder(latent_dim, output_shape):
    decoder_inputs = layers.Input(shape=(latent_dim,))
    
    x = layers.Dense(output_shape[0] // 8 * 128, activation="relu")(decoder_inputs)
    x = layers.Reshape((output_shape[0] // 8, 128))(x)  # Adjust size
    
    x = layers.Conv1DTranspose(128, kernel_size=3, strides=2, padding="same", activation="relu")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv1DTranspose(64, kernel_size=3, strides=2, padding="same", activation="relu")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv1DTranspose(32, kernel_size=3, strides=2, padding="same", activation="relu")(x)
    x = layers.BatchNormalization()(x)
    
    outputs = layers.Conv1DTranspose(output_shape[1], kernel_size=3, padding="same", activation="sigmoid")(x)
    return Model(decoder_inputs, outputs, name="Decoder")

# VAE Model
class VAE(Model):
    def __init__(self, encoder, decoder, **kwargs):
        super(VAE, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def compute_loss(self, inputs):
        """Computes the VAE loss function."""
        mu, log_var, z = self.encoder(inputs)
        reconstructed = self.decoder(z)

        # Reconstruction loss
        recon_loss = tf.reduce_mean(tf.keras.losses.mse(inputs, reconstructed))

        # KL Divergence loss
        kl_loss = -0.5 * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var))

        return recon_loss + kl_loss

    def train_step(self, data):
        """Custom training step."""
        with tf.GradientTape() as tape:
            loss = self.compute_loss(data)
        grads = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
        return {"loss": loss}

def generate_samples(model, num_samples=10):
    """Convenience function for generating num_samples from model."""
    random_latent_vectors = tf.random.normal(shape=(num_samples, LATENT_DIM)) # Sample random latent vectors from the prior (e.g., normal distribution)
    generated_samples = model.decoder(random_latent_vectors) # Use the decoder to generate samples from the random latent vectors
    generated_samples = [s.numpy().squeeze() for s in generated_samples] #convert to numpy, reshape
    return generated_samples

In [7]:
# Build the encoder and decoder
encoder = build_encoder(INPUT_SHAPE, LATENT_DIM)
decoder = build_decoder(LATENT_DIM, INPUT_SHAPE)

# Create the VAE instance
vae = VAE(encoder, decoder)
vae.compile(optimizer=tf.keras.optimizers.Adam())

# Train the model
n_epochs = 20
vae.fit(songs_dataset, epochs=n_epochs)

Epoch 1/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 43ms/step - loss: 0.0381
Epoch 2/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 50ms/step - loss: 0.0293
Epoch 3/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 36ms/step - loss: 0.0248
Epoch 4/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 35ms/step - loss: 0.0229
Epoch 5/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 35ms/step - loss: 0.0223
Epoch 6/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 34ms/step - loss: 0.0219
Epoch 7/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 33ms/step - loss: 0.0219
Epoch 8/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - loss: 0.0223
Epoch 9/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 37ms/step - loss: 0.0222
Epoch 10/20
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 37ms/step - loss: 0.0214

<keras.src.callbacks.history.History at 0x7f37468ddcc0>

In [8]:
generated_samples = generate_samples(vae, num_samples=5)
mel_audio = mel_to_audio(generated_samples[0])
ipd.Audio(mel_audio, rate=22050) #generated sample

In [9]:
#hm, same poor results. what if we just train with classical and if we train for longer?

In [11]:
music_df = load_df_from_kaggle("../data/kaggle/genres_original/")
classical_df = music_df[music_df['genre'] == 'classical']
songs_dataset = create_tf_dataset(classical_df)

  y, sr = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)


couldn't load: ../data/kaggle/genres_original/jazz/jazz.00054.wav, got: 


In [12]:
# Build the encoder and decoder
encoder = build_encoder(INPUT_SHAPE, LATENT_DIM)
decoder = build_decoder(LATENT_DIM, INPUT_SHAPE)

# Create the VAE instance
vae = VAE(encoder, decoder)
vae.compile(optimizer=tf.keras.optimizers.Adam())

# Train the model
n_epochs = 100
vae.fit(songs_dataset, epochs=n_epochs)

Epoch 1/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 57ms/step - loss: 0.0509
Epoch 2/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - loss: 0.0450
Epoch 3/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - loss: 0.0429
Epoch 4/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - loss: 0.0428
Epoch 5/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - loss: 0.0407
Epoch 6/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - loss: 0.0374
Epoch 7/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - loss: 0.0386
Epoch 8/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - loss: 0.0359
Epoch 9/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step - loss: 0.0343
Epoch 10/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - loss: 0.0316
Epoch 11/

<keras.src.callbacks.history.History at 0x7f3745d52dd0>

In [13]:
generated_samples = generate_samples(vae, num_samples=5)
mel_audio = mel_to_audio(generated_samples[0])
ipd.Audio(mel_audio, rate=22050) #generated sample

## The FMA

In [14]:
mp3_track = "../data/fma/small/000002.mp3"
audio, sr = librosa.load(mp3_track, sr=None)  # sr=None keeps the original sample rate
print("Original Ground-Truth Audio:")
ipd.display(ipd.Audio(audio, rate=sr))

Original Ground-Truth Audio:


In [15]:
load_audio_to_mel(mp3_track)

array([[0.16467619, 0.36605176, 0.4088923 , ..., 0.48357907, 0.49062985,
        0.51684415],
       [0.1480258 , 0.3213346 , 0.34637317, ..., 0.4673593 , 0.47747165,
        0.4846487 ],
       [0.11635093, 0.23060103, 0.29741874, ..., 0.4150136 , 0.4975031 ,
        0.5137979 ],
       ...,
       [0.        , 0.11782055, 0.27500135, ..., 0.30580443, 0.20114584,
        0.16650248],
       [0.        , 0.05122089, 0.2493271 , ..., 0.23760028, 0.12784128,
        0.10992394],
       [0.        , 0.        , 0.11801024, ..., 0.01942482, 0.        ,
        0.04164057]], dtype=float32)

In [16]:
fma_df = load_df_from_fma("../data/fma/small/")
songs_dataset = create_tf_dataset(fma_df)
fma_df

[src/libmpg123/layer3.c:INT123_do_layer3():1774] error: part2_3_length (3360) too large for available bit count (3240)
  mel_spec_norm = (mel_spec_db - mel_spec_db.min()) / (mel_spec_db.max() - mel_spec_db.min())


couldn't load: ../data/fma/small/145730.mp3, got: got nan in audio


Note: Illegal Audio-MPEG-Header 0x00000000 at offset 22401.
Note: Trying to resync...
Note: Skipped 1024 bytes in input.
[src/libmpg123/parse.c:wetwork():1349] error: Giving up resync after 1024 bytes - your stream is not nice... (maybe increasing resync limit could help).
  y, sr = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)


couldn't load: ../data/fma/small/098567.mp3, got: 
couldn't load: ../data/fma/small/107535.mp3, got: got nan in audio


[src/libmpg123/layer3.c:INT123_do_layer3():1774] error: part2_3_length (3328) too large for available bit count (3240)


couldn't load: ../data/fma/small/README.txt, got: 


Note: Illegal Audio-MPEG-Header 0x00000000 at offset 33361.
Note: Trying to resync...
Note: Skipped 1024 bytes in input.
[src/libmpg123/parse.c:wetwork():1349] error: Giving up resync after 1024 bytes - your stream is not nice... (maybe increasing resync limit could help).


couldn't load: ../data/fma/small/098565.mp3, got: 




couldn't load: ../data/fma/small/099134.mp3, got: 


[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!


couldn't load: ../data/fma/small/133297.mp3, got: 
couldn't load: ../data/fma/small/checksums, got: 


[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
[src/libmpg123/layer3.c:INT123_do_layer3():1844] error: dequantization failed!


couldn't load: ../data/fma/small/057179.mp3, got: got nan in audio




couldn't load: ../data/fma/small/108925.mp3, got: 


[src/libmpg123/layer3.c:INT123_do_layer3():1804] error: dequantization failed!
Note: Illegal Audio-MPEG-Header 0x00000000 at offset 63168.
Note: Trying to resync...
Note: Skipped 1024 bytes in input.
[src/libmpg123/parse.c:wetwork():1349] error: Giving up resync after 1024 bytes - your stream is not nice... (maybe increasing resync limit could help).


couldn't load: ../data/fma/small/098569.mp3, got: 


Unnamed: 0,song,numpy_representation
0,131325.mp3,"[[0.4775898, 0.56400305, 0.50602037, 0.3654788..."
1,130991.mp3,"[[0.30612016, 0.4510931, 0.44410768, 0.3094793..."
2,036986.mp3,"[[0.0, 0.50472647, 0.78372747, 0.8312534, 0.80..."
3,055235.mp3,"[[0.78318435, 0.7341951, 0.7232668, 0.80268776..."
4,133781.mp3,"[[0.4461677, 0.58156985, 0.5606811, 0.38789338..."
...,...,...
7985,108059.mp3,"[[0.521342, 0.5069345, 0.5159296, 0.48801193, ..."
7986,137719.mp3,"[[0.0, 0.7154924, 0.90667117, 0.91515195, 0.89..."
7987,025124.mp3,"[[0.5448439, 0.7134255, 0.76546204, 0.7591831,..."
7988,114402.mp3,"[[0.42155224, 0.56374013, 0.57610005, 0.561157..."


In [19]:
# Build the encoder and decoder
encoder = build_encoder(INPUT_SHAPE, LATENT_DIM)
decoder = build_decoder(LATENT_DIM, INPUT_SHAPE)

# Create the VAE instance
vae = VAE(encoder, decoder)
vae.compile(optimizer=tf.keras.optimizers.Adam())

# Train the model
n_epochs = 100
vae.fit(songs_dataset, epochs=n_epochs)

Epoch 1/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 39ms/step - loss: 0.0292
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 40ms/step - loss: 0.0252
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 37ms/step - loss: 0.0252 
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 43ms/step - loss: 0.0250
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 42ms/step - loss: 0.0249
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 35ms/step - loss: 0.0249
Epoch 7/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 35ms/step - loss: 0.0249 
Epoch 8/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 32ms/step - loss: 0.0248
Epoch 9/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 33ms/step - loss: 0.0248
Epoch 10/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

<keras.src.callbacks.history.History at 0x7f3742ceb850>

In [18]:
generated_samples = generate_samples(vae, num_samples=5)
mel_audio = mel_to_audio(generated_samples[0])
ipd.Audio(mel_audio, rate=22050) #generated sample

In [None]:
#unfortunately we're still not capturing the quality of music we desire!