### Regis University

**MSDS688_X70: Artificial Intelligence**  
Master of Science in Data Science Program

#### Week 7: Music Generation with GANs  
*GPU Required*

## Lecture: Week 7 - Audio Generation with GANs

### Overview

This week, we explore how **Generative Adversarial Networks (GANs)** can be adapted to generate audio. Generating audio presents unique challenges compared to image generation, due to the time-series nature of audio data and the complex patterns of frequency and amplitude it contains. In this lecture, we will cover the core principles of GANs, how they can be modified for audio generation, and the technical aspects of building and training these models.

---

### 1. **GANs for Audio Generation: A Brief Introduction**

At its core, the architecture of a **Generative Adversarial Network (GAN)** remains consistent whether it is used for images, audio, or other types of data. A GAN consists of two neural networks: the **generator** and the **discriminator**, which play an adversarial game. However, generating audio poses additional challenges because audio data involves both temporal and frequency components.

#### Key Components of a GAN:
- **Generator (G)**: This network takes a latent vector (random noise) and generates audio waveforms from it. The goal is to produce audio samples that are indistinguishable from real audio.
- **Discriminator (D)**: The discriminator classifies input audio as either real (from the dataset) or fake (generated by the generator). The discriminator helps the generator improve by rejecting poor-quality audio samples.

In GANs for audio generation, the input and output data are **time-series**. This introduces unique requirements for how the data is modeled and processed.

---

### 2. **Challenges in Audio Generation**

Generating high-quality audio is more challenging than generating images because:
1. **Temporal Structure**: Audio is inherently sequential, meaning there are dependencies between adjacent time points. The model needs to understand both short-term patterns (e.g., individual sound waves) and long-term structures (e.g., rhythms, melodies, phrases).
2. **Frequency Representation**: Audio can be represented in both the **time domain** (as raw waveforms) and the **frequency domain** (as spectrograms). Deciding how to represent the audio is critical to the architecture of the GAN.
3. **Dimensionality**: Audio data tends to have high dimensionality, especially when sampled at high rates (e.g., 44.1kHz), meaning a single second of audio could contain tens of thousands of data points.

---

### 3. **Time-Domain vs. Frequency-Domain Representation**

When building GANs for audio generation, it is important to decide whether to represent the audio data in the **time domain** (raw waveforms) or the **frequency domain** (spectrograms).

#### Time-Domain (Raw Waveforms):
- **Pros**: Captures the raw temporal structure of the sound directly. The generator creates waveforms that are ready to be played back as audio.
- **Cons**: Requires very high precision, especially for high-quality audio, and often needs more complex architectures to generate realistic outputs.
- **Applications**: Speech synthesis, simple sounds.

#### Frequency-Domain (Spectrograms):
- **Pros**: Spectrograms represent the frequency content of the audio over time, which allows the model to focus on spectral features that are important for human perception, such as harmonics and timbre.
- **Cons**: Requires an additional step to convert the spectrogram back into audio (usually through an inverse Fourier Transform).
- **Applications**: Music generation, environmental sounds, complex audio.

Both methods have trade-offs, and the choice depends on the specific task and dataset.

---

### 4. **Architecture of Audio GANs**

The architecture of a GAN for audio generation closely follows the principles of standard GANs but with some modifications to handle the unique structure of audio data.

#### 4.1 Generator Architecture

The **generator** is responsible for transforming a random latent vector into a structured audio sequence. The generator typically consists of a series of **upsampling layers** (e.g., transposed convolutions) that progressively increase the temporal resolution of the audio data.

##### Key Components:
1. **Latent Vector (Input)**: The generator starts with a random noise vector, often drawn from a standard normal distribution. This latent vector serves as the seed for generating the audio sequence.
   - In audio GANs, the latent vector might include additional parameters related to pitch, tempo, or other audio features.
   
2. **Transposed Convolutional Layers**: These layers progressively upsample the latent vector into a waveform or spectrogram. The upsampling increases the temporal resolution, allowing the model to capture finer details in the audio.
   - For waveform generation, transposed convolutions generate audio samples directly.
   - For spectrogram generation, transposed convolutions produce spectrogram features that must later be converted to waveforms.

3. **Activation Functions**: Commonly used activation functions include:
   - **ReLU (Rectified Linear Unit)**: Introduces non-linearity, which is important for generating complex patterns in audio.
   - **tanh**: Often used in the final layer to ensure that the output values are scaled to the appropriate range for audio data.

4. **Output**: The generator produces an audio waveform or spectrogram, which can then be post-processed and converted into a playable format (such as WAV files).

#### 4.2 Discriminator Architecture

The **discriminator** is a binary classifier that tries to distinguish between real audio (from the dataset) and generated audio (from the generator). The architecture of the discriminator is similar to that of convolutional neural networks (CNNs) used for image classification, but adapted for the time-series nature of audio.

##### Key Components:
1. **Convolutional Layers**: The discriminator uses **strided convolutions** to downsample the input audio (or spectrogram), extracting features that allow it to distinguish between real and fake data.
   - In the time domain, the convolutions operate over the raw waveform.
   - In the frequency domain, the convolutions operate over the spectrogram.

2. **Leaky ReLU Activation**: In each layer of the discriminator, **Leaky ReLU** is used instead of the standard ReLU to ensure that small gradients are preserved, which helps prevent the network from becoming too confident and "dead".

3. **Sigmoid Output Layer**: The final output of the discriminator is a single scalar value, representing the probability that the input is real (1) or fake (0). This is achieved using a **sigmoid** activation function.

---

### 5. **Loss Functions for Audio GANs**

The objective of GAN training is to optimize both the generator and the discriminator using a loss function that reflects how well the generator can fool the discriminator.

#### Generator Loss:
The generator’s goal is to produce audio that is indistinguishable from real audio. The generator loss is calculated based on the discriminator’s ability to classify the generated audio as fake. The generator tries to **minimize** this loss, effectively trying to "fool" the discriminator.

#### Discriminator Loss:
The discriminator’s goal is to correctly classify real and fake audio. Its loss is the binary cross-entropy between the real and generated data classifications. The discriminator tries to **maximize** this loss, aiming to distinguish real from fake audio as effectively as possible.

The adversarial nature of these loss functions helps both models improve over time.

---

### 6. **Optimization and Training in Audio GANs**

Training audio GANs involves a delicate balance between optimizing the generator and discriminator. The models are trained in an alternating fashion:
1. **Training the Discriminator**: The discriminator is trained on batches of real audio and generated audio. Its weights are updated to better classify the audio as real or fake.
2. **Training the Generator**: The generator is trained to improve its ability to produce realistic audio that can fool the discriminator. This involves updating the generator's weights to minimize the loss based on the discriminator's feedback.

#### Key Hyperparameters:
- **Learning Rate**: Both the generator and discriminator are typically trained using the **Adam optimizer** with carefully tuned learning rates to ensure stable convergence.
- **Batch Size**: Smaller batch sizes are often used in audio GANs due to the high dimensionality of audio data.
- **Latent Dimensionality**: The size of the latent vector controls the variability in the generated audio. Higher-dimensional latent spaces allow for more complex variations in generated audio sequences.

---

### 7. **Challenges in Audio GANs**

While GANs for image generation are well understood, GANs for audio pose additional challenges:
1. **Training Stability**: Like all GANs, audio GANs are prone to instability during training. It is crucial to maintain a careful balance between the discriminator and generator to prevent issues like mode collapse or vanishing gradients.
2. **Audio Quality**: High-fidelity audio is difficult to generate, especially when producing long sequences. Fine-tuning the generator to produce smooth, coherent audio over time is non-trivial.
3. **Evaluation**: Evaluating the quality of generated audio is subjective and often requires manual listening, though some automatic metrics like **Fréchet Audio Distance (FAD)** have been developed.

---

### Conclusion

This week’s assignment introduces you to the fascinating world of **audio generation with GANs**. You will learn how to design and train GANs for generating audio sequences, either in the time domain (as waveforms) or frequency domain (as spectrograms). Pay close attention to the unique challenges posed by audio data, including the temporal dependencies and high dimensionality, and experiment with different architectures and loss functions to achieve high-quality results.

---


## Assignment Part 1: Follow Me – Music Generation Using GANs

In this section, we will build and train a Generative Adversarial Network (GAN) to generate music sequences, and we will save the results as MIDI and WAV files. You will explore how GANs can be used to generate creative outputs such as music.


In [None]:
!pip install tensorflow wave

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import wave
import struct

In [None]:
# Define the generator model
def build_generator(latent_dim):
    model = tf.keras.Sequential()
    model.add(layers.Input(shape=(latent_dim,)))  # Replaced input_dim with Input layer
    model.add(layers.Dense(256, activation="relu"))
    model.add(layers.Dense(512, activation="relu"))
    model.add(layers.Dense(1024, activation="relu"))
    model.add(layers.Dense(128, activation="tanh"))  # Output layer
    return model

In [None]:
# Define the discriminator model
def build_discriminator(input_shape):
    model = tf.keras.Sequential()
    model.add(layers.Input(shape=input_shape))  # Using Input layer instead of input_shape
    model.add(layers.Dense(512, activation="relu"))
    model.add(layers.Dense(256, activation="relu"))
    model.add(layers.Dense(1, activation="sigmoid"))  # Binary classification
    return model

In [None]:
# GAN Model combining generator and discriminator
def build_gan(generator, discriminator):
    discriminator.trainable = False  # Freeze the discriminator during GAN training
    gan_input = layers.Input(shape=(latent_dim,))
    generated_sequence = generator(gan_input)
    gan_output = discriminator(generated_sequence)

    gan = tf.keras.models.Model(gan_input, gan_output)
    gan.compile(optimizer='adam', loss='binary_crossentropy')
    return gan

In [None]:
import wave
import struct

# Generate WAV file from the generated sequence (simple sine wave approximation)
def sequence_to_wav(sequence, wav_file="generated_music.wav", duration_per_note=0.5, sample_rate=44100):
    # Clip values to ensure they're within the MIDI range (0 to 127)
    sequence = np.clip((sequence + 1) * 63.5, 0, 127)  # Rescale from [-1, 1] to [0, 127]

    # Define basic parameters for audio synthesis
    num_samples_per_note = int(sample_rate * duration_per_note)
    amplitude = 32767  # Max amplitude for WAV files
    data = []

    # Generate sine wave for each note in the sequence
    for note in sequence:
        frequency = 440.0 * (2.0 ** ((note - 69.0) / 12.0))  # Convert MIDI note to frequency
        for i in range(num_samples_per_note):
            sample = amplitude * np.sin(2 * np.pi * frequency * (i / sample_rate))
            data.append(int(sample))

    # Write to a WAV file
    with wave.open(wav_file, 'w') as wav_out:
        wav_out.setnchannels(1)  # Mono sound
        wav_out.setsampwidth(2)  # 16-bit sound
        wav_out.setframerate(sample_rate)
        wav_out.writeframes(b''.join([struct.pack('<h', sample) for sample in data]))  # Pack as signed 16-bit integer

    print(f"Saved generated music to {wav_file}")


In [None]:
# Parameters
latent_dim = 100
input_shape = (128,)

In [None]:
# Build models
generator = build_generator(latent_dim)
discriminator = build_discriminator((128,))

In [None]:
# Compile generator and discriminator separately
generator.compile(optimizer='adam', loss='binary_crossentropy')
discriminator.compile(optimizer='adam', loss='binary_crossentropy')

In [None]:
# Build and compile the GAN model
gan = build_gan(generator, discriminator)

In [None]:
# Training loop (simplified)
def train_gan(gan, generator, discriminator, epochs, batch_size, latent_dim):
    for epoch in range(epochs):
        # Generate random noise as input
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        generated_sequences = generator.predict(noise)

        # Generate "real" sequences (random data for this example)
        real_sequences = np.random.rand(batch_size, 128)

        # Combine real and generated sequences
        combined_sequences = np.concatenate([generated_sequences, real_sequences])

        # Create labels for real (1) and fake (0)
        labels = np.concatenate([np.ones((batch_size, 1)), np.zeros((batch_size, 1))])

        # Train discriminator
        discriminator.trainable = True  # Allow discriminator to be trainable
        d_loss = discriminator.train_on_batch(combined_sequences, labels)

        # Train generator via GAN
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        misleading_labels = np.ones((batch_size, 1))
        discriminator.trainable = False  # Freeze discriminator during GAN training
        g_loss = gan.train_on_batch(noise, misleading_labels)

        print(f"Epoch {epoch + 1}/{epochs}, Discriminator Loss: {d_loss}, Generator Loss: {g_loss}")

    # Generate a final WAV file after all epochs are complete
    final_noise = np.random.normal(0, 1, (1, latent_dim))
    final_generated_sequence = generator.predict(final_noise)[0]
    sequence_to_wav(final_generated_sequence, wav_file="walkthrough_generated_music.wav")

In [None]:
# Train the GAN
train_gan(gan, generator, discriminator, epochs=50, batch_size=32, latent_dim=latent_dim)

In [None]:
# Download the generated WAV file from Colab to your local machine
from google.colab import files

files.download('walkthrough_generated_music.wav')

## Assignment Part 2: Fine-Tuning the Music Generation Model

In this section, the GAN model will be fine-tuned to create custom music. We provide the rest of the model structure, but you need to update the GAN to explore different configurations and produce unique music. **A framework has been provided, and your job is to complete the TODOs.**


In [None]:
#### TODO: Define your custom hyperparameters ####
custom_latent_dim = 64  # Change the latent dimension
custom_epochs = 30  # Modify the number of epochs
custom_batch_size = 16  # Modify the batch size
custom_optimizer = 'rmsprop'  # Try a different optimizer (e.g., 'rmsprop')


In [None]:
#### TODO: Modify the generator model (example: changing number of layers and units) ####
def custom_build_generator(latent_dim):
    model = tf.keras.Sequential()
    model.add(layers.Input(shape=(latent_dim,)))  # Input layer

    #### TODO ####

    model.add(layers.Dense(128, activation="tanh"))  # Output with 128 units to match discriminator input
    return model

In [None]:
#### TODO: Modify the discriminator model (example: changing number of layers and units) ####
def custom_build_discriminator(input_shape):
    model = tf.keras.Sequential()
    model.add(layers.Input(shape=input_shape))  # Input layer

    #### TODO ####

    model.add(layers.Dense(1, activation="sigmoid"))  # Output for binary classification
    return model

In [None]:
# Rebuild the models with the custom parameters
custom_generator = custom_build_generator(custom_latent_dim)
custom_discriminator = custom_build_discriminator((128,))


In [None]:
# Compile the models with the new optimizer
custom_generator.compile(optimizer=custom_optimizer, loss='binary_crossentropy')
custom_discriminator.compile(optimizer=custom_optimizer, loss='binary_crossentropy')


In [None]:
# GAN Model combining generator and discriminator with custom latent dimension
def build_gan(generator, discriminator, latent_dim):
    discriminator.trainable = False  # Freeze the discriminator during GAN training
    gan_input = layers.Input(shape=(latent_dim,))  # Ensure the latent_dim is passed here
    generated_sequence = generator(gan_input)
    gan_output = discriminator(generated_sequence)

    gan = tf.keras.models.Model(gan_input, gan_output)
    gan.compile(optimizer='adam', loss='binary_crossentropy')
    return gan


In [None]:
# Rebuild the GAN model with the updated generator, discriminator, and latent_dim
custom_gan = build_gan(custom_generator, custom_discriminator, custom_latent_dim)


In [None]:
# Train the customized GAN
train_gan(custom_gan, custom_generator, custom_discriminator, epochs=custom_epochs, batch_size=custom_batch_size, latent_dim=custom_latent_dim)


In [None]:
# Save the WAV file from the generated music
generated_noise = np.random.normal(0, 1, (1, custom_latent_dim))  # Generate new noise
generated_sequence = custom_generator.predict(generated_noise)  # Generate a new sequence
sequence_to_wav(generated_sequence[0], wav_file="custom_generated_music.wav")  # Save to a new WAV file

In [None]:
# Download the generated WAV file from Colab to your local machine
from google.colab import files

files.download('custom_generated_music.wav')

In [None]:
# Compare Music Files
import numpy as np
import matplotlib.pyplot as plt
import wave

# Function to read WAV file data
def read_wav_data(wav_file):
    with wave.open(wav_file, 'rb') as wav:
        frames = wav.readframes(wav.getnframes())
        waveform = np.frombuffer(frames, dtype=np.int16)  # Assuming 16-bit WAV
        return waveform

# Load the waveforms of both files
custom_wav = "custom_generated_music.wav"
walkthrough_wav = "walkthrough_generated_music.wav"

custom_waveform = read_wav_data(custom_wav)
walkthrough_waveform = read_wav_data(walkthrough_wav)

# Make sure both waveforms are the same length for comparison
min_length = min(len(custom_waveform), len(walkthrough_waveform))
custom_waveform = custom_waveform[:min_length]
walkthrough_waveform = walkthrough_waveform[:min_length]

# Calculate the difference between the waveforms
waveform_difference = custom_waveform - walkthrough_waveform

# Sample the data points to avoid overcrowding (e.g., every 100th point)
sampling_rate = 100
sampled_custom_waveform = custom_waveform[::sampling_rate]
sampled_walkthrough_waveform = walkthrough_waveform[::sampling_rate]
sampled_difference = waveform_difference[::sampling_rate]

# Plot the waveforms and their difference
plt.figure(figsize=(15, 8))

# Custom generated waveform
plt.subplot(3, 1, 1)
plt.plot(sampled_custom_waveform, color='blue', alpha=0.7, linewidth=0.7)
plt.title("Waveform of Custom Generated Music")
plt.xlabel("Sample Index")
plt.ylabel("Amplitude")

# Walkthrough generated waveform
plt.subplot(3, 1, 2)
plt.plot(sampled_walkthrough_waveform, color='green', alpha=0.7, linewidth=0.7)
plt.title("Waveform of Walkthrough Generated Music")
plt.xlabel("Sample Index")
plt.ylabel("Amplitude")

# Difference between waveforms
plt.subplot(3, 1, 3)
plt.plot(sampled_difference, color='red', alpha=0.8, linewidth=0.7)
plt.title("Difference Between Custom and Walkthrough Waveforms")
plt.xlabel("Sample Index")
plt.ylabel("Amplitude Difference")

plt.tight_layout()
plt.show()



### TODO: Audio Generation with GANs from Random Noise Analysis

Now that you've trained a Generative Adversarial Network (GAN) to generate audio sequences purely from random noise, summarize your experience and insights by addressing the following questions:

- **Generated Audio Quality:**  
  How coherent and realistic was the audio generated entirely from random noise? Describe specific characteristics or limitations you observed.

- **Training Insights and Challenges:**  
  What challenges or issues arose during the training process (e.g., stability, convergence)? How did you manage or resolve these challenges?

- **Impact of Hyperparameters:**  
  Which hyperparameters (e.g., latent dimension size, learning rate, training epochs) significantly influenced the quality and characteristics of the generated audio? Explain your observations.

- **Practical and Creative Potential:**  
  Based on your results, in what practical or creative contexts could audio generated from random noise using GANs be utilized?

**Action:**  
Write a concise summary (1-2 paragraphs) capturing your observations clearly in a markdown cell below.
