# Generating .Wav Files from Splice Samples

The following models attempt to simplify my introduction into generative ai modeling. Earlier iterations of this assignment sought to use Google's Magenta to train a model using midi files/sequences to generate any sort of sound. I ran across multiple barriers that prevented me from pursuing this method (issues with packages and environments). This notebook is number 5 in a series of 7 using pytorch that attempt to train a generator to convert random noise into something less harsh and meaningless. The following model is my most successful attempt at getting this to work (it barely sounds like music, but atleast it's not random noise). The second model iterates on this method through implementing CNN architecture (although i was able to train, this was completely unsuccessful, and no improvement to this implementation was made in the notebooks that followed).

The pre-write will outline the steps taken to build the first model (minus the frustration).

1. Gather Training data -- .wav files were downloaded from Splice (a sample repo for music producers). Samples chosen were between 30 secs to a minute long. Instrumental only (no vox, minimal drums). About 220 samples across soul, gospel, rnb, and lofi hip hop categories were chosen. Primarily pianos, synths, and other melodic instruments.
2. Pre-processing -- when downloaded these files were organized in various folder branches. A script to flatten the root folder/directory was used to write/organize our data in a more easily accessible fashion. Earlier modeling attempts at preprocessing and train models duplicate many of the files. New functions were used to remove duplicates. After duplicate files were verified and removed, we normalize the length and the sample rates of the .wav files.
3. Loading the Data -- Each file was converted to tensors that could then be fed into our neural network using our 'data loader'.
4. Neural Network and Hyper Parameter tuning -- This first model is a fully connected feedforward NN or Multi-layer perceptron. The parameters I changed primarily were batch size and number of epoch. Previous versions used smaller batches and more epochs. After the first 'successful' training to nearly an hour, we achieved something beyond just noise. I could tell i was moving in the correct direction when the noise has hints of something else (almost like a radio signal almost entirely obscured by static). By changing these parameters (as well as fixing my data directories), we were able to achieve results similar to the longer trainings (batch size 1 and 50 epochs) but much more quickly. I faced many issues with dimensionality which ultimately prevented me from continuing to tune my CNN version.
5. The only form of validation done in this process was saving a generated .wav files and listening to it. The final result was a caucophonous/harmonically disonant drone with sparse elements of bell dings and rythmic textures. I never thought I would be so happy to hear something so awful sounding.

In [1]:
import os
import librosa
import soundfile as sf
import hashlib

def hash_audio_file(file_path):
    """Generate a hash for the audio file based on its content."""
    y, sr = librosa.load(file_path, sr=None)
    # Create a hash of the audio data
    audio_hash = hashlib.md5(y.tobytes()).hexdigest()
    return audio_hash

def flatten_wav_files(source_dir, target_dir):
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)

    seen_files = set()

    for root, _, files in os.walk(source_dir):
        for file in files:
            if file.endswith('.wav'):
                file_path = os.path.join(root, file)

                # Generate a hash for the audio file
                try:
                    audio_hash = hash_audio_file(file_path)
                    new_file_name = f"{audio_hash}.wav"

                    # Check for duplicates
                    if new_file_name not in seen_files:
                        seen_files.add(new_file_name)
                        new_file_path = os.path.join(target_dir, new_file_name)

                        # Load the audio file and save it to the target directory
                        y, sr = librosa.load(file_path, sr=None)
                        sf.write(new_file_path, y, sr)
                        print(f"Copied: {new_file_path}")
                    else:
                        print(f"Duplicate found: {file_path}, skipping.")
                except Exception as e:
                    print(f"Error processing {file_path}: {e}")

source_directory = 'data_copy'
target_directory = 'flat_new'

flatten_wav_files(source_directory, target_directory)


Copied: flat_new/e17dec0980e9c5a98d55878cdf69a43f.wav
Copied: flat_new/0c35034c25032f8af4ecd7fd05823c2a.wav
Copied: flat_new/589de4c1ef2de26c172adc17a821a1b7.wav
Copied: flat_new/5f9f29b959ddcaf4427942f445b4c32c.wav
Copied: flat_new/12bc5bfcf99e5cfca7c9f5f3177c0aea.wav
Copied: flat_new/fcdb17f8d4842fbbbfbc64e8d7e97020.wav
Copied: flat_new/0dcbfd00524bec462bf274e1944fcf7a.wav
Copied: flat_new/834d08b5e47de29f55946560f20d5fe0.wav
Copied: flat_new/98287ff62cf2afa762d960164a2b9b82.wav
Copied: flat_new/c8382806236dbb13b111cf54a1019648.wav
Copied: flat_new/9d3fce79f095cf78e4eafd4ec6940124.wav
Copied: flat_new/58d50f354073bff07733f24123e871ee.wav
Copied: flat_new/036060341d0d564cb38f66e5df29960c.wav
Copied: flat_new/cd536a01a601e6951e57f29c74745bd2.wav
Copied: flat_new/802c402bc9727528d8571bb632491a83.wav
Copied: flat_new/eea7d1afe34ef61161670cd0353b6357.wav
Copied: flat_new/8506b824da08fd3a9ad9b73be6d6c070.wav
Copied: flat_new/03e3fba6be54d43b2336cf584e94c3e1.wav
Copied: flat_new/46ea08c8712

In [1]:
import os
import torch
import torchaudio

def preprocess_audio(file_path, target_sample_rate=22050, target_length=661500):
    # Load the audio file
    waveform, sample_rate = torchaudio.load(file_path)

    # Resample if the sample rate is different
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        waveform = resampler(waveform)

    # Trim or pad to target length
    if waveform.size(1) > target_length:
        waveform = waveform[:, :target_length]
    elif waveform.size(1) < target_length:
        padding = target_length - waveform.size(1)
        waveform = torch.nn.functional.pad(waveform, (0, padding))

    return waveform

# Apply to all audio files
def load_and_preprocess_data(data_dir):
    audio_files = []
    for filename in os.listdir(data_dir):
        if filename.endswith(".wav"):
            file_path = os.path.join(data_dir, filename)
            audio_files.append(preprocess_audio(file_path))
    return audio_files

# Example usage
data_dir = "flat_new"
processed_audio = load_and_preprocess_data(data_dir)

In [3]:
from torch.utils.data import Dataset, DataLoader

class AudioDataset(Dataset):
    def __init__(self, audio_data):
        self.audio_data = audio_data

    def __len__(self):
        return len(self.audio_data)

    def __getitem__(self, idx):
        # Squeeze the waveform to remove unnecessary dimensions
        return self.audio_data[idx].squeeze(0)  # This will convert [1, 661500] to [661500]


# Create DataLoader
audio_dataset = AudioDataset(processed_audio)
audio_loader = DataLoader(audio_dataset, batch_size=32, shuffle=True)

In [10]:
import torch.nn as nn
class AudioGenerator(nn.Module):
    def __init__(self):
        super(AudioGenerator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(100, 512),  # Input dimension is 100 for noise
            nn.ReLU(),
            nn.Linear(512, 1024),
            nn.ReLU(),
            nn.Linear(1024, 2048),
            nn.ReLU(),
            nn.Linear(2048, 661500),  # Output dimension for audio samples
            nn.Tanh()  # Output normalized to [-1, 1]
        )

    def forward(self, x):
        return self.model(x)

generator = AudioGenerator()
optimizer = torch.optim.Adam(generator.parameters(), lr=0.001)
criterion = nn.MSELoss()

In [18]:
for epoch in range(10):
    for batch in audio_loader:
        optimizer.zero_grad()
        
        # Generate noise input
        noise = torch.randn(batch.size(0), 100)  # Adjust according to your noise dimension
        generated_audio = generator(noise)
        
        # Ensure both generated_audio and batch are in the shape [batch_size, 661500]
        loss = criterion(generated_audio, batch)  # Both should now match
        
        loss.backward()
        optimizer.step()
        print('Batch complete')
    print('Epoch:', epoch,'complete')


Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 0 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 1 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 2 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 3 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 4 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 5 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 6 complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Batch complete
Epoch: 7 complete
Batch complete
B

In [21]:
with torch.no_grad():
    noise = torch.randn(1, 100)  # Generate noise
    generated_audio = generator(noise)

# Optionally save the generated audio
torchaudio.save("generated_audio_new2.wav", generated_audio, 22050)

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Define the audio dataset class
class AudioDataset(Dataset):
    def __init__(self, audio_data):
        self.audio_data = audio_data

    def __len__(self):
        return len(self.audio_data)

    def __getitem__(self, idx):
        # Squeeze the waveform to remove unnecessary dimensions
        return self.audio_data[idx].squeeze(0)  # This will convert [1, 661500] to [661500]

# Define the CNN model
class AudioCNN(nn.Module):
    def __init__(self):
        super(AudioCNN, self).__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=16, kernel_size=5, stride=2, padding=2),  # First Conv Layer
            nn.ReLU(),
            nn.Conv1d(16, 32, kernel_size=5, stride=2, padding=2),  # Second Conv Layer
            nn.ReLU(),
            nn.Conv1d(32, 64, kernel_size=5, stride=2, padding=2),  # Third Conv Layer
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, stride=2, padding=2),  # Fourth Conv Layer
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=5, stride=2, padding=2),  # Fifth Conv Layer
            nn.ReLU()
        )
        
        # Using a test input to find the correct input size for the linear layer
        self.test_input = torch.zeros(1, 1, 661500)  # [batch_size, channels, length]
        self.flattened_size = self._get_flattened_size(self.test_input)
        self.fc = nn.Linear(self.flattened_size, 1)  # Fully connected layer; adjust input size accordingly

    def _get_flattened_size(self, input_tensor):
        with torch.no_grad():  # No need to track gradients here
            output = self.conv_layers(input_tensor)
        return output.numel()  # Get the number of elements in the output

    def forward(self, x):
        x = x.view(-1, 1, 661500)  # Add the channel dimension
        x = self.conv_layers(x)  # Pass through convolutional layers
        x = x.view(x.size(0), -1)  # Flatten the tensor
        x = self.fc(x)  # Pass through fully connected layer
        return x

# Create an instance of the model
model = AudioCNN()

# Create DataLoader with your processed audio data
# processed_audio should be a tensor of shape [N, 1, 661500] where N is the number of audio samples
audio_dataset = AudioDataset(processed_audio)
audio_loader = DataLoader(audio_dataset, batch_size=32, shuffle=True)

# Define optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Example training loop (simplified)
for epoch in range(5):  # Number of epochs
    for batch in audio_loader:
        optimizer.zero_grad()  # Zero gradients
        output = model(batch)  # Forward pass
        
        # Define target variable appropriately here
        # For example, it could be the same shape as output if doing regression on audio
        target = torch.zeros(batch.size(0), 1)  # Placeholder for target, replace with actual targets
        
        loss = criterion(output, target)  # Calculate loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update weights
        
        print(f"Epoch [{epoch + 1}/10], Loss: {loss.item():.4f}")


Epoch [1/10], Loss: 0.0000
Epoch [1/10], Loss: 3712.9907
Epoch [1/10], Loss: 109.6233
Epoch [1/10], Loss: 134.4192
Epoch [1/10], Loss: 311.2141
Epoch [1/10], Loss: 225.3912
Epoch [1/10], Loss: 65.1125
Epoch [2/10], Loss: 2.1817
Epoch [2/10], Loss: 12.7934
Epoch [2/10], Loss: 18.7875
Epoch [2/10], Loss: 11.8786
Epoch [2/10], Loss: 3.1221
Epoch [2/10], Loss: 1.0223
Epoch [2/10], Loss: 3.7862
Epoch [3/10], Loss: 1.6029
Epoch [3/10], Loss: 0.0210
Epoch [3/10], Loss: 0.8483
Epoch [3/10], Loss: 0.7811
Epoch [3/10], Loss: 0.4956
Epoch [3/10], Loss: 0.1749
Epoch [3/10], Loss: 0.0100
Epoch [4/10], Loss: 0.0519
Epoch [4/10], Loss: 0.1993
Epoch [4/10], Loss: 0.2640
Epoch [4/10], Loss: 0.0857
Epoch [4/10], Loss: 0.0902
Epoch [4/10], Loss: 0.0361
Epoch [4/10], Loss: 0.0054
Epoch [5/10], Loss: 0.0406
Epoch [5/10], Loss: 0.0418
Epoch [5/10], Loss: 0.0214
Epoch [5/10], Loss: 0.0041
Epoch [5/10], Loss: 0.0013
Epoch [5/10], Loss: 0.0114
Epoch [5/10], Loss: 0.0159


In [37]:
# Generate noise
with torch.no_grad():
    noise = torch.randn(1, 1, 100)  # Shape: (batch_size, channels, noise_dim)
    generated_audio = generator(noise)

# Ensure the output is in the correct shape for torchaudio.save
generated_audio = generated_audio.view(1, -1)  # Flatten the tensor to (1, audio_length)

# Save the generated audio to a file
torchaudio.save("generated_audio_new3.wav", generated_audio, 22050) 