# Real-time voice cloning
Goal: 
Convert speech audio from person A into speech audio for person B using an embedding of person B's voice.
This conversion should be done in real-time and if possible on low-end hardware or CPU only.

We will train the model using unlabeled data, i.e. we will not require aligned pairs of audio from person A and person B.


## Component Overview

1. Voice VAE: 
    - A variational autoencoder (VAE) that learns a latent representation of a speaker's voice.
    - The encoder takes a wave tensor as input and outputs a latent representation of the speaker's voice.
    - The decoder takes the latent representation and outputs a reconstructed wave tensor.
    - The VAE is trained using a reconstruction loss and a KL divergence loss
  
2. Voice Embedder:
    - A neural network that takes a latent tensor as input and outputs a multidimensional representation of the speaker's voice.
    - The network is trained together with the Voice Converter on unlabeled data.

3. Voice Converter:
    - A neural network that takes a with the VAE encoder latent representation of a speaker's voice and a wave tensor as input and outputs a new latent tensor that can be decoded to a wave tensor.

## Training
- The training will only use unlabeled speech recordings of random speakers in multiple languages.
- The voice VAE will be trained to reconstruct the input wave tensor while having a multidimensional latent space with a lower sample rate that the original wave tensor.
  - We will test different loss functions to find one that is ideal for our requirements.
  - additional noise will be added to the input wave tensor to make the model more robust to noise and guide it to only learn the most important features of the voice.
- The Voice Embedder will be trained using the latent representation of the voice VAE as input and generate a multidimensional representation of the speaker's voice of length 1, regardless of the length of the input.
- The Voice Converter will be trained to take the latent representation of the voice VAE and the voice embedding as input and output a new latent representation that can be decoded to a wave tensor.

### Diagram
Wave input: [i], Latent representation: [l], Voice embedding: [e], Wave output: [o]
Voice VAE: (VAE), Voice Embedder: (VE), Voice Converter: (VC)

```
[i] -> VAE -> [l] 
[l] -> VE -> [e]
[l] + [e] -> VC -> [l']
[l'] -> VAE -> [o]
```
To create learnable pattern, we will cross convert two different speakers' voices and train the model to reconstruct the original wave tensor.
Let [l1] and [l2] be the latent representations of speaker 1 and speaker 2 respectively.
```
[l1] -> VE -> [e1]
[l2] -> VE -> [e2]

[l1] + [e2] -> VC -> [l1']
[l2] + [e1] -> VC -> [l2']

[l1'] -> VE -> [e1']
[l2'] -> VE -> [e2']

[l1'] + [e2'] -> VC -> [l1'']
[l2'] + [e1'] -> VC -> [l2'']
```
Since the embedding is supposed to represent the speaker's voice, e1 and e2', as well as e2 and e1', should be similar, if not identical. Therefor we can calculate the loss between e1 and e2' and e2 and e1' as one part of the loss function to train the model.
The other part of the loss function will be the reconstruction loss of the original latent representation l1/l2 and the final reconstruction l1''/l2''.
This way, the first conversion to l1'/l2', which is the voice clip to the voice of the other speaker, will not be directly part of the loss function, but will be indirectly trained by the requirements of the other parts of the loss function.

In case this idea doesn't work, there are other possibilities such as using labeled data to control the latent representation, using an architecture similar to a GAN, or including embeddings of the speakers own voice in the conversion, which should result in an output that is similar, if not identical to the input.


In [1]:
# quick function to eval model results and throw exception if false
def assert_equals(actual, expected):
    if actual != expected:
        raise Exception("Expected: " + str(expected) + ", Actual: " + str(actual))

## Dataset
While testing different architectures, we will use small parts of the [Mozilla Common Voice](https://voice.mozilla.org/en/datasets) dataset. It includes speech from many different speakers in English and other languages. While the samples are transcribed and also contain other metadata, we will not use any of that information. Instead, we will use the raw audio files during unsupervised training.

### Dataset Class
The dataset class will be responsible for loading the audio files and preprocessing them. We will use the following preprocessing steps during each batch:
- load the audio file
- resample to 16kHz
- normalize the volume
- TODO: add noise or other augmentation

In [2]:
import torch
import torchaudio
from torch.utils.data import Dataset

import numpy as np
import os # for file path manipulation

import csv # for reading tsv files


# TODO: data augmentation and added noise



# custom dataset class
class SpeechDataset(Dataset):
    def __init__(self, tsvs=[], sample_rate=16000, transform=None, columns=['path']):
        self.tsvs = tsvs
        self.sample_rate = sample_rate
        self.transform = transform
        self.columns = columns
        self.data = []

        # load metadata
        self._load_metadata()

    def augment(self, speech):
        # TODO: implement this
        return speech

    def add_noise(self, speech):
        # TODO: implement this
        return speech


    def split(self, split_ratio=0.8):
        # split data into train and test sets
        # split_ratio is the ratio of training data to test data
        # returns two SpeechDataset objects, one for train and one for test

        # get split index
        split_idx = int(len(self.data) * split_ratio)

        # split data
        train_data = self.data[:split_idx]
        test_data = self.data[split_idx:]

        # create new SpeechDataset objects
        train_dataset = SpeechDataset(sample_rate=self.sample_rate, transform=self.transform, columns=self.columns)
        test_dataset = SpeechDataset(sample_rate=self.sample_rate, transform=self.transform, columns=self.columns)

        # set data
        train_dataset.data = train_data
        test_dataset.data = test_data

        return train_dataset, test_dataset

    def _load_metadata(self):
        self.data = []
        for tsv in self.tsvs:
            dir_path, _ = os.path.split(tsv)
            
            clips = os.path.join(dir_path, 'clips', '')
            
            # read tsv and append to data
            with open(tsv, 'r') as f:
                reader = csv.DictReader(f, delimiter='\t')
                for row in reader:
                    # commonvoice columns:
                    # client_id	path	sentence	up_votes	down_votes	age	gender	accents	variant	locale	segment
                    
                    # get columns
                    data = [row[col] for col in self.columns]
                    if 'path' in self.columns:
                        # convert path to absolute path
                        path_idx = self.columns.index('path')
                        data[path_idx] = clips + data[path_idx]
                    # append to data
                    self.data.append(data)


        # shuffle data
        np.random.shuffle(self.data)

    def get_column_names(self):
        # if path is included, last column is audio data that will be loaded in __getitem__
        if 'path' in self.columns:
            # self.columns + ['audio']
            return self.columns + ['audio']
        else:
            return self.columns
        

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):

        if torch.is_tensor(idx):
            idx = idx.tolist()
        
        # load data
        sample = self.data[idx]
        # load audio (if path is in sample)
        if 'path' in self.columns:
            # load audio
            #print(sample[self.columns.index('path')])
            audio, sample_rate = torchaudio.load(sample[self.columns.index('path')])
            
            # resample audio if necessary
            if sample_rate != self.sample_rate:
                resampler = torchaudio.transforms.Resample(sample_rate, self.sample_rate)
                audio = resampler(audio)

                # normalize audio
                audio = audio / torch.max(torch.abs(audio))

            
            # add audio to sample
            sample.append(audio)

        # apply transform if necessary
        if self.transform:
            sample = self.transform(sample)

        return sample


In [3]:
# load the specified tsv files

dataset = SpeechDataset(tsvs=[
    'commonvoice\\cv-corpus-16.0-delta-2023-12-06\\en\\validated.tsv',
    'commonvoice\\cv-corpus-16.0-delta-2023-12-06\\de\\validated.tsv', 
    'commonvoice\\cv-corpus-16.0-delta-2023-12-06\\ja\\validated.tsv'], columns=['path', 'sentence'])

# print some info about the dataset
print('Dataset length:', len(dataset))
print('Dataset columns:', dataset.get_column_names())


# play a random sample to make sure it works
import random
# get first sample
sample = dataset[random.randint(0, len(dataset))]
print('Sample:', sample)

# get audio from sample
audio = sample[-1]

# play audio
from IPython.display import Audio


# Play the audio using IPython's Audio widget
audio_widget = Audio(data=audio.numpy(), rate=16000)
display(audio_widget)

Dataset length: 16894
Dataset columns: ['path', 'sentence', 'audio']
Sample: ['commonvoice\\cv-corpus-16.0-delta-2023-12-06\\ja\\clips\\common_voice_ja_39087888.mp3', '大きなのっぽのおじいさん', tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -2.4783e-12,
         -6.1398e-12,  1.1443e-11]])]


## Secondary loss network to train the encoder
Since the en- and decoder has to down and upsample the audio, it is difficult to use just MSE loss. With MSE, the VAE would have to learn to reconstruct the audio perfectly, meaning even being on sample (1/12600s) off would be considered a mistake. Instead, we will use a secondary loss network to train the encoder. This loss network will be a simple convolutional network that takes the original audio and the reconstructed audio as input and outputs a single value. This value will be the loss for the encoder. The loss network will be trained with MSE loss.
The input to the loss network will be the original audio and copies of the original audio shifted slightly in time. This will allow the loss network to learn to ignore small differences in the audio that are not relevant for the voice.

During the training of the VAE, we will also use the reconstructed audio to train the loss network. This will allow the loss network to learn to ignore the differences introduced by the VAE.

In [4]:
# loss network
import torch.nn as nn

class LossNetwork(nn.Module):

    def __init__(self):
        super(LossNetwork, self).__init__()
        self.conv1 = nn.Conv1d(2, 4, 40, stride=1)
        self.conv2 = nn.Conv1d(4, 4, 20, stride=1)
        self.conv3 = nn.Conv1d(4, 1, 10, stride=1)

        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = self.conv3(x)
        x = torch.abs(x)
        x = torch.mean(x, dim=2)

        x = x.view(-1, 1)

        return x

In [5]:
# train network
import torch.optim as optim
from torch.utils.data import DataLoader

def loss_net_collate_fn(batch):
    # list of np arrays
    audio = [ b[-1] for b in batch ]
    # list of torch tensors
    audio = [ torch.tensor(a) for a in audio ]
    
    # pad audio
    audio = nn.utils.rnn.pad_sequence(audio, batch_first=True)
    # add channel dimension
    audio = audio.unsqueeze(1)

    # to cuda
    audio = audio.cuda()

    # duplicate tensor
    audio2 = audio.clone()

    # random float for each i in batch
    #rand = torch.rand(len(batch)).cuda()
    #gauss = torch.randn(len(batch)).cuda()  * 10
    #move = torch.where(rand > 0.5, gauss, torch.zeros(len(batch)).cuda()).to(torch.int)
    # move audio2
    
    # pad audio and audio2 to same length
    max_len = max([ a.shape[2] for a in [audio, audio2] ])
    audio = nn.utils.rnn.pad_sequence(audio, batch_first=True, padding_value=0, total_length=max_len)
    audio2 = nn.utils.rnn.pad_sequence(audio2, batch_first=True, padding_value=0, total_length=max_len)

    return audio, audio2


# create dataloader
train_dataset, test_dataset = dataset.split()
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=loss_net_collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=True, collate_fn=loss_net_collate_fn)

# create loss network
loss_net = LossNetwork()
loss_net.cuda()

# create optimizer
optimizer = optim.Adam(loss_net.parameters(), lr=0.0001)

# create loss function
loss_fn = nn.MSELoss()

# train
for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(train_dataloader):
        # get inputs
        audio, audio2 = data
        for j in range(audio.shape[0]):
            # zero the parameter gradients
            optimizer.zero_grad()

            true_loss = 0 if j == 0 else 1
            y = torch.ones(audio.shape[0], 1).cuda() * true_loss
            # stack audio and audio2
            x = torch.stack([audio[j], audio2[j]], dim=1)

            # forward + backward + optimize
            outputs = loss_net(x)
            loss = loss_fn(outputs, y)
            loss.backward()
            optimizer.step()

            # roll audio2
            audio2 = torch.roll(audio2, 1, dims=0)
            

            # print statistics
            running_loss += loss.item()
            if i % 100 == 99:
                print('[%d, %5d] loss: %.5f' %
                    (epoch + 1, i + 1, running_loss / 100))
                running_loss = 0.0





  audio = [ torch.tensor(a) for a in audio ]


RuntimeError: The size of tensor a (76608) must match the size of tensor b (58176) at non-singleton dimension 1

In [4]:

# baseline models
import torch.nn as nn

SAMPLE_RATE = 16000

def pad_batch(batch):
    if isinstance(batch[0], list):
        # if batch is list of list, get tensor from last element
        batch = [sample[-1].reshape(-1) for sample in batch]
    # pads batch to longest sequence
    # batch is list of samples
    lengths = [len(sample) for sample in batch]
    max_length = max(lengths)
    max_length = (max_length // (64) + 1) * (64)
    # pad to max length
    padded_batch = [torch.nn.functional.pad(sample, (0, max_length - len(sample))) for sample in batch]
    t = torch.stack(padded_batch)
    # to cuda
    t = t.cuda()
    return t

class BaselineEmbedder(nn.Module):
    def __init__(self, sample_rate = SAMPLE_RATE, embedding_dim=32):
        super(BaselineEmbedder, self).__init__()
        self.sample_rate = sample_rate
        self.embedding_dim = embedding_dim

        # lstm layers
        self.lstm = nn.LSTM(input_size=1, hidden_size=embedding_dim, num_layers=3, batch_first=True)

    
    def forward(self, x):
        # x is audio, clips are padded to longest sequence
        # x is (batch_size, samples)

        # reshape to (batch_size, samples, 1)
        x = x.unsqueeze(2)
        x = self.lstm(x)
        # get last hidden state
        x = x[0][:, -1, :]
        x = x.reshape(-1, self.embedding_dim)
        return x
    


In [5]:
baseline = BaselineEmbedder()
print(baseline)

batch = [dataset[random.randint(0, len(dataset))][-1] for _ in range(16)]
batch = [sample[-1] for sample in batch]
batch = pad_batch(batch)

print('Input shape:', batch.shape)

# get embeddings
embeddings = baseline(batch)
print('Embeddings shape:', embeddings.shape)


BaselineEmbedder(
  (lstm): LSTM(1, 32, num_layers=3, batch_first=True)
)
Input shape: torch.Size([16, 168832])


RuntimeError: Input and parameter tensors are not at the same device, found input tensor at cuda:0 and parameter tensor at cpu

In [6]:
# VAE and decoder

class print_shape(nn.Module):
    def __init__(self, message):
        super().__init__()
        self.message = message
    
    def forward(self, x):
        print(self.message, x.shape)
        return x

class VAEBase(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)

        self.training = True
        # encoder
        # input: audio waveform 16000 samples
        # latent space: 32 dimensions, 100 samples
        # goal: learn latent space representation of audio that is easier to use in RNNs

        self.encoder = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=16, kernel_size=20, stride=10, padding=5),
            # samples 16,000 -> 1,600
            nn.ReLU(),
            #nn.Linear(in_features=16, out_features=16),
            #nn.ReLU(),
            nn.Conv1d(in_channels=16, out_channels=32, kernel_size=8, stride=4, padding=2),
            # samples 1,600 -> 400
            nn.ReLU(),
            #nn.Linear(in_features=32, out_features=32),
            #nn.ReLU(),
            nn.Conv1d(in_channels=32, out_channels=32, kernel_size=4, stride=2, padding=1),
            # samples 400 -> 200
            nn.ReLU(),
            #nn.Linear(in_features=32, out_features=32),
            #nn.ReLU(),
            nn.Conv1d(in_channels=32, out_channels=64, kernel_size=4, stride=2, padding=1),
            # samples 200 -> 100
            #nn.ReLU(),
            #nn.Linear(in_features=64, out_features=64),
            nn.Tanh()
        )

        # decoder
        # input: latent space representation
        # output: audio waveform 16000 samples
        # goal: reconstruct original audio waveform from latent space representation

        self.decoder = nn.Sequential(
            nn.ConvTranspose1d(in_channels=32, out_channels=32, kernel_size=4, stride=2, padding=1),
            # samples 100 -> 200
            nn.ReLU(),
            #nn.Linear(in_features=32, out_features=32),
            #nn.ReLU(),
            nn.ConvTranspose1d(in_channels=32, out_channels=32, kernel_size=4, stride=2, padding=1),
            # samples 200 -> 400
            nn.ReLU(),
            #nn.Linear(in_features=32, out_features=32),
            #nn.ReLU(),
            nn.ConvTranspose1d(in_channels=32, out_channels=16, kernel_size=8, stride=4, padding=2),
            # samples 400 -> 1600
            nn.ReLU(),
            #nn.Linear(in_features=16, out_features=16),
            #nn.ReLU(),
            nn.ConvTranspose1d(in_channels=16, out_channels=1, kernel_size=20, stride=10, padding=5),
            # samples 1600 -> 16000
            nn.Tanh()
        )

    def set_training(self, training):
        self.training = training

    def sample(self, mu, log_var):
        # if not self.training:
        #     return mu
        if not self.training:
            return mu
        # reparameterization trick
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def forward(self, x):

        # reshape to (n, 1, length)
        x = x.unsqueeze(1)
        # encode
        x = self.encoder(x)
        # get mu and log_var
        mu = x[:, :32]
        log_var = x[:, 32:]
        # sample from latent space
        z = self.sample(mu, log_var)
        # decode
        x = self.decoder(z)
        if not self.training:
            return x
        
        
        return x, mu, log_var

In [7]:
# VAE v2

class EncoderBlock(nn.Module):
    def __init__(self, in_channels, out_channels, n_conv=2, kernel_size=16, stride=1, padding="same", activation=nn.GELU):
        super().__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.n_conv = n_conv
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        self.conv_in = nn.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_rest = nn.ModuleList([nn.Conv1d(out_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) for _ in range(n_conv)])

        self.activation = activation()
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)


    def residual(self, state, block_input):
        # fit residual to to block output shape
        # block_input is (n, in_channels, length)
        # target shape is (n, out_channels, length)

        # repeat channels to match target shape
        block_input = block_input.repeat(1, self.out_channels // self.in_channels, 1)
        # add residual to block output
        state = state + block_input
        return state
    
    def forward(self, x):
        l = self.conv_in(x)
        for conv in self.conv_rest:
            l = self.activation(l)
            l = conv(l)

        # residual connection
        l = self.residual(l, x)

        # pool
        l = self.pool(l)

        # final activation
        l = self.activation(l)

        return l
    
# test encoder block
print("Test Encoder Block")
encoder_block = EncoderBlock(1, 8)
print(encoder_block)
x = torch.randn(16, 1, 16000)
print('Input shape:', x.shape)
x = encoder_block(x)

print("Expected output shape:", (16, 8, 8000))
print('True Output shape:', x.shape)
    
class DecoderBlock(nn.Module):
    def __init__(self, in_channels, out_channels, n_conv=2, kernel_size=3, stride=1, padding=1, activation=nn.GELU):
        super().__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.n_conv = n_conv
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        self.conv_rest = nn.ModuleList([nn.Conv1d(in_channels, in_channels, kernel_size=kernel_size, stride=stride, padding=padding) for _ in range(n_conv)])
        self.conv_out = nn.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)

        self.activation = activation()
        self.upsample = nn.Upsample(scale_factor=2)


    def residual(self, state, upsampled):
        # fit residual to to block output shape
        # upsampled is (n, in_channels, length)
        # target shape is (n, out_channels, length)
        # average channels to get target shape

        # hack, just return first n channels
        state = state + upsampled[:, :self.conv_out.out_channels, :]
        state = self.activation(state)
        return state


    
    def forward(self, x):
        # upsample
        l = self.upsample(x)
        res = l
        # convolutions
        for conv in self.conv_rest:
            l = conv(l)
            l = self.activation(l)
        # last convolution
        l = self.conv_out(l)
        # activation

        # residual connection
        l = self.residual(l, res)

        return l
    
# test decoder block
print('Test Decoder Block')
decoder_block = DecoderBlock(8, 1)
print(decoder_block)
print('Input shape:', x.shape)
# reuse x from encoder block
x = decoder_block(x)

print("Expected output shape:", (16, 1, 16000))
print('True Output shape:', x.shape)

    
def get_dimension_count(layer_i, max_dim=64, layer_n=6):
    if layer_i == 0:
        return 1
    elif layer_i >= layer_n:
        return max_dim
    return max_dim
    return min(int(max_dim/2 + layer_i/layer_n*max_dim/2) , max_dim)

class Encoder(nn.Module):
    def __init__(self, n_layers=6, latent_dim=64):
        super().__init__()

        self.n_layers = n_layers

        self.blocks = nn.ModuleList([EncoderBlock(get_dimension_count(i), get_dimension_count(i+1)) for i in range(n_layers)])

    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x
    
class Decoder(nn.Module):
    def __init__(self, n_layers=6, latent_dim=64):
        super().__init__()

        self.n_layers = n_layers

        self.blocks = nn.ModuleList([DecoderBlock(get_dimension_count(n_layers - i), get_dimension_count(n_layers - i - 1)) for i in range(n_layers)])

        # set last block activation to tanh
        self.blocks[-1].activation = nn.Tanh()
        

    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x
    
class VAE(nn.Module):
    def __init__(self, n_layers=6, latent_dim=64):
        super().__init__()

        self.n_layers = n_layers
        self.latent_dim = latent_dim

        self.encoder = Encoder(n_layers=n_layers, latent_dim=latent_dim)
        self.decoder = Decoder(n_layers=n_layers, latent_dim=latent_dim)

        self.fc_mu = nn.Linear(get_dimension_count(n_layers), latent_dim)
        self.fc_log_var = nn.Linear(get_dimension_count(n_layers), latent_dim)

    def sample(self, mu, log_var):
        # if not self.training:
        #     return mu
        if not self.training:
            return mu
        # reparameterization trick
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def forward(self, x):
        # encode
        x = self.encoder(x)
        # get mu and log_var
        #mu = self.fc_mu(x[:, :, 0])
        #log_var = self.fc_log_var(x[:, :, 0])
        # sample from latent space
        #z = self.sample(mu, log_var)
        # decode
        z = x
        x = self.decoder(z)
        if not self.training:
            return x
        
        
        #return x, mu, log_var
        return x



        

Test Encoder Block
EncoderBlock(
  (conv_in): Conv1d(1, 8, kernel_size=(16,), stride=(1,), padding=same)
  (conv_rest): ModuleList(
    (0-1): 2 x Conv1d(8, 8, kernel_size=(16,), stride=(1,), padding=same)
  )
  (activation): GELU(approximate='none')
  (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
Input shape: torch.Size([16, 1, 16000])
Expected output shape: (16, 8, 8000)
True Output shape: torch.Size([16, 8, 8000])
Test Decoder Block
DecoderBlock(
  (conv_rest): ModuleList(
    (0-1): 2 x Conv1d(8, 8, kernel_size=(3,), stride=(1,), padding=(1,))
  )
  (conv_out): Conv1d(8, 1, kernel_size=(3,), stride=(1,), padding=(1,))
  (activation): GELU(approximate='none')
  (upsample): Upsample(scale_factor=2.0, mode='nearest')
)
Input shape: torch.Size([16, 8, 8000])
Expected output shape: (16, 1, 16000)
True Output shape: torch.Size([16, 1, 16000])


  return F.conv1d(input, weight, bias, self.stride,


In [13]:
vae = VAEBase()
print(vae)

# get batch
batch = [dataset[random.randint(0, len(dataset))][-1] for _ in range(16)]
batch = [sample[-1] for sample in batch]
batch = pad_batch(batch)


print('Input shape:', batch.shape)

# get output
output,_,_ = vae(batch)
print('Output shape:', output.shape)



VAEBase(
  (encoder): Sequential(
    (0): Conv1d(1, 16, kernel_size=(20,), stride=(10,), padding=(5,))
    (1): ReLU()
    (2): Conv1d(16, 32, kernel_size=(8,), stride=(4,), padding=(2,))
    (3): ReLU()
    (4): Conv1d(32, 32, kernel_size=(4,), stride=(2,), padding=(1,))
    (5): ReLU()
    (6): Conv1d(32, 64, kernel_size=(4,), stride=(2,), padding=(1,))
    (7): Tanh()
  )
  (decoder): Sequential(
    (0): ConvTranspose1d(32, 32, kernel_size=(4,), stride=(2,), padding=(1,))
    (1): ReLU()
    (2): ConvTranspose1d(32, 32, kernel_size=(4,), stride=(2,), padding=(1,))
    (3): ReLU()
    (4): ConvTranspose1d(32, 16, kernel_size=(8,), stride=(4,), padding=(2,))
    (5): ReLU()
    (6): ConvTranspose1d(16, 1, kernel_size=(20,), stride=(10,), padding=(5,))
    (7): Tanh()
  )
)
Input shape: torch.Size([16, 156160])
Output shape: torch.Size([16, 1, 156160])


In [8]:
# test full VAE

print('Test VAE')
vae = VAE()

batch = torch.randn(32, 1, 54472)
print('Input shape:', batch.shape)
output = vae(batch)
print('Output shape:', output.shape)





Test VAE
Input shape: torch.Size([32, 1, 54472])
Output shape: torch.Size([32, 1, 54464])


In [9]:
# VAE training test
import torch.optim as optim
from torch.utils.data import DataLoader
import tqdm

# hyperparameters
BATCH_SIZE = 32
LEARNING_RATE = 0.00001
EPOCHS = 1

# create dataloader
train_dataset, val_dataset = dataset.split(0.9)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)


def print_progress(epoch, batch, loss):
    prog_rounded = round(100 * batch / (len(train_loader)-1), 2)
    prog = batch / (len(train_loader)-1)
    prog = int(prog * 20)
    print(f'Epoch: {epoch} | {"#" * prog}{"-" * (20 - prog)} ({prog_rounded}%) | Loss: {loss.item()}', end='\r')

# create model
vae = VAE()
vae.to('cuda')
vae.train()

# create optimizer
optimizer = optim.Adam(vae.parameters(), lr=LEARNING_RATE, weight_decay=0.0001)

# create loss function
loss_fn = nn.MSELoss()

# train
for epoch in range(EPOCHS):
    for i, batch in enumerate(train_loader):
        # zero gradients
        optimizer.zero_grad()
        #batch = batch.to('cuda')
        batch = batch.reshape(batch.shape[0], 1, -1)

        # get output
        output = vae(batch)

        # calculate loss
        loss = loss_fn(output, batch)

        # backpropagate
        loss.backward()

        # update parameters
        optimizer.step()

        # print progress
        print_progress(epoch, i, loss)

    # validate
    with torch.no_grad():
        vae.eval()
        for batch in tqdm.tqdm(val_loader):
            # get output
            batch = batch.reshape(batch.shape[0], 1, -1)
            output = vae(batch)

            # calculate loss
            loss = loss_fn(output, batch)

        vae.train()
    print()
    print('Epoch:', epoch, 'Loss:', loss.item())



RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [14]:
# train vae
import torch.optim as optim
from torch.utils.data import DataLoader
import tqdm

# hyperparameters
BATCH_SIZE = 64
LEARNING_RATE = 0.001
EPOCHS = 5
kl_beta = 0.1

# create dataloader
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)

# create model
vae = VAEBase()
vae.train()
# create optimizer
optimizer = optim.Adam(vae.parameters(), lr=LEARNING_RATE)

# train
def l_rate(epoch):
    return 0.001 * 0.5 ** (epoch)

def b_size(epoch):
    return 16 * 2 ** (epoch // 2)

def print_progress(epoch, batch, loss):
    prog = batch / len(dataloader)
    prog = int(prog * 20)
    print(f'Epoch: {epoch} | {"#" * prog}{"-" * (20 - prog)} | Loss: {loss.item()}', end='\r')

for epoch in range(EPOCHS):

    # set learning rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = l_rate(epoch)

    # set batch size
    BATCH_SIZE = b_size(epoch)
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=pad_batch)

    # set model to training mode    
    vae.train()

    for i, batch in enumerate(dataloader):
        # zero gradients
        optimizer.zero_grad()

        # forward pass
        output, mu, var = vae(batch)
        output = output.squeeze(1)

        # calculate loss
        reconstruction_loss = torch.nn.functional.mse_loss(output, batch)
        # KL divergence
        kl_divergence = -0.5 * torch.sum(1 + var - mu.pow(2) - var.exp())
        # total loss
        loss = reconstruction_loss

        # backward pass
        loss.backward()

        # update weights
        optimizer.step()

        # print progress
        print_progress(epoch, i, loss)
    print()

# save model
torch.save(vae.state_dict(), 'weights/vae.pth')

RuntimeError: Given groups=1, weight of size [4, 1, 3], expected input[1, 16, 152160] to have 1 channels, but got 16 channels instead

In [63]:
#vae = VAE()
#vae.load_state_dict(torch.load('weights/vae.pth'))

vae.eval()

# test vae
# get sample
sample = dataset[random.randint(0, len(dataset))]
audio = sample[-1]
print(sample[1])
audio.reshape(1, 1, -1)

# play before
audio_widget = Audio(data=audio.numpy()[0], rate=16000)
display(audio_widget)

# get reconstruction
audio_tensor = torch.tensor(audio, device='cuda')
output = vae(audio_tensor)

# play after
output_np = output.cpu().detach().numpy()[0]
output_np = output_np.reshape(-1)
#print(output_np[:100])
audio_widget = Audio(data=output.cpu().detach().numpy()[0], rate=16000)
display(audio_widget)


Er bietet bei klarem Wetter eine weit reichende Rundumsicht.


  audio_tensor = torch.tensor(audio, device='cuda')
