# Re-implementation of AudioLM

In the following notebook, we are going to introduce a reimplementation of the **AudioLM** network, proposed in the paper *"AudioLM: a Language Modeling Approach to Audio Generation"* (https://arxiv.org/abs/2209.03143).

AudioLM is a state-of-the-art framework built in order to **generate high-quality audio**, while dealing with **long-term consistency**. Trained on a large corpora of audio data, AudioLM is able to provide **natural and coherent audio continuations**, given short initial prompts. The network is also able to **maintain speaker identity**, finding a good trade-off between audio quality and semantical coherence. AudioLM is also able to provide good quality musical continuation from a short prompt, but this will not be discussed in this notebook.


## Import and settings

In [7]:
# basic import
import torch
import torchaudio
import numpy as np
import random
import pytorch_lightning as pl
import os

# data imports
from hubertKM import SemanticTokenizer
from SoundStream import soundstream_16khz
from data import TokensDataset, store_from_librilight
from torch.utils.data import DataLoader, random_split

# model imports
from TransformerModel import SemanticTransformer, CoarseTransformer, FineTransformer

In [4]:
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # If using CUDA
np.random.seed(seed)
random.seed(seed)

## Converting Audio Data into Tokens

The most important novelty provided by AudioLM is the usage of a **mixed tokenization approach**, which has never been seen in other Language Modeling competitors. As shown below, we have two tokenization processes that can proceed in parallel.

<center><img src="reportImg/1.png"/></center>

In order to keep informations regarding language syntax and semantic content in speech, the audio waveform is passed through a **w2v-BERT** model that, combined with a K-Means quantizer, returns a set of **Semantic tokens**.

On the other hand, the network needs also to maintain informations about the acoustic features of the audio, in particular pronunciation and speaker identity. In order to do so, the audio waveform is passed through a pretrained audio codec, **SoundStream**, which is able to build an internal hierarchical representation of the audio. Through those representations, called **Acoustic tokens**, the audio is divided into different components, going from the most basic structural audio features (defined as **Coarse Acoustic tokens**) to the fine acoustic details (defined as **Fine Acoustic tokens**).

By modeling both semantic and acoustic tokens within the same framework, the semantic tokens would ensure long-term consistency, while the acoustic tokens would ensure high-quality audio synthesis.

<center><img src="reportImg/2.png"/></center>


### Semantic tokens with w2v-BERT-like model

In the original paper, the semantic tokens are computed through **w2v-BERT**, a recent model trained on learning **self-supervised audio representations**. When trained on large speech corpora, w2v-BERT learns to map the input audio waveform to a rich set of linguistic features.

<center><img src="reportImg/w2vbert.png"/></center>

The model is based on the usage of **Conformer Blocks**, a variation of Transformer blocks augmented with convolution, and during training uses a combination of two self-supervised objectives: a **masked language modeling(MLM) loss** and a **contrastive loss**. Even though this model can be fine-tuned for different tasks, **AudioLM leverages only on the pre-trained version**, using the Context Vectors of as specific MLM Layer as embeddings.
Finally the embeddings are passed through a **K-Means quantizer**, that would simply convert the embeddings into centroid indices, which will be used as Semantic Tokens. During training, the paper propose a value of K=1024 clusters for the quantizer, choosing as Context Vectors the output of the 7th MLM layer normalized.

Regarding our implementation, since **w2v-BERT is a closed-source project**, we opted for a **HuBERT** model as an alternative. We decided to work with a **pretrained version of HuBERT-Base and K-Means** quantizer provided by Fairseq, at https://github.com/facebookresearch/fairseq/blob/main/examples/hubert. In the following model, we use K=500 clusters and the output is taken from the 9th layer of HuBERT.

In [5]:
w2vBERT = SemanticTokenizer("facebook/hubert-base-ls960","./hubertKM/hubert_base_ls960_L9_km500.bin") 

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [16]:
waveform, sampleRate = torchaudio.load("exampleAudio.flac")

with torch.no_grad():
    semanticTokens, embeddings = w2vBERT(waveform)

print("Embeddings taken from HuBERT\n")
print(embeddings)
print("\nSemantic Tokens\n")
print(semanticTokens)

Embeddings taken from HuBERT

tensor([[-0.1961, -0.2603,  0.1180,  ...,  0.2637, -0.0621,  0.3440],
        [-0.1926, -0.2778,  0.1176,  ...,  0.2671, -0.0666,  0.4041],
        [-0.2709, -0.2252,  0.0270,  ...,  0.1852, -0.1773,  0.7314],
        ...,
        [-0.1882, -0.3348,  0.0669,  ...,  0.1321, -0.2052,  0.1334],
        [-0.2268, -0.4392,  0.0788,  ...,  0.2549, -0.2078,  0.0664],
        [-0.2917, -0.2706,  0.1425,  ...,  0.2663, -0.1193,  0.1476]])

Semantic Tokens

[ 17  17 289 ... 160 160 193]


### Acoustic tokens with SoundStream codec model

The original framework computes the acoustic tokens using **SoundStream**, a state-of-the-art **neural audio codec** that, through the usage of a convolutional enconder, is able to **map the input audio into a new lower-sampled version**. This allows the model to maintain information about acoustic features, while keeping the audio quality as much as unaffected as possible.

<center><img src="reportImg/soundstream.png"/></center>

The model follows a **common Encoder-Decoder structure**, while the middle part, a **Residual Vector Quantizer**, is responsible to catch all the **meaningful acoustic aspects** of the input waveform, classifing them into a hierarchical order. The codec achieves high quality by being trained end-to-end with a combination of **reconstruction and adversarial losses**.

For AudioLM, the **acoustic tokens are the output embeddings** obtained by the RVQ component, in quantizer order. As already mentioned, the first quantizers capture elementary acoustics, while the last quantizers will encapsulates more fine audio details. In the paper, they proposed using 12 quantizer layers (where the first $Q' = 4$ are used for Coarse tokens), along with a codebook of 1024 and 4 encoder/decoder blocks.

In our implementation, we used a **pretrained SoundStream model** implementation to reduce training time, provided by https://github.com/kaiidams/soundstream-pytorch. In this model version, only **the number of quantizers is reduced** to 8, where 2 of them are dedicated to Coarse tokens.

In [20]:
soundStream = soundstream_16khz()

In [42]:
from SoundStream import audio_to_tokens, decode_audio, tokens_to_audio
from IPython.display import Audio, display

waveform, sampleRate = torchaudio.load("exampleAudio.flac")

coarse, fine = audio_to_tokens(waveform, soundStream)

print("Coarse Tokens:")
print(coarse)

print("\nFine Tokens:")
print(fine)

reconstructedWaveform = tokens_to_audio(coarse, fine, soundStream)

print("\nOriginal Audio:\n")
display(Audio(waveform, rate = sampleRate))

print("\nReconstructed Audio:\n")
display(Audio(reconstructedWaveform, rate = sampleRate))

Coarse Tokens:
tensor([  20, 1601, 2287,  ...,  890, 1975, 2626])

Fine Tokens:
tensor([3774, 4108, 5947,  ..., 5890, 6705, 7568])

Original Audio:




Reconstructed Audio:



### Token Dataset creation

Now that we have defined models for both the Semantic and Acoustic tokenization, we can proceed with **converting audio data into tokens**. We tried **performing this operation separately** from the AudioLM training, in order to allievate computational costs.

The data used for the model comes from **Libri-Light** (at https://github.com/facebookresearch/libri-light), an high-quality collection of audio-books narrated by different speakers. The dataset already comes with a nice separation in training data and finetuning data. 

In the original paper, they used *unlab-60k*, a combined version of small, medium and big dataset, for a total of about **60000 hours** worth of recording. Since we do not have any computational mean allowing us to train a network on such amount of data, we tried using the **fine-tuning version**, that is worth about **10 hours** of audio data.

Our implementation takes in input the data from **Libri-Light Limited Torch dataset** (refer to https://pytorch.org/audio/main/generated/torchaudio.datasets.LibriLightLimited.html), and return a **CSV file** containing the Semantic, Coarse and fine tokens.

In [13]:
tokenPath = "out" ## output file directory
tokenFile = "tokens.csv" ## output file name
audioPath = "data" ## data_location

In [14]:
#fileCount = storeTokens(audioPath, tokenPath, tokenFile, w2vBERT, soundStream, fileCountCheckpoint = 10)
fileCount = store_from_librilight(tokenPath, tokenFile, w2vBERT, soundStream, fileCountCheckpoint = 10, subset = "10h")

100%|███████████████████████████████████████████████████████████████████████████████| 570M/570M [00:38<00:00, 15.7MB/s]


SAVED 10 AUDIO ON OUTPUT out\tokens.csv. Total of 10 records saved.


KeyboardInterrupt: 

In [15]:
AUDIO_LENGTH = 30
CROP_LENGTH = [15,5,1]

semanticDataset = TokensDataset(tokenPath, tokenFile, mode = "semantic", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)
coarseDataset = TokensDataset(tokenPath, tokenFile, mode = "coarse", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)
fineDataset = TokensDataset(tokenPath, tokenFile, mode = "fine", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)

## AudioLM: a transformer-based audio model

Once we have converted our data into token sequences, we can start defining the **generator** model. AudioLM network is based on three Decoder-only transformers, each of them dedicated to the **auto-regressive** generation of a specific kind of token. 
During inference, we first generate the new semantic tokens, and then use them to condition the generation of new acoustic tokens. With this structure, we can safely assume that *semantic tokens are expected to be conditionally independent from past acoustic tokens* given past semantic tokens:
$$
  p(z_{t}|z_{\lt t},y_{\lt t}) \simeq p(z_{t}|z_{\lt t})
$$



### Some implementation details

In the original paper, each generator follows **the same identical decoder-only Transformer** structure, with:
- 12 decoder layers
- 16 attention heads
- 1024 as embedding dimension
- 4096 as dimension for the feed-forward layer
- 0.1 as dropout value

The original model is also enriched with a **T5-style relative positional encoding**, that is trained along with the three stage transformers. For this reimplementation, we adapted an implementation of the paper *"Self-Attention with Relative Position Representations"* provided by https://github.com/AliHaiderAhmad001/Self-Attention-with-Relative-Position-Representations.

### Semantic Transformer: expanding a sentence snippet

<center><img src="reportImg/semantic.png"/></center>

The first transformer is responsible for the generation of new semantic tokens, estimating $ p(z_{t}|z_{\lt t}) $, where $ z_{i} $ denotes the $i$-th semantic token. In this way, given the semantic aspects of the prompt, the network is able to generate **new coherent content**, keeping long-term consistency.

In [26]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [27]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/semantic-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=15 #30
vocab_size=500


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = SemanticTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = SemanticTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

No checkpoint found. Starting from scratch...


In [None]:
model.train()
if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming training...")
    # Pass the checkpoint path to trainer.fit
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader, ckpt_path=checkpoint_path)
else:
    print("No checkpoint found. Starting from scratch...")
    # Start training from scratch
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

### Coarse Transformer: generating new audio

<center><img src="reportImg/coarse.png"/></center>

Once we have defined a mechanism to generate new semantic content, we need to generate the corresponding acoustic content. Since we want the new tokens to be **coherent both with the sentence meaning** (represented by the Semantic tokens) and the **original speaker and enviroment conditions** (represented by the past Coarse Tokens), this model is trained to estimate:
$$ p(y_t^q \mid z, y_{<t}^{<Q'}, y_t^{<q})  \quad \text{for} \ q \le Q'$$
where the current generated semantic token depends on **Semantic tokens** $z$ and the **acoustic tokens** $ y_{<t}^{<Q'} $ and $ y_t^{<q} $ that references respectively to **previous and current audio parts**. In the formula, $Q'$ denotes the **number of Soundstream quantizer** dedicated to Coarse tokens generation.


In [None]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/coarse-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=5 #10
vocab_size=1024


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = CoarseTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = CoarseTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

In [None]:
model.train()
if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming training...")
    # Pass the checkpoint path to trainer.fit
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader, ckpt_path=checkpoint_path)
else:
    print("No checkpoint found. Starting from scratch...")
    # Start training from scratch
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

### Fine Transformer: rebuilding a detailed audio waveform

<center><img src="reportImg/fine.png"></center>

At last, we need to generate the **acoustic details** of the extended audio, since we need to recreate the **hierarchical structure** of acoustic features needed for the SoundStream model. This is easily done through a third and last transformer, that given the Coarse tokens and the previous Fine tokens, is trained to generate new Fine tokens. In formula:

$$ p(y_t^q \mid y_{<t}^{\leq Q'}, y_{\geq t}^{> Q'}, y_t^{<q}) \quad \text{for} \ q > Q' $$

where the new tokens depends on the **Coarse tokens** $y_{<t}^{\leq Q'}$ and the **Fine tokens** $y_{\geq t}^{> Q'}$ and $y_t^{<q}$, belonging respectively to **previous and current audio parts**.

In [None]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/fine-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=1.5 #3
vocab_size=1024


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = FineTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = FineTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

In [None]:
model.train()
if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming training...")
    # Pass the checkpoint path to trainer.fit
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader, ckpt_path=checkpoint_path)
else:
    print("No checkpoint found. Starting from scratch...")
    # Start training from scratch
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

Once we have generated both Coarse acoustic tokens and Fine acoustic tokens, we can reshape them into a format that resembles the original data provided by the Residual Vector Quantizer component of SoundStream. Finally we pass the Acoustic tokens into the Decoder part of Soundstream, that will return the final audio waveform.

<center><img src="reportImg/full.png"></center>

## Inference and results

## References

- AudioLM paper: https://arxiv.org/abs/2209.03143
- AudioLM blog post: https://research.google/blog/audiolm-a-language-modeling-approach-to-audio-generation/
- HuBERT and K-Means implementation: https://github.com/facebookresearch/fairseq/blob/main/examples/hubert
- w2v-BERT paper: https://arxiv.org/abs/2108.06209
- SoundStream implementation: https://github.com/kaiidams/soundstream-pytorch
- SoundStream paper: https://arxiv.org/abs/2107.03312
- LibriLight full dataset: https://github.com/facebookresearch/libri-light/tree/main
- LibriLight Limited on pyTorch: https://pytorch.org/audio/main/generated/torchaudio.datasets.LibriLightLimited.html
- Relative positional embeddings implementation: https://github.com/AliHaiderAhmad001/Self-Attention-with-Relative-Position-Representations
  