# Re-implementation of AudioLM

In the following notebook, we are going to introduce a reimplementation of the **AudioLM** network, proposed in the paper *"AudioLM: a Language Modeling Approach to Audio Generation"* (https://arxiv.org/abs/2209.03143).

AudioLM is a state-of-the-art framework built in order to **generate high-quality audio**, while dealing with **long-term consistency**. Trained on a large corpora of audio data, AudioLM is able to provide **natural and coherent audio continuations**, given short initial prompts. The network is also able to **maintain speaker identity**, finding a good trade-off between audio quality and semantical coherence. AudioLM is also able to provide good quality musical continuation from a short prompt, but this will not be discussed in this notebook.


## Import and settings

In [None]:
# missing imports

In [None]:
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  # If using CUDA
np.random.seed(seed)
random.seed(seed)

## Converting Audio Data into Tokens

The most important novelty provided by AudioLM is the usage of a **mixed tokenization approach**, which has never been seen in other Language Modeling competitors. As shown below, we have two tokenization processes that can proceed in parallel.

<center><img src="reportImg/1.png"/></center>

In order to keep informations regarding language syntax and semantic content in speech, the audio waveform is passed through a **w2v-BERT** model that, combined with a K-Means quantizer, returns a set of **Semantic tokens**.

On the other hand, the network needs also to maintain informations about the acoustic features of the audio, in particular pronunciation and speaker identity. In order to do so, the audio waveform is passed through a pretrained audio codec, **SoundStream**, which is able to build an internal hierarchical representation of the audio. Through those representations, called **Acoustic tokens**, the audio is divided into different components, going from the most basic structural audio features (defined as **Coarse Acoustic tokens**) to the fine acoustic details (defined as **Fine Acoustic tokens**).

By modeling both semantic and acoustic tokens within the same framework, the semantic tokens would ensure long-term consistency, while the acoustic tokens would ensure high-quality audio synthesis.

<center><img src="reportImg/2.png"/></center>


### Semantic tokens with w2v-BERT-like model

In [None]:
from hubertKM import SemanticTokenizer, visualizeEmbeddings

In [None]:
w2vBERT = SemanticTokenizer("facebook/hubert-base-ls960","./hubertKM/hubert_base_ls960_L9_km500.bin") 

In [None]:
## missing model Test

In [None]:
## missing visualization

### Acoustic tokens with SoundStream codec model

In [None]:
from SoundStream import soundstream_16khz

In [None]:
soundStream = soundstream_16khz()

In [None]:
## missing model Test

### Token Dataset creation

In [None]:
from data import storeTokens, TokensDataset, store_from_librilight

In [None]:
tokenPath = "out" ## output file directory
tokenFile = "tokens.csv" ## output file name
audioPath = "data" ## data_location

In [None]:
#fileCount = storeTokens(audioPath, tokenPath, tokenFile, w2vBERT, soundStream, fileCountCheckpoint = 10)
fileCount = store_from_librilight(tokenPath, tokenFile, w2vBERT, soundStream, fileCountCheckpoint = 10, subset = "10h")

In [None]:
AUDIO_LENGTH = 30
CROP_LENGTH = [15,5,1]

semanticDataset = TokensDataset(tokenPath, tokenFile, mode = "semantic", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)
coarseDataset = TokensDataset(tokenPath, tokenFile, mode = "coarse", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)
fineDataset = TokensDataset(tokenPath, tokenFile, mode = "fine", expected_audio_length = AUDIO_LENGTH, crop_length = CROP_LENGTH)

## AudioLM: a transformer-based audio model

Once we have converted our data into token sequences, we can start defining the generator model. AudioLM network is based on three Decoder-only transformers, each of them dedicated to the auto-regressive generation of a specific kind of token. 
During inference, we first generate the new semantic tokens, and then use them to condition the generation of new acoustic tokens. With this structure, we can safely assume that semantic tokens are expected to be conditionally independent from past acoustic tokens given past semantic tokens:
$$
  p(z_{t}|z_{\lt t},y_{\lt t}) \simeq p(z_{t}|z_{\lt t})
$$


### Semantic Transformer: expanding a sentence snippet

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla rhoncus elementum neque nec suscipit. Cras hendrerit feugiat dolor at sodales. Proin feugiat mattis felis vel maximus. Quisque tempus imperdiet odio, eget rhoncus eros tempus nec. Fusce venenatis est et dui porta fermentum a nec mauris. Aenean sit amet ullamcorper est. Pellentesque semper lorem fermentum vulputate egestas. Vestibulum interdum viverra felis. Maecenas molestie pulvinar consectetur. Curabitur vitae dignissim massa. Sed sodales odio ante, ut mollis sem feugiat ullamcorper. In hac habitasse platea dictumst. Aliquam et ante dui. Fusce laoreet orci in orci tincidunt, vitae mollis mi vestibulum. Nulla aliquam volutpat purus, suscipit iaculis metus egestas vel. Suspendisse pretium bibendum turpis ac dictum. 

<center><img src="reportImg/semantic.png"/></center>

In [None]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/semantic-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=15 #30
vocab_size=500


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = SemanticTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = SemanticTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

### Coarse Transformer: generating new audio

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla rhoncus elementum neque nec suscipit. Cras hendrerit feugiat dolor at sodales. Proin feugiat mattis felis vel maximus. Quisque tempus imperdiet odio, eget rhoncus eros tempus nec. Fusce venenatis est et dui porta fermentum a nec mauris. Aenean sit amet ullamcorper est. Pellentesque semper lorem fermentum vulputate egestas. Vestibulum interdum viverra felis. Maecenas molestie pulvinar consectetur. Curabitur vitae dignissim massa. Sed sodales odio ante, ut mollis sem feugiat ullamcorper. In hac habitasse platea dictumst. Aliquam et ante dui. Fusce laoreet orci in orci tincidunt, vitae mollis mi vestibulum. Nulla aliquam volutpat purus, suscipit iaculis metus egestas vel. Suspendisse pretium bibendum turpis ac dictum. 

<center><img src="reportImg/coarse.png"/></center>

In [None]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/coarse-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=5 #10
vocab_size=1024


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = CoarseTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = CoarseTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

### Fine Transformer: generating audio details

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla rhoncus elementum neque nec suscipit. Cras hendrerit feugiat dolor at sodales. Proin feugiat mattis felis vel maximus. Quisque tempus imperdiet odio, eget rhoncus eros tempus nec. Fusce venenatis est et dui porta fermentum a nec mauris. Aenean sit amet ullamcorper est. Pellentesque semper lorem fermentum vulputate egestas. Vestibulum interdum viverra felis. Maecenas molestie pulvinar consectetur. Curabitur vitae dignissim massa. Sed sodales odio ante, ut mollis sem feugiat ullamcorper. In hac habitasse platea dictumst. Aliquam et ante dui. Fusce laoreet orci in orci tincidunt, vitae mollis mi vestibulum. Nulla aliquam volutpat purus, suscipit iaculis metus egestas vel. Suspendisse pretium bibendum turpis ac dictum. 

<center><img src="reportImg/fine.png"></center>

In [None]:
TRAINING_PERCENTAGE = 0.8
BATCH_SIZE = 16

train_dataset, valid_dataset = random_split(coarseDataset, [TRAINING_PERCENTAGE, 1 - TRAINING_PERCENTAGE])
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)

trainer = pl.Trainer(
    max_epochs=30,
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
    log_every_n_steps=1,
    #devices=1 if torch.cuda.is_available() else None
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint_path = "checkpoints/fine-checkpoint-10h-colab.ckpt"
d_model=256 #1024
num_layers=3 #12
num_heads=4 #16
dim_feedforward=1024 #4096
audioDuration=1.5 #3
vocab_size=1024


if os.path.exists(checkpoint_path):
    print(f"Checkpoint found at {checkpoint_path}. Resuming old model...")
    model = FineTransformer.load_from_checkpoint(checkpoint_path, d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)
else:
    print("No checkpoint found. Starting from scratch...")
    model = FineTransformer(d_model=d_model, num_layers = num_layers, num_heads=num_heads, k = int(d_model/num_heads), dim_feedforward=dim_feedforward, audioDuration = audioDuration, vocab_size = vocab_size, myDevice = device)

## Inference and results