# EnCodec

- **arXiv**: [EnCodec Paper](https://arxiv.org/abs/2210.13438)
- **Hugging Face Models**:
  - [24kHz Model](https://huggingface.co/facebook/encodec_24khz)
  - [48kHz Model](https://huggingface.co/facebook/encodec_48khz)

EnCodec is a neural audio compression(`Residual Vector Quantized`) model that delivers high-fidelity audio.


- **24kHz**: Suitable for `voice and basic audio`, capturing 24,000 samples per second.
- **48kHz**: Ideal for `high-quality music and detailed audio`, capturing 48,000 samples per second.

### Architecture
<img src="https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png" alt="EnCodec Architecture" width="600">

In [3]:
from datasets import load_dataset, Audio
from transformers import EncodecModel, AutoProcessor

# load a demonstration datasets
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

# load the model + processor (for pre-processing the audio)
model = EncodecModel.from_pretrained("facebook/encodec_24khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Generating validation split: 100%|██████████| 73/73 [00:00<00:00, 2129.13 examples/s]
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


Let's see model default target bandwith

bandwidth for 24khz model has 1.5, 3, 6, 12 or 24 kbps.

In [4]:
model.config.target_bandwidths[0]

1.5

In [7]:
# cast the audio data to the correct sampling rate for the model
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
audio_sample = librispeech_dummy[0]["audio"]["array"]

# pre-process the inputs
inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

# explicitly encode then decode the audio inputs
# inputs["input_values"]  1, 1, 140520
encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"], bandwidth=24) # encoder_outputs.audio_codes # torch.Size([1, 1, 2, 440])
audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]

# or the equivalent with a forward pass
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
audio_values = audio_values.detach().cpu().numpy().squeeze()


## Display Auido

In [8]:
from IPython.display import Audio as IPythonAudio, display

# Play the input sound
print(f"Input Sound: {audio_sample.shape}")
display(IPythonAudio(audio_sample, rate=processor.sampling_rate))

# Play the output sound
print(f"Output Sound: {audio_values.shape}")
display(IPythonAudio(audio_values, rate=processor.sampling_rate))

Input Sound: (140520,)


Output Sound: (140520,)


## Step-by-Step **Encoding**

In [9]:
import torch 

batch, channels, input_length = inputs['input_values'].shape

# batch, channels, input_length
#     1,        1,       140520

chunk_length = input_length # 140520
stride = input_length # 140520
step = chunk_length - stride # 0

encoded_frames,scales = [], []

for offset in range(0, input_length - step, stride):
    # offset -> 0
    mask = inputs['padding_mask'][..., offset : offset + chunk_length].bool() # torch.Size([1, 140520])
    frame = inputs['input_values'][:, :, offset : offset + chunk_length] # torch.Size([1, 1, 140520])
    
    ######################
    # Encoding the frame #
    ######################
    # Step 1 : Encode the frame
    # model._encode_frame 
    length = frame.shape[-1] # 140520
    duration = length / model.config.sampling_rate # 5.855 sec <= 140520 / 24000 

    scale = None
    
    embeddings = model.encoder(frame) # => 1, 128, 440

    # Step 2 : Quantize the embeddings
    # model.quantizer.encode
    num_quantizers = model.quantizer.get_num_quantizers_for_bandwidth(model.config.target_bandwidths[0]) # 2 
    residual = embeddings
    all_indices = []
    for layer in model.quantizer.layers[:num_quantizers]:
        indices = layer.encode(residual) # (1, 440), (1, 440)
        quantized = layer.decode(indices) # (1, 128, 440), (1, 128, 440)
        residual = residual - quantized
        all_indices.append(indices) 
    codes = torch.stack(all_indices) # 2, 1, 440
    codes = codes.transpose(0, 1) # 1, 2, 440 

    encoded_frames.append(codes)
    scales.append(scale)

encoded_frames = torch.stack(encoded_frames) # 1, 1, 2, 440

## Step-by-Step **Decoding**

In [10]:
audio_codes = encoded_frames # 1, 1, 2, 440

chunk_length = model.config.chunk_length # None
# chunck length is None so ... 

######################
# Decoding the frame #
######################

# model._decode_frame 
codes = audio_codes[0].transpose(0, 1) # 2, 1, 440
# Step 1: Quantize Decoding First
quantized_out = torch.tensor(0.0, device=codes.device) # 0.0
for i, indices in enumerate(codes):
    #   i,            indices
    # 0,1,  (1, 440),(1, 440)
    quantized = model.quantizer.layers[i].decode(indices) # (1, 128, 440)
    quantized_out = quantized_out + quantized # (1, 128, 440)

embeddings = quantized_out # 1, 128, 440
audio_values_step_by_step = model.decoder(embeddings) # 1, 1, 140800
audio_values_step_by_step = audio_values_step_by_step.detach().cpu().numpy().squeeze()


## Check step-by-step results

In [11]:
from IPython.display import Audio as IPythonAudio, display

# Play the output sound
print(f"Output Sound: {audio_values_step_by_step.shape}")
display(IPythonAudio(audio_values_step_by_step, rate=processor.sampling_rate))

Output Sound: (140800,)


In [12]:
model.quantizer.codebook_size

1024

### Integrating RVQ Codes with LLM Tokens

1. **Challenge**: Directly inputting 1, 1, `2`, 440 dimension RVQ codes into LLM tokens is impractical.
2. **Requirement**: Both RVQ codes are essential for voice generation.

### Solution
- Avoid decoding residuals.
- Only include the last dimension: 1, 1, `last`, 440 for LLM compatibility.

In [13]:
audio_codes = encoded_frames # 1, 1, 2, 440

chunk_length = model.config.chunk_length # None
# chunck length is None so ... 

######################
# Decoding the frame #
######################

# model._decode_frame 
codes = audio_codes[0].transpose(0, 1) # 2, 1, 440
# Step 1: Quantize Decoding First
quantized_out = torch.tensor(0.0, device=codes.device) # 0.0

# decode only last 
decode_index = 0
codes = codes[decode_index] # 1, 1, 440

quantized_last_only = model.quantizer.layers[decode_index].decode(codes)
quantized_out_last_only = quantized_out + quantized_last_only

embeddings_decode_last_residual_only = quantized_out_last_only # 1, 128, 440

audio_values_decode_last_only = model.decoder(embeddings_decode_last_residual_only) # 1, 1, 140800
audio_values_decode_last_only = audio_values_decode_last_only.detach().cpu().numpy().squeeze()

Let's check output audio

In [14]:
# Play the output sound
print(f"Output Sound: {audio_values_decode_last_only.shape}")
display(IPythonAudio(audio_values_decode_last_only, rate=processor.sampling_rate))

Output Sound: (140800,)


It sounds strange, like Darth Vader speaking.

Since decoing with only last residual, it sounds like Darth Vader speaking we need to leverage all the residuals.

But we also need to consider `Encodec` in streaming fashion.

## Streaming Encodec

In [37]:
import librosa
import numpy as np
import torch
from IPython.display import Audio as IPythonAudio, display

sample_wav_file = "../data/hangang_kor.wav"

# 24kHz로 리샘플링
y, sr = librosa.load(sample_wav_file, sr=24000)

print(f"샘플 WAV 파일: {sample_wav_file}")
print(f"샘플 WAV 파일 shape: {y.shape}")
print(f"샘플 WAV 길이(초): {y.shape[0] / sr}")

# 80ms 단위로 오디오 분할
segment_length = int(0.08 * sr)  # 80ms
num_segments = y.shape[0] // segment_length

decoded_segments = []

for i in range(num_segments):
    # 항상 80ms 길이의 세그먼트를 처리
    segment = y[i*segment_length:(i+1)*segment_length]
    
    # 오디오 전처리
    inputs = processor(raw_audio=segment, sampling_rate=sr, return_tensors="pt")
    
    # 오디오 인코딩 및 디코딩
    with torch.no_grad():
        outputs = model(input_values=inputs["input_values"], padding_mask=inputs["padding_mask"], bandwidth=24)
    
    if i % 33 == 0:
        print(f"세그먼트 {i+1} 처리 완료")

    # 디코딩된 오디오 추출
    audio_values = outputs.audio_values
    decoded_audio = audio_values[0].cpu().numpy().squeeze()
    
    # 디코딩된 오디오의 길이가 segment_length와 다를 경우 조정
    if len(decoded_audio) > segment_length:
        decoded_audio = decoded_audio[:segment_length]
    elif len(decoded_audio) < segment_length:
        decoded_audio = np.pad(decoded_audio, (0, segment_length - len(decoded_audio)))
    
    decoded_segments.append(decoded_audio)

# 모든 디코딩된 세그먼트를 연결하여 전체 오디오 재구성
full_audio = np.concatenate(decoded_segments)
print("원본 오디오 shape:", y.shape)
display(IPythonAudio(y, rate=sr))

# 원본 오디오와 재구성된 오디오 재생
print(f"재구성된 오디오({full_audio.shape}) :")
display(IPythonAudio(full_audio, rate=sr))


샘플 WAV 파일: ../data/hangang_kor.wav
샘플 WAV 파일 shape: (240000,)
샘플 WAV 길이(초): 10.0
세그먼트 1 처리 완료
세그먼트 34 처리 완료
세그먼트 67 처리 완료
세그먼트 100 처리 완료
원본 오디오 shape: (240000,)


재구성된 오디오((240000,)) :
