# Sound Generation with Stable Audio Open and OpenVINO™

[Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) is an open-source model optimized for generating short audio samples, sound effects, and production elements using text prompts. The model was trained on data from Freesound and the Free Music Archive, respecting creator rights.

![stable-audio](https://github.com/openvinotoolkit/openvino_notebooks/assets/76171391/ed4aa0f2-0501-4519-8b24-c1c3072b4ef2)

#### Key Takeaways:

 - Stable Audio Open is an open source text-to-audio model for generating up to 47 seconds of samples and sound effects.

 - Users can create drum beats, instrument riffs, ambient sounds, foley and production elements.

 - The model enables audio variations and style transfer of audio samples.

This model is made to be used with the [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) library for inference.

#### Table of contents:
- [Prerequisites](#Prerequisites)
- [Load the original model and inference](#Load-the-original-model-and-inference)
- [Convert the model to OpenVINO IR](#Convert-the-model-to-OpenVINO-IR)
  - [T5-based text embedding](#T5-based-text-embedding)
  - [Transformer-based diffusion (DiT) model](#Transformer-based-diffusion-(DiT)-model)
  - [Decoder part of autoencoder](#Decoder-part-of-autoencoder)
- [Compiling models and inference](#Compiling-models-and-inference)
- [Interactive inference](#Interactive-inference)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).


<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/stable-audio/stable-audio.ipynb" />


## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [1]:
import platform

%pip install -q "torch>=2.2" "torchaudio" "einops" "einops-exts" "huggingface-hub" "k-diffusion" "pytorch_lightning" "alias-free-torch" "ema-pytorch" "transformers>=4.45" "gradio>=4.19" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q --no-deps "stable-audio-tools"
%pip install  -q "nncf>=2.12.0"
if platform.system() == "Darwin":
    %pip install -q "numpy>=1.26,<2.0.0" "pandas>2.0.2" "matplotlib>=3.9"
else:
    %pip install -q "numpy>=1.26" "pandas>2.0.2" "matplotlib>=3.9"
%pip install -q "openvino>=2024.4.0"

## Load the original model and inference
[back to top ⬆️](#Table-of-contents:)

>**Note**: run model with notebook, you will need to accept license agreement. 
>You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/stabilityai/stable-audio-open-1.0), carefully read terms of usage and click accept button.  You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
>You can login on Hugging Face Hub in notebook environment, using following code:

In [2]:
# uncomment these lines to login to huggingfacehub to get access to pretrained model
# from huggingface_hub import notebook_login, whoami

# try:
#     whoami()
#     print('Authorization token already provided')
# except OSError:
#     notebook_login()

In [3]:
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

import requests
from pathlib import Path

if not Path("notebook_utils.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("stable-audio.ipynb")


# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")

No module named 'flash_attn'
flash_attn not installed, disabling Flash Attention


  WeightNorm.apply(module, name, dim)


In [4]:
sample_rate = model_config["sample_rate"]

model = model.to("cpu")
total_seconds = 20

# Set up text and timing conditioning
conditioning = [{"prompt": "128 BPM tech house drum loop", "seconds_start": 0, "seconds_total": total_seconds}]

# Generate stereo audio
output = generate_diffusion_cond(
    model,
    steps=100,
    seed=42,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_rate * total_seconds,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device="cpu",
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

42


  with torch.cuda.amp.autocast(dtype=torch.float16) and torch.set_grad_enabled(self.enable_grad):


  0%|          | 0/100 [00:00<?, ?it/s]



In [5]:
from IPython.display import Audio

Audio("output.wav")

## Convert the model to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

Let's define the conversion function for PyTorch modules. We use `ov.convert_model` function to obtain OpenVINO Intermediate Representation object and `ov.save_model` function to save it as XML file.

For reducing memory consumption, weights compression optimization can be applied using [NNCF](https://github.com/openvinotoolkit/nncf). Weight compression aims to reduce the memory footprint of a model.
models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways:

* enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device;

* improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers.

[Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) provides 4-bit / 8-bit mixed weight quantization as a compression method. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use.

`nncf.compress_weights` function can be used for performing weights compression. The function accepts an OpenVINO model and other compression parameters. Different parameters may be suitable for different models. In this case default parameters give bad results. But we can change mode to `CompressWeightsMode.INT8_SYM` to [compress weights symmetrically to 8-bit integer data type](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md#user-guide) and get the inference results the same as original. 

More details about weights compression can be found in [OpenVINO documentation](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html).

In [6]:
from pathlib import Path
import torch
from nncf import compress_weights, CompressWeightsMode
import openvino as ov


def convert(model: torch.nn.Module, xml_path: str, example_input):
    xml_path = Path(xml_path)
    if not xml_path.exists():
        xml_path.parent.mkdir(parents=True, exist_ok=True)
        model.eval()
        with torch.no_grad():
            converted_model = ov.convert_model(model, example_input=example_input)
            converted_model = compress_weights(converted_model, mode=CompressWeightsMode.INT8_SYM)
        ov.save_model(converted_model, xml_path)

        # cleanup memory
        torch._C._jit_clear_class_registry()
        torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
        torch.jit._state._clear_class_state()

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, openvino


In [7]:
MODEL_DIR = Path("model")

CONDITIONER_ENCODER_PATH = MODEL_DIR / "conditioner_encoder.xml"
DIFFUSION_PATH = MODEL_DIR / "diffusion.xml"
PRETRANSFORM_PATH = MODEL_DIR / "pretransform.xml"

The pipeline comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder. In this example an initial audio is not used, so we need to convert T5-based text embedding model, transformer-based diffusion (DiT) model and only decoder part of autoencoder.

### T5-based text embedding
[back to top ⬆️](#Table-of-contents:)

In [8]:
example_input = {
    "input_ids": torch.zeros(1, 120, dtype=torch.int64),
    "attention_mask": torch.zeros(1, 120, dtype=torch.int64),
}

convert(model.conditioner.conditioners["prompt"].model, CONDITIONER_ENCODER_PATH, example_input)

### Transformer-based diffusion (DiT) model
[back to top ⬆️](#Table-of-contents:)

In [9]:
import types


def continious_transformer_forward(self, x, mask=None, prepend_embeds=None, prepend_mask=None, global_cond=None, return_info=False, **kwargs):
    batch, seq, device = *x.shape[:2], x.device

    info = {
        "hidden_states": [],
    }

    x = self.project_in(x)

    if prepend_embeds is not None:
        prepend_length, prepend_dim = prepend_embeds.shape[1:]

        assert prepend_dim == x.shape[-1], "prepend dimension must match sequence dimension"

        x = torch.cat((prepend_embeds, x), dim=-2)

        if prepend_mask is not None or mask is not None:
            mask = mask if mask is not None else torch.ones((batch, seq), device=device, dtype=torch.bool)
            prepend_mask = prepend_mask if prepend_mask is not None else torch.ones((batch, prepend_length), device=device, dtype=torch.bool)

            mask = torch.cat((prepend_mask, mask), dim=-1)

    # Attention layers

    if self.rotary_pos_emb is not None:
        rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1])
    else:
        rotary_pos_emb = None

    if self.use_sinusoidal_emb or self.use_abs_pos_emb:
        x = x + self.pos_emb(x)

    # Iterate over the transformer layers
    for layer in self.layers:
        x = layer(x, rotary_pos_emb=rotary_pos_emb, global_cond=global_cond, **kwargs)
        if return_info:
            info["hidden_states"].append(x)

    x = self.project_out(x)

    if return_info:
        return x, info

    return x


class DiffusionWrapper(torch.nn.Module):
    def __init__(self, diffusion):
        super().__init__()
        self.diffusion = diffusion

    def forward(self, x=None, t=None, cross_attn_cond=None, cross_attn_cond_mask=None, global_embed=None):
        model_inputs = {"cross_attn_cond": cross_attn_cond, "cross_attn_cond_mask": cross_attn_cond_mask, "global_embed": global_embed}

        return self.diffusion.forward(x, t, cfg_scale=7, **model_inputs)


example_input = {
    "x": torch.rand([1, 64, 1024], dtype=torch.float32),
    "t": torch.rand([1], dtype=torch.float32),
    "cross_attn_cond": torch.rand([1, 130, 768], dtype=torch.float32),
    "cross_attn_cond_mask": torch.ones([1, 130], dtype=torch.float32),
    "global_embed": torch.rand(torch.Size([1, 1536]), dtype=torch.float32),
}

diffuser = model.model.model
diffuser.transformer.forward = types.MethodType(continious_transformer_forward, diffuser.transformer)
convert(DiffusionWrapper(diffuser), DIFFUSION_PATH, example_input)

### Decoder part of autoencoder
[back to top ⬆️](#Table-of-contents:)

In [10]:
import types


def residual_forward(self, x):
    res = x
    x = self.layers(x)
    return x + res


for layer in model.pretransform.model.decoder.layers:
    if layer.__class__.__name__ == "DecoderBlock":
        for sublayer in layer.layers:
            if sublayer.__class__.__name__ == "ResidualUnit":
                sublayer.forward = types.MethodType(residual_forward, sublayer)

convert(model.pretransform.model.decoder, PRETRANSFORM_PATH, torch.rand([1, 64, 215], dtype=torch.float32))

## Compiling models and inference
[back to top ⬆️](#Table-of-contents:)

Select device from dropdown list for running inference using OpenVINO.

In [11]:
from notebook_utils import device_widget

device = device_widget("CPU")

device

Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

Let's create callable wrapper classes for compiled models to allow interaction with original pipeline. Note that all of wrapper classes return `torch.Tensor`s instead of `np.array`s.

In [12]:
core = ov.Core()


class TextEncoderWrapper(torch.nn.Module):
    def __init__(self, text_encoder, dtype, device="CPU"):
        super().__init__()
        self.text_encoder = core.compile_model(text_encoder, device)
        self.dtype = dtype

    def __call__(self, input_ids=None, attention_mask=None):
        inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        last_hidden_state = self.text_encoder(inputs)[0]

        return {"last_hidden_state": torch.from_numpy(last_hidden_state)}

In [13]:
class OVWrapper(torch.nn.Module):
    def __init__(self, ov_model, old_model, device="CPU") -> None:
        super().__init__()
        self.mock = torch.nn.Parameter(torch.zeros(1))  # this is only mock to not change the pipeline
        self.dif_transformer = core.compile_model(ov_model, device)

    def forward(self, x=None, t=None, cross_attn_cond=None, cross_attn_cond_mask=None, global_embed=None, **kwargs):
        inputs = {
            "x": x,
            "t": t,
            "cross_attn_cond": cross_attn_cond,
            "cross_attn_cond_mask": cross_attn_cond_mask,
            "global_embed": global_embed,
        }
        result = self.dif_transformer(inputs)

        return torch.from_numpy(result[0])

In [14]:
class PretransformDecoderWrapper(torch.nn.Module):
    def __init__(self, ov_model, device="CPU"):
        super().__init__()
        self.decoder = core.compile_model(ov_model, device)

    def forward(self, latents=None):
        result = self.decoder(latents)

        return torch.from_numpy(result[0])

Now we can replace the original models by our wrapped OpenVINO models and run inference. 

In [15]:
model.model.model = OVWrapper(DIFFUSION_PATH, model.model.model, device.value)
model.conditioner.conditioners["prompt"].model = TextEncoderWrapper(
    CONDITIONER_ENCODER_PATH, model.conditioner.conditioners["prompt"].model.dtype, device.value
)
model.pretransform.model.decoder = PretransformDecoderWrapper(PRETRANSFORM_PATH, device.value)

In [16]:
output = generate_diffusion_cond(
    model,
    steps=100,
    seed=42,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_rate * total_seconds,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device="cpu",
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

42


  0%|          | 0/100 [00:00<?, ?it/s]

In [17]:
Audio("output.wav")

## Interactive inference
[back to top ⬆️](#Table-of-contents:)

In [18]:
def _generate(prompt, total_seconds, steps, seed):
    sample_rate = model_config["sample_rate"]

    # Set up text and timing conditioning
    conditioning = [{"prompt": prompt, "seconds_start": 0, "seconds_total": total_seconds}]

    output = generate_diffusion_cond(
        model,
        steps=steps,
        seed=seed,
        cfg_scale=7,
        conditioning=conditioning,
        sample_size=sample_rate * total_seconds,
        sigma_min=0.3,
        sigma_max=500,
        sampler_type="dpmpp-3m-sde",
        device="cpu",
    )

    # Rearrange audio batch to a single sequence
    output = rearrange(output, "b d n -> d (b n)")

    # Peak normalize, clip, convert to int16, and save to file
    output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
    return (sample_rate, output.numpy().transpose())

In [None]:
if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/stable-audio/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

demo = make_demo(fn=_generate)

try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)
# If you are launching remotely, specify server_name and server_port
# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`
# To learn more please refer to the Gradio docs: https://gradio.app/docs/