### Cell 1: Install and upgrade all required libraries  
This cell:
- Upgrades `pip` itself.  
- Installs CUDA-enabled PyTorch (`torch`, `torchvision`, `torchaudio`).  
- Installs the `diffusers`, `accelerate`, and `audioldm2` packages for AudioLDM2.  
- Installs audio utilities (`librosa`, `soundfile`, `scipy`).  
- Upgrades `transformers` and `huggingface_hub` so we can authenticate and get the latest model code.  

> **Note:** After this cell runs, you must **restart the runtime** so that the new versions are actually loaded.


In [3]:
!pip install --upgrade pip
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
!pip install "diffusers>=0.21.0" accelerate audioldm2
!pip install librosa soundfile scipy
!pip install --upgrade transformers==4.46.0 huggingface_hub

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118
Collecting transformers==4.46.0
  Downloading transformers-4.46.0-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers==4.46.0)
  Downloading tokenizers-0.20.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Reason for being yanked: This version unfortunately does not work with 3.8 but we did not drop the support yet[0m[33m
[0mDownloading transformers-4.46.0-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m117.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.20.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m113.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
[2K  Attempting uninstall: tokenizers
[2K    Found exist

### Cell 2: Log in to Hugging Face  
This cell uses `huggingface_hub.login()` to cache your HF access token in Colab.  
Once you paste your token, subsequent `from_pretrained(..., use_auth_token=True)` calls will download models without any further prompts.


In [1]:
# Cell 2: Log in to Hugging Face (so from_pretrained won’t prompt you)
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Cell 3: Monkey-patch and load AudioLDM2  
1. **Stub out** `numpy.dtypes` so that the JAX import check in Diffusers won’t crash on older NumPy versions.  
2. **Patch** `GPT2Model._get_initial_cache_position` so AudioLDM2’s internal GPT-based conditioning won’t hit a missing-method error.  
3. **Import** `AudioLDM2Pipeline` and call `from_pretrained(...)` with `use_auth_token=True` so it loads the 1.1B-parameter model in FP16 on GPU.


In [2]:
# Cell 3: Monkey‐patch numpy & GPT2Model, then load AudioLDM2
import numpy as np, types

# 1) Stub out numpy.dtypes so JAX import checks won’t crash
if not hasattr(np, "dtypes"):
    np.dtypes = types.SimpleNamespace()

# 2) Stub GPT2Model._get_initial_cache_position so generation won’t error
from transformers import GPT2Model
if not hasattr(GPT2Model, "_get_initial_cache_position"):
    def _get_initial_cache_position(self, sequence_length: int = 0):
        return 0
    GPT2Model._get_initial_cache_position = _get_initial_cache_position

import torch
from diffusers import AudioLDM2Pipeline

# 3) Load the 1.1B-parameter AudioLDM2 in FP16 on GPU
pipe = AudioLDM2Pipeline.from_pretrained(
    "cvssp/audioldm2",
    torch_dtype=torch.float16,
    use_auth_token=True
)
pipe = pipe.to("cuda")
print("✅ Loaded AudioLDM2 on", torch.cuda.get_device_name(0))


Keyword arguments {'use_auth_token': True} are not expected by AudioLDM2Pipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

Expected types for language_model: (<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>,), got <class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>.


✅ Loaded AudioLDM2 on Tesla T4


### Cell 4: Verify GPU & CUDA  
This cell checks that:
- CUDA is available to PyTorch.  
- Which GPU device Colab has provided (e.g. “Tesla T4”).  
- How much total GPU memory you have, so you can size generations appropriately.


In [3]:
# Cell 4: Verify CUDA/GPU
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
    total_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Total GPU memory: {total_gb:.1f} GB")


CUDA available: True
Device: Tesla T4
Total GPU memory: 14.7 GB


### Cell 5: Generate a short audio clip  
Here we:
- Define a text **prompt** (“A gentle brook flowing through a forest at dawn”).  
- Call `pipe(...)` with `num_inference_steps=80` and `audio_length_in_s=4.0` to synthesize a 4-second waveform.  
- Store the resulting NumPy array in `audio`.


In [16]:
prompt = "A dog barking"
negative_prompt = "wind noise"

output = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

### Cell 6: Play audio inline and save using prompt-based filename  
This cell:  
- Converts your `prompt` into a safe filename (strips invalid characters, replaces spaces with underscores, and truncates if too long).  
- Embeds an audio player so you can listen to the clip right in the notebook.  
- Saves the waveform to a WAV file named `<sanitized_prompt>.wav`, matching your prompt.  


In [17]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as A_dog_barking.wav
