<a href="https://colab.research.google.com/github/Omri-Triff/Text-to-Timbre-Drum-Transfer/blob/main/models/audioLDM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text to Drum Timbre Generation - AudioLDM model

Generate a short drum audio sample that represents the **desired drum style**.

Example prompts:
- "Vintage jazz drum kit"
- "Heavy metal drums with distortion"
- "80s electronic drum machine"

### 1. Environment Setup

System and GPU check

In [None]:
import torch
if torch.cuda.is_available():
    print(" GPU Connected: ", torch.cuda.get_device_name(0))
else:
    print(" Warning: No GPU connected. Go to Runtime > Change runtime type > T4 GPU")

 GPU Connected:  Tesla T4


Install dependencies

In [None]:
# AudioLDM
!pip install -q "diffusers==0.33.1" transformers accelerate scipy

### 2. Run model

In [None]:
%cd /content

/content


In [None]:
%%writefile run_model.py

import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["USE_TF"] = "0"

import torch
from diffusers import AudioLDMPipeline
import scipy.io.wavfile
import argparse
import numpy as np

def generate_audio(prompt, duration=5.0, steps=50, output_file="output.wav"):
    print(f"\nStarting generation for prompt: '{prompt}'")

    # Load the pretrained AudioLDM model
    # The model will be downloaded only if it is not already cached
    # float16 is used for better performance and lower memory usage
    try:
        pipe = AudioLDMPipeline.from_pretrained(
            "cvssp/audioldm-s-full-v2",
             dtype=torch.float16
        )
    except Exception as e:
        print(f"Error loading model: {e}")
        return

    # Move the model to GPU if available
    if torch.cuda.is_available():
        pipe = pipe.to("cuda")
        print("Using CUDA GPU")
    else:
        print("Using CPU (generation may be slow)")

    # Generate audio from text prompt
    print("Generating audio...")
    audio = pipe(
        prompt,
        num_inference_steps=steps,
        audio_length_in_s=duration,
        guidance_scale=1.5,  # lower = often cleaner audio
        negative_prompt="melody, bass, synth, guitar, piano, vocals, singing, speech, chords, orchestra, reverb, ambience, static, hiss, noise, distortion, artifacts, low quality"  # new
    ).audios[0]

    # --- Save the generated audio to a WAV file (robust) ---
    audio = np.asarray(audio)

    # If audio is shape (n,) it's fine; if it's (n,1) flatten it
    audio = audio.squeeze()

    # Clip to valid range
    audio = np.clip(audio, -1.0, 1.0)

    # Convert to int16 PCM (standard wav format)
    audio_int16 = (audio * 32767.0).astype(np.int16)

    scipy.io.wavfile.write(output_file, rate=16000, data=audio_int16)
    print(f"Audio saved to: {output_file}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio from text using AudioLDM")
    parser.add_argument("--prompt", type=str, required=True, help="Text description for audio generation")
    parser.add_argument("--out", type=str, default="generated.wav", help="Output WAV filename")
    parser.add_argument("--time", type=float, default=5.0, help="Audio duration in seconds")
    args = parser.parse_args()

    generate_audio(
        args.prompt,
        duration=args.time,
        output_file=args.out
    )

Overwriting run_model.py


Generate drum audio according to prompt and time selection

In [None]:
# Run AudioLDM using the generated prompt (ChangePromptAsNeeded)
!python run_model.py --prompt "A dramatic drum solo in a huge hall" --out "drums.wav" --time 10


Full prompt sent to AudioLDM:
  "drums solo in HipHop style"

Starting generation for prompt: 'drums solo in HipHop style'
Keyword arguments {'dtype': torch.float16} are not expected by AudioLDMPipeline and will be ignored.
Loading pipeline components...: 100% 6/6 [00:01<00:00,  5.48it/s]
Using CUDA GPU
Generating audio...
100% 50/50 [00:04<00:00, 11.31it/s]
Audio saved to: drums.wav


In [None]:
from IPython.display import Audio
Audio("drums.wav")