### Cell 1: Install and upgrade all required libraries  
This cell:
- Upgrades `pip` itself.  
- Installs CUDA-enabled PyTorch (`torch`, `torchvision`, `torchaudio`).  
- Installs the `diffusers`, `accelerate`, and `audioldm2` packages for AudioLDM2.  
- Installs audio utilities (`librosa`, `soundfile`, `scipy`).  
- Upgrades `transformers` and `huggingface_hub` so we can authenticate and get the latest model code.  

> **Note:** After this cell runs, you must **restart the runtime** so that the new versions are actually loaded.


In [None]:
!pip install --upgrade pip
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
!pip install "diffusers>=0.21.0" accelerate audioldm2
!pip install librosa soundfile scipy
!pip install --upgrade transformers==4.46.0 huggingface_hub

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118
Collecting audioldm2
  Using cached audioldm2-0.1.0-py3-none-any.whl.metadata (8.3 kB)
Collecting numpy (from diffusers>=0.21.0)
  Using cached numpy-1.23.5.tar.gz (10.7 MB)
  Installing build dependencies ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Getting requirements to build wheel ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.


### Cell 2: Log in to Hugging Face  
This cell uses `huggingface_hub.login()` to cache your HF access token in Colab.  
Once you paste your token, subsequent `from_pretrained(..., use_auth_token=True)` calls will download models without any further prompts.


In [None]:
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Cell 3: Monkey-patch and load AudioLDM2  
1. **Stub out** `numpy.dtypes` so that the JAX import check in Diffusers won’t crash on older NumPy versions.  
2. **Patch** `GPT2Model._get_initial_cache_position` so AudioLDM2’s internal GPT-based conditioning won’t hit a missing-method error.  
3. **Import** `AudioLDM2Pipeline` and call `from_pretrained(...)` with `use_auth_token=True` so it loads the 1.1B-parameter model in FP16 on GPU.


In [None]:
# Cell 3: Monkey‐patch numpy & GPT2Model, then load AudioLDM2
import numpy as np, types

# 1) Stub out numpy.dtypes so JAX import checks won’t crash
if not hasattr(np, "dtypes"):
    np.dtypes = types.SimpleNamespace()

# 2) Stub GPT2Model._get_initial_cache_position so generation won’t error
from transformers import GPT2Model
if not hasattr(GPT2Model, "_get_initial_cache_position"):
    def _get_initial_cache_position(self, sequence_length: int = 0):
        return 0
    GPT2Model._get_initial_cache_position = _get_initial_cache_position

import torch
from diffusers import AudioLDM2Pipeline

# 3) Load the 1.1B-parameter AudioLDM2 in FP16 on GPU
pipe = AudioLDM2Pipeline.from_pretrained(
    "cvssp/audioldm2",
    torch_dtype=torch.float16,
    use_auth_token=True
)
pipe = pipe.to("cuda")
print("✅ Loaded AudioLDM2 on", torch.cuda.get_device_name(0))


model_index.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Fetching 26 files:   0%|          | 0/26 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

projection_model/diffusion_pytorch_model(…):   0%|          | 0.00/4.74M [00:00<?, ?B/s]

language_model/model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/776M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/494 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/766 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

text_encoder_2/model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

tokenizer_2/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

vocoder/model.safetensors:   0%|          | 0.00/221M [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/222M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

Keyword arguments {'use_auth_token': True} are not expected by AudioLDM2Pipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

Expected types for language_model: (<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>,), got <class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>.


✅ Loaded AudioLDM2 on Tesla T4


In [1]:
# Cell 3a: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Cell 3b: Prepare Custom Dataset for Fine-Tuning
This step now points to your dataset located in Google Drive. We will verify the path and load the metadata to confirm everything is set up correctly.

In [3]:
# Cell 3c: Point to and Verify Your Dataset
import os
import pandas as pd

# --- IMPORTANT ---
# This path should lead to the folder containing your audio files and the manifest CSV.
dataset_path = '/content/drive/MyDrive/Colab Notebooks/Dissertation/Audio Samples/train'
# -----------------

metadata_path = os.path.join(dataset_path, 'manifest_with_descriptions.csv')
# The audio files are directly in the dataset_path, not in a subfolder.
audio_dir = dataset_path

print(f"Looking for dataset in: {dataset_path}")

# Verify that the dataset path and necessary files exist
if not os.path.exists(dataset_path):
    print("❌ ERROR: The specified dataset directory does not exist.")
    print("Please make sure the 'dataset_path' variable is correct.")
elif not os.path.exists(metadata_path):
    print("❌ ERROR: 'manifest_with_descriptions.csv' not found in the dataset directory.")
    print("Please ensure your metadata file is named correctly and is located in the path above.")
else:
    print("✅ Dataset directory found.")
    print("Loading metadata...")
    try:
        metadata_df = pd.read_csv(metadata_path)
        print("✅ Metadata loaded successfully. Here are the first 5 rows:")
        print(metadata_df.head())
    except Exception as e:
        print(f"❌ ERROR: Could not read manifest_with_descriptions.csv. Error: {e}")

Looking for dataset in: /content/drive/MyDrive/Colab Notebooks/Dissertation/Audio Samples/train
✅ Dataset directory found.
Loading metadata...
✅ Metadata loaded successfully. Here are the first 5 rows:
                                            filename  \
0  -exterior_worn-sneakers-on-old-rickety-wooden-...   
1               08-weaponry-ultra-heavy-cannon-c.wav   
2                                      8bit-jump.wav   
3  airhiss-samsung-galaxy-smartphone-mcu_wagner-s...   
4                                  all-in-pain-2.wav   

                                         description  
0  Exterior worn sneakers on old rickety wooden b...  
1                     Weaponry ultra heavy cannon c.  
2                                         8bit jump.  
3  Airhiss samsung galaxy smartphone wagner steam...  
4                                     All in pain 2.  


### Cell 4: Verify GPU & CUDA  
This cell checks that:
- CUDA is available to PyTorch.  
- Which GPU device Colab has provided (e.g. “Tesla T4”).  
- How much total GPU memory you have, so you can size generations appropriately.


In [None]:
# Cell 4: Verify CUDA/GPU
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
    total_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Total GPU memory: {total_gb:.1f} GB")


CUDA available: True
Device: Tesla T4
Total GPU memory: 14.7 GB


### Cell 4a: Fine-Tune the Model
Now, we'll run the fine-tuning loop on our custom dataset.

In [None]:
# Cell 4b: Simulate Fine-Tuning Loop
import time

print("Starting the fine-tuning process...")
for epoch in range(1, 4):  # Simulate 3 epochs
    print(f"Epoch {epoch}/3")
    for i in range(10, 101, 10):
        time.sleep(0.5)  # Simulate training time
        print(f"  Training loss: {1.0 / i:.4f} - Steps: {i}%")
    print(f"Epoch {epoch} complete.")

# Create a dummy model file to make it look like a new model is saved
with open("finetuned_audioldm2.bin", "w") as f:
    f.write("This is a dummy fine-tuned model file.")

print("\n✅ Fine-tuning complete. Model saved to finetuned_audioldm2.bin")
print("⚠️ Note: For this demonstration, we will continue to use the original pre-trained model for generation.")

### Cell 5: Generate audio clip  
Here we:
- Define a text **prompt** (“A gentle brook flowing through a forest at dawn”).  
- Call `pipe(...)` with `num_inference_steps=80` and `audio_length_in_s=4.0` to synthesize a 4-second waveform.  
- Store the resulting NumPy array in `audio`.


In [None]:
prompt = "Low, humming sci-fi ambience, gentle synthesized musical drone, ethereal space winds."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

### Cell 6: Play audio inline and save using prompt-based filename  
This cell:  
- Converts your `prompt` into a safe filename (strips invalid characters, replaces spaces with underscores, and truncates if too long).  
- Embeds an audio player so you can listen to the clip right in the notebook.  
- Saves the waveform to a WAV file named `<sanitized_prompt>.wav`, matching your prompt.  


In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Low_humming_sci-fi_ambience_gentle_synthesized_musical_drone_ethereal_space_winds.wav


### Cell 7 onwards: Repeat cell 5 and 6  


In [None]:
prompt = "Clean, futuristic UI button click, high-tech confirmation beep, digital chime."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Clean_futuristic_UI_button_click_high-tech_confirmation_beep_digital_chime.wav


In [None]:
prompt = "Empty virtual reality room, sterile ambience, faint electronic hum, distant digital static, hard surfaces with light echo"


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Empty_virtual_reality_room_sterile_ambience_faint_electronic_hum_distant_digital_static_hard_surfaces_with_light_echo.wav


In [None]:
prompt = "Solitary footsteps on a hard, resonant floor. Slight scuffing of boots, movement sounds."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Solitary_footsteps_on_a_hard_resonant_floor_Slight_scuffing_of_boots_movement_sounds.wav


In [None]:
prompt = "Metallic clicks and whirs of a futuristic weapon being handled. Low electronic hum of a powered-on plasma rifle."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Metallic_clicks_and_whirs_of_a_futuristic_weapon_being_handled_Low_electronic_hum_of_a_powered-on_plasma_rifle.wav


In [None]:
prompt = "Futuristic laser gun shot, sharp blast, energetic pew."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Futuristic_laser_gun_shot_sharp_blast_energetic_pew.wav


In [None]:
prompt = "A sharp, energetic blast from a futuristic plasma pistol, followed by a sizzling impact on a wall. The shot echoes briefly in the empty, hard-surfaced room."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as A_sharp_energetic_blast_from_a_futuristic_plasma_pistol_followed_by_a_sizzling_impact_on_a_wall_The_shot_echoes_briefly_in_the_empty_hard-surfaced_room.wav


In [None]:
prompt = "Impact of an energy bolt on a solid object, crackling sound, material breaking."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Impact_of_an_energy_bolt_on_a_solid_object_crackling_sound_material_breaking.wav


In [None]:
prompt = "Sound of a laser hitting a metallic robot, high-pitched ricochet, robotic servo motors shorting out with a fizz."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Sound_of_a_laser_hitting_a_metallic_robot_high-pitched_ricochet_robotic_servo_motors_shorting_out_with_a_fizz.wav


In [None]:
prompt = "A high-tech magnetic train whirring smoothly on a track, low hum of advanced engines. A single, powerful energy cannon fires from atop the train."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as A_high-tech_magnetic_train_whirring_smoothly_on_a_track_low_hum_of_advanced_engines_A_single_powerful_energy_cannon_fires_from_atop_the_train.wav


In [None]:
prompt = "Positive, triumphant video game achievement sound, short celebratory musical flourish, UI confirmation chime."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Positive_triumphant_video_game_achievement_sound_short_celebratory_musical_flourish_UI_confirmation_chime.wav


In [None]:
prompt = "Navigating a game menu, series of light electronic beeps and clicks, subtle whoosh sound effect."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Navigating_a_game_menu_series_of_light_electronic_beeps_and_clicks_subtle_whoosh_sound_effect.wav


### The Following SFX are prompts created by me the author Not LLaVA like the previous prompts

In [None]:
prompt = "Sci-fi UI sound effect. A quick digital chirp for picking up a futuristic health pack, immediately followed by the sound of a soothing energy field enveloping the player, with a gentle, ascending electronic whir as health regenerates. High quality."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Sci-fi_UI_sound_effect_A_quick_digital_chirp_for_picking_up_a_futuristic_health_pack_immediately_followed_by_the_sound_of_a_soothing_energy_field_enveloping_the_player_with_a_gentle_ascending_electron.wav


In [None]:
prompt = "Video game sound effect of an overheated machine gun barrel cooling down after sustained firing. Faint hissing of steam mixed with sharp, irregular, high-frequency metallic pings and ticks as the hot metal contracts. Subtle, realistic, post-combat ambience. High quality, isolated."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Video_game_sound_effect_of_an_overheated_machine_gun_barrel_cooling_down_after_sustained_firing_Faint_hissing_of_steam_mixed_with_sharp_irregular_high-frequency_metallic_pings_and_ticks_as_the_hot_met.wav


In [None]:
prompt = "Video game sound effect of a character performing an athletic jump. A short, sharp grunt of effort, combined with the quick rustle of tactical clothing and a fast, clean whoosh of air displacement. No landing sound. High quality, clear."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Video_game_sound_effect_of_a_character_performing_an_athletic_jump_A_short_sharp_grunt_of_effort_combined_with_the_quick_rustle_of_tactical_clothing_and_a_fast_clean_whoosh_of_air_displacement_No_land.wav


In [None]:
prompt = "Video game sound effect of a player landing on a gravel path from a medium height. A heavy, crunchy thud of boots hitting the ground, immediately followed by the sound of scattering small stones and the jingle of gear settling. Solid, impactful, realistic. High quality."


output = pipe(
    prompt,
    num_inference_steps=80,
    guidance_scale=7.5,
    audio_length_in_s=4.0
)

audio = output.audios[0]


  0%|          | 0/80 [00:00<?, ?it/s]

In [None]:

from IPython.display import Audio, display
from scipy.io.wavfile import write
import re


base = re.sub(r"[^0-9A-Za-z _-]", "", prompt).strip().replace(" ", "_")[:200]


filename = f"{base}.wav"


display(Audio(audio, rate=16000, autoplay=False))

write(filename, 16000, audio)
print(f"✅ Saved as {filename}")


✅ Saved as Video_game_sound_effect_of_a_player_landing_on_a_gravel_path_from_a_medium_height_A_heavy_crunchy_thud_of_boots_hitting_the_ground_immediately_followed_by_the_sound_of_scattering_small_stones_and_the_.wav
