# Demo 1 – Minecraft Sound Generation with AudioLDM2

**Goal:** Show that small-scale LoRA adaptation of AudioLDM2 shifts generations toward Minecraft SFX.

**Runtime:** Google Colab with T4 GPU

### Pipeline
1. Clone repo & install dependencies
2. Fetch Minecraft sound assets (zombie & skeleton categories)
3. Preprocess audio → 16 kHz mono .wav, fixed 4 s length
4. Build manifest (metadata.csv with captions + train/val split)
5. Generate **baseline** samples from vanilla AudioLDM2
6. LoRA fine-tune UNet on the Minecraft dataset
7. Generate **adapted** samples and compare

---
## 0 · Check GPU & Setup

In [1]:
# Verify GPU is available
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

import torch
print(f"PyTorch {torch.__version__}  |  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Tesla T4, 15360 MiB
PyTorch 2.10.0+cu128  |  CUDA available: True
GPU: Tesla T4


---
## 1 · Clone Repo & Install Dependencies

In [2]:
import os

# ── Clone the project repo (change URL to your fork) ──
REPO_URL = "https://github.com/BHatiru/GenAI-Minecraft-Sounds.git"  # TODO: update
REPO_DIR = "/content/GenAI-Minecraft-Sounds"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL} {REPO_DIR}
os.chdir(REPO_DIR)
print(f"Working directory: {os.getcwd()}")

Cloning into '/content/GenAI-Minecraft-Sounds'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 46 (delta 8), reused 35 (delta 6), pack-reused 0 (from 0)[K
Receiving objects: 100% (46/46), 951.40 KiB | 14.42 MiB/s, done.
Resolving deltas: 100% (8/8), done.
Working directory: /content/GenAI-Minecraft-Sounds


In [3]:
# ── Install Python dependencies ──
# Colab already has a CUDA-enabled PyTorch – do NOT reinstall it,
# or the cu118 build will conflict with Colab's CUDA 12.x drivers
# and torch.cuda.is_available() will return False.
!pip install -q librosa soundfile pydub pyyaml requests tqdm
!pip install -q diffusers[torch] transformers accelerate peft datasets scipy

# Verify CUDA is visible to PyTorch
import torch
print(f"torch {torch.__version__}  |  CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

torch 2.10.0+cu128  |  CUDA: True
GPU: Tesla T4


---
## 2 · Fetch Minecraft Sound Assets

In [4]:
!python scripts/fetch_minecraft_assets.py --config configs/demo1.yaml

2026-02-28 12:47:53,993  INFO      Fetching category: mob/zombie
2026-02-28 12:47:54,160  INFO        ↓ mob/zombie/death.ogg
2026-02-28 12:47:54,300  INFO        ↓ mob/zombie/hurt1.ogg
2026-02-28 12:47:54,447  INFO        ↓ mob/zombie/hurt2.ogg
2026-02-28 12:47:54,585  INFO        ↓ mob/zombie/infect.ogg
2026-02-28 12:47:54,730  INFO        ↓ mob/zombie/metal1.ogg
2026-02-28 12:47:54,878  INFO        ↓ mob/zombie/metal2.ogg
2026-02-28 12:47:55,020  INFO        ↓ mob/zombie/metal3.ogg
2026-02-28 12:47:55,164  INFO        ↓ mob/zombie/remedy.ogg
2026-02-28 12:47:55,308  INFO        ↓ mob/zombie/say1.ogg
2026-02-28 12:47:55,455  INFO        ↓ mob/zombie/say2.ogg
2026-02-28 12:47:55,610  INFO        ↓ mob/zombie/say3.ogg
2026-02-28 12:47:55,779  INFO        ↓ mob/zombie/step1.ogg
2026-02-28 12:47:55,924  INFO        ↓ mob/zombie/step2.ogg
2026-02-28 12:47:56,071  INFO        ↓ mob/zombie/step3.ogg
2026-02-28 12:47:56,232  INFO        ↓ mob/zombie/step4.ogg
2026-02-28 12:47:56,382  INFO    

In [5]:
# Sanity check: list downloaded files
import glob

ogg_files = sorted(glob.glob("data/raw/**/*.ogg", recursive=True))
print(f"Total .ogg files downloaded: {len(ogg_files)}")
for f in ogg_files[:10]:
    print(f"  {f}")
if len(ogg_files) > 10:
    print(f"  ... and {len(ogg_files) - 10} more")

Total .ogg files downloaded: 195
  data/raw/ambient/cave/cave1.ogg
  data/raw/ambient/cave/cave10.ogg
  data/raw/ambient/cave/cave11.ogg
  data/raw/ambient/cave/cave12.ogg
  data/raw/ambient/cave/cave13.ogg
  data/raw/ambient/cave/cave14.ogg
  data/raw/ambient/cave/cave15.ogg
  data/raw/ambient/cave/cave16.ogg
  data/raw/ambient/cave/cave17.ogg
  data/raw/ambient/cave/cave18.ogg
  ... and 185 more


---
## 3 · Preprocess Audio

Convert .ogg → .wav at 16 kHz, mono, trimmed silence, padded/clipped to 4 seconds.

In [6]:
!python scripts/preprocess_audio.py --config configs/demo1.yaml

2026-02-28 12:48:37,714  INFO      NumExpr defaulting to 2 threads.
2026-02-28 12:48:40,879  INFO      Loaded 195 sound blocks from data/raw
2026-02-28 12:48:40,879  INFO      ── Generating mob sequences …
2026-02-28 12:48:41,095  INFO         → 179 mob clips
2026-02-28 12:48:41,096  INFO      ── Generating step sequences …
2026-02-28 12:48:41,138  INFO         → 44 step clips
2026-02-28 12:48:41,138  INFO      ── Generating damage / combat sequences …
2026-02-28 12:48:41,152  INFO         → 20 damage/combat clips
2026-02-28 12:48:41,152  INFO      ── Generating ambient clips …
2026-02-28 12:48:41,251  INFO         → 82 ambient clips
2026-02-28 12:48:41,251  INFO      ── Total: 325 clips to export
2026-02-28 12:48:42,501  INFO      Wrote caption sidecar → data/processed/_captions.json  (325 entries)
2026-02-28 12:48:42,502  INFO      Done – 325 clips exported  |  4.00–4.00 s  |  mean 4.00 s


In [7]:
# Sanity check: verify processed files
import soundfile as sf
import numpy as np

wav_files = sorted(glob.glob("data/processed/**/*.wav", recursive=True))
print(f"Total processed .wav files: {len(wav_files)}")

# Spot-check first 3 files
for wf in wav_files[:3]:
    audio, sr = sf.read(wf, dtype="float32")
    dur = len(audio) / sr
    print(f"  {wf}  |  sr={sr}  dur={dur:.2f}s  "
          f"range=[{audio.min():.3f}, {audio.max():.3f}]  "
          f"shape={audio.shape}")

Total processed .wav files: 325
  data/processed/ambient/cave/cave1.wav  |  sr=16000  dur=4.00s  range=[-0.800, 1.000]  shape=(64000,)
  data/processed/ambient/cave/cave10.wav  |  sr=16000  dur=4.00s  range=[-0.879, 1.000]  shape=(64000,)
  data/processed/ambient/cave/cave10_slow.wav  |  sr=16000  dur=4.00s  range=[-0.887, 1.000]  shape=(64000,)


---
## 4 · Build Manifest (metadata.csv)

In [8]:
!python scripts/build_manifest.py --config configs/demo1.yaml

2026-02-28 12:48:43,275  INFO      Loaded 325 captions from data/processed/_captions.json
2026-02-28 12:48:43,286  INFO      Manifest written to data/manifest.csv  (325 rows: 277 train, 48 val)
2026-02-28 12:48:43,286  INFO      ── Example rows ──
2026-02-28 12:48:43,286  INFO        train | ambient/cave/cave1.wav | minecraft cave ambience sound effect
2026-02-28 12:48:43,286  INFO        val | ambient/cave/cave10.wav | minecraft cave ambience sound effect
2026-02-28 12:48:43,286  INFO        train | ambient/cave/cave10_slow.wav | slow minecraft cave ambience sound effect
2026-02-28 12:48:43,286  INFO        train | ambient/cave/cave11.wav | minecraft cave ambience sound effect
2026-02-28 12:48:43,286  INFO        train | ambient/cave/cave11_slow.wav | slow minecraft cave ambience sound effect


In [9]:
# Preview the manifest
import pandas as pd

df = pd.read_csv("data/manifest.csv")
print(f"Manifest shape: {df.shape}")
print(f"Split counts:\n{df['split'].value_counts()}")
print()
df.head(10)

Manifest shape: (325, 3)
Split counts:
split
train    277
val       48
Name: count, dtype: int64



Unnamed: 0,file_name,caption,split
0,ambient/cave/cave1.wav,minecraft cave ambience sound effect,train
1,ambient/cave/cave10.wav,minecraft cave ambience sound effect,val
2,ambient/cave/cave10_slow.wav,slow minecraft cave ambience sound effect,train
3,ambient/cave/cave11.wav,minecraft cave ambience sound effect,train
4,ambient/cave/cave11_slow.wav,slow minecraft cave ambience sound effect,train
5,ambient/cave/cave12.wav,minecraft cave ambience sound effect,train
6,ambient/cave/cave12_slow.wav,slow minecraft cave ambience sound effect,train
7,ambient/cave/cave13.wav,minecraft cave ambience sound effect,train
8,ambient/cave/cave13_slow.wav,slow minecraft cave ambience sound effect,train
9,ambient/cave/cave14.wav,minecraft cave ambience sound effect,train


---
## 5 · Listen to a Few Samples

Play some processed Minecraft sounds to verify quality.

In [10]:
import IPython.display as ipd

for wf in wav_files[:4]:
    print(f"\n▶ {wf}")
    audio, sr = sf.read(wf, dtype="float32")
    display(ipd.Audio(audio, rate=sr))


▶ data/processed/ambient/cave/cave1.wav



▶ data/processed/ambient/cave/cave10.wav



▶ data/processed/ambient/cave/cave10_slow.wav



▶ data/processed/ambient/cave/cave11.wav


---
## 6 · Baseline Generation (Vanilla AudioLDM2)

Generate samples from the pre-trained model *before* any fine-tuning.

In [30]:
# Generate baseline samples for a couple of prompts
PROMPTS = [
    "water dripping sounds",
    "metal clanging sounds",
]

for prompt in PROMPTS:
    !python -m src.mcaudio.infer.generate \
        --prompt "{prompt}" \
        --config configs/demo1.yaml \
        --num_samples 2 \
        --output outputs/demo1/baseline

2026-02-28 13:22:21,977  INFO      NumExpr defaulting to 2 threads.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
2026-02-28 13:22:23,877  INFO      Loading AudioLDM2 pipeline: cvssp/audioldm2
2026-02-28 13:22:24,060  INFO      HTTP Request: GET https://huggingface.co/api/models/cvssp/audioldm2 "HTTP/1.1 200 OK"
2026-02-28 13:22:24,161  INFO      HTTP Request: HEAD https://huggingface.co/cvssp/audioldm2/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 13:22:24,170  INFO      HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cvssp/audioldm2/c8e7e189d324425c05c4c2f81214041ef4107983/model_index.json "HTTP/1.1 200 OK"
Loading pipeline components...:   0% 0/11 [00:00<?, ?it/s]
Loading weights:   0% 0/

In [32]:
# Listen to baseline generations
baseline_wavs = sorted(glob.glob("outputs/demo1/baseline/*.wav"))
print(f"Baseline samples: {len(baseline_wavs)}")

for wf in baseline_wavs[:8]:
    print(f"\n▶ {os.path.basename(wf)}")
    audio, sr = sf.read(wf, dtype="float32")
    display(ipd.Audio(audio, rate=sr))

Baseline samples: 12

▶ metal_clanging_sounds_000.wav



▶ metal_clanging_sounds_001.wav



▶ minecraft_enderman_encounter_sound_000.wav



▶ minecraft_enderman_encounter_sound_001.wav



▶ minecraft_walking_sounds_000.wav



▶ minecraft_walking_sounds_001.wav



▶ minecraft_water_sound_effect_000.wav



▶ minecraft_water_sound_effect_001.wav


In [42]:
!git pull

Already up to date.


---
## 7 · LoRA Fine-Tuning

Fine-tune the UNet cross-attention layers with LoRA adapters on the Minecraft dataset.

Training loop:  audio → mel → VAE latents → add noise → UNet predicts noise → MSE loss

> **T4 GPU:** ~15-25 min for 500 steps (bs=1, grad_accum=4). Use `--max_steps 100` for a quick sanity check (~3 min).

In [36]:
# Full training (500 steps, ~20 min on T4)
!python -m src.mcaudio.train.lora_train \
    --config configs/demo1.yaml \
    --max_steps 100 \
    --log_every 10 \
    --save_every 50

2026-02-28 13:27:48,997  INFO      NumExpr defaulting to 2 threads.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
2026-02-28 13:27:50,894  INFO      Device: cuda  |  weight dtype: torch.float16
2026-02-28 13:27:50,894  INFO      Loading AudioLDM2 pipeline: cvssp/audioldm2
2026-02-28 13:27:51,062  INFO      HTTP Request: GET https://huggingface.co/api/models/cvssp/audioldm2 "HTTP/1.1 200 OK"
2026-02-28 13:27:51,168  INFO      HTTP Request: HEAD https://huggingface.co/cvssp/audioldm2/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 13:27:51,179  INFO      HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cvssp/audioldm2/c8e7e189d324425c05c4c2f81214041ef4107983/model_index.json "HTTP/1.1 200 OK"
Loa

---
## 8 · Generate with LoRA Adapter

Compare LoRA-adapted outputs against the baselines from §6.

In [None]:
# Generate with LoRA adapter for the same prompts
PROMPTS = [
    "minecraft skeleton combat encounter with player damage sound effect",
    "fast minecraft ghast moaning then charging and shooting fireball then dying sound effect"
]

for prompt in PROMPTS:
    !python -m src.mcaudio.infer.generate \
        --prompt "{prompt}" \
        --config configs/demo1.yaml \
        --lora_weights outputs/demo1/lora_weights \
        --num_samples 2 \
        --output outputs/demo1/lora

2026-02-28 13:29:59,683  INFO      NumExpr defaulting to 2 threads.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
2026-02-28 13:30:01,607  INFO      Loading AudioLDM2 pipeline: cvssp/audioldm2
2026-02-28 13:30:01,788  INFO      HTTP Request: GET https://huggingface.co/api/models/cvssp/audioldm2 "HTTP/1.1 200 OK"
2026-02-28 13:30:01,898  INFO      HTTP Request: HEAD https://huggingface.co/cvssp/audioldm2/resolve/main/model_index.json "HTTP/1.1 307 Temporary Redirect"
2026-02-28 13:30:01,910  INFO      HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/cvssp/audioldm2/c8e7e189d324425c05c4c2f81214041ef4107983/model_index.json "HTTP/1.1 200 OK"
Loading pipeline components...:   0% 0/11 [00:00<?, ?it/s]
Loading weights:   0% 0/

In [41]:
# Listen to LoRA-adapted generations
lora_wavs = sorted(glob.glob("outputs/demo1/lora/*.wav"))
print(f"LoRA samples: {len(lora_wavs)}")

for wf in lora_wavs:
    print(f"\n▶ {os.path.basename(wf)}")
    audio, sr = sf.read(wf, dtype="float32")
    display(ipd.Audio(audio, rate=sr))

LoRA samples: 18

▶ metal_clanging_sounds_000.wav



▶ metal_clanging_sounds_001.wav



▶ minecraft_cave_ambience_sound_effect_000.wav



▶ minecraft_cave_ambience_sound_effect_001.wav



▶ minecraft_enderman_encounter_sound_000.wav



▶ minecraft_enderman_encounter_sound_001.wav



▶ minecraft_skeleton_combat_encounter_with_player_damage_sound_000.wav



▶ minecraft_skeleton_combat_encounter_with_player_damage_sound_001.wav



▶ minecraft_skeleton_hit_sound_effect_000.wav



▶ minecraft_skeleton_hit_sound_effect_001.wav



▶ minecraft_walking_on_wood_sound_effect_000.wav



▶ minecraft_walking_on_wood_sound_effect_001.wav



▶ minecraft_walking_sounds_000.wav



▶ minecraft_walking_sounds_001.wav



▶ minecraft_water_sound_effect_000.wav



▶ minecraft_water_sound_effect_001.wav



▶ minecraft_zombie_death_sound__000.wav



▶ minecraft_zombie_death_sound__001.wav


---
## 9 · Side-by-Side Comparison

In [None]:
import IPython.display as ipd
from pathlib import Path

PROMPTS = [
    "creeper hissing before explosion",
    "footsteps walking on stone",
    "skeleton shooting a bow",
    "zombie groaning",
]
base_dir = Path("outputs/baseline")
lora_dir = Path("outputs/lora")

for prompt in PROMPTS:
    slug = prompt.replace(" ", "_")[:60]
    base_wavs = sorted(base_dir.glob(f"{slug}*.wav"))
    lora_wavs = sorted(lora_dir.glob(f"{slug}*.wav"))
    if not base_wavs and not lora_wavs:
        continue
    print(f"\n{'='*60}")
    print(f"Prompt: \"{prompt}\"")
    print(f"{'='*60}")
    if base_wavs:
        print("▸ Baseline")
        display(ipd.Audio(str(base_wavs[0]), rate=16000))
    if lora_wavs:
        print("▸ LoRA-adapted")
        display(ipd.Audio(str(lora_wavs[0]), rate=16000))

---
## Summary

| Stage | Artefact | Location |
|-------|----------|----------|
| Raw assets | .ogg files | `data/raw/` |
| Processed | 16 kHz mono .wav | `data/processed/` |
| Manifest | metadata.csv | `data/manifest.csv` |
| Baseline | generated .wav | `outputs/demo1/baseline/` |
| LoRA weights | adapter checkpoint | `outputs/demo1/lora_weights/` |
| LoRA samples | generated .wav | `outputs/demo1/lora/` |