# Demo 1 – Minecraft Sound Generation with AudioLDM2

**Goal:** Show that small-scale LoRA adaptation of AudioLDM2 shifts generations toward Minecraft SFX.

**Runtime:** Google Colab with T4 GPU

### Pipeline
1. Clone repo & install dependencies
2. Fetch Minecraft sound assets (zombie & skeleton categories)
3. Preprocess audio → 16 kHz mono .wav, fixed 4 s length
4. Build manifest (metadata.csv with captions + train/val split)
5. Generate **baseline** samples from vanilla AudioLDM2
6. LoRA fine-tune UNet on the Minecraft dataset
7. Generate **adapted** samples and compare

---
## 0 · Check GPU & Setup

In [None]:
# Verify GPU is available
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

import torch
print(f"PyTorch {torch.__version__}  |  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---
## 1 · Clone Repo & Install Dependencies

In [None]:
import os

# ── Clone the project repo (change URL to your fork) ──
REPO_URL = "https://github.com/<YOUR_USERNAME>/GenAI-Minecraft-Sounds.git"  # TODO: update
REPO_DIR = "/content/GenAI-Minecraft-Sounds"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL} {REPO_DIR}
os.chdir(REPO_DIR)
print(f"Working directory: {os.getcwd()}")

In [None]:
# ── Install Python dependencies ──
# Pin transformers & diffusers to versions compatible with AudioLDM2
!pip install -q librosa soundfile pydub pyyaml requests tqdm
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q "diffusers[torch]>=0.27,<=0.32.2" "transformers>=4.36,<=4.44.2" accelerate peft datasets scipy

---
## 2 · Fetch Minecraft Sound Assets

In [None]:
!python scripts/fetch_minecraft_assets.py --config configs/demo1.yaml

In [None]:
# Sanity check: list downloaded files
import glob

ogg_files = sorted(glob.glob("data/raw/**/*.ogg", recursive=True))
print(f"Total .ogg files downloaded: {len(ogg_files)}")
for f in ogg_files[:10]:
    print(f"  {f}")
if len(ogg_files) > 10:
    print(f"  ... and {len(ogg_files) - 10} more")

---
## 3 · Preprocess Audio

Convert .ogg → .wav at 16 kHz, mono, trimmed silence, padded/clipped to 4 seconds.

In [None]:
!python scripts/preprocess_audio.py --config configs/demo1.yaml

In [None]:
# Sanity check: verify processed files
import soundfile as sf
import numpy as np

wav_files = sorted(glob.glob("data/processed/**/*.wav", recursive=True))
print(f"Total processed .wav files: {len(wav_files)}")

# Spot-check first 3 files
for wf in wav_files[:3]:
    audio, sr = sf.read(wf, dtype="float32")
    dur = len(audio) / sr
    print(f"  {wf}  |  sr={sr}  dur={dur:.2f}s  "
          f"range=[{audio.min():.3f}, {audio.max():.3f}]  "
          f"shape={audio.shape}")

---
## 4 · Build Manifest (metadata.csv)

In [None]:
!python scripts/build_manifest.py --config configs/demo1.yaml

In [None]:
# Preview the manifest
import pandas as pd

df = pd.read_csv("data/manifest.csv")
print(f"Manifest shape: {df.shape}")
print(f"Split counts:\n{df['split'].value_counts()}")
print()
df.head(10)

---
## 5 · Listen to a Few Samples

Play some processed Minecraft sounds to verify quality.

In [None]:
import IPython.display as ipd

for wf in wav_files[:4]:
    print(f"\n▶ {wf}")
    audio, sr = sf.read(wf, dtype="float32")
    display(ipd.Audio(audio, rate=sr))

---
## 6 · Baseline Generation (Vanilla AudioLDM2)

Generate samples from the pre-trained model *before* any fine-tuning.

In [None]:
# Generate baseline samples for a couple of prompts
PROMPTS = [
    "minecraft zombie hurt sound effect",
    "minecraft skeleton death sound effect",
]

for prompt in PROMPTS:
    !python -m src.mcaudio.infer.generate \
        --prompt "{prompt}" \
        --config configs/demo1.yaml \
        --num_samples 4 \
        --output outputs/demo1/baseline

In [None]:
# Listen to baseline generations
baseline_wavs = sorted(glob.glob("outputs/demo1/baseline/*.wav"))
print(f"Baseline samples: {len(baseline_wavs)}")

for wf in baseline_wavs[:4]:
    print(f"\n▶ {os.path.basename(wf)}")
    audio, sr = sf.read(wf, dtype="float32")
    display(ipd.Audio(audio, rate=sr))

---
## 7 · LoRA Fine-Tuning  *(optional for Demo 1)*

Fine-tune the UNet with LoRA adapters on the Minecraft dataset.

> **Note:** This takes ~15-30 min on T4. You can reduce `max_train_steps` for a quicker test.

In [None]:
# Uncomment to run training:
# !python -m src.mcaudio.train.lora_train --config configs/demo1.yaml --max_steps 200

---
## 8 · Generate with LoRA Adapter  *(after training)*

In [None]:
# Uncomment after LoRA training completes:
# for prompt in PROMPTS:
#     !python -m src.mcaudio.infer.generate \
#         --prompt "{prompt}" \
#         --config configs/demo1.yaml \
#         --lora_weights outputs/demo1/lora_weights \
#         --num_samples 4 \
#         --output outputs/demo1/lora

In [None]:
# # Listen to LoRA-adapted generations
# lora_wavs = sorted(glob.glob("outputs/demo1/lora/*.wav"))
# print(f"LoRA samples: {len(lora_wavs)}")
#
# for wf in lora_wavs[:4]:
#     print(f"\n▶ {os.path.basename(wf)}")
#     audio, sr = sf.read(wf, dtype="float32")
#     display(ipd.Audio(audio, rate=sr))

---
## Summary

| Stage | Artefact | Location |
|-------|----------|----------|
| Raw assets | .ogg files | `data/raw/` |
| Processed | 16 kHz mono .wav | `data/processed/` |
| Manifest | metadata.csv | `data/manifest.csv` |
| Baseline | generated .wav | `outputs/demo1/baseline/` |
| LoRA weights | adapter checkpoint | `outputs/demo1/lora_weights/` |
| LoRA samples | generated .wav | `outputs/demo1/lora/` |