# JointDiT — Text + Image Conditioning Quickstart\n\nThis notebook demonstrates the **text & image–conditioned** pipeline:\n1) (Optional) fetch a tiny AVSync15 subset\n2) cache video/audio latents (optionally with CLIP image embeddings)\n3) run tiny smoke trainings (Stage A/B)\n4) infer from the CLI with text + (first-frame) image conditioning\n5) optionally launch the Gradio UI\n\n> **Assumptions**\n> * You already have the repo environment set up (`.venv`, models in `assets/models/...`).\n> * You’ve integrated the CLIP condition encoder and updated `infer_joint.py` per today’s changes.

In [None]:
import os, sys, torch, platform\nprint('Python  :', platform.python_version())\nprint('PyTorch :', torch.__version__)\nprint('CUDA OK?:', torch.cuda.is_available())\nprint('GPU     :', torch.cuda.get_device_name(0) if torch.cuda.is_available() else '-')\nROOT = os.getcwd(); print('ROOT    :', ROOT)

## 1) (Optional) Fetch a tiny AVSync15 subset\nIf you’ve already placed raw clips in `data/raw/{train,val}`, skip this. Otherwise, set your folder URL below (Google Drive folder containing `videos.tar.gz`, `train.txt`, `test.txt`).

In [None]:
FOLDER_URL = ""  # e.g. 'https://drive.google.com/drive/folders/1onvx5y6QOceDrHZy8-ajFJ4RUuGWwT5V'\nif FOLDER_URL:\n    !python scripts/data/fetch_avsync15_1s.py --folder-url "$FOLDER_URL" --limit-train 3 --limit-val 1 --sr 16000 --fps 12\nelse:\n    print('Skipping download; FOLDER_URL not set.')

## 2) Cache latents (with optional CLIP image embedding)\nMake sure `configs/day02_cache.yaml` has `clip.enabled: true` and a valid CLIP model variant & tag (e.g. `variant: ViT-B-16`, `model_path: openai`).\n\nThis will write:\n* `data/cache/video_latents/{split}/*.pt`\n* `data/cache/audio_latents/{split}/*.pt`\n* `data/cache/img_firstframe/{split}/*.png`\n* `data/cache/img_clip/{split}/*.pt` (if CLIP enabled)\n* `data/cache/meta/{split}/*.json`

In [None]:
!PYTHONPATH=. python scripts/data/cache_latents.py --cfg configs/day02_cache.yaml --split train\n!PYTHONPATH=. python scripts/data/cache_latents.py --cfg configs/day02_cache.yaml --split val

## 3) Train — tiny smoke (Stage A)\nRuns a very small number of steps to verify training works with conditions. For real training, increase steps and remove the `--max-steps` override.

In [None]:
!PYTHONPATH=. python scripts/train/train_stage_a.py --cfg configs/day05_train.yaml --max-steps 25 --log-suffix nb --ckpt-suffix nb

## 4) Train — tiny smoke (Stage B)\nFinetunes the joint parts lightly. Adjust the unfreeze blocks and steps in the config for your GPU budget.

In [None]:
!PYTHONPATH=. python scripts/train/train_stage_b.py --cfg configs/day07_trainB.yaml --max-steps 100 --log-suffix nb --ckpt-suffix nb

## 5) Inference (CLI) — text + first-frame image conditioning\nWe use the **val** meta JSON to define shapes, and condition with a positive prompt and first-frame image. Set `JOINTDIT_USE_IMG=1` to pull the image from the meta.

In [None]:
import glob, os\nckpts = sorted(glob.glob('checkpoints/**/ckpt_step_*.pt', recursive=True))\nlatest = ckpts[-1] if ckpts else ''\nprint('Latest ckpt:', latest)\nmeta = sorted(glob.glob('data/cache/meta/val/*.json'))\nref  = meta[0] if meta else ''\nprint('Ref meta  :', ref)\n\nos.environ['JOINTDIT_PROMPT'] = 'a baby laughing'\nos.environ['JOINTDIT_NEG_PROMPT'] = ''\nos.environ['JOINTDIT_CKPT'] = latest\nos.environ['JOINTDIT_REF_META'] = ref\nos.environ['JOINTDIT_STEPS'] = '10'\nos.environ['JOINTDIT_SEED']  = '0'\nos.environ['JOINTDIT_WV']    = '1.2'\nos.environ['JOINTDIT_WA']    = '1.2'\nos.environ['JOINTDIT_WT']    = '1.5'\nos.environ['JOINTDIT_WNT']   = '0.0'\nos.environ['JOINTDIT_WI']    = '1.0'\nos.environ['JOINTDIT_USE_IMG']= '1'\n\n!PYTHONPATH=. python scripts/infer/infer_joint.py --cfg configs/ui_infer.yaml

## 6) (Optional) Launch the UI\nThis will start a Gradio server; stop the cell to terminate it.

In [None]:
# !python scripts/ui/app.py