Gemini is natively multimodal in audio and video — capabilities Claude Code lacks. v0.7.0 offloads them to agy.
/agy:transcribe <audio|video|YouTube-URL> [focus]— faithful transcript + summary in the source language; timestamps for video/URLs. Voice notes, meetings, calls, screencasts. →docs/agy/transcripts//agy:media <file|URL> | <question>— multimodal Q&A over audio/video/image ("what decisions were made?", "what happens at 2:30?"), grounded with time references. →docs/agy/media/
Verified on a real WhatsApp .ogg voice note and a YouTube video. 15 commands now, no Node runtime.