# VibeVoice Colab — T4 Quickstart (1.5B)
This page provides a quickstart guide to run VibeVoice on Colab with T4.

The T4 GPU can only support the 1.5B model due to memory limitations. Please note that T4 can only use SDPA instead of flash_attention_2, which may result in unstable and lower audio quality. For the best TTS experience, we recommend trying the 7B model on a more powerful GPU.

## Step 1: Use T4



Use T4 in Colab: go to Runtime → Change runtime type → Hardware accelerator: GPU → T4.

In [1]:
import torch
print(torch.cuda.is_available())
!nvidia-smi

True
Thu Aug 28 12:20:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   67C    P8             13W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                           

## Step 2: Env Install

In [2]:
!git clone https://github.com/microsoft/VibeVoice.git

import os
os.chdir("./VibeVoice")

!apt update && apt install ffmpeg -y
!pip install -e .

Cloning into 'VibeVoice'...
remote: Enumerating objects: 360, done.[K
remote: Counting objects: 100% (112/112), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 360 (delta 68), reused 52 (delta 39), pack-reused 248 (from 2)[K
Receiving objects: 100% (360/360), 87.49 MiB | 22.00 MiB/s, done.
Resolving deltas: 100% (166/166), done.
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,199 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu

## Step 3: Run VibeVoice

In [3]:
# First download checkpoint takes ~3 minutes
!python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_short.txt --speaker_names Alice Frank

from IPython.display import Audio
Audio("./outputs/2p_short_generated.wav")

2025-08-28 12:22:04.528761: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756383724.545672    1306 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756383724.550712    1306 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756383724.563194    1306 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756383724.563226    1306 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756383724.563229    1306 computation_placer.cc:177] computation placer alr

### TTS from your text

In [7]:
# text = """Speaker 1: Can I try VibeVoice with my own example?
# Speaker 2: Of course! VibeVoice is open-source, built to benefit everyone — you’re welcome to try it out."""
# with open("demo/text_examples/my_example.txt", "w", encoding="utf-8") as f:
#     f.write(text)

# Write the VibeVoice-compatible script to a file
text = """Speaker 1: The Voice of the Unspoken — a Dhanur AI podcast. Welcome, I’m Alice.
Speaker 2: And I’m Frank. Today we’re talking about why AI content will be everywhere — fully AI or AI-augmented — and why that’s a win for creators, brands, and communities.
Speaker 1: Let’s start with a simple truth: there are more ideas than hours to express them. Great thoughts get trapped by tight schedules, small budgets, language barriers, and the worry that your voice won’t sound “professional” enough.
Speaker 2: For years, the creative pipeline was gated by time, tools, and confidence. AI breaks those gates.
Speaker 1: When people hear “AI content,” some imagine bland copy. That’s the rough draft of a better story. Yes, AI can generate text, voices, music, and images. More importantly, it meets you where you are, takes the shape of your idea, and extends it into formats you never imagined.
Speaker 2: Think of two modes. Mode one: fully AI. You provide a brief, a message, a goal — the system produces a clear, consistent output. Great for routine updates, summaries, training modules, explainers, or turning long reports into bite-sized audio briefs.
Speaker 1: Mode two: AI-augmented. You remain the storyteller, director, and taste-maker. AI becomes your on-demand team: editor, researcher, script polisher, speaking coach, sound assistant, and multilingual translator. You set the strategy. AI clears the path from spark to final cut.
Speaker 2: In both modes, the power shifts from gatekeepers to creators. That’s why we call this “the voice of the unspoken.” The teacher with brilliant techniques but no studio budget. The founder with a day-one story that deserves day-one polish. The student who thinks in two languages and wants to publish in five.
Speaker 1: The activist who needs clarity and reach. The small business owner who can’t hire a full audio team but can turn bullet points into a brand anthem. AI turns a whisper into a broadcast.
Speaker 2: Let’s tackle the big three: authenticity, quality, and trust.
Speaker 1: Authenticity isn’t about whether a machine touched the file. It’s whether the idea rings true and serves real people. AI doesn’t erase your voice; it can protect it — keeping tone consistent across channels, translating without losing intent, and producing variations that fit context without diluting your core.
Speaker 2: Quality used to mean “expensive and slow.” Now it means “clear, useful, and respectful of time.” AI lifts the floor: mic noise is cleaned, pacing is guided, filler is trimmed, structure is sharpened. And when you want human textures — breaths, laughs, emphasis — you can keep them.
Speaker 1: We don’t want robotic. We want reliable humanity at scale.
Speaker 2: Trust is earned by process. Disclose when it matters. Get consent for references and voice. Keep creators in charge with watermarking, rights management, and audit trails. Feedback loops improve models — and keep humans in command.
Speaker 1: Let’s get practical. Imagine your idea as a seed. AI workflows are the greenhouse. You plant a headline, a paragraph, a sketch, or even a rough voice note.
Speaker 2: The system germinates it: outlines become scripts; scripts become audio; audio becomes short clips, reels, and multilingual versions. Each pass adds intent: who is this for, what do they need, how should it feel?
Speaker 1: You stop wrestling with software and spend your energy refining meaning.
Speaker 2: This is where augmentation shines. Start with your own voice — your cadence and lived experience. AI helps with breath control, consistent loudness, and subtle emphasis so your story lands.
Speaker 1: Prefer a narrator? Choose one aligned with your brand and audience comfort. Need accessibility? Generate transcripts, audio descriptions, and simple-language versions without starting from scratch. Inclusion grows when the cost to produce it approaches zero.
Speaker 2: Will AI replace creators? No. It replaces friction: the blank page, the clumsy mic setup, the tenth revision that steals momentum. What it can’t replace is judgment — what to say, what not to say, and why it matters.
Speaker 1: Your taste still leads. Your ethics still anchor the work. Your story still carries the weight.
Speaker 2: And yes, AI content will be everywhere — not because it’s trendy, but because it’s useful. Companies will ship product updates in familiar voices. Educators will tailor lessons to different speeds and languages. Nonprofits will scale outreach without scaling budgets.
Speaker 1: Indie artists will release in formats that used to require a label. The long tail of creativity — quiet, local, niche — finally gets a loudspeaker.
Speaker 2: If you’re wondering where to start, use this rhythm. First, capture truth. Voice memo, bullet list, messy draft — get it down.
Speaker 1: Second, shape with intent. One listener, one need, one promise.
Speaker 2: Third, augment with care. Use AI to clean, structure, and style — without sanding off your edges.
Speaker 1: Fourth, iterate in public. Release shorter, sooner. Learn from real reactions, not imagined ones.
Speaker 2: Finally, expand responsibly. Add languages. Add formats. Add accessibility. Let the same idea serve more people, better.
Speaker 1: That’s how the future arrives — not as a replacement, but as a reinforcement. Not as a gimmick, but as a craft upgrade.
Speaker 2: At Dhanur AI we’re building for that world: content that feels personal at scale, with creators keeping the steering wheel, and ideas that used to be unspoken gaining a clear, confident voice.
Speaker 1: If you’ve been waiting for perfect conditions, consider this your sign. Your story doesn’t need a bigger budget. It needs a shorter path. AI can be that path. Start with what you know, say it the way only you can, and let the tools carry it the last mile.
Speaker 2: Thanks for listening. Share this with someone whose ideas deserve a brighter microphone.
Speaker 1: This was “The Voice of the Unspoken” — a Dhanur AI podcast. See you next time.
"""

with open("demo/text_examples/ai_podcast.txt", "w", encoding="utf-8") as f:
    f.write(text)
print("Wrote demo/text_examples/ai_podcast.txt")

Wrote demo/text_examples/ai_podcast.txt


In [8]:
!python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/ai_podcast.txt --speaker_names Alice Frank
Audio("./outputs/my_example_generated.wav")


2025-08-28 12:29:41.201435: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756384181.233478    3338 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756384181.243228    3338 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756384181.266878    3338 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756384181.266922    3338 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756384181.266929    3338 computation_placer.cc:177] computation placer alr

ValueError: rate must be specified when data is a numpy array or list of audio samples.

# Risk and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.