<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/VibeVoice%201.5B%20TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ VibeVoice TTS Microsoft Colab

## 📄 Description  
This Colab notebook uses VibeVoice TTS to generate expressive, long-form, multi-speaker conversational audio, such as podcasts, from text.

**Capabilities**: Context-Aware Expression, Multi-lingual conversation, Podcast with Background Music, Long Conversational Speech

---

## How to use

- Follow the instructions from the comments to change the script_text
- Run all cells in the section you need
- The generated output will be in `output.wav`

---

## 🔗 Resources

- **GitHub Repository:** https://github.com/microsoft/VibeVoice
- **Model Availability:** https://huggingface.co/microsoft/VibeVoice-1.5B

---

## Special note

- English and Chinese only.
- VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
- A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz.
- The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

---

## 🎙️ Explore More TTS Models  
Want to try out additional TTS models? Check out the curated collection here:  
👉 [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)

In [1]:
# 1. Install dependencies and clone the repository
print("⏳ Installing dependencies and setting up the environment...")
# Install ffmpeg for audio processing
!apt-get update -y -qq > /dev/null
!apt-get install -y ffmpeg -qq > /dev/null

# Clone the VibeVoice repository if it doesn't exist
import os
if not os.path.exists('VibeVoice'):
  !git clone https://github.com/microsoft/VibeVoice.git > /dev/null
%cd VibeVoice

# Install flash-attn and the VibeVoice package
!pip install flash-attn --no-build-isolation -qq
!pip install -e . -qq

print("✅ Environment setup complete.")

⏳ Installing dependencies and setting up the environment...
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Cloning into 'VibeVoice'...
remote: Enumerating objects: 224, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 224 (delta 43), reused 41 (delta 20), pack-reused 149 (from 1)[K
Receiving objects: 100% (224/224), 85.72 MiB | 31.79 MiB/s, done.
Resolving deltas: 100% (81/81), done.
/content/VibeVoice
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Gettin

In [None]:
# --- Parameters ---
# 📝 Paste your multi-speaker script below:
script_text = """
Speaker 1: The hype was immense, with teasers and leaks building for weeks.
Speaker 2: Great to be here, Alice. It's certainly been an eventful launch.
Speaker 1: And we also have Frank, a tech enthusiast and a super-user.
Speaker 3: Hey, Alice. Happy to be here.
"""

# 🗣️ Map Speakers to Available Voices
# Use a Python dictionary format. The keys should exactly match the speaker names in your script.
# Available voices: `Alice`, `Carter`, `Frank`, `Maya`, `Sanuel`, `Anchen`, `Bowen`, `Xinran`
speaker_voice_mapping = "{'Speaker 1': 'Alice', 'Speaker 2': 'Carter', 'Speaker 3': 'Frank'}"

In [None]:
import os
import re
import torch
import time
import ast
from IPython.display import Audio, display

# --- Import VibeVoice components ---
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
from transformers.utils import logging

logging.set_verbosity_info()
logger = logging.get_logger(__name__)

# --- Helper Class from the original script (with the fix) ---
class VoiceMapper:
    def __init__(self):
        self.setup_voice_presets()
        new_dict = {}
        for name, path in self.voice_presets.items():
            if '_' in name: name = name.split('_')[0]
            if '-' in name: name = name.split('-')[-1]
            new_dict[name] = path
        self.voice_presets.update(new_dict)

    def setup_voice_presets(self):
        # FIX: Corrected the path to the voices directory to be a simple relative path.
        voices_dir = "demo/voices"
        if not os.path.isdir(voices_dir):
            self.voice_presets, self.available_voices = {}, {}
            print(f"❌ Error: Voices directory not found at path: {os.path.abspath(voices_dir)}")
            return

        wav_files = [f for f in os.listdir(voices_dir) if f.lower().endswith('.wav')]
        self.voice_presets = {os.path.splitext(f)[0]: os.path.join(voices_dir, f) for f in wav_files}
        self.voice_presets = dict(sorted(self.voice_presets.items()))
        self.available_voices = {n: p for n, p in self.voice_presets.items() if os.path.exists(p)}
        print(f"✅ Found {len(self.available_voices)} voice files. Available voices: {', '.join(self.available_voices.keys())}")

    def get_voice_path(self, speaker_name: str) -> str:
        speaker_lower = speaker_name.lower()
        for preset_name, path in self.voice_presets.items():
            if preset_name.lower() == speaker_lower: return path
        for preset_name, path in self.voice_presets.items():
            if speaker_lower in preset_name.lower(): return path
        default_voice = list(self.voice_presets.values())[0]
        print(f"⚠️ Warning: No voice preset found for '{speaker_name}', using default voice: {os.path.basename(default_voice)}")
        return default_voice

# --- Main Generation Logic ---
model_path = "microsoft/VibeVoice-1.5B"
final_output_file = "output.wav"

try:
    # 1. Parse user inputs
    speaker_voice_map = ast.literal_eval(speaker_voice_mapping)
    full_script = script_text.strip()

    # Automatically detect unique speakers from the script
    unique_speakers_in_script = sorted(list(set(re.findall(r"^(.+?):", full_script, re.MULTILINE))))
    if not unique_speakers_in_script:
        raise ValueError("No speakers found in the script. Ensure it follows the 'Speaker Name: Text' format.")
    print(f"✅ Detected speakers in script: {', '.join(unique_speakers_in_script)}")

    # 2. Map speakers to voice files
    voice_mapper = VoiceMapper()
    # Add a check to ensure voices were actually found
    if not voice_mapper.available_voices:
        raise FileNotFoundError("Could not find any voice files. Please ensure the 'demo/voices' directory is correct.")

    voice_samples = []
    print("✅ Mapping speakers to voices:")
    for speaker in unique_speakers_in_script:
        voice_name = speaker_voice_map.get(speaker)
        if not voice_name:
            raise ValueError(f"Speaker '{speaker}' from script is not defined in the Speaker-to-Voice mapping.")
        voice_path = voice_mapper.get_voice_path(voice_name)
        voice_samples.append(voice_path)
        print(f"  - '{speaker}' -> '{voice_name}' (using {os.path.basename(voice_path)})")

    # 3. Load processor and model
    print("\n⏳ Loading processor and model...")
    processor = VibeVoiceProcessor.from_pretrained(model_path)
    model = VibeVoiceForConditionalGenerationInference.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map='cuda',
        attn_implementation="sdpa" # Use 'sdpa' for T4 GPU compatibility
    )
    model.eval()
    model.set_ddpm_inference_steps(num_steps=10)
    print("✅ Model loaded successfully.")

    # 4. Prepare inputs for the model
    inputs = processor(
        text=[full_script],
        voice_samples=[voice_samples],
        padding=True,
        return_tensors="pt",
        return_attention_mask=True,
    ).to('cuda')

    # 5. Generate audio
    print("\n🎙️ Generating audio...")
    start_time = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=None,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={'do_sample': False},
        verbose=True,
    )
    generation_time = time.time() - start_time
    print(f"⏱️ Generation finished in {generation_time:.2f} seconds.")

    # 6. Save and display the audio
    processor.save_audio(outputs.speech_outputs[0], final_output_file)
    print(f"\n✅ Audio successfully generated and saved as '{final_output_file}'.")
    display(Audio(final_output_file, autoplay=True))

except Exception as e:
    print(f"\n❌ An error occurred: {e}")