<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/Orpheus_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üó£Ô∏è Orpheus TTS Colab

## üìÑ Description

This Colab notebook runs **Orpheus TTS**, an **open-source, LLM-based text-to-speech system** built on a **LLaMA-3B backbone**.
Orpheus showcases the **emergent capabilities of large language models for speech synthesis**, delivering **highly expressive**, **human-like**, and **low-latency** voice generation with **zero-shot voice cloning**.

**Capabilities:**
Human-Like Expressive Speech, Zero-Shot Voice Cloning, Guided Emotion & Intonation Tags, Low-Latency Streaming (~200ms), Realtime-Ready TTS

---

## How to use

* Run the first cell to set up dependencies
* Modify the input text for speech generation
* Run all remaining cells to generate audio output

---

## ‚öôÔ∏è Model Highlights

* üó£ **State-of-the-art expressiveness** ‚Äì natural rhythm, emotion, and intonation rivaling closed-source models
* üé≠ **Emotion & intonation control** ‚Äì guide speech style with simple tags (e.g., `<laugh>`, `<sigh>`)
* üß¨ **Zero-shot voice cloning** ‚Äì no fine-tuning required
* ‚ö° **Low-latency streaming** ‚Äì ~200ms latency, reducible to ~100ms with input streaming
* üåç **Research-backed scalability** ‚Äì part of a larger multilingual model family

---

## üß† Model Details

* **Base Model:** LLaMA-3B
* **Model Variant:** Finetuned Prod (English)
* **Training Data:** 100k+ hours of English speech (pretraining)
* **Supported Language:** English
* **Voices:** Multiple preset speaker options (e.g., tara, leah, jess, leo, dan, mia, zac, zoe)
* **Use Case:** Production-ready, real-time expressive TTS

---

## üîó Resources

* **GitHub Repository:** https://github.com/canopyai/Orpheus-TTS
* **Model Card:** https://huggingface.co/canopylabs/orpheus-3b-0.1-ft

---

## üéôÔ∏è Explore More TTS Models

Looking for more cutting-edge voice models?
üëâ Check out the full collection: [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)

## Voice Generation

In [None]:
# Authenticate with huggingface for model access
from huggingface_hub import login
login()

In [None]:
# Install dependencies
!pip install -q snac transformers soundfile librosa ipywebrtc

In [None]:
from snac import SNAC
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import IPython.display as ipd

model_name = "canopylabs/orpheus-3b-0.1-ft"

# Load SNAC codec model (24kHz)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cpu")  # keep codec on CPU

# Load Orpheus TTS (LLM backbone)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32)
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Loaded Orpheus TTS on {device} and SNAC codec on CPU.")

In [None]:
# Available English voices (from docs): "tara", "leah", "jess", "leo", "dan", "mia", "zac", "zoe"
voice_name = "tara"  # üëà change to another voice if you like

prompt_text = (
    "Hey there, I'm Orpheus, a text-to-speech model that can speak with natural rhythm and emotion. "
    "<chuckle> It's nice to meet you in this Colab notebook!"
)  # üëà change this to whatever you want the model to say

# Format prompt as expected by Orpheus finetuned model
full_prompt = f"{voice_name}: {prompt_text}"
print("Using prompt:\n", full_prompt)

In [None]:
# Special tokens (from Orpheus example)
start_token = torch.tensor([[128259]], dtype=torch.int64)        # Start of human (SOH)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text (EOT), End of human (EOH)

# Tokenize text
input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids

# SOH SOT Text EOT EOH
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)

# Move to device
input_ids = modified_input_ids.to(device)
attention_mask = torch.ones_like(input_ids, dtype=torch.int64).to(device)

# Generate audio
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
        num_return_sequences=1,
        eos_token_id=128258,  # end of codes
    )

# Parse output
token_to_find = 128257  # code start token
token_to_remove = 128258  # code end token

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
else:
    cropped_tensor = generated_ids

# Remove the end token
mask = cropped_tensor != token_to_remove

processed_rows = []
for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []
for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)

def redistribute_codes(code_list):
    layer_1 = []
    layer_2 = []
    layer_3 = []
    for i in range((len(code_list) + 1) // 7):
        layer_1.append(code_list[7 * i])
        layer_2.append(code_list[7 * i + 1] - 4096)
        layer_3.append(code_list[7 * i + 2] - (2 * 4096))
        layer_3.append(code_list[7 * i + 3] - (3 * 4096))
        layer_2.append(code_list[7 * i + 4] - (4 * 4096))
        layer_3.append(code_list[7 * i + 5] - (5 * 4096))
        layer_3.append(code_list[7 * i + 6] - (6 * 4096))

    codes = [
        torch.tensor(layer_1).unsqueeze(0),
        torch.tensor(layer_2).unsqueeze(0),
        torch.tensor(layer_3).unsqueeze(0),
    ]
    audio_hat = snac_model.decode(codes)  # [1, T]
    return audio_hat

my_samples = []
for code_list in code_lists:
    samples = redistribute_codes(code_list)
    my_samples.append(samples)

# Fix: Add .detach() before .numpy() to prevent RuntimeError for tensors requiring grad
audio_tensor = my_samples[0].cpu().detach().numpy().squeeze()

# SNAC 24kHz model, so use 24000 Hz
sample_rate = 24000

In [None]:
print("Playing generated audio...")
ipd.display(ipd.Audio(audio_tensor, rate=sample_rate))