# Spark TTS Inference Guide with vLLM

This notebook demonstrates how to perform text-to-speech (TTS) inference using the **Sunbird/spark-tts-salt** model with vLLM. Tested on RTX 4090 (24GB).

## Overview

Spark TTS is a powerful text-to-speech model that can generate high-quality speech in multiple languages and voices. This implementation uses:
- **vLLM** for efficient model inference
- **BiCodec tokenizer** for audio token processing
- **Retry logic** to handle generation errors gracefully
- **Text chunking** to process long texts efficiently

## Key Features
- Multi-language support (English, Luganda, Swahili, etc.)
- Multiple speaker IDs for different voices
- Robust error handling with automatic retries
- Flexible text chunking strategies
- Audio playback in Jupyter notebooks

## 1. Setup and Installation

First, install all required dependencies. This includes:
- **vLLM**: For efficient LLM inference
- **soundfile & librosa**: Audio processing
- **xformers, omegaconf, einx, einops**: Supporting libraries

In [None]:
pip install -q einx einops soundfile librosa vllm omegaconf

## 2. Import Libraries

Import all necessary libraries for TTS inference.

In [None]:
# Core imports
from vllm import LLM
from vllm.sampling_params import SamplingParams
import os
from getpass import getpass
import re
import soundfile as sf
from huggingface_hub import snapshot_download
import torch
import sys
from typing import Tuple, List, Optional
import numpy as np
from IPython.display import Audio, display
import time
import huggingface_hub

# Determine device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 3. Clone Spark-TTS Repository

Clone the Spark-TTS repository to access the audio tokenizer and utilities.

**Note**: Uncomment the git clone line if you haven't cloned the repository yet.

In [None]:
# Clone the Spark-TTS repository (uncomment if needed)
if not os.path.exists('Spark-TTS'):
    !git clone https://github.com/SparkAudio/Spark-TTS

# Add Spark-TTS to Python path
sys.path.append('Spark-TTS')
print("Spark-TTS repository path added to sys.path")

## 4. Set Hugging Face Token

Set your Hugging Face token for model access. Get your token from: https://huggingface.co/settings/tokens

In [None]:
huggingface_hub.login()

## 5. Load the TTS Model

Load the Spark TTS model using vLLM.

**Model**: `Sunbird/spark-tts-salt`

This may take a few minutes depending on your internet connection.

In [None]:
# Load the TTS model with vLLM
print("Loading Spark TTS model...")
model = LLM(
    "Sunbird/spark-tts-salt",
    enforce_eager=False,
    gpu_memory_utilization=0.5) # Leave some VRAM for the audio tokeniser
print("✅ Model loaded successfully!")

## 6. Download and Setup Audio Tokenizer

Download the BiCodec tokenizer model files from Hugging Face and initialize the audio tokenizer.

The tokenizer converts between audio and token representations.

In [None]:
# Download tokenizer model files
model_base_repo = "unsloth/Spark-TTS-0.5B"
cache_dir = "Spark-TTS-0.5B"

if not os.path.exists(cache_dir):
    print(f"Downloading tokenizer files from {model_base_repo}...")
    snapshot_download(
        repo_id=model_base_repo,
        local_dir=cache_dir,
        ignore_patterns=["*LLM*"],  # Skip LLM files, we only need tokenizer
    )
    print(f"✅ Tokenizer files downloaded to {cache_dir}")

# Initialize the audio tokenizer
from sparktts.models.audio_tokenizer import BiCodecTokenizer

print("Initializing audio tokenizer...")
audio_tokenizer = BiCodecTokenizer(cache_dir, device)
print("✅ Audio tokenizer initialized!")  

## 7. Text Chunking Utilities

These functions split long text into manageable chunks for TTS processing.

### Three Chunking Strategies:

1. **chunk_text**: Splits by sentence boundaries with a maximum character limit
2. **chunk_text_simple**: Splits into individual sentences (recommended for TTS)
3. **chunk_text_with_count**: Groups a fixed number of sentences per chunk

In [None]:
def chunk_text(text: str, max_chunk_size: int = 500) -> List[str]:
    """
    Split text into chunks based on sentence boundaries.
    
    This approach preserves natural sentence flow and intonation for TTS.
    
    Args:
        text: The input string to chunk
        max_chunk_size: Maximum character length per chunk (soft limit)
    
    Returns:
        List of text chunks, each containing one or more complete sentences
    """
    # Split on sentence-ending punctuation (. ! ?) followed by whitespace
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    
    chunks: List[str] = []
    current_chunk: List[str] = []
    current_length = 0
    
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
        
        sentence_length = len(sentence)
        
        # Start new chunk if adding this sentence would exceed limit
        if current_chunk and (current_length + sentence_length + 1) > max_chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        
        current_chunk.append(sentence)
        current_length += sentence_length + 1
    
    # Add the final chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks


def chunk_text_simple(text: str) -> List[str]:
    """
    Split text into individual sentences.
    
    Recommended for TTS - provides maximum control with one sentence per chunk.
    
    Args:
        text: The input string to chunk
    
    Returns:
        List of individual sentences
    """
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s.strip() for s in sentences if s.strip()]


def chunk_text_with_count(text: str, sentences_per_chunk: int = 3) -> List[str]:
    """
    Split text into chunks containing a specific number of sentences.
    
    Args:
        text: The input string to chunk
        sentences_per_chunk: Number of sentences to include in each chunk
    
    Returns:
        List of text chunks
    """
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
    
    chunks: List[str] = []
    
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)
    
    return chunks

In [None]:
# Precomputed global tokens for each speaker
GLOBAL_IDS_BY_SPEAKER = {
 241: [1755, 1265, 184, 3545, 2718, 2405, 3237, 1360, 3621, 1850, 37, 3382, 736,
       3380, 3131, 2036, 244, 2128, 254, 2550, 3181, 764, 1277, 502, 2941, 1993,
       3556, 1428, 3505, 3245, 3506, 1540],
 242: [1367, 1522, 308, 4061, 1449, 2468, 2193, 1349, 3458, 2339, 1651, 3174,
       501, 3364, 3194, 2041, 442, 1061, 502, 2234, 2397, 358, 3829, 2490, 2031,
       1002, 3548, 586, 3445, 1419, 4093, 2908],
 243: [2051, 242, 2684, 4062, 2654, 2252, 353, 3657, 2759, 3254, 1649, 3366,
       1017, 3600, 3131, 3813, 1535, 1595, 1059, 237, 2158, 1174, 4085, 2174,
       3791, 990, 3274, 2693, 3829, 2271, 2650, 1689],
 245: [2031, 2545, 116, 4060, 746, 1385, 3301, 1312, 3638, 1846, 85, 3190, 1016,
       3384, 3134, 954, 244, 1104, 235, 2549, 3357, 508, 1278, 1974, 2621, 1896,
       3812, 2185, 3061, 2941, 1187, 5],
 246: [1811, 1138, 2873, 3309, 2639, 723, 3363, 974, 1612, 2531, 1769, 3376,
       933, 3848, 3195, 2180, 2359, 1275, 3493, 3260, 2279, 3715, 3508, 2433,
       4082, 1087, 3545, 1449, 160, 3531, 2908, 2094],
 248: [2559, 1523, 440, 3789, 1438, 373, 2212, 1248, 3369, 1847, 36, 3126, 480,
       3380, 3133, 2041, 248, 2384, 730, 2554, 3182, 1785, 1277, 1013, 2425,
       1932, 3560, 1177, 2736, 2430, 2722, 261]
}

def text_to_speech(text, audio_tokenizer, model, speaker_id, temperature):
    '''Create a wav array of speech from text.'''
    texts = chunk_text_simple(text)
    texts = [t.strip() for t in texts if len(t.strip()) > 0]

    sampling_params = SamplingParams(temperature=temperature, max_tokens=2048)
    
    global_tokens = GLOBAL_IDS_BY_SPEAKER[speaker_id]
    
    prompts = []
    for text in texts:
        prompt = f"<|task_tts|><|start_content|>{speaker_id}: {text}<|end_content|><|start_global_token|>"
        prompt += ''.join([f'<|bicodec_global_{t}|>' for t in global_tokens]) + '<|end_global_token|><|start_semantic_token|>'
        prompts.append(prompt)
    
    outputs = model.generate(
        prompts=prompts,
        sampling_params=sampling_params
    )
    
    speech_segments = []
    
    for i in range(len(outputs)):
        predicted_tokens = outputs[i].outputs[0].text
        semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicted_tokens)
        if not semantic_matches:
            raise ValueError("No semantic tokens found in the generated output.")
        
        pred_semantic_ids = (
            torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0)
        )
        
        pred_global_ids = torch.Tensor([global_tokens]).long()
        
        wav_np = audio_tokenizer.detokenize(
            pred_global_ids.to(device), pred_semantic_ids.to(device)
        )
        speech_segments.append(wav_np)
    
    result_wav = np.concatenate(speech_segments)
    
    return result_wav

## Usage examples

### Example 1: English Text

In [None]:
%%time

text = ("Hello, I'm Prosi Nafula. I am a nurse who takes care of many people who have cancer "
    "and who have questions about their illness and what to expect. There are many types of cancer. "
    "The type of cancer you have is named after the place where it started. For example, if cancer "
    "starts in the breast then it is called breast cancer. Cancer doesn't spread from one person to "
    "another but it can spread through your own body. All cancers need to be treated.")

# Optional long-form test: generate an hour's worth of speech (should take ~20 seconds)
if False:
    import urllib.request
    # THE ANALYSIS OF MIND By Bertrand Russell
    url = "https://www.gutenberg.org/cache/epub/2529/pg2529.txt"
    with urllib.request.urlopen(url) as response:
        full_text = response.read().decode('utf-8')
    book_text = ' '.join(full_text.split()[610:12000])
    

# 241: Acholi (female)
# 242: Ateso (female)
# 243: Runyankore (female)
# 245: Lugbara (female)
# 246: Swahili (male)
# 248: Luganda (female)

speaker_id = 241  
temperature = 0.7

result_wav = text_to_speech(
    text=text,
    audio_tokenizer=audio_tokenizer,
    model=model,
    speaker_id=speaker_id,
    temperature=temperature
)

duration = len(result_wav) / 16000
print(f"\n✅ TTS conversion completed!")
print(f"   Total duration: {duration:.2f} seconds")

In [None]:
display(Audio(result_wav, rate=16000))

In [None]:
# Optional: Save to file
if False:
    sf.write('output_english.wav', result_wav, sr)
    print("Audio saved to output_english.wav")

### Example 2: Luganda Text

In [None]:
# Sample Luganda text
luganda_text = (
    "Nze Prosi Nafula. Ndi musawo akola ku bantu abalina kookolo era abalina ebibuuzo ku bulwadde bwabwe n'ekyo kye basuubira. "
    "Waliwo ebika bya kookolo bingi. Ekika kya kookolo ky'olina kiyitibwa erinnya ly'ekifo we kyatandikira. "
    "Okugeza, kookolo bw'atandikira mu mabeere, ayitibwa kookolo w'amabeere. Kookolo tasaasaana okuva ku muntu omu okudda ku mulala "
    "naye asobola okusaasaana mu mubiri gwo. Kkookolo yenna yeetaaga okujjanjabibwa."
)

print(f"Text length: {len(luganda_text)} characters")

In [None]:
%%time

# Generate Luganda speech
speaker_id = 248  # Luganda speaker
temperature = 0.7

result_wav_luganda = text_to_speech(
    text=luganda_text,
    audio_tokenizer=audio_tokenizer,
    model=model,
    speaker_id=speaker_id,
    temperature=temperature
)

In [None]:
# Play the generated Luganda audio
display(Audio(result_wav_luganda, rate=16000))

### Example 3: Swahili Text

In [None]:
# Sample Swahili text
swahili_text = (
    "Habari, naitwa Prosi Nafula. Mimi ni muuguzi ambaye hushughulikia watu wengi walio na saratani "
    "na ambao wana maswali kuhusu ugonjwa wao na kile wanachoweza kutarajia. Kuna aina nyingi za saratani. "
    "Aina ya saratani unayokuwa nayo inaitwa kwa jina la mahali ilipoanza. Kwa mfano, saratani ikiwa imeanza "
    "katika matiti basi inaitwa saratani ya matiti. Saratani haisambaii kutoka mtu mmoja hadi mwingine lakini "
    "inaweza kusambaa katika mwili wako. Kansa zote zinahitaji kutibiwa."
)

print(f"Text length: {len(swahili_text)} characters")

In [None]:
%%time

# Generate Swahili speech
speaker_id = 246  # Swahili speaker
temperature = 0.7

result_wav_swahili = text_to_speech(
    text=swahili_text,
    audio_tokenizer=audio_tokenizer,
    model=model,
    speaker_id=speaker_id,
    temperature=temperature
)

In [None]:
# Play the generated Swahili audio
display(Audio(result_wav_swahili, rate=16000))

## 10. Custom Usage

Use this cell to generate speech from your own text.

In [None]:
# Enter your custom text here
my_text = "Your text goes here."

# Configure parameters
my_speaker_id = 248  # Choose appropriate speaker ID
my_temperature = 0.7  # 0.1 (conservative) to 1.0 (creative)

In [None]:
%%time

# Generate speech
my_wav, my_sr = text_to_speech(
    text=my_text,
    audio_tokenizer=audio_tokenizer,
    model=model,
    speaker_id=my_speaker_id,
    temperature=my_temperature
)

In [None]:
# Play your audio
display(Audio(my_wav, rate=my_sr))

In [None]:
# Save to file (optional)
output_filename = 'my_tts_output.wav'
sf.write(output_filename, my_wav, my_sr)
print(f"✅ Audio saved to {output_filename}")

## Re-calculate the global tokens for each speaker.

These tokens determine e.g. speaker speed and pitch, and stay roughly the same, so we precompute them.

In [None]:
# 241: Acholi (female)
# 242: Ateso (female)
# 243: Runyankore (female)
# 245: Lugbara (female)
# 246: Swahili (male)
# 248: Luganda (female)

global_ids_by_speaker = {}

for speaker_id in [241, 242, 243, 245, 246, 248]:

    text = "I am a nurse who takes care of many people who have cancer and who have questions about their illness and what to expect."
    prompt = f"<|task_tts|><|start_content|>{speaker_id}: {text}<|end_content|><|start_global_token|>"
    outputs = model.generate(
        prompts=[prompt],
        sampling_params=sampling_params
    )
    predicted_tokens = outputs[0].outputs[0].text
    global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicted_tokens)
    if not global_matches:
        print("Warning: No global tokens found. Using zeros as fallback.")
        pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
    else:
        pred_global_ids = (
            torch.tensor([int(token) for token in global_matches])
            .long()
            .unsqueeze(0)
        )
    global_ids_by_speaker[speaker_id] = [int(t) for t in pred_global_ids.numpy()[0]]   

import pprint
pprint.pp(global_ids_by_speaker, width=80, compact=True)

## 11. Tips and Best Practices

### Temperature Settings:
- **0.1-0.3**: More consistent but potentially monotone
- **0.5-0.7**: Balanced (recommended)
- **0.8-1.0**: More varied but potentially less stable

### Text Chunking Strategies:
- **chunk_text_simple**: Best for most use cases (one sentence per chunk)
- **chunk_text**: Good for controlling chunk size
- **chunk_text_with_count**: Good for grouping related sentences

### Handling Errors:
- The system automatically retries failed chunks up to 3 times
- Failed chunks are replaced with silence to maintain timing
- Adjust `max_retries` parameter if needed

### Performance Tips:
- Longer texts take more time to process
- GPU acceleration significantly speeds up generation
- Consider breaking very long texts into multiple batches

### Common Issues:
1. **Dimension mismatch errors**: Usually resolved by retry logic
2. **No audio output**: Check speaker_id and ensure text is not empty
3. **Poor quality**: Try adjusting temperature or using different speaker_id

## 12. Conclusion

You now have a complete TTS inference pipeline using Spark TTS with vLLM!

### Key Features Covered:
✅ Model loading and initialization
✅ Audio tokenizer setup
✅ Text chunking strategies
✅ Robust generation with retry logic
✅ Multi-language support
✅ Audio playback and saving

### Next Steps:
- Experiment with different speaker IDs
- Try various temperature settings
- Test with your own texts and languages
- Integrate into your applications

### Resources:
- [Spark TTS GitHub](https://github.com/SparkAudio/Spark-TTS)
- [vLLM Documentation](https://docs.vllm.ai/)
- [Model on Hugging Face](https://huggingface.co/jq/spark-tts-salt)