# Minimal TTS Prototype — Kokoro-82M

**Model:** [`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) (82M params, ~350MB, Apache 2.0)

**Why Kokoro:** #1 on TTS Arena (legacy), 44% win rate on Arena V2, 96x real-time on cloud GPU, built-in G2P/text normalization, 24kHz output, multilingual. Uses under 1GB VRAM — leaves 15GB headroom on T4.

**Alternatives considered:**
| Model | Params | VRAM | Speed | Trade-off |
|---|---|---|---|---|
| **Kokoro-82M** | 82M | <1 GB | ~96x RT | English-best, multilingual expanding |
| SpeechT5 | 143M | ~1.2 GB | ~1s/sent | Robotic, no text normalization, 16kHz |
| Chatterbox-Turbo | 350M | ~4 GB | ~2-3x RT | Voice cloning, English only |
| Chatterbox | 500M | ~6-7 GB | ~1-2x RT | Expressive, but heavy for free Colab |
| XTTS-v2 | 467M | ~5 GB | ~0.5x RT | Multilingual cloning, slow, complex deps |

Set **Runtime > Change runtime type > T4 GPU** before running.

In [None]:
!pip install -q kokoro>=0.9.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1

In [None]:
import torch
import numpy as np
import soundfile as sf
from kokoro import KPipeline
from IPython.display import display, Audio

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

pipeline = KPipeline(lang_code='a')
print("Pipeline ready")

Device: cuda


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json: 0.00B [00:00, ?B/s]

  WeightNorm.apply(module, name, dim)


kokoro-v1_0.pth:   0%|          | 0.00/327M [00:00<?, ?B/s]

Pipeline ready


In [None]:
# lang_code to voice prefix mapping:
#   'a' (American English): af_heart, af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael
#   'b' (British English):  bf_emma, bf_isabella, bm_george, bm_lewis
#   'e' (Spanish), 'f' (French), 'h' (Hindi), 'i' (Italian), 'p' (Portuguese), 'j' (Japanese), 'z' (Chinese)

VOICE = 'af_heart'
SPEED = 1.0

_counter = 0

def speak(text, voice=VOICE, speed=SPEED, filename=None):
    global _counter
    if filename is None:
        _counter += 1
        filename = f"tts_{_counter:03d}.wav"

    audio_parts = []
    for _, (gs, ps, audio) in enumerate(pipeline(text, voice=voice, speed=speed)):
        audio_parts.append(audio)

    full_audio = np.concatenate(audio_parts) if len(audio_parts) > 1 else audio_parts[0]
    sf.write(filename, full_audio, 24000)
    print(f"Saved: {filename} ({len(full_audio)/24000:.1f}s)")
    return Audio(filename, autoplay=True)

In [None]:
speak("Hello! This is a text to speech prototype running on Google Colab with a T4 GPU.")

voices/af_heart.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Saved: tts_001.wav (6.1s)


In [None]:
# kokoro handles numbers, abbreviations, and punctuation natively via its G2P engine
speak("Dr. Smith earned $2,500 in 2025. That is roughly 150 dollars per session vs. the usual rate.")

Saved: tts_002.wav (9.8s)


In [None]:
# long text is auto-chunked by the pipeline at sentence boundaries
speak(
    "Artificial intelligence has transformed many fields in recent years. "
    "Natural language processing now powers search engines, chatbots, and translation tools. "
    "Computer vision enables self-driving cars and medical image analysis. "
    "Speech synthesis, like this demo, can generate human-sounding audio from plain text. "
    "The progress has been remarkable, but significant challenges remain around safety, "
    "fairness, and ensuring these systems work reliably for everyone."
)

Saved: tts_003.wav (28.7s)


In [None]:
# compare voices on the same text
sample = "The quick brown fox jumps over the lazy dog near the riverbank."
for v in ['af_heart', 'af_bella', 'am_adam', 'bf_emma', 'bm_george']:
    print(f"\n--- {v} ---")
    display(speak(sample, voice=v))


--- af_heart ---
Saved: tts_004.wav (4.4s)



--- af_bella ---


voices/af_bella.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Saved: tts_005.wav (4.7s)



--- am_adam ---


voices/am_adam.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Saved: tts_006.wav (4.3s)



--- bf_emma ---


voices/bf_emma.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Saved: tts_007.wav (4.3s)



--- bm_george ---


voices/bm_george.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Saved: tts_008.wav (5.1s)


In [None]:
# speed control: 0.5 = slow/deliberate, 1.0 = normal, 1.5 = fast
speak("This sentence is spoken slowly for emphasis.", speed=0.8)
speak("This sentence is spoken quickly for a news-reader effect.", speed=1.3)

Saved: tts_009.wav (4.0s)
Saved: tts_010.wav (3.1s)
