<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/kokoro_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ Kokoro TTS Google Colab

## 📄 Description  
This Colab notebook uses Kokoro TTS to generate voice audio from text.

**Languages supported**: American English (a), British English (b), Spanish (es), French (fr-fr), Hindi (hi), Italian (it), Japanese (ja), Brazilian Portuguese (pt-br), Mandarin Chinese (zh).


**Capabilities**: Text-to-speech, Multi-lingual, Predefined Voices

---

## How to use
- Follow the instructions to input text to generate and adjust params.
- Run all cells and output will be in `output.wav`

---

## 🔗 Resources

- **GitHub Repository:** [hexgrad/kokoro](https://github.com/hexgrad/kokoro)
- **Model Availability:** [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)

---

## 🎙️ Explore More TTS Models  
Want to try out additional TTS models? Check out the curated collection here:  
👉 [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)


In [1]:
!pip install kokoro==0.9.4 soundfile misaki[en] misaki[zh] misaki[ja] pydub
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!apt-get install -y ffmpeg

Collecting kokoro==0.9.4
  Downloading kokoro-0.9.4-py3-none-any.whl.metadata (21 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting misaki[en]
  Downloading misaki-0.9.4-py3-none-any.whl.metadata (19 kB)
Collecting loguru (from kokoro==0.9.4)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting addict (from misaki[en])
  Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)
Collecting espeakng-loader (from misaki[en])
  Downloading espeakng_loader-0.2.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting num2words (from misaki[en])
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting phonemizer-fork (from misaki[en])
  Downloading phonemizer_fork-3.3.2-py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.3/48.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-curated-transformers (from misaki[en])


In [2]:

en_us_voices = [
  "af_heart", "af_alloy", "af_aoede", "af_bella", "af_jessica", "af_kore",
  "af_nicole", "af_nova", "af_river", "af_sarah", "af_sky",
  "am_adam", "am_echo", "am_eric", "am_fenrir", "am_liam", "am_michael",
  "am_onyx", "am_puck", "am_santa"
]

en_gb_voices = [
  "bf_emma", "bf_isabella", "bf_alice", "bf_lily",
  "bm_george", "bm_lewis", "bm_daniel", "bm_fable"
]

ja_voices = [
  "jf_alpha", "jf_gongitsune", "jf_nezumi", "jf_tebukuro",
  "jm_kumo"
]

zh_voices = [
  "zf_xiaobei", "zf_xiaoni", "zf_xiaoxiao", "zf_xiaoyi",
  "zm_yunjian", "zm_yunxi", "zm_yunxia", "zm_yunyang"
]

es_voices = [
  "ef_dora", "em_alex", "em_santa", "ff_siwis"
]

hi_voices = [
  "hf_alpha", "hf_beta", "hm_omega", "hm_psi"
]

it_voices = [
  "if_sara", "im_nicola"
]

pt_br_voices = [
  "pf_dora", "pm_alex", "pm_santa"
]

In [3]:
language = "a"   # Make sure lang_code matches the voice
# 🇺🇸 'a' => American English,
# 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese
text_to_generate = "This is voice generated by Kokoro"
voice = en_us_voices[0]   # Make sure voice matches the language code

In [6]:
from IPython.display import Audio, display
from pydub import AudioSegment
import soundfile as sf
import os
from kokoro import KPipeline
import torch

# Ensure pydub can find ffmpeg
from pydub.utils import which
AudioSegment.converter = which("ffmpeg")

pipeline = KPipeline(lang_code=language)

# Directory to save intermediate audio files
output_dir = "audio_fragments"
os.makedirs(output_dir, exist_ok=True)

# Generating audio fragments
audio_files = []
generator = pipeline(text_to_generate, voice=voice, speed=1, split_pattern=r'\n+')

for i, (gs, ps, audio) in enumerate(generator):
    print(f"Fragment {i}:")
    print(f"Graphemes: {gs}")
    print(f"Phonemes: {ps}")

    # Save each fragment as a WAV file
    file_path = f"{output_dir}/{i}.wav"
    sf.write(file_path, audio, 24000)
    audio_files.append(file_path)

# Concatenate all audio fragments
combined = AudioSegment.silent(duration=0)  # Start with a silent segment

for file in audio_files:
    segment = AudioSegment.from_wav(file)
    combined += segment

# Save the combined audio as output.wav
combined_output_path = "output.wav"
combined.export(combined_output_path, format="wav")

# Display the concatenated audio file
display(Audio(combined_output_path, autoplay=True))



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)


kokoro-v1_0.pth:   0%|          | 0.00/327M [00:00<?, ?B/s]

af_heart.pt:   0%|          | 0.00/523k [00:00<?, ?B/s]

Fragment 0:
Graphemes: This is voice generated by Kokoro
Phonemes: ðˌɪs ɪz vˈYs ʤˈɛnəɹˌATᵻd bI kəkˈɔɹO
