<a href="https://colab.research.google.com/github/Troyanovsky/awesome-TTS-Colab/blob/main/GLM_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üó£Ô∏è GLM-TTS Colab

## üìÑ Description

This notebook runs **GLM-TTS**, a **LLM-powered text-to-speech system** featuring **zero-shot voice cloning**, **RL-enhanced emotion control**, and **streaming real-time synthesis**.  
Built with a **two-stage architecture**‚Äîspeech token generation via an LLM and waveform synthesis via Flow Matching‚ÄîGLM-TTS produces **high-fidelity, expressive**, and **bilingual (EN/ZH)** speech suitable for interactive applications.

**Capabilities:**  
Zero-Shot Voice Cloning (3‚Äì10s), RL-Tuned Emotion & Prosody, High-Quality Speech Generation, Phoneme-Level Control, Streaming / Low-Latency TTS, Chinese‚ÄìEnglish Mixed Input

---

## How to use

* Adjust the text/audio inputs in the provided fields  
* Expand and run the section that you need. Note that you may need to restart the session after the `pip install` cells and then run following cells for libraries to work properly.

---

## ‚öôÔ∏è Model Highlights

* üó£ **Zero-shot cloning** ‚Äì reproduce a speaker from a few seconds of audio  
* üé≠ **Emotion-optimized prosody** ‚Äì Multi-reward RL (GRPO) improves expressiveness  
* üî§ **Hybrid phoneme+text input** ‚Äì precise control over pronunciation  
* ‚ö° **Streaming inference** ‚Äì supports real-time generation for interactive settings  
* üåè **Bilingual capability** ‚Äì robust for Chinese, English, and mixed-language text  

---

## üß† Model Details

* **Architecture:** LLM-based speech token generator + Flow Matching vocoder  
* **Supported Languages:** English, Chinese, mixed text  
* **Voice Cloning:** 3‚Äì10 seconds prompt audio  
* **Control Features:** Emotion, prosody, phoneme-level instructions  
* **Performance:** Low-latency streaming suitable for live applications  
* **Format:** Available via Hugging Face  

---

## üîó Resources

* **GitHub Repository:** https://github.com/zai-org/GLM-TTS  
* **Model Availability:** https://huggingface.co/zai-org/GLM-TTS  

---

## üéôÔ∏è Explore More TTS Models

Looking for more cutting-edge voice models?  
üëâ Check out the full collection: [awesome-TTS-Colab](https://github.com/Troyanovsky/awesome-TTS-Colab)

## TTS (using default example audio)

In [1]:
# Setup: Clone repo, install deps, download checkpoints

!git clone https://github.com/zai-org/GLM-TTS.git
%cd /content/GLM-TTS

# Remove any line containing WeTextProcessing from requirements.txt
import re

req_path = "requirements.txt"

with open(req_path, "r") as f:
    lines = f.readlines()

# Filter out the problematic line(s)
cleaned = []
for line in lines:
    if not re.search(r'wetextprocessing', line, re.IGNORECASE):
        cleaned.append(line)

with open(req_path, "w") as f:
    f.writelines(cleaned)

Cloning into 'GLM-TTS'...
remote: Enumerating objects: 170, done.[K
remote: Counting objects: 100% (170/170), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 170 (delta 41), reused 148 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (170/170), 28.86 MiB | 17.07 MiB/s, done.
Resolving deltas: 100% (41/41), done.
/content/GLM-TTS


In [2]:
# ‚ö†Ô∏è You may need to restart session after installing these libraries, then run following cells
!pip install -r requirements.txt
!pip install -U "huggingface-hub>=0.34.0,<1.0"
!pip install WeTextProcessing

Collecting numpy==1.26.4 (from -r requirements.txt (line 1))
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.0/61.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy==1.15.3 (from -r requirements.txt (line 2))
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m62.0/62.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece==0.1.99 (from -r requirements.txt (line 3))
  Downloading sentencepiece-0.1.99.tar.gz (2.6 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m 

Collecting WeTextProcessing
  Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl.metadata (7.2 kB)
Collecting pynini==2.1.6 (from WeTextProcessing)
  Downloading pynini-2.1.6-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl (2.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.0/2.0 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynini-2.1.6-cp312-cp312-manylinux_2_28_x86_64.whl (154.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m154.7/154.7 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynini, WeTextProcessing
Successfully installed WeTextProcessing-1.0.4.1 pynini-2.1.6


In [3]:
from huggingface_hub import snapshot_download

# Download full GLM-TTS checkpoint into ./ckpt
snapshot_download(
    "zai-org/GLM-TTS",
    local_dir="ckpt",
    local_dir_use_symlinks=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

hift/hift.pt:   0%|          | 0.00/83.4M [00:00<?, ?B/s]

llm/model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/180 [00:00<?, ?B/s]

flow/flow.pt:   0%|          | 0.00/901M [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

llm/model-00002-of-00002.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

speech_tokenizer/model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vq32k-phoneme-tokenizer/tokenizer.model:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenization_chatglm.py: 0.00B [00:00, ?B/s]

vocos2d/generator_jit.ckpt:   0%|          | 0.00/60.4M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

'/content/GLM-TTS/ckpt'

In [1]:
# Text that GLM-TTS will synthesize
input_text = "Hello, this is Jiayan from GLM-TTS. How are you doing?"  #@param {type:"string"}

# Reference text that matches the built-in reference audio speaker
prompt_text = "I wonder if you'd like to have a burger with me."  #@param {type:"string"}

# Built-in reference audio shipped with the repo
prompt_wav_path = "examples/prompt/jiayan_en.wav"  #@param {type:"string"}

# Inference options
sample_rate = 24000  #@param {type:"integer"}
use_phoneme = False  #@param {type:"boolean"}
use_cache = True     #@param {type:"boolean"}
seed = 0             #@param {type:"integer"}


In [2]:
%cd /content/GLM-TTS

import torch
from IPython.display import Audio, display

# Import GLM-TTS helpers from the repo
from glmtts_inference import load_models, generate_long, DEVICE

# Load all frontends + models (LLM + Flow)
frontend, text_frontend, speech_tokenizer, llm, flow = load_models(
    use_phoneme=use_phoneme,
    sample_rate=sample_rate,
)

def glmtts_synthesize(
    prompt_wav: str,
    prompt_text: str,
    synth_text: str,
    sample_rate: int = 24000,
    use_phoneme: bool = False,
    seed: int = 0,
    use_cache: bool = True,
):
    """Minimal wrapper around generate_long for a single utterance."""
    # Normalize text
    prompt_text_norm = text_frontend.text_normalize(prompt_text)
    synth_text_norm = text_frontend.text_normalize(synth_text)

    # Extract tokens & features from prompt
    prompt_text_token = frontend._extract_text_token(prompt_text_norm + " ")
    prompt_speech_token = frontend._extract_speech_token([prompt_wav])
    speech_feat = frontend._extract_speech_feat(prompt_wav, sample_rate=sample_rate)
    embedding = frontend._extract_spk_embedding(prompt_wav)

    cache_speech_token = [prompt_speech_token.squeeze().tolist()]
    flow_prompt_token = torch.tensor(cache_speech_token, dtype=torch.int32).to(DEVICE)

    # LLM cache (for longer text, can reuse history)
    cache = {
        "cache_text": [prompt_text_norm],
        "cache_text_token": [prompt_text_token],
        "cache_speech_token": cache_speech_token,
        "use_cache": use_cache,
    }

    uttid = "simple_demo"

    # Core generation
    tts_speech, _, _, _ = generate_long(
        frontend=frontend,
        text_frontend=text_frontend,
        llm=llm,
        flow=flow,
        text_info=[uttid, synth_text_norm],
        cache=cache,
        embedding=embedding,
        seed=seed,
        flow_prompt_token=flow_prompt_token,
        speech_feat=speech_feat,
        device=DEVICE,
        use_phoneme=use_phoneme,
    )

    return tts_speech


# Run synthesis
tts_speech = glmtts_synthesize(
    prompt_wav=prompt_wav_path,
    prompt_text=prompt_text,
    synth_text=input_text,
    sample_rate=sample_rate,
    use_phoneme=use_phoneme,
    seed=seed,
    use_cache=use_cache,
)

/content/GLM-TTS
[load_quantize_encoder] start. model_path='ckpt/speech_tokenizer'
Configured for 24kHz frontend.


2025-12-15 02:37:43,916 WETEXT INFO building fst for zh_normalizer ...
INFO:wetext-zh_normalizer:building fst for zh_normalizer ...
2025-12-15 02:38:22,740 WETEXT INFO done
INFO:wetext-zh_normalizer:done
2025-12-15 02:38:22,742 WETEXT INFO fst path: /usr/local/lib/python3.12/dist-packages/tn/zh_tn_tagger.fst
INFO:wetext-zh_normalizer:fst path: /usr/local/lib/python3.12/dist-packages/tn/zh_tn_tagger.fst
2025-12-15 02:38:22,744 WETEXT INFO           /usr/local/lib/python3.12/dist-packages/tn/zh_tn_verbalizer.fst
INFO:wetext-zh_normalizer:          /usr/local/lib/python3.12/dist-packages/tn/zh_tn_verbalizer.fst
2025-12-15 02:38:22,751 WETEXT INFO found existing fst: /usr/local/lib/python3.12/dist-packages/tn/en_tn_tagger.fst
INFO:wetext-en_normalizer:found existing fst: /usr/local/lib/python3.12/dist-packages/tn/en_tn_tagger.fst
2025-12-15 02:38:22,753 WETEXT INFO                     /usr/local/lib/python3.12/dist-packages/tn/en_tn_verbalizer.fst
INFO:wetext-en_normalizer:                

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading HiFT model from ckpt/hift/hift.pt on cuda...


In [None]:
# Play audio directly from tensor
waveform = tts_speech.squeeze().cpu().numpy()
display(Audio(waveform, rate=sample_rate))

## Voice Clone

In [1]:
# Setup: Clone repo, install deps, download checkpoints

!git clone https://github.com/zai-org/GLM-TTS.git
%cd /content/GLM-TTS

# Remove any line containing WeTextProcessing from requirements.txt
import re

req_path = "requirements.txt"

with open(req_path, "r") as f:
    lines = f.readlines()

# Filter out the problematic line(s)
cleaned = []
for line in lines:
    if not re.search(r'wetextprocessing', line, re.IGNORECASE):
        cleaned.append(line)

with open(req_path, "w") as f:
    f.writelines(cleaned)

Cloning into 'GLM-TTS'...
remote: Enumerating objects: 170, done.[K
remote: Counting objects: 100% (170/170), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 170 (delta 41), reused 148 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (170/170), 28.86 MiB | 24.57 MiB/s, done.
Resolving deltas: 100% (41/41), done.
/content/GLM-TTS


In [2]:
# ‚ö†Ô∏è You may need to restart session after installing these libraries, then run following cells
!pip install -r requirements.txt
!pip install -U "huggingface-hub>=0.34.0,<1.0"
!pip install WeTextProcessing

Collecting numpy==1.26.4 (from -r requirements.txt (line 1))
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.0/61.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy==1.15.3 (from -r requirements.txt (line 2))
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m62.0/62.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece==0.1.99 (from -r requirements.txt (line 3))
  Downloading sentencepiece-0.1.99.tar.gz (2.6 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m 

Collecting WeTextProcessing
  Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl.metadata (7.2 kB)
Collecting pynini==2.1.6 (from WeTextProcessing)
  Downloading pynini-2.1.6-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl (2.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.0/2.0 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynini-2.1.6-cp312-cp312-manylinux_2_28_x86_64.whl (154.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m154.7/154.7 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynini, WeTextProcessing
Successfully installed WeTextProcessing-1.0.4.1 pynini-2.1.6


In [3]:
from huggingface_hub import snapshot_download

# Download full GLM-TTS checkpoint into ./ckpt
snapshot_download(
    "zai-org/GLM-TTS",
    local_dir="ckpt",
    local_dir_use_symlinks=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

config.yaml:   0%|          | 0.00/180 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

hift/hift.pt:   0%|          | 0.00/83.4M [00:00<?, ?B/s]

llm/model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

flow/flow.pt:   0%|          | 0.00/901M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

llm/model-00002-of-00002.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

speech_tokenizer/model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocos2d/generator_jit.ckpt:   0%|          | 0.00/60.4M [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

vq32k-phoneme-tokenizer/tokenizer.model:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenization_chatglm.py: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

'/content/GLM-TTS/ckpt'

In [4]:
# Reference text that matches the built-in reference audio speaker
prompt_text = "In short, we embarked on, a mission to make America great again, for all Americans."  #@param {type:"string"}

# Upload reference audio for cloning
prompt_wav_path = "/content/GLM-TTS/examples/prompt/uploaded.wav"

from google.colab import files
import os
import shutil

# Ensure the directory exists
os.makedirs(os.path.dirname(prompt_wav_path), exist_ok=True)

# Prompt user upload
uploaded = files.upload()

# Validate upload
if not uploaded:
    raise RuntimeError("No file was uploaded.")

# Take the first uploaded file
uploaded_filename = next(iter(uploaded))

# Basic validation: ensure it's a WAV file
if not uploaded_filename.lower().endswith(".wav"):
    raise ValueError("Uploaded file is not a .wav file.")

# Save to desired path
with open(prompt_wav_path, "wb") as f:
    f.write(uploaded[uploaded_filename])

print(f"File saved to: {prompt_wav_path}")


Saving trump_promptvn.wav to trump_promptvn.wav
File saved to: /content/GLM-TTS/examples/prompt/uploaded.wav


In [2]:
# Text that GLM-TTS will synthesize
input_text = "How is this voice-cloning quality? Does it sound good?"  #@param {type:"string"}

# Inference options
sample_rate = 24000  #@param {type:"integer"}
use_phoneme = False  #@param {type:"boolean"}
use_cache = True     #@param {type:"boolean"}
seed = 0             #@param {type:"integer"}

In [None]:
%cd /content/GLM-TTS

import torch
from IPython.display import Audio, display

# Import GLM-TTS helpers from the repo
from glmtts_inference import load_models, generate_long, DEVICE

# Load all frontends + models (LLM + Flow)
frontend, text_frontend, speech_tokenizer, llm, flow = load_models(
    use_phoneme=use_phoneme,
    sample_rate=sample_rate,
)

def glmtts_synthesize(
    prompt_wav: str,
    prompt_text: str,
    synth_text: str,
    sample_rate: int = 24000,
    use_phoneme: bool = False,
    seed: int = 0,
    use_cache: bool = True,
):
    """Minimal wrapper around generate_long for a single utterance."""
    # Normalize text
    prompt_text_norm = text_frontend.text_normalize(prompt_text)
    synth_text_norm = text_frontend.text_normalize(synth_text)

    # Extract tokens & features from prompt
    prompt_text_token = frontend._extract_text_token(prompt_text_norm + " ")
    prompt_speech_token = frontend._extract_speech_token([prompt_wav])
    speech_feat = frontend._extract_speech_feat(prompt_wav, sample_rate=sample_rate)
    embedding = frontend._extract_spk_embedding(prompt_wav)

    cache_speech_token = [prompt_speech_token.squeeze().tolist()]
    flow_prompt_token = torch.tensor(cache_speech_token, dtype=torch.int32).to(DEVICE)

    # LLM cache (for longer text, can reuse history)
    cache = {
        "cache_text": [prompt_text_norm],
        "cache_text_token": [prompt_text_token],
        "cache_speech_token": cache_speech_token,
        "use_cache": use_cache,
    }

    uttid = "simple_demo"

    # Core generation
    tts_speech, _, _, _ = generate_long(
        frontend=frontend,
        text_frontend=text_frontend,
        llm=llm,
        flow=flow,
        text_info=[uttid, synth_text_norm],
        cache=cache,
        embedding=embedding,
        seed=seed,
        flow_prompt_token=flow_prompt_token,
        speech_feat=speech_feat,
        device=DEVICE,
        use_phoneme=use_phoneme,
    )

    return tts_speech

In [5]:
# Run synthesis
tts_speech = glmtts_synthesize(
    prompt_wav=prompt_wav_path,
    prompt_text=prompt_text,
    synth_text=input_text,
    sample_rate=sample_rate,
    use_phoneme=use_phoneme,
    seed=seed,
    use_cache=use_cache,
)

In [None]:
# Play audio directly from tensor
waveform = tts_speech.squeeze().cpu().numpy()
display(Audio(waveform, rate=sample_rate))