TL;DR GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.
GPA stands for General Purpose Audio.
In academia, a studentβs GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjectsβranging from Calculus and Philosophy to Gym class.
Similarly, our GPA model unifies the three major pillars of audio tasksβText-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)βinto a single auto-regreesive transformer.
- Our open-source content includes support for multiple frameworks and provides production-ready code suitable for cloud deployment.
- we include concise inference examples and training pipelines for research purpose.
- The released 0.3B model is also perfect for edge devices and edge deployment is to be released.
| Category | Item | Status |
|---|---|---|
| Core Features | Unified LLM-based audio generation & understanding | β |
| Inference Scripts (STT, TTS, VC) | β | |
| Training Pipeline (DeepSpeed) | β | |
| Interactive Demo | β | |
| Basic Service Deployment (vLLM/FastAPI) | β | |
| Paper (ArXiv) | β | |
| Model Releases | GPA-0.3B-preview (Edge-focused) | β |
| GPA-0.3B (Edge-focused) | β¬ | |
| Edge Deployment | Android Platform | β¬ |
| RK Series | β¬ | |
| IOS Platform | β¬ | |
| Frameworks | vllm | β |
| llama-cpp | β | |
| sglang | β | |
| torch | β | |
| mlx-lm | β | |
| rknn | β¬ |
The default development environment is configured for:
- OS: Linux (x86_64)*
- GPU: NVIDIA*
- CUDA: 12.x*
The provided
uv.lockfile was generated under this configuration.If your system matches the above, you can use the
uv-based setup for a fully reproducible environment.If you are using:
- CUDA 11.x (e.g. cu116)
- CPU-only systems
- macOS or Windows
please follow the pip-based installation described below.
We use uv for fast and reproducible Python environment management.
1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or install via pip if you prefer:
# pip install uv2. Sync the environment (installs all dependencies)
π‘Note: If training is not required, or if building flash_attn is difficult/slow on your device, you may comment out this dependency in pyproject.toml. Training should be switched to eager mode in such condition.
uv sync1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate2. Install base dependencies
pip install -r requirements.txtBefore running inference, please download the model checkpoints from Hugging Face or ModelScope.
| Model | Hugging Face | ModelScope |
|---|---|---|
| GPA-0.3B-preview | Download | Download |
| GPA-0.3B | Coming Soon | Coming Soon |
Important: After downloading the checkpoints, please verify that your model directory structure matches the hierarchy below.
${GPA_MODEL_DIR}/
βββ BiCodec/
βββββ wav2vec2-large-xlsr-53/
βββ glm-4-voice-tokenizer/
βββ added_tokens.json
βββ chat_template.jinja
βββ config.json
βββ generation_config.json
βββ merges.txt
βββ model.safetensors
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ tokenizer.json
βββ vocab.json
You can perform various tasks like Speech-to-Text, Text-to-Speech, and Voice Conversion using the provided scripts.
π‘Note: Please navigate to the inference directory to ensure relative paths for audio files work correctly.
π‘Note: Currently, we only support input in WAV format at a sample rate of 16 kHz.
cd scripts/inferenceπ‘Note: To use other python environments, replace "uv run" with "path_to_your_python".
Speech-to-Text (STT/ASR):
# Using uv
uv run gpa_inference.py --task stt \
--src_audio_path "test_audio/000.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"
# Or using python
python gpa_inference.py --task stt \
--src_audio_path "test_audio/000.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"Text-to-Speech (TTS):
# Using uv
uv run gpa_inference.py --task tts-a \
--text "Hello world, this is Major Tom speaking." \
--ref_audio_path "test_audio/astro.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"
# Or using python
python gpa_inference.py --task tts-a \
--text "Hello world, this is Major Tom speaking." \
--ref_audio_path "test_audio/astro.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"Voice Conversion (VC):
# Using uv
uv run gpa_inference.py --task vc \
--src_audio_path "test_audio/vc_src.wav" \
--ref_audio_path "test_audio/astro.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"
# Or using python
python gpa_inference.py --task vc \
--src_audio_path "test_audio/vc_src.wav" \
--ref_audio_path "test_audio/astro.wav" \
--gpa_model_path "${GPA_MODEL_DIR}" \
--tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
--bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
--text_tokenizer_path "${GPA_MODEL_DIR}"For more details on inference arguments, check out the Inference README.
We provide a training script to help you get started. A small sample dataset is included in the repository to quickly verify that the pipeline works as expected:
- scripts/train/merged_shuffled_train.jsonl
- scripts/train/dataset
π‘Note: Before running, be sure to update the paths in train_gpa.sh.
# Run the training script (uses DeepSpeed)
cd scripts/train
bash train_gpa.shThe training script automatically handles environment activation via uv run.
Building your own dataset is as simple as following the format of our provided .jsonl example (scripts/train/merged_shuffled_train.jsonl) and pointing it to your prepared data.
We provide a complete set of scripts for service deployment, including a FastAPI-based backend server, a Gradio-based GUI, and basic testing scripts.
β οΈ Caution: The current vLLM-based deployment may exhibit occasional audio quality degradation under large-scale concurrent workloads. For reliable evaluation and quality validation, we recommend using the basic PyTorch inference implementation provided in the inference module.
The core service is built with FastAPI. We utilize a Dockerfile to build the runtime environment, ensuring consistency and ease of deployment.
-
Ensure you have Docker and Docker Compose installed.
-
Set the required environment variables (e.g., in a
.envfile or export them).Please configure
GPA_CODE_ROOTandGPA_MODEL_DIR. For model preparation, refer to Checkpoint Download.export GPA_CODE_ROOT="/absolute/path/to/this/repo" export GPA_MODEL_DIR="/absolute/path/to/models"
-
Run with Docker Compose:
cd scripts/server docker compose up -d --build -
Test: You can use the provided client script to verify that the service is working correctly.
# Run the test client python test_client.py
We provide a user-friendly web interface for interacting with the API.
π‘Note: The GUI uses the original PyTorch deployment instead of vLLM
# Install Gradio if not already installed
pip install gradio
# Start the GUI app
cd scripts/server
python gui_app.pyThe GUI will be available at http://localhost:7868.
The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.
Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.
π‘Note
- TTFC: Time To First Chunk (TTS)
- TTFT: Time To First Token (ASR)
- RTF: Real-Time Factor (audio duration / synthesis time)
| Concurrency | Avg TTFC (ms) | P50 TTFC (ms) | P99 TTFC (ms) | Avg RTF | P50 RTF | P99 RTF | Audio Dur (s) |
|---|---|---|---|---|---|---|---|
| 1 | 258.8 | 258.8 | 258.8 | 0.197 | 0.197 | 0.197 | 6.44 |
| 5 | 385.0 | 394.7 | 396.2 | 0.218 | 0.217 | 0.248 | 6.76 |
| 10 | 544.6 | 564.2 | 566.7 | 0.282 | 0.301 | 0.313 | 6.49 |
| 20 | 977.8 | 977.9 | 982.9 | 0.470 | 0.490 | 0.538 | 7.19 |
| 40 | 1797.0 | 1736.4 | 2564.5 | 0.421 | 0.400 | 0.587 | 6.33 |
| 80 | 3786.4 | 4054.4 | 5415.8 | 0.763 | 0.763 | 1.096 | 6.32 |
| 160 | 9847.9 | 10239.9 | 14350.3 | 1.718 | 1.740 | 2.577 | 6.44 |
Table 2. TTS Streaming RTF and Audio Duration
| Concurrency | Avg TTFT (ms) | P50 TTFT (ms) | P99 TTFT (ms) | Avg Total (ms) |
|---|---|---|---|---|
| 1 | 157.5 | 157.5 | 157.5 | 190.9 |
| 5 | 394.1 | 393.7 | 395.9 | 400.0 |
| 10 | 589.6 | 721.3 | 723.3 | 598.1 |
| 20 | 1316.3 | 1495.6 | 1500.4 | 1317.8 |
| 40 | 2690.9 | 2678.3 | 2861.4 | 2693.7 |
| 80 | 3833.4 | 3961.3 | 4027.0 | 3845.1 |
| 160 | 5037.0 | 5689.3 | 6676.0 | 5044.0 |
Table 3. ASR Streaming Latency vs Concurrency
| Model | Open-Source | Model Size | test-zh CER (%) β | test-zh Sim (%) β | test-en WER (%) β | test-en Sim (%) β |
|---|---|---|---|---|---|---|
| Multi-Stage or NAR Methods | ||||||
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | β | - | 1.12 | 79.6 | 2.25 | 76.2 |
| MiniMax-Speech | β | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | β | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | β | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | β | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | β | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | β | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | β | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | β | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | β | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | β | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | β | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | β | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | β | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| One-Stage AR Methods | ||||||
| Spark TTS | β | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-0.3B-preview | β | 0.3B | 0.95 | 65.9 | 1.51 | 56.5 |
Note: ASR results on Librispeech and Aishell-1. WER (%) is reported for Librispeech, and CER (%) is reported for Aishell-1.
| Model | Model Size | Librispeech test-clean | Aishell-1 |
|---|---|---|---|
| Models with < 0.5B parameters | |||
| Whisper-S | 0.24B | 3.13 | - |
| GPA-0.3B-preview | 0.3B | 8.88 | 4.50 |
| Models with > 0.5B parameters | |||
| Fun-ASR-nano | 0.8B | 1.76 | 1.80 |
| FireRed-ASR | 1.1B | 1.84 | 0.54 |
| GLM-ASR-nano | 1.5B | 2.00 | 1.81 |
| GLM-ASR-nano* | 1.5B | 2.17 | 2.17 |
| Whisper-L | 1.55B | 1.82 | 4.72 |
| Kimi-Audio | - | 1.32 | 0.71 |
| Step-Audio2 | - | 1.17 | 0.63 |
| Seed-ASR | - | 1.58 | 0.68 |
| Seed-ASR* | - | 2.80 | 1.63 |
| Fun-ASR | 7.7B | 1.51 | 1.22 |
We borrowed a lot of code from the following excellent projects:
If you find GPA useful for your research or projects, please cite us:
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}