A high-performance Rust library for converting text to audio files using Zhipu AI's GLM models, featuring intelligent text segmentation, parallel processing, and advanced audio merging capabilities.
- 🤖 AI-Powered Text Segmentation - Intelligent semantic text splitting using GLM models for natural-sounding audio
- 🎵 Multiple Voice Options - Support for 7 distinct voices with customizable speed and volume
- ⚡ Parallel Processing - Concurrent audio generation for improved performance on long texts
- 🔄 Automatic Retry - Built-in exponential backoff retry mechanism for robust API calls
- 🛠️ Flexible Configuration - Builder pattern API for intuitive customization
- 📦 Zero Dependencies Audio Processing - Built-in WAV audio merging without external tools
- 🎯 Smart Modes - Automatic direct conversion for short texts, segmented processing for long texts
Used for intelligent text splitting and semantic analysis:
- GLM-4.7 - Latest flagship model with superior semantic understanding
- GLM-4.6 - Advanced reasoning model for complex text analysis
- GLM-4.5 - High-performance general-purpose model
- GLM-4.5-Flash - Optimized for speed (default)
- GLM-4.5-Air - Lightweight and cost-effective model
- GLM-TTS - Zhipu AI's dedicated text-to-speech model for high-quality audio generation
- Rust 1.70 or later
- Zhipu AI API Key - Get one from Zhipu AI Platform
- Network Connection - Required for API calls
export ZHIPU_API_KEY="your_api_key_here"Add to your Cargo.toml:
[dependencies]
text2audio = "0.1.0"
tokio = { version = "1", features = ["full"] }Basic usage:
use text2audio::Text2Audio;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let api_key = std::env::var("ZHIPU_API_KEY")?;
let converter = Text2Audio::new(api_key);
converter.convert("你好,世界!", "output.wav").await?;
println!("Audio saved to output.wav");
Ok(())
}use text2audio::Text2Audio;
let converter = Text2Audio::new(&api_key);
converter.convert("Hello, world!", "hello.wav").await?;use text2audio::{Text2Audio, Voice};
let converter = Text2Audio::new(&api_key)
.with_voice(Voice::Xiaochen)
.with_speed(1.5) // 50% faster
.with_volume(2.0); // Louder
converter.convert("加速版语音", "fast.wav").await?;use text2audio::{Text2Audio, Model};
let long_text = "非常长的文本...";
let converter = Text2Audio::new(&api_key)
.with_model(Model::GLM4_7) // Use best model for segmentation
.with_max_segment_length(300) // Shorter segments for better flow
.with_thinking(true); // Enable thinking mode
converter.convert(long_text, "long_audio.wav").await?;use text2audio::{Text2Audio, Voice};
let converter = Text2Audio::new(&api_key)
.with_voice(Voice::Tongtong)
.with_parallel(5) // Process up to 5 segments concurrently
.with_retry_config(5, Duration::from_millis(200));
converter.convert(very_long_text, "output.wav").await?;use text2audio::{Text2Audio, Model, Voice};
use std::time::Duration;
let converter = Text2Audio::builder(&api_key)
.model(Model::GLM4_7)
.voice(Voice::Tongtong)
.speed(1.2)
.volume(1.5)
.max_segment_length(500)
.parallel(3)
.thinking(true)
.retry_config(3, Duration::from_millis(100))
.build();
converter.convert("优化的长文本", "narration.wav").await?;| Method | Type | Range | Default | Description |
|---|---|---|---|---|
with_model() |
Model |
enum | GLM4_5Flash |
AI model for text segmentation |
with_voice() |
Voice |
enum | Tongtong |
Voice selection for TTS |
with_speed() |
f32 |
0.5 - 2.0 | 1.0 |
Speech speed multiplier |
with_volume() |
f32 |
0.0 - 10.0 | 1.0 |
Audio volume level |
with_max_segment_length() |
usize |
100 - 1024 | 500 |
Max characters per segment |
with_parallel() |
usize |
1 - 10 | disabled | Enable concurrent processing |
with_thinking() |
bool |
true/false | false |
Enable AI thinking mode |
with_coding_plan() |
bool |
true/false | false |
Use coding plan endpoint |
with_retry_config() |
(u32, Duration) |
custom | (3, 100ms) |
Retry attempts and delay |
All voices are provided by Zhipu AI's TTS service:
Voice::Tongtong(童童) - Default female voice, clear and naturalVoice::Chuichui(锤锤) - Warm and friendly male voiceVoice::Xiaochen(晓辰) - Professional narration voiceVoice::Jam- Youthful and energetic voiceVoice::Kazi- Deep and authoritative voiceVoice::Douji(豆鸡) - Cute and playful voiceVoice::Luodo- Mature and calm voice
Choose the appropriate model based on your needs:
- GLM-4.7: Best for long, complex texts requiring deep semantic understanding
- GLM-4.6: Good balance of quality and speed for most use cases
- GLM-4.5: Reliable general-purpose model
- GLM-4.5-Flash: Fastest processing, ideal for simple texts
- GLM-4.5-Air: Most cost-effective for high-volume processing
The library provides detailed error types for robust error handling:
use text2audio::{Text2Audio, Error};
match converter.convert(text, "output.wav").await {
Ok(_) => println!("✓ Conversion successful"),
Err(Error::EmptyInput) => eprintln!("✗ Error: Input text is empty"),
Err(Error::TtsApi(msg)) => eprintln!("✗ TTS API Error: {}", msg),
Err(Error::AiApi(msg)) => eprintln!("✗ AI API Error: {}", msg),
Err(Error::Audio(msg)) => eprintln!("✗ Audio Processing Error: {}", msg),
Err(Error::Io(e)) => eprintln!("✗ File I/O Error: {}", e),
Err(e) => eprintln!("✗ Unexpected Error: {}", e),
}text2audio/
├── src/
│ ├── lib.rs # Main API and Text2Audio struct
│ ├── client.rs # Zhipu AI API client
│ ├── ai_splitter.rs # AI-powered text segmentation
│ ├── audio_merger.rs # WAV audio file merging
│ ├── config.rs # Voice and configuration types
│ └── error.rs # Error types and Result alias
├── examples/ # Usage examples
├── assets/ # Sample text files
└── target/ # Build output
- Input Validation: Check if text is empty
- Length Detection:
- Short text (≤ max_segment_length): Direct TTS conversion
- Long text (> max_segment_length): AI-powered segmentation
- Text Segmentation: AI model splits text at semantic boundaries
- Audio Generation:
- Sequential: One segment at a time
- Parallel: Multiple segments concurrently (if enabled)
- Audio Merging: Combine all audio segments into final WAV file
- Retry Handling: Automatic retry with exponential backoff on failures
The project includes comprehensive examples demonstrating various features:
cargo run --example simpleConverts a short Chinese text to audio using default settings.
cargo run --example ai_splitterDemonstrates AI-powered semantic segmentation for long texts.
cargo run --example custom_voiceShows voice customization and parameter tuning.
cargo run --example parallelIllustrates concurrent audio generation for performance.
cargo run --example from_fileConverts text from a file with optimized settings for long-form content.
cargo run --example ai_splitterDemonstrates direct usage of the AiSplitter component.
- Choose the Right Model: Use GLM-4.5-Flash for simple texts, GLM-4.7 for complex content
- Enable Parallel Processing: Set
with_parallel(3-5)for long texts to significantly reduce total time - Optimize Segment Length:
- 300-500 chars for narrative content
- 800-1024 chars for technical content
- Adjust Retry Config: Increase retries and delays for unstable networks
- Use Thinking Mode: Enable for texts requiring deep semantic understanding
- Minimum Rust Version: 1.70.0
- Dependencies: tokio (async runtime), zai-rs (Zhipu AI client), hound (WAV handling)
- Network: Stable internet connection for API calls
- API Key: Valid Zhipu AI API key with TTS service enabled
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Areas for improvement:
- Additional audio format support (MP3, OGG)
- Custom voice training integration
- Local model inference support
- Batch processing utilities
- Audio post-processing effects
Please feel free to submit issues, feature requests, or pull requests.