text2audio

A high-performance Rust library for converting text to audio files using Zhipu AI's GLM models, featuring intelligent text segmentation, parallel processing, and advanced audio merging capabilities.

Features

🤖 AI-Powered Text Segmentation - Intelligent semantic text splitting using GLM models for natural-sounding audio
🎵 Multiple Voice Options - Support for 7 distinct voices with customizable speed and volume
⚡ Parallel Processing - Concurrent audio generation for improved performance on long texts
🔄 Automatic Retry - Built-in exponential backoff retry mechanism for robust API calls
🛠️ Flexible Configuration - Builder pattern API for intuitive customization
📦 Zero Dependencies Audio Processing - Built-in WAV audio merging without external tools
🎯 Smart Modes - Automatic direct conversion for short texts, segmented processing for long texts

Supported AI Models

Text Segmentation Models

Used for intelligent text splitting and semantic analysis:

GLM-4.7 - Latest flagship model with superior semantic understanding
GLM-4.6 - Advanced reasoning model for complex text analysis
GLM-4.5 - High-performance general-purpose model
GLM-4.5-Flash - Optimized for speed (default)
GLM-4.5-Air - Lightweight and cost-effective model

Text-to-Speech Model

GLM-TTS - Zhipu AI's dedicated text-to-speech model for high-quality audio generation

Prerequisites

Rust 1.70 or later
Zhipu AI API Key - Get one from Zhipu AI Platform
Network Connection - Required for API calls

Environment Setup

export ZHIPU_API_KEY="your_api_key_here"

Quick Start

Add to your Cargo.toml:

[dependencies]
text2audio = "0.1.0"
tokio = { version = "1", features = ["full"] }

Basic usage:

use text2audio::Text2Audio;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key = std::env::var("ZHIPU_API_KEY")?;
    let converter = Text2Audio::new(api_key);
    
    converter.convert("你好，世界！", "output.wav").await?;
    println!("Audio saved to output.wav");
    
    Ok(())
}

Usage Examples

1. Basic Text to Audio

use text2audio::Text2Audio;

let converter = Text2Audio::new(&api_key);
converter.convert("Hello, world!", "hello.wav").await?;

2. Custom Voice and Speed

use text2audio::{Text2Audio, Voice};

let converter = Text2Audio::new(&api_key)
    .with_voice(Voice::Xiaochen)
    .with_speed(1.5)  // 50% faster
    .with_volume(2.0);  // Louder

converter.convert("加速版语音", "fast.wav").await?;

3. Long Text with AI Segmentation

use text2audio::{Text2Audio, Model};

let long_text = "非常长的文本...";
let converter = Text2Audio::new(&api_key)
    .with_model(Model::GLM4_7)  // Use best model for segmentation
    .with_max_segment_length(300)  // Shorter segments for better flow
    .with_thinking(true);  // Enable thinking mode

converter.convert(long_text, "long_audio.wav").await?;

4. Parallel Processing for Performance

use text2audio::{Text2Audio, Voice};

let converter = Text2Audio::new(&api_key)
    .with_voice(Voice::Tongtong)
    .with_parallel(5)  // Process up to 5 segments concurrently
    .with_retry_config(5, Duration::from_millis(200));

converter.convert(very_long_text, "output.wav").await?;

5. Using Builder Pattern

use text2audio::{Text2Audio, Model, Voice};
use std::time::Duration;

let converter = Text2Audio::builder(&api_key)
    .model(Model::GLM4_7)
    .voice(Voice::Tongtong)
    .speed(1.2)
    .volume(1.5)
    .max_segment_length(500)
    .parallel(3)
    .thinking(true)
    .retry_config(3, Duration::from_millis(100))
    .build();

converter.convert("优化的长文本", "narration.wav").await?;

Configuration Reference

Text2Audio Methods

Method	Type	Range	Default	Description
`with_model()`	`Model`	enum	`GLM4_5Flash`	AI model for text segmentation
`with_voice()`	`Voice`	enum	`Tongtong`	Voice selection for TTS
`with_speed()`	`f32`	0.5 - 2.0	`1.0`	Speech speed multiplier
`with_volume()`	`f32`	0.0 - 10.0	`1.0`	Audio volume level
`with_max_segment_length()`	`usize`	100 - 1024	`500`	Max characters per segment
`with_parallel()`	`usize`	1 - 10	disabled	Enable concurrent processing
`with_thinking()`	`bool`	true/false	`false`	Enable AI thinking mode
`with_coding_plan()`	`bool`	true/false	`false`	Use coding plan endpoint
`with_retry_config()`	`(u32, Duration)`	custom	`(3, 100ms)`	Retry attempts and delay

Voice Options

All voices are provided by Zhipu AI's TTS service:

Voice::Tongtong (童童) - Default female voice, clear and natural
Voice::Chuichui (锤锤) - Warm and friendly male voice
Voice::Xiaochen (晓辰) - Professional narration voice
Voice::Jam - Youthful and energetic voice
Voice::Kazi - Deep and authoritative voice
Voice::Douji (豆鸡) - Cute and playful voice
Voice::Luodo - Mature and calm voice

AI Models

Choose the appropriate model based on your needs:

GLM-4.7: Best for long, complex texts requiring deep semantic understanding
GLM-4.6: Good balance of quality and speed for most use cases
GLM-4.5: Reliable general-purpose model
GLM-4.5-Flash: Fastest processing, ideal for simple texts
GLM-4.5-Air: Most cost-effective for high-volume processing

Error Handling

The library provides detailed error types for robust error handling:

use text2audio::{Text2Audio, Error};

match converter.convert(text, "output.wav").await {
    Ok(_) => println!("✓ Conversion successful"),
    Err(Error::EmptyInput) => eprintln!("✗ Error: Input text is empty"),
    Err(Error::TtsApi(msg)) => eprintln!("✗ TTS API Error: {}", msg),
    Err(Error::AiApi(msg)) => eprintln!("✗ AI API Error: {}", msg),
    Err(Error::Audio(msg)) => eprintln!("✗ Audio Processing Error: {}", msg),
    Err(Error::Io(e)) => eprintln!("✗ File I/O Error: {}", e),
    Err(e) => eprintln!("✗ Unexpected Error: {}", e),
}

Architecture

text2audio/
├── src/
│   ├── lib.rs           # Main API and Text2Audio struct
│   ├── client.rs        # Zhipu AI API client
│   ├── ai_splitter.rs   # AI-powered text segmentation
│   ├── audio_merger.rs  # WAV audio file merging
│   ├── config.rs        # Voice and configuration types
│   └── error.rs         # Error types and Result alias
├── examples/            # Usage examples
├── assets/              # Sample text files
└── target/              # Build output

Workflow

Input Validation: Check if text is empty
Length Detection:
- Short text (≤ max_segment_length): Direct TTS conversion
- Long text (> max_segment_length): AI-powered segmentation
Text Segmentation: AI model splits text at semantic boundaries
Audio Generation:
- Sequential: One segment at a time
- Parallel: Multiple segments concurrently (if enabled)
Audio Merging: Combine all audio segments into final WAV file
Retry Handling: Automatic retry with exponential backoff on failures

Running Examples

The project includes comprehensive examples demonstrating various features:

Basic Example

cargo run --example simple

Converts a short Chinese text to audio using default settings.

AI Segmentation Example

cargo run --example ai_splitter

Demonstrates AI-powered semantic segmentation for long texts.

Custom Voice Example

cargo run --example custom_voice

Shows voice customization and parameter tuning.

Parallel Processing Example

cargo run --example parallel

Illustrates concurrent audio generation for performance.

File Input Example

cargo run --example from_file

Converts text from a file with optimized settings for long-form content.

Direct AI Splitter Usage

cargo run --example ai_splitter

Demonstrates direct usage of the AiSplitter component.

Performance Tips

Choose the Right Model: Use GLM-4.5-Flash for simple texts, GLM-4.7 for complex content
Enable Parallel Processing: Set with_parallel(3-5) for long texts to significantly reduce total time
Optimize Segment Length:
- 300-500 chars for narrative content
- 800-1024 chars for technical content
Adjust Retry Config: Increase retries and delays for unstable networks
Use Thinking Mode: Enable for texts requiring deep semantic understanding

Requirements

Minimum Rust Version: 1.70.0
Dependencies: tokio (async runtime), zai-rs (Zhipu AI client), hound (WAV handling)
Network: Stable internet connection for API calls
API Key: Valid Zhipu AI API key with TTS service enabled

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Areas for improvement:

Additional audio format support (MP3, OGG)
Custom voice training integration
Local model inference support
Batch processing utilities
Audio post-processing effects

Please feel free to submit issues, feature requests, or pull requests.

Acknowledgments

Zhipu AI - For providing the GLM models and TTS API
zai-rs - Rust client for Zhipu AI API
hound - WAV audio format handling

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

License

AnlangA/text2audio

Folders and files

Latest commit

History

Repository files navigation

text2audio

Features

Supported AI Models

Text Segmentation Models

Text-to-Speech Model

Prerequisites

Environment Setup

Quick Start

Usage Examples

1. Basic Text to Audio

2. Custom Voice and Speed

3. Long Text with AI Segmentation

4. Parallel Processing for Performance

5. Using Builder Pattern

Configuration Reference

Text2Audio Methods

Voice Options

AI Models

Error Handling

Architecture

Workflow

Running Examples

Basic Example

AI Segmentation Example

Custom Voice Example

Parallel Processing Example

File Input Example

Direct AI Splitter Usage

Performance Tips

Requirements

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages