Skip to content

AnlangA/text2audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

text2audio

Crates.io License: MIT Rust Version

A high-performance Rust library for converting text to audio files using Zhipu AI's GLM models, featuring intelligent text segmentation, parallel processing, and advanced audio merging capabilities.

Features

  • 🤖 AI-Powered Text Segmentation - Intelligent semantic text splitting using GLM models for natural-sounding audio
  • 🎵 Multiple Voice Options - Support for 7 distinct voices with customizable speed and volume
  • Parallel Processing - Concurrent audio generation for improved performance on long texts
  • 🔄 Automatic Retry - Built-in exponential backoff retry mechanism for robust API calls
  • 🛠️ Flexible Configuration - Builder pattern API for intuitive customization
  • 📦 Zero Dependencies Audio Processing - Built-in WAV audio merging without external tools
  • 🎯 Smart Modes - Automatic direct conversion for short texts, segmented processing for long texts

Supported AI Models

Text Segmentation Models

Used for intelligent text splitting and semantic analysis:

  • GLM-4.7 - Latest flagship model with superior semantic understanding
  • GLM-4.6 - Advanced reasoning model for complex text analysis
  • GLM-4.5 - High-performance general-purpose model
  • GLM-4.5-Flash - Optimized for speed (default)
  • GLM-4.5-Air - Lightweight and cost-effective model

Text-to-Speech Model

  • GLM-TTS - Zhipu AI's dedicated text-to-speech model for high-quality audio generation

Prerequisites

  • Rust 1.70 or later
  • Zhipu AI API Key - Get one from Zhipu AI Platform
  • Network Connection - Required for API calls

Environment Setup

export ZHIPU_API_KEY="your_api_key_here"

Quick Start

Add to your Cargo.toml:

[dependencies]
text2audio = "0.1.0"
tokio = { version = "1", features = ["full"] }

Basic usage:

use text2audio::Text2Audio;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let api_key = std::env::var("ZHIPU_API_KEY")?;
    let converter = Text2Audio::new(api_key);
    
    converter.convert("你好,世界!", "output.wav").await?;
    println!("Audio saved to output.wav");
    
    Ok(())
}

Usage Examples

1. Basic Text to Audio

use text2audio::Text2Audio;

let converter = Text2Audio::new(&api_key);
converter.convert("Hello, world!", "hello.wav").await?;

2. Custom Voice and Speed

use text2audio::{Text2Audio, Voice};

let converter = Text2Audio::new(&api_key)
    .with_voice(Voice::Xiaochen)
    .with_speed(1.5)  // 50% faster
    .with_volume(2.0);  // Louder

converter.convert("加速版语音", "fast.wav").await?;

3. Long Text with AI Segmentation

use text2audio::{Text2Audio, Model};

let long_text = "非常长的文本...";
let converter = Text2Audio::new(&api_key)
    .with_model(Model::GLM4_7)  // Use best model for segmentation
    .with_max_segment_length(300)  // Shorter segments for better flow
    .with_thinking(true);  // Enable thinking mode

converter.convert(long_text, "long_audio.wav").await?;

4. Parallel Processing for Performance

use text2audio::{Text2Audio, Voice};

let converter = Text2Audio::new(&api_key)
    .with_voice(Voice::Tongtong)
    .with_parallel(5)  // Process up to 5 segments concurrently
    .with_retry_config(5, Duration::from_millis(200));

converter.convert(very_long_text, "output.wav").await?;

5. Using Builder Pattern

use text2audio::{Text2Audio, Model, Voice};
use std::time::Duration;

let converter = Text2Audio::builder(&api_key)
    .model(Model::GLM4_7)
    .voice(Voice::Tongtong)
    .speed(1.2)
    .volume(1.5)
    .max_segment_length(500)
    .parallel(3)
    .thinking(true)
    .retry_config(3, Duration::from_millis(100))
    .build();

converter.convert("优化的长文本", "narration.wav").await?;

Configuration Reference

Text2Audio Methods

Method Type Range Default Description
with_model() Model enum GLM4_5Flash AI model for text segmentation
with_voice() Voice enum Tongtong Voice selection for TTS
with_speed() f32 0.5 - 2.0 1.0 Speech speed multiplier
with_volume() f32 0.0 - 10.0 1.0 Audio volume level
with_max_segment_length() usize 100 - 1024 500 Max characters per segment
with_parallel() usize 1 - 10 disabled Enable concurrent processing
with_thinking() bool true/false false Enable AI thinking mode
with_coding_plan() bool true/false false Use coding plan endpoint
with_retry_config() (u32, Duration) custom (3, 100ms) Retry attempts and delay

Voice Options

All voices are provided by Zhipu AI's TTS service:

  • Voice::Tongtong (童童) - Default female voice, clear and natural
  • Voice::Chuichui (锤锤) - Warm and friendly male voice
  • Voice::Xiaochen (晓辰) - Professional narration voice
  • Voice::Jam - Youthful and energetic voice
  • Voice::Kazi - Deep and authoritative voice
  • Voice::Douji (豆鸡) - Cute and playful voice
  • Voice::Luodo - Mature and calm voice

AI Models

Choose the appropriate model based on your needs:

  • GLM-4.7: Best for long, complex texts requiring deep semantic understanding
  • GLM-4.6: Good balance of quality and speed for most use cases
  • GLM-4.5: Reliable general-purpose model
  • GLM-4.5-Flash: Fastest processing, ideal for simple texts
  • GLM-4.5-Air: Most cost-effective for high-volume processing

Error Handling

The library provides detailed error types for robust error handling:

use text2audio::{Text2Audio, Error};

match converter.convert(text, "output.wav").await {
    Ok(_) => println!("✓ Conversion successful"),
    Err(Error::EmptyInput) => eprintln!("✗ Error: Input text is empty"),
    Err(Error::TtsApi(msg)) => eprintln!("✗ TTS API Error: {}", msg),
    Err(Error::AiApi(msg)) => eprintln!("✗ AI API Error: {}", msg),
    Err(Error::Audio(msg)) => eprintln!("✗ Audio Processing Error: {}", msg),
    Err(Error::Io(e)) => eprintln!("✗ File I/O Error: {}", e),
    Err(e) => eprintln!("✗ Unexpected Error: {}", e),
}

Architecture

text2audio/
├── src/
│   ├── lib.rs           # Main API and Text2Audio struct
│   ├── client.rs        # Zhipu AI API client
│   ├── ai_splitter.rs   # AI-powered text segmentation
│   ├── audio_merger.rs  # WAV audio file merging
│   ├── config.rs        # Voice and configuration types
│   └── error.rs         # Error types and Result alias
├── examples/            # Usage examples
├── assets/              # Sample text files
└── target/              # Build output

Workflow

  1. Input Validation: Check if text is empty
  2. Length Detection:
    • Short text (≤ max_segment_length): Direct TTS conversion
    • Long text (> max_segment_length): AI-powered segmentation
  3. Text Segmentation: AI model splits text at semantic boundaries
  4. Audio Generation:
    • Sequential: One segment at a time
    • Parallel: Multiple segments concurrently (if enabled)
  5. Audio Merging: Combine all audio segments into final WAV file
  6. Retry Handling: Automatic retry with exponential backoff on failures

Running Examples

The project includes comprehensive examples demonstrating various features:

Basic Example

cargo run --example simple

Converts a short Chinese text to audio using default settings.

AI Segmentation Example

cargo run --example ai_splitter

Demonstrates AI-powered semantic segmentation for long texts.

Custom Voice Example

cargo run --example custom_voice

Shows voice customization and parameter tuning.

Parallel Processing Example

cargo run --example parallel

Illustrates concurrent audio generation for performance.

File Input Example

cargo run --example from_file

Converts text from a file with optimized settings for long-form content.

Direct AI Splitter Usage

cargo run --example ai_splitter

Demonstrates direct usage of the AiSplitter component.

Performance Tips

  1. Choose the Right Model: Use GLM-4.5-Flash for simple texts, GLM-4.7 for complex content
  2. Enable Parallel Processing: Set with_parallel(3-5) for long texts to significantly reduce total time
  3. Optimize Segment Length:
    • 300-500 chars for narrative content
    • 800-1024 chars for technical content
  4. Adjust Retry Config: Increase retries and delays for unstable networks
  5. Use Thinking Mode: Enable for texts requiring deep semantic understanding

Requirements

  • Minimum Rust Version: 1.70.0
  • Dependencies: tokio (async runtime), zai-rs (Zhipu AI client), hound (WAV handling)
  • Network: Stable internet connection for API calls
  • API Key: Valid Zhipu AI API key with TTS service enabled

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Areas for improvement:

  • Additional audio format support (MP3, OGG)
  • Custom voice training integration
  • Local model inference support
  • Batch processing utilities
  • Audio post-processing effects

Please feel free to submit issues, feature requests, or pull requests.

Acknowledgments

  • Zhipu AI - For providing the GLM models and TTS API
  • zai-rs - Rust client for Zhipu AI API
  • hound - WAV audio format handling

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages