Skip to content

Ladvien/speech_tool

Repository files navigation

speech_tool

A text-to-speech server to convert text to speech using the Kokoro-TTS models and FastAPI.

Other Tool Packages

Quick Start

Run:

pip install speech_tool

Create a config.yaml file with the following content, see Configuration for more details.

Create a main.py file with the following content:

import os
import yaml
from fastapi import FastAPI
import uvicorn
from speech_tool import SpeechToolServer, NodeConfig

CONFIG_PATH = os.environ.get("NODE_CONFIG_PATH", "config.yaml")
config = NodeConfig(**yaml.safe_load(open(CONFIG_PATH, "r")))

app = FastAPI()

speech_tool = SpeechToolServer(config)
app.include_router(speech_tool.router)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Create a client file client.py with the following content:

import requests
import io
import sounddevice as sd
import soundfile as sf
from datetime import datetime

HOST = "http://0.0.0.0:8000" # <--- Change to your server IP
url = f"{HOST}/node/speech"

start = datetime.now()
response = requests.get(
    url,
    params={
        "text": """Anyway, it was the Saturday of the football game with Saxon Hall. 
                   The game with Saxon Hall was supposed to be a very big deal around Pencey. 
        """,
        "voice": "af_bella",
        "speed": 1.1,
        "split_pattern": r"\n+",
    },
    stream=True,
)


# Read the streamed response into memory
audio_buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=4096):
    if chunk:
        audio_buffer.write(chunk)

# Play the audio in real-time
audio_buffer.seek(0)  # Reset buffer for reading
data, samplerate = sf.read(audio_buffer)
sd.play(data, samplerate)
sd.wait()  # Wait for audio to finish playing

print(f"Time taken: {datetime.now() - start}")

Run:

python main.py &

And then run:

python client.py

Configuration

Create a config.yaml file with the following content:

name: "speech_node"

# "kokoro-v1.0.fp16-gpu.onnx",
# "kokoro-v1.0.fp16.onnx",
# "kokoro-v1.0.int8.onnx",
# "kokoro-v1.0.onnx"
model_name: kokoro-v1.0.int8.onnx
voices_name: voices-v1.0.bin

response:
  # TODO: type: stream
  sample_rate: 24000
  format: wav
  compression_level: 0

pipeline:
  model:
  device: cpu # cpu or cuda
  use_transformer: true

  # Model configuration
  # 'a' = American English
  # 'b' = British English
  # 'e' = Spanish
  # 'f' = French
  # 'h' = Hindi
  # 'i' = Italian
  # 'p' = Portuguese
  # 'j' = Japanese
  # 'z' = Chinese
  language_code: en-us

  # Request defaults
  speed: 1.0 # Can be set during request
  voice: "af_heart" # Can be set during request
  split_pattern: "\n" # Can be set during request

Dependencies

Linux

Ubuntu

sudo apt update
sudo apt install libglslang-dev

Manjaro

sudo pacman -S ffmpeg glslang

# Check for version mismatch
find /usr -name "libglslang-default-resource-limits.so*"
# If version mismatch
sudo ln -s /usr/lib/libglslang-default-resource-limits.so.15 /usr/lib/libglslang-default-resource-limits.so.14

# Check for version mismatch
find /usr -name "libSPIRV.so*"
# If version mismatch

sudo ldconfig

If NVIDIA is not working:

sudo modprobe -r nvidia_uvm
sudo modprobe nvidia_uvm

MacOS

brew install ffmpeg
brew install glslang

About

A text-to-speech server to inclusion in AI pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors