Skip to content

Kokoro: high-pitched voices produce harsh sibilance that post-processing can't fix #23

@Alex-Wengg

Description

@Alex-Wengg

Summary

High-pitched Kokoro voices (mostly female af_* voices) produce noticeably harsh/sharp sibilant sounds (s, sh, z). Lower-pitched voices (male am_* and some female) sound fine. The built-in AudioPostProcessor.deEss() at -3dB doesn't meaningfully reduce it, and more aggressive post-processing (-6dB high-shelf + -4dB parametric notch at 8kHz) also had no perceptible improvement.

Reproduction

Using KokoroTtsManager.synthesizeDetailed() with any high-pitched voice (e.g. af_heart, af_bella), synthesize text containing sibilant-heavy words. Compare against a lower-pitched voice like am_adam.

The sibilance is clearly audible in the raw samples before any post-processing is applied.

What we tried (in SpeakBook, consuming FluidAudioTTS)

  1. Built-in de-esser (deEss: true, default) — -3dB high-shelf at 6kHz via AudioPostProcessor.applyTtsPostProcessing(). No perceptible change on high-pitched voices.

  2. AVAudioUnitEQ in the playback chain — high-shelf at 6kHz (-6dB) + parametric notch at 8kHz (-4dB, bandwidth 1.5). Also no meaningful improvement.

  3. Both approaches together — still sibilant.

Analysis

The sibilance appears to be baked into the model output for high-pitched voices rather than being a simple high-frequency energy problem that EQ can solve. The spectral characteristics of the sibilants in these voices likely overlap too much with the voice's natural harmonics, so attenuating the sibilant range also degrades the voice quality without fixing the harshness.

Possible directions

  • Investigate whether this is a known characteristic of the upstream Kokoro-82M model for certain voice embeddings
  • Voice-specific de-essing with narrower/targeted frequency bands per voice
  • Training-side fix if the voice embeddings can be adjusted
  • Document which voices are known to have this issue so consumers can guide users toward better-sounding voices

Environment

  • FluidAudioTTS (Kokoro via KokoroTtsManager)
  • iOS/macOS, CoreML models from Kokoro-82M HuggingFace repo
  • Sample rate: 24kHz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions