Kokoro: high-pitched voices produce harsh sibilance that post-processing can't fix

## Summary

High-pitched Kokoro voices (mostly female `af_*` voices) produce noticeably harsh/sharp sibilant sounds (s, sh, z). Lower-pitched voices (male `am_*` and some female) sound fine. The built-in `AudioPostProcessor.deEss()` at -3dB doesn't meaningfully reduce it, and more aggressive post-processing (-6dB high-shelf + -4dB parametric notch at 8kHz) also had no perceptible improvement.

## Reproduction

Using `KokoroTtsManager.synthesizeDetailed()` with any high-pitched voice (e.g. `af_heart`, `af_bella`), synthesize text containing sibilant-heavy words. Compare against a lower-pitched voice like `am_adam`.

The sibilance is clearly audible in the raw samples before any post-processing is applied.

## What we tried (in SpeakBook, consuming FluidAudioTTS)

1. **Built-in de-esser** (`deEss: true`, default) — `-3dB` high-shelf at 6kHz via `AudioPostProcessor.applyTtsPostProcessing()`. No perceptible change on high-pitched voices.

2. **AVAudioUnitEQ in the playback chain** — high-shelf at 6kHz (-6dB) + parametric notch at 8kHz (-4dB, bandwidth 1.5). Also no meaningful improvement.

3. Both approaches together — still sibilant.

## Analysis

The sibilance appears to be baked into the model output for high-pitched voices rather than being a simple high-frequency energy problem that EQ can solve. The spectral characteristics of the sibilants in these voices likely overlap too much with the voice's natural harmonics, so attenuating the sibilant range also degrades the voice quality without fixing the harshness.

## Possible directions

- Investigate whether this is a known characteristic of the upstream Kokoro-82M model for certain voice embeddings
- Voice-specific de-essing with narrower/targeted frequency bands per voice
- Training-side fix if the voice embeddings can be adjusted
- Document which voices are known to have this issue so consumers can guide users toward better-sounding voices

## Environment

- FluidAudioTTS (Kokoro via KokoroTtsManager)
- iOS/macOS, CoreML models from Kokoro-82M HuggingFace repo
- Sample rate: 24kHz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kokoro: high-pitched voices produce harsh sibilance that post-processing can't fix #23

Summary

Reproduction

What we tried (in SpeakBook, consuming FluidAudioTTS)

Analysis

Possible directions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kokoro: high-pitched voices produce harsh sibilance that post-processing can't fix #23

Description

Summary

Reproduction

What we tried (in SpeakBook, consuming FluidAudioTTS)

Analysis

Possible directions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions