Skip to content

fix(asr): 用 verbose_json 元数据丢弃 Whisper 幻听段落(仅 OpenAI/Groq)#572

Merged
H-Chris233 merged 1 commit into
Open-Less:betafrom
katanumahotori:fix/whisper-hallucination-verbose-json
Jun 1, 2026
Merged

fix(asr): 用 verbose_json 元数据丢弃 Whisper 幻听段落(仅 OpenAI/Groq)#572
H-Chris233 merged 1 commit into
Open-Less:betafrom
katanumahotori:fix/whisper-hallucination-verbose-json

Conversation

@katanumahotori
Copy link
Copy Markdown
Contributor

@katanumahotori katanumahotori commented Jun 1, 2026

User description

问题

Whisper 在静音 / 弱音 / 噪声段会生成「听起来合理但用户没说」的文本(已知的 hallucination 缺陷)。录音前后的沉默、麦克风底噪经常被转写成无关词,污染最终结果。当前 transcribe_chunk 直接取 json["text"],没有任何过滤。

方案

当 provider 返回 verbose_json 时,每个 segment 带 no_speech_prob / avg_logprob / compression_ratio。用保守阈值丢掉明显不是语音的段落:

  • no_speech_prob > 0.6avg_logprob < -0.5(高静音概率 + 低置信)
  • compression_ratio > 2.4(反复幻听,Whisper 标准阈值)
  • avg_logprob < -1.0(置信极低,噪声被词化)

误删真实语音最糟,所以阈值偏保守。响应里没有 segments 时退回直接用 text(与旧行为一致);某些指标字段缺失时按「保留」处理,所以对返回 segments 但缺指标的 provider 是无害空转。

Provider 门控(关键)

verbose_json 只对确证支持且有收益的 provider 开启,避免破坏其它后端:

provider 模型 verbose_json 处理
whisper(OpenAI) Whisper ✅ 完整(含上述指标) 开启,过滤有效
groq Whisper ✅ 完整(seek/avg_logprob/compression_ratio/no_speech_prob) 开启,过滤有效
zhipu(GLM-ASR) GLM-ASR 接受该值但不产出上述指标 保持 json(过滤空转,最小化行为变更)
siliconflow SenseVoice / TeleSpeech 文档无 response_format 保持 json(避免未知参数导致 4xx)

依据:OpenAI / Groq 现行文档均明确 verbose_json 返回上述 segment 指标;SiliconFlow 文档的转写接口没有 response_format 参数,模型为 SenseVoice/TeleSpeech;GLM-ASR 接受 verbose_json 但 segment 形态不同。

whisper_supports_verbose_json(provider_id) 决定是否开启;WhisperBatchASR 增加一个 verbose_json bool 参数。开启时同时把 temperature 固定为 0(转写是确定性任务)。

测试

  • extract_confident_text:丢弃幻听段 / 保留可信段 / 无 segments 回退 text / 缺指标时保留。
  • whisper_supports_verbose_json:仅 whisper/groq 为 true,siliconflow/zhipu 为 false。
  • cargo check --lib --tests 通过。

平台 / 兼容性

仅改 transcribe_chunk 与构造参数。未开启的 provider 行为完全不变。

fork 维护者在日语环境实际使用中发现幻听问题;SiliconFlow 无法本地实测(无凭据),故按文档保守门控,不改其行为。命名 / 阈值如需调整请直接指出。


PR Type

Bug fix, Tests


Description

  • Add verbose_json support to filter hallucinated segments via metadata (no_speech_prob, avg_logprob, compression_ratio)

  • Gate the feature to only OpenAI/Groq to avoid breaking other providers

  • Add extract_confident_text function with conservative thresholds

  • Add unit tests for the new function and provider gating


File Walkthrough

Relevant files
Enhancement
whisper.rs
Add verbose_json hallucination filter and tests                   

openless-all/app/src-tauri/src/asr/whisper.rs

  • Added verbose_json boolean field to WhisperBatchASR
  • Modified transcribe to conditionally request
    response_format=verbose_json and use extract_confident_text filter
  • Added extract_confident_text function to drop hallucinated segments
    using thresholds (no_speech_prob, avg_logprob, compression_ratio)
  • Added unit tests for the filtering logic
+138/-3 
coordinator.rs
Gate verbose_json support to whisper/groq providers           

openless-all/app/src-tauri/src/coordinator.rs

  • Added whisper_supports_verbose_json function to gate the feature only
    to providers "whisper" and "groq"
  • Modified build_qa_asr_start to pass the flag when constructing
    WhisperBatchASR
  • Added unit test to verify provider gating
+25/-0   
dictation.rs
Pass verbose_json flag in dictation session                           

openless-all/app/src-tauri/src/coordinator/dictation.rs

  • Modified begin_session to pass the verbose_json flag when creating
    WhisperBatchASR
+1/-0     

…I/Groq only)

Whisper fabricates plausible-but-unspoken text on silence/noise (the
classic hallucination defect): leading/trailing silence or mic hiss turns
into unrelated words. When the provider returns verbose_json, each segment
carries no_speech_prob / avg_logprob / compression_ratio — use them to
drop segments that clearly aren't speech (conservative thresholds so real
speech is never trimmed). No segments in the response → fall back to text.

Provider-gated to avoid breaking non-Whisper backends:
- whisper (OpenAI) / groq: native Whisper, verbose_json fully supported
  with the metrics above — filter is effective. Verified against both
  providers' current docs.
- siliconflow: SenseVoice / TeleSpeech, response_format is undocumented;
  sending verbose_json risks a 4xx, so it stays on the existing json path.
- zhipu (GLM-ASR): accepts verbose_json but does not emit those metrics
  (filter would be a no-op), so it also stays on json to minimize behavior
  change. Only whisper/groq opt in.

whisper_supports_verbose_json(provider_id) decides the flag; WhisperBatchASR
gains a verbose_json bool. Missing metric fields are treated as "keep" so
the filter is harmless for any provider that returns segments without them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ No major issues detected

katanumahotori added a commit to katanumahotori/openless that referenced this pull request Jun 1, 2026
Aligns the fork with PR Open-Less#572: the Whisper hallucination filter only
requests response_format=verbose_json for providers that return the
metrics (whisper/groq). SiliconFlow (SenseVoice/TeleSpeech, no
response_format) and zhipu (GLM-ASR, no metrics) keep the plain json
path. Previously the fork always sent verbose_json, which was fine on
Groq but would risk a 4xx if switched to SiliconFlow.

WhisperBatchASR gains a verbose_json bool; whisper_supports_verbose_json
decides it at construction. strip_prompt_echo still runs on both paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@H-Chris233 H-Chris233 merged commit ad62936 into Open-Less:beta Jun 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants