Skip to content

fix(bailian): 加固 DashScope ASR 累积文本重复修复#535

Merged
H-Chris233 merged 1 commit into
Open-Less:betafrom
H-Chris233:fix/issue-530-bailian-dedup
May 27, 2026
Merged

fix(bailian): 加固 DashScope ASR 累积文本重复修复#535
H-Chris233 merged 1 commit into
Open-Less:betafrom
H-Chris233:fix/issue-530-bailian-dedup

Conversation

@H-Chris233
Copy link
Copy Markdown
Collaborator

@H-Chris233 H-Chris233 commented May 27, 2026

User description

背景

Issue #530 报告 DashScope ASR 在片段累积时产生重复文本。之前的 commit 54f07d8 已修复了核心的 end_time 判断和 BTreeMap 覆盖。本次是在其基础上的进一步加固。

改动

改动 说明
Heartbeat 过滤 跳过 sentence.heartbeat 事件
sentence_end 判断 使用 API 文档标注的 sentence_end 布尔字段,end_time > 0 作为兼容 fallback
partial_segments 追踪 interim 结果按 sentence_id 缓存,final 时自动清理
sentence_id == 0 不丢弃 所有 final 结果存入 final_segments,不再 drop
重叠合并守卫 merge_segments() 函数,最小 2 字符重叠检测
测试覆盖 从 3 个扩展到 20 个

测试

cargo test --lib asr::bailian
20 passed; 0 failed

审查

代码经 Claude Code 自动化审查 + 独立子代理审查,无严重问题。


PR Type

Bug fix, Tests


Description

  • Skip heartbeat-only ASR events

  • Commit finals by sentence_end

  • Cache partials per sentence_id

  • Merge overlaps and expand tests


Diagram Walkthrough

flowchart LR
  A["DashScope ASR events"] --> B["Ignore heartbeat events"]
  B --> C["Detect finality via sentence_end"]
  C --> D["Track partials by sentence_id"]
  D --> E["Store finals in BTreeMap"]
  E --> F["Merge overlapping segments"]
  F --> G["Emit deduplicated transcript"]
Loading

File Walkthrough

Relevant files
Bug fix
bailian.rs
Harden DashScope ASR deduplication flow                                   

openless-all/app/src-tauri/src/asr/bailian.rs

  • Skips heartbeat events before processing transcript text.
  • Uses sentence_end as the primary finality signal, with end_time as
    fallback.
  • Tracks interim results in partial_segments and clears them when a
    final arrives.
  • Adds merge_segments() to avoid duplicated overlap when assembling the
    final transcript.
  • Expands tests to cover heartbeat filtering, partial updates, duplicate
    finals, zero sentence_id, and ordered multi-segment assembly.
+304/-9 

- 跳过 heartbeat 事件(不含识别文本)
- 改用 API 文档标注的 sentence_end 判断 finality,end_time > 0 作为兼容 fallback
- 新增 partial_segments BTreeMap 追踪 interim 结果,final 时自动清理
- sentence_id == 0 的 final 结果不再丢弃,存入 final_segments
- 新增 merge_segments() 带 2 字符最小重叠检测的拼接守卫
- finish_with_partial_or_error 同时检查 partial_segments
- 测试从 3 个扩展到 20 个
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

530 - Partially compliant

Compliant requirements:

  • Skip heartbeat-only DashScope ASR events.
  • Keep interim text separate from committed final text.
  • Merge overlapping segments to tolerate replayed or duplicated server events.
  • Add tests covering heartbeat events, multiple partial updates, duplicate finals, zero sentence_id, and multi-sentence assembly.

Non-compliant requirements:

  • Avoid dropping finals when sentence_id is missing or zero.
  • Detect final sentences using sentence_end as the authoritative finality signal.

Requires further human verification:

  • Whether the end_time fallback still matches all real DashScope payload variants in production.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

ID collision

Storing every final sentence in a BTreeMap keyed by sentence_id means any stream that emits multiple final sentences with sentence_id == 0 will overwrite earlier ones, so only the last sentence survives. The PR explicitly changes the code to keep zero IDs instead of dropping them, but this path still loses data for providers or payload versions that omit sentence IDs.

// 所有 final 结果(含 sentence_id == 0)都存入 final_segments。
// BTreeMap 覆盖语义保证同一 sentence_id 不会重复追加。
st.final_segments.insert(sentence_id, trimmed.to_string());
Finality fallback

sentence_end is treated as final only when it is true, but the end_time > 0 fallback can still promote non-final cumulative updates into final_segments. If DashScope sends interim result-generated events with a populated end_time, this reintroduces the duplicate-prefix behavior the ticket is trying to fix.

// 使用 API 文档标注的 sentence_end 作为 finality 判断,
// end_time > 0 作为兼容 fallback(部分早期接口可能不含 sentence_end)。
let sentence_end = sentence
    .get("sentence_end")
    .and_then(Value::as_bool)
    .unwrap_or(false);
let end_time = sentence
    .get("end_time")
    .and_then(Value::as_i64)
    .unwrap_or(0);
let is_sentence_final = sentence_end || end_time > 0;

@H-Chris233 H-Chris233 merged commit e5cb823 into Open-Less:beta May 27, 2026
4 checks passed
@H-Chris233 H-Chris233 deleted the fix/issue-530-bailian-dedup branch May 29, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant