Skip to content

Backend: should_discard_conversation uses whitespace split for word count, silently undercounts CJK transcripts #7065

@waffensam

Description

@waffensam

Summary

`should_discard_conversation` in `backend/utils/llm/conversation_processing.py` computes word count and applies the fast-path skip using Python's `str.split()`. This is correct for space-delimited languages (English, etc.) but silently undercounts CJK transcripts (Chinese, Japanese, Korean), where words are not separated by whitespace.

Root cause

```python

backend/utils/llm/conversation_processing.py:167

if transcript and len(transcript.split(' ')) > 100:
return False # fast-path: skip LLM call, keep conversation

line 170

word_count = len(transcript.split()) if transcript and transcript.strip() else 0
```

A Chinese transcript of 150+ characters — equivalent to ~100 English words in information density — produces a `word_count` of 1–5 after `split()`, because Chinese has no word-boundary spaces.

This means:

  1. The fast-path at line 167 is never triggered for CJK content, regardless of actual length.
  2. The `word_count` injected into the LLM prompt at line 190 is misleadingly low.
  3. Combined with the `duration_seconds < 120` stricter discard bar (line 191–196), short but substantive CJK conversations are systematically over-discarded.

Concrete example

A 105-second Chinese transcript (1 segment, ~100 characters):

```
那个换目的地的正常要传就是情况说明要传系统吗 不是就那个 出口的 就他那个划清的那个...
```

  • `len(transcript.split())` → ~15 (only whitespace-separated chunks)
  • Fast-path skip: not triggered
  • LLM prompt gets: `Word count: 15 words` (misleading)
  • Duration < 120 s → higher discard bar applied
  • Result: discarded, despite meaningful content

Expected behavior

CJK transcripts with substantial character count should not be undercounted. The fast-path and word-count heuristic should account for character-based languages.

Suggested fix

```python
import unicodedata

def _word_count(text: str) -> int:
"""Estimate word count for both space-delimited and CJK text."""
cjk_chars = sum(1 for c in text if unicodedata.east_asian_width(c) in ('W', 'F'))
if cjk_chars > len(text) * 0.3:
# CJK-dominant: use character count / 2 as proxy for word count
return cjk_chars // 2
return len(text.split())
```

Apply to both the fast-path check (line 167) and the `word_count` variable (line 170).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions