Backend: should_discard_conversation uses whitespace split for word count, silently undercounts CJK transcripts

## Summary

\`should_discard_conversation\` in \`backend/utils/llm/conversation_processing.py\` computes word count and applies the fast-path skip using Python's \`str.split()\`. This is correct for space-delimited languages (English, etc.) but silently undercounts CJK transcripts (Chinese, Japanese, Korean), where words are not separated by whitespace.

## Root cause

\`\`\`python
# backend/utils/llm/conversation_processing.py:167
if transcript and len(transcript.split(' ')) > 100:
    return False  # fast-path: skip LLM call, keep conversation

# line 170
word_count = len(transcript.split()) if transcript and transcript.strip() else 0
\`\`\`

A Chinese transcript of 150+ characters — equivalent to ~100 English words in information density — produces a \`word_count\` of 1–5 after \`split()\`, because Chinese has no word-boundary spaces.

This means:

1. The fast-path at line 167 is never triggered for CJK content, regardless of actual length.
2. The \`word_count\` injected into the LLM prompt at line 190 is misleadingly low.
3. Combined with the \`duration_seconds < 120\` stricter discard bar (line 191–196), short but substantive CJK conversations are systematically over-discarded.

## Concrete example

A 105-second Chinese transcript (1 segment, ~100 characters):

\`\`\`
那个换目的地的正常要传就是情况说明要传系统吗 不是就那个 出口的 就他那个划清的那个...
\`\`\`

- \`len(transcript.split())\` → ~15 (only whitespace-separated chunks)
- Fast-path skip: **not triggered**
- LLM prompt gets: \`Word count: 15 words\` (misleading)
- Duration < 120 s → higher discard bar applied
- Result: **discarded**, despite meaningful content

## Expected behavior

CJK transcripts with substantial character count should not be undercounted. The fast-path and word-count heuristic should account for character-based languages.

## Suggested fix

\`\`\`python
import unicodedata

def _word_count(text: str) -> int:
    """Estimate word count for both space-delimited and CJK text."""
    cjk_chars = sum(1 for c in text if unicodedata.east_asian_width(c) in ('W', 'F'))
    if cjk_chars > len(text) * 0.3:
        # CJK-dominant: use character count / 2 as proxy for word count
        return cjk_chars // 2
    return len(text.split())
\`\`\`

Apply to both the fast-path check (line 167) and the \`word_count\` variable (line 170).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend: should_discard_conversation uses whitespace split for word count, silently undercounts CJK transcripts #7065

Summary

Root cause

backend/utils/llm/conversation_processing.py:167

line 170

Concrete example

Expected behavior

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backend: should_discard_conversation uses whitespace split for word count, silently undercounts CJK transcripts #7065

Description

Summary

Root cause

backend/utils/llm/conversation_processing.py:167

line 170

Concrete example

Expected behavior

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions