Summary
`should_discard_conversation` in `backend/utils/llm/conversation_processing.py` computes word count and applies the fast-path skip using Python's `str.split()`. This is correct for space-delimited languages (English, etc.) but silently undercounts CJK transcripts (Chinese, Japanese, Korean), where words are not separated by whitespace.
Root cause
```python
backend/utils/llm/conversation_processing.py:167
if transcript and len(transcript.split(' ')) > 100:
return False # fast-path: skip LLM call, keep conversation
line 170
word_count = len(transcript.split()) if transcript and transcript.strip() else 0
```
A Chinese transcript of 150+ characters — equivalent to ~100 English words in information density — produces a `word_count` of 1–5 after `split()`, because Chinese has no word-boundary spaces.
This means:
- The fast-path at line 167 is never triggered for CJK content, regardless of actual length.
- The `word_count` injected into the LLM prompt at line 190 is misleadingly low.
- Combined with the `duration_seconds < 120` stricter discard bar (line 191–196), short but substantive CJK conversations are systematically over-discarded.
Concrete example
A 105-second Chinese transcript (1 segment, ~100 characters):
```
那个换目的地的正常要传就是情况说明要传系统吗 不是就那个 出口的 就他那个划清的那个...
```
- `len(transcript.split())` → ~15 (only whitespace-separated chunks)
- Fast-path skip: not triggered
- LLM prompt gets: `Word count: 15 words` (misleading)
- Duration < 120 s → higher discard bar applied
- Result: discarded, despite meaningful content
Expected behavior
CJK transcripts with substantial character count should not be undercounted. The fast-path and word-count heuristic should account for character-based languages.
Suggested fix
```python
import unicodedata
def _word_count(text: str) -> int:
"""Estimate word count for both space-delimited and CJK text."""
cjk_chars = sum(1 for c in text if unicodedata.east_asian_width(c) in ('W', 'F'))
if cjk_chars > len(text) * 0.3:
# CJK-dominant: use character count / 2 as proxy for word count
return cjk_chars // 2
return len(text.split())
```
Apply to both the fast-path check (line 167) and the `word_count` variable (line 170).
Summary
`should_discard_conversation` in `backend/utils/llm/conversation_processing.py` computes word count and applies the fast-path skip using Python's `str.split()`. This is correct for space-delimited languages (English, etc.) but silently undercounts CJK transcripts (Chinese, Japanese, Korean), where words are not separated by whitespace.
Root cause
```python
backend/utils/llm/conversation_processing.py:167
if transcript and len(transcript.split(' ')) > 100:
return False # fast-path: skip LLM call, keep conversation
line 170
word_count = len(transcript.split()) if transcript and transcript.strip() else 0
```
A Chinese transcript of 150+ characters — equivalent to ~100 English words in information density — produces a `word_count` of 1–5 after `split()`, because Chinese has no word-boundary spaces.
This means:
Concrete example
A 105-second Chinese transcript (1 segment, ~100 characters):
```
那个换目的地的正常要传就是情况说明要传系统吗 不是就那个 出口的 就他那个划清的那个...
```
Expected behavior
CJK transcripts with substantial character count should not be undercounted. The fast-path and word-count heuristic should account for character-based languages.
Suggested fix
```python
import unicodedata
def _word_count(text: str) -> int:
"""Estimate word count for both space-delimited and CJK text."""
cjk_chars = sum(1 for c in text if unicodedata.east_asian_width(c) in ('W', 'F'))
if cjk_chars > len(text) * 0.3:
# CJK-dominant: use character count / 2 as proxy for word count
return cjk_chars // 2
return len(text.split())
```
Apply to both the fast-path check (line 167) and the `word_count` variable (line 170).