Feat: Improve Translation quality and perfomrance.#11
Conversation
- Added functions to handle subtitle blocks without timestamps for easier parsing. - Updated translation logic to enhance handling of subtitle blocks. - Improved model interaction by eliminating noise from timestamps. - Streamlined serialization of subtitle data, boosting translation accuracy.
- Updated subtitle serialization to exclude timestamps, improving translation accuracy. - Added logic to reattach timestamps after translation. - Simplified input format in translation prompt to align with changes.
- Integrates file context scanning for consistent character names and term translations. - Enhances translation accuracy by addressing gendered-pronoun errors. - Introduces a glossary system to improve scene-context translations. - Provides better support for language-specific nuances in subtitle translations.
- Introduces batch splitting on validation failures to enhance translation accuracy. - Adds register consistency for coherent translations across subtitle files. - Ensures non-empty inputs do not produce empty outputs. - Improves context detection for better glossary generation. - Provides more granular retry logic, benefiting subtitle translation stability.
- Introduces context scanning for gendered-pronoun accuracy. - Ensures consistent language register and terminology. - Improves batch handling with retry logic and batch splitting. - Enhances translation fidelity for non-Latin languages.
- Enables automatic detection of source language for translations. - Enhances usability for mixed-language subtitle batches. - Updates UI to include an "Auto-detect" option in language selection. - Improves translation workflow by eliminating the need to specify a source language.
- Introduces a verbose flag to control the visibility of retry and validation warnings. - By default, warnings are suppressed, enhancing user experience during translation tasks. - Verbose mode helps users monitor detailed translation attempts and issues.
- Introduces contextual hints for form fields to guide user input. - Enhances user experience by clarifying field purposes and options. - Adds advice for optional fields and server-specific inputs. - Provides better visual structure and alignment in the UI.
- Decreased default batch size from 15 to 10 for better handling. - Updated help text to reflect batch size adjustment. - Aligns batch size across CLI and web for consistency.
There was a problem hiding this comment.
Pull request overview
This PR improves subtitle translation consistency and robustness across both the web app and CLI by adding a file-level context prepass (“cast & register”), enabling source-language auto-detection by default, and refining batching/validation/retry behavior.
Changes:
- Added a one-shot context scan to extract register/characters/terms and attach a batch-scoped glossary to each translation request (web + CLI).
- Switched to a timestamp-free “lite” wire format for LLM calls and added recursive batch splitting on repeated validation failures; updated defaults (batch size 10, source auto-detect) and added CLI
--verbose. - Updated UI hints, CLI help text, and README; bumped versions to 0.4.0.
Reviewed changes
Copilot reviewed 29 out of 32 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/index.html | Removes alternate favicon link while keeping SVG favicon setup. |
| web/src/app/core/translation.service.ts | Adds context prepass + glossary injection; refactors request building; adds split-on-validation-failure logic. |
| web/src/app/core/translation-prompt.ts | Updates system prompt for “no timestamps” wire format and supports optional glossary in user message. |
| web/src/app/core/time-tracker.ts | Refactors comments/docstrings for brevity/clarity (no logic change). |
| web/src/app/core/subtitle-formats/types.ts | Shortens interface documentation (no functional change). |
| web/src/app/core/subtitle-formats/index.ts | Shortens parsing documentation (no functional change). |
| web/src/app/core/srt-parser.ts | Adds serializeLite/parseLite wire format helpers for timestamp-free translation requests. |
| web/src/app/core/languages.ts | Condenses comments on language metadata (no functional change). |
| web/src/app/core/file-types.ts | Clarifies upload parsing comment (no functional change). |
| web/src/app/core/context-pass.ts | New web context-pass implementation (scan budget, parsing tagged response, batch-scoped glossary rendering). |
| web/src/app/app.component.ts | Defaults source language to auto-detect and adjusts swap behavior; comment cleanups. |
| web/src/app/app.component.scss | Adds layout/hint styles for language + advanced form hints. |
| web/src/app/app.component.html | Adds “Auto-detect” source option and explanatory field hints across advanced/basic sections. |
| web/public/favicon.ico | Includes favicon asset (supports browsers that request .ico). |
| web/package.json | Bumps web app version to 0.4.0. |
| web/package-lock.json | Updates lockfile version metadata to 0.4.0. |
| cli/uv.lock | Bumps CLI package version to 0.4.0 in lock metadata. |
| cli/translora.py | Makes --source optional (auto-detect default), changes default batch size to 10, adds --verbose, refreshes help/examples. |
| cli/tests/test_srt_parser.py | Adds validation coverage for “empty output text” cases. |
| cli/tests/test_context_pass.py | Adds tests for context-pass parsing/rendering and scan sampling. |
| cli/pyproject.toml | Bumps CLI package version to 0.4.0. |
| cli/core/translator.py | Adds per-file context scan step and passes FileContext into batch translation. |
| cli/core/time_tracker.py | Condenses docstrings (no logic change). |
| cli/core/srt_parser.py | Adds serialize_lite/parse_lite and extends validation to catch empty output on non-empty input. |
| cli/core/prompt.py | Updates system prompt for “no timestamps” wire format + glossary/register guidance. |
| cli/core/live_status.py | Condenses docstrings/comments (no logic change). |
| cli/core/formats/types.py | Condenses SubtitleDocument docstring (no functional change). |
| cli/core/formats/init.py | Condenses module docstring (no functional change). |
| cli/core/context_pass.py | New CLI context-pass implementation + scan extraction call. |
| cli/core/config.py | Adds verbose flag and makes warnings silent by default (reroutable). |
| cli/core/batch_runner.py | Adds lite wire format, glossary user message, recursive split on validation failure, and updated retry/warn behavior. |
| README.md | Documents prepass/context scan, auto-detect source, new defaults/flags, and pipeline updates. |
Files not reviewed (1)
- web/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if len(output) == len(batch): | ||
| output = [ | ||
| SubtitleBlock(number=batch[i].number, | ||
| timestamp=batch[i].timestamp, | ||
| text=output[i].text) |
There was a problem hiding this comment.
Here the parsed lite output is rewritten with the input batch’s numbers/timestamps before validation. That prevents validate_batch from catching incorrect numbering or reordered blocks (those fields get overwritten), which can silently misalign text with timestamps. Prefer validating the returned numbering/order first, and then reattaching timestamps by matching on the returned block number (or only overwriting timestamps).
| can_split = len(batch) > 1 | ||
| attempts = _ATTEMPTS_BEFORE_SPLIT if can_split else cfg.max_retries | ||
| hit_validation_failure = False |
There was a problem hiding this comment.
attempts is set to _ATTEMPTS_BEFORE_SPLIT for any batch with >1 block, which limits retries for transient HTTP/network errors as well as validation failures. This can reduce resilience to 429/5xx spikes. Consider keeping cfg.max_retries for request failures, and only triggering split-after-N when validation keeps failing.
| # Wire format: number + text only. Timestamps are stripped before sending | ||
| # because small models sometimes corrupt them; callers reattach positionally. | ||
| def serialize_lite(blocks: list[SubtitleBlock]) -> str: | ||
| return "\n\n".join(f"{b.number}\n{b.text}" for b in blocks) + "\n" | ||
|
|
||
|
|
||
| def parse_lite(content: str) -> list[SubtitleBlock]: |
There was a problem hiding this comment.
serialize_lite/parse_lite are new parsing/serialization paths but the existing tests don’t appear to cover them (e.g., multi-line text, trailing newlines, invalid blocks). Adding focused tests for these helpers would reduce the risk of wire-format regressions.
| // Wire format strips timestamps; reattach positionally. | ||
| if (output.length === inputBlocks.length) { | ||
| output = output.map((b, i) => ({ | ||
| number: inputBlocks[i].number, | ||
| timestamp: inputBlocks[i].timestamp, |
There was a problem hiding this comment.
The code reattaches block numbers/timestamps positionally before calling validateBatch. This means validateBatch can no longer detect a model that outputs incorrect block numbers or reorders blocks (the mapping overwrites those fields), which can silently attach the wrong text to a timestamp. Consider validating the model’s returned numbering/order first, and then reattaching timestamps by matching on the returned block number (or only overwriting timestamps, not numbers).
| const canSplit = inputBlocks.length > 1; | ||
| // Splittable batches give up early — halving resolves persistent count | ||
| // mismatches faster than more retries on the same payload. | ||
| const attempts = canSplit ? ATTEMPTS_BEFORE_SPLIT : maxRetries; |
There was a problem hiding this comment.
attempts is reduced to ATTEMPTS_BEFORE_SPLIT for any batch with >1 block, which also limits retries for transient HTTP/network errors (not just validation mismatches). This makes large batches much more likely to fail on temporary 429/5xx. Consider using maxRetries for request failures, and only switching to split-after-N logic when repeated validation failures occur.
| // Wire format: number + text only. Timestamps are stripped before sending | ||
| // because small models sometimes corrupt them; callers reattach positionally. | ||
| export function serializeLite(blocks: SubtitleBlock[]): string { | ||
| return blocks.map((b) => `${b.number}\n${b.text}`).join('\n\n') + '\n'; | ||
| } |
There was a problem hiding this comment.
New wire-format helpers (serializeLite/parseLite) are introduced here but there are no unit tests covering their round-trip behavior, multi-line text handling, or invalid input tolerance. Since this module already has a spec file, adding tests for these functions would help prevent subtle parsing regressions.
This pull request updates both the documentation (
README.md) and core CLI implementation to introduce a new pre-translation "cast & register prepass" for better consistency, improve language auto-detection, enhance validation and retry logic, and make the CLI interface clearer and more robust. The changes also adjust default behaviors and add new configuration options.Key improvements and new features:
Major pipeline and feature enhancements
Validation and retry improvements
CLI and configuration changes
verboseflag to the CLI to show retry/validation warnings, which are now hidden by default for cleaner output. [1] [2]Codebase cleanup and minor improvements