Skip to content

Feat: Improve Translation quality and perfomrance.#11

Merged
SloMR merged 15 commits into
mainfrom
feat/ImproveTranslation
Apr 21, 2026
Merged

Feat: Improve Translation quality and perfomrance.#11
SloMR merged 15 commits into
mainfrom
feat/ImproveTranslation

Conversation

@SloMR
Copy link
Copy Markdown
Owner

@SloMR SloMR commented Apr 21, 2026

This pull request updates both the documentation (README.md) and core CLI implementation to introduce a new pre-translation "cast & register prepass" for better consistency, improve language auto-detection, enhance validation and retry logic, and make the CLI interface clearer and more robust. The changes also adjust default behaviors and add new configuration options.

Key improvements and new features:

Major pipeline and feature enhancements

  • Added a "cast & register prepass" step to extract characters, recurring terms, and written register before translation, ensuring consistent translation of names and formality across batches. This is now included in both the documentation and the CLI pipeline. [1] [2] [3]
  • Enabled auto-detection of the source language by default, both in the web and CLI interfaces, simplifying usage for mixed-language files. [1] [2] [3]

Validation and retry improvements

  • Improved validation logic: on repeated validation failures, batches are recursively split down to single blocks before giving up, increasing robustness for difficult files. [1] [2] [3]
  • Enhanced retry/back-off logic, including more precise handling of rate limits and batch splitting.

CLI and configuration changes

  • Changed the default batch size from 15 to 10 for better reliability with small/medium models, and updated the CLI/configuration accordingly. [1] [2] [3]
  • Added a verbose flag to the CLI to show retry/validation warnings, which are now hidden by default for cleaner output. [1] [2]
  • Clarified CLI usage and documentation, including clearer flag descriptions and examples for both auto-detect and explicit source language. [1] [2]

Codebase cleanup and minor improvements

  • Refactored and clarified docstrings and comments for better readability and maintainability. [1] [2] [3] [4]
  • Updated the roadmap to reflect the new "cast & register prepass" and removed the now-implemented translation memory feature.

SloMR added 15 commits April 21, 2026 18:21
- Added functions to handle subtitle blocks without timestamps for easier parsing.
- Updated translation logic to enhance handling of subtitle blocks.
- Improved model interaction by eliminating noise from timestamps.
- Streamlined serialization of subtitle data, boosting translation accuracy.
- Updated subtitle serialization to exclude timestamps, improving translation accuracy.
- Added logic to reattach timestamps after translation.
- Simplified input format in translation prompt to align with changes.
- Integrates file context scanning for consistent character names and term translations.
- Enhances translation accuracy by addressing gendered-pronoun errors.
- Introduces a glossary system to improve scene-context translations.
- Provides better support for language-specific nuances in subtitle translations.
- Introduces batch splitting on validation failures to enhance translation accuracy.
- Adds register consistency for coherent translations across subtitle files.
- Ensures non-empty inputs do not produce empty outputs.
- Improves context detection for better glossary generation.
- Provides more granular retry logic, benefiting subtitle translation stability.
- Introduces context scanning for gendered-pronoun accuracy.
- Ensures consistent language register and terminology.
- Improves batch handling with retry logic and batch splitting.
- Enhances translation fidelity for non-Latin languages.
- Enables automatic detection of source language for translations.
- Enhances usability for mixed-language subtitle batches.
- Updates UI to include an "Auto-detect" option in language selection.
- Improves translation workflow by eliminating the need to specify a source language.
- Introduces a verbose flag to control the visibility of retry and validation warnings.
- By default, warnings are suppressed, enhancing user experience during translation tasks.
- Verbose mode helps users monitor detailed translation attempts and issues.
- Introduces contextual hints for form fields to guide user input.
- Enhances user experience by clarifying field purposes and options.
- Adds advice for optional fields and server-specific inputs.
- Provides better visual structure and alignment in the UI.
- Decreased default batch size from 15 to 10 for better handling.
- Updated help text to reflect batch size adjustment.
- Aligns batch size across CLI and web for consistency.
@SloMR SloMR self-assigned this Apr 21, 2026
Copilot AI review requested due to automatic review settings April 21, 2026 18:14
@SloMR SloMR merged commit 51d84ff into main Apr 21, 2026
5 checks passed
@SloMR SloMR deleted the feat/ImproveTranslation branch April 21, 2026 18:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves subtitle translation consistency and robustness across both the web app and CLI by adding a file-level context prepass (“cast & register”), enabling source-language auto-detection by default, and refining batching/validation/retry behavior.

Changes:

  • Added a one-shot context scan to extract register/characters/terms and attach a batch-scoped glossary to each translation request (web + CLI).
  • Switched to a timestamp-free “lite” wire format for LLM calls and added recursive batch splitting on repeated validation failures; updated defaults (batch size 10, source auto-detect) and added CLI --verbose.
  • Updated UI hints, CLI help text, and README; bumped versions to 0.4.0.

Reviewed changes

Copilot reviewed 29 out of 32 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
web/src/index.html Removes alternate favicon link while keeping SVG favicon setup.
web/src/app/core/translation.service.ts Adds context prepass + glossary injection; refactors request building; adds split-on-validation-failure logic.
web/src/app/core/translation-prompt.ts Updates system prompt for “no timestamps” wire format and supports optional glossary in user message.
web/src/app/core/time-tracker.ts Refactors comments/docstrings for brevity/clarity (no logic change).
web/src/app/core/subtitle-formats/types.ts Shortens interface documentation (no functional change).
web/src/app/core/subtitle-formats/index.ts Shortens parsing documentation (no functional change).
web/src/app/core/srt-parser.ts Adds serializeLite/parseLite wire format helpers for timestamp-free translation requests.
web/src/app/core/languages.ts Condenses comments on language metadata (no functional change).
web/src/app/core/file-types.ts Clarifies upload parsing comment (no functional change).
web/src/app/core/context-pass.ts New web context-pass implementation (scan budget, parsing tagged response, batch-scoped glossary rendering).
web/src/app/app.component.ts Defaults source language to auto-detect and adjusts swap behavior; comment cleanups.
web/src/app/app.component.scss Adds layout/hint styles for language + advanced form hints.
web/src/app/app.component.html Adds “Auto-detect” source option and explanatory field hints across advanced/basic sections.
web/public/favicon.ico Includes favicon asset (supports browsers that request .ico).
web/package.json Bumps web app version to 0.4.0.
web/package-lock.json Updates lockfile version metadata to 0.4.0.
cli/uv.lock Bumps CLI package version to 0.4.0 in lock metadata.
cli/translora.py Makes --source optional (auto-detect default), changes default batch size to 10, adds --verbose, refreshes help/examples.
cli/tests/test_srt_parser.py Adds validation coverage for “empty output text” cases.
cli/tests/test_context_pass.py Adds tests for context-pass parsing/rendering and scan sampling.
cli/pyproject.toml Bumps CLI package version to 0.4.0.
cli/core/translator.py Adds per-file context scan step and passes FileContext into batch translation.
cli/core/time_tracker.py Condenses docstrings (no logic change).
cli/core/srt_parser.py Adds serialize_lite/parse_lite and extends validation to catch empty output on non-empty input.
cli/core/prompt.py Updates system prompt for “no timestamps” wire format + glossary/register guidance.
cli/core/live_status.py Condenses docstrings/comments (no logic change).
cli/core/formats/types.py Condenses SubtitleDocument docstring (no functional change).
cli/core/formats/init.py Condenses module docstring (no functional change).
cli/core/context_pass.py New CLI context-pass implementation + scan extraction call.
cli/core/config.py Adds verbose flag and makes warnings silent by default (reroutable).
cli/core/batch_runner.py Adds lite wire format, glossary user message, recursive split on validation failure, and updated retry/warn behavior.
README.md Documents prepass/context scan, auto-detect source, new defaults/flags, and pipeline updates.
Files not reviewed (1)
  • web/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cli/core/batch_runner.py
Comment on lines +144 to +148
if len(output) == len(batch):
output = [
SubtitleBlock(number=batch[i].number,
timestamp=batch[i].timestamp,
text=output[i].text)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the parsed lite output is rewritten with the input batch’s numbers/timestamps before validation. That prevents validate_batch from catching incorrect numbering or reordered blocks (those fields get overwritten), which can silently misalign text with timestamps. Prefer validating the returned numbering/order first, and then reattaching timestamps by matching on the returned block number (or only overwriting timestamps).

Copilot uses AI. Check for mistakes.
Comment thread cli/core/batch_runner.py
Comment on lines +133 to +135
can_split = len(batch) > 1
attempts = _ATTEMPTS_BEFORE_SPLIT if can_split else cfg.max_retries
hit_validation_failure = False
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempts is set to _ATTEMPTS_BEFORE_SPLIT for any batch with >1 block, which limits retries for transient HTTP/network errors as well as validation failures. This can reduce resilience to 429/5xx spikes. Consider keeping cfg.max_retries for request failures, and only triggering split-after-N when validation keeps failing.

Copilot uses AI. Check for mistakes.
Comment thread cli/core/srt_parser.py
Comment on lines +51 to +57
# Wire format: number + text only. Timestamps are stripped before sending
# because small models sometimes corrupt them; callers reattach positionally.
def serialize_lite(blocks: list[SubtitleBlock]) -> str:
return "\n\n".join(f"{b.number}\n{b.text}" for b in blocks) + "\n"


def parse_lite(content: str) -> list[SubtitleBlock]:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialize_lite/parse_lite are new parsing/serialization paths but the existing tests don’t appear to cover them (e.g., multi-line text, trailing newlines, invalid blocks). Adding focused tests for these helpers would reduce the risk of wire-format regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +164
// Wire format strips timestamps; reattach positionally.
if (output.length === inputBlocks.length) {
output = output.map((b, i) => ({
number: inputBlocks[i].number,
timestamp: inputBlocks[i].timestamp,
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code reattaches block numbers/timestamps positionally before calling validateBatch. This means validateBatch can no longer detect a model that outputs incorrect block numbers or reorders blocks (the mapping overwrites those fields), which can silently attach the wrong text to a timestamp. Consider validating the model’s returned numbering/order first, and then reattaching timestamps by matching on the returned block number (or only overwriting timestamps, not numbers).

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +143
const canSplit = inputBlocks.length > 1;
// Splittable batches give up early — halving resolves persistent count
// mismatches faster than more retries on the same payload.
const attempts = canSplit ? ATTEMPTS_BEFORE_SPLIT : maxRetries;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempts is reduced to ATTEMPTS_BEFORE_SPLIT for any batch with >1 block, which also limits retries for transient HTTP/network errors (not just validation mismatches). This makes large batches much more likely to fail on temporary 429/5xx. Consider using maxRetries for request failures, and only switching to split-after-N logic when repeated validation failures occur.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +52
// Wire format: number + text only. Timestamps are stripped before sending
// because small models sometimes corrupt them; callers reattach positionally.
export function serializeLite(blocks: SubtitleBlock[]): string {
return blocks.map((b) => `${b.number}\n${b.text}`).join('\n\n') + '\n';
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New wire-format helpers (serializeLite/parseLite) are introduced here but there are no unit tests covering their round-trip behavior, multi-line text handling, or invalid input tolerance. Since this module already has a spec file, adding tests for these functions would help prevent subtle parsing regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants