Skip to content

JustinMatters/translator

Repository files navigation

Translator

Translates content from a CSV, Excel, plain-text, or HTML file — or directly from a URL — into one or more target languages using a local LLM via Ollama or the OpenRouter API. Each translation is back-translated into the source language and scored for round-trip quality. Results are written back to the output file.

Three model roles are configured independently and may be the same or different models:

Role Purpose
Translator Forward translation: source → target
Back-translator Reverse translation: target → source
Evaluator Scores semantic similarity of original vs back-translation (0.0–1.0)

Prerequisites

uv

Install uv — the package manager used to install and run the tool:

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Ollama (only needed if you plan to use local Ollama mode)

NOTE The Ollama mode works for demonstration purposes but is not a good translation solution due to throughput limitations, OpenRouter based translation is reccommended for productive use.

Install Ollama and pull at least one model:

ollama pull phi4-mini

The tool will attempt to start Ollama automatically if it is not already running. If you prefer to start it yourself, run ollama serve before launching the tool.

OpenRouter API key (cloud mode only)

If you want to use OpenRouter instead of Ollama, create an account at openrouter.ai and obtain an API key. Then set it as an environment variable:

# macOS / Linux — add to ~/.bashrc or ~/.zshrc to make permanent
export OPENROUTER_API_KEY=your-key-here

# Windows (Command Prompt) — sets permanently for your user account
setx OPENROUTER_API_KEY your-key-here

Installation

IMPORTANT Do not unzip the file into a folder managed with One Drive, copy to a logical location on your C drive and unzip there, otherwise UV and the libraries will get very confused and fail to work!

Clone or unzip the project files into a new folder. Navigate to the folder in a command window and then install all dependencies from the lockfile:

uv sync --frozen

This creates a .venv in the project directory and installs exact locked versions with SHA256 hash verification.


Usage

# navigate to your environment first
cd C:\Location\of\translator

# activate your UV environment
.venv\Scripts\activate

# Cloud models via OpenRouter (default config - recommended)
uv run translator --input phrases.csv --output results.csv

# Local models via Ollama (alternate config)
uv run translator --input phrases.csv --output results.csv --config config_ollama.toml

All options

Flag Default Description
--input (optional) Input file path (.csv, .xlsx, .txt, .html) or a URL. Omit to process all files in input/.
--output (optional) Output file path. Omit to overwrite --input, or write to output/ in folder mode. Not used for HTML/URL input.
--source-column phrase CSV/Excel column containing source text.
--config ./config.toml Path to config file.
--verbose Enable debug-level console output.
--quiet Suppress all console output except errors.
--log-file log_file.txt Path to log file.

All model, language, backend, and threshold settings are configured in config.toml — see Configuration below.


Note on first run

On first run, the tool downloads evaluation model weights (~2–3 GB total) from HuggingFace for the sentence-transformer and BERTScore models. This is a one-time download; subsequent runs use the cached weights. Any configured Ollama models not already present are also pulled automatically.


Where to put your files

Single file: pass --input and optionally --output directly:

uv run translator --input phrases.csv --output results.csv

Batch mode: place all your files in an input/ folder next to the project, then run with no flags. Translated files are written to output/ automatically:

translator/
├── input/
│   ├── phrases.csv
│   ├── marketing.xlsx
│   └── legal.txt
└── output/       ← created automatically
    ├── phrases.csv
    ├── marketing.xlsx
    └── legal.txt
uv run translator

HTML and URL translation

Pass a URL or a local .html/.htm file as --input and the tool will translate every visible text node on the page:

# Translate a web page directly from its URL
uv run translator --input https://example.com/about

# Translate a local HTML file
uv run translator --input page.html

For each target language the tool writes two files into output/:

File Description
{stem}_{lang}.html Full translated HTML with all markup preserved
{stem}_summary.xlsx One row per text node: original, translation, back-translation, score

The --output flag is not used for HTML/URL input — output is always written to output/.


Translation cache

Short phrases (≤ cache_max_words words, default 5) are cached in a local SQLite database at cache/translations.db. On subsequent runs any cached phrase is served instantly with no LLM call, which makes re-translating pages with repeated navigation text (menus, labels, headings) much faster. Short phrases that score below the quality threshold are accepted and cached without retry — a two-word label is unlikely to improve on a second attempt.

To start fresh, delete the cache/ folder:

rm -rf cache/

To disable caching entirely, set max_words = 0 in the [cache] section of your config file.


Input format

The input CSV must have a column containing the source phrases (default column name: phrase).

id,phrase
1,The early bird catches the worm
2,Better late than never

Excel (.xlsx, .xls) and plain-text (.txt) files are also supported.


Output format

The original columns are preserved and three columns are appended per target language:

id,phrase,fr_translation,fr_back,fr_score,de_translation,de_back,de_score
1,The early bird catches the worm,Le lève-tôt attrape le ver,The early bird catches the worm,0.923,...
Column Description
{lang}_translation Forward translation produced by the Translator model
{lang}_back Back-translation produced by the Back-translator model
{lang}_score Round-trip quality score 0.0–1.0

Scores near 1.0 indicate the meaning was very well preserved. Scores below 0.6 suggest the translation may be unreliable.

For plain-text input, the output file begins with an average score line followed by each language's translation.


Configuration

All model, language, backend, and threshold settings live in config.toml. Pass --config path/to/file.toml to use a different config. Every key listed below is required — the tool will report all missing keys if any are absent rather than silently using defaults.

[models]
backend = "ollama"          # "ollama" or "openrouter"
translator = "phi4-mini"
backtranslator = "phi4-mini"
evaluator = "phi4-mini"
request_timeout = 60        # seconds per LLM call; raise for slow models

[ollama]
base_url = "http://localhost:11434"

[openrouter]
# Set OPENROUTER_API_KEY as an environment variable (see Prerequisites above)

[scoring]
quality_threshold = 0.7     # score at which the retry loop exits early
retries = 3                 # max additional attempts per phrase/chunk
pass_threshold = 0.8        # informational: score considered passing
warn_threshold = 0.6        # informational: score considered acceptable
max_chunk_chars = 4000      # split inputs longer than this; 0 to disable

[cache]
max_words = 5               # cache phrases with this many words or fewer; 0 to disable

[languages]
source = "en"
targets = ["fr", "de", "nl", "fr-BE", "nl-BE"]

Four ready-made configs are included:

File Backend Models
config.toml Ollama (local) phi4-mini for all roles
config_ollama.toml Ollama (local) phi4-mini for all roles
config_or_cheap.toml OpenRouter Claude Sonnet / GPT / DeepSeek
config_or_pricey.toml OpenRouter Claude Opus / GPT / Gemini

Running tests

The project has a full test suite (unit and integration). Tests are excluded from the distribution zip to keep it lightweight, but are available in the full repository.

uv run pytest                  # Unit tests only (default)
uv run pytest -m integration   # Integration tests (requires live Ollama / OpenRouter)
uv run pytest -m ""            # Everything

About

llm translation with accuracy estimation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages