OCR Translate

AI-powered manga/comic translation pipeline — extracts Japanese text from images using a local vision-language model and translates it via a separate local LLM. No cloud APIs, no Tesseract, no manual preprocessing.

Overview

OCR Translate processes comic and manga pages through a two-stage AI pipeline:

Vision Model (Qwen3-VL 8B) scans the raw image and extracts speech bubble text
Language Model (TranslateGemma 12B) translates the extracted text to natural English

Both models run locally through Ollama, keeping all data on-device with zero external API calls. The pipeline handles vertical Japanese text, varied bubble layouts, sound effects, and NSFW content without censorship (with appropriate model selection).

Features

Zero preprocessing — raw images go directly to the vision model; no grayscale, thresholding, or denoising required
Vertical text support — vision models natively understand right-to-left Japanese reading order
Streaming batch mode — processes 50+ pages sequentially with immediate file writes per page; survives crashes without data loss
Minimal prompts — deliberately terse instructions (2–3 lines) to minimize token waste and prevent model confusion
Configurable models — any Ollama vision model for extraction, any Ollama text model for translation
GPU-accelerated — vision encoding and text generation both run on GPU when properly configured
563 lines of Python — light, maintainable, no framework overhead

Architecture

images/                   # Input: .png, .jpg, .webp pages
    │
    ▼
┌─────────────────────────────────────────────┐
│  src/preprocess.py  (43 lines)              │
│  Loads image, downscales to config size,    │
│  encodes to base64 for Ollama API           │
└────────────────┬────────────────────────────┘
                 │ base64
                 ▼
┌─────────────────────────────────────────────┐
│  src/ocr.py  (94 lines)                     │
│  POST /api/chat to vision model             │
│  Returns extracted Japanese text             │
└────────────────┬────────────────────────────┘
                 │ raw text
                 ▼
┌─────────────────────────────────────────────┐
│  src/translate.py  (113 lines)              │
│  Strips furigana, POST /api/chat to LLM    │
│  Returns natural English translation         │
└────────────────┬────────────────────────────┘
                 │
                 ▼
            outputs/                          
       batch_translation.txt

Data flow between modules:

preprocess.py → ocr.py: base64-encoded image string
ocr.py → translate.py: raw Japanese text (furigana annotations stripped before sending)
All modules import from config.py, which loads settings from .env

Quickstart

Prerequisites

Ollama installed and running
Python 3.10+

1. Pull models

ollama pull qwen3-vl:8b         # Vision model for text extraction
ollama pull translategemma:12b  # Translation model

2. Install dependencies

pip install -r requirements.txt

3. Configure

Copy .env.example to .env and edit if needed:

OLLAMA_HOST=http://localhost:11434
VISION_MODEL=qwen3-vl:8b
TRANSLATION_MODEL=translategemma:12b
SOURCE_LANG=Japanese
TARGET_LANG=English
MAX_IMAGE_DIMENSION=1024
TEMPERATURE=0.5
REQUEST_TIMEOUT=120

4. Run

# Single image
python run.py images/comic.png

# Batch process a folder (streaming writes, survives crashes)
python run.py images/chapter1/ --batch

# Extract text only (no translation)
python run.py images/chapter1/ --batch --ocr-only

# Translate a text string directly
python run.py --translate-only "こんにちは"

Commands

Command	Description
`run.py <image>`	Extract text and translate
`run.py <dir> --batch`	Process all images in a directory
`run.py <image> --ocr-only`	Extract text only
`run.py --translate-only "text"`	Translate a text string
`run.py --setup`	Show model setup instructions
`run.py --list-models`	List installed Ollama models
`run.py --verbose`	Print detailed API request/response info
`run.py --output <path>`	Save output to file

Configuration (.env)

Setting	Default	Purpose
`VISION_MODEL`	`qwen3-vl:8b`	Model that reads text from images (must support vision)
`TRANSLATION_MODEL`	`translategemma:12b`	Model that translates text
`MAX_IMAGE_DIMENSION`	`1024`	Downscale images larger than this (pixels)
`TEMPERATURE`	`0.5`	Translation creativity (0.0 = deterministic, 1.0 = creative)
`REQUEST_TIMEOUT`	`120`	API request timeout in seconds
`SOURCE_LANG`	`Japanese`	Source language hint
`TARGET_LANG`	`English`	Target language

How It Works

Text Extraction (Vision Model)

The system sends a base64-encoded image to Ollama's /api/chat endpoint with a minimal prompt:

"Output only the raw Japanese text from each speech bubble. No translations, no descriptions, no commentary. Ignore any small furigana/ruby text next to kanji. Just the main text."

The vision model returns the extracted text directly. No scene descriptions, no formatting — just the raw dialogue. The response is parsed (handles context/text splitting on --- delimiters for single-image mode, or returns text directly for batch mode).

Furigana Stripping

Japanese manga often includes small reading characters (furigana) next to kanji. Before sending extracted text to the translator, a regex strips kana-only parenthetical content:

# 同(おな)じ顔(かお) → 同じ顔
re.sub(r'\([ぁ-ゟァ-ヿー]+\)', '', text)

This ensures the translator sees clean, uninterrupted Japanese without reading annotations that would confuse tokenization.

Translation (Language Model)

Clean text is sent to the translation model with a minimal prompt:

"Translate Japanese manga dialogue to natural English." / "Translate this manga text to English. Output only the translation."

For batch mode, all pages are consolidated into one request with === PAGE N === markers. The translation model reads the full narrative arc before translating each page, producing contextually aware translations.

Batch Mode Resilience

Batch processing writes each page's extracted text to outputs/batch_extracted.txt immediately after extraction. If the process crashes on page 40 of 58, pages 1–39 are already saved. Restarting continues from where it left off (the output file is rewritten each batch run, so restart from the beginning — no duplicate pages are possible).

Supported Image Formats

PNG, JPEG, WebP, BMP, TIFF — anything Pillow can open.

Dependencies

Package	Purpose
`Pillow`	Image loading, resizing, base64 encoding
`requests`	HTTP client for Ollama API
`python-dotenv`	Configuration via `.env` file

No OCR library, no OpenCV, no CUDA SDKs. The AI models handle everything.

Performance Notes

GPU: Vision encoding and LLM inference both require GPU for acceptable speed. CPU-only inference is 10–50x slower.
AMD GPUs: Ollama supports AMD via HIP SDK on Windows. If inference is slow (40s+/page), verify GPU utilization in Task Manager.
Batch speed: With keep_alive, the vision model loads once and stays in VRAM. 58 pages typically complete in 10–25 minutes depending on GPU.

Project Structure

ocrTranslate/
├── .env                    # Local config (gitignored)
├── .env.example            # Config template
├── requirements.txt        # Python dependencies
├── run.py                  # CLI entry point
├── README.md
├── images/                 # Input images
├── outputs/                # Generated files (gitignored)
└── src/
    ├── __init__.py
    ├── config.py           # Central config + AI prompts
    ├── preprocess.py       # Image load/resize/base64
    ├── ocr.py              # Vision model text extraction
    └── translate.py        # Translation + furigana stripping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Translate

Overview

Features

Architecture

Quickstart

Prerequisites

1. Pull models

2. Install dependencies

3. Configure

4. Run

Commands

Configuration (.env)

How It Works

Text Extraction (Vision Model)

Furigana Stripping

Translation (Language Model)

Batch Mode Resilience

Supported Image Formats

Dependencies

Performance Notes

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

OCR Translate

Overview

Features

Architecture

Quickstart

Prerequisites

1. Pull models

2. Install dependencies

3. Configure

4. Run

Commands

Configuration (.env)

How It Works

Text Extraction (Vision Model)

Furigana Stripping

Translation (Language Model)

Batch Mode Resilience

Supported Image Formats

Dependencies

Performance Notes

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages