SceneTrans: A method for text localization and translation in scene images

Abstrac：Objective Existing scene text editing methods mainly focus on monolingual editing within image patches containing single textual contents, and thus fail to support cross-lingual text editing. To address this limitation, this paper proposes SceneTrans, an end-to-end English-to-Chinese scene text localization-and-translation framework that enables accurate text localization in natural images, cross-lingual translation from English to Chinese, and visually consistent scene text editing. Method First, for text localization and translation, the MiniCPM-V 2.6 multimodal large model is adopted as the baseline. An instruction-tuning strategy based on location-enhanced prompt templates is designed to enable the joint learning of text localization and translation, while a lightweight visual-side LoRA tuning scheme is introduced to preserve the model’ s inherent translation ability. Second, for scene text editing, a diffusion-based Chinese text editing module is constructed. A Chinese glyph structure encoder is developed to precisely model glyph priors. Meanwhile, a Chinese glyph recognition supervision mechanism is introduced to ensure the glyph accuracy of the generated text. Third, to address the scarcity of cross-lingual scene text seek-and-translation data, a English-Chinese paired dataset named SynthTrans is built. SynthTrans contains 200,000 pairs of English-Chinese scene text images synthesized on real street-scene backgrounds with diverse bilingual-compatible fonts and includes complex geometric transformations such as curvature and slant, providing high-quality data support for model training and evaluation. Result Experiments conducted on scene text detection datasets including Total-Text, MSRA-TD500, and ICDAR2015, as well as the proposed SynthTrans editing dataset, demonstrate the effectiveness of the proposed approach. In the text localization task, the proposed framework achieves F1 scores comparable to those of mainstream detection models. In the text editing task, compared with methods such as TextCtrl and DiffSTE, the proposed method achieves an SSIM of 0.622 and a PSNR of 18.51, while improving text rendering accuracy (ACC) to 71.13% and achieving a normalized edit distance (NED) of 0.7837. Comparative and ablation experiments validate that the proposed glyph structure encoding and glyph recognition supervision effectively ensure structural fidelity and style consistency for Chinese text generation, demonstrating strong cross-lingual scene text seek-and-translation capabilities. Conclusion The proposed end-to-end seek-and-translation framework accomplishes an integrated “localization-translation-editing” pipeline for natural scene text images, generating Chinese scene text with accurate glyph structures and style consistency with the original text. It overcomes the limitations of previous methods that cannot perform localization or cross-lingual editing, offering a new technical solution and dataset foundation for cross-lingual scene text seek-and-translation in natural images.

Example of the SynthTrans dataset：

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SceneTrans: A method for text localization and translation in scene images

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SceneTrans: A method for text localization and translation in scene images

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages