Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize japanese symbols in two screenshots #4102

Open
superbonaci opened this issue Jul 19, 2023 · 4 comments
Open

Recognize japanese symbols in two screenshots #4102

superbonaci opened this issue Jul 19, 2023 · 4 comments

Comments

@superbonaci
Copy link

superbonaci commented Jul 19, 2023

Current Behavior

Recognize the symbols.

Expected Behavior

Recognize the symbols in these two screenshots.
Original pictures from Dragon Ball episode 1:

goku1

goku2

After some perspective correction (maybe helps?):

goku1-ed

goku2-ed

Suggested Fix

Recognize the symbols.

tesseract -v

tesseract 5.3.2
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.11 : libwebp 1.3.1 : libopenjp2 2.5.0
Found NEON
Found libarchive 3.6.2 zlib/1.2.11 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
Found libcurl/7.88.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.51.0

Operating System

macOS 13 Ventura

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

@amitdo
Copy link
Collaborator

amitdo commented Mar 16, 2024

Tesseract's layout analysis was designed to deal with simple layouts of books, magazines, newspapers and documents.

For any image that Tesseract completely fails to recognize, or fails to recognize some areas in the image, it is recommended to use a different tool to clean the image for Tesseract and make it easier for Tesseract to recognize text.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

In your case, you should give Tesseract just the letters without the frame around them.

@superbonaci
Copy link
Author

No result either with the improved picture:

result

% tesseract -l jpn result.png result.txt
Empty page!!
Empty page!!
% tesseract -l script/Japanese result.png result.txt
Empty page!!
Empty page!!

@amitdo
Copy link
Collaborator

amitdo commented Mar 17, 2024

Did you try with different psm values?

@superbonaci
Copy link
Author

superbonaci commented Mar 17, 2024

Still no luck, but Google Lens finds it fine:
https://ja.wikipedia.org/wiki/%E5%80%92%E7%A6%8F

@amitdo amitdo reopened this Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants