Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

Closed
madmalkav opened this issue Nov 13, 2023 · 1 comment
Closed

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

madmalkav opened this issue Nov 13, 2023 · 1 comment
Assignees

Comments

@madmalkav
Copy link

I have found that japan_dict.txt misses, at least, these two kanji that are part of the jōyō kanji list:


I think any training for a japanese OCR should include all the jōyō kanji, as they are considered the most common and needed for japanese readers.

Any chance you update the japan_dict.txt file and retrain the japanese model? If needed, I can try to searchfor other missing kanji from the jōyō list, but not very good at scripting to it will take me some time to obtain the list.

@matt-m-o
Copy link

matt-m-o commented Nov 13, 2023

PaddleOCR is one of the best OCR toolkits I've tested so far. However, the official Japanese model could be significantly improved if it could recognize all Jouyou kanji (regular-use kanji), as these are very common.

I have identified 25 missing kanji:
塡 謁 喉 挫 愁 辣 栓 渇 舷 訃 摯 剝 綻 凄 咽 嚇 羞 隙 繭 慄 僅 錮 頰 厘 拷

It would be greatly appreciated if you could retrain the Japanese model with these additions. This would enhance the accuracy and usability.

Files:
jouyou-kanji.csv
dict_japan.txt

Python script to find missing kanji:

jouyou_kanji_file = "jouyou-kanji.csv"
dict_file = "dict_japan.txt"
output_file = "missing_kanji.txt"

with open(dict_file, "r") as f:
    dict_chars = set(f.read().splitlines())

with open(jouyou_kanji_file, "r") as f:
    jouyou_kanji = set( line[0] for line in f.read().splitlines() )

missing_kanji = jouyou_kanji - dict_chars

with open(output_file, "w") as f:
    for char in missing_kanji:
        f.write(char + "\n")

print(f"Missing kanji have been written to {output_file}")

@madmalkav madmalkav changed the title jōyō kanji not included in japanese model 25 jōyō kanji (regular-use kanji) not included in japanese model Nov 14, 2023
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Jun 10, 2024
@SWHL SWHL converted this issue into discussion #12940 Jun 10, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants