25 jōyō kanji (regular-use kanji) not included in japanese model #11249

madmalkav · 2023-11-13T08:03:17Z

I have found that japan_dict.txt misses, at least, these two kanji that are part of the jōyō kanji list:

I think any training for a japanese OCR should include all the jōyō kanji, as they are considered the most common and needed for japanese readers.

Any chance you update the japan_dict.txt file and retrain the japanese model? If needed, I can try to searchfor other missing kanji from the jōyō list, but not very good at scripting to it will take me some time to obtain the list.

matt-m-o · 2023-11-13T17:56:02Z

PaddleOCR is one of the best OCR toolkits I've tested so far. However, the official Japanese model could be significantly improved if it could recognize all Jouyou kanji (regular-use kanji), as these are very common.

I have identified 25 missing kanji:
塡謁喉挫愁辣栓渇舷訃摯剝綻凄咽嚇羞隙繭慄僅錮頰厘拷

It would be greatly appreciated if you could retrain the Japanese model with these additions. This would enhance the accuracy and usability.

Files:
jouyou-kanji.csv
dict_japan.txt

Python script to find missing kanji:

jouyou_kanji_file = "jouyou-kanji.csv"
dict_file = "dict_japan.txt"
output_file = "missing_kanji.txt"

with open(dict_file, "r") as f:
    dict_chars = set(f.read().splitlines())

with open(jouyou_kanji_file, "r") as f:
    jouyou_kanji = set( line[0] for line in f.read().splitlines() )

missing_kanji = jouyou_kanji - dict_chars

with open(output_file, "w") as f:
    for char in missing_kanji:
        f.write(char + "\n")

print(f"Missing kanji have been written to {output_file}")

paddle-bot bot assigned tink2123 Nov 13, 2023

madmalkav mentioned this issue Nov 13, 2023

Some OCR mismatches matt-m-o/YomiNinja#5

Closed

madmalkav changed the title ~~jōyō kanji not included in japanese model~~ 25 jōyō kanji (regular-use kanji) not included in japanese model Nov 14, 2023

matt-m-o mentioned this issue Jan 31, 2024

OCR mistakes matt-m-o/YomiNinja#17

Open

PaddlePaddle locked and limited conversation to collaborators Jun 10, 2024

SWHL converted this issue into discussion #12940 Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

madmalkav commented Nov 13, 2023

matt-m-o commented Nov 13, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

25 jōyō kanji (regular-use kanji) not included in japanese model #11249

Comments

madmalkav commented Nov 13, 2023

matt-m-o commented Nov 13, 2023 • edited Loading

This issue was moved to a discussion.

matt-m-o commented Nov 13, 2023 •

edited

Loading