25 jōyō kanji (regular-use kanji) not included in japanese model #12940

madmalkav · 2023-11-13T08:03:17Z

madmalkav
Nov 13, 2023

I have found that japan_dict.txt misses, at least, these two kanji that are part of the jōyō kanji list:

I think any training for a japanese OCR should include all the jōyō kanji, as they are considered the most common and needed for japanese readers.

Any chance you update the japan_dict.txt file and retrain the japanese model? If needed, I can try to searchfor other missing kanji from the jōyō list, but not very good at scripting to it will take me some time to obtain the list.

matt-m-o · 2023-11-13T17:56:02Z

matt-m-o
Nov 13, 2023

PaddleOCR is one of the best OCR toolkits I've tested so far. However, the official Japanese model could be significantly improved if it could recognize all Jouyou kanji (regular-use kanji), as these are very common.

I have identified 25 missing kanji:
塡謁喉挫愁辣栓渇舷訃摯剝綻凄咽嚇羞隙繭慄僅錮頰厘拷

It would be greatly appreciated if you could retrain the Japanese model with these additions. This would enhance the accuracy and usability.

Files:
jouyou-kanji.csv
dict_japan.txt

Python script to find missing kanji:

jouyou_kanji_file = "jouyou-kanji.csv"
dict_file = "dict_japan.txt"
output_file = "missing_kanji.txt"

with open(dict_file, "r") as f:
    dict_chars = set(f.read().splitlines())

with open(jouyou_kanji_file, "r") as f:
    jouyou_kanji = set( line[0] for line in f.read().splitlines() )

missing_kanji = jouyou_kanji - dict_chars

with open(output_file, "w") as f:
    for char in missing_kanji:
        f.write(char + "\n")

print(f"Missing kanji have been written to {output_file}")

1 reply

madmalkav Jun 10, 2024
Author

Can I ask what it means the issue was converted into a discussion?

madmalkav · 2024-06-20T09:39:28Z

madmalkav
Jun 20, 2024
Author

Created PR to solve this: #13142

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25 jōyō kanji (regular-use kanji) not included in japanese model #12940

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

25 jōyō kanji (regular-use kanji) not included in japanese model #12940

madmalkav Nov 13, 2023

Replies: 2 comments · 1 reply

matt-m-o Nov 13, 2023

madmalkav Jun 10, 2024 Author

madmalkav Jun 20, 2024 Author

madmalkav
Nov 13, 2023

Replies: 2 comments 1 reply

matt-m-o
Nov 13, 2023

madmalkav Jun 10, 2024
Author

madmalkav
Jun 20, 2024
Author