Replies: 2 comments 1 reply
-
PaddleOCR is one of the best OCR toolkits I've tested so far. However, the official Japanese model could be significantly improved if it could recognize all Jouyou kanji (regular-use kanji), as these are very common. I have identified 25 missing kanji: It would be greatly appreciated if you could retrain the Japanese model with these additions. This would enhance the accuracy and usability. Files: Python script to find missing kanji: jouyou_kanji_file = "jouyou-kanji.csv"
dict_file = "dict_japan.txt"
output_file = "missing_kanji.txt"
with open(dict_file, "r") as f:
dict_chars = set(f.read().splitlines())
with open(jouyou_kanji_file, "r") as f:
jouyou_kanji = set( line[0] for line in f.read().splitlines() )
missing_kanji = jouyou_kanji - dict_chars
with open(output_file, "w") as f:
for char in missing_kanji:
f.write(char + "\n")
print(f"Missing kanji have been written to {output_file}") |
Beta Was this translation helpful? Give feedback.
1 reply
-
Created PR to solve this: #13142 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have found that japan_dict.txt misses, at least, these two kanji that are part of the jōyō kanji list:
喉
渇
I think any training for a japanese OCR should include all the jōyō kanji, as they are considered the most common and needed for japanese readers.
Any chance you update the japan_dict.txt file and retrain the japanese model? If needed, I can try to searchfor other missing kanji from the jōyō list, but not very good at scripting to it will take me some time to obtain the list.
Beta Was this translation helpful? Give feedback.
All reactions