Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Encoding Issue with some Entries #10

Open
CalculusAce opened this issue Oct 30, 2020 · 1 comment
Open

Encoding Issue with some Entries #10

CalculusAce opened this issue Oct 30, 2020 · 1 comment

Comments

@CalculusAce
Copy link

I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.

I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.

@Thermospore
Copy link

I've noticed this as well. Doesn't seem to be an issue in the new version of yomichan import, which uses https://github.com/FooSoft/zero-epwing-go ? However that version seems to have an issue that this version doesn't (number 3 here. the others are dictionary specific)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants