Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hiraganas does not take context into account (like numbers) #12

Open
lasyan3 opened this issue May 24, 2021 · 11 comments
Open

Hiraganas does not take context into account (like numbers) #12

lasyan3 opened this issue May 24, 2021 · 11 comments

Comments

@lasyan3
Copy link

lasyan3 commented May 24, 2021

Hi,
Not sure if this is a Kawazu or LibNMeCab issue, but when converting kanjis ignores exceptions.
For example, if one wants to convert 300 with 三百, it will output さんひゃく (sanhyaku) but the correct answer is さんゃく (sanbyaku).
Same for 600 and 900.

Currently working on a workaround on my fork: lasyan3@156bf7e

@Cutano
Copy link
Owner

Cutano commented May 25, 2021

I see..
I think it is MeCab didn't regard 三百 as a whole word. By handling special cases is an effective method but might be hard to cover all the cases. But currently, it is the best way to solve this problem.

@lasyan3
Copy link
Author

lasyan3 commented May 25, 2021

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.
Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

@Cutano
Copy link
Owner

Cutano commented May 25, 2021

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.
Anyway, feel free to use my code from my fork if you want to and if you think it's ok (I had to change the JapaneseElement from struct to class to be able to set some properties).

Sure, if it goes well in your project please let me know and send me a pull request if possible. Handling cases one by one is better than handling nothing, thanks for the advice.

@ookii-tsuki
Copy link
Collaborator

I agree with you, handling special cases one by one might be hard. I'm doing it right now for numbers and time because I need it for my web app helping me learning japanese and I'll see if I continue in this way or not.

I think you could use JMDICT and go through each element in that dictionary and check if the reading generated by NMeCab is the same as in the dictionary, if not, add it to a list that you can use to determine the special cases. It would be faster than doing it manually one by one.

@lasyan3
Copy link
Author

lasyan3 commented May 26, 2021

Well I downloaded the Wacton Desu nuget package (which is a dotnet port of the JMDICT) to make some tests and I just realized that the issue seems to only appear with kanjis showing numbers. Indeed, with a sentence, words are correctly "divided" from each others. For example : 日本語を勉強します --> [日本語] [を] [勉強] [し] [ます]
And in that case, as far as I can tell, the reading is correct.
But with numbers, all kanjis are splited and so the program cannot detect the specials readings.
So if I am right, we just have to handle the cases for numbers, which is pretty small.

@ookii-tsuki
Copy link
Collaborator

Yeah I think just fixing numbers would be enough
Also in case you didn't know, counters are also treated the same way as numbers
For example
一人(ひとり) would be divided into two elements 一(いち)、人(にん) while it should be one word.
Or
一回(いっかい) would divide to 一(いち)、回(かい)
And a lot of other counters.

@lasyan3
Copy link
Author

lasyan3 commented May 26, 2021

Hmm so maybe the root cause is the way NmeCab is parsing sentences ? Maybe there is a way to make it correctly group counters and numbers, I'll investigate this way.

@ookii-tsuki
Copy link
Collaborator

I think the actual problem is from the IpaDic, NMeCab uses that dictionary to parse the sentences.
NMeCab also supports UniDic which I heard is better than IpaDic and up-to-date but it's a lot bigger (2GB or so) but I don't know if it has this problem with numbers or not.

@lasyan3
Copy link
Author

lasyan3 commented May 28, 2021

I think I found a solution to deal with that issue. Maybe not the best one, but at least seems to work.
I let Kawazu split the sentence and I keep only the kanjis who are alone (because in other cases that means Kawazu identified the combo and thus the proper reading).
Then I use Wacton Desu to analyze the kanjis left and compare the readings.
You can view the detailed implementation in my repo, in the branch "desu": lasyan3@93bb51f

@ookii-tsuki
Copy link
Collaborator

I don't think implementing Wacton library is a good idea because it uses a lot of RAM (about 460MB for the Japanese enteries) so it's better to run this test in a separate project and get all the cases where the reading is wrong and then save them in a json or xml and then use that to check for wrong readings in Kawazu

@lasyan3
Copy link
Author

lasyan3 commented Jun 7, 2021

I tried running the test to get all the cases but I end with 58109 wrong readings, this seems anormaly big to me.
So for now I'll stay on my first idea, dealing with counters and adding exceptions each time I see a new kind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants