New version of JMnedict (the proper name dictionary) #41

stephenmk · 2023-02-02T18:47:01Z

This pull request is to redesign the format of the JMnedict dictionary for Yomichan. It also includes a fix for a part-of-speech tag problem in non-English versions of JMdict.

New version of JMnedict

Related issue: FooSoft/yomichan#2111

Unlike the new version of JMdict, this redesign does not add new information or use any of Yomichan's new structured content features. It simply redesigns how the information is presented to users.

JMnedict contains a daunting number of entries that surpasses even JMdict. There are generally two types of entries in the file: (1) specific names of people, companies, events, etc., and (2) generic names such as given names and surnames. The latter category far outnumbers the former.

While the entries for specific names often provide useful information and context for a given term, the entries for generic names do not. The glossaries for generic names simply transliterate the term into Latin characters. So for example, the JMnedict entry for おおたに【大谷】 simply contains the gloss "Ootani" along with "place" and "surname" tags.

The problem is that JMnedict contains 44 generic name entries for the kanji 大. This means that anytime a Yomichan user searches for a word beginning with 大, Yomichan will also retrieve all 44 generic name entries for 大. This clutters the search results with a large amount of low quality information.

My suggestion is that we discard all glosses in generic entries with kanji forms. This way we can merge all generic entries sharing the same kanji form into single Yomichan entries.

Example: 尚三郎 (readings are moved to the glossaries for generic kanji terms)

Example: 山海経 (specific name entries retain glosses)

Example: 大谷海岸駅 (all 44 generic 大 entries merged into one)

Example: 林佳樹 (gloss is technically a transliteration but is retained because it has a space)

Example: じゅりあん (glosses are retained because they are not all transliterations)

JMdict: missing part-of-speech tags

I noticed that non-English versions of the new JMdict dictionaries did not have part-of-speech tags, unlike the old versions.

Only English-language senses in JMdict contain part-of-speech tags. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix.

Instead, I suggest gathering all distinct part-of-speech tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix. This still isn't ideal, but I think this is at least an improvement on the previous setup.

Test Dictionary Builds

Very old versions of JMdict and unofficial versions are unlikely to have the publication date entry at the end of the file.

Only English-language senses in JMdict contain part-of-speech tags. This info is displayed to users in definition tags and also used for deinflecting verbs and adjectives during term lookups. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix. Instead, I suggest gathering all distinct PoS tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix.

Thermospore · 2023-02-03T00:39:11Z

nice! yea currently I have jmnedict in its own profile with a different key to trigger it, cos it clutters things up. I'll have to try this out

one potential problem I see is that you can't do a kana -> kanji search for some entries. ex if you heard "おおやかいがん" and looked it up, this entry wouldn't show up

hopefully your ime or even just google could help you out in cases like this, but it is a bit of a regression

stephenmk · 2023-02-03T00:57:36Z

That is doable, but it's a tradeoff between utility and bloat. Adding kana-to-kanji lookups doubles the size of the term database, and I'm not sure if that functionality is actually useful.

I made a version like this last year if you'd like to try installing it and see for yourself: FooSoft/yomichan#2111 (comment)

Example: よしたけ

I've been using the version without the kana-to-kanji terms for about six months now and never found myself wishing for that functionality.

Thermospore · 2023-02-03T02:29:38Z

another issue I just noticed is if the reading is removed, freq dicts with readings (ex bccwj, B長 in my screenshot) wouldn't function anymore

maybe a yomichan change could allow for clean/compacted jmnedict entries while still allowing for kana searches and freq dicts with readings. (might even be some overlap with the changes described in this thread to allow for cleaner / more compact viewing of kanji/kana combinations)

tangentially related: I keep forgetting that modes other than group term-reading pairs exist... is there any reason not to use it? It might be better to just remove the other modes from yomichan, and focus on improving grouped mode. instead of trying to finangle grouped mode-esque functionality into the other modes, from the dictionary creation end

removing everything but grouped mode would also streamline development / testing / troubleshooting, since you'd have 1 less dimension of modes to worry about. maybe this could use its own thread on the yomichan repo...

thanks for reading, let me know your thoughts on this!

jmdict.go

jmnedict.go

FooSoft · 2023-02-05T17:57:13Z

Looking good!

stephenmk · 2023-02-05T19:46:04Z

@FooSoft, thanks again for your time.

@Thermospore, it is indeed an issue that JMnedict contains no frequency information. For example, 若槻 might be read 「わかつき」 the vast majority of the time, but this isn't evident by looking at JMnedict. I actually mentioned this to the JMdict editors last year, although I didn't have any good solutions at the time. You made a good point that the BCCWJ frequency list could be used for this purpose. I just proposed this idea to the editors, and Dr. Breen agrees that it sounds promising.

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

Thermospore · 2023-02-07T05:50:52Z

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

thanks for the response, sure that sounds like a good stopgap

next week when I have time, I'll make a thread on the yomichan repo about grouping modes, which would address the core of the issue

basically, I think grouped mode should be default (and various improvements / changes made), and have the other modes just be discontinued / hidden in advanced settings

probably 99% of people using a non grouped mode are just using it because it is default, or because of a feature it has which could just be implemented in grouped mode

the other modes are just holding things back, I think. I don't think grouped mode functionality should have to be finangled into all the modes, from the dictionary end:

it should all just be one mode

stephenmk added 5 commits January 30, 2023 13:26

Add verification logic for date entry in JMdict

b826dbf

Very old versions of JMdict and unofficial versions are unlikely to have the publication date entry at the end of the file.

New JMnedict version

8281301

Use library implementation of Contains function

3b420f8

Rename some jmdict functions

19d6d0b

stephenmk mentioned this pull request Feb 2, 2023

New version of JMdict for Yomichan #40

Merged

Use cached part-of-speech values

5755b79

Designate more JMnedict category tags

dffbec6

stephenmk added 2 commits February 3, 2023 15:51

Fix typo

70611a5

Simplify string -> runes conversion

a9d85dc

FooSoft reviewed Feb 4, 2023

View reviewed changes

jmdict.go Outdated Show resolved Hide resolved

jmdict.go Show resolved Hide resolved

jmdict.go Show resolved Hide resolved

jmnedict.go Outdated Show resolved Hide resolved

Improve readability of publication date functions

ecf22da

FooSoft merged commit f4da17e into FooSoft:master Feb 5, 2023

stephenmk mentioned this pull request Apr 6, 2023

Possibility of moving/adding branded product names(particularly foods) to jmdict JMdictProject/JMdictIssues#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New version of JMnedict (the proper name dictionary) #41

New version of JMnedict (the proper name dictionary) #41

stephenmk commented Feb 2, 2023 •

edited

Loading

Thermospore commented Feb 3, 2023

stephenmk commented Feb 3, 2023 •

edited

Loading

Thermospore commented Feb 3, 2023

FooSoft commented Feb 5, 2023

stephenmk commented Feb 5, 2023

Thermospore commented Feb 7, 2023

New version of JMnedict (the proper name dictionary) #41

New version of JMnedict (the proper name dictionary) #41

Conversation

stephenmk commented Feb 2, 2023 • edited Loading

New version of JMnedict

JMdict: missing part-of-speech tags

Test Dictionary Builds

Thermospore commented Feb 3, 2023

stephenmk commented Feb 3, 2023 • edited Loading

Thermospore commented Feb 3, 2023

FooSoft commented Feb 5, 2023

stephenmk commented Feb 5, 2023

Thermospore commented Feb 7, 2023

stephenmk commented Feb 2, 2023 •

edited

Loading

stephenmk commented Feb 3, 2023 •

edited

Loading