Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

New version of JMdict for Yomichan #40

Merged
merged 19 commits into from
Jan 30, 2023
Merged

New version of JMdict for Yomichan #40

merged 19 commits into from
Jan 30, 2023

Conversation

stephenmk
Copy link
Contributor

@stephenmk stephenmk commented Jan 28, 2023

The current version of JMdict for Yomichan is missing important supplemental information provided in the original JMdict file. This pull request is to update Yomichan-Import to process this information and include it in the output dictionary files.

Example Sentences

Since 2021, an English version of the JMdict file featuring example sentences from the Tanaka Corpus has been published daily. Only priority-tagged sentences are included, so the amount of examples per word is not overwhelming; typically only one example is included per sense at most.

Following the style used on tatoeba.org, I have marked the English sentences with a UK flag emoji 🇬🇧 and Japanese sentences with a Japanese flag emoji 🇯🇵.

Example: 健康

1_example

Example: 現在 (includes two sentences on one sense, which is uncommon)

1_example_2

My updated version of Yomichan-Import can produce Yomichan dictionaries with or without these example sentences depending on which file is input (JMdict_e_examp or the regular JMdict file).

Sense Notes

These notes provide extra context on how words are used ("now mostly used in idioms", "also written as 訓む", etc.). I've updated Yomichan-Import to include this information and marked these notes with a notepad emoji 📝.

Example: に付けて

2_sensenotes

Related issue: yomichan #1165

Gloss Types

Some glosses (aka definitions) contain special type information. The current version of JMdict for Yomichan includes these glosses but does not indicate their types, which can cause some glosses to appear nonsensical.

I've marked these special glosses with an info emoji ℹ️ and prefixed the glosses with their types in italic font ("literally," "figuratively," or "trademark"). There are also "explanatory" gloss types, but I don't think those need an italic prefix.

Example: 上方絵 (explanatory)

3_glosstypes_2

Example: 猫の手も借りたい (literal)

3_glosstypes

Related issue: yomichan #2057

Source Languages

Entries for modern Japanese loanwords (外来語) contain language-of-origin information. I've updated Yomichan-Import to include this information and marked the notes with a globe emoji 🌐.

Example: ミサンガ

4_srclang_1

Example: スキンシップ (wasei)

4_srclang_3

Example: マッチポンプ (multiple languages of origin)

4_srclang_2

References

Some JMdict entries contain cross-references to other entries. I've updated Yomichan-Import to include these references and marked them with an arrow emoji ➡️.

I've formatted the referenced expressions as query links, so a user may jump to that entry by clicking on the link. The notes are also presented with a compact glossary of the referenced expression sense. For situations in which the reading of the referenced expression is ambiguous, I've also included the designated reading in parentheses.

Example: 舌の根

5_references_3

Example: 猛暑日 (includes two references)

5_references_2

Example: 脅す (references a word with an ambiguous reading)

5_references_4

Each reference in JMdict points to a specific numbered sense of an entry rather than the entire entry itself. However, the current version of JMdict for Yomichan does not indicate these sense numbers. For the sake of clarity, I have added numbered tags to entries containing multiple senses in order to indicate the original sense numbers.

Example: 故障 (four senses)

5_references_5

Antonyms

Antonyms are functionally identical to cross-references. I've marked these with a "counterclockwise arrows" emoji 🔄.

Example: 良くないね

6_antonyms_1

Other Forms

JMdict's structure assumes that the reader will be able to view the various forms of an expression alongside the term glossaries. For example, the entry for もと【元・本・素・基】 includes notes specifying that sense 1 is usually written 元, sense 2 is usually written 本, sense 4 is usually 素, etc. These alternative forms are not displayed in Yomichan's default result-grouping mode, and the new inclusion of sense notes without this form information could cause confusion. Aside from that, the ability to view these alternative forms is likely to be of interest to users in general.

For entries with more than one form, I have added an extra "forms" term in the Yomichan dictionary file which contains these forms in a regular list structure.

Example: ことば典

7_forms_3

For entries with more than one distinct reading, I have arranged the forms into a table.

Example: 魚虱

7_forms_4

Example: 素性

7_forms_1

Both the list and table formats contain symbols to represent various meta information about the different forms.

  • Priority (common) forms: ★
  • Rare kanji forms: 🅁
  • Irregular forms: ⚠
  • Outdated forms: ⛬ (Japanese map symbol for historic sites)

In tables, the ㊒ symbol is used to indicate valid forms without any special meta information. Also, gikun (義訓) readings and ateji (当て字) kanji are presented in angle brackets 〈 〉, which is a convention used by some Japanese dictionaries to denote jukujikun (熟字訓) terms.

I hope that my symbol choices are mostly intuitive, but I'm open to suggestions for improvements. I have also updated the term tags to align with these symbols, so users will be able to hover over them to see detailed explanations.

Example: ふいんき (with alt. text on the ⚠ term tag)

7_forms_5

Yomichan includes a "related terms" grouping mode which can be used to display much of this information, but for complicated entries the information can be difficult for users to parse.

Example: 素性 in "Group related terms" mode

7_forms_2

I've also designed this version of Yomichan-Import to produce a standalone dictionary which only contains these forms lists and tables, so users can access this information even if they don't want to use the rest of JMdict.

Related issue: yomichan #2183

Search-Only Terms

Since August 2022, JMdict has included "search-only" terms which are meant to aid term look-ups without cluttering entries with rare, non-standard spellings of words. I've updated Yomichan-Import to produce terms which display links to the standard forms of these terms.

Example: 登り旗 (redirects to のぼり旗)

8_search_1

Example: のぼり旗 in "Group related terms" mode (登り旗 is not displayed)

8_search_2

Example: 鉤なり (redirects to a word with an ambiguous reading)

8_search_3

Other Improvements

Frequency tags and term ranking

I've updated the names and descriptions of JMdict frequency tags to better reflect their meanings. The ranking method for determining the search result display order has also been adjusted. I wrote about this in the "Term prioritization" section of this comment.

Rarely-used kanji forms

I've updated the program to produce additional kana-only headwords for kana forms which are only associated with rare, irregular, or outdated kanji forms. So for example, a user who scans "それ" will now see "それ" as the headword of the top result rather than "其れ". I wrote about this in the "Rarely-used Kanji Forms" section of this comment.

Problems and Considerations

Yomichan validation

This new version of JMdict for Yomichan makes extensive use of nested data structures, so the validation step of the import process is very slow with the current version of Yomichan. On my PC (10 year old Intel i7-3770K), the file takes 32 minutes to validate.

Related issue: yomichan #2138

Merging of terms from separate entries

Yomichan's default result-grouping mode merges terms from different JMdict entries if the terms share the same reading and expression. For example, a the top search result for 元 will be a combination of the entries for もと【元・本・素・基】 (sequence 1260670) and もと【元・旧・故】 (sequence 2219590).

Example: 元 (the first 9 senses are for a priority ⭐ term, but the final 3 are not. Note also that sense #9 is hidden because it only applies to the 本 form of the word.)

9_merging

Example: 軽卒 (the final sense is for an irregular ⚠️ usage of the kanji form, but the first 2 are not)

9_merging_2

The term tags that appear next to the headword at the top of the Yomichan entry may not apply to every sense in the search results. Users can check the "forms" terms to determine which term tags apply to which senses, but this setup could cause confusion.

Test Dictionary Builds

(Updated 01/29/2023)

  • jmdict_english_extra_with_examples_2023_01_29.zip
    This is the complete new version of JMdict (English) for Yomichan with all of the new features described above.

  • jmdict_english_extra_2023_01_29.zip
    This is the same as the above version, except without the Tanaka Corpus example sentences. I wouldn't recommend this version, personally.

  • jmdict_english_2023_01_29.zip
    This is a "legacy" version of the dictionary which is similar to the currently published version. It does not contain the supplemental information in glossaries or the "forms" terms, but it does contain the search-only redirect terms and the new style of term tags (e.g. ⚠ tags instead of "iK" tags). This version validates very quickly during the import process. Once a solution is worked out for the validation problem, I think we can remove support for this version.

  • jmdict_forms_2023_01_29.zip
    Contains only "forms" terms and search-only redirect terms. This is for users who don't want to use the English version of JMdict.

  • jmdict_german_2023_01_29.zip
    This is a German version of JMdict produced by the new version of Yomichan-Import. The JMdict source file only contains the supplemental information described above (sense notes, cross references, etc.) for English language entries. Therefore the Yomichan versions for other languages remain largely the same as the old versions of the dictionaries.
    I've designed the new program to build non-English dictionary files without "forms" terms or the search-only terms. This allows a user to install both the English dictionary and a dictionary for another language without cluttering their installation with duplicated terms. If a user does not want to install the English dictionary, they can acquire the "forms" terms and search-only terms by installing the standalone "jmdict_forms" dictionary file.

Since this pull request summary is already very long, I've tried to avoid going into too much extra detail. I'd be happy to elaborate on any of these topics if there are any questions.

Thanks for taking the time to review this request.

Necesssary for structured content support
This allows a user to install the English version and another version
without cluttering their setup with duplicated information.

If a user doesn't want to use the English version, they can get the
"search" and "forms" terms by installing the separate jmdict_forms
file.
If a term has a frequency tag, it should return higher in search
results than a match which does not have a tag.

For example, a search for 素性 should return すじょう rather than
そせい, because the former has a "news" frequency tag.
Sense numbers start at 1, not 0
If a headword appears in multiple entries, then each entry needs a
corresponding "forms" term in the output dictionary.

For example, 軽卒 is the only headword in entry 2275730, but 軽卒 also
appears as an irregular form in entry 1252910. If a "forms" term is
not included for the former entry, then it will appear that 軽卒 is
irregular for all senses in the output dictionary.
This commit ensures that terms are grouped among their entries of
origin and displayed in correct sequential order in Yomichan's default
result grouping mode, "Group term-reading pairs."
Copy link
Owner

@FooSoft FooSoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

お疲れ様です

I can see a lot of work has gone into this. Aside from some nitpicks, I think this looks like a righteous set of changes to me. Can you speak to any backwards compatibility issues here? I'm not seeing any as fields are being added to existing structures, but wanted to make sure that this is something we are thinking about.

Additional nit: can you please switch filenames like jmdictConstants.go from camelCase to snake_case?

enamdict.go Outdated Show resolved Hide resolved
common.go Show resolved Hide resolved
@Thermospore
Copy link

sick, I'm excited to try this out! random question, where is the purple JMdict tag in the screenshots?
image

@stephenmk
Copy link
Contributor Author

Thanks for the feedback, @FooSoft . I've updated the PR in accordance with your current notes.

As for backwards compatibility, do you mean with respect to Yomichan, the Yomichan-Import user interface, old versions of the JMdict file, or something else entirely? I've added one new format handler to the command line options (-format="forms") for producing the standalone forms dictionary, and I updated the build_dicts.sh script to include it. That should be the only change to the cli interface.

The program will no longer produce the old style of JMdict dictionary files (the ones lacking all the supplemental info). Since my new version takes ~30 minutes or so to validate during the import process, it might not be wise to publish it for the general public until a solution is worked out. So if we want Yomichan-Import to continue to be able to produce the old style dictionaries, I could add some options to do so.

sick, I'm excited to try this out! random question, where is the purple JMdict tag in the screenshots?

@Thermospore, I just hid it using custom CSS. I like to hide the dictionary tags and distinguish different dictionaries using differently colored backgrounds. like this:

.tag[data-category="dictionary"] {
    display: none
}
.definition-item[data-dictionary="JMdict"] {
    background-color: rgba(255,255,0,0.02);
}
.definition-item[data-dictionary="新明解"] {
    background-color: rgba(0,0,255,0.1);
}
.definition-item[data-dictionary="広辞苑"] {
    background-color: rgba(0,255,255,0.4);
}

etc.

@Thermospore
Copy link

I just hid it using custom CSS. I like to hide the dictionary tags and distinguish different dictionaries using differently colored backgrounds. like this:

ah that's cool, thanks

for reference I just tried importing the sans sentence one (jmdict_english_2023_01_28.zip). step 3 finished at 4m50s. step 6 finished 3.5 min later, at 8m22s. I think my 広辞苑 takes about 6m for me. I'm using chrome, win 10, and an AMD ryzen 9 5900X

@FooSoft
Copy link
Owner

FooSoft commented Jan 29, 2023

@stephenmk if it's simple enough to add a flag for the output that can be quickly validated, we should probably do that. Folks do use this tool to build jmdict out-of-band of whenever I "officially" update the dictionaries.

The backwards compatibility question is mostly directed at how loading older versions of the dictionary is handled. There are weird unofficial versions of it floating around (for languages not officially supported); hopefully the new changes don't rely on this new data being present.

@Thermospore
Copy link

Thermospore commented Jan 29, 2023

I agree it is a good idea to have search-only terms be, well, search only by default. however, I think it is an issue that the search-only terms are split out into separate entries. having the entries split up like this would cause issues with frequency lists and secondary lookups. my main use case for jmdict is that it is a (relatively) comprehensive index of all the kanji/reading forms of each headword, for pulling up jj dicts / frequencies etc as a primary dict in group related terms mode

also with this style, you have to perform an extra action whenever looking up a search only word (click hyperlink or re-search), which is a hassle. I get the impression a lot of words are commonly written in a search-only form, so this event would happen a lot

maybe yomichan could add native support for true search only terms (and add a "show search-only terms" option, off by default)? ie it could hide the headword somehow while still performing secondary lookups and frequency list lookups with it. "hide" could potentially mean entirely removing, or collapsing, or maybe greying it out like is currently done for obsolete kanji forms etc
image

honestly that whole headwords section might use a bit of rework in group related terms mode, now that jmdict has many new tags (rarely used kanji, search only, etc). currently you can't tell which of those tags apply to which headword, besides certain cases where headwords are colored/greyed out

maybe some improvements would be to

  • update the color coding with the new tags, so you can get the most important info at a glance (ex stuff that is rare / obsolete etc is greyed out, common/popular terms are blue, etc)
  • show all tags for a headword on hover, if you want all the details

sorry I typed this pretty quick feel free to ask for clarification or point out something I'm missing

Require `-language=english_extra` to produce the complete version of
the new JMdict dictionary file.

If and when we determine that the all the new features are ready to be
included the dictionary by default, we can remove this logic.
@stephenmk
Copy link
Contributor Author

@FooSoft,

I updated my code to require a -language=english_extra parameter in order to produce the new style of the dictionary. Using the original -language=english parameter (or no -language parameter at all) will produce a dictionary file that is very similar to the currently published form of the file. I updated the "Test Dictionary Builds" section of my original post with more details.

Regarding weird unofficial versions of JMdict, as long as they can be parsed by your JMdict library, I think there shouldn't be much of a problem. The only issue I can think of would be with the headword frequency tags ("news1", "ichi2", "spec1", etc.) and info tags ("iK", "oK", "io", etc.). My version of Yomichan-Import searches for these tags specifically, renames them, and uses them to determine things like term ranking in search results. So if the weird unofficial version contained different tags, I think Yomichan-Import would still output an otherwise-complete dictionary file, but those custom tags would be missing.

That said, I'm very surprised to hear that such unofficial versions exist. The idea of the JMdict XML schema being used by anyone other than the EDRDG sounds ill-advised. The thought of maintaining unofficial support for such files also scares me a little. The JMdict format is a moving target with many changes planned in the future, so if the weird unofficial versions are still being developed and used, I think it would be better for the users to contact us so that we may provide a dedicated module for those formats rather than for us to continue providing unofficial support.


@Thermospore,

I share your concern and think that you make a good point, but there is more information that needs to be considered.

Search-only tags in JMdict are generally for forms that would not have been allowed in the dictionary prior to August 2022. This means that these forms are almost always relatively rare and also irregular (i.e. not featured in other major dictionaries). You can read more about this in the JMdict editorial policy. My new version should continue to serve as a relatively comprehensive index as it did before.

However, there are some notable cases in which it won't. For example, JMdict has an entry for the expression 「そこに山があるから」. A different form of the expression 「其処に山があるから」 is used in Daijisen, but because of the usage of a rare kanji form (其処), this form of the expression was not viable for inclusion in JMdict until the search-only policy was implemented. So while JMdict should function well as an indexing dictionary 99% of the time without the search-only forms, there are indeed some situations in which the search-only forms would be useful to group by.

For now, that's a bit outside of the scope of the dictionary file. Just as you suggest, Yomichan itself would need to be modified to hide the search-only terms in the "Group related terms" mode.

also with this style, you have to perform an extra action whenever looking up a search only word (click hyperlink or re-search), which is a hassle. I get the impression a lot of words are commonly written in a search-only form, so this event would happen a lot

No, it should be a fairly rare occurrence. I can see how it would be a hassle if you're using Yomichan's clipboard monitor and aren't actively focused on the browser window, but otherwise it really is just one click to redirect to the standard form of the word. You also need to consider Yomichan's default "Group term-reading pairs" mode, which only displays one headword at a time. In that situation, a user wouldn't be very interested in the search-only form and would want to be redirected to a standard form of the word. The "Group related" mode removes the need for this redirect, but only in that mode.

honestly that whole headwords section might use a bit of rework in group related terms mode, now that jmdict has many new tags (rarely used kanji, search only, etc). currently you can't tell which of those tags apply to which headword, besides certain cases where headwords are colored/greyed out

I think the right approach is to display this information in a table, as I've done in the new dictionary. Presenting three or more forms in a flat list with various different readings and subtle differences in kanji is invariably going to be difficult to comprehend regardless of the addition of extra symbols and colors.

Custom dictionary files using the JMdict XML format may contain
nonstandard frequency and information tags.
@stephenmk
Copy link
Contributor Author

I noticed this comment in issue #30:

Pretty much every Japanese study tool ever created (if it includes a dictionary) uses EDICT/JMDICT2. A converter from yomichan format would be very handy in plugging into these types of tools.

Having read that, I can see why someone would convert a dictionary into the JMdict XML format.

I updated my code so that undocumented frequency and information tags will be included in the output dictionary files. With that completed, I don't believe there are any backwards compatibility issues with this new version.

(The current production version of Yomichan-Import actually has a frequency tag whitelist, but I see no reason why unknown tags should be discarded if they're found.)

@FooSoft FooSoft merged commit 74de4ce into FooSoft:master Jan 30, 2023
@stephenmk
Copy link
Contributor Author

stephenmk commented Feb 2, 2023

I just noticed that non-English versions of the new JMdict dictionaries do not have part-of-speech tags, unlike the old versions.

New Russian entry for 早急

new_russian

Old Russian entry for 早急 (has 'adj-na', 'adj-no', and 'n' tags)

old_russian

Just wanted to make a note about what's happening here. The prior version of the program would loop through all of the senses in an entry and keep a copy of the last set of part-of-speech tags that it found, even if those tags were from a different language within that entry. If it found a sense without part-of-speech tags, it would assign those previous tags to it.

yomichan-import/edict.go

Lines 121 to 127 in 9222417

var partsOfSpeech []string
for index, sense := range edictEntry.Sense {
if len(sense.PartsOfSpeech) != 0 {
partsOfSpeech = sense.PartsOfSpeech
}

Strictly speaking this isn't correct, although it might produce correct information some or even most of the time. If both the English and Russian versions of an entry only have one sense each, then the part-of-speech info is most likely the same. All bets are off outside of that special case, though, so I think it's best to stick with the new behavior.

However, Yomichan won't be able to deinflect verbs and such if this information is missing. I'll need to modify the program a bit to ensure that at least the appropriate grammar rules are added to these terms.


Edit: I thought about this some more and changed my mind a bit. I've written about this more here: #41

@Kalleo1
Copy link

Kalleo1 commented Feb 12, 2023

This dictionary is amazing! but i'm just having one little problem, my pop-up is not showing some emojis like the country flag and the info emoji.

I noticed some people with the same problem as me and we are all using win 10 and a chromium base browser, so i think the problem is one of these two(or both).

if there's a way i could change to a different emoji i'll would appreciate.

4
5

@stephenmk
Copy link
Contributor Author

@UMNV

Thanks, I'm glad to hear that people are liking the new dictionary.

I did a search for more info about this font issue and found this blog post: https://nolanlawson.com/2022/04/08/the-struggle-of-using-native-emoji-on-the-web/

As it turns out, Microsoft’s emoji font does not have country flags on either Windows 10 or Windows 11. So instead of the US flag emoji, you’ll just see the characters “US” (and the equivalent country codes for other flags).

It sounds like this is an issue specifically with chromium-based browsers on windows. It seems firefox doesn't have this problem because it ships with twitter's emoji font by default.

I don't have a PC with windows installed that I could use to help you troubleshoot the problem, unfortunately. I see that there is an extension for chrome which adds the twitter emoji font and claims to fix the flag issue. I can't test it myself, but maybe you could try it.

https://chrome.google.com/webstore/detail/twemoji-for-chrome/fopgafjdjlongoeblobbafbnapafcicg?hl=en

@Kalleo1
Copy link

Kalleo1 commented Feb 12, 2023

@stephenmk
I tried this extension(and some others) and indeed it changes the flags displayed on chrome but it doesn't work with yomichan for some reason.
My temporary solution was to edit the json files and change with some other emoji, it's not ideal but it's working.
1

but thanks for your help and the amazing work you're doing.

@stephenmk
Copy link
Contributor Author

@UMNV , sorry to hear it didn't work. I imagine a large majority of yomichan users are on windows and chromium-based browsers, so it would be nice if we had a better solution. It might be smart to include some default embedded fonts in the yomichan extension itself, but implementing that kind of functionality is outside my expertise at the moment.

Rather than editing the json files, you can set the icons using some custom CSS in your yomichan settings. Here are the default values:

ul[data-sc-content="glossary"] {
  list-style-type: circle !important;
}
ul[data-sc-content="infoGlossary"] {
  list-style-type: "ℹ️ " !important;
}
ul[data-sc-content="sourceLanguages"] {
  list-style-type: "🌐 " !important;
}
ul[data-sc-content="notes"] {
  list-style-type: "📝 " !important;
}
ul[data-sc-content="antonyms"] {
  list-style-type: "🔄 " !important;
}
ul[data-sc-content="references"] {
  list-style-type: "➡️ " !important;
}
ul[data-sc-content="examples"] {
  list-style-type: "🇯🇵 " !important;
}
ul[data-sc-content="examples"] > li[lang="en"] {
  list-style-type: "🇬🇧 " !important;
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants