Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kana forms - threshold for inclusion #62

Closed
Marcusjmdict opened this issue Apr 12, 2022 · 1 comment
Closed

Kana forms - threshold for inclusion #62

Marcusjmdict opened this issue Apr 12, 2022 · 1 comment

Comments

@Marcusjmdict
Copy link

https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1267620.1

In the 虎の巻 entry, Stephen added two additional surface forms to the kanji field, トラの巻 and とらの巻, quoting the ngram numbers:

虎の巻 257508
トラの巻 71459
とらの巻 5711
とらのまき 1565
トラのまき 633
とらのマキ No matches
トラのマキ No matches

I don't have an issue with トラの巻, it's obviously common, but, とらの巻 OTOH only gets 1.7% of the total ngram hits. If it were a "true kanji form" (i.e. with another kanji rather than with kana replacing a kanji), we'd have tagged it [rK], and I think [rK] forms are only really worth adding when they actually appear in actual dictionaries. So, I removed とらの巻, but Jim added it back saying "With 5k in the ngrams I'd keep it."

We discussed this previously in the 女の子 entry regarding the addition of オンナノコ[nokanji] to the reading field: https://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2181330
we ended up not including it, despite the absolute number (100k) being rather high.

I feel we need to sit down and figure out exactly what thresholds we want these different forms to meet for inclusion. Personally, I don't think we should include anything not in another dictionary if it's what we'd qualify as an [rK] if it had contained unique kanji - i.e. if it gets less than 2.5% of the total ngram hits. The ngrams of course aren't the end-all and twitter usage etc. can be useful indicators too depending on the word, but I don't think we need to make any exceptions for absolute numbers. Like I said in the 女の子 entry, "it doesn't make sense to just look at the raw numbers, or all our P-tagged entries should be horrible messes with lots of different and rare versions. There's a balance we need to strike between presenting easy-to-read entries, and trying to include absolutely everything. "

@JMdictProject
Copy link
Owner

I think this matter has been resolved with the introduction of the sK tags. The とらの巻 form now has that tag, so it doesn't formally appear in the 虎の巻 entry but can be searched using it. (https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%A4%C8%A4%E9%A4%CE%B4%AC)
I'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants