Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of moving/adding branded product names(particularly foods) to jmdict #93

Closed
briankrznarich opened this issue Apr 4, 2023 · 10 comments

Comments

@briankrznarich
Copy link

briankrznarich commented Apr 4, 2023

I had a whole novel here, but I've replaced with something more to the point after feedback below(and since this may not be going anywhere if the general point doesn't have any sympathetic ears).

As noted by the reply post, my thoughts are that there is value in moving branded product names (particularly food names), into JMDictdb. The rationale is that these are often listed as ingredients in other dishes. We have every kind of orange (はっさく, 八朔, for example, although that seems like it could be moved into jmnedict by the same logic), and every kind of obscure spice or plant. But if a branded product name shows up on a restaurant menu, the dictionary provides no help in identifying what it is.

This isn't a large number of entries. I understand why jmnedict was split off from the main dictionary, but there only currently ~600 items tagged as [product] and only a fraction of these are foods. Moving them to jmnedict, which most people expect to be a repository for people/place/company names, has greatly reduced their visibility and utility in downstream consumers of this data.

My specific personal experience was with a dish identified as ベビースターもんじゃ焼き。 I looked up "babystar", found nothing in the dictionary. This is apparently a common pairing ( ベビースターもんじゃ焼き is something for which you can find many recipes online).

Jim points out that this was actually in jmnedict already:
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&e=2174891
ベビースターラーメン; ベビー・スター・ラーメン
[product]
▶ Baby Star Ramen

I'll note that the original entry was "baby star crispy noodle snack", which is what I would have actually wanted to see. Also, ベビースター seems to appear with もんじゃ焼き without the ラーメン suffix. At some point, what I'd like to know is that "ベビースター" (53,935 ngrams) is a crispy ramen noodle snack

I apologize for my lazyness with jisho.org. I certainly don't mean to treat this as a forum for jisho.org issues, I just know that there is some communication between the groups. I will go ahead and ping their email and see if it goes anywhere.

What you might wish to know though is that:

  1. jisho.org is not purging entries that have moved from jmdict to jmnedict (edit: they are purging ordinary deletes)
  2. jisho.org is not importing any [product] entries into their system (maybe they don't import anything anymore from jmnedict, I haven't looked into it).

This goes to my point of reduced visibility/utility of these particular segregated entries. But I'll drop them a line, and if anything of relevance comes back, I'll update.

@JMdictProject
Copy link
Owner

The starting point for this is the Proper Names section in our Editorial Policy (https://www.edrdg.org/wiki/index.php/Editorial_policy#Proper_Names). You are proposing that "branded product names" be added to the categories of proper names in JMdict.

My view is that in general a dictionary like JMdict is not the place for such entries. The companion JMnedict covers them and ベビースターラーメン is already an entry there, as is ポッキー.

The coverage of the jisho.org server should be taken up in its forum. It is not really relevant to the coverage of JMdict.

[Please try and be more succinct and to the point when raising these issues.]

@briankrznarich briankrznarich changed the title Possibility of moving branded-food names from jmnedict to jmdict; also, jisho.org does not display [product] entries anywhere Possibility of moving/adding branded product names(particularly foods) to jmdict Apr 4, 2023
@briankrznarich
Copy link
Author

To the point of the proper role of a dictionary, while I don't think it makes sense to catalog every brand-name found on my grocery store shelf, at some point if a product becomes so engrained in a culture that everyone knows exactly what it is, and it starts showing up as an ingredient by name on menus, I think that an exception is warranted for the benefit of the user of the dictionary.

In English, I keep thinking of Taco Bell's "Doritos Locos Taco". By analogy with Babystar Ramen, it is also the case that if you google "doritos recipes", you'll find plenty of home-cooked meals you can make. At some point, a foreigner might need to figure out what a "dorito' actually is, and excluding from the dictionary on a technicality is not of any help to them (just as it was no help to me to be unable to figure out what "babystar" was).

And so I looked for "Doritos" in the dictionary, and this is a bit of what I found:

https://www.oxfordlearnersdictionaries.com/definition/english/doritostm
Doritos™ noun /dəˈriːtəʊz/ [plural]
​a crispy snack food made from maize and sold in various cheese and spicy flavours

From Longman Dictionary of Contemporary English
https://www.ldoceonline.com/dictionary/doritos
Do‧ri‧tos /dəˈriːtəʊz/ trademark
a type of corn chip

https://www.wordsense.eu/Doritos/
(This is an extremely elaborate entry with sentence examples and everything: "I looked out the kitchen window at my garden, my trenches, my dirt, and then my gaze turned downward toward my Dorito-stained hand.")

http://www.wordow.com/english/dictionary/Doritos
Doritos /doʊˈriːtoʊz/ is a brand of seasoned tortilla chips produced since 1964 by American food company Frito-Lay (a wholly owned subsidiary of PepsiCo).

Which is to say, the exceptional inclusion of such every-day brand-names by proper dictionaries is not unprecedented.

@briankrznarich
Copy link
Author

briankrznarich commented Apr 5, 2023

I'll lead with the conclusion. If an item goes into jmnedict, most end-users of jmdict data (given the popular tools available) will never see it. jmdict is a 100+mb download(in some formats). jmnedict is larger still. People and tools that don't want a giant database of every Japanese surname are choosing to not incorporate jmnedict (which is one of the reasons it was broken out, no?).

I would assert that these people would still like to know what ポッキー, ベービースター and チョコビ are. They want to know that 都こんぶ is a "sweet and sour snack made from dried kelp"(our jmnedict entry). (And, some of them might even like to know what a ピンゾロ is)

But because these entries are shunted off with the rest of the proper names, many people do not get to benefit from them. There are only 600 [product] entries, and it wouldn't have to be a universal move. But if we want these entries to actually be useful (to a larger set of end-users), then it would be nice to consider (re-)incorporating them into the main database.

====

I wanted to actually see what the landscape was like for jmnedict. I downloaded all of the common browser (Chrome, Firefox) plugins I could find, and all of the common iOS and Android apps I could identify. I search a couple of key terms with different tags, dates-of-entry, and sources (jmdict, jmnedict) and compiled the results in a google spreadsheet here:

https://docs.google.com/spreadsheets/d/1FwT3gtCD1xskpD-8V-raeWch-fut-D__4NdmRQXh50s/edit?usp=sharing

Only one tool seems to be doing the "right" thing with jmnedict entries, the firefox plugin:
10ten Japanese Reader (formerly Rikaichamp)

It finds [product], [place], etc. entries, and returns them only as exact matches( placed directly above other dictionary results).

The following tools appear incorporate place names, but not [product] entries (mysterious):
jisho.org
ejje.weblio.jp/content/

The following never incorporate names:
Shirabe Jisho(popular iPhone app)
Rikaikun (popular chrome plugin)
"Nihongo" (iphone app)
imiwa (iphone app - out of date, 2020?)
midori (iphone app)
takoboko (popular android app)
Yomiwa (android, looks fairly out of date)

The following incorporate names with some big provisos:
Aedict is a very robust android dictionary app with support for many online dictionaries, which can be downloaded from within the app. jmdictdb is installed as a base dictionary by default. the jmnedictdb database is available separately, among a list of ~10 potential choices. In order to get the definition of ポッキー to appear, the user must install a 160mb db of proper names. Once the name dictionary is enabled, the name entries seem to get precedence over the primary entries, making some vocab searches practically unusable (みやこ・都 is one such case, returning hundreds of place names before the dictionary definition). Based on usability, I do not get the impression that most people enable the "proper name dictionary" option by default.

Yomichan( a popular browser plugin) has been unsupported for several years, and it doesn't look like the EPWING dictionary exports it consumes are updated by this project. Dictionaries must be manually downloaded (and they are 2+ years out of date), then manually imported from the developer's orphaned website. Like aedict, the user must explicitly download, then import the "names" dictionary, in addition to the main jmdict database. Most people probably skip the names.

@stephenmk
Copy link

This is tangential to the current discussion, but (for the record) Yomichan only stopped receiving support a couple months ago. There's a community-led fork attempting to keep it alive now since it's essential software to so many people.

Yomichan doesn't consume EPWING versions of JMdict. A separate desktop program, Yomichan-Import, downloads the regular XML-formatted JMdict file and converts it into a yomichan-compatible dictionary file. You can technically run the program yourself to get a dictionary file with the most up-to-date data any time you want, but that's admittedly too technical for most users.

For JMnedict, the original version for Yomichan did indeed get passed over by lots of users. A complaint that I had and also heard from many others was that it cluttered search results with too many low-quality entries. For example, searching for a word prefixed with 大 would also return results for all of the 44 generic "大" name entries in JMnedict. I recently redeveloped the Yomichan version of JMnedict to group all generic name entries together into one search result, similarly to how the entries are displayed on WWWJDIC.

I've suggested in the past that it could be useful to somehow separate these generic name terms (which only contain transliterations in the glosses) from the more specific name entries (Doritos, etc.). I think a lot of people disregard JMnedict because they believe it only has the former.

@JMdictProject
Copy link
Owner

I think the real issue is how to make it easier for the more common and important name entries to be made available on more platforms. I've been rather insulated from this as all the platforms I use display both JMdict and JMnedict entries, but Brian's summation has highlighted a problem. I wasn't aware, for example, that jisho only included some of the name entries. I agree that the size of JMnedict, combined with the fact that many entries in it are rare (to say the least) makes it difficult for a lot of apps to include it in its entirety,

I don't think choosing one or more name categories and moving them into JMdict is necessarily the way to go. I think we probably need to think of a deliverable dictionary file which combines JMdict with a selection of the more common and useful JMnedict entries. I'll think about this in coming days and see if there is a workable approach. I'll possibly open a fresh issue once I have my thoughts in order.

Just for interest, I did some counts of some of the various name categories. They are (roughly)

  • unclassified: 130636
  • places: 210699
  • surnames: 125002
  • female: 105259
  • male: 20669
  • given (no gender): 58616
  • companies: 1182
  • organizations: 4961
  • products: 620
  • works: 1203
  • stations: 8261
  • services: 93

@robinjmdict
Copy link

robinjmdict commented Apr 7, 2023

I agree with Jim.

The simplest solution might be to have two JMnedict files: one for place names and personal names, and another for all the other categories (except "unclassified"). The latter file would be much smaller (and probably more useful for most users). I don't think there's an easy way to extract "useful" place names and personal names.

@briankrznarich
Copy link
Author

briankrznarich commented Apr 7, 2023

I'm going to put on my "software developer" hat now, and say that I am happy to help in implementing whatever the group might decide to do, if that help would be useful.

===

I'm glad to see that we seem to be reaching similar conclusions, and aren't really on such different pages after all. Jim's comment about a deliverable dictionary combining JMdict with some common jmnedict entries seems to me like one of the most beneficial options to end-users (and without having to re-organize jmdict/jmnedict).

We might think about making such a dictionary ("JMDict + Useful Proper Names") the "default" export file, and ("JMDict Classic - minimal proper names") a download file with a new name. This would make getting the new entries "opt-out" instead of "opt-in", and have the fastest immediate impact. I guess it depends on what we think downstream users would want, or on what "representations" we feel have been made about the data. (If the selected entries don't overwhelm standard searches (like 大 and 都 do with jmnedict currently), I think most people would want the data regardless of size. Bytes are cheap, a 20% file size increase would probably mean nothing to most applications.

Continuing to think out loud....

Not sure about the large [station] category (since, when reading, the appearance of 駅 pretty much tells you what you are looking at, and reduces the rest to a "place" entry), but for companies, organizations, products, works, and services, these mostly look like entries that would be useful.

Katakana -> romanization, in particular, would be a godsend. In my 3rd year of Japanese, I still couldn't get from ボトル to "bottle". Not in a million years would I piece this together:

タフツ大学 | タフツだいがく | Tufts University (2600 ngrams)

We could consider using ngrams to do some preliminary filtering for some value of 'common':

中国国際航空 Air China (ngrams 47458, I think it would be nice to have this)
中国地質大学 China University of Geosciences; CUG (183 ngrams, maybe not so critical)

This would involve adding an explicit tag of some kind to jmnedict, like [shared], which we could initialize programmatically, then maintain by hand. This would allow cherry-picking of important person/place names later on if there were any desire to do so.

Alternatively, we could make [companies], [organizations], etc. [shared] by default, and [places],[surnames],[unclassified] etc. [unshared] by default, then use [shared] and [unshared] tags only to override this(either approach has tradeoffs).

I wouldn't advocate for a hard ngram requirement. My new dream is to let the world know that a ピンゾロ is a moniker for a 1995~2000 model-year Toyota Corolla Levin model AE111 (111 = ピンゾロ, a perfect dice roll of 3 ones). I'm not sure I could prove popularity in an absolute sense, but it is an interesting piece of information within a niche (car modding and street racing). Without such an entry, a twitter entry like this is very hard to interpret:

https://twitter.com/ae111teru32gtr/status/1643612874433085441

But as usual, I digress.

==
@stephenmk I'm glad to hear that Yomichan is being picked up. The maintainer who announced he was stopping development didn't bother to put a date in his announcement post. I guessed "2 years" based on the absence of ポッキーゲーム in my downloaded Yomichan dictionary. It may be the case that the dictionaries on his website haven't been updated in a while, even if the plugin has been maintained.

@yamagoya
Copy link

yamagoya commented Apr 7, 2023

Some tool-related info that might be helpful in this discussion...

The JMdictDB database stores both JMdict and JMnedict entries in a common format and the tool that produces the XML files can already output JMnedict entries in JMdict-formatted XML. When doing this, JMnedict XML tags and entities are mapped to similar JMdict ones so that the produced XML conforms to the JMdict DTD.

Thus it is currently quite feasible to produce a JMdict-format XML file containing a subset of JMnedict entries that could be distributed separately or appended to a full JMdict XML file.

To select a subset of JMnedict entries is a little awkward at present (create a simple ad hoc tool to generate a list of the sequence numbers of the JMnedict entries desired and feed that list to the XML generator tool) but this could be easily improved.

If a subset of a particular category of JMnedict entries (e.g., "places") is wanted, the current "frequency of use" tags might be suitable to indicate which ones. None of those tags are currently used on JMnedict entries so the "spec" tag could be used to denote the set of important places that should be output for inclusion in the JMdict XML file. IMO this would be better than inventing a new JMnedict-specific tag for the purpose.

There might be some other minor changes required to the XML tool such as suppressing warnings when certain JMnedict specific tags are encountered but I think any such changes will be pretty trivial.

I don't have any opinion on whether a JMdict XML file with selected JMnedict entries, or @robinjmdict 's suggestion of two JMnedict files (which is also quite feasible from a tooling standpoint), or some other option is best, but thought the above info might be helpful in deciding.

@birtles
Copy link

birtles commented Apr 8, 2023

Only one tool seems to be doing the "right" thing with jmnedict entries, the firefox plugin:
10ten Japanese Reader (formerly Rikaichamp)

(Off topic but it's also available for Chrome, Edge, Safari (Mac and iOS), and Thunderbird. Hikibiki also provides the names dictionary in an offline format, but doesn't download it by default. Both automatically update the JMdict/JMnedict/Kanjidic data twice a week, downloading just the updated entries.)

@JMdictProject
Copy link
Owner

I am closing this issue now, and will shortly open a new issue covering the possible expansion of the JMdict daily release by adding a selection of entries from JMnedict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants