Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to expand the JMdict daily release by adding selected entries from JMnedict #94

Open
JMdictProject opened this issue Apr 23, 2023 · 28 comments

Comments

@JMdictProject
Copy link
Owner

JMdictProject commented Apr 23, 2023

Following the discussion in Issue #93 I have been exploring the options for including a number of JMnedict (proper name) entries in the daily JMdict release. The base entries would stay in the JMnedict database for maintenance, etc. and would continue to be in the daily JMnedict XML release. The main reason for adding the selection to JMdict is to assist users of apps and sites which lack ready access to the JMnedict entries. The entries would appear as nouns and would retain their classification (work, product, etc.).

The selection of entries in the various categories will depend on how common they are, so I have extracted the Google n-gram counts for all the terms in JMnedict to help with the selection.

I propose that initially about 4,000 name entries be selected. The proposed sets are:

  • 550 product entries that have n-gram counts of 20 or more
  • 900 works entries (same criterion)
  • 1,100 company name entries (same criterion)
  • the most common 2,000 person name entries (of the 53k entries 8k have counts over 5,000)
  • the most common 1,000 organization name entries (of the 5k entries 1k have counts over 5,000)
  • of the smaller categories all the entries with non-zero n-gram counts would be included. There are approximately: characters (130), creatures (8), events (45), fiction (22), legends (5), services (100), ships (7), myths (20)

At this stage, it is not proposed that any of the larger categories: surnames (145k entries, 125k with n-gram counts), female (109k, 70k), male (21k, 16k), given (61k, 60k), place-names (227k, 157k), stations (8k, 8k) be included initially, but this can be revisited.

The selected entries will be tagged in the JMnedict database by adding the "spec1" tag to the reading priority. This can be done initially via a bulk update, and entries can be later modified via the usual edit system. The tagged entries can be automatically extracted in JMdict format and included in the JMdict release.

I will announce this proposal on the mailing list to alert developers/maintainers and to seek feedback.

@robinjmdict
Copy link

I don't think this is the best approach.

The Google n-gram counts are a blunt instrument. They're outdated (16 years old now) and false positives often distort the counts. How would we handle the thousands upon thousands of products/works/companies that have been created since 2007? Presumably almost all of them would meet the criteria for inclusion if the n-gram threshold is as low as 20.

Most of those 2,000 person name entries are just romanisations (sometimes with dates of birth and death). Also, I don't think full names belong in JMdict.

It complicates the editing process – any amendment would have to be made to both databases.

I think it would be more practical to have a separate names dictionary file without personal or place names, and to strongly encourage developers to include it in their app/sites.

@birtles
Copy link

birtles commented Apr 24, 2023

As an app developer using both dictionaries, I'd rather avoid having duplicates. We don't want our users to have to download such entries twice but it's cumbersome for us to detect duplicates. If entries do end up being duplicated it would be very nice to have a flag indicating which entries appear in both databases so we can filter them out.

@JMdictProject
Copy link
Owner Author

Robin wrote:

How would we handle the thousands upon thousands of products/works/companies that have been created since 2007?

Well, first we'd have to get them into JMnedict. Most of the present contents came from WWW-scraping I did years ago. The numbers of recent formations we have are rather low. The number of products/works/companies entries without n-gram scores is in the low 100s, and I plan to flag the more obvious recent ones such as Tik Tok and Weibo. Clearly it would be great to get more recent names included.

Most of those 2,000 person name entries are just romanisations (sometimes with dates of birth and death). Also, I don't think full names belong in JMdict.

But shouldn't they be searchable via a typical dictionary app? Bear in mind what I am proposing is a workaround to assist the more constrained apps. There's no change proposed to the underlying dictionary structures.

It complicates the editing process – any amendment would have to be made to both databases.

No, no, no! I'm not proposing duplicating the name entries in the JMdict database. What I'm proposing is that the main distributed XML file have the JMdict entries plus a selection from JMnedict. Editing of name entries would continue to be done only in the JMnedict database.

I think it would be more practical to have a separate names dictionary file without personal or place names, and to strongly encourage developers to include it in their app/sites.

That would only reduce the JMnedict size by about 35%, so I don't think it would really address the issue of apps not using the names file. It would also be a bit of a pain doing a split and we'd have to deal with the issue that some forms are both place names and surnames, etc.

What I'm proposing is a fairly simple system to help apps be more useful. It can be implemented by a couple of scripts without impacting the underlying database or editing processes.

@JMdictProject
Copy link
Owner Author

Brian (?) Birtles wrote:

If entries do end up being duplicated it would be very nice to have a flag indicating which entries appear in both databases so we can filter them out.

I should have made clear that under this proposal the JMdict file would be available in versions with and without the added names.

In the JMnedict XML file it will be possible to detect the entries which are included in the JMdict distribution as they will have the "spec1" tag. In the (expanded) JMdict file these entries will have their "5nnnnnn" sequence numbers, but it may also be possible to flag them another way.

@robinjmdict
Copy link

I'm not proposing duplicating the name entries in the JMdict database. What I'm proposing is that the main distributed XML file have the JMdict entries plus a selection from JMnedict. Editing of name entries would continue to be done only in the JMnedict database.

Ah, sorry, I completely misunderstood the proposal. This sounds a lot more sensible. I realise now that this what @yamagoya suggested on #93. I don't think my idea of a separate abridged JMnedict file has any advantages over this approach (as long as there's always a version of the JMdict file without the added names).

But shouldn't [full names] be searchable via a typical dictionary app?

I'm not sure how helpful it would be to have 2000 searchable person names, especially if they're selected using n-gram data from a snapshot of the web in 2007. There's likely a bias towards celebrities who were popular at the time. Also, this set would only constitute a small proportion of the names that users are likely to encounter.

That would only reduce the JMnedict size by about 35%, so I don't think it would really address the issue of apps not using the names file.

By "personal names", I meant all names used for people (i.e. surname, given, male, female and person). Excluding these entries would massively reduce the size of the JMnedict dictionary file. But this is irrelevant now that I understand the proposal.

@birtles
Copy link

birtles commented Apr 25, 2023

I should have made clear that under this proposal the JMdict file would be available in versions with and without the added names.

That sounds great. Thank you.

@chrisvasselli
Copy link

As one of those app developers that uses JMdict but has so far not integrated the JMnedict file, I really like this idea, and would definitely use it in my app. 👍

@chasecolburn
Copy link

I'm also an app developer that would welcome this change.

@JMdictProject
Copy link
Owner Author

Thanks for the feedback. I think the proposal can go ahead.

I have put the basic mechanism in place, i.e. any JMnedict entry tagged as "spec1" will now automatically go into the JMdict_e* XML files. So far there is just one entry tagged (ティックトック) for the purposes of testing. In the coming days, I will progressively tag more entries.

In checking the "product" entries initially I see there are a few issues that need to be resolved:

  • there are a number of (near) identical entries in both databases, e.g. for 朝日新聞 and ポケモン. We don't need both to be in the JMdict release. It would be easy not to tag them in JMnedict, but perhaps it would be best to drop them from JMdict in the first place.
  • there are some that overlap, e.g. オセロ. They need to be rationalized a bit.

I plan to glance over them before the tagging, but rather than delay the process it would probably be better to start off with having some duplicated entries and we could tidy them up later.

@stephenmk
Copy link

By my count, 79% (587k / 742k) of the entries in JMnedict contain glosses that are just romanizations of a particular noun (e.g. "Tarō" for 太郎). The remaining 155k entries that I've identified contain a lot of useful information in the "place," "unclass," and personal name categories. Most of them are foreign (i.e. non-Japanese) names.

Pulling some examples at random, I see that アントワープ (Antwerp), ウィンチェスター (Winchester), オークランド (Auckland), and ラガーディア (La Guardia) are among these entries and do not have corresponding entries in JMdict. There are also many useful Korean and Chinese place names and personal names.

Here's a CSV containing the full set of 155k entries.

Note that I didn't classify person entries such as 本庶佑 "Honjo Tasuku" as romanizations due to the inclusion of a space character in the gloss. If we want to trim this set of entries further, I could exclude those entries (several tens of thousands) as well.

Rather than the proposed approach to selecting a small number of JMnedict entries for inclusion, perhaps we could take this approach instead.

@JMdictProject
Copy link
Owner Author

I don't think the added name entries should be that high. I was envisioning something like 10-20k max. Once the initial 4k or so were in I was planning to look at the other groups. Place names such as アントワープ and オークランド, which have fairly high n-gram counts, are obvious candidates.

@JMdictProject
Copy link
Owner Author

I've been spending some spare moments setting up the scripts to implement the process. They are complete now and I put in the batch of deity names (13) yesterday and tracked the process to make sure it all ran. Everything seems fine, so if your look in the latest JMdict you'll see an entry for ベルゼブブ (5075303). It's been exported in the EDICT format too so it's in WWWJDIC (https://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1MDJ%A5%D9%A5%EB%A5%BC%A5%D6%A5%D6)

I'll push out more batches in coming days.

@birtles
Copy link

birtles commented May 5, 2023

I should have made clear that under this proposal the JMdict file would be available in versions with and without the added names.

In the JMnedict XML file it will be possible to detect the entries which are included in the JMdict distribution as they will have the "spec1" tag. In the (expanded) JMdict file these entries will have their "5nnnnnn" sequence numbers, but it may also be possible to flag them another way.

It appears there is no multi-lingual version of JMdict without the extra names. Is there any way to detect these duplicate entries in the full version of JMdict?

@JMdictProject
Copy link
Owner Author

The extra name entries all have a sequence number starting with "5". In the distant future, there may be some meta code for this.

If there was significant interest I could generate two versions of the multi-lingual distribution.

@birtles
Copy link

birtles commented May 5, 2023

The extra name entries all have a sequence number starting with "5". In the distant future, there may be some meta code for this.

Great, thank you. I see about 557 such entries in the export from yesterday.

If there was significant interest I could generate two versions of the multi-lingual distribution.

It should be easy enough to filter out the extra name entries, but if there was a version without it I would probably use that.

@JMdictProject
Copy link
Owner Author

The extra name entries all have a sequence number starting with "5". In the distant future, there may be some meta code for this.

If there was significant interest I could generate two versions of the multi-lingual distribution.

@JMdictProject
Copy link
Owner Author

I have now processed all of the smaller categories of names. I ran checks against JMdict and blocked the flagging of ones that were already there. There is still some overlap with a few entries, but that can be tidied up eventually. In the end, just over 6,700 entries in the JMnedict database were given the tag which has resulted in them going into the JMdict XML release.

The big question now is what, if anything, to do with the other big categories. The main one is probably the "person" set. Some of those could be flagged. If there was enough interest I could put up a page of the most common 500 or so, with update links.

@chrisvasselli
Copy link

Hi all, just curious to check on the status of this. I was thinking of mentioning this new set of entries in the next update for my app, but it probably makes sense to do it just once when all the data is imported.

Is there still an open question about what to do with the "person" set?

@JMdictProject
Copy link
Owner Author

I think it's still open. The problem is how to choose a suitable subset from the large and messy collection.

I mean to look at person names which are also in references such as GG5 and kokugos. It will be a few weeks before I can get to doing that. It would be good to have some generally acceptable criteria for selection.

@stephenmk
Copy link

Does the JMnedict tag have to be named [spec1] and not something more descriptive of its purpose? This seems to be a frequent source of confusion.

@JMdictProject
Copy link
Owner Author

Does the JMnedict tag have to be named [spec1] and not something more descriptive of its purpose? This seems to be a frequent source of confusion.

It could potentially be any tag capable of being added to a reading.

Can you expand on the "frequent source of confusion"? None has been brought to my attention. After all, the entries that get added to JMdict have it removed.

The current [spec1] could be replaced by something like [nf1], or a special-purpose tag such as [prom] created for the role.

@stephenmk
Copy link

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5144426.2

2023-05-07 16:13:42 Stephen Kraus
Was the [spec1] tag intentionally only placed on the reading?

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5741528.2

2023-07-22 16:51:17 Opencooper
Probably a false positive for spec1 considering it doesn't have any hits in Kotobank or on jawiki.
A 2023-07-22 21:25:25 Jim Breen
The spec1 is just to trigger its inclusion in the JMdict XML distribution.

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5746459.2

2023-09-09 21:21:37 Anonymous
Why spec1?

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5746458.2

2023-09-09 21:22:12 Anonymous
Why is spec1 on the reading?

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5000045.2

2023-09-10 00:41:09 Anonymous
Thr Spec1 is likely wrong here if it's based on the reading

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5742445.2

2023-09-10 21:24:00 Anonymous
Why does spec1 go on the reading in jmnedict?

https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5746610.2

2024-01-03 00:42:31 Anonymous
Why spec1?

Obviously not a hugely important issue, but I expect people will forever continue to ask about this every few months or so. The repurposing of the tag for this feature isn't documented anywhere, either in tooltips or on the reference page. To anyone beginning to learn how to edit the database, it looks like an obvious error. It probably leaves a bad impression on them when they learn that it isn't.

I think the ideal setup is to have a new tag with tooltip text explaining that the tag only exists to mark name entries for inclusion in the JMdict XML distribution.

I know there are plenty of more interesting things to work on, so I completely understand and won't press the issue if this doesn't seem important enough to bother with.

@JMdictProject
Copy link
Owner Author

Fair enough. Probably best to have a specific tag. I should able to create something like [prom] and convert the entries. It will be a few weeks until I can look into it in detail.

@JMdictProject
Copy link
Owner Author

In https://www.edrdg.org/jmwsgi/entr.py?svc=jmdict&sid=&q=5259563 Stephen suggests including person names if they are kokugos ( ホーチミン) is one such. I agree. I'd include names that are in major JEs such as GG5.

@robinjmdict
Copy link

Fine with me but is it possible to exclude the non-[person] senses (e.g. [surname], [place])? See トールキン and ペレ for examples.

It should be noted that the vast majority of our [person] entries are full names (first name + surname) but the non-Japanese person name entries in the JEs and kougos typically only include the surname. Presumably this is what we'd do as well, i.e. include アインシュタイン but not アルベルト・アインシュタイン.

We'll have to manually amend these surname entries one by one. I think we only recently started adding [person] senses to surnames.

@stephenmk
Copy link

By my count there are ~7500 person name entries and ~5600 place name entries in the latest edition of daijirin. If anyone is interested, I can send them the full list (reading, surface form, entry type).

Should we extend this new inclusion policy to place names as well? I just came across "鳥取砂丘" which is in daijirin, for example.

@Marcusjmdict
Copy link

Marcusjmdict commented Jan 29, 2024 via email

@razasyedh
Copy link

I think including [person] entries is a good next step if there haven't been any downstream issues with apps processing the currently added names.

However, I think any inclusion of [place] entries would need to be much more gradual. A lot of these entries are currently only romanizations and I think were scraped off the Internet, so may not be verified. (e.g. we had Stephen's example of 「鳥取砂丘」 as "Tottorisakyū" before it was manually amended) We could perhaps start with the top, say, 200 places, (these will probably be countries which we already have) check those, and then expand.

For individual first and last names, the approach Marcus has suggested sounds reasonable so far. As an end user, it would also be nice to know how common a name is, just as we currently indicate with the dictionary entries. (of course, these top 500 are common, but I'm speaking long-term, if the scope gets expanded)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants