-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the full ISO 639-3 list of languages #10578 #10762
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I eyeballed the list (not that I really know what I'm looking at) and left a comment inline.
language French fra 1960 fra fre fr | ||
language French Sign Language fsl 1961 fsl | ||
language Friulian fur 1962 fur | ||
language Fula, Fulah, Pulaar, Pular, Fulah, Pulaar, Pular ful 1963 ful ff Fula Fulah Pular Pulaar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does ful
have seven variations? At https://en.wikipedia.org/wiki/ISO_639:f they give it a single English name.
Perhaps the release note could indicate the source of the data.
I downloaded a zip from https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables (I'm not sure if this is the right place) and I only see one name, not seven:
pdurbin@beamish ~ % cd ~/Downloads/iso-639-3_Code_Tables_20240415
pdurbin@beamish iso-639-3_Code_Tables_20240415 % ls
iso-639-3-macrolanguages.tab iso-639-3.tab iso-639-3_Name_Index.tab iso-639-3_Retirements.tab
pdurbin@beamish iso-639-3_Code_Tables_20240415 % ack Fulah
iso-639-3_Name_Index.tab
2069:ful Fulah Fulah
iso-639-3.tab
1978:ful ful ful ff M L Fulah
(Later I noticed that "Divehi, Dhivehi, Maldivian, Dhivehi" has four values.)
Is there a script we can add to this pull request that was used to put these values into the tsv file? What process do we follow if we need to update the list in a few years? Should we put something in the dev guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already fixed it. Or at least most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had some of these aliases like Pular in our old block - possibly from their 639-1 /639-2 sources. It looks like the merge method resulted in some duplicates (like Pular showing twice) which are what looks like it got fixed. There's a further issue though with this example and perhaps others: Pular is listed separately as
fuf I L Pular
in the ISO file and Pulaar is
fuc I L Pulaar
Right now, I don't see fuf
or fuc
in the citation block and if they were there (I think they should be), they shouldn't still be listed as alternates for ful
(which may have been true with 639-1/2).
I'm not sure how to automate unscrambling that. Almost seems like manually going through the existing 187 entries to see if they've been split or dropped, etc. is going to be needed. (Maybe automation can pull out all the lines where we have extra alternates so those can be the ones to focus on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merge with our original tsv file. That file has Pulaar mapped to ful
language Fula, Fulah, Pulaar, Pular ful 48 ful ff Fula Fulah Pulaar Pular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the source of the ISO file in the release notes
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
language French fra 1960 fra fre fr | ||
language French Sign Language fsl 1961 fsl | ||
language Friulian fur 1962 fur | ||
language Fula, Fulah, Pular, Pulaar ful 1963 ful ff Fula Fulah fuc Pular fuf Pulaar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In iso-639-3.tab, I see
fuc I L Pulaar
fuf I L Pular
ful ful ful ff M L Fulah
which means these are not the same language (as I assume they were considered on 639-1 or 639-2 where we got the original entry with all of these together.)
It looks like 639-3 considers ful to be the macro language comprised of ffm, fub, fuc, fue, fuf, fuh, fui, fuq, and fuv (from https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3-macrolanguages.tab). We have all but fuc and fuf as separate entries now.
I think fuc/fuf should also be separate and, in general, the algorithm doing the merge should untangle an alternates that are considered a separate language in 639-1. It looks like there are ~64 macrolanguages and probably few of them are represented in our old table. For those that are, I think any alternates that are for separate individual languages should keep their separate entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove Pulaar and Pular from our original tsv file the merge will make them separate. I'll do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than addressing the issue w.r.t. macrolanguages and alternates I've raised, this looks ready to go.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…the citation block #8578 (temporary? - we can drop these from the branch before we merge)
This comment has been minimized.
This comment has been minimized.
@stevenwinship @qqmyers a much longer version: This is how I tested:
The lists of entries to compare were selected by All the new entries have been confirmed to be properly formed.
... which is ok since these now have their own entries. The check on the main names in the 2nd column finds 37 discrepancies:
I would like to avoid introducing this many changes and address this by reviewing them on a case by case basis. My main goal is to avoid introducing more comma-separated lists in this column. In the ISO 639-3 as published these comma-separated lists serve the purpose that in our case is addressed by the list of "alternate" values that we supply. But since we use this "main name" for metadata exports, I don't think resulting entries like
It doesn't seem to be possible to accomplish the above without breaking the sort order, so the order numbers would need to be recalcuated. But I cannot imagine any human researcher looking to enter "Croatian" and actually expecting to find it under letter "L", as "Logudorese Sardinian, Croatian". So I'm proposing to keep it "Croatian" and adjusting the order numbers. We have discussed making it possible to drop the explicit order numbers for CVVs such as this one; plus we wanted to make it possible to supply localized versions of the names without breaking the sort order (German vs. Deutsch) - which of course requires sorting past the database retrieval. But let's leave that part for later. |
There will always be issues trying to merge two lists. Maybe we should drop our existing list and only use the ISO list, Have you (@landreev) noticed many languages in our list that are not in the ISO list? |
No, there are no cases of supported legacy languages that are no longer in ISO 639-3.
Our existing list was not really "our" in any proprietary way. It was originally based on ISO 639-1, so it was fairly standard. |
Dropping our existing, legacy CVV and starting from scratch would be easier. But I just don't think we can afford to do that - for reasons of backward compatibility. |
I have modified my PR, adding corrected sort order to the CVV list. |
…ges-qa a proposed refinement (simplification) of some of the "main" language names in the new list
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
(waiting for the last Jenkins run to complete at this point) |
What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).
Which issue(s) this PR closes: Figure out whether, or how to support the extended ISO 639-3 list of languages #8578
Special notes for your reviewer:
Suggestions on how to test this: start with preexisting datasets created by older version of dataverse
Follow Update the Citation metadata block from release notes. Make sure the languages are the same and that some new languages can be added
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: Yes. to be included in this PR
Additional documentation: The ISO data was downloaded from
https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables:~:text=iso%2D639%2D3_Code_Tables_20240415.zip