Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the full ISO 639-3 list of languages #10578 #10762

Merged
merged 15 commits into from
Sep 4, 2024

Conversation

stevenwinship
Copy link
Contributor

@stevenwinship stevenwinship commented Aug 8, 2024

What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).

Which issue(s) this PR closes: Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Special notes for your reviewer:

Suggestions on how to test this: start with preexisting datasets created by older version of dataverse
Follow Update the Citation metadata block from release notes. Make sure the languages are the same and that some new languages can be added

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Yes. to be included in this PR

Additional documentation: The ISO data was downloaded from
https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables:~:text=iso%2D639%2D3_Code_Tables_20240415.zip

@stevenwinship stevenwinship added Type: Bug a defect Feature: Harvesting pm.epic.nih_harvesting NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues GREI 3 Search and Browse Size: 50 A percentage of a sprint. 35 hours. labels Aug 8, 2024
@stevenwinship stevenwinship self-assigned this Aug 8, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I eyeballed the list (not that I really know what I'm looking at) and left a comment inline.

language French fra 1960 fra fre fr
language French Sign Language fsl 1961 fsl
language Friulian fur 1962 fur
language Fula, Fulah, Pulaar, Pular, Fulah, Pulaar, Pular ful 1963 ful ff Fula Fulah Pular Pulaar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does ful have seven variations? At https://en.wikipedia.org/wiki/ISO_639:f they give it a single English name.

Perhaps the release note could indicate the source of the data.

I downloaded a zip from https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables (I'm not sure if this is the right place) and I only see one name, not seven:

pdurbin@beamish ~ % cd ~/Downloads/iso-639-3_Code_Tables_20240415 
pdurbin@beamish iso-639-3_Code_Tables_20240415 % ls
iso-639-3-macrolanguages.tab	iso-639-3.tab			iso-639-3_Name_Index.tab	iso-639-3_Retirements.tab
pdurbin@beamish iso-639-3_Code_Tables_20240415 % ack Fulah
iso-639-3_Name_Index.tab
2069:ful	Fulah	Fulah

iso-639-3.tab
1978:ful	ful	ful	ff	M	L	Fulah

(Later I noticed that "Divehi, Dhivehi, Maldivian, Dhivehi" has four values.)

Is there a script we can add to this pull request that was used to put these values into the tsv file? What process do we follow if we need to update the list in a few years? Should we put something in the dev guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed it. Or at least most.

Copy link
Member

@qqmyers qqmyers Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some of these aliases like Pular in our old block - possibly from their 639-1 /639-2 sources. It looks like the merge method resulted in some duplicates (like Pular showing twice) which are what looks like it got fixed. There's a further issue though with this example and perhaps others: Pular is listed separately as
fuf I L Pular
in the ISO file and Pulaar is
fuc I L Pulaar

Right now, I don't see fuf or fuc in the citation block and if they were there (I think they should be), they shouldn't still be listed as alternates for ful (which may have been true with 639-1/2).

I'm not sure how to automate unscrambling that. Almost seems like manually going through the existing 187 entries to see if they've been split or dropped, etc. is going to be needed. (Maybe automation can pull out all the lines where we have extra alternates so those can be the ones to focus on.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge with our original tsv file. That file has Pulaar mapped to ful

language Fula, Fulah, Pulaar, Pular ful 48 ful ff Fula Fulah Pulaar Pular

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the source of the ISO file in the release notes

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Aug 9, 2024
language French fra 1960 fra fre fr
language French Sign Language fsl 1961 fsl
language Friulian fur 1962 fur
language Fula, Fulah, Pular, Pulaar ful 1963 ful ff Fula Fulah fuc Pular fuf Pulaar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In iso-639-3.tab, I see

fuc				I	L	Pulaar	
fuf				I	L	Pular	
ful	ful	ful	ff	M	L	Fulah	

which means these are not the same language (as I assume they were considered on 639-1 or 639-2 where we got the original entry with all of these together.)

It looks like 639-3 considers ful to be the macro language comprised of ffm, fub, fuc, fue, fuf, fuh, fui, fuq, and fuv (from https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3-macrolanguages.tab). We have all but fuc and fuf as separate entries now.

I think fuc/fuf should also be separate and, in general, the algorithm doing the merge should untangle an alternates that are considered a separate language in 639-1. It looks like there are ~64 macrolanguages and probably few of them are represented in our old table. For those that are, I think any alternates that are for separate individual languages should keep their separate entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove Pulaar and Pular from our original tsv file the merge will make them separate. I'll do that

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than addressing the issue w.r.t. macrolanguages and alternates I've raised, this looks ready to go.

This comment has been minimized.

@stevenwinship stevenwinship removed their assignment Aug 14, 2024

This comment has been minimized.

@landreev landreev self-assigned this Aug 15, 2024
…the citation block #8578

(temporary? - we can drop these from the branch before we merge)

This comment has been minimized.

@landreev
Copy link
Contributor

landreev commented Aug 26, 2024

@stevenwinship @qqmyers
short version: I made a pr into this branch, #10801 with a few proposed changes - mainly, simplifications, for the names of a few already-supported languages. Note that if you are ok with accepting this list, the numerical order values will need to be re-calculated.

a much longer version:

This is how I tested:
I put together a quick validation script, it is checked in under scripts/issues/8578/script_check_languages.pl (we can drop it from the branch before we merge, or keep it for reference). The script parses the lists of languages, from this branch and develop and checks for the following:

  • That the entries are properly formed, i.e. each one has a valid name, a 3-letter identifier in the 3rd column, that the same 3-letter is in the 5th column as the first alternate value, and that more valid alternate values, letter codes or names, are supplied in the consecutive tab-separate columns.
  • That all the 2- and 3-letter codes, plus all the alternate name that have been supported until now (i.e., in the develop branch) are still supported in the new list.
  • It also checks and compares the old and new "main names" of languages (the 2nd column).

The lists of entries to compare were selected by egrep '^\tlanguage\t' scripts/api/data/metadatablocks/citation.tsv and saved as citation_languages_10762.tsv and citation_languages_develop.tsv in the directory above.

All the new entries have been confirmed to be properly formed.
All the currently supported codes and alternates are still supported, with the exception of the 2 that you specifically mentioned above:

./script_check_languages.pl -checkcodes citation_languages_10762.tsv citation_languages_develop.tsv
Parsing the new languages list...
7921 language entries processed. All entries well-formed.
Processing the previously supported list:
identifier: ful, alternate value Pulaar is missing in the new list!
identifier: ful, alternate value Pular is missing in the new list!
186 develop branch entries processed.

... which is ok since these now have their own entries.

The check on the main names in the 2nd column finds 37 discrepancies:

./script_check_languages.pl -checknames citation_languages_10762.tsv citation_languages_develop.tsv
Parsing the new languages list...
7921 language entries processed. All entries well-formed.
Processing the previously supported list:
Language name different for id abk: old: Abkhaz, new: Abkhaz, Abkhazian
Language name different for id cat: old: Catalan,Valencian, new: Catalan, Valencian
Language name different for id hrv: old: Croatian, new: Logudorese Sardinian, Croatian
Language name different for id div: old: Divehi, Dhivehi, Maldivian, new: Maldivian, Dhivehi, Divehi
Language name different for id ful: old: Fula, Fulah, Pulaar, Pular, new: Fula, Fulah
Language name different for id ell: old: Greek (modern), new: Modern Greek (1453-), Greek (modern)
Language name different for id grn: old: Guaraní, new: Guarani, Guaraní
Language name different for id hat: old: Haitian, Haitian Creole, new: Haitian Creole, Haitian
Language name different for id heb: old: Hebrew (modern), new: Hebrew (modern), Hebrew
Language name different for id ina: old: Interlingua, new: Interlingua, Interlingua (International Auxiliary Language Association)
Language name different for id kal: old: Kalaallisut, Greenlandic, new: Greenlandic, Kalaallisut
Language name different for id kir: old: Kyrgyz, new: Kirghiz, Kyrgyz
Language name different for id kua: old: Kwanyama, Kuanyama, new: Kuanyama, Kwanyama
Language name different for id lim: old: Limburgish, Limburgan, Limburger, new: Limburgan, Limburger, Limburgish
Language name different for id msa: old: Malay, new: Malay, Malay (macrolanguage)
Language name different for id mri: old: Māori, new: Māori, Maori
Language name different for id mar: old: Marathi (Marāṭhī), new: Marathi, Marathi (Marāṭhī)
Language name different for id nde: old: Northern Ndebele, new: Northern Ndebele, North Ndebele
Language name different for id nep: old: Nepali, new: Nepali (macrolanguage), Nepali
Language name different for id iii: old: Nuosu, new: Sichuan Yi, Nuosu
Language name different for id nbl: old: Southern Ndebele, new: South Ndebele, Southern Ndebele
Language name different for id oci: old: Occitan, new: Occitan (post 1500), Occitan
Language name different for id chu: old: Old Church Slavonic,Church Slavonic,Old Bulgarian, new: Church Slavonic, Church Slavic, Old Church Slavonic, Old Bulgarian
Language name different for id ori: old: Oriya, new: Oriya, Oriya (macrolanguage)
Language name different for id pli: old: Pāli, new: Pāli, Pali
Language name different for id fas: old: Persian (Farsi), new: Persian, Persian (Farsi)
Language name different for id run: old: Kirundi, new: Rundi, Kirundi
Language name different for id san: old: Sanskrit (Saṁskṛta), new: Sanskrit (Saṁskṛta), Sanskrit
Language name different for id gla: old: Scottish Gaelic, Gaelic, new: Gaelic, Scottish Gaelic
Language name different for id slv: old: Slovene, new: Slovenian, Slovene
Language name different for id spa: old: Spanish, Castilian, new: Castilian, Spanish
Language name different for id swa: old: Swahili, new: Swahili, Swahili (macrolanguage)
Language name different for id bod: old: Tibetan Standard, Tibetan, Central, new: Tibetan, Tibetan Standard, Central
Language name different for id uig: old: Uyghur, Uighur, new: Uighur, Uyghur
186 develop branch entries processed.

I would like to avoid introducing this many changes and address this by reviewing them on a case by case basis. My main goal is to avoid introducing more comma-separated lists in this column. In the ISO 639-3 as published these comma-separated lists serve the purpose that in our case is addressed by the list of "alternate" values that we supply. But since we use this "main name" for metadata exports, I don't think resulting entries like <dc:language>Guarani, Guaraní</dc:language> look right. So, to be clear, we should never have added any such comma-separated lists in the first place, but since we've had a few for a while it is now a matter of legacy.
So I tried going through the list with the following rules in mind:

  • When the old name "X" is being replaced with "Y,X" whre Y is the name that's scientifically preferred now, we simply go with Y, and make sure that X is listed as an alternate. Example:
    the current entry: Slovene slv 146 slv sl Slovenian
    the PR entry: Slovenian, Slovene slv 6298 slv sl Slovenian Slovene
    proposed solution: Slovenian slv 6298 slv sl Slovene
  • When the old name "X" is being replaced with "X, Y", with the assumption that X is still the preferred name, we keep the name unchanged, as X, and make sure Y is an alternate. Example:
    the current entry: Māori mri 103 mri mao mi Maori
    the PR entry: Māori, Maori mri 4677 mri Māori Maori mao mi
    proposed solution: Māori mri 4677 mri Maori mao mi
  • When an already supported comma-separated name is being replaced with a new list that's cosmetically different, or the same list in different order, we go with the 639-3 version (i.e., the PR). Example:
    current: Kalaallisut, Greenlandic kal 74 kal kl Kalaallisut Greenlandic
    PR: Greenlandic, Kalaallisut kal 2173 kal kl Greenlandic Kalaallisut
    solution: keep the PR entry.
  • When an already supported comma-separated name is being replaced with an (unnecessarily-complicated/longer list), we try to keep it as simple as possible.

It doesn't seem to be possible to accomplish the above without breaking the sort order, so the order numbers would need to be recalcuated. But I cannot imagine any human researcher looking to enter "Croatian" and actually expecting to find it under letter "L", as "Logudorese Sardinian, Croatian". So I'm proposing to keep it "Croatian" and adjusting the order numbers.

We have discussed making it possible to drop the explicit order numbers for CVVs such as this one; plus we wanted to make it possible to supply localized versions of the names without breaking the sort order (German vs. Deutsch) - which of course requires sorting past the database retrieval. But let's leave that part for later.

@stevenwinship
Copy link
Contributor Author

There will always be issues trying to merge two lists. Maybe we should drop our existing list and only use the ISO list, Have you (@landreev) noticed many languages in our list that are not in the ISO list?

@landreev
Copy link
Contributor

Have you (@landreev) noticed many languages in our list that are not in the ISO list?

No, there are no cases of supported legacy languages that are no longer in ISO 639-3.
In my QA script referenced above, I have this at line 125:

unless ($LANGUAGE_NAMES{$identifier})
{
     die "Previously supported language " . $identifier . " (" . $mainname . ") is no longer on the list.\n";
}

Our existing list was not really "our" in any proprietary way. It was originally based on ISO 639-1, so it was fairly standard.

@landreev
Copy link
Contributor

landreev commented Aug 27, 2024

There will always be issues trying to merge two lists. Maybe we should drop our existing list and only use the ISO list ... ?

Dropping our existing, legacy CVV and starting from scratch would be easier. But I just don't think we can afford to do that - for reasons of backward compatibility.
From what I understand, from talking about these languages - including the couple of cases where somebody actually asked us to support more languages - I really get an impression that all the Ks of extra languages added to the v.3 of ISO 639 are fairly exotic, and will only be used in the metadata that describes some specialized research in linguistics, i.e., in some relatively rare cases. We are adding all these new entries to the citation.tsv because we've concluded that there was no clean or easy way to distribute them as some kind of an optional "expansion pack". But most real life cases of the "language" metadata field actually being filled in real life datasets will likely still be covered by the most common languages that are currently supported. So, by that logic I feel like maintaining the backward compatibility with the existing list should be the priority. But I am of course open to counter-arguments.

@landreev
Copy link
Contributor

I have modified my PR, adding corrected sort order to the CVV list.

…ges-qa

a proposed refinement (simplification) of some of the "main" language names in the new list
Copy link

github-actions bot commented Sep 4, 2024

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8578-support-for-iso-639-3-languages
ghcr.io/gdcc/configbaker:8578-support-for-iso-639-3-languages

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev
Copy link
Contributor

landreev commented Sep 4, 2024

(waiting for the last Jenkins run to complete at this point)

@landreev landreev merged commit 8fd8c18 into develop Sep 4, 2024
12 of 13 checks passed
@stevenwinship stevenwinship deleted the 8578-support-for-iso-639-3-languages branch September 4, 2024 19:47
@pdurbin pdurbin added this to the 6.4 milestone Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 4 FY25 Sprint 4 FY25 Sprint 5 FY25 sprint 5 GREI 3 Search and Browse NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 50 A percentage of a sprint. 35 hours. Type: Bug a defect
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Figure out whether, or how to support the extended ISO 639-3 list of languages Missing 639-3 language codes?
5 participants