Support the full ISO 639-3 list of languages #10578 #10762

stevenwinship · 2024-08-08T16:03:18Z

What this PR does / why we need it: Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).

Which issue(s) this PR closes: Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Special notes for your reviewer:

Suggestions on how to test this: start with preexisting datasets created by older version of dataverse
Follow Update the Citation metadata block from release notes. Make sure the languages are the same and that some new languages can be added

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Yes. to be included in this PR

Additional documentation: The ISO data was downloaded from
https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables:~:text=iso%2D639%2D3_Code_Tables_20240415.zip

pdurbin

I eyeballed the list (not that I really know what I'm looking at) and left a comment inline.

pdurbin · 2024-08-08T18:14:20Z

scripts/api/data/metadatablocks/citation.tsv

+	language	French	fra	1960	fra	fre	fr
+	language	French Sign Language	fsl	1961	fsl
+	language	Friulian	fur	1962	fur
+	language	Fula, Fulah, Pulaar, Pular, Fulah, Pulaar, Pular	ful	1963	ful	ff	Fula	Fulah	Pular	Pulaar


Why does ful have seven variations? At https://en.wikipedia.org/wiki/ISO_639:f they give it a single English name.

Perhaps the release note could indicate the source of the data.

I downloaded a zip from https://iso639-3.sil.org/code_tables/download_tables#Complete%20Code%20Tables (I'm not sure if this is the right place) and I only see one name, not seven:

pdurbin@beamish ~ % cd ~/Downloads/iso-639-3_Code_Tables_20240415 pdurbin@beamish iso-639-3_Code_Tables_20240415 % ls iso-639-3-macrolanguages.tab iso-639-3.tab iso-639-3_Name_Index.tab iso-639-3_Retirements.tab pdurbin@beamish iso-639-3_Code_Tables_20240415 % ack Fulah iso-639-3_Name_Index.tab 2069:ful Fulah Fulah iso-639-3.tab 1978:ful ful ful ff M L Fulah

(Later I noticed that "Divehi, Dhivehi, Maldivian, Dhivehi" has four values.)

Is there a script we can add to this pull request that was used to put these values into the tsv file? What process do we follow if we need to update the list in a few years? Should we put something in the dev guide?

Already fixed it. Or at least most.

We had some of these aliases like Pular in our old block - possibly from their 639-1 /639-2 sources. It looks like the merge method resulted in some duplicates (like Pular showing twice) which are what looks like it got fixed. There's a further issue though with this example and perhaps others: Pular is listed separately as
fuf I L Pular
in the ISO file and Pulaar is
fuc I L Pulaar

Right now, I don't see fuf or fuc in the citation block and if they were there (I think they should be), they shouldn't still be listed as alternates for ful (which may have been true with 639-1/2).

I'm not sure how to automate unscrambling that. Almost seems like manually going through the existing 187 entries to see if they've been split or dropped, etc. is going to be needed. (Maybe automation can pull out all the lines where we have extra alternates so those can be the ones to focus on.)

I merge with our original tsv file. That file has Pulaar mapped to ful

language Fula, Fulah, Pulaar, Pular ful 48 ful ff Fula Fulah Pulaar Pular

I added the source of the ISO file in the release notes

qqmyers · 2024-08-14T18:57:46Z

scripts/api/data/metadatablocks/citation.tsv

+	language	French	fra	1960	fra	fre	fr
+	language	French Sign Language	fsl	1961	fsl
+	language	Friulian	fur	1962	fur
+	language	Fula, Fulah, Pular, Pulaar	ful	1963	ful	ff	Fula	Fulah	fuc	Pular	fuf	Pulaar


In iso-639-3.tab, I see

fuc I L Pulaar fuf I L Pular ful ful ful ff M L Fulah

which means these are not the same language (as I assume they were considered on 639-1 or 639-2 where we got the original entry with all of these together.)

It looks like 639-3 considers ful to be the macro language comprised of ffm, fub, fuc, fue, fuf, fuh, fui, fuq, and fuv (from https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3-macrolanguages.tab). We have all but fuc and fuf as separate entries now.

I think fuc/fuf should also be separate and, in general, the algorithm doing the merge should untangle an alternates that are considered a separate language in 639-1. It looks like there are ~64 macrolanguages and probably few of them are represented in our old table. For those that are, I think any alternates that are for separate individual languages should keep their separate entries.

If I remove Pulaar and Pular from our original tsv file the merge will make them separate. I'll do that

qqmyers

Other than addressing the issue w.r.t. macrolanguages and alternates I've raised, this looks ready to go.

doc/release-notes/8578-support-for-iso-639-3-languages.md

…the citation block #8578 (temporary? - we can drop these from the branch before we merge)

… the list. #8578

landreev · 2024-08-26T14:50:01Z

@stevenwinship @qqmyers
short version: I made a pr into this branch, #10801 with a few proposed changes - mainly, simplifications, for the names of a few already-supported languages. Note that if you are ok with accepting this list, the numerical order values will need to be re-calculated.

a much longer version:

This is how I tested:
I put together a quick validation script, it is checked in under scripts/issues/8578/script_check_languages.pl (we can drop it from the branch before we merge, or keep it for reference). The script parses the lists of languages, from this branch and develop and checks for the following:

That the entries are properly formed, i.e. each one has a valid name, a 3-letter identifier in the 3rd column, that the same 3-letter is in the 5th column as the first alternate value, and that more valid alternate values, letter codes or names, are supplied in the consecutive tab-separate columns.
That all the 2- and 3-letter codes, plus all the alternate name that have been supported until now (i.e., in the develop branch) are still supported in the new list.
It also checks and compares the old and new "main names" of languages (the 2nd column).

The lists of entries to compare were selected by egrep '^\tlanguage\t' scripts/api/data/metadatablocks/citation.tsv and saved as citation_languages_10762.tsv and citation_languages_develop.tsv in the directory above.

All the new entries have been confirmed to be properly formed.
All the currently supported codes and alternates are still supported, with the exception of the 2 that you specifically mentioned above:

./script_check_languages.pl -checkcodes citation_languages_10762.tsv citation_languages_develop.tsv
Parsing the new languages list...
7921 language entries processed. All entries well-formed.
Processing the previously supported list:
identifier: ful, alternate value Pulaar is missing in the new list!
identifier: ful, alternate value Pular is missing in the new list!
186 develop branch entries processed.

... which is ok since these now have their own entries.

The check on the main names in the 2nd column finds 37 discrepancies:

./script_check_languages.pl -checknames citation_languages_10762.tsv citation_languages_develop.tsv
Parsing the new languages list...
7921 language entries processed. All entries well-formed.
Processing the previously supported list:
Language name different for id abk: old: Abkhaz, new: Abkhaz, Abkhazian
Language name different for id cat: old: Catalan,Valencian, new: Catalan, Valencian
Language name different for id hrv: old: Croatian, new: Logudorese Sardinian, Croatian
Language name different for id div: old: Divehi, Dhivehi, Maldivian, new: Maldivian, Dhivehi, Divehi
Language name different for id ful: old: Fula, Fulah, Pulaar, Pular, new: Fula, Fulah
Language name different for id ell: old: Greek (modern), new: Modern Greek (1453-), Greek (modern)
Language name different for id grn: old: Guaraní, new: Guarani, Guaraní
Language name different for id hat: old: Haitian, Haitian Creole, new: Haitian Creole, Haitian
Language name different for id heb: old: Hebrew (modern), new: Hebrew (modern), Hebrew
Language name different for id ina: old: Interlingua, new: Interlingua, Interlingua (International Auxiliary Language Association)
Language name different for id kal: old: Kalaallisut, Greenlandic, new: Greenlandic, Kalaallisut
Language name different for id kir: old: Kyrgyz, new: Kirghiz, Kyrgyz
Language name different for id kua: old: Kwanyama, Kuanyama, new: Kuanyama, Kwanyama
Language name different for id lim: old: Limburgish, Limburgan, Limburger, new: Limburgan, Limburger, Limburgish
Language name different for id msa: old: Malay, new: Malay, Malay (macrolanguage)
Language name different for id mri: old: Māori, new: Māori, Maori
Language name different for id mar: old: Marathi (Marāṭhī), new: Marathi, Marathi (Marāṭhī)
Language name different for id nde: old: Northern Ndebele, new: Northern Ndebele, North Ndebele
Language name different for id nep: old: Nepali, new: Nepali (macrolanguage), Nepali
Language name different for id iii: old: Nuosu, new: Sichuan Yi, Nuosu
Language name different for id nbl: old: Southern Ndebele, new: South Ndebele, Southern Ndebele
Language name different for id oci: old: Occitan, new: Occitan (post 1500), Occitan
Language name different for id chu: old: Old Church Slavonic,Church Slavonic,Old Bulgarian, new: Church Slavonic, Church Slavic, Old Church Slavonic, Old Bulgarian
Language name different for id ori: old: Oriya, new: Oriya, Oriya (macrolanguage)
Language name different for id pli: old: Pāli, new: Pāli, Pali
Language name different for id fas: old: Persian (Farsi), new: Persian, Persian (Farsi)
Language name different for id run: old: Kirundi, new: Rundi, Kirundi
Language name different for id san: old: Sanskrit (Saṁskṛta), new: Sanskrit (Saṁskṛta), Sanskrit
Language name different for id gla: old: Scottish Gaelic, Gaelic, new: Gaelic, Scottish Gaelic
Language name different for id slv: old: Slovene, new: Slovenian, Slovene
Language name different for id spa: old: Spanish, Castilian, new: Castilian, Spanish
Language name different for id swa: old: Swahili, new: Swahili, Swahili (macrolanguage)
Language name different for id bod: old: Tibetan Standard, Tibetan, Central, new: Tibetan, Tibetan Standard, Central
Language name different for id uig: old: Uyghur, Uighur, new: Uighur, Uyghur
186 develop branch entries processed.

I would like to avoid introducing this many changes and address this by reviewing them on a case by case basis. My main goal is to avoid introducing more comma-separated lists in this column. In the ISO 639-3 as published these comma-separated lists serve the purpose that in our case is addressed by the list of "alternate" values that we supply. But since we use this "main name" for metadata exports, I don't think resulting entries like <dc:language>Guarani, Guaraní</dc:language> look right. So, to be clear, we should never have added any such comma-separated lists in the first place, but since we've had a few for a while it is now a matter of legacy.
So I tried going through the list with the following rules in mind:

When the old name "X" is being replaced with "Y,X" whre Y is the name that's scientifically preferred now, we simply go with Y, and make sure that X is listed as an alternate. Example:
the current entry: Slovene slv 146 slv sl Slovenian
the PR entry: Slovenian, Slovene slv 6298 slv sl Slovenian Slovene
proposed solution: Slovenian slv 6298 slv sl Slovene
When the old name "X" is being replaced with "X, Y", with the assumption that X is still the preferred name, we keep the name unchanged, as X, and make sure Y is an alternate. Example:
the current entry: Māori mri 103 mri mao mi Maori
the PR entry: Māori, Maori mri 4677 mri Māori Maori mao mi
proposed solution: Māori mri 4677 mri Maori mao mi
When an already supported comma-separated name is being replaced with a new list that's cosmetically different, or the same list in different order, we go with the 639-3 version (i.e., the PR). Example:
current: Kalaallisut, Greenlandic kal 74 kal kl Kalaallisut Greenlandic
PR: Greenlandic, Kalaallisut kal 2173 kal kl Greenlandic Kalaallisut
solution: keep the PR entry.
When an already supported comma-separated name is being replaced with an (unnecessarily-complicated/longer list), we try to keep it as simple as possible.

It doesn't seem to be possible to accomplish the above without breaking the sort order, so the order numbers would need to be recalcuated. But I cannot imagine any human researcher looking to enter "Croatian" and actually expecting to find it under letter "L", as "Logudorese Sardinian, Croatian". So I'm proposing to keep it "Croatian" and adjusting the order numbers.

We have discussed making it possible to drop the explicit order numbers for CVVs such as this one; plus we wanted to make it possible to supply localized versions of the names without breaking the sort order (German vs. Deutsch) - which of course requires sorting past the database retrieval. But let's leave that part for later.

stevenwinship · 2024-08-26T16:51:46Z

There will always be issues trying to merge two lists. Maybe we should drop our existing list and only use the ISO list, Have you (@landreev) noticed many languages in our list that are not in the ISO list?

landreev · 2024-08-27T15:16:07Z

Have you (@landreev) noticed many languages in our list that are not in the ISO list?

No, there are no cases of supported legacy languages that are no longer in ISO 639-3.
In my QA script referenced above, I have this at line 125:

unless ($LANGUAGE_NAMES{$identifier})
{
     die "Previously supported language " . $identifier . " (" . $mainname . ") is no longer on the list.\n";
}

Our existing list was not really "our" in any proprietary way. It was originally based on ISO 639-1, so it was fairly standard.

landreev · 2024-08-27T15:16:49Z

There will always be issues trying to merge two lists. Maybe we should drop our existing list and only use the ISO list ... ?

Dropping our existing, legacy CVV and starting from scratch would be easier. But I just don't think we can afford to do that - for reasons of backward compatibility.
From what I understand, from talking about these languages - including the couple of cases where somebody actually asked us to support more languages - I really get an impression that all the Ks of extra languages added to the v.3 of ISO 639 are fairly exotic, and will only be used in the metadata that describes some specialized research in linguistics, i.e., in some relatively rare cases. We are adding all these new entries to the citation.tsv because we've concluded that there was no clean or easy way to distribute them as some kind of an optional "expansion pack". But most real life cases of the "language" metadata field actually being filled in real life datasets will likely still be covered by the most common languages that are currently supported. So, by that logic I feel like maintaining the backward compatibility with the existing list should be the priority. But I am of course open to counter-arguments.

…rrected. #8578

landreev · 2024-08-27T16:00:24Z

I have modified my PR, adding corrected sort order to the CVV list.

…ges-qa a proposed refinement (simplification) of some of the "main" language names in the new list

github-actions · 2024-09-04T14:35:12Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8578-support-for-iso-639-3-languages

ghcr.io/gdcc/configbaker:8578-support-for-iso-639-3-languages

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev · 2024-09-04T14:42:38Z

(waiting for the last Jenkins run to complete at this point)

full iso 639 in tsv

d0d7b8b

stevenwinship self-assigned this Aug 8, 2024

remove duplicate properties

0c3840a

pdurbin reviewed Aug 8, 2024

View reviewed changes

stevenwinship added 5 commits August 8, 2024 16:48

fixing failed test

107de4d

fixing failed test

36119b7

fixing failed test

bd35185

removed lenient check for CVV

c74e5af

remove diplicates

5680a8a

This comment has been minimized.

Sign in to view

stevenwinship added 2 commits August 9, 2024 15:20

updateing release notes

2c6c2f5

remove change to json file

accf498

This comment has been minimized.

Sign in to view

stevenwinship removed their assignment Aug 9, 2024

qqmyers reviewed Aug 14, 2024

View reviewed changes

qqmyers requested changes Aug 14, 2024

View reviewed changes

qqmyers assigned stevenwinship Aug 14, 2024

qqmyers reviewed Aug 14, 2024

View reviewed changes

doc/release-notes/8578-support-for-iso-639-3-languages.md Outdated Show resolved Hide resolved

qqmyers mentioned this pull request Aug 14, 2024

Support the full ISO 639-3 list of languages #10578

Closed

make Pulaar fuc and Pular fuf separate languages

595094b

This comment has been minimized.

Sign in to view

Adding to the release notes

b7f54db

stevenwinship removed their assignment Aug 14, 2024

qqmyers approved these changes Aug 14, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

landreev self-assigned this Aug 15, 2024

QA script and files for testing the ISO 639-3 language list added to …

3d534e3

…the citation block #8578 (temporary? - we can drop these from the branch before we merge)

This comment has been minimized.

Sign in to view

a proposed refinement (simplification) of some of the "main" names on…

7dc1b3f

… the list. #8578

landreev mentioned this pull request Aug 26, 2024

a proposed refinement (simplification) of some of the "main" language names in the new list #10801

Merged

a quick experiment - a proposed citation.tsv with the sorted order co…

8764a6f

…rrected. #8578

cmbz added FY25 Sprint 4 FY25 Sprint 4 FY25 Sprint 5 FY25 sprint 5 labels Aug 28, 2024

cmbz mentioned this pull request Aug 28, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

56 tasks

Merge pull request #10801 from IQSS/8578-support-for-iso-639-3-langua…

f1b1342

…ges-qa a proposed refinement (simplification) of some of the "main" language names in the new list

landreev merged commit 8fd8c18 into develop Sep 4, 2024
12 of 13 checks passed

stevenwinship deleted the 8578-support-for-iso-639-3-languages branch September 4, 2024 19:47

pdurbin added this to the 6.4 milestone Sep 4, 2024

pdurbin mentioned this pull request Sep 6, 2024

Metadata Blocks Properties check failing after merging #10762, Missing key 'controlledvocabulary.language.abkhaz #10826

Closed

stevenwinship mentioned this pull request Sep 9, 2024

reworked controlled vocab language keys #10829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the full ISO 639-3 list of languages #10578 #10762

Support the full ISO 639-3 list of languages #10578 #10762

stevenwinship commented Aug 8, 2024 •

edited by pdurbin

Loading

pdurbin left a comment

pdurbin Aug 8, 2024

stevenwinship Aug 8, 2024

qqmyers Aug 8, 2024 •

edited

Loading

stevenwinship Aug 9, 2024

stevenwinship Aug 9, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

qqmyers Aug 14, 2024

stevenwinship Aug 14, 2024

qqmyers left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Aug 26, 2024 •

edited

Loading

stevenwinship commented Aug 26, 2024

landreev commented Aug 27, 2024

landreev commented Aug 27, 2024 •

edited

Loading

landreev commented Aug 27, 2024

github-actions bot commented Sep 4, 2024

landreev commented Sep 4, 2024

Support the full ISO 639-3 list of languages #10578 #10762

Support the full ISO 639-3 list of languages #10578 #10762

Conversation

stevenwinship commented Aug 8, 2024 • edited by pdurbin Loading

pdurbin left a comment

Choose a reason for hiding this comment

pdurbin Aug 8, 2024

Choose a reason for hiding this comment

stevenwinship Aug 8, 2024

Choose a reason for hiding this comment

qqmyers Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

stevenwinship Aug 9, 2024

Choose a reason for hiding this comment

stevenwinship Aug 9, 2024

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

qqmyers Aug 14, 2024

Choose a reason for hiding this comment

stevenwinship Aug 14, 2024

Choose a reason for hiding this comment

qqmyers left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Aug 26, 2024 • edited Loading

stevenwinship commented Aug 26, 2024

landreev commented Aug 27, 2024

landreev commented Aug 27, 2024 • edited Loading

landreev commented Aug 27, 2024

github-actions bot commented Sep 4, 2024

landreev commented Sep 4, 2024

stevenwinship commented Aug 8, 2024 •

edited by pdurbin

Loading

qqmyers Aug 8, 2024 •

edited

Loading

landreev commented Aug 26, 2024 •

edited

Loading

landreev commented Aug 27, 2024 •

edited

Loading