Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standard/norm for LanguageSimpleType #27

Open
bertsky opened this issue Jan 22, 2021 · 4 comments
Open

standard/norm for LanguageSimpleType #27

bertsky opened this issue Jan 22, 2021 · 4 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Jan 22, 2021

In PAGE-XML there's @language / @primaryLanguage of type pc:LanguageSimpleType to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14. This is problematic because exact 639 mappings are needed for software implementation and interoperability.

Take Norwegian for example:

                       <enumeration value="Norwegian"/>
                        <enumeration value="Norwegian Bokmål"/>
                        <enumeration value="Norwegian Nynorsk"/>

According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?

@bertsky
Copy link
Contributor Author

bertsky commented Jan 22, 2021

(Likewise, IIUC, only the first part of the ScriptSimpleType enums is actually ISO 15924, so these would have to be split at -.)

@bertsky
Copy link
Contributor Author

bertsky commented Feb 15, 2021

So IMO what needs to be done is:

  1. In the next namespace version of PAGE-XML, change ScriptSimpleType to conform to ISO 15924 and LanguageSimpleType to conform to ISO 639.
  2. Provide a (manually crafted) transformation stylesheet mapping the existing, non-standardized xs:restriction strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)

@bertsky
Copy link
Contributor Author

bertsky commented Apr 12, 2021

@bertsky
Copy link
Contributor Author

bertsky commented Apr 12, 2021

Oh, and there's a file here documentation/Language List (from ISO).xlsx – but it does not contain a complete mapping of all language strings against their 639 codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant