Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ScriptLangTag values with implied script subtag #978

Closed
nedley opened this issue Oct 6, 2022 · 13 comments
Closed

Allow ScriptLangTag values with implied script subtag #978

nedley opened this issue Oct 6, 2022 · 13 comments
Assignees

Comments

@nedley
Copy link

nedley commented Oct 6, 2022

We are in the process of migrating out-of-font metadata to 'meta' tables, and most of the values being migrated have a language subtag but an implied script subtag. We would appreciate it if the spec allowed for ScriptLangTags with a language but no script subtag, in which case a likely script would inferred using a process like the one described in UTS #35.


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

@PeterCon PeterCon self-assigned this Oct 7, 2022
@PeterCon PeterCon added this to the OpenType 1.9.1 milestone Oct 7, 2022
@PeterCon
Copy link
Collaborator

PeterCon commented Oct 7, 2022

In general, this wouldn't be a breaking change for fonts—i.e., any existing fonts would remain conformant. For existing applications, it could mean new fonts provide a dlng or slng value that gets ignored—sub-optimal, but not a terrible problem in the long run.

A bigger concern, though, is maintenance over time: making sure that values put in fonts today continue to be valid and match the expectations of applicatons into the indefinite future.

A problem will exist if the assumptions as to what script is implicitly inferred from a language subtag changes over time. If language tags existed in 1900, "tr" would have been assumed to imply Arabic script, and an Arabic-script font designed for Turkish might have used dlng=tr. But by 1950, that inference would be wrong.

You point to UTS #35, but I don't think that will be a good reference for determining when a script subtag can be omitted: CLDR uses suppress-script data from BCP 47, but also uses "heuristically-derived" values "based on the default content data, [and] the population data". Crucially, the likelyScript values "may change over time". The use of such heuristics suggests low expectation for stability, and significant risk that font data that's useful and conformant today might become un-useful or even non-conformant tomorrow.

BCP 47 suppress-script data may be better. It would be easy to describe and easy for font developers to find and explorer (go to https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry and search for "suppress-script"). The suppress-script values can be changed, though proposed changes have to go through a human review process via the ietf-languages mail list, and so probably have a better chance of stability. The fact that linguistic realities can change over time creates uncertainty, however: the record for "tr" has Suppress-Script: Latn today, but would have had a different value in the past (if BCP 47 existed).

The primary intent in the spec of the script subtag the one required element was to allow for tags that only include a script subtag, which often makes sense as a way to character what fonts can support or are designed for. But it also has a benefit of ensuring forward compatibility for font data.

If an implementation is using something like ICU that expects tags that omit likely script subtags, it seems like it would be easy enough to detect these cases in the meta table and remove the script subtag. That doesn't seem like a bad tradeoff for providing better stability for font data.

With all that in mind, do you still think it would be an improvement to remove that constraint?

@nedley
Copy link
Author

nedley commented Oct 7, 2022

Point taken that the likeliest script for a language can change over time, but note also that determining scripts supported by the font is a mechanical process given the character set so an implementation can identify the correct one(s) as needed. Again, our aim is not to suggest that bare language tags are preferred moving forward but it is awkward to see that our historical use of these fields should be ignored as nonconformant when they are still accurate, if incomplete.

@PeterCon
Copy link
Collaborator

PeterCon commented Oct 7, 2022

determining scripts supported by the font is a mechanical process given the character set

True, though part of the point of having metadata is to inform without needing to do a detailed analysis of the font.

At any rate, it sounds like the main concern is to relax the constraint allowing some existing fonts to be considered conformant. Would you be OK with allowing scriptlangtags to have only a language subtag but recommending that a script subtag should always be included?

@nedley
Copy link
Author

nedley commented Oct 7, 2022

We would be fine with that resolution, yes.

@tiroj
Copy link

tiroj commented Oct 7, 2022

A problem will exist if the assumptions as to what script is implicitly inferred from a language subtag changes over time. If language tags existed in 1900, "tr" would have been assumed to imply Arabic script, and an Arabic-script font designed for Turkish might have used dlng=tr. But by 1950, that inference would be wrong.

More immediate examples can be found in Central Asia, where several languages changed script twice during the 20th Century, at least one language changed script three times during the same period, and several of which are in the process of transitioning to a new script right now.

@nedley
Copy link
Author

nedley commented Oct 7, 2022

Suffice it to say the fonts and data we are dealing with do not suffer from problems of this nature…

[edit: By which I mean languages where the likeliest script is transitioning.]

@PeterCon
Copy link
Collaborator

PeterCon commented Oct 7, 2022

Suffice it to say the fonts and data we are dealing with do not suffer from problems of this nature…

Sure, but the spec still needs to anticipate the general case. My main concerns would be establishing what will provide for longer term stability and then not introducing ambiguity with some font developers getting the impression it's fine to do things that aren't conducive to longer term stability.

@nedley
Copy link
Author

nedley commented Oct 7, 2022

We are not proposing any change that implies a preference for implied/suppressed script. But if software is to accommodate existing fonts on our platforms it needs to be aware of this until such time as we can make the necessary modifications.

@PeterCon
Copy link
Collaborator

PeterCon commented Oct 7, 2022

Got it; I'll work on some wording...

@PeterCon
Copy link
Collaborator

PeterCon commented Oct 8, 2022

See OT 1.9.1 alpha for draft revisions addressing this issue.

I've relaxed ScriptLangTag syntax as you requested but with clear statements that a tag without a script subtag is strongly discouraged and that applications are permitted to ignore such tags (allowing existing implementations that do so to remain conformant).

@PeterCon PeterCon closed this as completed Oct 8, 2022
@PeterCon
Copy link
Collaborator

PeterCon commented Oct 8, 2022

As we discussed offline, I also added some clarification regarding the intended use and distinction between slng and dlng.

@dscorbett
Copy link

The new ScriptLangTag grammar disallows some previously allowed tags. If a tag include includes any region, variant, extension, or private use subtags, it must now include a language subtag. The first line:

ScriptLangTag = language | script | language "-" script

should be:

ScriptLangTag = (language | script | language "-" script)

@PeterConstable
Copy link

Of course. It was my first thought to wrap in parens, but then I didn't do that. Thanks for the catch.

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants