Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional version/date attributes for gaiji description #2132

Open
747 opened this issue Apr 12, 2021 · 18 comments
Open

Additional version/date attributes for gaiji description #2132

747 opened this issue Apr 12, 2021 · 18 comments

Comments

@747
Copy link

747 commented Apr 12, 2021

Due to the gradual and time-consuming procedure of Han character standardization into Unicode, an unencoded Han gaiji will likely have multiple identities as well as go through property changes during and after the standardization process. In order to maximize the stability of text body using the gaiji module, the character/glyph description needs the capacity to record the update history for traceable collation.

Thus we suggest:

  • Extended versioning: the existing @version is limited to the Unicode Standard version and insufficient to support modifiable properties. We will need attributes to delimit the start and end points in Unicode version (perhaps as @verFrom and @verTo). Versioning systems outside Unicode should better be supported as well for those regional or specialized character sets that are (still) widely used.

  • Datable elements: to further support non-version-based change items, we should allow att.datable attributes to subelements of <char>/<glyph>. Whether the version-based and date-based attributes can coexist is subject to discussion.

An example is like (<mapping>s contain numeral notations instead of real code points for visibility):

<char xml:id="myChar">
  <localProp name="Name" value="A LOCAL GAIJI" />
  <unihanProp name="kIRG_USource" value="U-012345" verFrom="1X.0" verTo="1Y.0" />
  <unihanProp name="kIRG_SSource" value="S-567890" verFrom="1Z.0" />
  <mapping type="internal">0xABCD</mapping>
  <mapping type="PUA" from="2012-01-01" to="2018-03-31">U+FXXX</mapping>
  <mapping type="standard" from="2018-04-01" to="2019-10-15">U+YYYYY</mapping>
  <mapping type="standard" from="2019-10-16">U+YYYYY U+E0100</mapping>
</char>

A hypothetical sample for the local version attribute:

  <localProp name="reading" value="MEOW" localVer="1.1.0" />
  <localProp name="reading" value="OINK" localVer="1.2.0" />

It could be also dated with start and end e.g. using @localVerFrom and @localVerTo.

By-Question: How do we handle IVD (UTS #37) versions and properties, which are not linked to the Unicode proper's in any ways?

@747
Copy link
Author

747 commented Apr 26, 2021

From: #1805 (comment)

@sydb
Copy link
Member

sydb commented May 8, 2021

Makes perfect sense to me that multiple version numbers would need to be recorded. But will it ever be necessary to specify anything other than

  1. No version # at all (not very interesting)
  2. A single version number (already supported)
  3. A range of version numbers (what @fromVer and @toVer would do)

If not, I am wondering (aloud) if just defining att.gaijiProp/@version as 1–2 occurrences of a Unicode version would do? (The semantics being that if there are 2 of them, it is a range.)

The original comment on this ticket is kind enough to show an example of a @localVer date attribute, but there is no explanation of what it is, or what it is supposed to be. I am hoping that someone (@747, @duncdrum?) can elaborate.

Also wondering what elements need att.datable and extended version capability. List of candidates:

from gaiji module

  • <charName> — has neither, to be deprecated 2022-02-15
  • <charProp> — has neither, to be deprecated 2022-02-15
  • <localProp> — has @version as part of att.gaijiProp
  • <mapping> — has neither
  • <unicodeProp> — has @version as part of att.gaijiProp
  • <unihanProp> — has @version as part of att.gaijiProp

from other modules

  • <binaryObject> — has neither
  • <desc> — has neither (but has @versionDate :-)
  • <figure> — has neither
  • <formula> — has neither
  • <graphic> — has neither
  • <media> — has neither
  • <note> — has neither
  • <noteGrp> — has neither

@747
Copy link
Author

747 commented May 9, 2021

But will it ever be necessary to specify anything other than

  1. No version # at all (not very interesting)
  2. A single version number (already supported)
  3. A range of version numbers (what @fromVer and @toVer would do)

@sydb Thank you, this is a point related to what I wrote above that "Whether the version-based and date-based attributes can coexist is subject to discussion." Actually, in reference to what we might need in our project's environment, adding to <*Prop> elements the version range and <mapping> the date range would probably suffice. At least I have not been able to think of a situation yet that a change in properties (of any form) is not accompanied by a version update (if any).

However, I can easily come up with a case that a change in mapping involves transitioning between multiple versioning schemes. In the following scenario,

  • You originally maintain gaiji using an old legacy character set (there seem a handful in Japan still in use). Since a while ago you distribute a font for users to display them. All glyphs are tentatively mapped to PUA due to lack of Unicode support.
  • You propose to include them in Unicode. Some glyphs are accepted and assigned with certain code points. However, some two glyphs are subject to unification and merged into a single code point in the main code chart.
  • You want to keep the distinction of those merged glyphs using the IVS mechanism, but it will need the existence of a base code point, which can be only acquired after the standard contains it has been published. Thus your proposal for new IVS collection will be accepted and published some time after the release of latest version of Unicode.

Here, all PUA, single code point, and IVS mapping of the glyph are equally a Unicode representation, so only one of them should be valid at a certain moment. Versioning schemes of the legacy set, Unicode (core spec), and IVD are all independent of each other, which means if you try to delimit them by "versions", the start and end attributes are described in terms of different frameworks. This could be very complicated compared to logging the changes by date. (Alternatively, you can update the version number of the legacy set whenever the Unicode mapping has changed, but do you have to fork it to keep up with Unicode?)

I assume that there are possible use cases that <mapping> should optimally record in versions (that I'd like to hear from others), while have a feeling that marking both dates and versions at once is not a good idea.


If not, I am wondering (aloud) if just defining att.gaijiProp/@version as 1–2 occurrences of a Unicode version would do? (The semantics being that if there are 2 of them, it is a range.)

This seems a great idea for the version range notation, with a little drawback I think that you will not be able to mark the @toVer equivalent alone (is it useful anyway?).


The original comment on this ticket is kind enough to show an example of a @localVer date attribute, but there is no explanation of what it is, or what it is supposed to be.

What I have in my mind is the analogy to <localProp>. Just as the guidelines say (emphasis by me):

Where the information concerned relates to a property which has already been identified in the Unicode Standard, use of the appropriate Unicode property name with unicodeProp is strongly encouraged. The use of available Unihan property names with unihanProp is similarly encouraged. Validation rules for property names according to Unicode conventions are incorporated into the TEI schemas. Where neither of these standards suffices use localProp. (5.2.1 Character Properties)

the @localVer would be responsible for anything but the Unicode versions (defined in the context). So it should be only needed for <localProp>. I originally thought <unihanProp> might need a separate version attribute, but it seems always synchronized with every Unicode core update. The real problems I found on this matter are versioning of Emoji (UTS #51) and IVD (UTS #37), which can be out of sync with Unicode core updates, yet parts of the Unicode standard.


Also wondering what elements need att.datable and extended version capability.

I didn't think thoroughly beyond elements I showed in the original post, but on second thought, we might need (extended) versioning for <graphic> (and by extension <binaryObject>), since it can contain glyph images which are easily affected by standard updates.

@duncdrum
Copy link
Contributor

duncdrum commented May 9, 2021

Just jotting done some quick notes.

To keep in line with version ranges in other standards I would propose to use minVer and maxVer as attribute names. Less confusion with from/to and potentially applicable elsewhere.

When I wrote the updates, my assumption was that all references to unihan or Unicode properties would be tied to a single unambiguous version. (Usually the latest at the time of publishing an edition). Automatic validation would assist and alert users to changes if they occur.

This could be made more explicit by defining a single @Version on the parent of all *Name elements. A more fine grained control for localProps referencing external standards is a good idea, but I have the feeling that that can be more easily achieved outside of the content model of the gaiji module.

@747
Copy link
Author

747 commented Jul 5, 2021

To move the discussion forward, below is my tentative spec design based on the comments so far:

  • attributes added to <localProp>, <unicodeProp>, <unihanProp>, <binaryObject> and <graphic>

    • maxVer: maximum Unicode standard version applied (included, for non-stable properties); cannot coexist with version
    • minVer: minimum Unicode standard version applied (included, for non-stable properties); cannot coexist with version
    • otherVerSource: any identifier string representing a non-Unicode versioning scheme (better if a URL?)
    • otherVer: a non-Unicode-standard (specified by otherVerSource) version number; needs otherVerSource, cannot coexist with otherMinVer and otherMaxVer
    • otherMaxVer: maximum non-Unicode-standard version applied (included); needs otherVerSource, cannot coexist with otherVer
    • otherMinVer: minimum non-Unicode-standard version applied (included); needs otherVerSource, cannot coexist with ohterVer

    otherVerSource can also take care of those Unicode-related version schemes such as IVD or emoji. It may be possible to have maxVer/minVer double the function of otherMaxVer/otherMinVer when otherVerSource exists, if that is okay. In that case, it'd be more elegant if we can also override the semantics of version.
    I have no concrete imagination what role <figure> and <formula> play in a gaiji information. They may deserve those attributes above if proven suitable.

    A more fine grained control for localProps referencing external standards is a good idea, but I have the feeling that that can be more easily achieved outside of the content model of the gaiji module.

    I hope an additional otherVerSource will account for the concern that the "external version number" alone is insufficient to specify what it means by itself.

  • attributes added to <mapping>


PS: For whoever that might be confused by the semantics between changes in <mapping> and <*Prop> because they are in the same discussion, the two are conceptually irrelevant, with the common trait being only that both are frequently seen in our gaiji experience. The former changes the character's binary representation (or "identity") itself, and typically happens during the process of inclusion into a character set, while the latter changes character information (or "usage"), which can occur whenever while a character is included in a character set.

@747
Copy link
Author

747 commented Mar 4, 2022

I wonder if there is any further discussion ongoing.

@sydb
Copy link
Member

sydb commented Mar 4, 2022

No, @747, I have to admit at least I have not thought about or discussed this ticket since last summer; so thanks for the ping!

As for versioning attributes you proposed on 05 Jul 21, two thoughts jump to mind.

  1. Putting version numbers on <graphic> and <binaryObject> would beg the question what does such a version number mean outside of the gaiji context.
  2. Could we dispense with @otherVer attributes by just asserting that @version or @minVer and @maxVer are version numbers within @scheme, where the default scheme is Unicode? Yes, in the case that a character property is applicable to both a range of versions in some other scheme and then a range of versions in Unicode the user would have to specify that property twice. Seems like a small price to pay.

@747
Copy link
Author

747 commented Mar 8, 2022

@sydb Thank you for your advice!

  1. Putting version numbers on <graphic> and <binaryObject> would beg the question what does such a version number mean outside of the gaiji context.

It is a very good point that involves the semantics of <char> (or <glyph>), which I don't think I have fully understood though the guideline text. If a <char> element is primarily an abstraction from the encoded document, that the <graphic> is expected to be taken from, or based on actual instances in the document, there will be little reason to put version information; but if the element is allowed to represent a character in some set, which is believed an equivalent to that of in-document instances, so that the <graphic> can show its idealized shape (like what is provided by Unicode), the content can be subject to changes made by that external standard (sometimes quite wildly).

As I re-read the guidelines, I was actually able to find description on <figure>, but oddly none about bare <binaryObject> or <graphic>.

The figure element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. (source)


  1. Could we dispense with @otherVer attributes by just asserting that @version or @minVer and @maxVer are version numbers within @scheme, where the default scheme is Unicode? Yes, in the case that a character property is applicable to both a range of versions in some other scheme and then a range of versions in Unicode the user would have to specify that property twice. Seems like a small price to pay.

Yes, it will be very welcome and efficient if possible.

@sydb
Copy link
Member

sydb commented Jul 9, 2022

@747

Council discussed this today, and we are wondering if the following would address your needs?

  • Change the definition of @version of att.gaijiProp so that one can express a range (possibly an open-ended range) in that attribute.[1]
  • Add a @scheme (probably to att.gaijiProp), with the default being “Unicode”.
  • Add the children of <glyph> or <char> to att.datable.

Thus your original example would look something like

<char xml:id="myChar">
  <localProp name="Name" value="A LOCAL GAIJI" />
  <unihanProp name="kIRG_USource" value="U-012345" version="1X.0 1Y.0" />
  <unihanProp name="kIRG_SSource" value="S-567890" version=">=1Z.0" />
  <mapping type="internal">0xABCD</mapping>
  <mapping type="PUA" from="2012-01-01" to="2018-03-31">U+FXXX</mapping>
  <mapping type="standard" from="2018-04-01" to="2019-10-15">U+YYYYY</mapping>
  <mapping type="standard" from="2019-10-16">U+YYYYY U+E0100</mapping>
</char>

This has the slight disadvantages that a) you would not get the lovely drop-down list of Unicode version numbers that you do now, you would have to type it by hand (gasp!), and b) if you wanted to express a character property that was drawn from both Unicode and some other standard you would have to use two separate elements.

Note

[1] This might be done by creating a new teidata.semanticVersion datatype which would adopt the syntax, but not all of the semantics, of the semantic versioning system. (See #1993 and associated.) Thus values like (maybe) "1.1.0–1.3.2" and ">=2.0" could be used. The basic idea is that a single version number could be preceded by >, >=, &lt;, or &lt;= for the obvious “after”, “notBefore”, “before”, or “notAfter” semantics respectively; if 2 version numbers are present (separated by an en dash? a space? a solidus?), the semantics are “>=1.1.0 and <=1.3.2”, for example.

@747
Copy link
Author

747 commented Jul 11, 2022

@sydb

Thank you for your update and continued support. The described spec seems enough to cover our use cases. Other than < > might not look very nice in the XML grammar, I think the new version syntax solves our problem, being able to express maximum, minimum, and range.

@sydb
Copy link
Member

sydb commented Jul 11, 2022

OK, @747. (Or should we just call you “Boeing”?)
One of us (probably me) will try to have a test version of this solution up (probably in a branch on this repo) within the month.

@ebeshero
Copy link
Member

ebeshero commented May 8, 2023

Note: related to #1993

@sydb
Copy link
Member

sydb commented Aug 13, 2023

One of us (probably me) will try to have a test version of this solution up (probably in a branch on this repo) within the monthyear.

Yeah, I meant year, that’s it.

Seriously, @747, the hold-up here is that Council cannot make up its mind about version numbering in general (e.g., #1993), which sort of makes progress on the version number part of this hard. So once I realized an entire year has gone by, I did the other two bullet points, but not the 1st one (the version number stuff).

The results are available (only in English) on my basement server; see the Guidelines and the schemas there.

Council will be meeting again in Paderborn in a few weeks, and I am pretty sure version numbering will be on the agenda.

@747
Copy link
Author

747 commented Aug 14, 2023

@sydb Hi, yes I understand that overriding @version would be a pain to the schema. Since the current @version is so tightly coupled with Unicode, shall we go back to the previous plan that makes distinct, more general attributes? I think I will be in Paderborn next month, is there anything I can do if I am present in the discussion?

@sydb
Copy link
Member

sydb commented Oct 14, 2023

Sorry, @747, turns out I did not make it to Paderborn.
Do you think we should go ahead and request the other two items (added @scheme to att.gaijiProp with a default value of "Unicode"; and added <meeting>, <localProp>, <unicodeProp>, and <unihanProp> to att.datable) be pulled into the production Guidelines now, rather than waiting to solve @version first?

@747
Copy link
Author

747 commented Oct 15, 2023

Hi @sydb thank you for your response. Do you mean <mapping> instead of <meeting>? On this assumption, those other additions can be independently processed from versioning matters, and we are okay with proceeding and discussing @version later.

@sydb
Copy link
Member

sydb commented Oct 15, 2023

Oh, yes! That was a slip of the brain.
Will try to generate PR for the non-version issues later today.

@ebeshero
Copy link
Member

ebeshero commented Jul 6, 2024

Noting that part of this ticket is addressed in #2511 for the Guidelines 4.8.0, but the rest (re @version) will have to wait for a later milestone.

@ebeshero ebeshero removed this from the Guidelines 4.8.0 milestone Jul 6, 2024
@ebeshero ebeshero added this to the Guidelines 4.9.0 milestone Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants