-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNG schema generation uses semantically unsound method of removing dangling references from content models #235
Comments
I have two concerns here:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text>
...
</text>
</TEI> If we make this change to the processor and I do nothing but attempt to regenerate a Relax NG schema with the updated processor, I will get a (from my perspective) broken schema that doesn't validate anything, because the content model of
I'm not especially confident in this inference. But it happened well before my time. @lb42, can you shed any light on whether the current behavior is the result of a mistake or was a deliberate change in policy? |
My belief, on absolutely no evidence as I am currently engaged in teaching an ODD workshop en francais, is that the current behaviour is an unintended consequence of changes to the stylesheets. Finding out at what point in time the particular change concerned happened is a theoretically feasible but I would say not very useful exercise. |
The current code treats these in the same way as an explicit element spec with Other things being equal, I think the correct fix would be to continue to treat the various ways of excluding or not including an element as equivalent; this was not my first thought (as reflected in a comment on bug TEI 1589, but further examination of the source code and Lou's reassurance about potential interference with other idioms have led me to the belief that the simplest fix is simply to change the way odd2relax.xsl handles references to undefined patterns.
From where I'm sitting, the change is not breaking a legitimate ODD but making the system produce the schema actually defined by the ODD instead of a different schema. Users who wish to change the content model of As a matter of interest, are you aware of ODDs used in production systems of projects that eliminate |
Yes; it would be desirable to inform the user that the selection of elements in the ODD produces a schema with some elements which are unsatisfiable but not excluded. That will frequently be something the user wishes to change. It is (I think) easy enough to make the odd2relax stylesheet emit such a warning; how complicated it might be to make the Roma interface convey that warning to the user I do not know. |
I understand that's how you see it, but as you say, we may have users who assume the software is currently working correctly. So while we should fix it, we do need to take some care. I don't offhand know if we have any users who are doing things that Lou would call crazy. We may not. @lb42 remarks
I'm confused by this. Am I wrong that the only way you get definitions for elements in a module in your compiled ODD is to reference that module? If I compile tei_lite, I don't find an elementSpec for |
I think that's true, with one slightly pedantic correction. One can use an elementRef to refer to an element (normally to include it), or an elementSpec to (re)define it, without referring to the module containing it. So if one wants to exclude element I understand Lou's remark about modules being macros to be referring to the fact that including
And All the mechanisms mentioned so far:
have (as far as I have thus far seen) the same effect in the output of odd2odd.xsl: no elementSpec for I believe the cases in which content models in one module refer to elements in other modules are much more common than Lou indicates, but perhaps he is excluding references to classes with members from other modules. Striking examples of cross-module references Include the references to |
I'm not advocating for anything. I'm just pointing out the existence of case 4 and wanting to make certain it is handled too. |
I am not sure how an ODD can usefully be considered "legitimate" if it is also "incorrect" but let that pass. I just wanted to confirm that (a) you can write an ODD containing only elementRefs and pull in a whole bunch of element specs if your @source attribute indicates a file that contains those specs. It wont be much use unless you include the module tei, however, since that's where the majority of class definitions hang out, and without them the content models of the elements you've referenced won't mean very much. (b) yes, I was not considering references to classes, since a reference to a class which is not populated is entirely benign in the construction of a schema. |
Backwards compatibility alert #12 & #35. We tell people that if they have an ODD which worked with TEI P5 release x.y, where x and y are some way back in the past, the ODD processor will continue to generate the same schemas as it did when that version was current if they specify x.y as the version of the source against which an ODD is to be compiled. The assumption is that the processor will treat old specs the same way as it always did, so all you need do is to give it the right mixture of old specs. If we now change the way the processor behaves, obviously that is no longer true. Or may no longer be true. |
@lb42 That's why I've always believed that the decision to split the TEI and Stylesheets repos apart was a mistake. The only way to remedy it really is to find a way to bind each TEI release to a specific Stylesheets release. Or put them back together again. |
@martindholmes well, we can always work out which stylesheet release corresponded with which TEI release, I suppose. But that might be counterproductive : I might want to go on using old specs, but benefit from super efficiency updates in the stylesheets. (I speak purely hypothetically of course) |
This is the point I was making above. I think it's probably solvable by essentially having a deprecation period for the old behavior, but we can't just make the change unless we're confident it won't break any ODDs. |
So we want an impossible situation: all Stylesheet releases work with all TEI releases, and vice versa. What happens if I want a bugfix from Stylesheets version 7.50, but I need processing for an element that was removed long before that version was released? In a perfect world (or maybe in P6), this is what I'd like:
That way, if you're using TEI version 3.5, then you can check out the latest release in that branch to get all the bugfixes for it and process your files successfully; when you want new features, you have to migrate to the appropriate later version, and see explicit error messages or warnings when your code is incompatible, so you can update it. If you need to stay on the old branch but there's a bug, it can get reported and fixed, with fixes ported to other branches if appropriate. |
Just a general comment on the current ODD processor: it really doesn't do error reporting. The code is (as they say) gnarly and occasionally redundant or even just plain inexplicable. Finding out why it behaves the way it does is often hard. So I tend to think that the benefits of any concerted effort to improve on its behaviour considerably outweigh the possible downside of some old behaviours disappearing. |
On Mar 8, 2017, at 9:53 AM, Lou ***@***.***> wrote:
Backwards compatibility alert #15 and #35. We tell people that if they have an ODD which worked with TEI P5 release x.y, where x and y are some way back in the past, the ODD processor will continue to generate the same schemas as it did when that version was current if they specify x.y as the version of the source against which an ODD is to be compiled. The assumption is that the processor will treat old specs the same way as it always did, so all you need do is to give it the right mixture of old specs. If we now change the way the processor behaves, obviously that is no longer true. Or may no longer be true.
Perhaps one way to address this would be as follows. Unless I see a better proposal here before I get around to actually making the changes, I will prepare the changes to the stylesheets thus:
1 Add a parameter (I’ll call it $sgd here, for 'subset guarantee on deletion’) to the appropriate stylesheets (and anything else that needs to know).
2 Default value of $sgd depends on the indicated source. If it’s the version current now, or an earlier one, then the default value is false; if it’s a later version, the default value is true.
3 If $sgd is false, the existing behavior is grandfathered in for Relax NG schemas; if it’s true, then deletion (= omission) of an element will have the subset semantics described in the bug reports, which is what the DTD processing now has.
4 I expect that for debugging purposes if no others I may add a parameter controlling whether unsatisfiable content model expressions should be simplified or not. Default value should be to simplify.
A consequence of this design is that the compatibility guarantee is preserved, but the subset-semantics behavior can be obtained immediately by those who don’t want the current behavior, without waiting for a new release.
Side question, for the compatibility lawyers in the crowd: How exactly should the stylesheet sniff out the version number?
* One might consult the tei:schemaSpec/@source attribute, look for a value matching the pattern "http://www.tei-c.org/Vault/P5/" + x + "." + y + "." + z + "/xml/tei/odd/p" + p + "subset.xml", and take the values x, y, and z from there.
When the source is a local document, or a resource which doesn't match that pattern, this approach won't work (so the stylesheet will need to have a default value for the case when it cannot identify a version number for the source).
* Instead of consulting the tei:schemaSpec 'source' attribute in the ODD being processed, one might instead consult the TEI document it points at. Assuming that $ED = /tei:TEI /tei:teiHeader[1] /tei:fileDesc /tei:editionStmt in this document, one could check for
$ED/tei:ref[contains(@target,’readme’)]/string()
(: then tokenize() the string on ‘.’ and compare Dewey-decimal style against (3, 1, 0) :)
or
$ED/tei:date/@when
(: if le ‘2016-12-15’ then default to false(), else to true() :)
* Are there other better approaches?
[Edited 8 March 2017 at 2 pm Mountain Time to try to eliminate potential confusion about which document's TEI header is being examined looking for a TEI version number.]
|
I think, @cmsmcq , you have hit it exactly right, except for one tweak: the |
One might even say that (and this is the most likely case) if the @source is not specified, then the default source assumed has to be tei:current -- as of the date the ODD was last revised. There's a table somewhere mapping TEI Release numbers to dates, for keen shedders of velocipedes. |
The generation of RelaxNG schemas from ODD documents removes from content models all references to undefined elements, without appropriate consideration of the context. The result is potentially a semantically inappropriate schema. It appears that the incorrect behavior appears for elements for which the ODD contains an
elementSpec
withmode="delete"
, elements omitted from theinclude
attribute on amoduleRef
, and elements listed in theexclude
attribute on amoduleRef
.A lengthy discussion of the topic, intermingled with some other topics, can be found in issue 1589 on the TEI repository. I'm opening this issue because the error in the generation of RNG schemas (and thus in all other schemas which are generated from the RNG schemas) and the error in the text of P5 can be fixed independently.
The discussion of the other issue suggests that some confusions and doubts should be addressed. Those who find the description above sufficient need not read further.
Assumptions about the meaning of element suppression
The key assumption in this bug report is that when an ODD suppresses an element
E
, the schema defined by the ODD should accept as valid all documents valid against the original schema which do not contain any instances ofE
. Some participants in the discussion of issue 1589 have apparently had some trouble accepting this as the correct behavior; some have suggested privately that the current behavior of simply removing all references toE
without regard for context has an equal claim to correctness. The following paragraphs attempt to address this suggestion.Two conflicting views of the semantics of element suppression
First, consider the choice between the two views offered on the meaning of
mode="delete"
.One view defines the meaning in terms of the meaning of the schema, as a change to the language to be defined. It does not prescribe the specific changes to be made to the text of the schema; any set of changes which has the required semantic effect is a correct implementation of the semantics. There are several different ways to operationalize the change, including one which is very simple to implement.
The other view defines the meaning of
mode="delete"
operationally, in terms of changes to the text of the schema. What is its effect on the language defined by the schema? As the examples below will show, the change implemented by odd2relax.xsl are not guaranteed to restrict the language; some but not all of them will extend the language. I have not yet found a reliable characterization of the effect of the current behavior of the Relax NG generator in terms of the language defined, and none has thus far been offered.I think an operator defined in terms of its semantics and easily implementable is likely to be more useful than an operator without a clear semantics, defined solely in terms of textual operations. So it seems to me that the meaning of element suppression in ODDs should be that of the first view identified above.
The motivating use case (the Mylonas principle)
Second, consider the basic use case. Facilities for suppressing elements of the TEI vocabulary were introduced in part at the insistence of projects who did not wish to make use of certain elements defined by TEI. When they were told "That's fine, you don't have to", they responded "That's not enough. It's not just that we don't mark personal names. Any occurrence of a
person
element in our texts is an error, and we want the schema to catch the error. That's part of what schemas are for." The goal, that is, was not to accept as valid any documents which would be invalid against the vanilla TEI grammar, but to enforce all of vanilla TEI's criteria of validity, while adding the criterion that certain elements should be absent. I think of this as the Mylonas Principle, since it was Elli Mylonas who successfully argued that the customization mechanisms should allow such restrictions of vanilla TEI.That is, the motivating use case for including element suppression among the defined mechanisms for TEI customization favors the first, not the second, view of the required effect.
Functional compatibility of ODDs with P3 mechanisms
Third, consider the history of TEI customization mechanisms. In TEI P3, suppression of elements led to the omission of declarations for the elements concerned; no changes were undertaken to content models in which those elements were referred to. The resulting DTD is conforming both under the rules of ISO 8879 and those of XML, and the meaning of the resulting schema is consistent with the first of the two views above but not with the second. The meaning of DTDs with dangling references is parallel to the interpretation of context-free grammars containing undefined non-terminals.
The introduction of ODD documents (originally part of the publication system developed for TEI P2 and P3) as a replacement for the DTD-based customization mechanisms developed by the Metalanguage Committee for TEI P3 was (as I understand it) intended to make the customization mechanism schema-language neutral (or at least polyglot) and to replicate and improve on the mechanisms of P3. If a change in the meaning of the element-suppression operator was consciously intended, it will have been called out as a significant change, and discussed, and justified as an improvement. I haven't heard of any such discussion; I infer that no conscious change of semantics was intended.
Examples
Cases with the same results
Note that in some cases, the two views of the element-suppression operator produce the same results. Let us assume that both element
E
and elementF
are to be deleted.(a | b | E)
will be rewritten(a | b)
.(a | E | F)
will be written(a)
.(a, E?, b)
will be rewritten(a, b)
.(E?)
and(E*)
will be rewrittenempty
.Cases with different results
In other cases, the two views will produce different results.
( (a, E, b) | p+ )
, the first view will produce(p+)
, and the second((a, b) | p+)
.The first view preserves the invariant of the original model, that if
a
is used, thenE
andb
are also used; the second violates the invariant.(E)
, the first view will producenotAllowed
, the secondempty
.The first view again preserves the invariant that every valid instance of the parent contains exactly one instance of
E
-- since our target schema allows no instances ofE
, there can be no valid instances of the parent. The second view discards the invariant and allows empty instances of the parent.(E | F)
, the first view will producenotAllowed
, the secondempty
.(a | (E | F))
, the first view produces(a | notAllowed)
and then after simplification(a)
.The second view produces
(a | empty)
and then after simplification(a?)
Note that
(a | E | F)
and(a | (E | F))
recognize the same language, but the second view produces different results for them. This suggests it may be challenging to find a coherent semantic description of its effect on the language being recognized.[Examples copy-edited 27 February, 10:30 Mountain Time.]
The text was updated successfully, but these errors were encountered: