Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add <w> to att.lexicographic #1776

Closed
iljackb opened this issue Jun 12, 2018 · 18 comments
Closed

add <w> to att.lexicographic #1776

iljackb opened this issue Jun 12, 2018 · 18 comments

Comments

@iljackb
Copy link

iljackb commented Jun 12, 2018

Just as is the case with well established usages of attributes native to att.lexicographic within the dictionary module, there are identical use-cases for these attributes that arise in the development of a text corpus which currently, for lack of a sufficient alternative, require a customized solution in the TEI.

In cases where building a corpus using <w>, it is possible that the forms within these tokens may need to be normalized. The acceptance of this proposal would enable users to be able to do this in one of two ways:

  • change the element value of <w> directly and record the original non-normalized form in @orig; (or)
  • keep the original, non-normalized element value of <w>, and record the normalized form as the value of @norm

Currently there is no attribute available on <w> which serves this function and the only feature available in the TEI at all is <orig> which does nothing about representing normalized forms either as an element or attribute value.

A specific use case for this proposal is a language documentation project of the Mixtepec-Mixtec language (iso 639: 'mix'), which is an under-resourced language with a very small body of published text booklets for children. In addition to these, new texts written by the project's native speaker consultants make up the core of the project's written material, and together with transcribed speech, these resources form the basis of the TEI corpus being produced. In dealing with such data there are three main factors which give rise to the need for the features in question, they are:

  1. the orthography is still undergoing changes (by a group from SIL Mexico), thus some texts have old spellings;
  2. the spelling conventions are not well known by speakers, leading to a need for significant corrections;
  3. there are also potential instances of sub-dialectal, and/or idiolectal vocabulary use (which we want to keep somewhere while also providing a normalized form for search and retrieval purposes);

Regarding usage case (1); for example in the Mixtepec-Mixtec language, the lexical items meaning 'when' & 'where' formerly both written orthographically as "nchii" and in earlier publications both appeared written as such (phonologically they are minimal pairs based on tone [nd͜ʒiː˥] vs [nd͜ʒiː˥˩]). Given the need to distinguish these items further as they cannot be reliably be understood by context, the word for 'when' retained the spelling "nchii" and 'what' was changed to "nchi". In the TEI corpus, the encoding of the instances of the old spelling of 'what' were changed to:

              `<w xml:id="d1e163" orig="Nchii">Nchi</w>`

Regarding usage case (2); where speakers spell something incorrectly but we would like to preserve it for any number of reasons, the use of @orig is essential and could have uses for both the speaker to see past mistakes, researchers to get insight into how untrained speakers write their language instinctually (in contrast to prescribed convention), etc.:

              `<w xml:id="d1e1435" orig="ntsa sia'i">ntsasia'i</w>`

             Side note: I could imagine `@split` (also att.lexicographic) might be used here instead of or possibly in addition to `@orig` to delineate the morphological sub-components (which corresponds to the original segmentation of the speakers original written form)

Regarding usage case (3); although our speaker consultants are undisputedly part of the Mixtepec-Mixtec area, our speakers are from a small village of only several hundred people a significant distance from the other main populated areas, and the question of whether there are any significant lexical variations between this place and the greater population is not entirely settled. In fact there are certain tendencies demonstrated by at least one speaker that may be candidates for further exploration. Additionally, this particular speaker is less exposed to the language every day and doesn't live in the language region of origin and so these tendencies could be due to idiolectal differences which also may be of use for future socio-linguistic topics.

              `<w xml:id="d1e2363" orig="intu'u">ntu'u</w>`

Finally note, in our project, given that the body of written language is so small, and there is an urgent need to establish a significant body of written text that is consistent, our editorial practice prefers normalization of the element contents and recording of the original in @orig. However, in any of these three cases, depending on editorial preferences, these could have been done in the inverse way, i.e. to give precedent on the preservation of the original texts and place the normalized form in the attribute @norm. e.g.

        (1)
              `<w xml:id="d1e163" norm="Nchi">Nchii</w>`
        (2)
              `<w xml:id="d1e1433" norm="ntsasia'i">
                   <w xml:id="d1e1434">ntsa</w>               
                   <w xml:id="d1e1435">sia'i</w>
                </w>`
         (3)
              `<w xml:id="d1e2363" orig="intu'u">ntu'u</w>`
@bansp
Copy link
Member

bansp commented Sep 10, 2018

I wish I had seen this request earlier. Do I assume correctly that the core of this request is that @orig and @norm be available to more than just lexicographic items? That could also follow if the two were separated into, say, att.normalize. This potential new class could then be used by att.linguistic and in this way, the two attributes would make their way into <w> and <pc> (because the latter also needs to deal with normalization issues).

(I found this ticket by searching for the ticket suggesting that @norm be moved to att.global. Would someone kindly reference that ticket here, if I hadn't imagined it? Clearly, there is a momentum here worth exploiting. <w> needs @norm badly, and we didn't dare suggest that, for symmetry, @orig would be nice to have there as well -- because some projects concentrate on the normalized side, while longing to record the source.)

@lb42
Copy link
Member

lb42 commented Sep 10, 2018

I just note that this reverses a sensible decision reluctantly taken long ago during the war on attributes. In particular it seems very likely that the value of @orig might need to contain markup constructs such as <g> or <hi> which you could not supply.

@martindholmes
Copy link
Contributor

Both James and I sadly noted exactly the same thing. :-)

@bansp
Copy link
Member

bansp commented Sep 10, 2018

Let me take a stab at that, with apologies to Martin for repeating some statements from a discussion earlier today.

  • Firstly, it is very true that, in general, it might happen that <g> and <hi> may be needed to render the original form. However, that is not the case at hand and will not be an issue in around 99% cases under discussion here, where we are safely catered for by the Unicode.
  • Let me say more: it is also true that corpora of this sort need to be well documented in order to make sure that whatever is placed inside @orig or @norm gets the proper language tag -- as we know, xml:lang has its fuzzy sides, and in particular, in these cases, the language tag (or more generally, language identification) describing attribute content of @orig or @norm must be distinct from what describes the element content. Crucially, however, note that in the case at hand, and in many others, we're talking about two language tags: one expressed by xml:lang for element content and another one for the respective attribute stream. That other one has to be cleverly placed elsewhere, as part of the cost of creating this kind of resources.
  • Next, as in the nice example that Jack adduces, it may be that there is a mismatch in the number of segments between the appropriate attribute and the element content. This is either an issue or a non-issue, depending on the kind of tools and strategies employed for/in the corpus.

The above is something that corpus creators are rather acutely aware of. The attributes that Jack mentions are therefore not meant for any and all source forms or normalized forms, but rather for a relatively precise subset of them -- those that can be used to provide information at the level of <w>. It is also true that a beautiful way to handle such cases would involve many more elements, more complex structures, and probably a lot of links. The thing in this case, and similar cases, is that we're not aiming at something beautiful, but rather for something practical, open to manipulation by tools that process a sequence of <w> elements and need to find the relevant information locally. Very often (unlike the case that Jack mentions but increasingly so in many similar projects), the amount of data to be processed also plays an enormous role -- there's no processing power to follow all the links and disassemble beautiful structures when you're up against gigabytes of data.

Technological issues aside, there's also a strand of argumentation concerning cases like these and many others that adopts a stance of a somewhat irritated Ubercreator saying "nah, I won't allow this in the schema because a novice encoder might produce utter gibberish if that were allowed". Newsflash: a novice encoder may produce gibberish out of nearly anything, and innovation is often born where there's freedom to follow new goals and describe new data (otherwise, part of the forced "innovation" in this case may turn out to be adoption of a different XML encoding format or cooking up your own).

Summing up: Jack here and others elsewhere are not saying "abandon the old ways and henceforth follow our new solution exclusively". We're saying: "for a precisely delimited set of cases, and under relatively tight technological constraints, adopt this if you're sure you know what you're doing". In the spirit of the TEI being a toolkit for creating schemas, we propose a well-described set of advanced components for specialized users.


Let me also reference issue #1670 for more examples of where the need for @norm is dear (it's called @reg there). Various issues concerning the limits of analogical kind of annotation are mentioned/discussed in a recent LREC paper by Martin Mueller, Susanne Haaf and myself. We are also going to talk about this at the upcoming LingSIG meeting.

@lb42
Copy link
Member

lb42 commented Sep 10, 2018

I believe that xml:lang specifies the language both for element content and for some attributes (depending on their datatype). I don't think XML allows you to change that rule, so if you want to give your attribute attributes (as it were) you're on your own.

@sydb
Copy link
Member

sydb commented Sep 10, 2018

Not sure which W3C standard you’re referring to, @lb42. My vague recollection is that the W3C had never considered characters outside of Unicode, and was happy to give you enough rope to hang yourself with if you put text that might need markup or language identification in your attributes. But I may well be mis-remembering.

But no matter. The argument against this ticket would be a lot stronger if the proposers were asking us to add a new attribute that violated the principles over which the War on Attributes was fought. But they’re not, as @norm and @orig already exist on quite a few elements. So they are just asking to tweak the peace treaty. Thus my instinct is to:
a) agree to the proposal, add <w> to att.lexicographic; and
b) use this as an opportunity to add a health warning to @orig and @norm.

My initial thought is that the health warning should comprise both a short warning in the <remarks> of each attribute and a longer discussion that includes alternate encoding that allow language identification, highlighting, and characters outside Unicode; and that the former should point to the latter.

@lb42
Copy link
Member

lb42 commented Sep 11, 2018

@norm and @orig don't exist on any non lexicographic elements do they?

@sydb
Copy link
Member

sydb commented Sep 11, 2018

Nope; just att.lexicographic.

@raffazizzi
Copy link
Contributor

F2F subgroup agrees with @sydb: make <w> a member of att.lexicographic is acceptable, but change desc and remarks for @norm and @orig to clarify that they are not to be used outside of a lexicographic context and to look at <orig> and <reg> for other uses.

@ebeshero ebeshero self-assigned this May 7, 2019
@tuurma
Copy link
Contributor

tuurma commented May 7, 2019

@tuurma and @ebeshero to propose rewording of GL to make very clear when it's acceptable to use text-bearing attributes

@ebeshero
Copy link
Member

ebeshero commented May 7, 2019

Council agrees in discussion with the proposal, but wants to implement a strong warning in the Guidelines, especially for the attributes bearing the teitext datatype. The warning would be to add clarification that @norm and @orig are not the same as their element counterparts and have precise linguistics use. Wording might be, "The attributes in this class are meant to be lexicographic use and are not intended to be used for editorial interventions."

@bansp
Copy link
Member

bansp commented Mar 2, 2020

I have submitted ticket #1973 that essentially does what I described in a comment above: introduces a separate attribute class to hold the two attributes in check, with a warning against abuse that the Council decided to add (see Elisa's comment immediately above). The ticket is accompanied by bits of documentation and it would be great if the Council could review it whenever convenient.

Thanks in advance for considering that -- enabling the use of @orig and @norm on <w> and <pc> would not only allow projects such as EarlyPrint to become 'legal', but would also allow work on developing a complete TEI serialization of the ISO MAF ("Morphosyntactic Annotation Framework") standard. Cheers!

@duncdrum
Copy link
Contributor

duncdrum commented Mar 2, 2020

If there is interest I d be happy to contribute a CJK example that avoids the use of <g> inside an attribute value.

@bansp
Copy link
Member

bansp commented Mar 2, 2020

I think that such an example would definitely have general value also outside of this narrowly scoped discussion.

@ebeshero
Copy link
Member

ebeshero commented May 2, 2020

Council greenlights this with some modifications to @bansp 's pull request: to add <w> to att.lexicographic.normalized and to apply cautionary language. (We'd want to create a subclass of att.lexicographic. (Ask @bansp to modify the pull request accordingly.)

@bansp
Copy link
Member

bansp commented May 6, 2020

That is now done and merged. I do hope that @iljackb will find the result satisfactory!

@martinascholger martinascholger added this to the Guidelines 4.1.0 milestone May 6, 2020
@iljackb
Copy link
Author

iljackb commented May 7, 2020 via email

@martinascholger
Copy link
Member

So, I'm closing this. Thanks @bansp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests