Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etymology section from Jack's and Laurent's Paper #26

Open
ttasovac opened this issue Sep 3, 2018 · 26 comments
Open

Add etymology section from Jack's and Laurent's Paper #26

ttasovac opened this issue Sep 3, 2018 · 26 comments
Assignees
Labels
docs Issues that concern the TEI Lex-0 documentation schema Issues that concern the TEI Lex-0 schema specification

Comments

@ttasovac
Copy link
Contributor

ttasovac commented Sep 3, 2018

Jack, what's your GitHub user name? I'd like to assign this to you.

@ttasovac ttasovac changed the title Add Etymology section from Jack's and Laurent's Paper Add etymology section from Jack's and Laurent's Paper Sep 3, 2018
@ttasovac ttasovac added this to the TEI Lex v0.7.0 milestone Sep 3, 2018
@ttasovac ttasovac added the docs Issues that concern the TEI Lex-0 documentation label Sep 3, 2018
@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 3, 2019

We haven't discussed this in great detail, but I need us to jumpstart this — also because my students in Lisbon need to encode some etymologies today in TEI Lex-0.

For the time being, I think we need:

  • specific types for etym from Laurent's and Jack's paper
  • cit type etymon (in addition to the types we currently allow)
  • desc as a child of etym

We will definitely discuss this and what our final recommendation will be. This is just to start the process.

@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 3, 2019

Merci, @laurentromary . I'll take a look.

One more general question — for you or anybody:

  • if we go for cit type="etymon"/form, I think it would be more natural to put the xml:lang on the form rather than the cit. Would you be ok with that?
  • Ancient Greek doesn't have an ISO 639-1 code, so we have to use ISO 639-2, i.e. a three-letter code. el is Modern Greek, which wouldn't be appropriate in the following example:

This is from Johnson's dictionary:

<etym type="borrowing"><pc>[</pc><cit type="etymon">
        <form xml:lang="grc"><orth>λεξικὸν</orth></form>
    </cit> and <cit type="etymon">
        <form xml:lang="grc"><orth>γράφω</orth></form>
    </cit>; <cit type="etymon">
        <form xml:lang="fr">lexicographe</form>
        <pc>,</pc>
        <lang value="fr">Fr.</lang>
    </cit><pc>]</pc>
</etym>

ttasovac added a commit that referenced this issue Jul 3, 2019
@ttasovac ttasovac added the schema Issues that concern the TEI Lex-0 schema specification label Jul 3, 2019
@laurentromary
Copy link
Contributor

laurentromary commented Jul 3, 2019

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

@laurentromary
Copy link
Contributor

Should not you put a <lbl> around "and" in your example?

@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 3, 2019

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Sure. I just don't like the fact that we have two-letter codes for modern languages and then a three-letter code for an ancient language, but I know that my 'liking' things is totally beside the point! 😃

Should not you put a around "and" in your example?

Yes, I was rushing... I think it will be a hard sell (I can imagine the questions starting with: "why is this a label"?) but yes, we don't like mixed content etc.

But, if I may ask again: are you ok with xml:lang on form and not on cit?

@laurentromary
Copy link
Contributor

Do we need to take a decision on the fly now? My stomach relates this to @xml:lang on <entry> (and not on entry/form).

@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 3, 2019

We can't and don't need to make the final decision now. But I need to present something — as a temporary solution for our exercises today (we start in an hour and a half). I can put the xml:lang back on cit for today, but I still think we need to think about it a little more...

@laurentromary
Copy link
Contributor

Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)

@WGBS2
Copy link

WGBS2 commented Jul 3, 2019 via email

@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 4, 2019

Two remarks:

1. text nodes

Should we remove textNode from the content model of etym? It would be nice to get rid of mixed content, but, on the other hand, we can't expect that all dictionaries will encode etymologies deeply. Some may simply mark up the etym section and leave everything inside as text.

My initial thought here is that yes, we should disallow textNodes, but recommend in the narrative guidelines that those who do not go granular simply add a <note> inside <etym>, i.e.

<etym>
    <note>[λεξικὸν and γράφω; lexicographe, Fr.]</note>
</etym>

2. default type

We will need to discuss the typing. At the moment we put the types from Laurent's and Jack's paper, but those will need narrative explanations in the context of TEI Lex-0 because they may not be self-evident. We need to leave that longer conversation for later. (@anacastrosalgado and I will try to look at how our current typology works with the Portuguese Academy dictionary and will report back.)

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

Any thoughts @laurentromary, @iljackb?

@laurentromary
Copy link
Contributor

I like the idea of the baseline provided with <note>. We should also signal a default way of marking up text nodes not identified as etymological components. Should we use <seg>, or be more prescriptive right away with specific elements (<pc>, <lbl>, etc.) or, like I suggested on another ticket use <alternate> models depending on the nature of the source encoding.

@ttasovac
Copy link
Contributor Author

ttasovac commented Jul 4, 2019

I think we should preserve <pc> and <lbl> as specific elements for punctuation (when serving as delimiters between elements) and explicit labels. The text nodes not identified as etymological components should be placed in a different element.

Back in Berlin we were considering <desc> which we currently do not allow in TL0. But <seg> may be better:

seg (arbitrary segment) represents any segmentation of text below the ‘chunk’ level."

I don't know what a chunk is but I like that segs are arbitrary. Whereas:

<desc> (description) contains a brief description of the object documented by its parent element, typically a documentation element or an entity.

implies a complete description, not fragments of it.

So, yes, I'd actually prefer <seg> to <desc>.

@xlhrld
Copy link
Collaborator

xlhrld commented Jul 5, 2019

In our TEI Lex-0 Etym paper we (@iljackb, @laurentromary and me) propose seg[@type="desc"] for portions of text that cannot be marked up using any more specific element, yes. These things are typically no sound descriptions of anything but rather seem like arbitrary cut-offs from the running text (citing from the paper, e.g. »Others have proposed an etymology«, »with intervocalic«, »becoming«).

NB: To me, the whole business with avoiding mixed content feels a bit like over-engineering for prose centered texts such as many etymologies. It doesn't provide much benefit to the modeling proper. Basically you just sort of confirm that yes, I didn't forget to mark this up as something more specific, it's just any <seg> of things I don't care about. It may be beneficial for certain parsers to avoid mixed content, though.

@TomazErjavec
Copy link

I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type is required. I now found this issue and comment:

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

  1. I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?

  2. Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule.

@WGBS2
Copy link

WGBS2 commented Jul 5, 2019 via email

@anacastrosalgado
Copy link
Collaborator

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )?
I would appreciate your help.

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )?
cota2

I would appreciate your help.

`<entry type=“monolexicalWord" xml:lang="pt" xml:id=“cota_b">

cota kˈɔtɐ :2 s. f. `

@laurentromary
Copy link
Contributor

If it alternates with what would be an <etym>, maybe we should be going with one here as well, but typed undefined. <etym type="undefined">

@iljackb
Copy link
Contributor

iljackb commented Sep 18, 2019 via email

@laurentromary
Copy link
Contributor

I would imagine the term has variants and relying on a typing would univocally help finding the appropriate content.

@ambs
Copy link

ambs commented Feb 10, 2021

I know this is kind of off-topic, but can I ask why this aversion to mixed content?
That is one of the main reason I use to sell XML instead of a serializing language like JSON for Digital Humanities.

@ttasovac
Copy link
Contributor Author

hi @ambs,

i wouldn't call it an aversion. the only concern is that sometimes mixed content is more difficult to process, I know i've run into issues with white spaces in html that were really difficult to solve (and would differ between browsers etc.) but all in all I think everybody will agree with you that mixed content is sometimes a must, is often needed in humanistic texts (i.e. narratives, not tabular data), and yes, that's an argument in favor of XML over JSON, for sure.

@tklampfl
Copy link

I have one question concerning etymologies in TEILex-0:
In the paper of Bowers / Romary (Bowers / Romary) referencing with pRef and oRef in etymological information plays an important role.
However, in the schema of TEILex-0 both elements are excluded:
grafik
So, I am irritated. What are the reasons for exluding pRef and oRef and for using ref instead?

Thank you for your answer.

Best wishes,
Thomas

@ttasovac
Copy link
Contributor Author

Etymology has not been officially added to TEI Lex-0 yet for no other reason than a lack of time on part of everybody involved. When etymology is finally added and documented properly, pRef and oRef are unlikely to make a comeback because we already reached a consensus that having specific elements for orthographic references and pronunciation references is unnecessary from the point of view of TEI Lex-0 since we can use typed ref elements for that.

@tklampfl
Copy link

Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0.

@laurentromary
Copy link
Contributor

laurentromary commented Mar 11, 2021 via email

@iljackb
Copy link
Contributor

iljackb commented Mar 11, 2021 via email

@ttasovac ttasovac removed this from the TEI Lex v0.9.0 milestone Sep 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Issues that concern the TEI Lex-0 documentation schema Issues that concern the TEI Lex-0 schema specification
Projects
None yet
Development

No branches or pull requests

9 participants