Add etymology section from Jack's and Laurent's Paper #26

ttasovac · 2018-09-03T06:58:43Z

Jack, what's your GitHub user name? I'd like to assign this to you.

ttasovac · 2019-07-03T10:57:06Z

We haven't discussed this in great detail, but I need us to jumpstart this — also because my students in Lisbon need to encode some etymologies today in TEI Lex-0.

For the time being, I think we need:

specific types for etym from Laurent's and Jack's paper
cit type etymon (in addition to the types we currently allow)
desc as a child of etym

We will definitely discuss this and what our final recommendation will be. This is just to start the process.

As per #26

ttasovac · 2019-07-03T11:20:57Z

Merci, @laurentromary . I'll take a look.

One more general question — for you or anybody:

if we go for cit type="etymon"/form, I think it would be more natural to put the xml:lang on the form rather than the cit. Would you be ok with that?
Ancient Greek doesn't have an ISO 639-1 code, so we have to use ISO 639-2, i.e. a three-letter code. el is Modern Greek, which wouldn't be appropriate in the following example:

This is from Johnson's dictionary:

<etym type="borrowing"><pc>[</pc><cit type="etymon">
        <form xml:lang="grc"><orth>λεξικὸν</orth></form>
    </cit> and <cit type="etymon">
        <form xml:lang="grc"><orth>γράφω</orth></form>
    </cit>; <cit type="etymon">
        <form xml:lang="fr">lexicographe</form>
        <pc>,</pc>
        <lang value="fr">Fr.</lang>
    </cit><pc>]</pc>
</etym>

laurentromary · 2019-07-03T13:44:12Z

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

laurentromary · 2019-07-03T13:47:01Z

Should not you put a <lbl> around "and" in your example?

ttasovac · 2019-07-03T14:30:26Z

For xml:lang, we should refer to BCP 47 and not to ISO 639 directly (it sets rules on how to use part 2 and 3 for instance). My bible is alway the IANA language sub tag registry: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Sure. I just don't like the fact that we have two-letter codes for modern languages and then a three-letter code for an ancient language, but I know that my 'liking' things is totally beside the point! 😃

Should not you put a around "and" in your example?

Yes, I was rushing... I think it will be a hard sell (I can imagine the questions starting with: "why is this a label"?) but yes, we don't like mixed content etc.

But, if I may ask again: are you ok with xml:lang on form and not on cit?

laurentromary · 2019-07-03T14:33:42Z

Do we need to take a decision on the fly now? My stomach relates this to @xml:lang on <entry> (and not on entry/form).

ttasovac · 2019-07-03T14:52:16Z

We can't and don't need to make the final decision now. But I need to present something — as a temporary solution for our exercises today (we start in an hour and a half). I can put the xml:lang back on cit for today, but I still think we need to think about it a little more...

laurentromary · 2019-07-03T16:19:02Z

Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry)

WGBS2 · 2019-07-03T16:38:39Z

Pour BasNum, j’utilise toujours les codes pays à 3 lettres afin de réduire l’ambiguïté. J’utilise xml:lang sur entry, mais je le trouve un peu redondant du fait que meme si un mot est d’origine étranger, Furetière/Basnage le considérait comme un mot du français - voir aile (prononcé ale) pour la bière anglaise apprécié par les jeunes parisiens de la fin XVII Geoffrey

…

Le 3 juil. 2019 à 18:19, laurentromary ***@***.***> a écrit : Absolutely. One element is the notion of when xml:lang is used to indicate the object language (such as in entry) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26?email_source=notifications&email_token=AD63DP5CH67BFDRLN7CCQNTP5TGPPA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE62HA#issuecomment-508161308>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD63DP2RDKHMWYWBQBP3RCDP5TGPPANCNFSM4FS4IJCA>.

ttasovac · 2019-07-04T08:35:52Z

Two remarks:

1. text nodes

Should we remove textNode from the content model of etym? It would be nice to get rid of mixed content, but, on the other hand, we can't expect that all dictionaries will encode etymologies deeply. Some may simply mark up the etym section and leave everything inside as text.

My initial thought here is that yes, we should disallow textNodes, but recommend in the narrative guidelines that those who do not go granular simply add a <note> inside <etym>, i.e.

<etym>
    <note>[λεξικὸν and γράφω; lexicographe, Fr.]</note>
</etym>

2. default type

We will need to discuss the typing. At the moment we put the types from Laurent's and Jack's paper, but those will need narrative explanations in the context of TEI Lex-0 because they may not be self-evident. We need to leave that longer conversation for later. (@anacastrosalgado and I will try to look at how our current typology works with the Portuguese Academy dictionary and will report back.)

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

Any thoughts @laurentromary, @iljackb?

laurentromary · 2019-07-04T12:58:57Z

I like the idea of the baseline provided with <note>. We should also signal a default way of marking up text nodes not identified as etymological components. Should we use <seg>, or be more prescriptive right away with specific elements (<pc>, <lbl>, etc.) or, like I suggested on another ticket use <alternate> models depending on the nature of the source encoding.

ttasovac · 2019-07-04T13:23:10Z

I think we should preserve <pc> and <lbl> as specific elements for punctuation (when serving as delimiters between elements) and explicit labels. The text nodes not identified as etymological components should be placed in a different element.

Back in Berlin we were considering <desc> which we currently do not allow in TL0. But <seg> may be better:

seg (arbitrary segment) represents any segmentation of text below the ‘chunk’ level."

I don't know what a chunk is but I like that segs are arbitrary. Whereas:

<desc> (description) contains a brief description of the object documented by its parent element, typically a documentation element or an entity.

implies a complete description, not fragments of it.

So, yes, I'd actually prefer <seg> to <desc>.

xlhrld · 2019-07-05T05:47:57Z

In our TEI Lex-0 Etym paper we (@iljackb, @laurentromary and me) propose seg[@type="desc"] for portions of text that cannot be marked up using any more specific element, yes. These things are typically no sound descriptions of anything but rather seem like arbitrary cut-offs from the running text (citing from the paper, e.g. »Others have proposed an etymology«, »with intervocalic«, »becoming«).

NB: To me, the whole business with avoiding mixed content feels a bit like over-engineering for prose centered texts such as many etymologies. It doesn't provide much benefit to the modeling proper. Basically you just sort of confirm that yes, I didn't forget to mark this up as something more specific, it's just any <seg> of things I don't care about. It may be beneficial for certain parsers to avoid mixed content, though.

TomazErjavec · 2019-07-05T18:18:59Z

I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl) are no loger valid, because now etym/@type is required. I now found this issue and comment:

But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case.

I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional?
Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule.

WGBS2 · 2019-07-05T18:31:08Z

I totally agree. Our <etym> are word histories, and more story than history. I shall only try classifying, using type, once I have full encoding and talk with real etymologists. I must say, I am wondering whether I can even attempt to stay in TLex0 as it is simply too simplistic for heritage dictionaries.

…

Le 5 juil. 2019 à 20:19, Tomaž Erjavec ***@***.***> a écrit : I just discovered that some of my Lex0 dictionaries (cf. https://gitlab.clarin.si/et/tei-lex0-sl <https://gitlab.clarin.si/et/tei-lex0-sl>) are no loger valid, because now ***@***.*** <https://github.com/type> is required. I now found this issue and comment: But for the time being, with @type being required, I'm just wondering if we can come up with a default, catch-all type, which will be neither "borrowing" nor "inheritance" because those might simply be wrong in the given case. I think it is more "XML like" that if you don't know a value for some attribute, you don't write the attribute, i.e. why make it required and then have a "I don't know" value, rather than it being optional? Note that the documentation is rife with examples of etym without @type, so right now it is pretty misleading what is ok and what not. I'd also bet (1 beer) that for the most cases of legacy dictionary it won't be clear what kind of etymology an etym represents, or at least not simply machine inferrable, so the @type will the rather an exception than a rule. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26?email_source=notifications&email_token=AD63DP2BCBRUMKG2VW7SJ3LP56GBJA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKCV7Q#issuecomment-508832510>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD63DPZ6WHZXTC7OBKKLAUDP56GBJANCNFSM4FS4IJCA>.

anacastrosalgado · 2019-09-17T22:56:50Z

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )?
I would appreciate your help.

Hi! In Portuguese dictionaries, when etymologists do not know the source of the materials they handle, "De origem obscura" [From obscure origin] is the usual label. How do you recommend to encode this? Thanks (@ttasovac , @laurentromary , @iljackb )?

I would appreciate your help.

`<entry type=“monolexicalWord" xml:lang="pt" xml:id=“cota_b">

cota kˈɔtɐ :2 s. f. `

laurentromary · 2019-09-18T04:56:55Z

If it alternates with what would be an <etym>, maybe we should be going with one here as well, but typed undefined. <etym type="undefined">

iljackb · 2019-09-18T08:19:14Z

So most simply I would do: <etym> <seg type="desc">De origem obscura</seg> </etym> If you want and/or think it would be useful, you could also put a value in <etym @type> such as "unknown", "undefined", "obscure", etc. But you don't necessarily need that as the term in <seg> is enough to be able to search for where the etymology isn't known.

…

On Wed, Sep 18, 2019 at 6:56 AM laurentromary ***@***.***> wrote: If it alternates with what would be an <etym>, maybe we should be going with one here as well, but typed undefined. <etym type="undefined"> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26?email_source=notifications&email_token=ABYQ2HH6VLZCCKGTBQSF5YLQKGYJRA5CNFSM4FS4IJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66ZMBA#issuecomment-532518404>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYQ2HHFHWR5W3FWFHTGTI3QKGYJRANCNFSM4FS4IJCA> .

laurentromary · 2019-09-18T08:22:35Z

I would imagine the term has variants and relying on a typing would univocally help finding the appropriate content.

ambs · 2021-02-10T22:11:21Z

I know this is kind of off-topic, but can I ask why this aversion to mixed content?
That is one of the main reason I use to sell XML instead of a serializing language like JSON for Digital Humanities.

ttasovac · 2021-02-16T13:52:09Z

hi @ambs,

i wouldn't call it an aversion. the only concern is that sometimes mixed content is more difficult to process, I know i've run into issues with white spaces in html that were really difficult to solve (and would differ between browsers etc.) but all in all I think everybody will agree with you that mixed content is sometimes a must, is often needed in humanistic texts (i.e. narratives, not tabular data), and yes, that's an argument in favor of XML over JSON, for sure.

tklampfl · 2021-03-11T13:58:42Z

I have one question concerning etymologies in TEILex-0:
In the paper of Bowers / Romary (Bowers / Romary) referencing with pRef and oRef in etymological information plays an important role.
However, in the schema of TEILex-0 both elements are excluded:

So, I am irritated. What are the reasons for exluding pRef and oRef and for using ref instead?

Thank you for your answer.

Best wishes,
Thomas

ttasovac · 2021-03-11T14:10:15Z

Etymology has not been officially added to TEI Lex-0 yet for no other reason than a lack of time on part of everybody involved. When etymology is finally added and documented properly, pRef and oRef are unlikely to make a comeback because we already reached a consensus that having specific elements for orthographic references and pronunciation references is unnecessary from the point of view of TEI Lex-0 since we can use typed ref elements for that.

tklampfl · 2021-03-11T14:24:50Z

Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0.

laurentromary · 2021-03-11T14:28:53Z

If you’re not in the hurry, we need to finalise a paper on this by the end of the month. I could send you a stable draft by then. Laurent

…

Le 11 mars 2021 à 15:25, tklampfl ***@***.***> a écrit : Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABH5B3ZDEN6RXB5JTMBAAATTDDHEHANCNFSM4FS4IJCA>.

iljackb · 2021-03-11T14:38:48Z

Hi Thomas, Just to give a preview of how it is different in Lex0 Etym, If you are encoding a declaration of an etymon, cognate or derivative, the format is still within <cit type="etymon"> as in the first paper, but it with <form> and <orth>/<pron>: <cit type="etymon" xml:lang="pt"> <form> <orth>humano</orth>  </form> </cit> But if it is a cross reference (such as the type that might occur in running text), that is when you would use <ref> (within <xr>), e.g. as follows: ....<xr type="related" subtype="etymon" xml:id="etym-dorsum" xml:lang="la"

<ref type="entry">dorsum</ref></xr>....

If this is a pronunciation form you can use @Notation (as you can with <pRef>), otherwise it is assumed to be orthographic or simply unspecified. So whether you should use <ref> or not according to our recommendations depends on the function of the form.. This is just to let you know the difference of how we are treating these in the new guidelines. But I see Laurent responded so the details will best be explained in the paper itself when you get it. Best, Jack

…

On Thu, Mar 11, 2021 at 3:25 PM tklampfl ***@***.***> wrote: Thank you very much for your answer. So, I will use ref instead to meet the requirements of TEI Lex-0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYQ2HBK4DGTDDUQG43KH33TDDHEHANCNFSM4FS4IJCA> .

ttasovac changed the title ~~Add Etymology section from Jack's and Laurent's Paper~~ Add etymology section from Jack's and Laurent's Paper Sep 3, 2018

ttasovac added this to the TEI Lex v0.7.0 milestone Sep 3, 2018

ttasovac added the docs Issues that concern the TEI Lex-0 documentation label Sep 3, 2018

ttasovac assigned ttasovac and laurentromary Jul 3, 2019

laurentromary added a commit that referenced this issue Jul 3, 2019

Several updates on the specification for etymology

4590ce0

As per #26

ttasovac added a commit that referenced this issue Jul 3, 2019

compiled as per changes in #26

1916b1b

ttasovac added the schema Issues that concern the TEI Lex-0 schema specification label Jul 3, 2019

ttasovac assigned ttasovac and laurentromary and unassigned laurentromary and ttasovac Jul 3, 2019

ttasovac modified the milestones: TEI Lex v0.7.0, TEI Lex v0.9.0 Nov 22, 2020

xlhrld mentioned this issue Jan 27, 2021

Mentioned in etym #143

Closed

ttasovac removed this from the TEI Lex v0.9.0 milestone Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add etymology section from Jack's and Laurent's Paper #26

Add etymology section from Jack's and Laurent's Paper #26

ttasovac commented Sep 3, 2018 •

edited

Loading

ttasovac commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019 •

edited

Loading

laurentromary commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019

WGBS2 commented Jul 3, 2019 via email

ttasovac commented Jul 4, 2019

laurentromary commented Jul 4, 2019

ttasovac commented Jul 4, 2019

xlhrld commented Jul 5, 2019

TomazErjavec commented Jul 5, 2019

WGBS2 commented Jul 5, 2019 via email

anacastrosalgado commented Sep 17, 2019

laurentromary commented Sep 18, 2019

iljackb commented Sep 18, 2019 via email

laurentromary commented Sep 18, 2019

ambs commented Feb 10, 2021

ttasovac commented Feb 16, 2021

tklampfl commented Mar 11, 2021

ttasovac commented Mar 11, 2021

tklampfl commented Mar 11, 2021

laurentromary commented Mar 11, 2021 via email

iljackb commented Mar 11, 2021 via email

Add etymology section from Jack's and Laurent's Paper #26

Add etymology section from Jack's and Laurent's Paper #26

Comments

ttasovac commented Sep 3, 2018 • edited Loading

ttasovac commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019 • edited Loading

laurentromary commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019

ttasovac commented Jul 3, 2019

laurentromary commented Jul 3, 2019

WGBS2 commented Jul 3, 2019 via email

ttasovac commented Jul 4, 2019

laurentromary commented Jul 4, 2019

ttasovac commented Jul 4, 2019

xlhrld commented Jul 5, 2019

TomazErjavec commented Jul 5, 2019

WGBS2 commented Jul 5, 2019 via email

anacastrosalgado commented Sep 17, 2019

laurentromary commented Sep 18, 2019

iljackb commented Sep 18, 2019 via email

laurentromary commented Sep 18, 2019

ambs commented Feb 10, 2021

ttasovac commented Feb 16, 2021

tklampfl commented Mar 11, 2021

ttasovac commented Mar 11, 2021

tklampfl commented Mar 11, 2021

laurentromary commented Mar 11, 2021 via email

iljackb commented Mar 11, 2021 via email

ttasovac commented Sep 3, 2018 •

edited

Loading

laurentromary commented Jul 3, 2019 •

edited

Loading