Fall 2019 Publication Thread #27

ctschroeder · 2019-06-11T17:03:33Z

Timeline:

hard list of documents determined by September 4
review beginning Sept 15

Version 3.0, version date 2019-09-30

List of materials as of 16 September.

PATHS texts (see #26 @ctschroeder )

Later in Fall

PATHS Sahidic colophons (see PATHS colophons #24 @ctschroeder )
more Johannes (@eplatte & @ctschroeder)
Seeks (@cluckmarq)
AP (MG & @cluckmarq now or later 2019)
God Says Through Those Who Are His (@bkrawiec)
Besa letter (@ctschroeder check with HB)
Shenoute Canons 6?
- needs permission from H Behlmer (@ctschroeder check)
- needs metadata (see Review metadata for new Canons 6 corpora #20)
  Alin has contacted us about more material and there's someone who wants to do G Philip

ctschroeder · 2019-09-16T17:15:39Z

I would like to add "copyist" or "scribe" to the metadata for the relevant texts in Marcion copied by Victor son of Mercurius (Onophrius, Cyrus, possibly others). See Layton's catalog https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. Everyone please let me know ASAP if you have any objections.

cluckmarq · 2019-09-16T17:23:26Z

No objection.

…

Sent from my iPhone

On Sep 16, 2019, at 1:15 PM, Caroline T. Schroeder ***@***.***> wrote: I would like to add "copyist" or "scribe" to the metadata for the relevant texts in Marcion copied by Victor son of Mercurius (Onophrius, Cyrus, possibly others). See Layton's catalog https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. Everyone please let me know ASAP if you have any objections. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ctschroeder · 2019-09-16T17:34:49Z

Also the list of materials for publication is now set. I am going to check to see if someone can review my doc from Johannes. Can everyone working on docs for this round of publication be sure that the docs are appropriately tagged in GitDox as "review"?
@amir-zeldes the treebank corpora don't need review (except the Cyrus and Onophrius docs which need metadata revisions). Can you or @lancealanmartin please check that all the treebank docs except Cyrus and Onophrius have the correct version_n and version_date and are marked "review"? Thanks!! Once that is done, they can be labeled "to publish" in GitDox and checked off here on the top of the thread. If you can't do it I'll get to it later this week or next week -- just let me know the scoop. Thank you so much!!!

amir-zeldes · 2019-09-16T20:45:22Z

Adding scribe sounds fine to me (sounds better than copyist for my ears, but I'm fine with either)

I have reviewed all of the re-release documents (Shenoute, AP, Besa) and made sure the version is 3.0.0 + dated if they have been edited since last release, so those should be all good. The new Budge materials are either recent additions to the treebank (onno1, cyrus1, ephraim, respose), or they have been checked by either Lance or me, so I think they are OK as 'checked' and only need metadata review, no linguistic review necessary at this point.

Now assigned for (metadata) review to @ctschroeder :

ephraim
cyrus
onnophrius
repose

The following are assigned to others for sentences/translation, but also need metadata review:

proclus
victor2

Thanks!

ctschroeder · 2019-09-16T23:13:54Z

thx so much @amir-zeldes! I will deal with Proclus and Victor when others are done with them. Whoever's working on them can assign them to me when done.

ctschroeder · 2019-09-16T23:46:01Z

Greetings @amir-zeldes @lancealanmartin @eplatte @bkrawiec @cluckmarq. I'm working on URNs for the Marcion material. Part of the CTS URN is the "text group" and part is the "edition." A few questions have arisen. Apologies for the long post! Replies requested by the end of the week if humanly possible. This comment contains a fair bit of info in regular text with precise questions in bold.

This is PART 1. There may be a PART 2 as I work through the other texts.

For texts with known, identified authors, the "group" is the author. So urn:cts:copticLit:besa.aphthonia.monbba refers to the text "Letter to Aphthonia" in the text group "Writings of Besa" in the edition manuscript MONB.BA. For the material edited by Budge in Marcion we have two questions: What "text group" to designate and what "edition"?

Edition (the simplest question): for the Martyrdom of Victor we used Budge as the edition. I suggest we do the same with the rest of the Marcion materials from Budge. Please let me know if you have objections to "Budge" as edition in the URN. no need to reply if this is fine.
For the "group" for each text/work we have a few options:

Onophrius & Cyrus: each is a vita or life. For O see https://atlas.paths-erc.eu/works/254 and https://atlas.paths-erc.eu/titles/235); Layton classifies it under "Miscellany" in his catalogue of BL Coptic Manuscripts pp. 192-93 see https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. For C see https://atlas.paths-erc.eu/works/246 and https://atlas.paths-erc.eu/titles/231). Shall we use lives, vitae, hagiography, or something else in the URN for the "textgroup"? (e.g., urn:cts:copticLit:lives.cyrus.budge) For Victor we used martyrdoms: urn:cts:copticLit:martyrdoms.victor. I lean toward "lives"
Ephraim (epistle) leads to 2 qs:
1. Should we spell as Ephraem or Ephrem or Ephraim? (usually Ephraem or Ephrem see https://atlas.paths-erc.eu/authors/40 http://syri.ac/ephrem http://syriaca.org/person/13; Budge uses Ephraim)
2. Should the "text group" be psephrem following our previous use of urn:cts:copticLit:pstheophilus.cross for a text by "pseudo-Theophilus"? or something else?
  There is no evidence Ephrem wrote the texts attributed to him in Coptic (see the rundown in the PATHS link just above; also I checked with Alin Suciu and Ellen Muehlberger). However, texts attributed to Ephrem seem to have constituted a "group" in antiquity. Alin and Ellen agree this is pseudepigraphical material; according to Alin the author is usually called pseudo-Ephrem or Ephrem Graecus.

Thank you!! Possibly more tomorrow on other Marcion texts/works.

eplatte · 2019-09-17T04:35:35Z

I like lives for the text group for Onophrius and Cyril, and Ephrem for the spelling and psephrem for the text group for the epistle. Budge also makes sense for the edition.

amir-zeldes · 2019-09-17T14:33:37Z

Agreed on lives, budge and adding a pseudo prefix. For the spelling of Ephraim, I feel like we've been using mostly Latin spellings for some reason (onnophrius with 'u', cyrus with 'cy' and 'u'), so something like ephraem or ephrem seems more consistent than 'ai'. Whatever makes more sense as the 'Latin' form I would say.

amir-zeldes · 2019-09-17T14:41:15Z

OK, auto sentence spans are now added to paths. Some things to note:

Quality depends on three things:
- How good the NLP did/how badly segmented the original was (good: e.g. Aphou, less good, e.g. Longinus)
- Whether or not there's punctuation (Phib is the best: good NLP, punctuation; Aphou is not as good - no punctuation)
- Luck (really, coincidental similarity to the limited training data)
When quality is bad, and especially if there's no punctuation at all, there are sometimes super-long sentences. I manually broke up 3-4 instances where 'sentence' length was >400 words. This was mainly a problem in Paul of Tamma (no punctuation, and for some reason the sentences went very long stretches without breaking)
I should note the sentencer is biased towards caution: it prefers to abstain when things look murky, and the upshot is it makes fewer truly crazy splits.

This all means we can now have the analytic vis for Paths. Note that because we do not have chapters, and the p tags (which seem fairly random) do not coincide with auto-sentences, we do not have a verses view for this data at the moment.

ctschroeder · 2019-09-17T16:18:18Z

@amir-zeldes I'll take a look this week about the chapters in PATHS. It sounds like other than that and the metadata, they are done? We are talking about Paul of Tamma, Phib, Aphou, Longinus and Luke (or no Longinus and Luke -- https://github.com/paths-erc/coptic-texts/blob/master/cc0418.xml). Thanks.

amir-zeldes · 2019-09-17T17:08:54Z

Yes, since we're releasing this as auto NLP they are basically done. If you want to do chapters let me know, but time is getting short - if so, they should properly nest 'translation' so we can do the blockified (non-numbered) verses view. Thanks!

And I think it is Longinus and Luke, the TEI header there is incorrectly copy-pasted from another file, right?

ctschroeder · 2019-09-17T17:11:48Z

Yes re Longinus and Luke.

Re chapters: part of the issue is the document URN usually includes the chapters, but we can skip that and just use the edition namespace as the end. Am wondering if the edition should be "CMCL" since it's taken from Tito Orlandi's editions (see for example this referenced in the paths header for Paul of Tamma) http://www.cmcl.it/~cmcl/paolotamma1.PDF

ctschroeder · 2019-09-17T17:14:27Z

or should the edition be "paths"? I think this is the best strategy, actually. Something like urn:cts:copticLit:lives.pauloftamma.cmcl or urn:cts:copticLit:lives.pauloftamma.paths

amir-zeldes · 2019-09-17T17:43:49Z

I also think it should be paths, since it includes paths annotations (e.g. their entity schema) and we don't actually know what processing steps happened between CMCL and their version. Saying it's paths is the simplest statement, and Paths's provenance from CMCL is something that should be described by Paths IMO

ctschroeder · 2019-09-17T17:49:28Z

Bingo

lancealanmartin · 2019-09-18T19:48:35Z

I can add PATHS as the edition. What should the collection be?

ctschroeder · 2019-09-19T03:59:45Z

hello, @amir-zeldes. Johannes.canons is ready for viz check; any documents with to_publish or review status. Beth is reviewing the doc needing review. Thanks so much!

amir-zeldes · 2019-09-26T17:34:53Z

version_date (and _n) has a validation, so that should get automatically flagged if someone used the wrong format.

According to a SQL query on the database, there are now no longer any documents with 'Liz', so that should be fine, but yes let's remember to always do full names!

Treebanking info:

AOF - Amir
A22 - Liz & Amir
Mark - Mitchell, Lance & Amir
1Cor - Mitchell & Amir
Cyrus - Lance & Amir
Onnophrius, Ephrem - Amir
AP - Liz & Amir

Of these, everything was already in corpus metadata, except the only missing one I found was 1Cor, which had no corpus metadata. I copied it over from Mark and added all of the treebankers + Carrie, but I'm not sure who else has added 1Cor without treebanking (that's just who I'm seeing in the documents). Feel free to add if you know someone else!

ctschroeder · 2019-09-26T18:08:32Z

Thank you Amir! (I don't believe corpus metadata errors crop up in validation.) I will check 1 Cor annotators.

lancealanmartin · 2019-09-26T18:38:55Z

I did entity annotation for the first three chapters of both 1 Cor and Mark as well as shenoute.fox. Should I add my name to these docs?

ctschroeder · 2019-09-26T18:45:20Z

Yes @lancealanmartin please add your name to any document you edited, and then also add it to the corpus metadatum for annotation. Giving full credit to everyone is a major principle of ours!! Most documents have the primary annotator first, subsequent annotators in the middle, and the senior editor(s) who reviewed the document (usually Amir or me, sometimes Beth) as the last name.

amir-zeldes · 2019-09-26T19:04:53Z

I have no issues with adding Lance to those documents, as entity annotations will one day be released, but just to clarify, those entity annotations are not currently available in the online corpora.

As for annotator order: I'm embarrassed to say I seem to have had this wrong. I think anything where I added the names I did alphabetically by last name... Since Carrie and I are alphabetically relatively high, this may often match the pattern Carrie is mentioning, but anything I added annotation/translation to is probably just alphabetic. Also, in the repo interface, these things get split up and are findable separately no matter the order they are listed in inside the field.

ctschroeder · 2019-09-26T19:43:43Z

No worries. I think order primarily a big deal for manually edited documents rather than the automated ones and especially by junior folks; I try to keep an eye out for this during publication.

ctschroeder · 2019-09-26T23:35:12Z

@amir-zeldes the Marcion corpora are ready and should be frozen. Marcion corpora that are also in the gold treebank corpora will need metadata updated for the treebank files. TY!

ctschroeder · 2019-09-27T00:58:15Z

Hi @amir-zeldes I'm almost done with the johannes corpus -- checking visualizations, and I noticed that the new document is not in ANNIS. I see that there are 8 docs in the private instance and in the public one. I checked and FA215-224 is missing from the private instance. Thank you!

amir-zeldes · 2019-09-27T02:16:40Z

Got it. Try again now

ctschroeder · 2019-09-27T03:43:11Z

Oh goodness that was a doozy. I think due to the page layer being labeled pb_n instead of pb_xml_id. I hope that fixed it.

Also I am really sick (v sore throat) and so while Johannes is done the rest will have to wait for tomorrow.

amir-zeldes · 2019-09-27T03:50:36Z

Oh no, it's been going around here too. Feel better!

New version with on fix is already online.

ctschroeder · 2019-09-27T05:29:18Z

Johannes is good to go!

amir-zeldes · 2019-09-27T13:34:44Z

Thanks - right now TEI is not validating due to having chapter_n but not verse_n. We could revert it to 'p' mode, without chapters, but is there a reason the verses are 'ignore:'ed?

ctschroeder · 2019-09-27T14:59:42Z

Hi. Are we talking about Johannes or everything? For Johannes they’re ignored because I started and didn’t finish once we decided we didn’t need verse numbers for this release. Re TEI this must be common for all the documents that don’t have verses? This is odd because I don’t remember this as a problem in the past. I’m also really too sick to brainstorm at the moment. Do what you think is best.

amir-zeldes · 2019-09-27T15:15:23Z

The decision is per corpus, so we can either switch off verse numbers in 'verses' for all documents, or I'm happy to add consecutive numbers to verses in each chapter myself if that would solve it. Also, if only one document doesn't have verses, it's TEI would have to look different from other documents in the corpus. Just give me your OK and I will add verse nums (they're mostly already there, I can easily finish)

Feel better!

ctschroeder · 2019-09-27T15:18:23Z

I’m not confident the numbers I have already are good sentences. If you have time to check please be my guest!

ctschroeder · 2019-09-27T22:17:52Z

@amir-zeldes it looks like we messed up the language/languages consistency in corpus metadata again. Is there an easy fix, or should I go back through all of them and check manually?

ctschroeder · 2019-09-27T22:21:24Z

@amir-zeldes sorry to bother you again but it appears the treebank annotators have not been added to document metadata in all the items. I'm noticing this in Mark. You've listed treebankers by corpus above, but I don't know which docs belong to whom. Can you please check the document level metadata to be sure the treebankers have been added? #27 (comment) Thank you!

ctschroeder · 2019-09-27T22:22:11Z

(This may mean the corpora we thought should be frozen need to be fixed. I assumed the treebank folks had been added to doc level annotation.)

amir-zeldes · 2019-09-28T01:47:31Z

OK, I will look into these tomorrow

ctschroeder · 2019-09-28T05:32:55Z

A few final things for this evening:

I'm noticing some corpora that are not ready are on public ANNIS. I'm guessing they are supposed to be behind the password and there was some glitch? At any rate, can they be removed right away? They are: AP, life of L&L, life of Phib, Mark (see below), 1 Cor?
red alert! unfreezing:

there was a problem with Cyrus (now fixed and ready to be reprocessed for publication)
Mark (see top post )

I checked the other Treebanked docs to see if the treebankers were in the doc level metadata; for 1 Cor and AP I couldn't tell bc there were many docs edited and multiple treebankers. Again see top post](Fall 2019 Publication Thread #27 (comment))
I could not commit part 1 of Longinus & Lucius. No clue why not. It gave me a GitHub error. Can you please commit part 1? Then it will be ready to publish.

amir-zeldes · 2019-09-28T13:36:59Z

ANNIS is a glitch from concurrent ANNIS4 security manager (you are actually seeing what's in ANNIS4 right now, including some non-ready tests). Now reverted, sorry about that.
OK, Cyrus is reconverted and Mark + 1Cor document annotators are checked for the treebanked parts (Chap. 1-6 in both)
1 Cor and AP are good to go (were already correct)
It seems that with the added annotations, it is now too big to commit via the API... I've committed it manually for now, but I'm opening an issue here Large serialized files cannot be committed via GitHub API gucorpling/gitdox#155

PS - oh, weird, now that I've manually committed, I can actually commit small changes to Longinus, presumably because the diff is small(?)..

amir-zeldes · 2019-09-28T13:54:36Z

RE language/languages:

All individual documents have 'language'
All corpora have 'languages' except:
- fox
- dormition
- pseudo-ephrem

Was it intentional for corpora to have 'languages' to differentiate from the document level metadatum? In ANNIS, metadata queries just 'apply', so it the two fields conflict and are called the same, it's possible exact meta-based searches will actually yield zero results for these if they're called the same.

Let me know your thoughts about what to do and I can try to apply it.

ctschroeder · 2019-09-28T14:55:49Z

Beth did some digging into this a year ago. I don’t remember the logic, but we went with language for doc/languages for corpus. It gets mangled in corpus metadata bc that can’t be validated.

ctschroeder · 2019-09-28T15:01:26Z

Also thanks so much for all of this! I will be offline almost all day. I think I’ve done everything I can (except for those additional 3 paths texts). Please ping me if you need anything and I’ll check in tonight. Take care.

amir-zeldes · 2019-09-28T17:16:00Z

Sounds good! Which 3 texts though? I think there's Longinus and Phib, which have chapter numbers from PATHS (p_n), and Aphou and Paul, which have unnumbered paragraphs that we made (just p)

ctschroeder · 2019-09-28T17:21:09Z

Greetings from the airport. see the corpora checked/not checked at the top of the thread. L&L should be the only one checked/ready. That checklist at the top should be the final list. I’ve done everything I can on all the docs except the 3 unchecked PATHS texts. Leave those three alone. Everything else is either ready or has items to check off that only you can do. Good luck.

ctschroeder · 2019-12-12T02:12:06Z

Closing. Info in #40.

ctschroeder added publish 2019 goals corpus annotation labels Jun 11, 2019

ctschroeder added this to the Fall2019 milestone Jun 11, 2019

ctschroeder assigned amir-zeldes and ctschroeder Jun 11, 2019

amir-zeldes unassigned amir-zeldes and ctschroeder Aug 22, 2019

ctschroeder assigned ctschroeder and unassigned ctschroeder Aug 22, 2019

amir-zeldes assigned amir-zeldes, ctschroeder and lancealanmartin Aug 22, 2019

ctschroeder mentioned this issue Oct 15, 2019

ISYE has no corpus metadata #38

Closed

5 tasks

ctschroeder closed this as completed Dec 12, 2019

Fall 2019 Publication Thread #27

Fall 2019 Publication Thread #27

Comments

ctschroeder commented Jun 11, 2019 • edited by amir-zeldes Loading

ctschroeder commented Sep 16, 2019

cluckmarq commented Sep 16, 2019 via email

ctschroeder commented Sep 16, 2019

amir-zeldes commented Sep 16, 2019

ctschroeder commented Sep 16, 2019

ctschroeder commented Sep 16, 2019

eplatte commented Sep 17, 2019

amir-zeldes commented Sep 17, 2019

amir-zeldes commented Sep 17, 2019

ctschroeder commented Sep 17, 2019

amir-zeldes commented Sep 17, 2019

ctschroeder commented Sep 17, 2019

ctschroeder commented Sep 17, 2019

amir-zeldes commented Sep 17, 2019

ctschroeder commented Sep 17, 2019

lancealanmartin commented Sep 18, 2019

ctschroeder commented Sep 19, 2019

amir-zeldes commented Sep 26, 2019

ctschroeder commented Sep 26, 2019

lancealanmartin commented Sep 26, 2019

ctschroeder commented Sep 26, 2019

amir-zeldes commented Sep 26, 2019

ctschroeder commented Sep 26, 2019

ctschroeder commented Sep 26, 2019

ctschroeder commented Sep 27, 2019

amir-zeldes commented Sep 27, 2019

ctschroeder commented Sep 27, 2019

amir-zeldes commented Sep 27, 2019

ctschroeder commented Sep 27, 2019

amir-zeldes commented Sep 27, 2019

ctschroeder commented Sep 27, 2019 via email

amir-zeldes commented Sep 27, 2019

ctschroeder commented Sep 27, 2019 via email

ctschroeder commented Sep 27, 2019

ctschroeder commented Sep 27, 2019

ctschroeder commented Sep 27, 2019

amir-zeldes commented Sep 28, 2019

ctschroeder commented Sep 28, 2019

amir-zeldes commented Sep 28, 2019

amir-zeldes commented Sep 28, 2019

ctschroeder commented Sep 28, 2019 via email

ctschroeder commented Sep 28, 2019

amir-zeldes commented Sep 28, 2019

ctschroeder commented Sep 28, 2019 via email

ctschroeder commented Dec 12, 2019

ctschroeder commented Jun 11, 2019 •

edited by amir-zeldes

Loading