Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall 2019 Publication Thread #27

Closed
52 of 62 tasks
ctschroeder opened this issue Jun 11, 2019 · 51 comments
Closed
52 of 62 tasks

Fall 2019 Publication Thread #27

ctschroeder opened this issue Jun 11, 2019 · 51 comments

Comments

@ctschroeder
Copy link
Member

ctschroeder commented Jun 11, 2019

Timeline:

  • hard list of documents determined by September 4
  • review beginning Sept 15

Version 3.0, version date 2019-09-30

List of materials as of 16 September.

PATHS texts (see #26 @ctschroeder )

  • Phib

  • Aphou

  • Paul of Tamma

  • L&L

  • possibly Treebank corpus (UD release is in Nov, ANNIS release this fall) Please check corpus AND document annotation for treebank annotator before ticking off box. CTS is checking other corpus metadata fields

    • AP: 1-6, 18-19, 23-32, 114-139
      • document level metadata needs to be checked for treebank annotators for all docs listed here; see comment below -- I don't know who of the treebankers worked on which document(s)
      • version number/date need to be added to listed docs
      • docs not listed need to be checked to see if there are commits since last version; if so update version number/date and commit
      • segmentation, tagging, parsing on the above docs are not all labeled "gold." These need to be updated for each AP doc listed.
    • Abraham: XL 93-94, YA 518-20
    • NBFB: XH 204-216
    • A22: YA 421-28
    • Mark: 1-9
      • document level metadata needs to be checked for treebank annotators for all docs 1-16. see comment below -- I don't know who of those treebankers worked on which document(s)
    • 1 Cor: 1-6
      • check corpus metadata
      • document level metadata needs to be checked for treebank annotators for all docs 1-16. @amir-zeldes made commits on all docs but is not listed in annotation; also see comment below -- I don't know who of those treebankers worked on which document(s)
    • Besa 1,2,13,15,25
    • Life of Cyrus 23-27 (also needs title change)
    • Life of Onnophrius: 1-7 (also needs title change)
    • Victor: part 1
    • Pseudo-Ephrem letter
  • Marcion

    • Onnophrius
      • needs next/previous, order
      • visualization review
      • corpus metadata (gitdox crashed Wed--try again later)
    • Cyrus
      • needs next/previous, order
      • visualization review
    • Ephrem (epistle)
      • visualization review
      • corpus metadata
    • Repose (dormition) of John
      • visualization review
    • Proclus (2 discourses)
      • needs next/previous, order
      • visualization review
      • corpus metadata
    • Victor (2 sections) [GitDox crashing -- docs need to be committed!]
      • visualization review
      • needs next/previous, order
      • corpus metadata
  • more Johannes (@ctschroeder; @eplatte is reviewing)
    - [x] visualization review
    - [x] corpus metadata

  • add copyist metadatum where relevant

  • check Correct the lemmas of the suffix conjug verboid, ounte #31 for corpora listed above

  • check Morph Errors #30 for corpora listed above

  • complete this checklist https://github.com/CopticScriptorium/budge-dev/issues/1

  • complete this checklist https://github.com/CopticScriptorium/budge-dev/issues/2

Later in Fall

@ctschroeder
Copy link
Member Author

I would like to add "copyist" or "scribe" to the metadata for the relevant texts in Marcion copied by Victor son of Mercurius (Onophrius, Cyrus, possibly others). See Layton's catalog https://www.dropbox.com/s/s7gdapyphgpc3mb/pLondCopt%20II%20%28Layton%29.pdf?dl=0. Everyone please let me know ASAP if you have any objections.

@cluckmarq
Copy link
Member

cluckmarq commented Sep 16, 2019 via email

@ctschroeder
Copy link
Member Author

Also the list of materials for publication is now set. I am going to check to see if someone can review my doc from Johannes. Can everyone working on docs for this round of publication be sure that the docs are appropriately tagged in GitDox as "review"?
@amir-zeldes the treebank corpora don't need review (except the Cyrus and Onophrius docs which need metadata revisions). Can you or @lancealanmartin please check that all the treebank docs except Cyrus and Onophrius have the correct version_n and version_date and are marked "review"? Thanks!! Once that is done, they can be labeled "to publish" in GitDox and checked off here on the top of the thread. If you can't do it I'll get to it later this week or next week -- just let me know the scoop. Thank you so much!!!

@amir-zeldes
Copy link
Member

Adding scribe sounds fine to me (sounds better than copyist for my ears, but I'm fine with either)

I have reviewed all of the re-release documents (Shenoute, AP, Besa) and made sure the version is 3.0.0 + dated if they have been edited since last release, so those should be all good. The new Budge materials are either recent additions to the treebank (onno1, cyrus1, ephraim, respose), or they have been checked by either Lance or me, so I think they are OK as 'checked' and only need metadata review, no linguistic review necessary at this point.

Now assigned for (metadata) review to @ctschroeder :

  • ephraim
  • cyrus
  • onnophrius
  • repose

The following are assigned to others for sentences/translation, but also need metadata review:

  • proclus
  • victor2

Thanks!

@ctschroeder
Copy link
Member Author

thx so much @amir-zeldes! I will deal with Proclus and Victor when others are done with them. Whoever's working on them can assign them to me when done.

@ctschroeder
Copy link
Member Author

Greetings @amir-zeldes @lancealanmartin @eplatte @bkrawiec @cluckmarq. I'm working on URNs for the Marcion material. Part of the CTS URN is the "text group" and part is the "edition." A few questions have arisen. Apologies for the long post! Replies requested by the end of the week if humanly possible. This comment contains a fair bit of info in regular text with precise questions in bold.

This is PART 1. There may be a PART 2 as I work through the other texts.

For texts with known, identified authors, the "group" is the author. So urn:cts:copticLit:besa.aphthonia.monbba refers to the text "Letter to Aphthonia" in the text group "Writings of Besa" in the edition manuscript MONB.BA. For the material edited by Budge in Marcion we have two questions: What "text group" to designate and what "edition"?

  1. Edition (the simplest question): for the Martyrdom of Victor we used Budge as the edition. I suggest we do the same with the rest of the Marcion materials from Budge. Please let me know if you have objections to "Budge" as edition in the URN. no need to reply if this is fine.

  2. For the "group" for each text/work we have a few options:

Thank you!! Possibly more tomorrow on other Marcion texts/works.

@eplatte
Copy link
Member

eplatte commented Sep 17, 2019

I like lives for the text group for Onophrius and Cyril, and Ephrem for the spelling and psephrem for the text group for the epistle. Budge also makes sense for the edition.

@amir-zeldes
Copy link
Member

Agreed on lives, budge and adding a pseudo prefix. For the spelling of Ephraim, I feel like we've been using mostly Latin spellings for some reason (onnophrius with 'u', cyrus with 'cy' and 'u'), so something like ephraem or ephrem seems more consistent than 'ai'. Whatever makes more sense as the 'Latin' form I would say.

@amir-zeldes
Copy link
Member

OK, auto sentence spans are now added to paths. Some things to note:

  1. Quality depends on three things:
    • How good the NLP did/how badly segmented the original was (good: e.g. Aphou, less good, e.g. Longinus)
    • Whether or not there's punctuation (Phib is the best: good NLP, punctuation; Aphou is not as good - no punctuation)
    • Luck (really, coincidental similarity to the limited training data)
  2. When quality is bad, and especially if there's no punctuation at all, there are sometimes super-long sentences. I manually broke up 3-4 instances where 'sentence' length was >400 words. This was mainly a problem in Paul of Tamma (no punctuation, and for some reason the sentences went very long stretches without breaking)
  3. I should note the sentencer is biased towards caution: it prefers to abstain when things look murky, and the upshot is it makes fewer truly crazy splits.

This all means we can now have the analytic vis for Paths. Note that because we do not have chapters, and the p tags (which seem fairly random) do not coincide with auto-sentences, we do not have a verses view for this data at the moment.

@ctschroeder
Copy link
Member Author

@amir-zeldes I'll take a look this week about the chapters in PATHS. It sounds like other than that and the metadata, they are done? We are talking about Paul of Tamma, Phib, Aphou, Longinus and Luke (or no Longinus and Luke -- https://github.com/paths-erc/coptic-texts/blob/master/cc0418.xml). Thanks.

@amir-zeldes
Copy link
Member

Yes, since we're releasing this as auto NLP they are basically done. If you want to do chapters let me know, but time is getting short - if so, they should properly nest 'translation' so we can do the blockified (non-numbered) verses view. Thanks!

And I think it is Longinus and Luke, the TEI header there is incorrectly copy-pasted from another file, right?

@ctschroeder
Copy link
Member Author

Yes re Longinus and Luke.

Re chapters: part of the issue is the document URN usually includes the chapters, but we can skip that and just use the edition namespace as the end. Am wondering if the edition should be "CMCL" since it's taken from Tito Orlandi's editions (see for example this referenced in the paths header for Paul of Tamma) http://www.cmcl.it/~cmcl/paolotamma1.PDF

@ctschroeder
Copy link
Member Author

or should the edition be "paths"? I think this is the best strategy, actually. Something like urn:cts:copticLit:lives.pauloftamma.cmcl or urn:cts:copticLit:lives.pauloftamma.paths

@amir-zeldes
Copy link
Member

I also think it should be paths, since it includes paths annotations (e.g. their entity schema) and we don't actually know what processing steps happened between CMCL and their version. Saying it's paths is the simplest statement, and Paths's provenance from CMCL is something that should be described by Paths IMO

@ctschroeder
Copy link
Member Author

Bingo

@lancealanmartin
Copy link

I can add PATHS as the edition. What should the collection be?

@ctschroeder
Copy link
Member Author

hello, @amir-zeldes. Johannes.canons is ready for viz check; any documents with to_publish or review status. Beth is reviewing the doc needing review. Thanks so much!

@amir-zeldes
Copy link
Member

version_date (and _n) has a validation, so that should get automatically flagged if someone used the wrong format.

According to a SQL query on the database, there are now no longer any documents with 'Liz', so that should be fine, but yes let's remember to always do full names!

Treebanking info:

  • AOF - Amir
  • A22 - Liz & Amir
  • Mark - Mitchell, Lance & Amir
  • 1Cor - Mitchell & Amir
  • Cyrus - Lance & Amir
  • Onnophrius, Ephrem - Amir
  • AP - Liz & Amir

Of these, everything was already in corpus metadata, except the only missing one I found was 1Cor, which had no corpus metadata. I copied it over from Mark and added all of the treebankers + Carrie, but I'm not sure who else has added 1Cor without treebanking (that's just who I'm seeing in the documents). Feel free to add if you know someone else!

@ctschroeder
Copy link
Member Author

Thank you Amir! (I don't believe corpus metadata errors crop up in validation.) I will check 1 Cor annotators.

@lancealanmartin
Copy link

I did entity annotation for the first three chapters of both 1 Cor and Mark as well as shenoute.fox. Should I add my name to these docs?

@ctschroeder
Copy link
Member Author

Yes @lancealanmartin please add your name to any document you edited, and then also add it to the corpus metadatum for annotation. Giving full credit to everyone is a major principle of ours!! Most documents have the primary annotator first, subsequent annotators in the middle, and the senior editor(s) who reviewed the document (usually Amir or me, sometimes Beth) as the last name.

@amir-zeldes
Copy link
Member

I have no issues with adding Lance to those documents, as entity annotations will one day be released, but just to clarify, those entity annotations are not currently available in the online corpora.

As for annotator order: I'm embarrassed to say I seem to have had this wrong. I think anything where I added the names I did alphabetically by last name... Since Carrie and I are alphabetically relatively high, this may often match the pattern Carrie is mentioning, but anything I added annotation/translation to is probably just alphabetic. Also, in the repo interface, these things get split up and are findable separately no matter the order they are listed in inside the field.

@ctschroeder
Copy link
Member Author

No worries. I think order primarily a big deal for manually edited documents rather than the automated ones and especially by junior folks; I try to keep an eye out for this during publication.

@ctschroeder
Copy link
Member Author

@amir-zeldes the Marcion corpora are ready and should be frozen. Marcion corpora that are also in the gold treebank corpora will need metadata updated for the treebank files. TY!

@ctschroeder
Copy link
Member Author

Hi @amir-zeldes I'm almost done with the johannes corpus -- checking visualizations, and I noticed that the new document is not in ANNIS. I see that there are 8 docs in the private instance and in the public one. I checked and FA215-224 is missing from the private instance. Thank you!

@amir-zeldes
Copy link
Member

Got it. Try again now

@ctschroeder
Copy link
Member Author

Oh goodness that was a doozy. I think due to the page layer being labeled pb_n instead of pb_xml_id. I hope that fixed it.

Also I am really sick (v sore throat) and so while Johannes is done the rest will have to wait for tomorrow.

@amir-zeldes
Copy link
Member

Oh no, it's been going around here too. Feel better!

New version with on fix is already online.

@ctschroeder
Copy link
Member Author

Johannes is good to go!

@amir-zeldes
Copy link
Member

Thanks - right now TEI is not validating due to having chapter_n but not verse_n. We could revert it to 'p' mode, without chapters, but is there a reason the verses are 'ignore:'ed?

@ctschroeder
Copy link
Member Author

ctschroeder commented Sep 27, 2019 via email

@amir-zeldes
Copy link
Member

The decision is per corpus, so we can either switch off verse numbers in 'verses' for all documents, or I'm happy to add consecutive numbers to verses in each chapter myself if that would solve it. Also, if only one document doesn't have verses, it's TEI would have to look different from other documents in the corpus. Just give me your OK and I will add verse nums (they're mostly already there, I can easily finish)

Feel better!

@ctschroeder
Copy link
Member Author

ctschroeder commented Sep 27, 2019 via email

@ctschroeder
Copy link
Member Author

@amir-zeldes it looks like we messed up the language/languages consistency in corpus metadata again. Is there an easy fix, or should I go back through all of them and check manually?

@ctschroeder
Copy link
Member Author

@amir-zeldes sorry to bother you again but it appears the treebank annotators have not been added to document metadata in all the items. I'm noticing this in Mark. You've listed treebankers by corpus above, but I don't know which docs belong to whom. Can you please check the document level metadata to be sure the treebankers have been added? #27 (comment) Thank you!

@ctschroeder
Copy link
Member Author

(This may mean the corpora we thought should be frozen need to be fixed. I assumed the treebank folks had been added to doc level annotation.)

@amir-zeldes
Copy link
Member

OK, I will look into these tomorrow

@ctschroeder
Copy link
Member Author

A few final things for this evening:

  1. I'm noticing some corpora that are not ready are on public ANNIS. I'm guessing they are supposed to be behind the password and there was some glitch? At any rate, can they be removed right away? They are: AP, life of L&L, life of Phib, Mark (see below), 1 Cor?

  2. red alert! unfreezing:

  • there was a problem with Cyrus (now fixed and ready to be reprocessed for publication)

  • Mark (see top post )

  1. I checked the other Treebanked docs to see if the treebankers were in the doc level metadata; for 1 Cor and AP I couldn't tell bc there were many docs edited and multiple treebankers. Again see top post](Fall 2019 Publication Thread #27 (comment))

  2. I could not commit part 1 of Longinus & Lucius. No clue why not. It gave me a GitHub error. Can you please commit part 1? Then it will be ready to publish.

@amir-zeldes
Copy link
Member

  1. ANNIS is a glitch from concurrent ANNIS4 security manager (you are actually seeing what's in ANNIS4 right now, including some non-ready tests). Now reverted, sorry about that.
  2. OK, Cyrus is reconverted and Mark + 1Cor document annotators are checked for the treebanked parts (Chap. 1-6 in both)
  3. 1 Cor and AP are good to go (were already correct)
  4. It seems that with the added annotations, it is now too big to commit via the API... I've committed it manually for now, but I'm opening an issue here Large serialized files cannot be committed via GitHub API gucorpling/gitdox#155

PS - oh, weird, now that I've manually committed, I can actually commit small changes to Longinus, presumably because the diff is small(?)..

@amir-zeldes
Copy link
Member

RE language/languages:

  • All individual documents have 'language'
  • All corpora have 'languages' except:
    • fox
    • dormition
    • pseudo-ephrem

Was it intentional for corpora to have 'languages' to differentiate from the document level metadatum? In ANNIS, metadata queries just 'apply', so it the two fields conflict and are called the same, it's possible exact meta-based searches will actually yield zero results for these if they're called the same.

Let me know your thoughts about what to do and I can try to apply it.

@ctschroeder
Copy link
Member Author

ctschroeder commented Sep 28, 2019 via email

@ctschroeder
Copy link
Member Author

Also thanks so much for all of this! I will be offline almost all day. I think I’ve done everything I can (except for those additional 3 paths texts). Please ping me if you need anything and I’ll check in tonight. Take care.

@amir-zeldes
Copy link
Member

Sounds good! Which 3 texts though? I think there's Longinus and Phib, which have chapter numbers from PATHS (p_n), and Aphou and Paul, which have unnumbered paragraphs that we made (just p)

@ctschroeder
Copy link
Member Author

ctschroeder commented Sep 28, 2019 via email

@ctschroeder
Copy link
Member Author

Closing. Info in #40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants