As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

rlskoeser · 2022-06-09T13:20:17Z

testing notes

I manually ran a bulk annotation export so you could review the basic functionality.

Go to the testing repository I'm using for testing backup:
https://github.com/Princeton-CDH/test-pgp-annotations

confirm there is content in the annotations directory for documents with transcriptions (this content is in w3c annotation format, no need to understand the structure)
confirm that there are simple html files with transcription content in transcriptions/html
confirm that there are text files with transcription content in transcriptions/txt — should only contain labels and actual transcription content
confirm that html and text files include PGPID and lowercase name of transcription authors / editors

I'm thinking of this backup as primarily meant to be machine readable rather than expecting someone to browse it on GitHub, but I'm open to suggestions or requests for revising how it is structured.

dev notes

Should probably be able to adapt some logic from existing simple annotation server script for downloading annotation lists by canvas: https://github.com/glenrobson/SimpleAnnotationServer/blob/master/scripts/downloadAnnotationListsByCanvas.py

May also be some overlapping logic/functionality with PEMM scripts for syncing CSV data to GitHub.

Some of the logic needs to be reusable so we can create the initial github repo in tandem with the tei to annotation conversion.

Should include co-author information, but don't implement that in the first go.

method to save an annotation list for a single canvas + document (adapt from annotations view and reuse)
configure annotation backup directory
method to generate path / filename for annotation on disk
method to generate transcription html/text filenames: pgpid###_authorslug_transcription.txt or .html
method to output full transcription content as html file for document + source
method to output full transcription content as plain text file for document + source
add logic to save and push to github (adapt from pemm)

The text was updated successfully, but these errors were encountered:

rlskoeser · 2022-09-14T20:38:08Z

@mrustow @kseniaryzhova I'm working on the logic to backup and version our transcription annotation content to GitHub. We'd discussed previously that we want to track who is making the edits, and we also discussed preserving the contributors to the bitbucket version when we migrate. I have a plan for how to do this, but it does make the tei to annotation migration more complicated so I wanted to check in and make sure it's worth it. (More complicated because as we migrate each file we need to determine the contributors for that file, then generate the corresponding annotation/transcription backup files and commit them with the appropriate co-author list. But it is feasible.)

I used git log to get a list of all unique usernames for everyone who's contributed to the tei bitbucket repo. I think we can probably get GitHub co-author emails for almost all of these folks, do you agree?

Alan Elbaum
Ben
Ben Johnston
Brendan Goldman
Jessica Parker
Ksenia Ryzhova
Rachel Richman
Rebecca Sutton Koeser
benj@princeton.edu
benjohnsto
mrustow
rlskoeser

Please let me know whether or not you want me to preserve this contributor information in the migration.

documenting how I got the list: git log --pretty=format:'%an' | sort | uniq

mrustow · 2022-09-16T20:48:15Z

I think it’s important to preserve all contributions to the editions at the document level if at all possible — seems like the best-practices thing to do. (We can of course merge the four Ben Johnstons and two RSKs.) Not preserving them would entail too much data loss. Sorry about the complexities this entails.

rlskoeser · 2022-09-16T20:55:31Z

@mrustow thanks for confirming! I just wanted to make sure before I implemented it. I think it's valuable but wanted to check.

I'll write separate user stories next week for tracking historic and ongoing user contributions next week so we can test the basic version and then work on this as a second round / enhancement.

rlskoeser · 2022-09-22T14:22:38Z

upgrading to 8 points for complexity

kseniaryzhova · 2022-09-23T01:44:29Z

@rlskoeser I keep getting this message when I open up those folders - is this an issue? I don't have problems accessing the files though, at least the ones I can see.

rlskoeser · 2022-09-23T12:36:30Z

@kseniaryzhova that's part of what I meant about the structure being machine readable vs. human readable — GitHub won't let you browse all the content in the current structure. Is that something that would be valuable? Probably, yes? How important? It would be helpful to have @mrustow take a look and weigh in on the file structure here.

If so, I'll need to revise the file structure so that we don't have so many files in a single directory but so it's also human navigable. This will be even more important once you import all the HTR at some point down the road.

If we need to do that, my inclination is to use pairtree based on PGPIDs — this a directory structure that I've used before, it's often used in digital libraries; it can generate a nested directory structure based on an identifier, e.g.:

   abcd      -> ab/cd/
   abcdefg   -> ab/cd/ef/g/
   12-986xy4 -> 12/-9/86/xy/4/

Do you think you'd be comfortable finding records by PGPID in this kind of structure?

mrustow · 2022-09-23T13:24:56Z

In theory sounds like a good proposal — can you give a specific rather than theoretical example, as I’m having trouble picturing it in context? Can talk it through IRT if needed between 10 and 11 am today.

rlskoeser · 2022-09-23T13:34:42Z

yes, sorry (that example was pulled from the spec — I can see how it's not very helpful)

Current highest PGPID is 36241; if we pad with zeros or dashes to make everything 6 digits, then to find PGPID 36241 you would have to navigate directories something like -3/62/ to find transcription content for PGPID36241. Or if we just take the first four digits of the PGPIDs (however long it is), it would look something like 36/24/

If we go with this structure, I think we should put html and text files in the same directory (right now I have them split out). Would you want transcription and translation files in the same directory too? Then you could navigate to one place to find all available human-readable (ish) exports for a single document.

Maybe I should do a quick manual conversion of the existing backup content into this structure so you can try it out on github and see what it's like. Helpful?

mrustow · 2022-09-23T14:09:13Z

Ahh that makes sense — a manual conversion would be really helpful to see. Thanks!

rlskoeser · 2022-09-23T15:01:04Z

@mrustow @kseniaryzhova this branch has a demo of transcription html & text files in pairtree structure: https://github.com/Princeton-CDH/test-pgp-annotations/tree/pairtree-demo

It does require the nested "pairtree_root" folder, which I'd forgotten about; but we could put shortcut links in the README.

Thinking about it more, we probably need to go with something like this to make sure it will scale for when you add all the HTR transcriptions. I know pairtree will work at that scale.

mrustow · 2022-09-23T16:11:08Z

OK — for me, it would be helpful if the first level was called like this: 1000 2000 3000 And the second level by the full numbers, not the last two digits.

rlskoeser · 2022-09-23T16:58:16Z

ok, so files for PGPID 1018 would be under 1000/1018/ ?

And what about longer ids, would PGPID 36238 be 36000/36238/

That sounds reasonable enough; shouldn't be too hard to generate the paths, it will make the content easier to find and browse on GitHub (possible to browse on GitHub), and should still be a usable enough structure for anyone who wants to do computational work with the text as a corpus (or for us to generate a corpus from the text content).

rlskoeser · 2022-09-26T13:15:01Z

@kseniaryzhova I made a new issue for the work to revise the directory structure, so when you're comfortable that the basic export functionality is working you can close this issue.

kseniaryzhova · 2022-09-26T14:00:56Z

@rlskoeser sounds good - I was keeping it open so Marina had a chance to make a decision, and it looks like that happened so I'm all set here!

rlskoeser mentioned this issue Jun 9, 2022

transcription editor mvp #756

Closed

rlskoeser self-assigned this Sep 14, 2022

rlskoeser added this to the CDH/PGP end of grant year 2 milestone Sep 19, 2022

rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Sep 22, 2022

rlskoeser mentioned this issue Sep 23, 2022

As a content editor, I want to navigate the transcription export data on GitHub so that I can find exported content by PGPID. #1124

Closed

2 tasks

kseniaryzhova closed this as completed Sep 26, 2022

rlskoeser removed the 🗜️ awaiting testing Implemented and ready to be tested label Sep 26, 2022

rlskoeser mentioned this issue Sep 27, 2022

Revise transcription export directory layout #1128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

rlskoeser commented Jun 9, 2022 •

edited by kseniaryzhova

rlskoeser commented Sep 14, 2022

mrustow commented Sep 16, 2022

rlskoeser commented Sep 16, 2022 •

edited

rlskoeser commented Sep 22, 2022

kseniaryzhova commented Sep 23, 2022

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

rlskoeser commented Sep 26, 2022

kseniaryzhova commented Sep 26, 2022

As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

Comments

rlskoeser commented Jun 9, 2022 • edited by kseniaryzhova

testing notes

dev notes

rlskoeser commented Sep 14, 2022

mrustow commented Sep 16, 2022

rlskoeser commented Sep 16, 2022 • edited

rlskoeser commented Sep 22, 2022

kseniaryzhova commented Sep 23, 2022

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

mrustow commented Sep 23, 2022 via email

rlskoeser commented Sep 23, 2022

rlskoeser commented Sep 26, 2022

kseniaryzhova commented Sep 26, 2022

rlskoeser commented Jun 9, 2022 •

edited by kseniaryzhova

rlskoeser commented Sep 16, 2022 •

edited