Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912

Closed
11 tasks done
rlskoeser opened this issue Jun 9, 2022 · 14 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jun 9, 2022

testing notes

I manually ran a bulk annotation export so you could review the basic functionality.

Go to the testing repository I'm using for testing backup:
https://github.com/Princeton-CDH/test-pgp-annotations

  • confirm there is content in the annotations directory for documents with transcriptions (this content is in w3c annotation format, no need to understand the structure)
  • confirm that there are simple html files with transcription content in transcriptions/html
  • confirm that there are text files with transcription content in transcriptions/txt — should only contain labels and actual transcription content
  • confirm that html and text files include PGPID and lowercase name of transcription authors / editors

I'm thinking of this backup as primarily meant to be machine readable rather than expecting someone to browse it on GitHub, but I'm open to suggestions or requests for revising how it is structured.

dev notes

Should probably be able to adapt some logic from existing simple annotation server script for downloading annotation lists by canvas: https://github.com/glenrobson/SimpleAnnotationServer/blob/master/scripts/downloadAnnotationListsByCanvas.py

May also be some overlapping logic/functionality with PEMM scripts for syncing CSV data to GitHub.

Some of the logic needs to be reusable so we can create the initial github repo in tandem with the tei to annotation conversion.

Should include co-author information, but don't implement that in the first go.

  • method to save an annotation list for a single canvas + document (adapt from annotations view and reuse)
  • configure annotation backup directory
  • method to generate path / filename for annotation on disk
  • method to generate transcription html/text filenames: pgpid###_authorslug_transcription.txt or .html
  • method to output full transcription content as html file for document + source
  • method to output full transcription content as plain text file for document + source
  • add logic to save and push to github (adapt from pemm)
@rlskoeser
Copy link
Contributor Author

@mrustow @kseniaryzhova I'm working on the logic to backup and version our transcription annotation content to GitHub. We'd discussed previously that we want to track who is making the edits, and we also discussed preserving the contributors to the bitbucket version when we migrate. I have a plan for how to do this, but it does make the tei to annotation migration more complicated so I wanted to check in and make sure it's worth it. (More complicated because as we migrate each file we need to determine the contributors for that file, then generate the corresponding annotation/transcription backup files and commit them with the appropriate co-author list. But it is feasible.)

I used git log to get a list of all unique usernames for everyone who's contributed to the tei bitbucket repo. I think we can probably get GitHub co-author emails for almost all of these folks, do you agree?

Alan Elbaum
Ben
Ben Johnston
Brendan Goldman
Jessica Parker
Ksenia Ryzhova
Rachel Richman
Rebecca Sutton Koeser
benj@princeton.edu
benjohnsto
mrustow
rlskoeser

Please let me know whether or not you want me to preserve this contributor information in the migration.


documenting how I got the list: git log --pretty=format:'%an' | sort | uniq

@mrustow
Copy link

mrustow commented Sep 16, 2022

I think it’s important to preserve all contributions to the editions at the document level if at all possible — seems like the best-practices thing to do. (We can of course merge the four Ben Johnstons and two RSKs.) Not preserving them would entail too much data loss. Sorry about the complexities this entails.

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Sep 16, 2022

@mrustow thanks for confirming! I just wanted to make sure before I implemented it. I think it's valuable but wanted to check.

I'll write separate user stories next week for tracking historic and ongoing user contributions next week so we can test the basic version and then work on this as a second round / enhancement.

@rlskoeser rlskoeser added this to the CDH/PGP end of grant year 2 milestone Sep 19, 2022
@rlskoeser
Copy link
Contributor Author

upgrading to 8 points for complexity

@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Sep 22, 2022
@kseniaryzhova
Copy link

@rlskoeser I keep getting this message when I open up those folders - is this an issue? I don't have problems accessing the files though, at least the ones I can see.
image

@rlskoeser
Copy link
Contributor Author

@kseniaryzhova that's part of what I meant about the structure being machine readable vs. human readable — GitHub won't let you browse all the content in the current structure. Is that something that would be valuable? Probably, yes? How important? It would be helpful to have @mrustow take a look and weigh in on the file structure here.

If so, I'll need to revise the file structure so that we don't have so many files in a single directory but so it's also human navigable. This will be even more important once you import all the HTR at some point down the road.

If we need to do that, my inclination is to use pairtree based on PGPIDs — this a directory structure that I've used before, it's often used in digital libraries; it can generate a nested directory structure based on an identifier, e.g.:

   abcd      -> ab/cd/
   abcdefg   -> ab/cd/ef/g/
   12-986xy4 -> 12/-9/86/xy/4/

Do you think you'd be comfortable finding records by PGPID in this kind of structure?

@mrustow
Copy link

mrustow commented Sep 23, 2022 via email

@rlskoeser
Copy link
Contributor Author

yes, sorry (that example was pulled from the spec — I can see how it's not very helpful)

Current highest PGPID is 36241; if we pad with zeros or dashes to make everything 6 digits, then to find PGPID 36241 you would have to navigate directories something like -3/62/ to find transcription content for PGPID36241. Or if we just take the first four digits of the PGPIDs (however long it is), it would look something like 36/24/

If we go with this structure, I think we should put html and text files in the same directory (right now I have them split out). Would you want transcription and translation files in the same directory too? Then you could navigate to one place to find all available human-readable (ish) exports for a single document.

Maybe I should do a quick manual conversion of the existing backup content into this structure so you can try it out on github and see what it's like. Helpful?

@mrustow
Copy link

mrustow commented Sep 23, 2022 via email

@rlskoeser
Copy link
Contributor Author

@mrustow @kseniaryzhova this branch has a demo of transcription html & text files in pairtree structure: https://github.com/Princeton-CDH/test-pgp-annotations/tree/pairtree-demo

It does require the nested "pairtree_root" folder, which I'd forgotten about; but we could put shortcut links in the README.

Thinking about it more, we probably need to go with something like this to make sure it will scale for when you add all the HTR transcriptions. I know pairtree will work at that scale.

@mrustow
Copy link

mrustow commented Sep 23, 2022 via email

@rlskoeser
Copy link
Contributor Author

ok, so files for PGPID 1018 would be under 1000/1018/ ?

And what about longer ids, would PGPID 36238 be 36000/36238/

That sounds reasonable enough; shouldn't be too hard to generate the paths, it will make the content easier to find and browse on GitHub (possible to browse on GitHub), and should still be a usable enough structure for anyone who wants to do computational work with the text as a corpus (or for us to generate a corpus from the text content).

@rlskoeser
Copy link
Contributor Author

@kseniaryzhova I made a new issue for the work to revise the directory structure, so when you're comfortable that the basic export functionality is working you can close this issue.

@kseniaryzhova
Copy link

@rlskoeser sounds good - I was keeping it open so Marina had a chance to make a decision, and it looks like that happened so I'm all set here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants