-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As an admin, I want transcription content synchronized from annotation storage to a GitHub repository so that the content is backed up, versioned, and available for use in generating a text corpus. #912
Comments
@mrustow @kseniaryzhova I'm working on the logic to backup and version our transcription annotation content to GitHub. We'd discussed previously that we want to track who is making the edits, and we also discussed preserving the contributors to the bitbucket version when we migrate. I have a plan for how to do this, but it does make the tei to annotation migration more complicated so I wanted to check in and make sure it's worth it. (More complicated because as we migrate each file we need to determine the contributors for that file, then generate the corresponding annotation/transcription backup files and commit them with the appropriate co-author list. But it is feasible.) I used Alan Elbaum Please let me know whether or not you want me to preserve this contributor information in the migration. documenting how I got the list: |
I think it’s important to preserve all contributions to the editions at the document level if at all possible — seems like the best-practices thing to do. (We can of course merge the four Ben Johnstons and two RSKs.) Not preserving them would entail too much data loss. Sorry about the complexities this entails. |
@mrustow thanks for confirming! I just wanted to make sure before I implemented it. I think it's valuable but wanted to check. I'll write separate user stories next week for tracking historic and ongoing user contributions next week so we can test the basic version and then work on this as a second round / enhancement. |
upgrading to 8 points for complexity |
@rlskoeser I keep getting this message when I open up those folders - is this an issue? I don't have problems accessing the files though, at least the ones I can see. |
@kseniaryzhova that's part of what I meant about the structure being machine readable vs. human readable — GitHub won't let you browse all the content in the current structure. Is that something that would be valuable? Probably, yes? How important? It would be helpful to have @mrustow take a look and weigh in on the file structure here. If so, I'll need to revise the file structure so that we don't have so many files in a single directory but so it's also human navigable. This will be even more important once you import all the HTR at some point down the road. If we need to do that, my inclination is to use pairtree based on PGPIDs — this a directory structure that I've used before, it's often used in digital libraries; it can generate a nested directory structure based on an identifier, e.g.:
Do you think you'd be comfortable finding records by PGPID in this kind of structure? |
In theory sounds like a good proposal — can you give a specific rather than theoretical example, as I’m having trouble picturing it in context? Can talk it through IRT if needed between 10 and 11 am today.
|
yes, sorry (that example was pulled from the spec — I can see how it's not very helpful) Current highest PGPID is 36241; if we pad with zeros or dashes to make everything 6 digits, then to find PGPID 36241 you would have to navigate directories something like If we go with this structure, I think we should put html and text files in the same directory (right now I have them split out). Would you want transcription and translation files in the same directory too? Then you could navigate to one place to find all available human-readable (ish) exports for a single document. Maybe I should do a quick manual conversion of the existing backup content into this structure so you can try it out on github and see what it's like. Helpful? |
Ahh that makes sense — a manual conversion would be really helpful to see. Thanks!
|
@mrustow @kseniaryzhova this branch has a demo of transcription html & text files in pairtree structure: https://github.com/Princeton-CDH/test-pgp-annotations/tree/pairtree-demo It does require the nested "pairtree_root" folder, which I'd forgotten about; but we could put shortcut links in the README. Thinking about it more, we probably need to go with something like this to make sure it will scale for when you add all the HTR transcriptions. I know pairtree will work at that scale. |
OK — for me, it would be helpful if the first level was called like this:
1000
2000
3000
And the second level by the full numbers, not the last two digits.
|
ok, so files for PGPID 1018 would be under And what about longer ids, would PGPID 36238 be That sounds reasonable enough; shouldn't be too hard to generate the paths, it will make the content easier to find and browse on GitHub (possible to browse on GitHub), and should still be a usable enough structure for anyone who wants to do computational work with the text as a corpus (or for us to generate a corpus from the text content). |
@kseniaryzhova I made a new issue for the work to revise the directory structure, so when you're comfortable that the basic export functionality is working you can close this issue. |
@rlskoeser sounds good - I was keeping it open so Marina had a chance to make a decision, and it looks like that happened so I'm all set here! |
testing notes
I manually ran a bulk annotation export so you could review the basic functionality.
Go to the testing repository I'm using for testing backup:
https://github.com/Princeton-CDH/test-pgp-annotations
transcriptions/html
transcriptions/txt
— should only contain labels and actual transcription contentI'm thinking of this backup as primarily meant to be machine readable rather than expecting someone to browse it on GitHub, but I'm open to suggestions or requests for revising how it is structured.
dev notes
Should probably be able to adapt some logic from existing simple annotation server script for downloading annotation lists by canvas: https://github.com/glenrobson/SimpleAnnotationServer/blob/master/scripts/downloadAnnotationListsByCanvas.py
May also be some overlapping logic/functionality with PEMM scripts for syncing CSV data to GitHub.
Some of the logic needs to be reusable so we can create the initial github repo in tandem with the tei to annotation conversion.
Should include co-author information, but don't implement that in the first go.
pgpid###_authorslug_transcription.txt
or .htmlThe text was updated successfully, but these errors were encountered: