Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a content editor, I want to navigate the transcription export data on GitHub so that I can find exported content by PGPID. #1124

Closed
2 tasks done
rlskoeser opened this issue Sep 23, 2022 · 6 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Sep 23, 2022

testing notes

review the revised directory structure generated by a fresh tei migration and export in this branch:
https://github.com/Princeton-CDH/test-pgp-annotations/tree/test-tei-migration2

  • content files are organized based on PGPID in chunks of thousands
  • no warnings from GitHub about truncating directory listing

question: would it be better if the numbered directories were in a labeled directory (parallel to the annotations directory), instead of all at the top level? I wasn't sure what to call it since I hope it will eventually include transcription as well as translation. Could we call it text? or content ?


per @mrustow :

OK — for me, it would be helpful if the first level was called like this:

1000
2000
3000

And the second level by the full numbers, not the last two digits.

ok, so files for PGPID 1018 would be under 1000/1018/ ?

And what about longer ids, would PGPID 36238 be 36000/36238/

That sounds reasonable enough; shouldn't be too hard to generate the paths, it will make the content easier to find and browse on GitHub (possible to browse on GitHub), and should still be a usable enough structure for anyone who wants to do computational work with the text as a corpus (or for us to generate a corpus from the text content).

Originally posted by @rlskoeser in #912 (comment)

@rlskoeser rlskoeser added this to the CDH/PGP end of grant year 2 milestone Sep 23, 2022
@rlskoeser rlskoeser added the 🛠️ chore One-off task or update label Sep 23, 2022
@rlskoeser rlskoeser changed the title revise transcription export directory structure As a content editor, I want to navigate the transcription export data on GitHub so that I can find exported content by PGPID. Sep 26, 2022
@rlskoeser rlskoeser self-assigned this Sep 27, 2022
@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Sep 28, 2022
@kseniaryzhova
Copy link

@mrustow could you weigh in on Rebecca's question? Do we want the transcriptions to go in their own directory/folder (like annotations)? And if we do, what do we want to call it (knowing this directory will have transcriptions AND translations in the future)?

@mrustow
Copy link

mrustow commented Sep 28, 2022

I don't understand it :(

@rlskoeser
Copy link
Contributor Author

@mrustow sorry! context:

Here's the test version of the new export layout, chunked by 1000s: https://github.com/Princeton-CDH/test-pgp-annotations/tree/test-tei-migration2

I have an annotations directory/folder (listed at the bottom) for the annotation format exports (which I don't expect you all to refer to directly), but I put the 1000s directories at the top level — which makes it kind of long. Should those go into a directory, and if so what would you call it? Right now it is the compiled transcription content but I expect translation content to be backed up in the same way eventually and think it should be included in the same location (so you can find by PGPID and then you have text files for both transcription and translation if available).

@rlskoeser rlskoeser reopened this Sep 28, 2022
@kseniaryzhova
Copy link

@rlskoeser spoke with @mrustow - no need for a separate directory for transcriptions/translations - keep the organization as-is. But is it posisble to get a counter of how many files are in each of the 1000 folders, just so we get a preview of how many files are in each thousand increment?

@rlskoeser
Copy link
Contributor Author

Can you say more about the preview / counter? where would you like to see this?

We could maybe put something like that in the readme, but it would get out of date unless we recalculated it regularly... (we'd have to note when it was last updated)

@rlskoeser
Copy link
Contributor Author

Thanks for reviewing and signing off on the layout. I'm going to close this as accepted, but if you have ideas on where we could provide counts, LMK and I will think about how we might implement.

@rlskoeser rlskoeser removed 🗜️ awaiting testing Implemented and ready to be tested 🛠️ chore One-off task or update labels Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants