Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

Closed
11 tasks done
rlskoeser opened this issue Oct 18, 2021 · 22 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Oct 18, 2021

testing notes

  • in the admin site, look at documents with digitized transcriptions associated; confirm that updated transcriptions now display line numbers that match what is in the xml / what is on PGP v3 site (note that transcriptions for documents in our error cases, i.e. joins and multiple editions, will not have updated content and should continue to display the old version with line numbers)
  • in the admin site, check a few footnotes with digitized transcriptions — confirm there is a log entry documenting that it was updated via script
  • check a few records to confirm that transcriptions are being matched correctly based on filename and pgpid

Until we have a new solution for managing and editing transcriptions, we need to use the existing TEI and make sure the new database is pulling in updates regularly.

dev notes

We need a new management command that can be configured to run as a nightly cron job.

  • add and document settings for path to local copy of TEI git repo

  • clone/pull any changes from the TEI git repo to local copy

  • match TEI to the correct footnote based on PGPID and source note; needs to handle old PGPIDs for transcriptions associated with merged documents. Matching up properly may require some data cleanup in the source notes. (for initial implementation, handle simple cases only!)

  • convert TEI to html that can be used in an IIIF Annotation List; should include labels for any blocks and line numbers in the TEI. Adapt from prototype code https://github.com/Princeton-CDH/geniza/blob/experiment/search/scripts/tei_transcriptions.py

  • update transcription content associated with the footnote

  • add a django admin log entry if the transcription has changed (would be nice to use git log entry details here, but probably not be worth the effort, given that this is an interim solution)

  • script should include reporting to help with concerns raised in Numbers of transcriptions aren't populating correctly into the admin site. #295 — include total number of transcription files, documents with transcriptions, number of fragments, and how how many joins

  • update document detail admin to handle the new format

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Nov 15, 2021

Hello @richmanrachel @mrustow — I'm finally circling back to working on the transcription synchronization we planned for the fall MVP, and I have a number of questions.

Here is the summary output the script is generating with totals for various categories:

Processed 4,557 TEI/XML files; skipped 38 TEI files with no text content.
39 documents not found in database.
290 documents with multiple fragments.
616 documents with multiple editions; 40 multiple editions with content (19 unique documents).
63 documents with no edition.
3,799 documents with one edition.
Updated 3,798 footnotes.

My approach for now is to synchronize the transcription content from the TEI into the content field in the existing footnote; from the numbers above, you'll see there are some places where that's causing me some trouble. I need help investigating and resolving some of these.

notes

  • no text content: we were aware of these and I don't think it's a problem; just reporting so we have numbers
  • multiple fragments: not a problem, just including for reporting purposes
  • one TEI file (8594.xml) has a translation and no transcription; this is out of scope for my current work, but I wanted to mention because I don't want the content to get lost. We might also want to be thinking about how to handle translations — it seems to me that it would be valuable to include translation text in the public search.

questions

  • a number of documents have more than one footnote that provides an edition; if I filter to the one that has content, then I would preserve the mapping of footnote to edition content that we established on import. Does that seem like a reasonable approach for now, or should I compare source with sourceDesc in TEI? (Note: using current content mapping won't handle all cases, see problems)

problems

  • documents not found: these IDs aren't listed as either current or past ids. When I looked at one of the TEI files and searched on the shelfmark listed, I didn't find any results
  • documents with multiple editions: these are footnotes with document relationship of edition; if I find the one that currently has content, then we preserve the initial mapping
  • there are 40 TEI files / 19 documents where the document has multiple editions with content associated; I suspect these are all merges but have only investigated briefly. Here's one example: https://geniza.cdh.princeton.edu/admin/corpus/document/2493/change/ Two footnotes from the same source which are associated with different parts of the transcription. I'm not sure the best way to resolve this! One option would be to have someone (probably would have to be Alan?) combine the two XML files.
  • In some cases there are multiple footnotes with editions but none with content (don't know how many yet; at least 1). I can try matching source name on the sourceDesc contents in the TEI, but don't know yet how reliable it will be.
  • 63 documents with no edition: not finding an existing footnote to attach the content to. In one example there's no footnote at all: https://geniza.cdh.princeton.edu/admin/corpus/document/2285/change/ In this case the TEI doesn't include a source description; even if it did, I don't think it's structured enough for me to generate a footnote automatically.

What's the best way to get help investigating these? I can generate some lists; maybe multiple separate lists for each different kinds of problems?

@rlskoeser rlskoeser added the ❓ question Further information is requested label Nov 15, 2021
@rlskoeser
Copy link
Contributor Author

A lower priority question, about formatting. The TEI includes a number of rend tags; how important are these? Can we ignore for the fall MVP? Are any of these important?

  • superscripts with numbers; they look like they could be footnote markers, but I don't see any footnotes
  • four documents have rend=col for columns with column breaks
  • 84 documents have a desc description at the beginning of the TEI; ok to ignore? examples: 5293,5291, 5300, 786
  • I see formatting (italics, bold, center) but I think they are all in descriptions or source descriptions and not in transcription text

@rlskoeser
Copy link
Contributor Author

Ack, sorry — I have yet another question: I had been planning to have this tei synchronization script generate admin log entries on footnote records when it updates the transcription content, but now am questioning the usefulness of that, since this is an interim solution for transcriptions. Any opinions on whether this is valuable to document or would clutter the database needlessly?

@richmanrachel
Copy link

@rlskoeser - thanks for the research you did and the good questions. I will set aside most of my meeting with Marina on Wednesday morning to investigate, and put a marker in the agenda to bring it up the discussion points as well!

@rlskoeser
Copy link
Contributor Author

@richmanrachel that sounds great. Should I go ahead and create a list of PGPIDs and xml files for reference/investigation? I should be able to do that sometime today so you'd have it available for your meeting tomorrow.

@richmanrachel
Copy link

@rlskoeser - that would be amazing. Thanks a million!

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Nov 16, 2021

Here are some lists. Hope this is helpful for investigating!

Documents with multiple editions

9121: 9121.xml, 5299.xml
4740: 5493.xml, 4740.xml
606: 606.xml, 496.xml
9089: 9090.xml, 9089.xml
5410: 9053.xml, 5410.xml
4585: 4601.xml, 4585.xml
4717: 5432.xml, 4717.xml
4738: 4738.xml, 5545.xml
5552: 5552.xml, 4346.xml
4743: 5553.xml, 4743.xml
9804: 477.xml, 460.xml
1849: 1849.xml, 1850.xml
848: 850.xml, 848.xml
2142: 2142.xml, 2143.xml, 2144.xml, 2145.xml
4495: 4496.xml, 4495.xml
4721: 4721.xml, 5513.xml
591: 591.xml, 9072.xml
2493: 2493.xml, 2496.xml
2691: 2691.xml, 9066.xml

Documents with no edition footnote

2855: 2855.xml
2935: 2935.xml
9083: 9083.xml
5297: 5297.xml
7457: 7457.xml
3552: 3552.xml
3422: 3422.xml
2926: 2926.xml
3999: 3999.xml
1156: 1156.xml
3557: 3557.xml
4204: 4204.xml
4170: 4170.xml
2460: 2460.xml
3554: 3554.xml
2201: 2201.xml
3530: 3530.xml
1858: 1858.xml
2410: 2410.xml
3861: 3861.xml
2202: 2202.xml
3323: 3323.xml
665: 665.xml
3910: 3910.xml
3723: 3723.xml
3864: 3864.xml
516: 516.xml
3520: 3520.xml
3244: 3244.xml
2748: 2748.xml
3721: 3721.xml
2156: 2156.xml
2750: 2750.xml
928: 928.xml
3473: 3473.xml
2584: 2584.xml
2427: 2427.xml
2396: 2396.xml
2237: 2237.xml
4521: 4521.xml
4092: 4092.xml
3850: 3850.xml
1666: 1666.xml
1896: 1896.xml
2435: 2435.xml
3930: 3930.xml
4046: 4046.xml
3073: 3073.xml
2543: 2543.xml
1710: 1710.xml
3477: 3477.xml
2634: 2634.xml
4342: 4342.xml
2863: 2863.xml
1947: 1947.xml
17135: 17135.xml
4218: 4218.xml
2085: 2085.xml
3238: 3238.xml
2867: 2867.xml
5313: 5313.xml
2285: 2285.xml
4178: 4178.xml

Empty TEI files

1596.xml
1597.xml
1807.xml
2301.xml
1595.xml
1594.xml
1591.xml
2475.xml
1579.xml
1343.xml
2639.xml
4710.xml
706.xml
4513.xml
4505.xml
4458.xml
4064.xml
462.xml
5392.xml
514.xml
500.xml
5407.xml
651.xml
453.xml
2753.xml
1060.xml
5400.xml
2384.xml
694.xml
4734.xml
497.xml
4746.xml
1575.xml
2478.xml
2641.xml
2457.xml
3206.xml
5307.xml

Documents not found in database

(displaying pgpid from TEI in case it differs)

9082.xml: 9082
5323.xml: 5323
9108.xml: 9108
5518.xml: 5518
5444.xml: 5444
1581.xml: 1581
9127.xml: 9127
2517.xml: 2517
1586.xml: 1586
5509.xml: 5509
3965.xml: 3965
4506.xml: 4506
5341.xml: 5341
4739.xml: 4739
5394.xml: 5394
4075.xml: 4075
4713.xml: 4713
3109.xml: 3109
4712.xml: 4712
5395.xml: 5395
4510.xml: 4510
4449.xml: 4449
4649.xml: 4649
4648.xml: 4648
4725.xml: 4725
2234.xml: 2234
4724.xml: 4724
4493.xml: 4493
693.xml: 693
4723.xml: 4723
481.xml: 481
4736.xml: 4736
4720.xml: 4720
4709.xml: 4709
9048.xml: 9048
9128.xml: 9128
4221.xml: 4221
2736.xml: 2736
5514.xml: 5514

tei with columns

(markup contains rend="col")

4523.xml
4530.xml
4541.xml
4560.xml

@rlskoeser
Copy link
Contributor Author

@richmanrachel when you discuss with Marina, please also discuss how much of this we need to handle for a first-pass implementation; I'd like to get the transcription sync out for testing so we can build out the functionality that depends on it — I think we should be able to proceed with that while we work on resolving these problem.

@richmanrachel
Copy link

@rlskoeser - sounds good. Thank you so much!

@richmanrachel
Copy link

a number of documents have more than one footnote that provides an edition; if I filter to the one that has content, then I would preserve the mapping of footnote to edition content that we established on import. Does that seem like a reasonable approach for now, or should I compare source with sourceDesc in TEI? (Note: using current content mapping won't handle all cases, see problems)

  • We're not quite sure that it will map correctly... perhaps we can walk through it together? How are you determining which footnote has content? Fingers crossed this is as easy as you think it is!

@rlskoeser
Copy link
Contributor Author

oh, sorry, that was unclear! "footnote with content" means the footnote that we attached the initial transcription text to when we did the spreadsheet import. (I think we got some of those wrong but they have been cleaned up / consolidated?)

@richmanrachel
Copy link

there are 40 TEI files / 19 documents where the document has multiple editions with content associated; I suspect these are all merges but have only investigated briefly. Here's one example: https://geniza.cdh.princeton.edu/admin/corpus/document/2493/change/ Two footnotes from the same source which are associated with different parts of the transcription. I'm not sure the best way to resolve this! One option would be to have someone (probably would have to be Alan?) combine the two XML files.

  • Yes, we can have Alan do this. I think in this case we can just delete one of the footnotes because they're both Goitein, Typed Texts.

@richmanrachel
Copy link

oh, sorry, that was unclear! "footnote with content" means the footnote that we attached the initial transcription text to when we did the spreadsheet import. (I think we got some of those wrong but they have been cleaned up / consolidated?)

  • @rlskoeser - we're totally blanking on this... Can you show us at the meeting, tomorrow, please?

@richmanrachel
Copy link

In some cases there are multiple footnotes with editions but none with content (don't know how many yet; at least 1). I can try matching source name on the sourceDesc contents in the TEI, but don't know yet how reliable it will be.

  • Let's not worry about this if it's only one case.

@rlskoeser
Copy link
Contributor Author

yes, absolutely! let's plan to talk through whatever is confusing or can't be resolved asynchronously

@richmanrachel
Copy link

63 documents with no edition: not finding an existing footnote to attach the content to. In one example there's no footnote at all: https://geniza.cdh.princeton.edu/admin/corpus/document/2285/change/ In this case the TEI doesn't include a source description; even if it did, I don't think it's structured enough for me to generate a footnote automatically.

  • We do need to clean up the TEI files that have sources/footnotes embedded. Working on a workflow now. Marina just wants to make sure that the information about when they were initially entered into the PGP can be manually input if needed as these internal source notes often indicate documents added to the PGP in the 80's.

@richmanrachel
Copy link

A lower priority question, about formatting. The TEI includes a number of rend tags;

@rlskoeser
Copy link
Contributor Author

* What does "rend" mean? @rlskoeser

Ah, sorry — rendition; it's an attribute that's usually used to indicate formatting

@richmanrachel
Copy link

A lower priority question, about formatting. The TEI includes a number of rend tags; how important are these? Can we ignore for the fall MVP? Are any of these important?

  • Overall, yes, we can ignore italics/bold, and we want to get rid of the descriptions from the TEI. The only concern is the 4 columns, as lists are actually in columns, but we might be able to ignore for MVP because it's such a small group.

@richmanrachel
Copy link

I had been planning to have this tei synchronization script generate admin log entries on footnote records when it updates the transcription content, but now am questioning the usefulness of that, since this is an interim solution for transcriptions. Any opinions on whether this is valuable to document or would clutter the database needlessly?

  • We think it will be helpful to know when the transcription moved from bitbucket to Django, so it won't be clutter? But not sure we entirely understand the question.

@rlskoeser
Copy link
Contributor Author

Output from running the script in qa:

Processed 4,557 TEI/XML files; skipped 38 TEI files with no text content.
39 documents not found in database.
274 documents with multiple fragments.
614 documents with multiple editions; 42 multiple editions with content (20 unique documents).
64 documents with no edition.
3,800 documents with one edition.
Updated 4,354 footnotes.

@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Nov 19, 2021
@richmanrachel
Copy link

Looks great! Closing :)

@blms blms removed the ❓ question Further information is requested label Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants