As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

rlskoeser · 2021-10-18T15:48:53Z

testing notes

in the admin site, look at documents with digitized transcriptions associated; confirm that updated transcriptions now display line numbers that match what is in the xml / what is on PGP v3 site (note that transcriptions for documents in our error cases, i.e. joins and multiple editions, will not have updated content and should continue to display the old version with line numbers)
in the admin site, check a few footnotes with digitized transcriptions — confirm there is a log entry documenting that it was updated via script
check a few records to confirm that transcriptions are being matched correctly based on filename and pgpid

Until we have a new solution for managing and editing transcriptions, we need to use the existing TEI and make sure the new database is pulling in updates regularly.

dev notes

We need a new management command that can be configured to run as a nightly cron job.

add and document settings for path to local copy of TEI git repo
clone/pull any changes from the TEI git repo to local copy
match TEI to the correct footnote based on PGPID and source note; needs to handle old PGPIDs for transcriptions associated with merged documents. Matching up properly may require some data cleanup in the source notes. (for initial implementation, handle simple cases only!)
convert TEI to html that can be used in an IIIF Annotation List; should include labels for any blocks and line numbers in the TEI. Adapt from prototype code https://github.com/Princeton-CDH/geniza/blob/experiment/search/scripts/tei_transcriptions.py
update transcription content associated with the footnote
add a django admin log entry if the transcription has changed (would be nice to use git log entry details here, but probably not be worth the effort, given that this is an interim solution)
script should include reporting to help with concerns raised in Numbers of transcriptions aren't populating correctly into the admin site. #295 — include total number of transcription files, documents with transcriptions, number of fragments, and how how many joins
update document detail admin to handle the new format

rlskoeser · 2021-11-15T16:42:57Z

Hello @richmanrachel @mrustow — I'm finally circling back to working on the transcription synchronization we planned for the fall MVP, and I have a number of questions.

Here is the summary output the script is generating with totals for various categories:

Processed 4,557 TEI/XML files; skipped 38 TEI files with no text content.
39 documents not found in database.
290 documents with multiple fragments.
616 documents with multiple editions; 40 multiple editions with content (19 unique documents).
63 documents with no edition.
3,799 documents with one edition.
Updated 3,798 footnotes.

My approach for now is to synchronize the transcription content from the TEI into the content field in the existing footnote; from the numbers above, you'll see there are some places where that's causing me some trouble. I need help investigating and resolving some of these.

notes

no text content: we were aware of these and I don't think it's a problem; just reporting so we have numbers
multiple fragments: not a problem, just including for reporting purposes
one TEI file (8594.xml) has a translation and no transcription; this is out of scope for my current work, but I wanted to mention because I don't want the content to get lost. We might also want to be thinking about how to handle translations — it seems to me that it would be valuable to include translation text in the public search.

questions

a number of documents have more than one footnote that provides an edition; if I filter to the one that has content, then I would preserve the mapping of footnote to edition content that we established on import. Does that seem like a reasonable approach for now, or should I compare source with sourceDesc in TEI? (Note: using current content mapping won't handle all cases, see problems)

problems

documents not found: these IDs aren't listed as either current or past ids. When I looked at one of the TEI files and searched on the shelfmark listed, I didn't find any results
documents with multiple editions: these are footnotes with document relationship of edition; if I find the one that currently has content, then we preserve the initial mapping
there are 40 TEI files / 19 documents where the document has multiple editions with content associated; I suspect these are all merges but have only investigated briefly. Here's one example: https://geniza.cdh.princeton.edu/admin/corpus/document/2493/change/ Two footnotes from the same source which are associated with different parts of the transcription. I'm not sure the best way to resolve this! One option would be to have someone (probably would have to be Alan?) combine the two XML files.
In some cases there are multiple footnotes with editions but none with content (don't know how many yet; at least 1). I can try matching source name on the sourceDesc contents in the TEI, but don't know yet how reliable it will be.
63 documents with no edition: not finding an existing footnote to attach the content to. In one example there's no footnote at all: https://geniza.cdh.princeton.edu/admin/corpus/document/2285/change/ In this case the TEI doesn't include a source description; even if it did, I don't think it's structured enough for me to generate a footnote automatically.

What's the best way to get help investigating these? I can generate some lists; maybe multiple separate lists for each different kinds of problems?

rlskoeser · 2021-11-15T16:52:17Z

A lower priority question, about formatting. The TEI includes a number of rend tags; how important are these? Can we ignore for the fall MVP? Are any of these important?

superscripts with numbers; they look like they could be footnote markers, but I don't see any footnotes
four documents have rend=col for columns with column breaks
84 documents have a desc description at the beginning of the TEI; ok to ignore? examples: 5293,5291, 5300, 786
I see formatting (italics, bold, center) but I think they are all in descriptions or source descriptions and not in transcription text

rlskoeser · 2021-11-15T16:57:05Z

Ack, sorry — I have yet another question: I had been planning to have this tei synchronization script generate admin log entries on footnote records when it updates the transcription content, but now am questioning the usefulness of that, since this is an interim solution for transcriptions. Any opinions on whether this is valuable to document or would clutter the database needlessly?

richmanrachel · 2021-11-15T17:56:01Z

@rlskoeser - thanks for the research you did and the good questions. I will set aside most of my meeting with Marina on Wednesday morning to investigate, and put a marker in the agenda to bring it up the discussion points as well!

rlskoeser · 2021-11-16T14:19:28Z

@richmanrachel that sounds great. Should I go ahead and create a list of PGPIDs and xml files for reference/investigation? I should be able to do that sometime today so you'd have it available for your meeting tomorrow.

richmanrachel · 2021-11-16T16:22:21Z

@rlskoeser - that would be amazing. Thanks a million!

rlskoeser · 2021-11-16T22:12:50Z

Here are some lists. Hope this is helpful for investigating!

Documents with multiple editions

9121: 9121.xml, 5299.xml
4740: 5493.xml, 4740.xml
606: 606.xml, 496.xml
9089: 9090.xml, 9089.xml
5410: 9053.xml, 5410.xml
4585: 4601.xml, 4585.xml
4717: 5432.xml, 4717.xml
4738: 4738.xml, 5545.xml
5552: 5552.xml, 4346.xml
4743: 5553.xml, 4743.xml
9804: 477.xml, 460.xml
1849: 1849.xml, 1850.xml
848: 850.xml, 848.xml
2142: 2142.xml, 2143.xml, 2144.xml, 2145.xml
4495: 4496.xml, 4495.xml
4721: 4721.xml, 5513.xml
591: 591.xml, 9072.xml
2493: 2493.xml, 2496.xml
2691: 2691.xml, 9066.xml

Documents with no edition footnote

2855: 2855.xml
2935: 2935.xml
9083: 9083.xml
5297: 5297.xml
7457: 7457.xml
3552: 3552.xml
3422: 3422.xml
2926: 2926.xml
3999: 3999.xml
1156: 1156.xml
3557: 3557.xml
4204: 4204.xml
4170: 4170.xml
2460: 2460.xml
3554: 3554.xml
2201: 2201.xml
3530: 3530.xml
1858: 1858.xml
2410: 2410.xml
3861: 3861.xml
2202: 2202.xml
3323: 3323.xml
665: 665.xml
3910: 3910.xml
3723: 3723.xml
3864: 3864.xml
516: 516.xml
3520: 3520.xml
3244: 3244.xml
2748: 2748.xml
3721: 3721.xml
2156: 2156.xml
2750: 2750.xml
928: 928.xml
3473: 3473.xml
2584: 2584.xml
2427: 2427.xml
2396: 2396.xml
2237: 2237.xml
4521: 4521.xml
4092: 4092.xml
3850: 3850.xml
1666: 1666.xml
1896: 1896.xml
2435: 2435.xml
3930: 3930.xml
4046: 4046.xml
3073: 3073.xml
2543: 2543.xml
1710: 1710.xml
3477: 3477.xml
2634: 2634.xml
4342: 4342.xml
2863: 2863.xml
1947: 1947.xml
17135: 17135.xml
4218: 4218.xml
2085: 2085.xml
3238: 3238.xml
2867: 2867.xml
5313: 5313.xml
2285: 2285.xml
4178: 4178.xml

Empty TEI files

1596.xml
1597.xml
1807.xml
2301.xml
1595.xml
1594.xml
1591.xml
2475.xml
1579.xml
1343.xml
2639.xml
4710.xml
706.xml
4513.xml
4505.xml
4458.xml
4064.xml
462.xml
5392.xml
514.xml
500.xml
5407.xml
651.xml
453.xml
2753.xml
1060.xml
5400.xml
2384.xml
694.xml
4734.xml
497.xml
4746.xml
1575.xml
2478.xml
2641.xml
2457.xml
3206.xml
5307.xml

Documents not found in database

(displaying pgpid from TEI in case it differs)

9082.xml: 9082
5323.xml: 5323
9108.xml: 9108
5518.xml: 5518
5444.xml: 5444
1581.xml: 1581
9127.xml: 9127
2517.xml: 2517
1586.xml: 1586
5509.xml: 5509
3965.xml: 3965
4506.xml: 4506
5341.xml: 5341
4739.xml: 4739
5394.xml: 5394
4075.xml: 4075
4713.xml: 4713
3109.xml: 3109
4712.xml: 4712
5395.xml: 5395
4510.xml: 4510
4449.xml: 4449
4649.xml: 4649
4648.xml: 4648
4725.xml: 4725
2234.xml: 2234
4724.xml: 4724
4493.xml: 4493
693.xml: 693
4723.xml: 4723
481.xml: 481
4736.xml: 4736
4720.xml: 4720
4709.xml: 4709
9048.xml: 9048
9128.xml: 9128
4221.xml: 4221
2736.xml: 2736
5514.xml: 5514

tei with columns

(markup contains rend="col")

4523.xml
4530.xml
4541.xml
4560.xml

rlskoeser · 2021-11-16T22:14:53Z

@richmanrachel when you discuss with Marina, please also discuss how much of this we need to handle for a first-pass implementation; I'd like to get the transcription sync out for testing so we can build out the functionality that depends on it — I think we should be able to proceed with that while we work on resolving these problem.

richmanrachel · 2021-11-16T22:25:27Z

@rlskoeser - sounds good. Thank you so much!

richmanrachel · 2021-11-17T14:49:54Z

a number of documents have more than one footnote that provides an edition; if I filter to the one that has content, then I would preserve the mapping of footnote to edition content that we established on import. Does that seem like a reasonable approach for now, or should I compare source with sourceDesc in TEI? (Note: using current content mapping won't handle all cases, see problems)

We're not quite sure that it will map correctly... perhaps we can walk through it together? How are you determining which footnote has content? Fingers crossed this is as easy as you think it is!

rlskoeser · 2021-11-17T14:52:39Z

oh, sorry, that was unclear! "footnote with content" means the footnote that we attached the initial transcription text to when we did the spreadsheet import. (I think we got some of those wrong but they have been cleaned up / consolidated?)

richmanrachel · 2021-11-17T14:58:41Z

there are 40 TEI files / 19 documents where the document has multiple editions with content associated; I suspect these are all merges but have only investigated briefly. Here's one example: https://geniza.cdh.princeton.edu/admin/corpus/document/2493/change/ Two footnotes from the same source which are associated with different parts of the transcription. I'm not sure the best way to resolve this! One option would be to have someone (probably would have to be Alan?) combine the two XML files.

Yes, we can have Alan do this. I think in this case we can just delete one of the footnotes because they're both Goitein, Typed Texts.

richmanrachel · 2021-11-17T14:59:47Z

oh, sorry, that was unclear! "footnote with content" means the footnote that we attached the initial transcription text to when we did the spreadsheet import. (I think we got some of those wrong but they have been cleaned up / consolidated?)

@rlskoeser - we're totally blanking on this... Can you show us at the meeting, tomorrow, please?

richmanrachel · 2021-11-17T15:00:57Z

In some cases there are multiple footnotes with editions but none with content (don't know how many yet; at least 1). I can try matching source name on the sourceDesc contents in the TEI, but don't know yet how reliable it will be.

Let's not worry about this if it's only one case.

rlskoeser · 2021-11-17T15:01:17Z

yes, absolutely! let's plan to talk through whatever is confusing or can't be resolved asynchronously

richmanrachel · 2021-11-17T15:09:16Z

63 documents with no edition: not finding an existing footnote to attach the content to. In one example there's no footnote at all: https://geniza.cdh.princeton.edu/admin/corpus/document/2285/change/ In this case the TEI doesn't include a source description; even if it did, I don't think it's structured enough for me to generate a footnote automatically.

We do need to clean up the TEI files that have sources/footnotes embedded. Working on a workflow now. Marina just wants to make sure that the information about when they were initially entered into the PGP can be manually input if needed as these internal source notes often indicate documents added to the PGP in the 80's.

richmanrachel · 2021-11-17T15:12:45Z

A lower priority question, about formatting. The TEI includes a number of rend tags;

What does "rend" mean? @rlskoeser

rlskoeser · 2021-11-17T15:13:30Z

* What does "rend" mean? @rlskoeser

Ah, sorry — rendition; it's an attribute that's usually used to indicate formatting

richmanrachel · 2021-11-17T15:21:56Z

A lower priority question, about formatting. The TEI includes a number of rend tags; how important are these? Can we ignore for the fall MVP? Are any of these important?

Overall, yes, we can ignore italics/bold, and we want to get rid of the descriptions from the TEI. The only concern is the 4 columns, as lists are actually in columns, but we might be able to ignore for MVP because it's such a small group.

richmanrachel · 2021-11-17T15:26:11Z

I had been planning to have this tei synchronization script generate admin log entries on footnote records when it updates the transcription content, but now am questioning the usefulness of that, since this is an interim solution for transcriptions. Any opinions on whether this is valuable to document or would clutter the database needlessly?

We think it will be helpful to know when the transcription moved from bitbucket to Django, so it won't be clutter? But not sure we entirely understand the question.

rlskoeser · 2021-11-19T19:25:28Z

Output from running the script in qa:

Processed 4,557 TEI/XML files; skipped 38 TEI files with no text content.
39 documents not found in database.
274 documents with multiple fragments.
614 documents with multiple editions; 42 multiple editions with content (20 unique documents).
64 documents with no edition.
3,800 documents with one edition.
Updated 4,354 footnotes.

richmanrachel · 2021-11-19T19:29:45Z

Looks great! Closing :)

rlskoeser added this to the PGP v4.0 (MVP) milestone Oct 18, 2021

rlskoeser mentioned this issue Oct 21, 2021

Numbers of transcriptions aren't populating correctly into the admin site. #295

Closed

rlskoeser self-assigned this Nov 2, 2021

rlskoeser added the ❓ question Further information is requested label Nov 15, 2021

rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Nov 19, 2021

richmanrachel closed this as completed Nov 19, 2021

richmanrachel removed the 🗜️ awaiting testing Implemented and ready to be tested label Nov 19, 2021

rlskoeser mentioned this issue Nov 19, 2021

As an admin, I want TEI transcription synchronization to handle documents with multiple transcriptions, so that content is not lost or hidden in the new system. #377

Closed

6 tasks

blms removed the ❓ question Further information is requested label Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

rlskoeser commented Oct 18, 2021 •

edited by richmanrachel

rlskoeser commented Nov 15, 2021 •

edited

rlskoeser commented Nov 15, 2021

rlskoeser commented Nov 15, 2021

richmanrachel commented Nov 15, 2021

rlskoeser commented Nov 16, 2021

richmanrachel commented Nov 16, 2021

rlskoeser commented Nov 16, 2021 •

edited

rlskoeser commented Nov 16, 2021

richmanrachel commented Nov 16, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 19, 2021

richmanrachel commented Nov 19, 2021

As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

As an admin, I want TEI transcription content regularly synchronized to the new database so that transcriptions are updated with changes in the current system. #321

Comments

rlskoeser commented Oct 18, 2021 • edited by richmanrachel

testing notes

dev notes

rlskoeser commented Nov 15, 2021 • edited

notes

questions

problems

rlskoeser commented Nov 15, 2021

rlskoeser commented Nov 15, 2021

richmanrachel commented Nov 15, 2021

rlskoeser commented Nov 16, 2021

richmanrachel commented Nov 16, 2021

rlskoeser commented Nov 16, 2021 • edited

Documents with multiple editions

Documents with no edition footnote

Empty TEI files

Documents not found in database

tei with columns

rlskoeser commented Nov 16, 2021

richmanrachel commented Nov 16, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

richmanrachel commented Nov 17, 2021

rlskoeser commented Nov 19, 2021

richmanrachel commented Nov 19, 2021

rlskoeser commented Oct 18, 2021 •

edited by richmanrachel

rlskoeser commented Nov 15, 2021 •

edited

rlskoeser commented Nov 16, 2021 •

edited