As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

sluescher · 2021-02-09T13:31:11Z

testing notes

Check a variety of documents and fragments from the PGP metadata spreadsheet and test how they have been imported.

for fragments:

check that fields are populated accurately from the spreadsheet:
- shelfmark
- historic shelfmark
- library/collection
- multifragment (yes/no)
- link to image
- iiif url for CUL documents with link to image (can verify via iiif viewer on fragment edit page)
- test that items with Library CUL in spreadsheet are assigned to the right collection based on shelfmark (T-S, Or., Add.)
check that record history documents creation via import script

for documents:

Check a few documents with joins to confirm that the document is linked to all fragments referenced by shelfmark in the join column

I want the following fields populated from the spreadsheet: library, shelfmark (current/historical), recto or verso, language/script, description, type and tags, and, if available, link to image.

dev notes

revisions after testing:

kmcelwee · 2021-02-11T16:28:01Z

Some potential pitfalls in parsing each field are listed in #1

rlskoeser · 2021-02-18T15:46:42Z

Re-estimating as 5 points since the story now includes importing documents and fragments.

rlskoeser · 2021-02-23T19:37:56Z

Hello @sluescher — several questions and notes based on what's been done so far. (I'm sure you'll need to confer or coordinate with others on many of these, but thought I'd let you decide who and how to resolve.)

Multiple types. For now, I am ignoring rows where there are two types separated by a semicolon. I propose we write a separate user story for that use case (are we still planning to split/demerge records on import?) so it can be implemented and tested after the more normal cases are handled.
Missing Library. There are a number of records where the Library field is empty. I'm attaching a CSV file with PGPID and current shelfmark. (file is named .csv.txt because GitHub won't let me upload a csv file 🤷‍♀️ ). I suggest that cleaning be done on at least the ones that are unusual or small numbers; if you want us to infer collection based on prefix for the larger sets and it can be done reliably (ENA, T-S?), please let us know. pgp-missing-libraries.csv.txt
What should be done with "unknown" in the recto and verso field?
I think I mentioned in our user story writing session, but so it's documented: we've named recto/verso "side"
Also discussed in user story session, documenting here: legacy input information (input by and date entered) will be stored as readonly text fields so that the information is available to you (but we're not going to parse it or structure it beyond that).

shelfmark & join questions

What should we do with shelfmark UNKNOWN 1 ?
Do some shelfmarks include recto/verso information, or is this still part of the shelfmark? Examples: Yevr.-Arab. I 1700.21r–21v (PGPID 9403), Yevr.-Arab. I 1700.1v–2r (PGPID 31822).
Some shelfmarks appear to include a version reference; is this still part of the shelfmark? (Examples: T-S 8J37.1 Ver. 2 / PGPID 9064; T-S 18J4.18 Ver. 2 / PGPID 5534)
Some shelfmarks are listed in join notation (examples: PGPID 30817, 30845); what does this indicate? (Note: I think these are all included in the missing libraries csv attached above, since they also don't have libraries).
In 21 rows with joins, the main shelfmark is not included in the join. Are these mistakes? List attached for review. pgp-shelfmark-notinjoin.csv.txt

language questions and requests

Please provide a mapping from language as listed in the metadata spreadsheet to the corresponding combination of Language+Script. I'm attaching a file with a rough list of unique languages currently in the spreadsheet. There are some typos or other errors, and some with non-linguistic information ("no image"); I suggest these be cleaned up in the spreadsheet. pgp-metadata-languages.txt
I propose that notes on vocalization, vowels, diacritics in the language field in the spreadsheet be preserved in a field on the document entity. Is this acceptable, and would you suggest a name for this field?
What should be done with questionable identifications? (e.g. "some xx", "language (?)")
What about unknown/identified in the language field?

sluescher · 2021-02-25T15:28:11Z

Multiple types. Checking and reporting back on best way to approach this.
Missing Library. Almost fixed (one library to be added)
What should be done with "unknown" in the recto and verso field? There should not be unknowns left?
legacy input information changes discussed. To be discussed in meeting
What should we do with shelfmark UNKNOWN 1 ? FOUND! Actual shelfmark added
Do some shelfmarks include recto/verso information, or is this still part of the shelfmark? Needs to be discussed. Special cases
Some shelfmarks appear to include a version reference; is this still part of the shelfmark? (Examples: T-S 8J37.1 Ver. 2 / PGPID 9064; T-S 18J4.18 Ver. 2 / PGPID 5534) will be cleaned
Some shelfmarks are listed in join notation (examples: PGPID 30817, 30845); what does this indicate? Alan will check
mapping from language and clean up in the spreadsheet. in progress
I propose that notes on vocalization, vowels, diacritics in the language field in the spreadsheet be preserved in a field on the document entity. Is this acceptable, and would you suggest a name for this field? Agreed. Name to come
What should be done with questionable identifications? (e.g. "some xx", "language (?)") As discussed in the meeting 2/25 some will be dispatched with, language (possible) to be added to the document model
What about unknown/identified in the language field? Unidentified needs to stay as for some we just don't know the language.

kmcelwee · 2021-03-02T17:03:59Z

List of languages that need to be corrected by the PGP team:

PGPID: 30500, Language: German
PGPID: 30616, Language: Catalan
PGPID: 31025, Language: Turkish
PGPID: 31082, Language: Arabo-Hebrew
PGPID: 31083, Language: Hebrew or Judaeo-Arabic
PGPID: 30977, Language: Turkish
PGPID: 19008, Language: Aramaic, Hebrew
PGPID: 19010, Language: Aramaic, Hebrew
PGPID: 19112, Language: Hebrew: Judaeo-Arabic
PGPID: 31369, Language: Arabo-Hebrew
PGPID: 31268, Language: English
PGPID: 31344, Language: Portuguese
PGPID: 19744, Language: Catalan
PGPID: 19769, Language: nomina barbara
PGPID: 31490, Language: Arabo-Hebrew
PGPID: 11621, Language: Sanskrit
PGPID: 11621, Language: English
PGPID: 11135, Language: Sanskrit
PGPID: 11135, Language: English
PGPID: 11112, Language: no image
PGPID: 11951, Language: cipher
PGPID: 11284, Language: Judaeo-arabic
PGPID: 12101, Language: Juaeo-Arabic
PGPID: 12230, Language: Ara
PGPID: 15435, Language: no image
PGPID: 15469, Language: Image missing
PGPID: 31232, Language: Greek or Coptic
PGPID: 32427, Language: English
PGPID: 32434, Language: English
PGPID: 28503, Language: Hebrewq
PGPID: 32295, Language: Judaeo-Arabic: Hebrew
PGPID: 7024, Language: Judaeo-ARabic
PGPID: 2910, Language: Judeao-Arabic
PGPID: 29377, Language: Christian Palestinian Aramaic
PGPID: 1869, Language: Romance
PGPID: 32299, Language: Ottoman Turkish
PGPID: 20341, Language: Arabic Judaeo-Arabic
PGPID: 20910, Language: Arabo-Hebrew
PGPID: 29264, Language: unidentified
PGPID: 17564, Language: Judaeo-Arabi
PGPID: 32113, Language: Gujarati
PGPID: 16817, Language: Coptic numerals
PGPID: 29070, Language: Coptic numerals
PGPID: 18874, Language: Coptic numbers
PGPID: 29175, Language: Coptic numerals
PGPID: 30740, Language: German
PGPID: 30747, Language: Catalan
PGPID: 8484, Language: Hebrew, Judaeo-Arabic
PGPID: 30749, Language: Romance
PGPID: 23956, Language: Portuguese
PGPID: 25907, Language: Judeo-Arabic
PGPID: 26304, Language: N/A
PGPID: 27870, Language: unknown language

We'll need the PGP team to either add these languages to the ontology spreadsheet or correct them in the metadata spreadsheet.

rlskoeser · 2021-03-03T21:59:07Z

Preliminary import is available on the test site. Note that some aspects of this are still provisional, pending some data work and decisions from the team. For now, we're using a provisional mapping of languages in the spreadsheet to language name in the new Language+Script model, and we're ignoring records that need to be demerged.

I'm attaching the full output of the import script — kind of noisy because it reports on all the skipped records with multiple types and missing languages, but may be useful to refer to as you're testing.
import_data_output.txt

richmanrachel · 2021-03-04T15:38:16Z

Filtering by multi-fragment would be helpful (so that we can fix them easier).
We're excited about the multi-fragment setup and thinking about how it could be modified to fit other cases too (bifoliums, court notebooks, and distinguishing between multi-fragments that share the same support or have just been put together by libraries).
Legacy input date is present, but we would like to be able to filter for it.
Website just stopped working, but noting that we'll check Bodl. MS Heb. f 56/16 for the multi-fragment text block.

richmanrachel · 2021-03-04T16:47:32Z

@rlskoeser - I'm having trouble tracking down any multifragments that show up as such in the database. None of the 25 documents that the PGP search finds (under "multifragement" in the description) seem to have enough information in the spreadsheet to migrate to the new database. Trying to think of a creative way to find the right docs to search, but will need to come back with more energy for this later.

Everything else looks great!

rlskoeser · 2021-03-08T19:39:52Z

Ah, sounds like the multifragment filter problem you identified on #75 is a blocker for the testing. Should have a fix for that soon.

Legacy input date is present, but we would like to be able to filter for it.

What kind of filter are you thinking would be useful? (I tested and there are too many values for the django list filter to be useful; I wasn't sure if you meant search, but that doesn't seem ideal either).

From the last meeting, I now know we would like to sort on this field — I'd like to track that as a separate user story, since it's a new requirement. If we add logic to parse this field into an actual date/time (or partial date for some? or approximate date?), then we could turn on django's date hierarchy, which would give you a nice way to drill down based on date. We'll also want to add logic to set the 'date_entered' field for new documents created in the database after the migration is complete.

FYI, ~9000 of the documents currently imported have no legacy input date. Do we need to do anything about that?

@richmanrachel can you coordinate writing a story for the input date and adding it to GitHub? Could just be something to the effect that you want to sort and do date-based filtering on input date with a brief explanation of why it's valuable.

richmanrachel · 2021-03-08T20:16:17Z

Changing status to "tested needs attention" while awaiting a fix for search by multifragment.

@rlskoeser - could you give me a sample of what kind of entries are in the input date field so I can think about how to make it into a user story?

rlskoeser · 2021-03-08T21:26:00Z

could you give me a sample of what kind of entries are in the input date field so I can think about how to make it into a user story?

Sure. Some of them are full dates in MM/DD/YYYY format; some are year only; a few have ranges; some have notes in addition to a date. Some include multiple dates, not sure what we should do with that. Here are some examples:

Rev. MR Nov 2018
August 2017–November 2018
August 2017–November 2018; 08/2020
2017
Nov 2018
9/5/2020
7/17/1990; 9/5/2020
2017; 11/21/2020

rlskoeser · 2021-03-10T19:22:32Z

Revise multifragment handling on import:

fragments with a multifragment value set should get boolean multifragment set true
actual multifragment value should be set on the text block

ref #68 #75 #66

richmanrachel · 2021-03-11T15:25:12Z

@kmcelwee - Marina really likes your Display Name category, and is adding a new column in the languages ontology: https://docs.google.com/spreadsheets/d/1m-6SWU2gSNcferuU4Uzri2IQLUSbj3U5voas0xCofOQ/edit#gid=0

She's trying to address most of the other languages in that spreadsheet too. Coptic numerals are a bigger issue (they're technically Greek, but you don't want to say there's Greek on a document w/o Greek language, and most people call them "Coptic").

rlskoeser · 2021-03-11T16:49:18Z

I added a new story so we can track importing language+script display name separately. #98

richmanrachel · 2021-03-15T15:51:00Z

Looks great! Closing :)

rlskoeser self-assigned this Feb 12, 2021

thatbudakguy added the 🆕 enhancement New feature or request label Feb 15, 2021

rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Mar 3, 2021

rlskoeser mentioned this issue Mar 4, 2021

As a content editor, I want to see the number of documents we have in each language/script combination, so that I can understand the relative proportions and provide information for data visualization and research. #55

Closed

5 tasks

richmanrachel added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Mar 8, 2021

rlskoeser added a commit that referenced this issue Mar 10, 2021

Correct multifragment modeling, and update admin & data import handling

0e1f31f

ref #68 #75 #66

rlskoeser added 🗜️ awaiting testing Implemented and ready to be tested and removed ⚠️ tested needs attention Has been through acceptance testing and needs additional work 🆕 enhancement New feature or request labels Mar 10, 2021

richmanrachel closed this as completed Mar 15, 2021

rlskoeser removed the 🗜️ awaiting testing Implemented and ready to be tested label Mar 22, 2021

rlskoeser mentioned this issue Mar 23, 2021

As a global admin, I want display name included in the one-time import of languages and scripts, so that I can start using display names while the import is still being developed and tested. #98

Closed

1 task

rlskoeser added this to the 0.3 Initial database and data import milestone Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

sluescher commented Feb 9, 2021 •

edited by richmanrachel

kmcelwee commented Feb 11, 2021

rlskoeser commented Feb 18, 2021

rlskoeser commented Feb 23, 2021

sluescher commented Feb 25, 2021 •

edited

kmcelwee commented Mar 2, 2021

rlskoeser commented Mar 3, 2021

richmanrachel commented Mar 4, 2021

richmanrachel commented Mar 4, 2021

rlskoeser commented Mar 8, 2021

richmanrachel commented Mar 8, 2021

rlskoeser commented Mar 8, 2021

rlskoeser commented Mar 10, 2021

richmanrachel commented Mar 11, 2021

rlskoeser commented Mar 11, 2021

richmanrachel commented Mar 15, 2021

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

Comments

sluescher commented Feb 9, 2021 • edited by richmanrachel

testing notes

dev notes

kmcelwee commented Feb 11, 2021

rlskoeser commented Feb 18, 2021

rlskoeser commented Feb 23, 2021

shelfmark & join questions

language questions and requests

sluescher commented Feb 25, 2021 • edited

kmcelwee commented Mar 2, 2021

rlskoeser commented Mar 3, 2021

richmanrachel commented Mar 4, 2021

richmanrachel commented Mar 4, 2021

rlskoeser commented Mar 8, 2021

richmanrachel commented Mar 8, 2021

rlskoeser commented Mar 8, 2021

rlskoeser commented Mar 10, 2021

richmanrachel commented Mar 11, 2021

rlskoeser commented Mar 11, 2021

richmanrachel commented Mar 15, 2021

sluescher commented Feb 9, 2021 •

edited by richmanrachel

sluescher commented Feb 25, 2021 •

edited