Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. #66

Closed
40 tasks done
sluescher opened this issue Feb 9, 2021 · 15 comments
Assignees

Comments

@sluescher
Copy link

sluescher commented Feb 9, 2021

testing notes

Check a variety of documents and fragments from the PGP metadata spreadsheet and test how they have been imported.

for fragments:

  • check that fields are populated accurately from the spreadsheet:
    • shelfmark
    • historic shelfmark
    • library/collection
    • multifragment (yes/no)
    • link to image
    • iiif url for CUL documents with link to image (can verify via iiif viewer on fragment edit page)
    • test that items with Library CUL in spreadsheet are assigned to the right collection based on shelfmark (T-S, Or., Add.)
  • check that record history documents creation via import script

for documents:

  • check that these fields are populated accurately from the spreadsheet:
    • PGPID
    • type
    • description
    • tags
    • languages — language+script based on list of languages (preliminary mapping)
    • probable languages — language+script based on languages listed with question mark
    • language note — should include text of language + parenthetical notes on vocalization, diacritics, etc
    • legacy input by
    • legacy input date
    • a text block for each associated fragment; should include when present:
      • side (recto/verso)
      • text block label (text block in spreadsheet -> extent label in database)
      • multifragment value
  • check that record history documents creation via import script

Check a few documents with joins to confirm that the document is linked to all fragments referenced by shelfmark in the join column


I want the following fields populated from the spreadsheet: library, shelfmark (current/historical), recto or verso, language/script, description, type and tags, and, if available, link to image.

dev notes

revisions after testing:

  • fragments with a multifragment value set should get boolean multifragment set true
  • actual multifragment value should be set on the text block
  • for fragments, should populate:
    • shelfmark
    • historic shelfmark
    • library/collection
    • infer missing library based on shelfmark (data cleanup requested)
    • multifragment
    • link to image
    • infer iiif url based on linked to image where possible
  • for documents, should populate:
    • PGP ID
    • language/script
    • description
    • type
    • tags
    • associate with fragments based on shelfmark and any shelfmarks included in the join field
  • on document/fragment through model, track:
    • side (recto/verso)
    • text block
@kmcelwee
Copy link
Contributor

Some potential pitfalls in parsing each field are listed in #1

@rlskoeser rlskoeser self-assigned this Feb 12, 2021
@thatbudakguy thatbudakguy added the 🆕 enhancement New feature or request label Feb 15, 2021
@rlskoeser rlskoeser changed the title As a global admin, I want a one-time import of all documents currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. As a global admin, I want a one-time import of all documents and fragments currently in the PGP spreadsheet and the fields in the db populated accordingly, in order to work with the data in the database. Feb 18, 2021
@rlskoeser
Copy link
Contributor

Re-estimating as 5 points since the story now includes importing documents and fragments.

@rlskoeser
Copy link
Contributor

Hello @sluescher — several questions and notes based on what's been done so far. (I'm sure you'll need to confer or coordinate with others on many of these, but thought I'd let you decide who and how to resolve.)

  1. Multiple types. For now, I am ignoring rows where there are two types separated by a semicolon. I propose we write a separate user story for that use case (are we still planning to split/demerge records on import?) so it can be implemented and tested after the more normal cases are handled.
  2. Missing Library. There are a number of records where the Library field is empty. I'm attaching a CSV file with PGPID and current shelfmark. (file is named .csv.txt because GitHub won't let me upload a csv file 🤷‍♀️ ). I suggest that cleaning be done on at least the ones that are unusual or small numbers; if you want us to infer collection based on prefix for the larger sets and it can be done reliably (ENA, T-S?), please let us know. pgp-missing-libraries.csv.txt
  3. What should be done with "unknown" in the recto and verso field?
  4. I think I mentioned in our user story writing session, but so it's documented: we've named recto/verso "side"
  5. Also discussed in user story session, documenting here: legacy input information (input by and date entered) will be stored as readonly text fields so that the information is available to you (but we're not going to parse it or structure it beyond that).

shelfmark & join questions

  1. What should we do with shelfmark UNKNOWN 1 ?
  2. Do some shelfmarks include recto/verso information, or is this still part of the shelfmark? Examples: Yevr.-Arab. I 1700.21r–21v (PGPID 9403), Yevr.-Arab. I 1700.1v–2r (PGPID 31822).
  3. Some shelfmarks appear to include a version reference; is this still part of the shelfmark? (Examples: T-S 8J37.1 Ver. 2 / PGPID 9064; T-S 18J4.18 Ver. 2 / PGPID 5534)
  4. Some shelfmarks are listed in join notation (examples: PGPID 30817, 30845); what does this indicate? (Note: I think these are all included in the missing libraries csv attached above, since they also don't have libraries).
  5. In 21 rows with joins, the main shelfmark is not included in the join. Are these mistakes? List attached for review. pgp-shelfmark-notinjoin.csv.txt

language questions and requests

  1. Please provide a mapping from language as listed in the metadata spreadsheet to the corresponding combination of Language+Script. I'm attaching a file with a rough list of unique languages currently in the spreadsheet. There are some typos or other errors, and some with non-linguistic information ("no image"); I suggest these be cleaned up in the spreadsheet. pgp-metadata-languages.txt
  2. I propose that notes on vocalization, vowels, diacritics in the language field in the spreadsheet be preserved in a field on the document entity. Is this acceptable, and would you suggest a name for this field?
  3. What should be done with questionable identifications? (e.g. "some xx", "language (?)")
  4. What about unknown/identified in the language field?

@sluescher
Copy link
Author

sluescher commented Feb 25, 2021

  • Multiple types. Checking and reporting back on best way to approach this.

  • Missing Library. Almost fixed (one library to be added)

  • What should be done with "unknown" in the recto and verso field? There should not be unknowns left?

  • legacy input information changes discussed. To be discussed in meeting

  • What should we do with shelfmark UNKNOWN 1 ? FOUND! Actual shelfmark added

  • Do some shelfmarks include recto/verso information, or is this still part of the shelfmark? Needs to be discussed. Special cases

  • Some shelfmarks appear to include a version reference; is this still part of the shelfmark? (Examples: T-S 8J37.1 Ver. 2 / PGPID 9064; T-S 18J4.18 Ver. 2 / PGPID 5534) will be cleaned

  • Some shelfmarks are listed in join notation (examples: PGPID 30817, 30845); what does this indicate? Alan will check

  • mapping from language and clean up in the spreadsheet. in progress

  • I propose that notes on vocalization, vowels, diacritics in the language field in the spreadsheet be preserved in a field on the document entity. Is this acceptable, and would you suggest a name for this field? Agreed. Name to come

  • What should be done with questionable identifications? (e.g. "some xx", "language (?)") As discussed in the meeting 2/25 some will be dispatched with, language (possible) to be added to the document model

  • What about unknown/identified in the language field? Unidentified needs to stay as for some we just don't know the language.

@kmcelwee
Copy link
Contributor

kmcelwee commented Mar 2, 2021

List of languages that need to be corrected by the PGP team:

PGPID: 30500, Language: German
PGPID: 30616, Language: Catalan
PGPID: 31025, Language: Turkish
PGPID: 31082, Language: Arabo-Hebrew
PGPID: 31083, Language: Hebrew or Judaeo-Arabic
PGPID: 30977, Language: Turkish
PGPID: 19008, Language: Aramaic, Hebrew
PGPID: 19010, Language: Aramaic, Hebrew
PGPID: 19112, Language: Hebrew: Judaeo-Arabic
PGPID: 31369, Language: Arabo-Hebrew
PGPID: 31268, Language: English
PGPID: 31344, Language: Portuguese
PGPID: 19744, Language: Catalan
PGPID: 19769, Language: nomina barbara
PGPID: 31490, Language: Arabo-Hebrew
PGPID: 11621, Language: Sanskrit
PGPID: 11621, Language: English
PGPID: 11135, Language: Sanskrit
PGPID: 11135, Language: English
PGPID: 11112, Language: no image
PGPID: 11951, Language: cipher
PGPID: 11284, Language: Judaeo-arabic
PGPID: 12101, Language: Juaeo-Arabic
PGPID: 12230, Language: Ara
PGPID: 15435, Language: no image
PGPID: 15469, Language: Image missing
PGPID: 31232, Language: Greek or Coptic
PGPID: 32427, Language: English
PGPID: 32434, Language: English
PGPID: 28503, Language: Hebrewq
PGPID: 32295, Language: Judaeo-Arabic: Hebrew
PGPID: 7024, Language: Judaeo-ARabic
PGPID: 2910, Language: Judeao-Arabic
PGPID: 29377, Language: Christian Palestinian Aramaic
PGPID: 1869, Language: Romance
PGPID: 32299, Language: Ottoman Turkish
PGPID: 20341, Language: Arabic Judaeo-Arabic
PGPID: 20910, Language: Arabo-Hebrew
PGPID: 29264, Language: unidentified
PGPID: 17564, Language: Judaeo-Arabi
PGPID: 32113, Language: Gujarati
PGPID: 16817, Language: Coptic numerals
PGPID: 29070, Language: Coptic numerals
PGPID: 18874, Language: Coptic numbers
PGPID: 29175, Language: Coptic numerals
PGPID: 30740, Language: German
PGPID: 30747, Language: Catalan
PGPID: 8484, Language: Hebrew, Judaeo-Arabic
PGPID: 30749, Language: Romance
PGPID: 23956, Language: Portuguese
PGPID: 25907, Language: Judeo-Arabic
PGPID: 26304, Language: N/A
PGPID: 27870, Language: unknown language

We'll need the PGP team to either add these languages to the ontology spreadsheet or correct them in the metadata spreadsheet.

@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Mar 3, 2021
@rlskoeser
Copy link
Contributor

Preliminary import is available on the test site. Note that some aspects of this are still provisional, pending some data work and decisions from the team. For now, we're using a provisional mapping of languages in the spreadsheet to language name in the new Language+Script model, and we're ignoring records that need to be demerged.

I'm attaching the full output of the import script — kind of noisy because it reports on all the skipped records with multiple types and missing languages, but may be useful to refer to as you're testing.
import_data_output.txt

@richmanrachel
Copy link

  • Filtering by multi-fragment would be helpful (so that we can fix them easier).
  • We're excited about the multi-fragment setup and thinking about how it could be modified to fit other cases too (bifoliums, court notebooks, and distinguishing between multi-fragments that share the same support or have just been put together by libraries).
  • Legacy input date is present, but we would like to be able to filter for it.
  • Website just stopped working, but noting that we'll check Bodl. MS Heb. f 56/16 for the multi-fragment text block.

@richmanrachel
Copy link

@rlskoeser - I'm having trouble tracking down any multifragments that show up as such in the database. None of the 25 documents that the PGP search finds (under "multifragement" in the description) seem to have enough information in the spreadsheet to migrate to the new database. Trying to think of a creative way to find the right docs to search, but will need to come back with more energy for this later.

Everything else looks great!

@rlskoeser
Copy link
Contributor

Ah, sounds like the multifragment filter problem you identified on #75 is a blocker for the testing. Should have a fix for that soon.

  • Legacy input date is present, but we would like to be able to filter for it.

What kind of filter are you thinking would be useful? (I tested and there are too many values for the django list filter to be useful; I wasn't sure if you meant search, but that doesn't seem ideal either).

From the last meeting, I now know we would like to sort on this field — I'd like to track that as a separate user story, since it's a new requirement. If we add logic to parse this field into an actual date/time (or partial date for some? or approximate date?), then we could turn on django's date hierarchy, which would give you a nice way to drill down based on date. We'll also want to add logic to set the 'date_entered' field for new documents created in the database after the migration is complete.

FYI, ~9000 of the documents currently imported have no legacy input date. Do we need to do anything about that?

@richmanrachel can you coordinate writing a story for the input date and adding it to GitHub? Could just be something to the effect that you want to sort and do date-based filtering on input date with a brief explanation of why it's valuable.

@richmanrachel
Copy link

Changing status to "tested needs attention" while awaiting a fix for search by multifragment.

@rlskoeser - could you give me a sample of what kind of entries are in the input date field so I can think about how to make it into a user story?

@richmanrachel richmanrachel added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Mar 8, 2021
@rlskoeser
Copy link
Contributor

could you give me a sample of what kind of entries are in the input date field so I can think about how to make it into a user story?

Sure. Some of them are full dates in MM/DD/YYYY format; some are year only; a few have ranges; some have notes in addition to a date. Some include multiple dates, not sure what we should do with that. Here are some examples:

  • Rev. MR Nov 2018
  • August 2017–November 2018
  • August 2017–November 2018; 08/2020
  • 2017
  • Nov 2018
  • 9/5/2020
  • 7/17/1990; 9/5/2020
  • 2017; 11/21/2020

@rlskoeser
Copy link
Contributor

Revise multifragment handling on import:

  • fragments with a multifragment value set should get boolean multifragment set true
  • actual multifragment value should be set on the text block

@rlskoeser rlskoeser added 🗜️ awaiting testing Implemented and ready to be tested and removed ⚠️ tested needs attention Has been through acceptance testing and needs additional work 🆕 enhancement New feature or request labels Mar 10, 2021
@richmanrachel
Copy link

@kmcelwee - Marina really likes your Display Name category, and is adding a new column in the languages ontology: https://docs.google.com/spreadsheets/d/1m-6SWU2gSNcferuU4Uzri2IQLUSbj3U5voas0xCofOQ/edit#gid=0

She's trying to address most of the other languages in that spreadsheet too. Coptic numerals are a bigger issue (they're technically Greek, but you don't want to say there's Greek on a document w/o Greek language, and most people call them "Coptic").

@rlskoeser
Copy link
Contributor

I added a new story so we can track importing language+script display name separately. #98

@richmanrachel
Copy link

Looks great! Closing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants