Skip to content
This repository has been archived by the owner on Oct 11, 2022. It is now read-only.

As a researcher, I want the structured text file and incipits parsed into canonical stories, story instances, and manuscripts and imported into Google Sheets so I can work with the data in a more structured form. #11

Closed
12 tasks done
rlskoeser opened this issue Jan 21, 2020 · 7 comments
Assignees
Milestone

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jan 21, 2020

  • document functions/methods
  • put generated file in an output folder that is git ignored; maybe add datestamp to filenames?
  • mss ids like PEth 41.8 and EMIP 601.225 should only use the number before the period. The number after the period should go in the order field - applies to PUL, EMIP, and maybe EMDL.
  • If there is a question mark after the period, ignore it (#.?)
  • for imported incipits: mark as macomber incipit (new field) and set confidence to high
  • for mss references with numbers not in parens, like ZBNE 60-29 and CRA 52-35 and SBLE 32-28 : the number after the the dash should go in the story order field, there are no folios
  • in the MSS: line, use the previous repository/collection for subsequent references if not specified (i.e. in ZBNE 60-29; 61-27; are both ZBNE.)
  • Put 'English translation:' contents into new canonical story field of same name
  • Put 'Text:' contents into new canonical story field 'Print Version'

additional changes 2/24/2020

  • make folio notation consistent: convert all "r" to "a" and all "v" to "b"
  • handle single-manuscript collections: C-, CL-, CBS-, and G- only have 1 mss, so the number that follows is the miracle order. Give them all manuscript id 1.
  • only infer folio start = folio end for single folios if manuscript is in PEth, EMDL, or EMIP (leave folio end empty for all others)
@rlskoeser rlskoeser changed the title script to parse structured text file and generate CSVs for Macomber stories, story instances, and manuscripts As a data curator, I want the structured text file and incipits parsed into canonical stories, story instances, and manuscripts and imported into Google Sheets so I can work with the data in a more structured form. Jan 28, 2020
@rlskoeser rlskoeser self-assigned this Jan 28, 2020
@rlskoeser rlskoeser added this to the v0.2.0 milestone Feb 3, 2020
@rlskoeser rlskoeser changed the title As a data curator, I want the structured text file and incipits parsed into canonical stories, story instances, and manuscripts and imported into Google Sheets so I can work with the data in a more structured form. As a researcher, I want the structured text file and incipits parsed into canonical stories, story instances, and manuscripts and imported into Google Sheets so I can work with the data in a more structured form. Feb 3, 2020
@WendyLBelcher
Copy link
Collaborator

Just a note to say that the issue of "mss ids like PEth 41.8 and EMIP 601.225" is important. Just in case it is getting lost in the mix. I don't know how to comment field!

@rlskoeser
Copy link
Contributor Author

@WendyLBelcher it's on my list! I'm sorry I haven't been able to get back to it yet. I started looking at it but discovered I needed to add some tests before I updated my script because I'm handling so many different cases now, and was worried about breaking things I already have working...

@rlskoeser rlskoeser added the awaiting testing issue is ready for acceptance testing label Feb 21, 2020
@rlskoeser
Copy link
Contributor Author

Increasing from 3 to 5 pts for the complexity of handling variation in manuscript references and available information.

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Mar 2, 2020

Problem documented by @elambrinaki on #29

On the Story Instance sheet, there are repeating rows, which I assume is the result of import of the manuscripts that are separated by ";" with no manuscript name repeated.

For example, on the Story Instance sheet, Canonical Story ID 404 is matched to G-Vatican (BAV) 92 three times. In the primary source, MAC0404 is listed in three different G-Vatican (BAV) manuscripts (GVE 92(7a); 146(69b); 242(23b)). In our Google sheets, instead of having three different rows with G-Vatican (BAV) 92 folio start 7a, G-Vatican (BAV) 146 folio start 69b, and G-Vatican (BAV) folio start 242 (23b), we have three identical rows (G-Vatican (BAV) 92 7a).

The same thing with this story and CR-Paris (BNF) 52. In the primary source, there are three manuscripts (CRA 52-91; 52(12a); 55(10b)), so the import should result in 1) CR-Paris (BNF) 52, miracle number 81, 2) CR-Paris (BNF) 52, folio start 12a, 3) CR-Paris (BNF) 55, folio start 10b.

  • fix logic for repeated manuscripts within one repository

@WendyLBelcher
Copy link
Collaborator

WendyLBelcher commented Mar 2, 2020

  • For conversion of text file, change 'BM': 'B-Oslo (SCOL)',
  • For conversion of text file, change 'WBLE': 'W-London (BM)',

@rlskoeser
Copy link
Contributor Author

Revised to correct the repository mapping and correct logic for repeating manuscripts when manuscript name/repository does not repeat.

@WendyLBelcher
Copy link
Collaborator

I believe this Issue can be closed.

@rlskoeser rlskoeser removed the awaiting testing issue is ready for acceptance testing label Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants