Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance between text and data? #9

Closed
seanredmond opened this issue Mar 6, 2018 · 1 comment
Closed

Balance between text and data? #9

seanredmond opened this issue Mar 6, 2018 · 1 comment

Comments

@seanredmond
Copy link
Member

Some bits of the data are so far going to be converted to attributes, meaning they'll be taken out of the text representation of the XML though the data is preserved. Can we decide on a principal to help guide when that occurs. To take the copies element as an example:

screen shot 2018-03-06 at 12 33 08 pm

it could (1) just be text

<copies>2c 3May51</copies>

The current proposal (2) from DCL is to regularize the date (see #8)

<copies date="1951-05-03">2c</copies> 

But we could go further (3) and just parse out the number of copies, too, so that it's an empty tag

<copies date="1951-05-03" num="2"/>

Or combine the first and third (4)

<copies date="1951-05-03" num="2">2c 3May51</copies>

I think either the first or the last (and really, I think the last is the best option). They both preserve the original information. The second (currently proposed) version does some of the processing up front and makes later processing easier but leaves out an important piece. The last option will be the easiest do deal with for both human and machine.

@seanredmond
Copy link
Member Author

After some offline discussion we're going to handle this according to a few of principles:

  1. Try to capture everything: Don't assume any detail will be uninteresting
  2. Don't add or remove any text: If you strip the XML tags, you should end up with the original text of the entry
  3. Add data and interpretation as attributes: Following the previous principle, anything we add (for convenience, regularization, etc.) should be added as attributes.

For this particular issue then, we will go with the last option:

<copies date="1951-05-03" num="2">2c 3May51</copies>

which both preserves the original text, but adds some derived attributes that will make the data easier to work with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant