Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider updating EML’s unitDictionary list #289

Closed
mobb opened this issue Feb 23, 2018 · 26 comments
Closed

Consider updating EML’s unitDictionary list #289

mobb opened this issue Feb 23, 2018 · 26 comments
Assignees
Labels
Milestone

Comments

@mobb
Copy link
Contributor

@mobb mobb commented Feb 23, 2018

There are several issues logged which relate to the list of units built into EML. It would help constructors to have these cleaned up. The LTER created a unit resource so that constructors can reuse custom units created by other LTER sites, and see examples. It is a RDB (interface at http://unit.lternet.edu) containing both the built-in units and units contributed by many LTER sites. LTER also created guidelines for units best practice (for both their dictionary and EML), but these have not yet been routinely applied to the dictionary.

EDI needs a unit list for its own EML construction, and the LTER unit dictionary is a candidate. A “clean” list of units from the LTER-dictionary may be a good candidate for EML as well. So the following tasks benefit both groups.

Steps:

  1. Migrate the LTER guidelines to the EDI website (side task, not directly related to EML)
  2. finish the vetting of the units in the LTER dictionary
  3. Export as stmml, ie, for eml-unitDictionary.xml
  4. Examine for use in EML 2-2
  5. Create a proposal for EML-2.2

Related to issues: #139, #140, #115, #284, #285, #286, #287

@mobb mobb self-assigned this Feb 23, 2018
@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Feb 23, 2018

Other things to consider:

  1. EML 2.2 will be backward compatible, so all the units currently in the list must be included, whether or not they match the guidelines.
  2. One of the problems with the 2.1 list is that spellings of unitNames are inconsistent. The “unitName” is simply a string id which points to the definition, so from one standpoint, the spelling doesn’t matter. However, since it’s the only required component, it has become de facto, a meaningful version of the unit. Some of the proposed, vetted additions will simply be alternate spellings of 2.1 units, which is likely to cause confusion (if they show up together).
  3. The LTER-dictionary has an attribute deprecatedInFavorOf, which (if adopted by EML2.2) could be used to shorten the list (by demoting non-preferred spellings, e.g., in an EML editor), while still allowing them in docs.
  4. STMML is no longer widely used. Even the current version (stmml-1.1) was created by eml-dev for EML-2.1 (with permission from 1.0’s authors).
@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Feb 23, 2018

@mobb agreed on all counts in your last two comments. I particularly like the addition of a deprecatedInFavorOf attribute, and I don't think its a problem to add it to STMML because we are the defacto maintainers of our fork.

@csjx

This comment has been minimized.

Copy link
Member

@csjx csjx commented Feb 23, 2018

Yes, while STMML is not widely used or maintained, we widely use it and maintain the fork as Matt said. I'm in favor of considering a replacement, but only if it is a superset of the functionality and expressivity of STMML, and does have wide use and maintenance. Replacing STMML will also involve a bunch of client software refactoring so that's something to consider. Did you have a replacement in mind @mobb, or were just noting how STMML kind of dead-ended from a maintenance/improvement perspective?

@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Feb 23, 2018

@csjx - nope - no replacement in mind. Just pointing out that to the best of my knowledge, we are the only ones using it.

@mobb mobb added the next label Feb 23, 2018
@csjx

This comment has been minimized.

Copy link
Member

@csjx csjx commented Feb 24, 2018

For reference, the NCEAS data team has been investigating the udunits library from Unidata. From a quick scan of the XML instance documents, I don't see a formal schema, but it looks like some extensive work has been put into it and may inform our unit dictionary: https://www.unidata.ucar.edu/software/udunits/

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 26, 2018

👍 for Unidata units. (The library also has nice R bindings). It also looks like it would be straight-forward to generate STMML unitList definitions from the the Unidata udunits2 XML library

@srearl

This comment has been minimized.

Copy link
Contributor

@srearl srearl commented Feb 26, 2018

As information managers increasingly use R and tools like ropensci::EML to generate metadata, an approach/framework that would allow for a tighter integration between R and services/packages that would aid documenting units would be very welcome.

@mbjones mbjones added the enhancement label Feb 27, 2018
@mbjones mbjones added this to the EML2.2.0 milestone Feb 27, 2018
@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Feb 27, 2018

Re udunits2:
I've played around with this a bit, and the database is pretty comprehensive, and conversions are pretty straightforward. It looks like the grammar can be mapped to stmml, eg, kilogrampermetersquared == kilogram per meter^2
Common notations (kg/m^3) are also supported, I presume as synonyms.

The mapping between these are probably best placed in the EML-unit dictionary; to do so in udunits2 would be a major fork.

I am not quite clear on their use of unit system. I think it maps to stmml unit type -- in that it represents a unique group of base units that a defined unit belongs to, so that conversions are not allowed between units in different systems. So the STMML unit type amountOfSubstanceConcentration is equivalent to the udunit system that is a quotient of [mole, volume]. Both volume quantities (quart, liter) and length^3 will work. I just can’t quite tell where these systems are defined. But if we simply include the mapping at the unit level, (per the above), udunit2 will take care of it.

I think we are better off not generating a list of STMML units from udunits, though. For a couple of reasons:

  • We would still probably want to generate a unitType, and it’s not clear how
  • Units are combinations of prefix+base, so there would be a whole lot lot of almost-impossible units generated, eg, petagram per centimeter^3
  • Some of the older, arcane EML units are not in udunits2 (indianYards!) so manual work is still in order to get them in

That is debatable, of course. But unless someone has a good argument (and wants to do the coding), I will continue with the plan to vet the units in the LTER DB, and export the STMML from that. There may be other units we want to add to EML’s list, but the LTER DB is already backward compatible (with EML 2.1) and is a good representation of what is commonly used now.

Regarding a nascent stmml-1.2, there are now two candidates for optional attributes for a unit:

  • deprecatedInFavorOf
  • synonym_udunit2
@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Mar 2, 2018

Udunits understands symbolic units. So synonym_udunit2 would just be the EML abbreviation I believe.

@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Mar 2, 2018

that occurred to me too - that abbreviation could hold this (be the synonym). but we have to watch out for character sets. using a dedicated attribute might make it more explicit.

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Mar 3, 2018

The character encoding for udunits2 is "US-ASCII". If there was a problem with characters, the dedicated attribute would then likely have to be the symbolic form of the unit spelled out with care to have numeric exponents... e.g. "meters3 per second".

@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Mar 29, 2018

Hi folks!

I am attaching a doc with the plan for EML2.2 units (it's a txt, because git won't accept markdown in attachments)
unitDictionary_plan_eml22.txt
. I just committed a revised stmml.xsd and a candidate eml-unitDictionary.xml that uses the revised xsd.

A few highlights:

  • eml-unitDictionary XML: The contents are mainly the EML-2.1 units; all the original units are still there, to maintain backward compatibility.
  • One of the problems in the 2.1 list was that spellings of unit/id were inconsistent, which made them difficult to find. SI has recommendations for expressing units: no plural, modifier follows the base unit (e.g. in our pattern then, use meterSquared, not squareMeters). So in the checked-in eml-unitDictionary.xml, where the 2.1 spelling did not follow these rules, a new unit was added and the existing version deprecated in favor of the replacement.
  • Some metrics: 295 units total, 78 were deprecated

The plan in brief:

Each unit has 5 important attributes (in xml, 4 attributes and 1 element):

  • unit/@id (required by the schema)
  • unit/@name
  • unit/description
  • unit/@udunitSynomym
  • unit/@deprecatedInFavorOf

If we promote the udunits package, then this group is no longer needed:
unit/@multiplierToSi, unit/@constantToSi, unit/@parentSi

unit/@unittype: We should talk about unit/@unittype, and whether it is important. There are some issues with the way these are currently named, and units assigned. unit/@unittype may be useful for grouping units.

Still to do:

  1. Examine the plan and example list, evaluate for EML 2.2.
  2. Consider what add additional units that might be generally useful. I am looking at the units used by the LTER network, and collecting a list of new units used by two or more sites. Some are already included. Someone else could assemble a similar list, perhaps from the ADC. @maier-m perhaps?
  3. Populate the udunitsSynonym field. I manually added this attribute for the first 10 units (through ‘becquerel’). I am hoping one of you who has been working with the udunits package can script this, with a little text processing of the unit/name, and also test whether the udunitsSynonym field will work for us. @csjx - I think you brought up udunits first.

unitDictionary_plan_eml22.txt

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Mar 29, 2018

Hello,

  • To clarify about plurals, is a unit like gramsPerLiter being depreciated to gramPerLiter? I would assume not but was unsure from the above.

  • udunitsynomym will generally just be the abbreviation of the EML unit. 148 EML units currently have abbreviations (out of 195).

    • Only 13 of these abbreviations are not currently parseable by udunits2. Most of these can be easily adjusted (i.e. M for molarity is not parseable but mole/L, lbs is not parseable but lb is).
    • Of the current EML units that don't have abbreviations, some will be simple to add and others will be more difficult. Units like bushelsPerAcre where all the base units are part of the udunits2 system are simple (bu/acre).
    • EML units not in udunits will be a bit tricky e.g. Yard_Sears although something like .91439841461602867 m is parseable by udunits. Also units like numberPerLiter would have to have a udunitsynomym of 1/L (Although udunits does seem to handle reciprocal units very bizarrely see below where udunits says 1 s = 1/s and also trips over 1/s)

screen shot 2018-03-29 at 1 49 29 pm

  • I love the idea of linking to udunitsynomym but wonder if users will find this phrasing confusing? The alternative may be to call this field definition? As long as proper abbreviations are used, udunits should be able to handle the units.

I would love to help with this further developing any component of the unit list as needed.

Cheers,
Mitchell

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Mar 29, 2018

Sorry to be so verbose, but I have one more thought.

There is no mention of abbreviation in the above text. Is abbreviation no longer going to be a part of the schema? Abbreviation may still be helpful even if it is typically the same as udunitsynomym for cases like mph milesPerHour to help users search through the 195+ EML units. How abbreviation is currently used is likely similar to how many PIs will have their units described in their own data (with the exception of ² over ^2 etc, maybe some value in switching to ^ nomenclature as in udunits). I'm just trying to think how to make searching and inputting EML units as friendly as possible because I do think that is a hurdle for users with lots of units. Maybe a bad idea, but requiring both an id and an abbreviation may be helpful as then users could possibly enter either an id or abbreviation to find their corresponding unit?

@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Mar 29, 2018

Hi Mitchell -

  • plurals:

    • is a unit like gramsPerLiter being depreciated to gramPerLiter? I would assume not but was unsure from the above.
      • yes. the most common incongruences were plurals, modifier before the unit, and inconsistent capitalization (which SI doesn't care about, but makes strings all look the same). I bet I did not catch them all, either.
  • abbreviation:

    • it's one of the optional attributes. Did I leave it out of the plan? The plan (attached doc) is longer and has explanations for the choices.
    • part of the reason I left abbreviations out of eml-unitDictionary.xml is that right now, I'm working in ascii. I think udunits prefers ascii as well. many abbreviations need more complex sets.
    • keep in mind that there is no such thing as one "proper" abbreviation. only abbreviations that are recognized by certain systems.
  • udunitSynomym:

    • good to hear that most abbreviations can be used as is by udunits! I actually exported this list from a DB that includes the abbreviations, so it's easy to duplicate abbreviations to the udunitSynomym attribute during the transform. we would need to go back and test each one, however. and be sure we're using the right character sets (for both xml and udunits)
    • name of the field:
      • We can change the name of this attribute, of course. But there is already a unit field called "definition" (it's an STMML unit element) so we can't use that.
      • I think that whatever name we choose, it should be clear that it's supposed to connect to the "udunits" package. udunits has lots of synomyms for each unit, and any of them would work in this field, so that's why I chose it. Here are some other options:
        • udunitsId
        • udunitsAbbreviation
        • udunitsSynonym (currently set at this)

Thanks for the offer to help! There are a couple of areas:

  1. collecting more candidate units. I am attaching another useful file: these are the candidate units that I gleaned from the LTER.
    candidate_units_LTER.txt

  2. unitType, or some other grouping mechanism, like quantity. Those two concepts are not synonyms. And many of the unitTypes in EML 2.1 are incorrectly named (or are quantities) or both! But this comment is too long already. I'll start another issue for unitType and add notes there. See #291

@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Mar 29, 2018

I like this proposal, and think it will help with making units more effective.

  1. I don't think we should remove the conversion information and unitType from units. While compatibility with udunits is a positive, the EML definitions of units should stand alone for long term interpretability and for backward compatibility. Many apps depend on those being available now, and so would have to be rewritten were they removed. I don't see any harm in including them, but lots of pain in removing them.

  2. We should think about how to officially include an externally maintained list of units. While it has been useful to have a base list in EML itself, it is clear there will always be a need for extensions. So, it would be helpful to define a mechanism for external units to be defined and used. The problem comes in with the problem of longevity. The first EML documents were written in 1997, and I can still find their unit definitions today. I would like to ensure that for documents authored in 2018, we can ensure that the exact definitions of any units used will be able to be just as accessible in 2038. There are definite advantages to bundling definitions in the EML itself.

@mbjones mbjones added in progress and removed next labels Apr 3, 2018
@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Apr 3, 2018

Hello, I am attaching a preliminary draft (in tabular form) of the candidate LTER units and some ADC candidate units that I processed. Hopefully these are useful as a starting point.
candidate_units_LTER_draft.txt
candidate_units_ADC_draft.txt

@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Apr 3, 2018

In doing a diff of the original eml-unitDictionary.xml file with SHA 82dd0dd, I see that a lot of whitespace and formatting has unnecessarily changed, which makes it really difficult to assess whether the changes were appropriate. Before we review and close this issue, I would like to see:

  • revert the eml-UnitDictionary.xml file to its original formatting and layout
  • that should restore the unitType and conversion attributes for existing units
  • add in any new units (with unittype and other STMML attributes) and any new attributes needed in a way that we can use diff to see the changes in the file (i.e., don't autogenerate the whole file afresh)
@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Apr 4, 2018

A couple of things:

  1. We should not reuse the 2008 unitTypes, (for reasons already outlined, issue #291) but we should replace them.

  2. editing: please don't take this the wrong way, but manual editing the XML is most likely how the 2.1 dictionary got its inconsistencies in the first place. I am editing with tools that let me examine similar units together -- mysql (LTER unit registry, a candidate for adoption by EDI, https://github.com/EDIorg/unit-registry).

  3. Diffs: Perhaps a meaningful diff can be constructed some other way (eg, parse both the 2.1 and the candidate into csvs | sort, etc.

A test doc containing an attribute with every unit from the 2.1 dictionary ought to work for assessing the backward compatibility of 2.2.

@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Apr 4, 2018

See comment in issue #291 (comment) for discussion of unitTypes.

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Apr 4, 2018

Hello,
If its helpful I tried to restore versioning as per Matt's request https://github.com/NCEAS/eml/pull/295

@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Apr 5, 2018

Thanks, mm - maybe we should talk. are you editing manually? or do you have a way to reorganize the xml doc with a script or stylesheet?

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Apr 5, 2018

Hello, I have been mostly using R scripts utilizing the xml2 package to work. I haven't worked much with stylesheets, do you think that would be a better option moving forward?

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented Apr 5, 2018

@maier-m

This comment has been minimized.

Copy link
Contributor

@maier-m maier-m commented May 16, 2018

I've taken a first stab at drafting the unitDictionary with #295.
FYI, I am missing a handful (of the more unique units) from @mobb's original change to the dictionary that can be added in the next round, but I wanted to pause here for review before moving full speed ahead.

@mobb mobb added the needs-review label Jul 5, 2018
@mobb

This comment has been minimized.

Copy link
Contributor Author

@mobb mobb commented Jul 13, 2018

reviewed @maier-m s additions; added the aforementioned 'unique' units, and corrected a few copy-paste errors. Updated many definitions (based on wikipedia info).

pushing up the eml-unitDictionary.xml file. It would be great if someone else could review it.
also see #302; enumeration will be checked in today too (based on this file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.