Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding documentation for how to deal with log-units #283

Open
amoeba opened this issue Feb 5, 2018 · 21 comments
Open

Consider adding documentation for how to deal with log-units #283

amoeba opened this issue Feb 5, 2018 · 21 comments

Comments

@amoeba
Copy link
Contributor

amoeba commented Feb 5, 2018

This came up on Slack today, what do we fill in when an attribute is a log-transform of another? The consensus was to create a custom unit called log{unit}, e.g., meter -> logmeter that is dimensionless. I couldn't find any guidance on this in the docs so I thought we might want to add a note or two so at least a Ctrl+F for "log" or "transform" would pop something up for the curious user.

@mobb
Copy link
Contributor

mobb commented Feb 5, 2018

thanks, Bryce. maybe someone could mine the EML to see what people are already doing with log units.
In general, we need guidelines for many dimensionless units, and attributes that are severely reduced.

@amoeba
Copy link
Contributor Author

amoeba commented Feb 5, 2018

Good idea! This would take a bit of work but is fairly doable. A good stepping off point would be https://cn.dataone.org/cn/v2/query/solr/?q=attribute:*log*&fl=attribute

@mbjones mbjones added this to the EML2.2.0 milestone Jun 29, 2018
@mobb mobb added in progress and removed next labels Jul 24, 2018
@mobb
Copy link
Contributor

mobb commented Aug 14, 2018

Thanks for the query @amoeba . bummer that we don't have the semantics in place to remove the attrs that are about trees.

But to be perfectly correct, it seems to be (expressing values that are the result of log transforms) that they are dimensionless:
https://math.stackexchange.com/questions/238390/units-of-a-log-of-a-physical-quantity

But I think this recommendation fits what we see in environmental data:
https://www.reddit.com/r/askscience/comments/1x09zc/what_happens_to_the_units_of_a_number_after/cf72xlk

So we should state that the log (or ln) is dimensionless, but the attribute description can state the original unit, which no longer have meaning - because you can't subtract or add the numbers as you originally would have.

@amoeba
Copy link
Contributor Author

amoeba commented Aug 14, 2018

So we should state that the log (or ln) is dimensionless, but the attribute description can state the original unit, which no longer have meaning - because you can't subtract or add the numbers as you originally would have.

👍

@mpsaloha
Copy link
Contributor

mpsaloha commented Aug 14, 2018 via email

@mobb
Copy link
Contributor

mobb commented Aug 15, 2018

Tried to summarize the slack discussion. something like this, for the EML documentation (feel free to edit):

If an attribute is a log transform, it can be unitless ("dimensionless" is a standardUnit in EML). If it is useful to include a version of the original unit for labeling, the customUnit can reflects the original dimensions, e.g., "logMeter", or "lnPa". However, the definition for a customUnit for a transformed value (in STMML) should state that it's relation to a parent is through an inverse transformation, and describe the transform, e.g., exp(x); STMML assumes simple arithmetic.

@mpsaloha
Copy link
Contributor

mpsaloha commented Aug 15, 2018 via email

@mobb
Copy link
Contributor

mobb commented Aug 15, 2018

Down-voting my own comment, above. Trying to cram all this into a single EML "unit" is a bad idea. Logs are dimensionless by definition, and a unit implies that certain operations can be performed, which is misleading. A better recommendation for describing a log measurement will be to use the annotation field.

@mobb
Copy link
Contributor

mobb commented Aug 15, 2018

comment from @mpsaloha regarding how to handle Units for TRANSFORMED DATA:

Interpretation of Units or Dimensions can be problematic after data are transformed for statistical purposes. Some transformations can be completely reversed to re-derive original values, although caution must be exercised if constants or other adjustments were made to the data beforehand.
Expressing both the nature of the transform ("transform_type"), as well as the original unit (if any) associated with a measurement, can, often provide invaluable information.

EML should recommend a convention for expressing transformed attribute values, e.g.
transform_type[original_unit]
and provide some standardized abbreviations for popular transformations, and mechanisms for constructing the above format.

Examples:
Log[Meter]
SqRt[Count]
...etc...

Transforms to consider for providing standardized prefixes in EML include:
Log, Ln, SqRt, CuRt, Arcsin, Box-Cox

Construction of an EML customized unit, as proposed above, should not be taken to indicate that the "original unit" is still associated with the transformed value. Rather, it indicates what that original unit was, for improved evaluation of data for re-use, as well as the potential for implementing a reverse transformation to re-derive the original data (although this should be done cautiously).

@mpsaloha will write some text, after @mobb finds the spot.

@mobb
Copy link
Contributor

mobb commented Aug 16, 2018

@mpsaloha -
There are two places where the documentation could be augmented:

  1. At the top-level documentation of eml-attribute:
    which currently shows up here:
    https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/index.html
    and
    https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml-attribute.html

Content is in this file (second paragraph, section starting approx line 70):
https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-attribute.xsd

  1. and in the documentation for <customUnit> itself:
    same file, approx line 900.
    There is no URL directly to that documentation; all you can do is go to the attribute.html page (above) and search in page for customUnit. It's about half way down.

If you want, put the text here and I'll add it, since I have that file out now.

@mbjones
Copy link
Member

mbjones commented Aug 19, 2018

While I am fine with clarifying the math behind the use of logs, sin, exp, and other transcendental functions, I would like us to be clear that it is not possible mathematically to take the log of a dimensioned quantity with units. The idea of a "log meter" is nonsensical mathematically. Rather, people often use a shorthand that assumes the arguments to transcendental functions have first been made dimensionless before the function is evaluated. There are numerous explanations of this on the web. Here are a couple of decent ones, the first of which is the most comprehensive, and points out that several popular internet sites like Wikipedia have promulgated mistakes in some of the math, including the use of the Taylor expansion as justification one way or the other:

The math stack exchange site also trots out some of these erroneous explanations. A simple and intuitive way to show that log(10 grams) is nonsensical is to see what happens to it when expanding it. Take the definition of the log function (using base 10 log as an example, but its true for all bases): y = log(x) if x = 10^y. Then examine the following expansion:

log(10 grams) = log(10 * 1 gram)
              = log(10) + log(gram)
              = 1 + log(gram)

From the paper linked above, then to calculate log(gram) one must ask yourself "what is the exponent y (a number) to which one should raise the base b, that will yield gram(s)?" There is no such number, as gram is not a number.

The way textbooks get away with using dimensioned numbers as arguments to transcendental functions is to (implicitly) divide by a reference constant first (e.g., ln(3 m) is really ln(3 m/1 m) to make the units cancel, which is ln(3). All arguments to transcendental functions must be dimensionless numbers, even though sometimes people don't make that explicit.

So, if people want to make a new STMML definition for logmeter in EML as another name for dimensionless and that has unitType=dimensionless then that is fine. It would clarify that the original unit was meter. But let's not imply that the value of a transcendental function has a unit. It is a pure number, and does not have units.

@mpsaloha
Copy link
Contributor

mpsaloha commented Aug 21, 2018 via email

@mbjones
Copy link
Member

mbjones commented Aug 21, 2018

You wrote:

while log-transformed measurements of wing-length might be unit-less, I would argue they retain their dimension of "length"

This is where we might diverge. My reading has lead me to think that log-transformed values are indeed dimensionless -- e.g., they no lo longer represent a 'length' -- and rather now represent a pure numerical value. Here's the relevant quote from the Massa et al paper:

By definition, transcendental functions such as logarithm (to any base), exponentiation, trigonometric functions, and hyperbolic functions act upon and deliver dimensionless numer-
ical values.

And also you wrote:

logarithmic values in general can have units and dimensions, e.g. decibels, pH, and astronomical magnitude do

I think pH, dB, the richter scale, and other similar indices are dimensionless and do not have units. The Massa paper also uses pH as an example, where they show that it is a log of a ratio in which the dimensions cancel, specifically it is the log of a ratio of concentrations for which the dimensions and units cancel. Similar with dB.

@cboettig
Copy link
Member

I believe that the logarithm of length in meters would technically be considered a level measurement, that is, of type "level" or "level difference" rather than of type "length" or of type "dimensionless".

@mpsaloha
Copy link
Contributor

Hi Matt and Carl,

Carl-- thanks for finding that. But did you notice that in the link you provided they refer to decibels under "Units of Level"?

Matt-- yes, I read the Matta et al. paper and liked their argument for dismissal of the Taylor expansion as "proof" of dimensionlessness in the case of logarithms, but noted that they also never mentioned the issue of inverse transformations in the case of logarithms-- which is a common use case.

So, we don't see eye-to-eye on several things here:

in general, what a "dimension" represents as opposed to a "unit"-- I don't think "dimensionless" is the same as "unitless"-- while a measurement value with its unit allows us to infer its dimension, the reverse is not true (a measured value with its dimension does not allow us to infer its unit-- as we well know from under-specified metadata! "Body weight of 5": dimension of Mass; units of ??)

that if one log-transforms a set of wing-lengths (e.g. measured in cm) it becomes a pure-number, so the inverse transform of those pure-numbers are also pure-numbers (i.e. dimension (length) and unit (cm) of those measurements are irretrievably lost. Note that analysts routinely re-derive original values and their associated units from statistically transformed variables-- how is this defendable if log-transforms are "pure numbers"?)

that pH, dB and other logarithmic scaled measurements are unitless. For example, I'd assert that 10 is unitless, but that 10dB has a unit of decibel, which is a measurement of the log ratio of amplitude of two "sounds" (air pressure levels) or other energy sources. If you want to call 'dB' (as an example) something other than a unit, maybe we need to invent a new category-- "unitless standard" for these standard names for interpreting and comparing quantitative values along some scale (which coincidentally is the primary function of those thingies we call "units"). So, regardless of what we call these, I think retaining them somewhere in the metadata, rather than letting them drift away in pure number bliss.

Also, we are promoting different notions of "dimensionlessness"-- yours having more to do with dimensional analysis, and mine more regarding semantics. E.g. if one has 100Kg of antelopes per 5Kg of Lions, I'd say the dimensions are "Mass"; whereas you (I think) would say this ratio is dimensionless.

@cboettig
Copy link
Member

@mpsaloha Yeah, decibels are a particularly interesting case. Apparently decibel is technically the log ratio of any measurement, so arguably the 'units' of logarithm of length could be decibels! Wikipedia suggests the convention is to put the unit following decibels, so decibels of log voltage would be dBV. (ironically dBm apparently refers to log base of milliwatts, sorry meters). Apparently the SI standard opposes this convention.

To make this more confusing, decibels are defined differently for power-type units and "field" (now called "root-power") type measurements, where it is typical to square the values before taking the ratio (equivalently, multiplying the log by 2), see: https://en.wikipedia.org/wiki/Field,_power,_and_root-power_quantities).

so decibel-meters, anyone?

Not sure I'm helping. pH is a little cleaner as technically it's already defined as the log of H+ activity, which is already defined as a dimensionless measure, so the use of logs does not imply the need for a reference scale.

There is some argument that these log-scaled units are quantities we tend to think of in percentage/multiplicative terms anyway, and measure in log-scale units....

@mpsaloha
Copy link
Contributor

mpsaloha commented Aug 22, 2018 via email

@mobb
Copy link
Contributor

mobb commented Aug 29, 2018

Summary from @mpsaloha via email:
"BTW-- after getting input from two mathematicians (Profs), both more-or-less agree with me: log is a transform on the value, and not the unit. And the unit should be preserved (somehow) for utility-- such as when do inverse transform."

It's the 'somehow' that we want to explain in the EML documentation.
my opinion:

  • the unit (after a log transform) is dimensionless.
  • the attribute's definition should include the original unit (before transform), e.g. for an inverse transform.
  • that a log transform was performed is a complex aspect of the attribute that belongs in an annotation (what that annotation looks like is TBD).

@brunj7
Copy link

brunj7 commented Aug 29, 2018

Mark mentioned this thread to me --- so here are my 2 cents:

I do not agree that a log transform of a number removes neither its associated unit nor dimension. If the number is a number of something, the log of this number is still of something.

100 km = 10^2 km = 100,000 m= 10^5 m = log(10^5) m = 5log10 m = 2log10 km

It is important since you can invert the transformation and get the original number (of something or not) back.

So the unit is still the same after a log transform, but we need to find a way to save the information that the stored values in the data file are in a log scale.

@mbjones
Copy link
Member

mbjones commented Aug 16, 2019

@brunj7 Your "equation" commits the fundamental mistakes that are outlined in the Matta et al. paper (https://pubs.acs.org/doi/pdf/10.1021/ed1000476) that I linked to in my comment above. I suggest that a deep read and understanding of that paper is required before we can make headway on this issue. I propose that we remove this issue from the EML 2.2 release given that we have not reached consensus in the last year and a half on the issue. I will bump this issue to the 3.0.0 milestone unless others object and can show a mechanism for consensus to be very quickly reached.

@mbjones mbjones modified the milestones: EML2.2.0, EML3.0.0 Aug 18, 2019
@mobb
Copy link
Contributor

mobb commented Aug 22, 2019

related to #323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants