Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should measurand and associated SU data items always have the same content type #278

Open
vaitkus opened this issue Nov 5, 2021 · 10 comments

Comments

@vaitkus
Copy link
Collaborator

vaitkus commented Nov 5, 2021

I have noticed that several Measurand data items from the CIF_CORE dictionary have the Integer content type while the associated SU data items have the Real content type. Data items in question:

_diffrn_refln.counts_bg_1
_diffrn_refln.counts_bg_2
_diffrn_refln.counts_net
_diffrn_refln.counts_peak
_diffrn_refln.counts_total

Also note, that while it is possible to assign a real SU value to a integer measurand value using a separate data item, this does not seem to be possible using the alternative parenthesis notation. For example:

_value          43
_value_su       1

can be written as 43(1), but it is unclear how the same could be done for:

_value          43
_value_su       0.8

Two questions:

  1. Should the previously mentioned _diffr_refln.count_* data items be changed in any way (e.g. by redefining SU items as integers)?
  2. Can it be reasonably assumed that SU items and Measurand items will always have the same content type?
@jamesrhester
Copy link
Contributor

jamesrhester commented Nov 9, 2021

  1. No, _diffrn_refln.count_* do not need to be redefined. I believe this behaviour is consistent with the SU of a Poisson-distributed value being the square root of the value.
  2. No, as the example of Poisson counting statistics shows, it cannot be reasonably assumed.

The point that the bracket notation cannot handle some SU values for integers is worth making and we are fortunate that we now recognise that the SU can always be presented as a separate data value. I will add a comment on this to the DDLm chapter.

@jamesrhester
Copy link
Contributor

jamesrhester commented Nov 9, 2021

Here is the revised text of the relevant part of the draft DDLm chapter. Feel free to comment!

DDLm provides a mechanism to specify
that one data item holds the standard uncertainty (SU) of another data item.
When using CIF formats, which allow the SU to be presented as part of a
data value, this separate SU data name is assigned the SU value (see
Table \ref{tab:DDLm-derive-CIF}). Where
both an appended SU and a separate SU data item are present for a data item, the
values must agree. CIF users should also note that appending SU may
not be possible in cases where an integer data type has an SU that is smaller
than one (for example an SU of 0.8 on a value of 2), in which case the
explicit SU data name is available.

@vaitkus
Copy link
Collaborator Author

vaitkus commented Nov 9, 2021

Thank you for the answers.

The revised text looks great, however, the revision might not actually be required after all. A more detailed explanation is provided below.

I gave some more thought to this and came to the conclusion that expressing small SU values for integers using the parenthesis notation might be possible after all if the integer value syntax is slightly relaxed by allowing integer values to have any number of 0 after the decimal separator (e.g. treat 5, 5.0, 5.00, 5.000 ... as valid expressions of integer five). For example, 2 with SU of 0.8 can then be written as 2.0(8).

Furthermore, this might already be valid if DDLm or CIF 2.0 does not explicitly specify how an integer should be expressed. I am not completely sure this is true, but neither the definition of the _type.contents from the DDLm reference dictionary [2] nor the CIF 2.0 grammar [3] provides such information. CIF 1.0 has an explicit syntax for integers in the form of <Integer> = { '+' | '-' }? { <Digit> }+ [1] which does not allow integers of the form 3(.0+)?, however, this did not manifest as a limitation since DDL1 does not differentiate between Integer and Real values, but rather uses the Numb content type that encompasses them both (data item _diffrn_refln_counts_bg_1 value 9.0(45) is perfectly valid according to DDL1). The dREL draft also provided a syntax for the Integer type [4] that is most similar to the DDL1 one, however, I am unsure if this affects DDLm/CIF 2.0 in any way.

Given the above, the combination of CIF 1.1 and DDLm could potentially have the integer SU limitation if we assume that the format-agnostic DDLm inherits the data type limitations from the carrier format (CIF 1.1 in this case).

Two related questions:

  1. Is the previous interpretation of the DDLm/CIF 1.1 interaction correct? Would DDLm indeed inherit the definition of an Integer from CIF 1.1 or should these Integer concepts be considered separate?
  2. Is there a specific syntax that an Integer value must adhere to in DDLm or CIF 2.0?

[1] https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax, Appendix A
[2]

cif_core/ddl.dic

Line 1683 in 9a2b576

Integer

[3] https://www.iucr.org/__data/assets/text_file/0009/112131/CIF2-ENBF.txt
[4] https://www.iucr.org/__data/assets/pdf_file/0007/16378/dREL_spec_aug08.pdf, section 2.2.2

@jamesrhester
Copy link
Contributor

jamesrhester commented Nov 10, 2021

These are good questions that have not been explicitly answered. For fun I will ask the DDLm group to see if they have any ideas. My own feeling is that the string '2.0' for data name '_xyz' of type 'Integer' corresponds to the integer 2, just as both delimited and undelimited strings can be considered numbers.

@vaitkus
Copy link
Collaborator Author

vaitkus commented Nov 10, 2021

As an additional note, after a lot of discussion the developers of JSON-Schema also came to the same conclusion [1] as @jamesrhester:

The integer type is used for integral numbers. JSON does not have distinct
types for integers and floating-point values. Therefore, the presence or absence
of a decimal point is not enough to distinguish between integers and non-integers.
For example, 1 and 1.0 are two ways to represent the same value in JSON. JSON
Schema considers that value an integer no matter which representation was used.

This definition also includes the scientific notation, that is actually also not covered
by the CIF 1.1 definition of an integer number ("1E2" is considered a Float).

https://json-schema.org/understanding-json-schema/reference/numeric.html#integer

@jcbollinger
Copy link
Contributor

jcbollinger commented Nov 10, 2021

I concur with James that as far as statistics go, there is no inconsistency in items defined to take integer values having standard uncertainties that are non-integral.

However, I think that there is an inconsistency between typing these specific items as integers and attributing an uncertainty to them. Do they convey exact counts measured by a device that can produce such? Then they have no uncertainty. Or do they convey estimates of counts? then there is no good reason to constrain them to be integral -- in fact, I would say that doing so would be statistically wrong.

Similar would apply to most other items. I'm having trouble seeing, semantically, where it would make sense for an item defined to describe an uncertain measurement to be constrained to be integral.

@jamesrhester
Copy link
Contributor

jamesrhester commented Nov 12, 2021

Thank you @vaitkus for finding that JSON information. I think it is reasonable to view the JSON specification as playing a similar role to the CIF1.1 syntax specification, that is, the only source of type information is the value itself. I think our context here is a little different, as we have a dictionary involved, so we can't draw as much guidance as we might like from how JSON have tackled it.

When a dictionary is available that states that a particular string should be an integer, then fundamentally the programmer is faced with a choice as to which of '2', '2.000', '0.2E1' to accept. One of the functions of a standard is to clarify such choices, and if we can try to make the standard align with the most likely choices of CIF software authors, who don't always read standards in quite the detail that we discuss them, then the standard will be more effective as we will have helped those that do think about these things to work in the same way as the rest. So I think we state that integers don't contain exponents or decimal points, in keeping with programming language behaviour and intuitive expectations. I will update the DDLm draft to clarify this if we and the DDLm group find this acceptable.

If a SU that is not an integer needs to be provided, then the _su data name can be used.

@jcbollinger makes an interesting point, but we are stuck with a historical decision that these data items are integers. To continue the discussion, measured raw counts must be integers. Due to Poisson counting statistics we can use that measurement to estimate the most likely value as being the measured value with sqrt(N) error bars. And I agree that this estimate is a real number. What has happened is that these two steps (measurement and estimate of true value) have been conflated, and for our IUCr SU formatting rules this only matters for the numbers 2 and 3.

Interestingly, the powder dictionary explicitly says (_pd_meas.counts_* items) that no SU should be attached to these values, and if the SU differs from sqrt(N) then the _pd_meas.intensity_* data names should be used. Clearly they were thinking about this 20 years ago!

So we have the following choices, assuming that we treat integers as not having a decimal point:

  1. Do nothing: if SU on '2' and '3' for _diffrn_refln.counts_* is needed, the _su item can be used
  2. Introduce _diffrn_refln.intensity_* items for future work, deprecate _su for counts data names and update the definition in the same way as for powder diffraction - but we can't remove them altogether as presumably somebody has used them somewhere? I couldn't find any mention of them in my copy of the COD.

As a final comment, the _diffrn_refln.counts_* data names are from the days of point detectors scanning peaks (which I remember!) but are not fit for use with area detectors, where we will clearly be using intensities due to the various processes involved in calculating background and peak fitting. So when we do get around to characterising data reduction, the real-valued intensity data names will be added - but this is a bigger project.

@vaitkus
Copy link
Collaborator Author

vaitkus commented Nov 12, 2021

Approach (2) seems nicer since it keeps things consistent between dictionaries (same approach). Also, I think that associated _su items can be safely removed without first deprecating them given that they were only recently introduced to the DDLm version of the dictionary (see PR #254) and never appeared in the DDL1 version of the dictionary.

@jamesrhester
Copy link
Contributor

jamesrhester commented Nov 15, 2021

The _su items I guess were implicitly present as the target of the bracketed SU. I'd prefer to wait and see if the core DMG have any comments about removed SU for the _diffrn_refln.counts_* data names before taking further steps.

@jamesrhester
Copy link
Contributor

jamesrhester commented Dec 8, 2021

The only comment from the core DMG approved of this change. The actions to take are:

  1. Change the counts_* data names to be Number and where necessary Recorded.
  2. Change any counts_net data names to be Derived.
  3. Remove the _su data names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants