Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index into a default list with more than one key value (ie create ENUMERATION_DEFAULTS) #399

Merged
merged 28 commits into from
Jun 27, 2023

Conversation

rowlesmr
Copy link
Collaborator

@rowlesmr rowlesmr commented May 18, 2023

From #397

This is a first attempt at putting things into place.

The idea is to allow indexing a single default value based on an arbitrary number of keys. In the example below, neutron scattering lengths are keyed on element symbol and atomic mass number. In any case, a single value is given to each combination of keys.

The definitions need work.

.

I've put in _enumeration.def_index_ids with a default value of [ _enumeration.def_index_id ], to give backward compatibility with the current use; in the new scheme, you need to parse _enumeration.def_index_ids, and we don't need to redefine every single current default enumeration.

I've generalised _enumeration_default.index as _enumeration_default.indexes, even though may not make total sense to make it a plural. Again, in the new scheme, _enumeration_default.indexes has the default value of [ _enumeration_default.index ], to give backward compatibility with the current use

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented May 18, 2023

An example use would be:

save_atom_type_scat.length_neutron

    _definition.id                '_atom_type_scat.length_neutron'
    _definition.update            2023-05-17
    _description.text
;
    The bound coherent scattering length for a given atom type and isotope.
;
    _name.category_id             atom_type_scat
    _name.object_id               length_neutron
    _type.purpose                 Number
    _type.source                  Assigned
    _type.container               Single
    _type.contents                Real
    _enumeration.def_index_ids     
        ['_atom_type.element_symbol' '_atom_type.mass_number']
    _units.code                   femtometres

    loop_
         _enumeration_default.indexes
         _enumeration_default.value
     [ H   . ]   -3.7390
     [ H   1 ]   -3.7406
     [ H   2 ]    6.671
     [ H   3 ]    4.792
     [ D   . ]    6.671
     [ He  . ]    3.26
     [ He  3 ]    5.74
     [ He  4 ]    3.26
     [ Li  . ]   -1.90
     [ Li  6 ]    2.0
     [ Li  7 ]   -2.22
     #...
 
save_

* _atom_type.mass_number is not yet a real data item. It's proposal lead to this change.

@vaitkus
Copy link
Collaborator

vaitkus commented May 18, 2023

I am thinking that it might be better to create a completely new category that would be most similar to ENUMERATION_DEFAULT, but that would use the new multivalue-key approach (e.g. ENUMERATION_DEFAULT_LOOKUP, but please feel free to offer a better name). The ENUMERATION_DEFAULT would be deprecated and we could slowly migrate away from it in the dictionaries. The use of dREL code in the reference dictionary introduces some complexity that IMHO would be better avoided.

@rowlesmr , @jamesrhester what are your opinions on this?

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented May 19, 2023

That makes some sense. The errors in the ddlm check as (I think) associated with not having the category key present, even though it's populated through dREL.

Something like:

  • _enumeration.def_index_ids - This is an array of datanames, that, together, form the key.
  • _enumeration_default_lookup.indexes - This is an array of values that form the key. The values are associated to the datanames given in _enumeration.def_index_ids.
  • _enumeration_default_lookup.value - This is the single value associated with a particular combination of key values.

.

The alternative would be to keep the current categories and just add _enumeration_default.indexes and _enumeration.def_index_ids as Array-type dataitems, and to change over all dictionaries at once. This would depend on how many other dictionaries use enumeration defaults.

.

More thinking, from #397 (comment)

My preference would simply be to define new data names as above, perhaps deprecating the current ones before there is too much DDLm software out there using them.

@vaitkus 's solution above is a go slowly approach, which is valid when dealing with standards: bring in a new category, apply it to scattering lengths, and move on from there. The alternate approach is to go nuclear and just define _enumeration.def_index_ids and _enumeration_default.indexes in the existing categories, and shift all of templ_enum.cif and cif_core.cif over to the new syntax. It may be a shock to the system, and I have no intuition as to how big a change this is, but it is an option.

I'm an idiot. We need @vaitkus 's solution above, as we are going to be changing the category key. But I would combine it with @jamesrhester 's preference, and deprecate the old and change over all the existing to the new.

I'll put something together.

@rowlesmr
Copy link
Collaborator Author

Just some other things to think about:

The atom symbol is a concatenation of element symbol and charge. If redesigning the entire system (which we're not), this could be changed to either separate element and charge or to also include the isotope number (if not natural abundance).

i) Separating the charge from the element would make the atomic mass, atomic number, radius bond*, dispersion tables, and the like much simpler. This would necessitate the requirement to index multiple keys to reference scattering factors and the like.

ii) Adding the isotope number to the _atom_type.symbol would negate the (current) need to index multiple keys. It would also vastly expand the aforementioned tables to include all the duplicate information for each isotope. You would need an isotope number for each charge state, as structures could be shared between X-ray and neutron experiments, necessitating both scattering factors and lengths.

and then (the position which should be chosen), iii) Dont't touch _atom_type.symbol, and just add _atom_type.mass_number, and a way to index on multiple keys.

* why are the bond radii of ions and neutral atoms the same? Whence did they come?

@jamesrhester
Copy link
Contributor

jamesrhester commented May 22, 2023

My thinking is as follows: If we create a new category to supercede enumeration_default, then we maintain perfect backwards compatibility, at the expense of carrying around an extra, unnecessary category and needing to update our templ_enum.cif file. If we instead change the category key of enumeration_default to the new list-valued object, we are not maintaining compatibility, which, with judicious use of dREL, has the following consequences:

  1. Software authors who have encoded these DDLm semantics need to update software to read our updated dictionaries
  2. We update version of DDLm to 5.0

Either of these approaches is viable, which means the decision is a policy one. I'm inclined to go with the update of DDLm to version 5.0, and simply provide the new category key with the dREL that explains how it is derived from the old category key. My reasons for preferring this is that at this level, very few software authors are affected (probably all are reading this message) and it keeps the foundational language cleaner.

If going with the above approach, we would need to pass it by ddlm-group for policy signoff.

@rowlesmr
Copy link
Collaborator Author

I have no intuition as to the best approach, so I'll go with the nuclear option of just changing the existing category. If proposing that to the dlm-group, then I don't think I need to do anything more.*

I've also already updated templ_enum.cif to the new syntax (not on this branch, though).

* I can't really update the scattering length, as that also needs a new data item, and that would complicate things in this PR.

@vaitkus
Copy link
Collaborator

vaitkus commented May 22, 2023

Either of these approaches is viable, which means the decision is a policy one. I'm inclined to go with the update of DDLm to version 5.0, and simply provide the new category key with the dREL that explains how it is derived from the old category key. My reasons for preferring this is that at this level, very few software authors are affected (probably all are reading this message) and it keeps the foundational language cleaner.

@jamesrhester As you mentioned, there seems to be two conflicting policies here. From my point of view, the proposed approach does not seem cleaner since it breaks things for DDLm software that is not capable of dynamically interpreting dREL code. Of course, the coders can still use dREL code as a reference, but it does require going back and tweaking the software. Given that, software from the cod-tools package does indeed not yet handle the ENUMERATION_DEFAULT fields when validating CIF instance files, however, dictionary validation will also be affected by this since the software would now need to know how to dynamically construct the category key from the values of another item (as can be seen by currently issued errors during PR validation). I do acknowledge, though, that such problems probably do not arise in software that properly interprets dREL.

I also understand that having two categories which serve a similar purpose may be somewhat confusing though still a much lesser evil from a programmers point of view as the handling of the new category can be implemented completely independently.

If the ddlm-group will prefer to go with the "nuclear" solution, I apologize in advance if the ddlm_validate program will break for a bit in a minor way. (It will probably be sufficient to modify the post commit check message filtering until I properly fix it).

@jamesrhester
Copy link
Contributor

@vaitkus , if the templ_enum.cif file is updated at the same time, would that help to fix the issues cod-tools software might experience? The new category key would be present, the only part that would not be validated is the relationship with the new _enumeration.def_index_ids.

Also, if cod-tools were to check the dictionary DDLm conformance version number, it could simply refuse to check anything that is not DDLm version 4 until updated.

I am not wedded to the "nuclear" option, but the other option would mean an old, unused category that would ideally eventually be removed, leading again to a version 5 of DDLm.

@vaitkus
Copy link
Collaborator

vaitkus commented May 23, 2023

If the templ_enum.cif file is updated at the same time, would that help to fix the issues cod-tools software might experience? The new category key would be present, the only part that would not be validated is the relationship with the new _enumeration.def_index_ids.

I think that would work. I would eventually implement the checking of the new relationships.

Also, if cod-tools were to check the dictionary DDLm conformance version number, it could simply refuse to check anything that is not DDLm version 4 until updated.

This is indeed something worth implementing since it would at least help to detect the incompatibilities. However, the underlying need to modify the code would still remain.

I am not wedded to the "nuclear" option, but the other option would mean an old, unused category that would ideally eventually be removed, leading again to a version 5 of DDLm.

I think that the accumulation of some outdated or deprecated definitions is unavoidable. These should preferably still be supported for a few intermediate versions and then eventually removed with the next major release (or even the next after that). Further more, is I correctly understand the current proposal, it still leaves some deprecated items (_enumeration.def_index_id, _enumeration_default.index) in the dictionary, with the only real difference being that they belong to the same category as the new items.

However, I do understand that this version of DDLm can still potentially be considered early in development so some less conservative changes may be preferred. Are these changes something you would like to see published in the upcoming volume G of ITC or are they aimed at a future release?

As a side note, the new propose name _enumeration_default.indexes should indeed be changed to something else since there is still a single index, it is simply potentially a multi-column (composite) one. Maybe something like "_enumeration_default.composite_index" be better? (Though still a bit of a misnomer in cases when we have only a single column.) We could go full database and call it a _enumeration_default.unique_index to clearly denote a one-to-one match.

@rowlesmr
Copy link
Collaborator Author

I think that the accumulation of some outdated or deprecated definitions is unavoidable. These should preferably still be supported for a few intermediate versions and then eventually removed with the next major release (or even the next after that). Further more, is I correctly understand the current proposal, it still leaves some deprecated items (_enumeration.def_index_id, _enumeration_default.index) in the dictionary, with the only real difference being that they belong to the same category as the new items.

@vaitkus makes a good argument. I say go with this. It still ends in a v5 with things removed, but is a more gradual, and officially deprecates them prior to removal.

Side note: is there anything in the CIF release notes about communicating the actual removal of deprecated dataitems?

As a side note, the new propose name _enumeration_default.indexes should indeed be changed to something else since there is still a single index, it is simply potentially a multi-column (composite) one. Maybe something like "_enumeration_default.composite_index" be better? (Though still a bit of a misnomer in cases when we have only a single column.) We could go full database and call it a _enumeration_default.unique_index to clearly denote a one-to-one match.

I am definitely not wedded to _enumeration_default.indexes. I can redo as _enumeration_default.unique_index.

<FUN>
image
</FUN>

@jamesrhester
Copy link
Contributor

OK, I'm happy to go with the preference of people who are actually using this thing. Let's have a new category and leave things at DDLm version 4 for now. As this is in time for Volume G 2nd Edition, I will write it up in place of the old data names. What do we think about enumeration_defaults as the new category name, and then _enumeration_defaults.index would work fine?

ddl.dic Outdated Show resolved Hide resolved
ddl.dic Outdated Show resolved Hide resolved
@jamesrhester
Copy link
Contributor

Apart from removing the dREL I think I'm happy to merge this. What do you think @vaitkus ?

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 5, 2023

I shall remove the drel. I think I put it there to avoid having to alter all the enumerations in templ_enum, but they shouldn't be too hard to convert; I'll have to find where I stashed my go at changing them.

as per comments from @jamesrhester
@rowlesmr rowlesmr changed the title index into a default list with more than one key value index into a default list with more than one key value (ie create ENUMERATION_DEFAULTS) Jun 5, 2023
@rowlesmr rowlesmr mentioned this pull request Jun 5, 2023
@jamesrhester
Copy link
Contributor

This change can be independent of any changes in templ_enum.cif, as we haven't removed the old attributes.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 5, 2023

@jamesrhester There is actually one annoying issue with _enumeration_defaults.index which I am not sure how to resolve. The problem is that _enumeration_defaults.index treats all of its values as having the save content type (Code) while in reality the data item referenced by _enumeration.def_index_ids can have a whole range of content type (i.e. the mass_number will be an integer).

Possible solutions:

  1. We could explicitly state that the content type(s) should be inherited from the items referenced by the _enumeration.def_index_ids attribute, but we still need some kind of content type value to indicate that this should be done. Neither Implied nor ByReference content type seem to be suitable for this task, but I am not sure a new content type should be added for just this one case.

2.a Alternatively, we could continue having a single content type _enumeration_defaults.index, but then it should be clearly stated that all values of the composite key are treated as if they would be of that type (e.g. even if mass_number is an integer, key values 01 and 1 would be treated as different). However, the question then still remains which one of the content types should be chosen? Maybe Word?

2.b Same as 2a, but we also state that the values are normalised before converting them to the Word value. However, the problem here is that we do not have any rules for normalising data values of different content types.

@jamesrhester
Copy link
Contributor

Yes, this type of indexing is something we have deliberately removed from the main dictionary, which wanted to index all categories using a single data name potentially created from several data values. We can't do that here without redesigning the entire default system.

We want a low-effort solution, as all we are talking about is looking up default values to inform calculations of derived values by dREL. This is not something that affects actual data transmission or archiving. So some thoughts:

  1. The values used to index come from the data file, so we can't control their presentation. That means we must be able to handle caseless data values and alternative ways of writing integers.
  2. We can control what datanames are used to perform the lookup (def_index_ids)

So I would propose simply defining an Inherited data type as follows:

"Used only with compound data values in the DDLm attribute dictionary. The data type of each value in a compound value is the same as that of the data name to which it is directly related."

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 7, 2023

Would you like me to work that into this PR?

@jamesrhester
Copy link
Contributor

Let's wait and see what @vaitkus thinks.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 7, 2023

I am generally OK with having the Inherited content type as long as it is only used in the reference dictionary.

However, the relationship between the value and the related item might not always be clear to an outside reader, so the definition of Inherited type should strongly encourage to include rules on establishing such relationships into the human-readable definition of the data item. Of course, we will have no formal way of checking if the description actually includes such information, but it is still better than nothing. Let's update the PR along these lines and then see if it need any further fine-tuning.

In the next major revision we may come up with better ideas, but, as you mentioned, this would require a serious reworking of the typing system.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 7, 2023

I've added some words.

The definition of _type.container for the value Implied refers to "the DDLm Reference Dictionary". Is that different to a "DDLm Attribute Dictionary"? or even "DDLm Dictionary"? also should it be "the" or "a" dictionary? is this supposed to refer to the file ddl.dic? I'm not up to speed on this level of nuance in the terminology.

@rowlesmr
Copy link
Collaborator Author

How can a List dataitem have a _type.contents attribute, if not Inherited?

A List is:

Ordered set of values. Elements need not be of same contents type.

I got it wrong, a List can have Implied contents if there are mixed types.

Does

Operations across arrays are equivalent to operations across elements of the Array

mean [1 2 3] * [2 4 6] = [1*2 2*4 3*6]?

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 13, 2023

mean [1 2 3] * [2 4 6] = [1*2 2*4 3*6]?

Not sure if you can multiply two arrays like this. My interpretation would be more like [1 2 3] * 5 = [5 10 15].

@rowlesmr
Copy link
Collaborator Author

Sure you can. It's element-wise multiplication (Hadamad product).

@jamesrhester
Copy link
Contributor

Just to confirm, yes, Lists may have different content types, but only Implied and Inherited would seem to allow that to happen. There's no point defining such a heterogeneous compound data name for use in a data file, as each of the components being of a different type means that they are processed differently which means they have a different meaning, so need a separate data name.

An Array has uniform content type.

A Matrix follows the rules of matrix multiplication, and would include items meant to be used as row/column vectors.

@jamesrhester
Copy link
Contributor

I think this is ready for merging if those few suggested changes above are incorporated, and the minor version is bumped in ddl.dic, and _dictionary.ddl_conformance in cif_core.dic changed to match.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 22, 2023

@jamesrhester The last version number change already increased the minor version, that is, the version was changed from 4.0.2 to 4.1.0, so the DDL version number increase is probably not necessary. Or should we create a new version of 4.2.0 and put all the changes from this branch there?

Also, should we formally deprecate the old ENUMERATION_DEFAULT category?

@jamesrhester
Copy link
Contributor

Yes, we should deprecate the old category. OK not to change the version number then, as long as it is different to the one used in the 3.2.0 core release.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 23, 2023

Ok, so I did quite a few changes since the last review:

  • Deprecated the ENUMERATION_DEFAULT category and the _enumeration.def_index_id attribute.
  • Adapted the ENUMERATION_SOURCE category mechanism to work with ENUMERATION_DEFAULTS instead of ENUMERATION_DEFAULT.
  • Updated the version number of DDL reference dictionary. However, it seems that we did assign some of the changes to version that were already in the 3.2.0-rc branch. I corrected the DDL reference dictionary here, but a more proper fix was implemented in PR Update dictionary version numbers and changelog messages. #429. I think we should merge that PR first, sync the master branch with this branch and after a final review we should be good to go (IMHO).

Copy link
Collaborator Author

@rowlesmr rowlesmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks all good. Just that one change on line 1002.

ddl.dic Outdated Show resolved Hide resolved
Co-authored-by: Matthew Rowles <rowlesmr@gmail.com>
Copy link
Collaborator

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validator does not yet properly handle the 'Inherited' type, but I updated the GitHub actions to ignore the warnings for now so we can start using these dictionary attributes in other feature branches.

I will most likely implement the proper handling in the upcoming few weeks.

@jamesrhester jamesrhester merged commit 74c022a into COMCIFS:master Jun 27, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants