Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow per-phase calculated intensity #3

Closed
jamesrhester opened this issue Sep 13, 2021 · 25 comments · Fixed by #35
Closed

Allow per-phase calculated intensity #3

jamesrhester opened this issue Sep 13, 2021 · 25 comments · Fixed by #35
Labels
enhancement New feature or request

Comments

@jamesrhester
Copy link
Contributor

Currently the calculated intensity _pd_calc_intensity_net is for the sum of all phases. It has been suggested that seeing the calculated contribution of each phase would also be useful for plotting. The sketch of a solution involves adding a child data name of phase_id to the pd_proc category.

@jamesrhester jamesrhester added the enhancement New feature or request label Sep 13, 2021
@rowlesmr
Copy link
Collaborator

This one is my fault. I've been thinking recently about plotting pd data from CIF, and what would be good things to be able to see.

My initial idea of a solution to document the contribution from each phase is something like:

data_diffraction_pattern_info

loop_
_pd_phase_id
_pd_phase_block_id
1          long_unique_string_1
2          long_unique_string_2
3          long_unique_string_3

loop_
_pd_data_point_id
_pd_meas_2theta_scan
_pd_calc_intensity_net
1          5.00     0
2          5.02     6
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          1          0
2          1          3
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          2          0
2          2          1
…

loop_
_pd_data_point_id
_pd_phase_id
_pd_calc_phase_intensity_net
1          3          0
2          3          2
…

@briantoby
Copy link
Collaborator

I think this is a good argument for the single-block CIF with _pd_phase.id. This would allow expansion by adding a new column for each phase rather than a new loop. In fact, the above is invalid unless each loop is put in a separate block, since each loop overwrites the previous data names.

@jamesrhester
Copy link
Contributor Author

Yes, @rowlesmr 's suggestion cannot work because you may not duplicate data names within a block. If each of the loops over _pd_data_point_id were in separate data blocks, and each data block had a value of _pd_phase_id within it, then it would work. It sort of looks like that was the original intention, as there were block pointers at the top of the example.

@rowlesmr
Copy link
Collaborator

rowlesmr commented Sep 21, 2021

Yeah, just noticed that. Multiple instances of a data name in a single block result in issues.

A modification of my example would be something like below. Each crystalline phase belongs to only one diffraction pattern, and therefore has a unique profile. Each diffraction pattern has many phases. I think everything knows about everything else.

data_overall_insitu_experiment
	# Many experimental patterns
	# Each experimental pattern collected at a different temperatures, pressures, and/or times, but on the same instrument
	# Each experimental pattern has many phases
	# Each phase has only one experimental pattern
	# Each phase has only one calculated profile
	# Experiment probably done to report quantitative phase analysis
	
	# insert common information here
	
	loop_
	_pd_phase_block_id
	phase_1_pattern_1_unique_string
	phase_2_pattern_1_unique_string
	#...
	
	loop_
	_pd_block_diffractogram_id
	pattern_1_unique_string
	pattern_2_unique_string
	#...
	
data_phase_1_pattern_1
	_pd_block_id	phase_1_pattern_1_unique_string
	_pd_block_diffractogram_id	pattern_1_unique_string
		
	# crystal structure information would go here
	
	loop_
	_pd_data_point_id
	_pd_calc_phase_intensity_net
	1	0
	2	3
	#...
	
data_phase_2_pattern_1
	_pd_block_id	phase_2_pattern_1_unique_string
	_pd_block_diffractogram_id	pattern_1_unique_string
		
	# crystal structure information would go here
	
	loop_
	_pd_data_point_id
	_pd_calc_phase_intensity_net
	1	0
	2	1
	#...
	
data_pattern_1	
	_pd_block_id	pattern_1_unique_string
	
	loop_
	_pd_phase_id
	_pd_phase_block_id
	_pd_phase_mass_%
	1	phase_1_pattern_1_unique_string	45.5
	2	phase_2_pattern_1_unique_string	54.5

	#time, temperature, pressure, other information
	#hkl info goes here, too, probably.
	
	loop_
	_pd_data_point_id
	_pd_meas_2theta_scan
	_pd_meas_intensity_total
	_pd_proc_ls_weight
	_pd_calc_intensity_total
	_pd_proc_intensity_bkg_calc
	1	5.00	43.364	0.040297	25.962	25.962  
	2	5.01	38.007	0.050546	26.168	26.168  
	#...

# etc....	

A more complicated example (taken from NISI.cif) is where each phase has multiple experimental patterns, and each pattern has multiple phases.

In this one:
The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id.
The crystal structures know about their individual profiles through _pd_phase_block_id (is that the correct way to do it?).
The crystal structures don't know about each other.
The individual profiles know about their crystal structure through _pd_phase_block_id (is that the correct way to do it?).
The individual profiles of a crystal structure don't know about each other.
The diffraction patterns know about the crystal structures through _pd_phase_block_id,
The diffraction patterns have no knowledge of the individual phase profiles (should they?).

data_overall_structure_determination
	# Many experimental patterns, each collected the same temperature, pressures, and/or time, but on different instruments
	# Each experimental pattern has many phases
	# Each phase has many experimental patterns
	# Each phase has many calculated profiles
	# Experiment probably done to report crystal structure
	
	# insert common information here
	
	loop_
	_pd_phase_block_id
	phase_1_unique_string
	phase_2_unique_string
	
	loop_
	_pd_block_diffractogram_id
	xray_pattern_unique_string
	cw_neutron_pattern_unique_string

data_phase_1
	_pd_block_id	phase_1_unique_string
	
	loop_
	_pd_block_diffractogram_id
	xray_pattern_unique_string
	cw_neutron_pattern_unique_string

	loop_
	_pd_phase_block_id
	phase_1_xray_unique_string
	phase_1_cw_unique_string

	#crystal structure information

data_phase_1_xray	
	_pd_block_id	phase_1_xray_unique_string
	_pd_phase_block_id	phase_1_unique_string
	_pd_block_diffractogram_id	xray_pattern_unique_string
	
	loop_
	_pd_data_point_id
	_pd_calc_phase_intensity_net
	1	0
	2	1
	#...
	
data_phase_1_cw	
	_pd_block_id	phase_1_cw_unique_string
	_pd_phase_block_id	phase_1_unique_string
	_pd_block_diffractogram_id	cw_neutron_pattern_unique_string
	
	loop_
	_pd_data_point_id
	_pd_calc_phase_intensity_net
	1	0
	2	3
	#...
	
data_phase_2
# blah
data_phase_2_xray	
# blah
data_phase_2_cw	
# blah
	
data_xray_pattern
	_pd_block_id	xray_pattern_unique_string
	
	_diffrn_radiation_wavelength 0.897654
	
	loop_
	_pd_phase_id
	_pd_phase_block_id
	1	phase_1_unique_string
	2	phase_2_unique_string

	loop_
	_pd_data_point_id
	_pd_meas_2theta_scan
	_pd_meas_intensity_total
	_pd_proc_ls_weight
	_pd_calc_intensity_total
	_pd_proc_intensity_bkg_calc
	1	5.00	43.364	0.040297	25.962	25.962  
	2	5.01	38.007	0.050546	26.168	26.168  
	#...
	
	loop_
	_refln_index_h
	_refln_index_k
	_refln_index_l
	_pd_refln_phase_id
	_refln_observed_status
	_refln_F_squared_meas
	_refln_F_squared_calc
	_refln_d_spacing
	2   0   0  1 o  16.505  16.060	1.76172
	3   1   1  2 o   4.854   5.087	1.63708
	2   2   2  2 o   0.000   0.000	1.56738
	4   0   0  2 o  10.301   9.812	1.35739
	2   2   0  1 o  15.566  15.195	1.24572
	#...
	
data_cw_pattern
	_pd_block_id	cw_neutron_pattern_unique_string
	
	_diffrn_radiation_wavelength 1.987
	
	loop_
	_pd_phase_id
	_pd_phase_block_id
	1	phase_1_unique_string
	2	phase_2_unique_string

	loop_
	_pd_data_point_id
	_pd_meas_2theta_scan
	_pd_meas_intensity_total
	_pd_proc_ls_weight
	_pd_calc_intensity_total
	_pd_proc_intensity_bkg_calc
	1	10.00	43.364	0.040297	25.962	25.962  
	2	10.10	38.007	0.050546	26.168	26.168  
	#...
	
	loop_
	_refln_index_h
	_refln_index_k
	_refln_index_l
	_pd_refln_phase_id
	_refln_observed_status
	_refln_F_squared_meas
	_refln_F_squared_calc 
	_refln_d_spacing
	4   0   0  2 o   9.773   9.812	1.35739
	3   3   1  2 o   4.799   4.801	1.24563
	2   2   0  1 o  15.254  15.195	1.24572
	#...
	
	

@rowlesmr
Copy link
Collaborator

rowlesmr commented Nov 7, 2021

Maybe my previous examples were a little too complex

Here I propose the following new data names

  • _pd_profile_block_id: this is the block id of the block which contains the profile information pertaining to the structure/diffraction pattern in the current block
  • _pd_proc_profile_intensity_total & _pd_proc_profile_intensity_net: the intensity attributed to a certain phase, either with or without a background contribution.

In this one:
The crystal structures know about their diffraction patterns through _pd_block_diffractogram_id.
The crystal structures know about their individual profiles through _pd_profile_block_id.
The crystal structures don't know about each other.

The individual profiles know about their diffraction pattern through _pd_block_diffractogram_id.
The individual profiles of a crystal structure don't know about each other.
The individual profiles know about their crystal structure through _pd_phase_block_id.

The diffraction patterns don't know about each other
The diffraction patterns know about their individual profiles through _pd_profile_block_id
The diffraction patterns know about their crystal structures through _pd_phase_block_id,

Anyway, I don't really know what I'm doing here, so I'll stop for now.

data_STR1_block
	_pd_block_id STR1
	
	loop_
	_pd_block_diffractogram_id
	XRAY
	NEUTRON
	
	loop_
	_pd_profile_block_id
	STR1_XRAY
	STR1_NEUTRON
	
	loop_
	_refln_d_spacing
	2.3
	3.4
	4.5
	5.6	
	#other crystal structure information

	
data_STR2_block
	_pd_block_id STR2
	
	loop_
	_pd_diffractogram_id
	XRAY
	NEUTRON

	loop_
	_pd_profile_block_id
	STR2_XRAY
	STR2_NEUTRON

	loop_
	_refln_d_spacing
	2.35
	3.45
	4.55
	5.65
	#other crystal structure information

	
data_XRAY_block
	_pd_block_id XRAY
	
	loop_
	_pd_phase_block_id
	_pd_profile_block_id
	STR1 STR1_XRAY
	STR2 STR2_XRAY
	
	loop_
	_pd_meas_2theta_scan
	_pd_meas_counts_total
	_pd_calc_intensity_total
	_pd_proc_intensity_bkg_calc
	1 2 3 4
	2 3 4 5
	#etc


data_NEUTRON_block
	_pd_block_id NEUTRON
	
	loop_
	_pd_phase_block_id
	_pd_profile_block_id
	STR1 STR1_NEUTRON
	STR2 STR2_NEUTRON
	
	loop_
	_pd_meas_time_of_flight
	_pd_proc_d_spacing
	_pd_meas_counts_total
	_pd_calc_intensity_total
	_pd_proc_intensity_bkg_calc
	1 2 3 4 5
	2 3 4 5 6
	#etc
	
data_STR1_XRAY_block
	_pd_block_id STR1_XRAY
	
	loop_
	_pd_block_diffractogram_id
	_pd_phase_block_id
	XRAY STR1
	
	loop_
	_pd_meas_2theta_scan
	_pd_proc_profile_intensity_total
	1 2
	2 3
	#etc
	
	
data_STR1_NEUTRON_block
	_pd_block_id STR1_NEUTRON
	
	loop_
	_pd_block_diffractogram_id
	_pd_phase_block_id
	NEUTRON STR1
	
	loop_
	_pd_proc_d_spacing
	_pd_proc_profile_intensity_total
	1 2
	2 3
	#etc
	
	
data_STR2_XRAY_block
	_pd_block_id STR2_XRAY
	
	loop_
	_pd_block_diffractogram_id
	_pd_phase_block_id
	XRAY STR2
	
	loop_
	_pd_meas_2theta_scan
	_pd_proc_profile_intensity_total
	1 2
	2 3
	#etc
	
	
data_STR2_NEUTRON_block
	_pd_block_id STR2_NEUTRON
	
	loop_
	_pd_block_diffractogram_id
	_pd_phase_block_id
	NEUTRON STR2
	
	loop_
	_pd_proc_d_spacing
	_pd_proc_profile_intensity_total
	1 2
	2 3
	#etc

@briantoby
Copy link
Collaborator

briantoby commented Nov 7, 2021 via email

@rowlesmr
Copy link
Collaborator

rowlesmr commented Nov 8, 2021

As a reflection table?

Yes, you can store reflections from individual phases together in a single table when you include _pd_refln_phase_id

loop_
_refln_index_h
_refln_index_k
_refln_index_l
_pd_refln_phase_id
_refln_d_spacing
1 2 3 a 3.4
1 4 8 b 3.6
1 7 9 b 3.8
1 4 1 a 6.6

OTOH, there is the need to set up for n*m sets of profile descriptions (where there are n phases and m datasets).

Yes, this is clunky.

It might still be better to used a looped variable for that where a phase ID would be included in a table by dataset

does "dataset" mean "data block containing a diffraction pattern"? if so, there would need to be a bunch more keywords, but it would cut down on the number of blocks. You would need a profile version of every possible ordinate you could use as X and Y (TOF, 2theta_meas, 2theta_corrected, d_spacing..., intensity, counts, net, total...)

This would definitely mimic a reflection table, just for every point in the diffraction pattern.

It could look something like:

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 

loop_
_pd_meas_2theta_scan
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
1.00 2 7 1
1.02 3 7 1
1.04 4 9 3
#etc

loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc

@jamesrhester
Copy link
Contributor Author

I think the time has come to figure out general principles for presenting complicated data. These principles would apply to PD as well as modulated + composite and any other complex dataset. The plan is to work these out for powder by imagining complicated scenarios and making sure they work. The following is a simple summary of what I've come up with so far. Note this is all in terms of DDLm dictionaries, DDL1 could never cope properly with the demands of any reasonably complex dataset. NB The use of block pointers addresses a separate problem that needn't complicate things here.

Key information:

  1. Data names in a Set category may only take a single value in a single data block.
  2. The default value of _audit.schema corresponds to the Set categories defined in the core + powder dictionaries
  3. A change in the value of data name _audit.schema from the default value can change the categories that are Set categories in a particular data block

Tasks:

  1. To define which categories in the powder CIF dictionary are Set categories and thereby define how things are distributed over data blocks
  2. To define a non-default value of _audit.schema for data blocks where we want to collect information from multiple data blocks.

As I understand it, the way in which powder would like to split things up is to have information specific to a particular phase in separate data blocks. Therefore, in DDLm terms, pd_phase is a Set category. This flows through to all "child" data names of _pd_phase.id e.g. _pd_profile.phase_id must also only take a single value in a single data block so you can't loop _pd_profile as in the previous example, and the same goes for _pd_refln.phase_id

Cif_core specifies that diffrn is a Set category, so different experimental conditions/radiations should also be in separate data blocks. I think this means that there is one diffractogram per data block as well.

Now I gather that a "summary block" is desirable, where selected information found in the other blocks is collated. This would be where block pointers would be included, but it should be the case that the same information could be obtained by just reading in all of the other data blocks. In any case, the summary block would need to e.g. loop _pd_phase.id and _diffrn.id which means they are no longer Set categories. The way to write such a block would be to set _audit.schema to something like Powder Summary (which we can define) and then loop to our heart's content.

I think this all started because @rowlesmr wanted to record the contributions of each phase to the calculated diffraction pattern. In the scheme posited above, this would require a separate tabulation in each data block corresponding to a particular diffraction pattern + particular phase, as well as a tabulation of the overall fit in each data block corresponding to a particular diffraction pattern (with no phase-specific information). This may seem vaguely wasteful of space due to the repetition of the 2 theta values, but the alternative would be to define a further _audit.schema that allowed phases to be looped but not diffrn.

So my question is, does the above scheme cover all situations that you've encountered? Have I perhaps missed something else that should be separated into another data block?

@briantoby
Copy link
Collaborator

briantoby commented Nov 9, 2021 via email

@jamesrhester
Copy link
Contributor Author

Apologies for the lack of clarity. In DDLm dictionaries, categories are classified as Set or Loop. Datanames in a Set category may only have one value per data block (something like list = no in DDL1), so if there are in fact many values (e.g. many phases) then having those phase_ids in a Set category forces those phases to be listed in separate data blocks. Classifying categories between Set and Loop enables us to define how to present complex data unambiguously. So what I'm trying to pin down is exactly how we would like to do that. Note that the single value restriction applies only to the "topmost" data names, in our case _pd_phase.id. Child data names (the ones that draw from its values) do not have to belong to Set categories.

I now understand what is wanted to provide partial patterns by phase. From a logistics perspective one really wants all the partials in a single loop. What one really needs is a way to say a CIF name gets N values not 1 for every row in the table. I think star might have a quoting or grouping mechanism that allows this even if CIF does not.

The only way to do this in a single loop in even our most flexible interpretation of the relational model is to have a separate column labelling the phase this calculated intensity belongs to. So for two phases you would have what @rowlesmr proposed:

loop_
_pd_profile_meas_2theta_scan
_pd_profile_phase_id
_pd_profile_intensity_net
1.00 a 4
1.00 b 2
1.02 a 4
1.04 b 2
#etc

If that is what you would prefer then we can do that. I don't understand why having the partial pattern grouped together in a separate data block with the per phase, per histogram information is less practical though.

@rowlesmr
Copy link
Collaborator

rowlesmr commented Nov 9, 2021

What do you mean by "logistically" when wanting the partials all in one loop?

If they all in one loop, you probably don't need the complexity of linking them to the structures and diffractograms, as you could just stick it in the diffractogram block and piggyback off the linking that is already there.
If each profile is in it's own block, you do need to link everything, but you get the simplicity of "this block is the just for that phase in that other diffractogram".

In both cases, the total number of datapoints you're adding is the same, as you still need to repeat each datapoint in the measured data for each profile you want to record.

.

I should explain my "clunky" comment. Ideally, you could have a single loop that gives columns for 2theta, meas_intensity, calc_intensity, and then one column per individual profile, but that would either necessitate repeating the profile intensity dataname in a loop, or having an arbitrary number of datanames to hold profile_1, profile_2... intensities

The clunkiness arises from having to repeat 2theta values in different loops or blocks that already exist.

@briantoby
Copy link
Collaborator

briantoby commented Nov 9, 2021 via email

@rowlesmr
Copy link
Collaborator

Now, something like 20 years after the introduction of multiple blocks for related information in pdCIF, do we yet have any software that assembles multiple blocks?

pip install pdCIFplotter :)

(Just on that, Dave Billings should be emailing you and James about what has just started in the CPD)

.

I've only looked at the pictures in "DDLm: A New Dictionary Definition Language". Is it possible to have vectors, where their length is defined by another data item?

#using a mixture of old and new syntax, as I don't know how to upgrade...
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas_2theta_scan 
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_intensity_net #length is defined as number of rows in _pd_phase_id ( or in _pd_phase_block_id)
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

@jamesrhester
Copy link
Contributor Author

I am reminded of a quote I read once in a database book that I've never been able to find again: the relational model is always second-best, meaning that in any given situation you can find a more efficient, streamlined way to represent data, but the relational model will still be second-best when the situation changes, while your original streamlined approach is now much worse.

Anyway.

Matthew's suggestion of using CIF2 vectors would be workable, with another vector defined somewhere as per one of Brian's suggestions above to give the order of phases. There is no need to define a length for a CIF2 vector.

So, I've slightly expanded Matthew's example below. How does it look?

Notes on the example:

  1. New category pd_phases is a per-diffraction-pattern category for information about phases in general.
  2. When there are multiple diffraction patterns that have been fit, each separate diffraction pattern would need a _pd_phases_presentation_order item
  3. We would define dREL routines within the dictionary that define the use of these new datanames to be equivalent to presenting each phase's partials in an appropriate per-phase-per-diffraction-pattern block, so the per-phase-per-diffraction-pattern approach would remain an option.
  4. CIF1 format files would not be able to use the vector notation
data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases_presentation_order [a b c]

loop_
_pd_phase_id
_pd_phase_block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas_2theta_scan 
_pd_meas_counts_total
_pd_calc_intensity_total
_pd_proc_intensity_bkg_calc
_pd_profile_phase_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

@rowlesmr
Copy link
Collaborator

Is it not possible to automatically define _pd_phases_presentation_order [a b c] from loop_ _pd_phase_id a b c?

It would be easier to maintain the CIF file if I only need to write down the phases in one place. Although, I do recall from somewhere (pycifrw docs?) that row order isn't guaranteed in CIFs...

@jamesrhester
Copy link
Contributor Author

No, the order of rows is very deliberately not significant. I understand your concerns with writing down the phases in more than one place, this is a key concern of the relational model, which aims to minimise duplication of information. The "ideal" relational approach in our case would have every separate phase in a separate data block, with no "summary block", meaning you really would only write the phases down once, and then shuffle the data around after reading it in, to match your problem of the day.

@jamesrhester
Copy link
Contributor Author

Relevant to this issue is http://comcifs.github.io/accepted/multi-block-principles. The core dictionary combined with that document and PD Loop/Set dictionary decisions dictates how multi-wavelength/sample/diffraction condition/histogram data are distributed over multiple data blocks. I suggest we work together on a document that lays out the principles for PD, happy to draft a first attempt and put it up at comcifs.github.io for discussion. It would be great to have the PD commission involved as well, that can happen once a draft is up for discussion.

@jamesrhester
Copy link
Contributor Author

I've now drafted a document for ongoing discussion: https://github.com/COMCIFS/comcifs.github.io/blob/master/draft/powder_data_presentation.md

@rowlesmr
Copy link
Collaborator

How about something like this?

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phase.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

New data items:

  • _pd_phases.profile_presentation_order
  • _pd_phase.profile_intensity_net
  • _pd_phase.profile_intensity_total

@jamesrhester
Copy link
Contributor Author

As per previous discussions, I proposed that there would always be something like _pd_calc.phase_id to offer the option of tabulating the calculated contribution in separate per-phase blocks. I realise now that this idea of mine is fundamentally wrong, because pdCIF has been set up to make it possible to tabulate measured and calculated intensities in a single loop, and there is no particular phase that measured intensities (in general!) belong to, therefore there cannot be a specific phase associated with this loop and the intensities-in-a-list proposal is therefore compatible with current pdCIF.

Therefore, any per-phase calculated intensity loop must be in a different (new) category, let's call it pd_calc_components where the calculated intensities are listed for a single phase. By defining things this way, it is possible for the above component-intensities-in-a-list proposal and the per-phase listing proposal to be compatible and co-exist.

So that was a long-winded way of saying, yes, I have no objections to this proposal, as long as pd_calc_components exists.

@rowlesmr
Copy link
Collaborator

rowlesmr commented Oct 6, 2022

I only know enough to be dangerous, so questions:

isn't that what _pd_phases.profile_presentation_order is doing? mapping a _pd_phase.profile_intensity_total to a _pd_phase_id to a _pd_phase_block_id?

and PD_CALC_COMPONENTS needs to be a child of PD_DATA so everything can be looped nicely? Isn't that just shifting the issue down the line one step?

.

or is it something like:

_pd_phases.profile_presentation_order is a matrix of _pd_phase.id values, and
_pd_phases.profile_intensity_net|total is a matrix of _pd_calc_components.profile_intensity_net|total

such that the order of values given in _pd_phases.profile_intensity_net matches the order of phases given in _pd_phases.profile_presentation_order

.

or is it to do with a summary block listing all of the histograms, phases, component profiles, and the like?

.

Example time!

component-intensities-in-a-list:

data_XRAY_diffraction_pattern_block
_pd_block_id XRAY
_pd_phases.profile_presentation_order [a b c]

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
_pd_phases.profile_intensity_net
5.00 120 120 100 [7, 8, 5]
5.02 123 121 100 [7, 8, 6]
#...

Per-phase listing

data_summary
#things go here


data_XRAY_diffraction_pattern_block
_pd_block_id XRAY

loop_
_pd_phase.id
_pd_phase.block_id
a STR1 
b STR2 
c STR3

loop_ 
_pd_meas.2theta_scan 
_pd_meas.counts_total
_pd_calc.intensity_total
_pd_proc.intensity_bkg_calc
5.00 120 120 100 
5.02 123 121 100 
#...


data_STR1XRAY_component_block
_pd_block.id STR1_XRAY
_pd_phase.block_id STR1
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_calc_components.profile_intensity_net
5.00 7
5.02 7
#...


data_STR2XRAY_component_block
_pd_block_id STR2_XRAY
_pd_phase.block_id STR2
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_total
5.00 100 108
5.02 100 108
#...


data_STR3XRAY_component_block
_pd_block_id STR3_XRAY
_pd_phase.block_id STR3
_pd_block.diffractogram_id XRAY
loop_ 
_pd_meas.2theta_scan 
_pd_proc.intensity_bkg_calc
_pd_calc_components.profile_intensity_net
5.00  5
5.02  6
#...

@jamesrhester
Copy link
Contributor Author

So this is not to do with the summary block. By having the pd_calc_components block we can ensure that the machine-readable part of the dictionary is able to capture as many links as possible between data names, which in turn means that as much as possible of the dictionary can be interpreted and manipulated automatically using the relational model. The profile_presentation_order approach does capture the same relationships, but only if a programmer reads the text descriptions and implements the link between position in the list and phase, that is, the relationships are expressed outside of the relational model despite being expressible within the relational model. I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.

Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components as well as an incentive to handle CIF2?

I've written out some dREL below to assure myself that not having pd_calc_components is not going to render the data files somehow unable to be processed relationally. All seems fine so I can drop that pd_calc_components requirement for now and we can simply add it in future if it becomes desirable for some relational reason. For now dropping it just means that pure dictionary-based software that wants all information to do with a phase will not access any per-phase per-point information, and there is the CIF2 thing I mentioned above.

Also, CIF allows the use of massive image arrays of numbers instead of the pure relational approach of a table of x,y positions and pixel intensity. So it is not like using an array to save space is new.

I've written out some dREL showing the precise relationships between these categories. Note how dREL forces us to
explicitly specify exactly how total intensity is calculated (ie whether or not scale factors are used).

# dREL pseudo-code for handling profile_presentation_order type information
# A Category method for populating a pd_calc_components category from profile_intensity_net information

loop pd as pd_calc {   # loop over the rows of pd_calc
   for phase_num in 1:len(pd.profile_intensity_net) {
       pd_calc_components.(point_id = pd.point_id,
                   phase_id = pd_phases.profile_presentation_order[phase_num],
                   profile_intensity_net = pd.profile_intensity_net[phase_num]
        )
    }
  }
# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc

t = 0
loop pcc as pd_calc_components {
    t = t + pcc.profile_intensity_net   #Is this right? Do we need a scale factor? Background?
    }
pd_calc.intensity_total = t

It is indeed possible to write dREL for the profile_presentation_order case, skipping pd_calc_components:

# dREL pseudo-code for calculating net total intensity, this is called for each row of pd_calc
t = 0
for i in pd_calc.profile_intensity_net {
    t = t + pd_calc.profile_intensity_net[i]
}
pd_calc.intensity_total = t

If we need to access the scale for a particular phase we get instead:

# dREL pseudo-code for calculating net total intensity scaled by phase scale
t = 0
for i in pd_calc.profile_intensity_net {
    ph = pd_phases.profile_presentation_order[i]
  scale = pd_xxx.scale[ph]  # Don't actually record the scale?
    t = t + pd_calc.profile_intensity_net[i] * scale
}
pd_calc.intensity_total = t

@rowlesmr
Copy link
Collaborator

rowlesmr commented Oct 7, 2022

I know this is a bit of an abstract point, but experience shows that keeping as close as possible to the relational model keeps us robust against future changes.

This sounds like a good reason to put it in.

Small point: lists (square-bracket-delimited values) are a CIF2 feature so any CIF reading software expecting CIF1 format is likely to fail rather than skipping over the value. Perhaps a more pedestrian reason for pd_calc_components as well as an incentive to handle CIF2?

I know the parser I'm fiddling around with writing for CIF1 just fails when it gets a '['. Pedestrian, but still legitimate.

.

_pd_calc.intensity_total includes bkg and normalisation, and so is specified on the same scale as the observed intensities.
_pd_calc.intensity_net does not contain bkg or normalisation and so is specified on the same scale as _pd_proc.intensity_net.

so I think, strictly,

# dREL pseudo code for total intensity: attached to _pd_calc.intensity_total
# Called for every row in pd_calc

t = pd_proc.intensity_bkg_calc  #I don't know if this is legitimate, but its what I want to do.
loop pcc as pd_calc_components {
    t += pcc.profile_intensity_total - pd_proc.intensity_bkg_calc
    }
pd_calc.intensity_total = t

^ With this definition, overlaying _pd_meas.intensity_total and _pd_calc_components.profile_intensity_total, means they'll line up and overlap nicely; there is no bkg offset, the intensities are on the same scale...

# dREL pseudo code for total intensity: attached to _pd_calc.intensity_net
# Called for every row in pd_calc

t = 0
loop pcc as pd_calc_components {
    t += pcc.profile_intensity_net
    }
pd_calc.intensity_net = t

^ This definition, requires that the bkg and normalisation correctsion are identication for each _pd_calc_components.profile_intensity_net

.

I think that the scale foactor you're looking for should be _pd_proc.intensity_norm; _pd_proc.intensity_net doesn't go into detail on where to enumerate the "correction and normalization factors" used.

@rowlesmr rowlesmr linked a pull request Jan 6, 2023 that will close this issue
@rowlesmr
Copy link
Collaborator

rowlesmr commented Jan 6, 2023

Still need to add _pd_calc_component.phase_id and _pd_calc_component.diffractogram_id data names to indicate in a machine-readable way that the information in pd_calc_component is per phase, per diffractogram.

@rowlesmr
Copy link
Collaborator

rowlesmr commented Feb 2, 2023

Still need to add _pd_calc_component.phase_id and _pd_calc_component.diffractogram_id data names to indicate in a machine-readable way that the information in pd_calc_component is per phase, per diffractogram.

They are there.

@rowlesmr rowlesmr closed this as completed Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants