Add property provided_dtypes to Context #404

ershockley · 2021-03-05T20:57:56Z

What is the problem / what does the code in this PR do
I often want a broad overview of the different data types, their lineage hashes, and whether they are saved to disk or not. This is especially useful for e.g. reprocessing or for comparing the changes to hashes between different contexts.

Can you briefly describe how it works?
It's a convenience function -- no new functionality. I thought making it a property would be best to make it immutable (or at least more difficult to mutate) but still a dictionary which is nice to work with.

Can you give a minimal working example (or illustrate with a figure)?
Sure:

import strax
from pprint import pprint
import numpy as np

_dtype_name = 'variable'
_dtype = ('variable 1', _dtype_name)
test_dtype = [(_dtype, np.float64)] + strax.time_fields

class UselessPlugin(strax.Plugin):
    """The plugin that we will be sub-classing"""
    provides = 'useless_data'
    dtype = test_dtype
    depends_on = tuple()

    def compute(self, something):
        return np.ones(len(something), dtype=self.dtype)

class UselessPlugin2(strax.Plugin):
    """The plugin that we will be sub-classing"""
    dtype = test_dtype
    provides = 'more_useless_data'
    depends_on = 'useless_data'
    save_when = strax.SaveWhen.TARGET

    def compute(self, something):
        return np.ones(len(something), dtype=self.dtype) * 2

st = strax.Context(storage=[])
st.register((UselessPlugin, UselessPlugin2))

pprint(st.provided_dtypes)

prints:

{'more_useless_data': {'hash': 'yxzcuhurrz', 'save_when': 'TARGET'},
 'useless_data': {'hash': 'bpf35bjrkg', 'save_when': 'ALWAYS'}}

This new method should not require any changes apart from perhaps including it in the documentation. I don't think tests are necessary.

JoranAngevaare

Nice feature Evan. This is definitely helpful!

I have three ideas/suggestions:

Would it make sense to make it a dataframe where we get a table to more easily display this information (in a notebook). We do a similar thing for the st.show_config: https://github.com/AxFoundation/strax/blob/master/strax/context.py#L305
Would you say it makes sense to add some other info to this list containing the version and compressor?
Perhaps make run_id an option like I explain below.

JoranAngevaare · 2021-03-08T07:40:15Z

strax/context.py

+        Summarize useful dtype information provided by this context
+        :return: dictionary of provided dtypes with their corresponding lineage hash and save_when
+        """
+        hashes = set([(d, self.key_for('0', d).lineage_hash, p.save_when)


While this does work, you might get a different hash depending on the run you might ask for depending on the default_per_run argument such as used here:
https://github.com/XENONnT/straxen/blob/master/straxen/plugins/event_processing.py#L293

This kind of way of tracking options is kind of outdated with CMT and also not very nice (exactly for bookkeeping things). However, since it's not forbidden a priori in strax, perhaps you want to make run_id='0' a keyword argument?

In case you wonder, for nT we only actively use it for the nveto processing (would be better if we can use CMT instead):

>>>pprint(st.key_for('0', 'records_nv'), st.key_for('13000', 'records_nv')) (0-records_nv-gemzm4oj2y, 13000-records_nv-xw2hbel7by)

Thanks @jorana - I had gotten this code from Jelle at some point a while ago and tbh never realized that the '0' was referring to the runid. I am quite surprised and concerned that the hash for the same context/dtype pair could change depending on the runid. This makes bookkeeping even more of a nightmare than it already is 😕

My personal preference would be to move to CMT for nveto processing ASAP and push to have the collection of dtypes and their hashes be a property of just the context, not the runid. But if this causes problems then yeah we should drop the property decorator and add a kwarg.

Yes, for xenon we definitely need to do this, just pointed out that this doesn't have to be the case for strax - a priori (although I don't really like that we support it to be honest so I'm fully with you on this one).

Ah yes I see your point. In that case I agree we should probably make this a normal method and add the runid as kwarg. I love properties but oh well I'll live 😁

ershockley · 2021-03-08T17:16:38Z

Thanks for the comments @jorana! I'm coming at this more from a data management standpoint and just included the values that are useful to me currently. Of course we can add more if you think they are useful. I think version makes sense but do you really think we need the compressor? At least for my use-cases this isn't super important, and I'm not sure analysts would care much either. Up to you though.

Maybe I'm too old school but I would prefer a dict to a dataframe, for my purposes at least. I would be mainly using it for jobs out on the grid where notebook visualization does not apply though. If you feel strongly we can of course switch to a df.

JoranAngevaare · 2021-03-08T17:23:03Z

Hi Evan, sure, that is completely fine. I was just thinking of using this as a plugin summary that I know others could use but if doesn't provide a format to the main intended user (you/your Darwin alter-ego) this idea doesn't float. If you need to display that info, dataframes are nice but sometimes cumbersome if you want to use it later-on - I agree.

ershockley · 2021-03-08T17:35:53Z

Just to update the working example. Now it would look like this, for some context st:

pprint(st.provided_dtypes())

returns

{'more_useless_data': {'hash': 'yxzcuhurrz',
                       'save_when': 'TARGET',
                       'version': '0.0.0'},
 'useless_data': {'hash': 'bpf35bjrkg',
                  'save_when': 'ALWAYS',
                  'version': '0.0.0'}}

JoranAngevaare · 2021-03-08T18:21:50Z

Thanks Evan

Add property provided_dtypes to Context

b0b70fd

ershockley requested a review from JoranAngevaare March 5, 2021 20:59

Update docstring

bcde2cc

JoranAngevaare approved these changes Mar 8, 2021

View reviewed changes

ershockley added 2 commits March 8, 2021 09:32

Add runid argument and version to provided_dtypes

bb76664

update docstring

c077fc1

Merge branch 'master' into summarize_hashes

2204894

JoranAngevaare merged commit 38f60dc into master Mar 8, 2021

JoranAngevaare deleted the summarize_hashes branch March 8, 2021 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add property provided_dtypes to Context #404

Add property provided_dtypes to Context #404

ershockley commented Mar 5, 2021

JoranAngevaare left a comment •

edited

JoranAngevaare Mar 8, 2021 •

edited

ershockley Mar 8, 2021

JoranAngevaare Mar 8, 2021 •

edited

ershockley Mar 8, 2021

ershockley commented Mar 8, 2021

JoranAngevaare commented Mar 8, 2021

ershockley commented Mar 8, 2021

JoranAngevaare commented Mar 8, 2021

Add property provided_dtypes to Context #404

Add property provided_dtypes to Context #404

Conversation

ershockley commented Mar 5, 2021

JoranAngevaare left a comment • edited

Choose a reason for hiding this comment

JoranAngevaare Mar 8, 2021 • edited

Choose a reason for hiding this comment

ershockley Mar 8, 2021

Choose a reason for hiding this comment

JoranAngevaare Mar 8, 2021 • edited

Choose a reason for hiding this comment

ershockley Mar 8, 2021

Choose a reason for hiding this comment

ershockley commented Mar 8, 2021

JoranAngevaare commented Mar 8, 2021

ershockley commented Mar 8, 2021

JoranAngevaare commented Mar 8, 2021

JoranAngevaare left a comment •

edited

JoranAngevaare Mar 8, 2021 •

edited

JoranAngevaare Mar 8, 2021 •

edited