Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add property provided_dtypes to Context #404

Merged
merged 5 commits into from Mar 8, 2021
Merged

Conversation

ershockley
Copy link
Contributor

What is the problem / what does the code in this PR do
I often want a broad overview of the different data types, their lineage hashes, and whether they are saved to disk or not. This is especially useful for e.g. reprocessing or for comparing the changes to hashes between different contexts.

Can you briefly describe how it works?
It's a convenience function -- no new functionality. I thought making it a property would be best to make it immutable (or at least more difficult to mutate) but still a dictionary which is nice to work with.

Can you give a minimal working example (or illustrate with a figure)?
Sure:

import strax
from pprint import pprint
import numpy as np

_dtype_name = 'variable'
_dtype = ('variable 1', _dtype_name)
test_dtype = [(_dtype, np.float64)] + strax.time_fields

class UselessPlugin(strax.Plugin):
    """The plugin that we will be sub-classing"""
    provides = 'useless_data'
    dtype = test_dtype
    depends_on = tuple()

    def compute(self, something):
        return np.ones(len(something), dtype=self.dtype)

class UselessPlugin2(strax.Plugin):
    """The plugin that we will be sub-classing"""
    dtype = test_dtype
    provides = 'more_useless_data'
    depends_on = 'useless_data'
    save_when = strax.SaveWhen.TARGET

    def compute(self, something):
        return np.ones(len(something), dtype=self.dtype) * 2

st = strax.Context(storage=[])
st.register((UselessPlugin, UselessPlugin2))

pprint(st.provided_dtypes)

prints:

{'more_useless_data': {'hash': 'yxzcuhurrz', 'save_when': 'TARGET'},
 'useless_data': {'hash': 'bpf35bjrkg', 'save_when': 'ALWAYS'}}

This new method should not require any changes apart from perhaps including it in the documentation. I don't think tests are necessary.

Copy link
Member

@JoranAngevaare JoranAngevaare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature Evan. This is definitely helpful!

I have three ideas/suggestions:

  • Would it make sense to make it a dataframe where we get a table to more easily display this information (in a notebook). We do a similar thing for the st.show_config: https://github.com/AxFoundation/strax/blob/master/strax/context.py#L305
  • Would you say it makes sense to add some other info to this list containing the version and compressor?
  • Perhaps make run_id an option like I explain below.

strax/context.py Outdated
Summarize useful dtype information provided by this context
:return: dictionary of provided dtypes with their corresponding lineage hash and save_when
"""
hashes = set([(d, self.key_for('0', d).lineage_hash, p.save_when)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does work, you might get a different hash depending on the run you might ask for depending on the default_per_run argument such as used here:
https://github.com/XENONnT/straxen/blob/master/straxen/plugins/event_processing.py#L293

This kind of way of tracking options is kind of outdated with CMT and also not very nice (exactly for bookkeeping things). However, since it's not forbidden a priori in strax, perhaps you want to make run_id='0' a keyword argument?

In case you wonder, for nT we only actively use it for the nveto processing (would be better if we can use CMT instead):

>>>pprint(st.key_for('0', 'records_nv'), st.key_for('13000', 'records_nv'))
(0-records_nv-gemzm4oj2y, 13000-records_nv-xw2hbel7by)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorana - I had gotten this code from Jelle at some point a while ago and tbh never realized that the '0' was referring to the runid. I am quite surprised and concerned that the hash for the same context/dtype pair could change depending on the runid. This makes bookkeeping even more of a nightmare than it already is 😕

My personal preference would be to move to CMT for nveto processing ASAP and push to have the collection of dtypes and their hashes be a property of just the context, not the runid. But if this causes problems then yeah we should drop the property decorator and add a kwarg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for xenon we definitely need to do this, just pointed out that this doesn't have to be the case for strax - a priori (although I don't really like that we support it to be honest so I'm fully with you on this one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I see your point. In that case I agree we should probably make this a normal method and add the runid as kwarg. I love properties but oh well I'll live 😁

@ershockley
Copy link
Contributor Author

Thanks for the comments @jorana! I'm coming at this more from a data management standpoint and just included the values that are useful to me currently. Of course we can add more if you think they are useful. I think version makes sense but do you really think we need the compressor? At least for my use-cases this isn't super important, and I'm not sure analysts would care much either. Up to you though.

Maybe I'm too old school but I would prefer a dict to a dataframe, for my purposes at least. I would be mainly using it for jobs out on the grid where notebook visualization does not apply though. If you feel strongly we can of course switch to a df.

@JoranAngevaare
Copy link
Member

Hi Evan, sure, that is completely fine. I was just thinking of using this as a plugin summary that I know others could use but if doesn't provide a format to the main intended user (you/your Darwin alter-ego) this idea doesn't float. If you need to display that info, dataframes are nice but sometimes cumbersome if you want to use it later-on - I agree.

@ershockley
Copy link
Contributor Author

Just to update the working example. Now it would look like this, for some context st:

pprint(st.provided_dtypes())

returns

{'more_useless_data': {'hash': 'yxzcuhurrz',
                       'save_when': 'TARGET',
                       'version': '0.0.0'},
 'useless_data': {'hash': 'bpf35bjrkg',
                  'save_when': 'ALWAYS',
                  'version': '0.0.0'}}

@JoranAngevaare
Copy link
Member

Thanks Evan

@JoranAngevaare JoranAngevaare merged commit 38f60dc into master Mar 8, 2021
@JoranAngevaare JoranAngevaare deleted the summarize_hashes branch March 8, 2021 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants