New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply function to data #304
Apply function to data #304
Conversation
|
strax/context.py
Outdated
|
||
result = np.concatenate(results) | ||
|
||
for function in self.context_config['apply_data_function']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will be applied to all data_kinds/types which might not be desirable since it can be easily get forgotten while working in the same context. Maybe some dictionary would be here more desirable as strax.Option e.g. {'peak_basics': lambda res: res['area'] + 10}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. In the implementation I envisage for straxen this is desirable (want to check all datatypes). See XENONnT/straxen#166 and more importantly https://github.com/XENONnT/analysiscode/blob/master/DAQ/remap_sectors/Software_remap.ipynb.
I'd propose to also make the datatype the second arg of each function and do a regex in the function.
... since it can be easily get forgotten while working in the same context.
Perhaps it's good to stress that I think this is quite a big hammer that really should be expert only as the data that you load is not the data that is stored on disk (what you want for blinded data for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WenzDaniel I've updated this PR and updated the description above. This allows you to check it in the function you are feeding strax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay now I see, and now I also agree that we should apply this kind of correction in accumulate. Easiest thing would be to add it into line 1083
@@ -1014,7 +1017,17 @@ def get_array(self, run_id: ty.Union[str, tuple, list], | |||
max_workers=max_workers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this function will be applied outside of get_iter it wont be included in the computation time of the progressbar. Now, I also understand why progressbars always tend to take ages for the last 0.2 % ;-) Maybe save thing would be something like if self.config['apply_data_function'] switch off progressbar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, on the other hand, it's fine if the progress bar reflects the loading. What happens after that can fal outside of the scope of the progress bar. For example, the select_runs also contains several steps that take different amounts of time.
True, perhaps my question makes more sense if you look at XENONnT/straxen#166. There I implement a remapping of channels that would be applied to any loading of data. Technically accumulate can also return data. If that is not remapped we might run into inconsistency issues between accumulate and get_array. On the other hand, I'm not sure if many people actually use this nice feature (and how they use it). |
As mentioned by Daniel the function is now also applied to accumulate. This PR is ready for merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, just one last comment though
f'{function} but expected callable function with two ' | ||
f'positional arguments: f(data, targets).') | ||
# Make sure that the function takes two arguments (data and targets) | ||
data = function(data, targets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, just check how traceable a "targets" does not exist error would be. In case the error message is not traceable/understandable at all (e.g. due to numba), maybe add an explicit check with ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem / what does the code in this PR do
Apply a function in
get_array
just before returning the data.This way, you can make a last-minute correction (or apply a blinding cut in the context).
Can you briefly describe how it works?
When one loads data you do either
st.get_array
orst.get_df
(which then callsst.get_array
for you). In the lines changed in this PR, you see that the results that is usually returned directly can now be changed by a function. Such a function should be a function that takes a single argument (the data) and does something to it. For example in XENONnT/straxen#166 it remaps some channels but this could also be used to apply a blinding cut.Can you give a minimal working example (or illustrate with a figure)?
As written in the Strax.Option a function should have the form of:
See for example the MWE below.
What does it affect?
get_df
,get_array
andaccumulate
. It does not affectget_iter
(directly) otherwise alsost.make
would be affected which would be problematic as one might be applying corrections twice (which is fine if your cutting data but not fine if you switch channels).After discussing the PR,
accumulate
is also affected to keep loading of data consistent across all methods.