Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply function to data #304

Merged

Conversation

JoranAngevaare
Copy link
Member

@JoranAngevaare JoranAngevaare commented Aug 12, 2020

What is the problem / what does the code in this PR do
Apply a function in get_array just before returning the data.

This way, you can make a last-minute correction (or apply a blinding cut in the context).

Can you briefly describe how it works?
When one loads data you do either st.get_array or st.get_df (which then calls st.get_array for you). In the lines changed in this PR, you see that the results that is usually returned directly can now be changed by a function. Such a function should be a function that takes a single argument (the data) and does something to it. For example in XENONnT/straxen#166 it remaps some channels but this could also be used to apply a blinding cut.

Can you give a minimal working example (or illustrate with a figure)?
As written in the Strax.Option a function should have the form of:

def function(data, targets):
    # Do something
    return data

See for example the MWE below.

def set_time_from_start(data, targets, 
                        works_on_target='*records*', 
                        kwargs_are_fine=True):
     print('Time field is changed to start of run')
     
     # You can use kwargs as long as you obey the positional arguments 'data' and 'targets'
     if kwargs_are_fine:
          pass
    
     # Let's check in our function that we actually want to work on this kind of target(s)
     if np.any([fnmatch.fnmatch(t, works_on_target) for t in strax.to_str_tuple(targets)]):
          # At least one of the targets is of the kind we want to apply this function to
          data['time']=data['time'][0]
     else: 
          # Ok this some non (raw-)record like data. Let's not do anything to the data
          pass
     return data

#Initiate some context 
st = strax.Context(...) # such as st = straxen.contexts.xenonnt_online()

# Now wrap whatever function we have around the data
st.set_context_config({'apply_data_function': (set_time_from_start,)})

# When doing this we will get the print message and data where 'time' starts to count in ns
# since run
st.get_array(run_id='008657', targets = ('raw_records'), seconds_range = (0,1))

What does it affect?
get_df, get_array and accumulate. It does not affect get_iter (directly) otherwise also st.make would be affected which would be problematic as one might be applying corrections twice (which is fine if your cutting data but not fine if you switch channels).

After discussing the PR, accumulate is also affected to keep loading of data consistent across all methods.

@JoranAngevaare JoranAngevaare marked this pull request as draft August 12, 2020 11:51
@JoranAngevaare JoranAngevaare marked this pull request as ready for review August 12, 2020 11:59
@WenzDaniel
Copy link
Collaborator

Also, accumulate is not affected
While I think it is a nice to have thing for get_array I do not see its need in accumulate since you can apply functions here anyhow.

strax/context.py Outdated

result = np.concatenate(results)

for function in self.context_config['apply_data_function']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will be applied to all data_kinds/types which might not be desirable since it can be easily get forgotten while working in the same context. Maybe some dictionary would be here more desirable as strax.Option e.g. {'peak_basics': lambda res: res['area'] + 10}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. In the implementation I envisage for straxen this is desirable (want to check all datatypes). See XENONnT/straxen#166 and more importantly https://github.com/XENONnT/analysiscode/blob/master/DAQ/remap_sectors/Software_remap.ipynb.

I'd propose to also make the datatype the second arg of each function and do a regex in the function.

... since it can be easily get forgotten while working in the same context.

Perhaps it's good to stress that I think this is quite a big hammer that really should be expert only as the data that you load is not the data that is stored on disk (what you want for blinded data for example).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenzDaniel I've updated this PR and updated the description above. This allows you to check it in the function you are feeding strax.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay now I see, and now I also agree that we should apply this kind of correction in accumulate. Easiest thing would be to add it into line 1083

@@ -1014,7 +1017,17 @@ def get_array(self, run_id: ty.Union[str, tuple, list],
max_workers=max_workers,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function will be applied outside of get_iter it wont be included in the computation time of the progressbar. Now, I also understand why progressbars always tend to take ages for the last 0.2 % ;-) Maybe save thing would be something like if self.config['apply_data_function'] switch off progressbar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, on the other hand, it's fine if the progress bar reflects the loading. What happens after that can fal outside of the scope of the progress bar. For example, the select_runs also contains several steps that take different amounts of time.

@JoranAngevaare
Copy link
Member Author

Also, accumulate is not affected

While I think it is a nice to have thing for get_array I do not see its need in accumulate since you can apply functions here anyhow

True, perhaps my question makes more sense if you look at XENONnT/straxen#166. There I implement a remapping of channels that would be applied to any loading of data. Technically accumulate can also return data. If that is not remapped we might run into inconsistency issues between accumulate and get_array. On the other hand, I'm not sure if many people actually use this nice feature (and how they use it).

@JoranAngevaare
Copy link
Member Author

As mentioned by Daniel the function is now also applied to accumulate. This PR is ready for merging

Copy link
Collaborator

@WenzDaniel WenzDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just one last comment though

f'{function} but expected callable function with two '
f'positional arguments: f(data, targets).')
# Make sure that the function takes two arguments (data and targets)
data = function(data, targets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, just check how traceable a "targets" does not exist error would be. In case the error message is not traceable/understandable at all (e.g. due to numba), maybe add an explicit check with ValueError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, this can be quite non-trivial in strax sometimes.

Fortunately this is about as understandable as it gets with 5(!) places in the last part of the trackback showing you what went wrong:
afbeelding

@JoranAngevaare JoranAngevaare merged commit 9dc02cf into AxFoundation:master Aug 17, 2020
@JoranAngevaare JoranAngevaare deleted the apply_function_to_data branch August 17, 2020 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants