Apply function to data #304

JoranAngevaare · 2020-08-12T11:51:01Z

What is the problem / what does the code in this PR do
Apply a function in get_array just before returning the data.

This way, you can make a last-minute correction (or apply a blinding cut in the context).

Can you briefly describe how it works?
When one loads data you do either st.get_array or st.get_df (which then calls st.get_array for you). In the lines changed in this PR, you see that the results that is usually returned directly can now be changed by a function. Such a function should be a function that takes a single argument (the data) and does something to it. For example in XENONnT/straxen#166 it remaps some channels but this could also be used to apply a blinding cut.

Can you give a minimal working example (or illustrate with a figure)?
As written in the Strax.Option a function should have the form of:

def function(data, targets):
    # Do something
    return data

See for example the MWE below.

def set_time_from_start(data, targets, 
                        works_on_target='*records*', 
                        kwargs_are_fine=True):
     print('Time field is changed to start of run')
     
     # You can use kwargs as long as you obey the positional arguments 'data' and 'targets'
     if kwargs_are_fine:
          pass
    
     # Let's check in our function that we actually want to work on this kind of target(s)
     if np.any([fnmatch.fnmatch(t, works_on_target) for t in strax.to_str_tuple(targets)]):
          # At least one of the targets is of the kind we want to apply this function to
          data['time']=data['time'][0]
     else: 
          # Ok this some non (raw-)record like data. Let's not do anything to the data
          pass
     return data

#Initiate some context 
st = strax.Context(...) # such as st = straxen.contexts.xenonnt_online()

# Now wrap whatever function we have around the data
st.set_context_config({'apply_data_function': (set_time_from_start,)})

# When doing this we will get the print message and data where 'time' starts to count in ns
# since run
st.get_array(run_id='008657', targets = ('raw_records'), seconds_range = (0,1))

What does it affect?
get_df, get_array and accumulate. It does not affect get_iter (directly) otherwise also st.make would be affected which would be problematic as one might be applying corrections twice (which is fine if your cutting data but not fine if you switch channels).

After discussing the PR, accumulate is also affected to keep loading of data consistent across all methods.

WenzDaniel · 2020-08-12T15:23:05Z

Also, accumulate is not affected
While I think it is a nice to have thing for get_array I do not see its need in accumulate since you can apply functions here anyhow.

WenzDaniel · 2020-08-12T15:30:08Z

strax/context.py

+
+        result = np.concatenate(results)
+
+        for function in self.context_config['apply_data_function']:


This function will be applied to all data_kinds/types which might not be desirable since it can be easily get forgotten while working in the same context. Maybe some dictionary would be here more desirable as strax.Option e.g. {'peak_basics': lambda res: res['area'] + 10}.

Good suggestion. In the implementation I envisage for straxen this is desirable (want to check all datatypes). See XENONnT/straxen#166 and more importantly https://github.com/XENONnT/analysiscode/blob/master/DAQ/remap_sectors/Software_remap.ipynb.

I'd propose to also make the datatype the second arg of each function and do a regex in the function.

... since it can be easily get forgotten while working in the same context.

Perhaps it's good to stress that I think this is quite a big hammer that really should be expert only as the data that you load is not the data that is stored on disk (what you want for blinded data for example).

@WenzDaniel I've updated this PR and updated the description above. This allows you to check it in the function you are feeding strax.

Okay now I see, and now I also agree that we should apply this kind of correction in accumulate. Easiest thing would be to add it into line 1083

WenzDaniel · 2020-08-12T15:33:37Z

strax/context.py

@@ -1014,7 +1017,17 @@ def get_array(self, run_id: ty.Union[str, tuple, list],
                max_workers=max_workers,


Since this function will be applied outside of get_iter it wont be included in the computation time of the progressbar. Now, I also understand why progressbars always tend to take ages for the last 0.2 % ;-) Maybe save thing would be something like if self.config['apply_data_function'] switch off progressbar.

We could do that, on the other hand, it's fine if the progress bar reflects the loading. What happens after that can fal outside of the scope of the progress bar. For example, the select_runs also contains several steps that take different amounts of time.

JoranAngevaare · 2020-08-13T08:08:11Z

Also, accumulate is not affected

While I think it is a nice to have thing for get_array I do not see its need in accumulate since you can apply functions here anyhow

True, perhaps my question makes more sense if you look at XENONnT/straxen#166. There I implement a remapping of channels that would be applied to any loading of data. Technically accumulate can also return data. If that is not remapped we might run into inconsistency issues between accumulate and get_array. On the other hand, I'm not sure if many people actually use this nice feature (and how they use it).

JoranAngevaare · 2020-08-14T12:31:52Z

As mentioned by Daniel the function is now also applied to accumulate. This PR is ready for merging

WenzDaniel

Looks fine, just one last comment though

WenzDaniel · 2020-08-17T06:06:53Z

strax/context.py

+                                f'{function} but expected callable function with two '
+                                f'positional arguments: f(data, targets).')
+            # Make sure that the function takes two arguments (data and targets)
+            data = function(data, targets)


Maybe, just check how traceable a "targets" does not exist error would be. In case the error message is not traceable/understandable at all (e.g. due to numba), maybe add an explicit check with ValueError

Good suggestion, this can be quite non-trivial in strax sometimes.

Fortunately this is about as understandable as it gets with 5(!) places in the last part of the trackback showing you what went wrong:

Update context.py

f8dbbc2

JoranAngevaare marked this pull request as draft August 12, 2020 11:51

JoranAngevaare requested a review from WenzDaniel August 12, 2020 11:59

JoranAngevaare marked this pull request as ready for review August 12, 2020 11:59

JoranAngevaare mentioned this pull request Aug 12, 2020

Remap old runs. XENONnT/straxen#166

Merged

WenzDaniel reviewed Aug 12, 2020

View reviewed changes

provide targets as positional argument

6ad5ae0

JoranAngevaare requested a review from petergaemers August 13, 2020 13:30

apply_function also to accumulate

d7b4499

JoranAngevaare added 2 commits August 14, 2020 15:39

Merge branch 'master' into apply_function_to_data

a4d17c8

Merge branch 'master' into apply_function_to_data

55e0110

JoranAngevaare requested a review from feigaodm August 14, 2020 15:30

WenzDaniel approved these changes Aug 17, 2020

View reviewed changes

JoranAngevaare merged commit 9dc02cf into AxFoundation:master Aug 17, 2020

JoranAngevaare deleted the apply_function_to_data branch August 17, 2020 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply function to data #304

Apply function to data #304

JoranAngevaare commented Aug 12, 2020 •

edited

WenzDaniel commented Aug 12, 2020

WenzDaniel Aug 12, 2020

JoranAngevaare Aug 13, 2020

JoranAngevaare Aug 13, 2020

WenzDaniel Aug 13, 2020 •

edited

WenzDaniel Aug 12, 2020

JoranAngevaare Aug 13, 2020

JoranAngevaare commented Aug 13, 2020

JoranAngevaare commented Aug 14, 2020

WenzDaniel left a comment

WenzDaniel Aug 17, 2020

JoranAngevaare Aug 17, 2020 •

edited


		result = np.concatenate(results)

		for function in self.context_config['apply_data_function']:

		@@ -1014,7 +1017,17 @@ def get_array(self, run_id: ty.Union[str, tuple, list],
		max_workers=max_workers,

Apply function to data #304

Apply function to data #304

Conversation

JoranAngevaare commented Aug 12, 2020 • edited

WenzDaniel commented Aug 12, 2020

WenzDaniel Aug 12, 2020

Choose a reason for hiding this comment

JoranAngevaare Aug 13, 2020

Choose a reason for hiding this comment

JoranAngevaare Aug 13, 2020

Choose a reason for hiding this comment

WenzDaniel Aug 13, 2020 • edited

Choose a reason for hiding this comment

WenzDaniel Aug 12, 2020

Choose a reason for hiding this comment

JoranAngevaare Aug 13, 2020

Choose a reason for hiding this comment

JoranAngevaare commented Aug 13, 2020

JoranAngevaare commented Aug 14, 2020

WenzDaniel left a comment

Choose a reason for hiding this comment

WenzDaniel Aug 17, 2020

Choose a reason for hiding this comment

JoranAngevaare Aug 17, 2020 • edited

Choose a reason for hiding this comment

JoranAngevaare commented Aug 12, 2020 •

edited

WenzDaniel Aug 13, 2020 •

edited

JoranAngevaare Aug 17, 2020 •

edited