Fix time range finding issues #231

JelleAalbers · 2020-02-11T17:56:11Z

This is a fix for #230, i.e. the treatment of the time_range / seconds_range / time_within options.

Background info:

Strax supports (a) data without time information, such as peaklet_classification or corrected_areas and (b) data that is never stored, such as peaks, or the temporary datatypes that are made to merge e.g. peaks and peak_basics if you load them together.
Strax even allows time selection with a no-time datatype as long as you load it alongside data of the same kind that does have time info (like peaks). It will then use the latter to map the time range to row numbers, which can be selected in the no-time datatypes.

There were two bugs:

Strax was too coarse when converting time ranges to row numbers. We picked whichever time-bearing datatype was listed first, found out in which chunk the start and stop of the range were contained, then loaded the row numbers corresponding to those entire chunks. If this data type happened to be very slim (few columns), a chunk would correspond to way more time than you selected (since we try to save in fixed-size files), possibly the entire run. If the simultaneously loaded data types were not slim, your RAM might explode.
The time range argument and associated conversion to row ranges could be spuriously skipped. When loading a never-stored datatype such as peaks together with a time-bearing datatype of the same kind, such as peak_basics, peaks still has to be reconstructed from peaklets and merged_s2s. However, the time range was not passed on to the latter, so strax attempted to load the full data in the run, starting with the first chunk, which would usually cause a merge failure. (... but not always; due to the very coarse chunk mapping you were fine if the run only had one chunk or perhaps if you were just loading the first chunk.)

Now:

Strax actually loads data to find out the exact row numbers that correspond to the time range (only for chunks containing the start and stop time, so possibly just one chunk). This does mean that some data is loaded twice when doing time_range selections.
If a time_range is specified, it will apply to all data types that are loaded. There are likely still cases where the current logic fails, e.g. when you make some deep chain of no-time-bearing and never-stored plugins. Then you will get a neater error: it will say which data type it tried in vain to find time information for (rather than a weird merge failure or blowing up your memory).
_time_range_to_n_range has been renamed to time_range_to_row_range for clarify.
An additional option _chunk_number is added to get_array, get_df, etc, to load just a single chunk. This is used internally in time_range_to_row_range.

Taking a step back, this is all additional complexity that comes from supporting data types without time information (see also #79). If I knew how much trouble this would have brought I would never have supported it -- it saves a bit of space, but not so much. However, by now, it is probably too big of a downstream change to stop supporting this.

Saves a bit of performance in case event_basics is run with very small chunks

JelleAalbers · 2020-02-11T18:03:26Z

So now codefactor thinks that using an f-string in an error message is a possible SQL injection vulnerability... right.

JelleAalbers added 7 commits February 10, 2020 10:50

Move dependencies_by_kind out of inner loop

966fff5

Saves a bit of performance in case event_basics is run with very small chunks

Sane default max_workers for multi_run, make run_id a string column

e0e6735

Fix multirun test

a68d811

Exact time range to row range conversion

5aba211

Fix it more

033f7cd

Fix it even more

03171ab

Unprivatize time_range_to_row_range and fix argument ordering

93c5f02

JelleAalbers merged commit c68891a into master Feb 11, 2020

JelleAalbers deleted the fix_time_range branch February 11, 2020 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix time range finding issues #231

Fix time range finding issues #231

JelleAalbers commented Feb 11, 2020

JelleAalbers commented Feb 11, 2020

Fix time range finding issues #231

Fix time range finding issues #231

Conversation

JelleAalbers commented Feb 11, 2020

JelleAalbers commented Feb 11, 2020