Fix lineage if per run default is not allowed #483

JoranAngevaare · 2021-07-09T14:21:49Z

What is the problem / what does the code in this PR do

Massive speedup for loading many runs
When use_per_run_defaults == False we don't allow for changing lineages in the context. If so, there is no need to for every run recompute the lineage of each plugin
This is a similar PR to Speed up run selection by ~100x for fixed defaults #440.

Can you briefly describe how it works?

This only works when we have fixed lineages, but when they are fixed, we don't need a full initialization of the plugin when we are going to load a new run. We can just re-use a plugin and only change run run_id attribute. To this end:

Initialize the plugin once
Cache the plugin in the _fixed_plugin_cache. To prevent a changed config not being reflected in the plugin (e.g. you load data and later change an option), we make a sha1 of the st.config any changes there will result in a new hash, and as such a new key to cache the plugins under.
Every next time a datakey/processor is requested, change return the requested plugin from the _fixed_plugin_cache and only change the run_id

Can you give a minimal working example (or illustrate with a figure)?
Before

After

JoranAngevaare · 2021-07-09T15:36:10Z

strax/context.py

+        return hashlib.sha1(str(self.config).encode('ascii')).hexdigest()
+        # return strax.deterministic_hash(self.config)


I tried using the deterministic hash but it does not like immutibledict. Perhaps there is an other solution but my guess is that this serves the purpose quite well.

Unfortunately this is not deterministic, you need to sort any dictionaries and sets to get a deterministic string. Also the string representation may depend on the version of the package of non-bultin values. Why not just fix the hashablize function here

from collections.abc import Mapping 254 if isinstance(obj, Mapping): # instead of if isinstance(obj, dict) ...

This will catch almost any dict-like object, not just the builtin dict.

Thanks Yossi, sure, let me try that.

Actually, it's not much of an issue if it's non-deterministic, the contexts will always just build the _config_hash on the fly, so any changes in order of the config or the version of a package may lead to a different context_hash but as long as it does not change for every run, it shouldn't matter. But you are right, better do make it deterministic because even if there is no issue now because some day it might if one assumes it is deterministic (which shouldn't be a bad assumption).

@jmosbacher , I tried your suggestion but it doesn't work. I initially also tried something like this (by doing the the same for immutabledict instead of Mapping), however, it turns out that since immutabledict has a hash method, this whole logic is never accessed.

I'll propose something similar to get it working.

Oh, sorry about that. I assumed the issue was with the hash function but I guess it was in the json encoder all along? In that case maybe add the Mapping logic to the NumpyJSONEncoder class?

WenzDaniel

I have not tested it, but from the code it looks fine. I am not sure if the deterministic hashing is really needed. I assume that changing options is only done by very few analysts. But if it can be implemented easily it would be a nice to have.

WenzDaniel · 2021-07-12T17:37:41Z

strax/context.py

+        if self._fixed_plugin_cache is None or self._config_hash() not in self._fixed_plugin_cache:
+            self._fixed_plugin_cache = {self._config_hash(): dict()}


I assume Yossi's comment is also the reason why you are always making a new dictionary when the hash cannot be found?

Very keenly spotted Daniel, I should have put a comment here.

So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin. I did think the likelihood that one was eating away memory for keeping the cache was greater than the change that someone was flipping between options often (although neither is very likely).

So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin.

Yes, this would be my guess, too. As I said I am fine like it is but if you can implement Yossi's suggestion easily it would be a nice to have,

I fully agree

jmosbacher · 2021-07-12T18:36:16Z

In general I think this is a good idea. My biggest concern is with multi-threading. Do we really have a guarantee that two runs are not being processed concurrently by two different threads?

WenzDaniel · 2021-07-13T05:36:10Z

I think so. The list of plugins needed to process some data is created before we send the different jobs.
.

JoranAngevaare · 2021-07-13T09:41:58Z

@jmosbacher @WenzDaniel , thanks for the reviews, I've added three things:

93e2993, convert Mapping to dict in order to allow using immutabledict for the deterministic hash
cae95cf , rely on the deterministic hash instead of a sha1. This does makes this PR ~10 times slower without an immediate benefit for the purpose pursued here. The speed increase is still ~500x or for the example above.
cae95cf , we must also check the versions of the plugins, I had initially forgotten this.

JoranAngevaare · 2021-07-13T10:32:45Z

In general I think this is a good idea. My biggest concern is with multi-threading. Do we really have a guarantee that two runs are not being processed concurrently by two different threads?

For completeness, this is all setup much later after this function

strax/strax/context.py

Line 647 in a7d7c11

def get_components(self, run_id: str,

. In fact during multi-run loading/processing you loop over the components here:

strax/strax/utils.py

Line 435 in a7d7c11

def multi_run(exec_function, run_ids, *args,

jmosbacher · 2021-07-13T11:53:09Z

@jorana Sorry, maybe im misunderstanding whats happening in your code. Arent you caching the plugin instances themselves so different runs will be processed with the same plugin instance? If I try to process multiple runs by default those will be processed concurrently with a thread pool right? but if the plugins are cached all threads would share the same plugin instance and the run_id on that instance would just be the last one requested no? what am I missing here?

JoranAngevaare · 2021-07-13T12:25:41Z

Ah right, I see your point better now. What you are saying is correct thanks for being persistent. Let's write a test for this as indeed the implications can be substantial. Otherwise we simply have to move this down a few lines in order to make sure we create separate instances.

I went ahead with merging this PR as I wanted to get things going for Daniels PR on super-runs but would have been better to check this carefully first.

jmosbacher · 2021-07-13T12:59:59Z

I think a possible solution to this would be to return a copy of the cached plugin instead of the original instance. Also would need a recursive copy of the plugin.deps variable I guess.

* fix #483 and add tests * haha, be smart about the st.key_for since it does allow reuse * add the actuall test * Have to investigate * add comments to copy function * why did we use a CutPlugin here * add type to function * allow configurable compressors and timeouts

JoranAngevaare added 5 commits July 9, 2021 16:19

Fix lineage if per run default is not allowed

b6348b5

increase readibility

1c7abb0

try to respect the git blame and revert pycharm auto-refactor

1dc6998

try to respect the git blame and revert pycharm auto-refactor

7ca0530

try to respect the git blame and revert pycharm auto-refactor

ad27883

JoranAngevaare commented Jul 9, 2021

View reviewed changes

JoranAngevaare changed the title ~~Fix lineage if per run default is not allowed~~ Fix lineage if per run default is not allowed - decrease many run loading time Jul 9, 2021

JoranAngevaare changed the title ~~Fix lineage if per run default is not allowed - decrease many run loading time~~ Fix lineage if per run default is not allowed Jul 9, 2021

JoranAngevaare marked this pull request as ready for review July 9, 2021 15:37

JoranAngevaare requested review from WenzDaniel and jmosbacher July 12, 2021 07:17

WenzDaniel approved these changes Jul 12, 2021

View reviewed changes

WenzDaniel mentioned this pull request Jul 13, 2021

Fix define runs and allow storing of superruns #472

Merged

JoranAngevaare added 2 commits July 13, 2021 11:22

allow immutabledict for deterministic hash

93e2993

use deterministic hash and add comments

cae95cf

JoranAngevaare added 4 commits July 13, 2021 11:42

Merge branch 'master' into fixed_lineage

a1c2132

reduce calls to _context_hash

6e5cd04

make call consistent

8ffc9d8

check all are cached

e8422ad

JoranAngevaare merged commit a7d7c11 into AxFoundation:master Jul 13, 2021

JoranAngevaare deleted the fixed_lineage branch July 13, 2021 10:19

JoranAngevaare added a commit that referenced this pull request Jul 13, 2021

fix #483 and add tests

ca2556b

JoranAngevaare mentioned this pull request Jul 13, 2021

Copy plugin instance for processing #485

Merged

WenzDaniel mentioned this pull request Aug 26, 2021

Fix test #515

Closed

WenzDaniel mentioned this pull request Oct 11, 2021

Fix caching #545

Merged

WenzDaniel mentioned this pull request Aug 10, 2023

show_config messes with plugin registry! #747

Closed

WenzDaniel mentioned this pull request Nov 22, 2023

Also copy dps and remove redundant checks. #777

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lineage if per run default is not allowed #483

Fix lineage if per run default is not allowed #483

JoranAngevaare commented Jul 9, 2021 •

edited

JoranAngevaare Jul 9, 2021

jmosbacher Jul 12, 2021 •

edited

JoranAngevaare Jul 12, 2021

JoranAngevaare Jul 13, 2021

jmosbacher Jul 13, 2021

WenzDaniel left a comment

WenzDaniel Jul 12, 2021

JoranAngevaare Jul 12, 2021

WenzDaniel Jul 12, 2021

JoranAngevaare Jul 12, 2021

jmosbacher commented Jul 12, 2021

WenzDaniel commented Jul 13, 2021

JoranAngevaare commented Jul 13, 2021 •

edited

JoranAngevaare commented Jul 13, 2021

jmosbacher commented Jul 13, 2021

JoranAngevaare commented Jul 13, 2021 •

edited

jmosbacher commented Jul 13, 2021

		return hashlib.sha1(str(self.config).encode('ascii')).hexdigest()
		# return strax.deterministic_hash(self.config)

		if self._fixed_plugin_cache is None or self._config_hash() not in self._fixed_plugin_cache:
		self._fixed_plugin_cache = {self._config_hash(): dict()}

Fix lineage if per run default is not allowed #483

Fix lineage if per run default is not allowed #483

Conversation

JoranAngevaare commented Jul 9, 2021 • edited

What is the problem / what does the code in this PR do

Can you briefly describe how it works?

Choose a reason for hiding this comment

jmosbacher Jul 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WenzDaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmosbacher commented Jul 12, 2021

WenzDaniel commented Jul 13, 2021

JoranAngevaare commented Jul 13, 2021 • edited

JoranAngevaare commented Jul 13, 2021

jmosbacher commented Jul 13, 2021

JoranAngevaare commented Jul 13, 2021 • edited

jmosbacher commented Jul 13, 2021

JoranAngevaare commented Jul 9, 2021 •

edited

jmosbacher Jul 12, 2021 •

edited

JoranAngevaare commented Jul 13, 2021 •

edited

JoranAngevaare commented Jul 13, 2021 •

edited