Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lineage if per run default is not allowed #483

Merged
merged 11 commits into from Jul 13, 2021

Conversation

JoranAngevaare
Copy link
Member

@JoranAngevaare JoranAngevaare commented Jul 9, 2021

What is the problem / what does the code in this PR do

  • Massive speedup for loading many runs
  • When use_per_run_defaults == False we don't allow for changing lineages in the context. If so, there is no need to for every run recompute the lineage of each plugin
  • This is a similar PR to Speed up run selection by ~100x for fixed defaults #440.

Can you briefly describe how it works?

This only works when we have fixed lineages, but when they are fixed, we don't need a full initialization of the plugin when we are going to load a new run. We can just re-use a plugin and only change run run_id attribute. To this end:

  • Initialize the plugin once
  • Cache the plugin in the _fixed_plugin_cache. To prevent a changed config not being reflected in the plugin (e.g. you load data and later change an option), we make a sha1 of the st.config any changes there will result in a new hash, and as such a new key to cache the plugins under.
  • Every next time a datakey/processor is requested, change return the requested plugin from the _fixed_plugin_cache and only change the run_id

Can you give a minimal working example (or illustrate with a figure)?
Before
afbeelding

After
afbeelding

strax/context.py Outdated
Comment on lines 431 to 432
return hashlib.sha1(str(self.config).encode('ascii')).hexdigest()
# return strax.deterministic_hash(self.config)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the deterministic hash but it does not like immutibledict. Perhaps there is an other solution but my guess is that this serves the purpose quite well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not deterministic, you need to sort any dictionaries and sets to get a deterministic string. Also the string representation may depend on the version of the package of non-bultin values. Why not just fix the hashablize function here

from collections.abc import Mapping

254        if isinstance(obj, Mapping): # instead of if isinstance(obj, dict)
...

This will catch almost any dict-like object, not just the builtin dict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yossi, sure, let me try that.

Actually, it's not much of an issue if it's non-deterministic, the contexts will always just build the _config_hash on the fly, so any changes in order of the config or the version of a package may lead to a different context_hash but as long as it does not change for every run, it shouldn't matter. But you are right, better do make it deterministic because even if there is no issue now because some day it might if one assumes it is deterministic (which shouldn't be a bad assumption).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmosbacher , I tried your suggestion but it doesn't work. I initially also tried something like this (by doing the the same for immutabledict instead of Mapping), however, it turns out that since immutabledict has a hash method, this whole logic is never accessed.

I'll propose something similar to get it working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry about that. I assumed the issue was with the hash function but I guess it was in the json encoder all along? In that case maybe add the Mapping logic to the NumpyJSONEncoder class?

@JoranAngevaare JoranAngevaare changed the title Fix lineage if per run default is not allowed Fix lineage if per run default is not allowed - decrease many run loading time Jul 9, 2021
@JoranAngevaare JoranAngevaare changed the title Fix lineage if per run default is not allowed - decrease many run loading time Fix lineage if per run default is not allowed Jul 9, 2021
@JoranAngevaare JoranAngevaare marked this pull request as ready for review July 9, 2021 15:37
Copy link
Collaborator

@WenzDaniel WenzDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested it, but from the code it looks fine. I am not sure if the deterministic hashing is really needed. I assume that changing options is only done by very few analysts. But if it can be implemented easily it would be a nice to have.

strax/context.py Outdated
Comment on lines 453 to 454
if self._fixed_plugin_cache is None or self._config_hash() not in self._fixed_plugin_cache:
self._fixed_plugin_cache = {self._config_hash(): dict()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume Yossi's comment is also the reason why you are always making a new dictionary when the hash cannot be found?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very keenly spotted Daniel, I should have put a comment here.

So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin. I did think the likelihood that one was eating away memory for keeping the cache was greater than the change that someone was flipping between options often (although neither is very likely).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin.

Yes, this would be my guess, too. As I said I am fine like it is but if you can implement Yossi's suggestion easily it would be a nice to have,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree

@jmosbacher
Copy link
Contributor

In general I think this is a good idea. My biggest concern is with multi-threading. Do we really have a guarantee that two runs are not being processed concurrently by two different threads?

@WenzDaniel
Copy link
Collaborator

I think so. The list of plugins needed to process some data is created before we send the different jobs.
.

@JoranAngevaare
Copy link
Member Author

JoranAngevaare commented Jul 13, 2021

@jmosbacher @WenzDaniel , thanks for the reviews, I've added three things:

  • 93e2993, convert Mapping to dict in order to allow using immutabledict for the deterministic hash
  • cae95cf , rely on the deterministic hash instead of a sha1. This does makes this PR ~10 times slower without an immediate benefit for the purpose pursued here. The speed increase is still ~500x or for the example above.
  • cae95cf , we must also check the versions of the plugins, I had initially forgotten this.

@JoranAngevaare JoranAngevaare merged commit a7d7c11 into AxFoundation:master Jul 13, 2021
@JoranAngevaare JoranAngevaare deleted the fixed_lineage branch July 13, 2021 10:19
@JoranAngevaare
Copy link
Member Author

In general I think this is a good idea. My biggest concern is with multi-threading. Do we really have a guarantee that two runs are not being processed concurrently by two different threads?

For completeness, this is all setup much later after this function

def get_components(self, run_id: str,
. In fact during multi-run loading/processing you loop over the components here:
def multi_run(exec_function, run_ids, *args,

@jmosbacher
Copy link
Contributor

@jorana Sorry, maybe im misunderstanding whats happening in your code. Arent you caching the plugin instances themselves so different runs will be processed with the same plugin instance? If I try to process multiple runs by default those will be processed concurrently with a thread pool right? but if the plugins are cached all threads would share the same plugin instance and the run_id on that instance would just be the last one requested no? what am I missing here?

@JoranAngevaare
Copy link
Member Author

JoranAngevaare commented Jul 13, 2021

Ah right, I see your point better now. What you are saying is correct thanks for being persistent. Let's write a test for this as indeed the implications can be substantial. Otherwise we simply have to move this down a few lines in order to make sure we create separate instances.

I went ahead with merging this PR as I wanted to get things going for Daniels PR on super-runs but would have been better to check this carefully first.

@jmosbacher
Copy link
Contributor

I think a possible solution to this would be to return a copy of the cached plugin instead of the original instance. Also would need a recursive copy of the plugin.deps variable I guess.

JoranAngevaare added a commit that referenced this pull request Jul 13, 2021
JoranAngevaare added a commit that referenced this pull request Jul 15, 2021
* fix #483 and add tests

* haha, be smart about the st.key_for since it does allow reuse

* add the actuall test

* Have to investigate

* add comments to copy function

* why did we use a CutPlugin here

* add type to function

* allow configurable compressors and timeouts
@WenzDaniel WenzDaniel mentioned this pull request Aug 26, 2021
@WenzDaniel WenzDaniel mentioned this pull request Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants