New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix lineage if per run default is not allowed #483
Fix lineage if per run default is not allowed #483
Conversation
strax/context.py
Outdated
return hashlib.sha1(str(self.config).encode('ascii')).hexdigest() | ||
# return strax.deterministic_hash(self.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using the deterministic hash but it does not like immutibledict. Perhaps there is an other solution but my guess is that this serves the purpose quite well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is not deterministic, you need to sort any dictionaries and sets to get a deterministic string. Also the string representation may depend on the version of the package of non-bultin values. Why not just fix the hashablize function here
from collections.abc import Mapping
254 if isinstance(obj, Mapping): # instead of if isinstance(obj, dict)
...
This will catch almost any dict-like object, not just the builtin dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yossi, sure, let me try that.
Actually, it's not much of an issue if it's non-deterministic, the contexts will always just build the _config_hash
on the fly, so any changes in order of the config or the version of a package may lead to a different context_hash but as long as it does not change for every run, it shouldn't matter. But you are right, better do make it deterministic because even if there is no issue now because some day it might if one assumes it is deterministic (which shouldn't be a bad assumption).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmosbacher , I tried your suggestion but it doesn't work. I initially also tried something like this (by doing the the same for immutabledict
instead of Mapping
), however, it turns out that since immutabledict has a hash method, this whole logic is never accessed.
I'll propose something similar to get it working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry about that. I assumed the issue was with the hash function but I guess it was in the json encoder all along? In that case maybe add the Mapping logic to the NumpyJSONEncoder class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested it, but from the code it looks fine. I am not sure if the deterministic hashing is really needed. I assume that changing options is only done by very few analysts. But if it can be implemented easily it would be a nice to have.
strax/context.py
Outdated
if self._fixed_plugin_cache is None or self._config_hash() not in self._fixed_plugin_cache: | ||
self._fixed_plugin_cache = {self._config_hash(): dict()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume Yossi's comment is also the reason why you are always making a new dictionary when the hash cannot be found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very keenly spotted Daniel, I should have put a comment here.
So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin. I did think the likelihood that one was eating away memory for keeping the cache was greater than the change that someone was flipping between options often (although neither is very likely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my suspicion is that the cache is only once created, at most twice if you change some option or register some plugin.
Yes, this would be my guess, too. As I said I am fine like it is but if you can implement Yossi's suggestion easily it would be a nice to have,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree
In general I think this is a good idea. My biggest concern is with multi-threading. Do we really have a guarantee that two runs are not being processed concurrently by two different threads? |
I think so. The list of plugins needed to process some data is created before we send the different jobs. |
@jmosbacher @WenzDaniel , thanks for the reviews, I've added three things:
|
For completeness, this is all setup much later after this function Line 647 in a7d7c11
Line 435 in a7d7c11
|
@jorana Sorry, maybe im misunderstanding whats happening in your code. Arent you caching the plugin instances themselves so different runs will be processed with the same plugin instance? If I try to process multiple runs by default those will be processed concurrently with a thread pool right? but if the plugins are cached all threads would share the same plugin instance and the run_id on that instance would just be the last one requested no? what am I missing here? |
Ah right, I see your point better now. What you are saying is correct thanks for being persistent. Let's write a test for this as indeed the implications can be substantial. Otherwise we simply have to move this down a few lines in order to make sure we create separate instances. I went ahead with merging this PR as I wanted to get things going for Daniels PR on super-runs but would have been better to check this carefully first. |
I think a possible solution to this would be to return a copy of the cached plugin instead of the original instance. Also would need a recursive copy of the plugin.deps variable I guess. |
* fix #483 and add tests * haha, be smart about the st.key_for since it does allow reuse * add the actuall test * Have to investigate * add comments to copy function * why did we use a CutPlugin here * add type to function * allow configurable compressors and timeouts
What is the problem / what does the code in this PR do
use_per_run_defaults == False
we don't allow for changing lineages in the context. If so, there is no need to for every run recompute the lineage of each pluginCan you briefly describe how it works?
This only works when we have fixed lineages, but when they are fixed, we don't need a full initialization of the plugin when we are going to load a new run. We can just re-use a plugin and only change run run_id attribute. To this end:
st.config
any changes there will result in a new hash, and as such a new key to cache the plugins under.Can you give a minimal working example (or illustrate with a figure)?
Before
After