New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_zarr method to context #540
Conversation
merge upstream changes
merge upstream
merge upstream changes
merge upstream changes
Thanks Yossi, looking at the documentation of https://zarr.readthedocs.io/en/stable/tutorial.html, this looks like something very useful. Few questions before diving into the code:
|
@JoranAngevaare thanks for the comments.
Indeed I was on the fence about this, my first implementation was using the target+hash as the data key but I switched to make it more intuitive to the user. Plus needing the hash to access the target in your array can also impede code reusability.
Yes, the purpose of this function is just to have an easy way to load data that wont fit in memory but still allow numpy-like access, not as an alternative storage option. Perhaps my choice of ./strax_data as the default location for the persistent array was misleading, I think maybe I will change that to be a temp dir by default to emphasize its meant to be more of a cache than storage since the data is already stored by strax.
The purpose is to enable people to load all the data they need for their analysis in a single call to strax even if it wont fit in memory and access it as if it was a regular numpy array, this can be single run/multiple runs for some analyses or for others multiple data kinds. I still haven't added any changes to the docs because I first wanted to first get your input on what options you think we should have here. Once the options and behavior is finalized ill of course add documentation and tests. |
Hi Yossi, I think in general zarr is a nice package and I see some potential to maybe replace multi- and superruns by such a system. But at the moment I have similar concerns as Joran. Some additional thing I am wondering. In the end it is based on dask, are dask.arrays supported by numba? Probably they are but I do not know. Since most of our important function are based on numba. But this looks like a nice exercise to be discussed on the upcoming workshop. In general I think we should make a list of things we want to compare all available options with. Out of the box I think we should compare:
Further, in the end it might be worth having a test branch and asking a few analysts for a beta-test. |
I slightly disagree with this statement. Sounds a bit messy to me and too easy to screw-up. I think if you want to compare different "data-settings-contexts" you also should use different "strax-contexts". In this way we always have a nice and clean separation. Hence in that sense I am more for your option 2. |
Btw I am wondering if you cannot use the group feature for this level of organization: |
And ragged arrays maybe for raw_data https://zarr.readthedocs.io/en/stable/tutorial.html#ragged-arrays ? :D |
@WenzDaniel thanks for the comments
zarr is not based on dask, it just allows you to store arrays on disk and access them as if they were regular numpy arrays in memory. I just showed a minimal example of how it integrates nicely with dask since if you use this to load data larger than memory you would probably want to do something with that data but any processing you do would have to use a distributed algorithm and dask.array and dask.dataframe have distributed implementations of most of the numpy and pandas apis.
dask works well with numba as far as i could tell (i have also implemented a full strax processing pipeline in dask and it worked fine, but that will be a different PR)
Yes, that is what i meant in option number 2 :)
Also something I am looking into separately and not related to this PR, sparse arrays for raw_data and ragged arrays for merging different data kinds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To solve this I can think of two simple solutions:
- Include the hash in the list of inserted runs - this would allow users to mix and match runs from different contexts as they please but if the same run being loaded exists from a different hash it would overwrite it even if overwrite=False.
- Include the context hash in the label of the group itself, so all zarr groups created by this method are always associated with a specific context.
Do you have any preference between the two or maybe a better solution?
I think I agree with Daniel that for comparing two datasets, one might better create two contexts to make sure that things don't mix.
I think you no went for option 2 (using the lineage-hash, not the context hash 😉 ), which I also makes sense.
Hej Yossi, I am currently trying to play a bit with your PR to get a better understanding how to use this addition but I am afraid I am already failing at the basics :D
Should show me some tree like structure for the asked data is not it? And maybe one other question :D How can I access your RUNS field? |
Other than that it is pretty cool :D Although I just tested it with light weighted data ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last comment. Besides the comment Joran made about the doc-string.
Hey @WenzDaniel thanks for playing around with this
indeed, but looking at their source code they use an ipytree widget for the notebook display of the tree so you need to install it explicitly or pip install zarr[jupyter], I guess we can add it to the dependencies if you think it would be useful for people
I set the RUNS field in the attrs of each array so eg: zgrp = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zgrp['peak_basics'].attrs['RUNS'] |
If the behavior seems reasonable to all I will add an explanation to the docs and some tests. |
Yes this is what I did. I will try again later.
Ahh thanks I tried |
since ipytree installs a jupyter widget, you will probably need to refresh the jupyter lab page at a minimum and maybe even restart the jupyter server (less likely) to reload the javascript side of jupyter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hej Yossi, I saw you made some additional changes. Do you need some more time or should we add this nice PR to the upcoming release?
I think this PR is ready to merge unless anyone wants some more changes. Maybe I should add an EXPERIMENTAL warning when this method is called? |
Sure if you like. I will advertise the PR later, but I am not so sure if many people will use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yossi!
This PR adds a get_zarr method to the context, to create persistent arrays which are useful for loading large datasets that dont fit in memory.
For each requested target the method iterates over all run_ids and chunks and adds them to the given storage location (overwrite or append is optional). The zarr group is then returned to the user.
Please add comments if you think behavior should be different (e.g. auto merge targets with same datatypes or using hash in array names etc.) will add tests once the behavior is finalized.
minimal example: