Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_zarr method to context #540

Merged
merged 23 commits into from Oct 14, 2021
Merged

Conversation

jmosbacher
Copy link
Contributor

@jmosbacher jmosbacher commented Oct 4, 2021

This PR adds a get_zarr method to the context, to create persistent arrays which are useful for loading large datasets that dont fit in memory.

For each requested target the method iterates over all run_ids and chunks and adds them to the given storage location (overwrite or append is optional). The zarr group is then returned to the user.

Please add comments if you think behavior should be different (e.g. auto merge targets with same datatypes or using hash in array names etc.) will add tests once the behavior is finalized.

minimal example:

import dask.array as da

zgrp = st.get_zarr(runs, ('peaks', 'event_basics')) # creates a zarr group and adds datasets to it with requestsed data
event_basics = zgrp['event_basics']  # dict-like access to the created persistent arrays
darr = da.from_zarr(event_basics)  # to dask array
ddf = darr.to_dask_dataframe()  # to dask dataframe

@JoranAngevaare
Copy link
Member

JoranAngevaare commented Oct 6, 2021

Thanks Yossi, looking at the documentation of https://zarr.readthedocs.io/en/stable/tutorial.html, this looks like something very useful.

Few questions before diving into the code:

  • how does the zarr './strax_data' get re-used? Now it looks like you are using the target rather than the datakey as the lookup. I think this can cause errors if one changes the context, right, it will still return the same data? If we can make this reproducible, it might be quite a benefit?
  • You chose to put this zarr outside of the storage-frontend system. Is the greatest advantage that you can create objects easily that transcend run_ids, I see that having zarr as another frontend also poses issues as you'll have two systems doing similar things? There are some cons too, like not being able to load data in it the "normal" way, as far as the context is concerned, the data only exists in this one function, and no other function is aware of it. Probably you thought of more pros/cons? There is of course also your other PR: Zarr storage #412.
  • This functionality looks rather similar to the multi-run functionality, I'm not sure where which of the approaches has the benefit in which use case. Perhaps this is something worth adding to the docs?

@jmosbacher
Copy link
Contributor Author

jmosbacher commented Oct 7, 2021

@JoranAngevaare thanks for the comments.

how does the zarr './strax_data' get re-used? Now it looks like you are using the target rather than the datakey as the lookup. I think this can cause errors if one changes the context, right, it will still return the same data? If we can make this reproducible, it might be quite a benefit?

Indeed I was on the fence about this, my first implementation was using the target+hash as the data key but I switched to make it more intuitive to the user. Plus needing the hash to access the target in your array can also impede code reusability.
The problem is that I then made it optional to append runs to the array instead of overwriting it, so this can indeed result in unintuitive behavior.
To solve this I can think of two simple solutions:

  1. Include the hash in the list of inserted runs - this would allow users to mix and match runs from different contexts as they please but if the same run being loaded exists from a different hash it would overwrite it even if overwrite=False.
  2. Include the context hash in the label of the group itself, so all zarr groups created by this method are always associated with a specific context.
    Do you have any preference between the two or maybe a better solution?

You chose to put this zarr outside of the storage-frontend system.

Yes, the purpose of this function is just to have an easy way to load data that wont fit in memory but still allow numpy-like access, not as an alternative storage option. Perhaps my choice of ./strax_data as the default location for the persistent array was misleading, I think maybe I will change that to be a temp dir by default to emphasize its meant to be more of a cache than storage since the data is already stored by strax.

This functionality looks rather similar to the multi-run functionality

The purpose is to enable people to load all the data they need for their analysis in a single call to strax even if it wont fit in memory and access it as if it was a regular numpy array, this can be single run/multiple runs for some analyses or for others multiple data kinds.

I still haven't added any changes to the docs because I first wanted to first get your input on what options you think we should have here. Once the options and behavior is finalized ill of course add documentation and tests.

@WenzDaniel
Copy link
Collaborator

Hi Yossi, I think in general zarr is a nice package and I see some potential to maybe replace multi- and superruns by such a system. But at the moment I have similar concerns as Joran. Some additional thing I am wondering. In the end it is based on dask, are dask.arrays supported by numba? Probably they are but I do not know. Since most of our important function are based on numba.

But this looks like a nice exercise to be discussed on the upcoming workshop. In general I think we should make a list of things we want to compare all available options with. Out of the box I think we should compare:

  • Disk and memory usage
  • Loading performance
  • Metadata handling
  • Complexity for the analysts

Further, in the end it might be worth having a test branch and asking a few analysts for a beta-test.

@WenzDaniel
Copy link
Collaborator

WenzDaniel commented Oct 7, 2021

this would allow users to mix and match runs from different contexts as they please

I slightly disagree with this statement. Sounds a bit messy to me and too easy to screw-up. I think if you want to compare different "data-settings-contexts" you also should use different "strax-contexts". In this way we always have a nice and clean separation. Hence in that sense I am more for your option 2.

@WenzDaniel
Copy link
Collaborator

Btw I am wondering if you cannot use the group feature for this level of organization:
https://zarr.readthedocs.io/en/stable/tutorial.html#groups

@WenzDaniel
Copy link
Collaborator

And ragged arrays maybe for raw_data https://zarr.readthedocs.io/en/stable/tutorial.html#ragged-arrays ? :D

@jmosbacher
Copy link
Contributor Author

jmosbacher commented Oct 7, 2021

@WenzDaniel thanks for the comments

In the end it is based on dask

zarr is not based on dask, it just allows you to store arrays on disk and access them as if they were regular numpy arrays in memory. I just showed a minimal example of how it integrates nicely with dask since if you use this to load data larger than memory you would probably want to do something with that data but any processing you do would have to use a distributed algorithm and dask.array and dask.dataframe have distributed implementations of most of the numpy and pandas apis.

dask.arrays supported by numba

dask works well with numba as far as i could tell (i have also implemented a full strax processing pipeline in dask and it worked fine, but that will be a different PR)

Btw I am wondering if you cannot use the group feature for this level of organization

Yes, that is what i meant in option number 2 :)

And ragged arrays maybe for raw_data

Also something I am looking into separately and not related to this PR, sparse arrays for raw_data and ragged arrays for merging different data kinds

@JoranAngevaare JoranAngevaare added the enhancement New feature or request label Oct 7, 2021
Copy link
Member

@JoranAngevaare JoranAngevaare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve this I can think of two simple solutions:

  1. Include the hash in the list of inserted runs - this would allow users to mix and match runs from different contexts as they please but if the same run being loaded exists from a different hash it would overwrite it even if overwrite=False.
  2. Include the context hash in the label of the group itself, so all zarr groups created by this method are always associated with a specific context.
    Do you have any preference between the two or maybe a better solution?

I think I agree with Daniel that for comparing two datasets, one might better create two contexts to make sure that things don't mix.

I think you no went for option 2 (using the lineage-hash, not the context hash 😉 ), which I also makes sense.

strax/context.py Outdated Show resolved Hide resolved
strax/context.py Outdated Show resolved Hide resolved
strax/context.py Show resolved Hide resolved
strax/context.py Show resolved Hide resolved
@WenzDaniel
Copy link
Collaborator

WenzDaniel commented Oct 12, 2021

Hej Yossi, I am currently trying to play a bit with your PR to get a better understanding how to use this addition but I am afraid I am already failing at the basics :D

zarr = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zarr.tree()

Should show me some tree like structure for the asked data is not it?

And maybe one other question :D How can I access your RUNS field?

@WenzDaniel
Copy link
Collaborator

Other than that it is pretty cool :D Although I just tested it with light weighted data ;)

Copy link
Collaborator

@WenzDaniel WenzDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last comment. Besides the comment Joran made about the doc-string.

strax/context.py Outdated Show resolved Hide resolved
@jmosbacher
Copy link
Contributor Author

jmosbacher commented Oct 12, 2021

Hey @WenzDaniel thanks for playing around with this

zarr = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zarr.tree()

Should show me some tree like structure for the asked data is not it?

indeed, but looking at their source code they use an ipytree widget for the notebook display of the tree so you need to install it explicitly or pip install zarr[jupyter], I guess we can add it to the dependencies if you think it would be useful for people
.

And maybe one other question :D How can I access your RUNS field?

I set the RUNS field in the attrs of each array so eg:

zgrp = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zgrp['peak_basics'].attrs['RUNS']

@jmosbacher
Copy link
Contributor Author

If the behavior seems reasonable to all I will add an explanation to the docs and some tests.

@WenzDaniel
Copy link
Collaborator

indeed, but looking at their source code they use an ipytree widget for the notebook display of the tree so you need to install it explicitly or pip install zarr[jupyter], I guess we can add it to the dependencies if you think it would be useful for people

Yes this is what I did. I will try again later.

zgrp['peak_basics'].attrs['RUNS']

Ahh thanks I tried zgrp.RUNS.

@jmosbacher
Copy link
Contributor Author

Yes this is what I did. I will try again later.

since ipytree installs a jupyter widget, you will probably need to refresh the jupyter lab page at a minimum and maybe even restart the jupyter server (less likely) to reload the javascript side of jupyter.

Copy link
Collaborator

@WenzDaniel WenzDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hej Yossi, I saw you made some additional changes. Do you need some more time or should we add this nice PR to the upcoming release?

docs/source/advanced/out_of_core.rst Show resolved Hide resolved
extra_requirements/requirements-tests.txt Outdated Show resolved Hide resolved
extra_requirements/requirements-tests.txt Outdated Show resolved Hide resolved
@jmosbacher
Copy link
Contributor Author

I think this PR is ready to merge unless anyone wants some more changes. Maybe I should add an EXPERIMENTAL warning when this method is called?

@WenzDaniel
Copy link
Collaborator

Maybe I should add an EXPERIMENTAL warning when this method is called?

Sure if you like. I will advertise the PR later, but I am not so sure if many people will use it.

Copy link
Member

@JoranAngevaare JoranAngevaare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yossi!

extra_requirements/requirements-tests.txt Show resolved Hide resolved
@WenzDaniel WenzDaniel merged commit 729cc46 into AxFoundation:master Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants