Add get_zarr method to context #540

jmosbacher · 2021-10-04T13:56:52Z

This PR adds a get_zarr method to the context, to create persistent arrays which are useful for loading large datasets that dont fit in memory.

For each requested target the method iterates over all run_ids and chunks and adds them to the given storage location (overwrite or append is optional). The zarr group is then returned to the user.

Please add comments if you think behavior should be different (e.g. auto merge targets with same datatypes or using hash in array names etc.) will add tests once the behavior is finalized.

minimal example:

import dask.array as da

zgrp = st.get_zarr(runs, ('peaks', 'event_basics')) # creates a zarr group and adds datasets to it with requestsed data
event_basics = zgrp['event_basics']  # dict-like access to the created persistent arrays
darr = da.from_zarr(event_basics)  # to dask array
ddf = darr.to_dask_dataframe()  # to dask dataframe

merge upstream changes

merge upstream

merge upstream changes

JoranAngevaare · 2021-10-06T19:10:38Z

Thanks Yossi, looking at the documentation of https://zarr.readthedocs.io/en/stable/tutorial.html, this looks like something very useful.

Few questions before diving into the code:

how does the zarr './strax_data' get re-used? Now it looks like you are using the target rather than the datakey as the lookup. I think this can cause errors if one changes the context, right, it will still return the same data? If we can make this reproducible, it might be quite a benefit?
You chose to put this zarr outside of the storage-frontend system. Is the greatest advantage that you can create objects easily that transcend run_ids, I see that having zarr as another frontend also poses issues as you'll have two systems doing similar things? There are some cons too, like not being able to load data in it the "normal" way, as far as the context is concerned, the data only exists in this one function, and no other function is aware of it. Probably you thought of more pros/cons? There is of course also your other PR: Zarr storage #412.
This functionality looks rather similar to the multi-run functionality, I'm not sure where which of the approaches has the benefit in which use case. Perhaps this is something worth adding to the docs?

jmosbacher · 2021-10-07T06:40:25Z

@JoranAngevaare thanks for the comments.

how does the zarr './strax_data' get re-used? Now it looks like you are using the target rather than the datakey as the lookup. I think this can cause errors if one changes the context, right, it will still return the same data? If we can make this reproducible, it might be quite a benefit?

Indeed I was on the fence about this, my first implementation was using the target+hash as the data key but I switched to make it more intuitive to the user. Plus needing the hash to access the target in your array can also impede code reusability.
The problem is that I then made it optional to append runs to the array instead of overwriting it, so this can indeed result in unintuitive behavior.
To solve this I can think of two simple solutions:

Include the hash in the list of inserted runs - this would allow users to mix and match runs from different contexts as they please but if the same run being loaded exists from a different hash it would overwrite it even if overwrite=False.
Include the context hash in the label of the group itself, so all zarr groups created by this method are always associated with a specific context.
Do you have any preference between the two or maybe a better solution?

You chose to put this zarr outside of the storage-frontend system.

Yes, the purpose of this function is just to have an easy way to load data that wont fit in memory but still allow numpy-like access, not as an alternative storage option. Perhaps my choice of ./strax_data as the default location for the persistent array was misleading, I think maybe I will change that to be a temp dir by default to emphasize its meant to be more of a cache than storage since the data is already stored by strax.

This functionality looks rather similar to the multi-run functionality

The purpose is to enable people to load all the data they need for their analysis in a single call to strax even if it wont fit in memory and access it as if it was a regular numpy array, this can be single run/multiple runs for some analyses or for others multiple data kinds.

I still haven't added any changes to the docs because I first wanted to first get your input on what options you think we should have here. Once the options and behavior is finalized ill of course add documentation and tests.

WenzDaniel · 2021-10-07T06:41:45Z

Hi Yossi, I think in general zarr is a nice package and I see some potential to maybe replace multi- and superruns by such a system. But at the moment I have similar concerns as Joran. Some additional thing I am wondering. In the end it is based on dask, are dask.arrays supported by numba? Probably they are but I do not know. Since most of our important function are based on numba.

But this looks like a nice exercise to be discussed on the upcoming workshop. In general I think we should make a list of things we want to compare all available options with. Out of the box I think we should compare:

Disk and memory usage
Loading performance
Metadata handling
Complexity for the analysts

Further, in the end it might be worth having a test branch and asking a few analysts for a beta-test.

WenzDaniel · 2021-10-07T06:47:02Z

this would allow users to mix and match runs from different contexts as they please

I slightly disagree with this statement. Sounds a bit messy to me and too easy to screw-up. I think if you want to compare different "data-settings-contexts" you also should use different "strax-contexts". In this way we always have a nice and clean separation. Hence in that sense I am more for your option 2.

WenzDaniel · 2021-10-07T06:52:24Z

Btw I am wondering if you cannot use the group feature for this level of organization:
https://zarr.readthedocs.io/en/stable/tutorial.html#groups

WenzDaniel · 2021-10-07T06:55:07Z

And ragged arrays maybe for raw_data https://zarr.readthedocs.io/en/stable/tutorial.html#ragged-arrays ? :D

jmosbacher · 2021-10-07T06:59:34Z

@WenzDaniel thanks for the comments

In the end it is based on dask

zarr is not based on dask, it just allows you to store arrays on disk and access them as if they were regular numpy arrays in memory. I just showed a minimal example of how it integrates nicely with dask since if you use this to load data larger than memory you would probably want to do something with that data but any processing you do would have to use a distributed algorithm and dask.array and dask.dataframe have distributed implementations of most of the numpy and pandas apis.

dask.arrays supported by numba

dask works well with numba as far as i could tell (i have also implemented a full strax processing pipeline in dask and it worked fine, but that will be a different PR)

Btw I am wondering if you cannot use the group feature for this level of organization

Yes, that is what i meant in option number 2 :)

And ragged arrays maybe for raw_data

Also something I am looking into separately and not related to this PR, sparse arrays for raw_data and ragged arrays for merging different data kinds

JoranAngevaare

To solve this I can think of two simple solutions:

Include the hash in the list of inserted runs - this would allow users to mix and match runs from different contexts as they please but if the same run being loaded exists from a different hash it would overwrite it even if overwrite=False.

Include the context hash in the label of the group itself, so all zarr groups created by this method are always associated with a specific context.
Do you have any preference between the two or maybe a better solution?

I think I agree with Daniel that for comparing two datasets, one might better create two contexts to make sure that things don't mix.

I think you no went for option 2 (using the lineage-hash, not the context hash 😉 ), which I also makes sense.

strax/context.py

WenzDaniel · 2021-10-12T12:35:55Z

Hej Yossi, I am currently trying to play a bit with your PR to get a better understanding how to use this addition but I am afraid I am already failing at the basics :D

zarr = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zarr.tree()

Should show me some tree like structure for the asked data is not it?

And maybe one other question :D How can I access your RUNS field?

WenzDaniel · 2021-10-12T12:40:21Z

Other than that it is pretty cool :D Although I just tested it with light weighted data ;)

WenzDaniel

Just one last comment. Besides the comment Joran made about the doc-string.

strax/context.py

jmosbacher · 2021-10-12T12:59:42Z

Hey @WenzDaniel thanks for playing around with this

zarr = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zarr.tree()
Should show me some tree like structure for the asked data is not it?

indeed, but looking at their source code they use an ipytree widget for the notebook display of the tree so you need to install it explicitly or pip install zarr[jupyter], I guess we can add it to the dependencies if you think it would be useful for people
.

And maybe one other question :D How can I access your RUNS field?

I set the RUNS field in the attrs of each array so eg:

zgrp = st.get_zarr(('030000', '030001', '030002'), targets=('peak_basics', 'lone_hits'))
zgrp['peak_basics'].attrs['RUNS']

jmosbacher · 2021-10-12T13:12:39Z

If the behavior seems reasonable to all I will add an explanation to the docs and some tests.

WenzDaniel · 2021-10-12T13:26:28Z

indeed, but looking at their source code they use an ipytree widget for the notebook display of the tree so you need to install it explicitly or pip install zarr[jupyter], I guess we can add it to the dependencies if you think it would be useful for people

Yes this is what I did. I will try again later.

zgrp['peak_basics'].attrs['RUNS']

Ahh thanks I tried zgrp.RUNS.

jmosbacher · 2021-10-12T13:36:39Z

Yes this is what I did. I will try again later.

since ipytree installs a jupyter widget, you will probably need to refresh the jupyter lab page at a minimum and maybe even restart the jupyter server (less likely) to reload the javascript side of jupyter.

WenzDaniel

Hej Yossi, I saw you made some additional changes. Do you need some more time or should we add this nice PR to the upcoming release?

docs/source/advanced/out_of_core.rst

extra_requirements/requirements-tests.txt

…et_zarr

jmosbacher · 2021-10-14T11:12:57Z

I think this PR is ready to merge unless anyone wants some more changes. Maybe I should add an EXPERIMENTAL warning when this method is called?

WenzDaniel · 2021-10-14T11:30:26Z

Maybe I should add an EXPERIMENTAL warning when this method is called?

Sure if you like. I will advertise the PR later, but I am not so sure if many people will use it.

JoranAngevaare

Thanks Yossi!

extra_requirements/requirements-tests.txt

jmosbacher and others added 10 commits October 9, 2020 13:45

Merge pull request #11 from AxFoundation/master

43f6868

merge upstream changes

Merge pull request #12 from AxFoundation/master

9f84b26

merge upstream

Merge pull request #13 from AxFoundation/master

62473a4

merge upstream changes

Merge pull request #14 from AxFoundation/master

cc55666

merge upstream changes

Merge branch 'AxFoundation:master' into master

f3ee478

Merge branch 'AxFoundation:master' into master

7897cac

Merge branch 'AxFoundation:master' into master

ffa281e

Merge branch 'AxFoundation:master' into master

b281ce3

add get_zarr context method

8fd7fd8

overwrite optional

70e2801

jmosbacher requested review from JoranAngevaare and WenzDaniel October 5, 2021 08:54

preserve context and run metadata

f1fbfe4

JoranAngevaare added the enhancement New feature or request label Oct 7, 2021

add kwargs

26b58a9

JoranAngevaare reviewed Oct 12, 2021

View reviewed changes

strax/context.py Outdated Show resolved Hide resolved

strax/context.py Outdated Show resolved Hide resolved

strax/context.py Show resolved Hide resolved

strax/context.py Show resolved Hide resolved

sphinx docs and kwargs hash

9d2804c

WenzDaniel approved these changes Oct 12, 2021

View reviewed changes

strax/context.py Outdated Show resolved Hide resolved

Yossi Mosbacher and others added 7 commits October 12, 2021 16:42

rename zarray

2f9bd05

add docs

7fc0db7

add zarr docs to index

bc9769c

add test for get_zarr

702e9e8

Merge branch 'master' into get_zarr

ba8955c

add fsspec dependency

ccd3159

Merge branch 'master' into get_zarr

aa817d0

WenzDaniel reviewed Oct 14, 2021

View reviewed changes

docs/source/advanced/out_of_core.rst Show resolved Hide resolved

extra_requirements/requirements-tests.txt Outdated Show resolved Hide resolved

extra_requirements/requirements-tests.txt Outdated Show resolved Hide resolved

Yossi Mosbacher added 2 commits October 14, 2021 14:10

alphabetical requirements

b5f027b

Merge branch 'get_zarr' of https://github.com/jmosbacher/strax into g…

3e8b6ce

…et_zarr

JoranAngevaare approved these changes Oct 14, 2021

View reviewed changes

extra_requirements/requirements-tests.txt Show resolved Hide resolved

add zarr and fsspec to base requirements

751250a

WenzDaniel merged commit 729cc46 into AxFoundation:master Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_zarr method to context #540

Add get_zarr method to context #540

jmosbacher commented Oct 4, 2021 •

edited

JoranAngevaare commented Oct 6, 2021 •

edited

jmosbacher commented Oct 7, 2021 •

edited by JoranAngevaare

WenzDaniel commented Oct 7, 2021

WenzDaniel commented Oct 7, 2021 •

edited

WenzDaniel commented Oct 7, 2021

WenzDaniel commented Oct 7, 2021

jmosbacher commented Oct 7, 2021 •

edited

JoranAngevaare left a comment

WenzDaniel commented Oct 12, 2021 •

edited

WenzDaniel commented Oct 12, 2021

WenzDaniel left a comment

jmosbacher commented Oct 12, 2021 •

edited

jmosbacher commented Oct 12, 2021

WenzDaniel commented Oct 12, 2021

jmosbacher commented Oct 12, 2021

WenzDaniel left a comment

jmosbacher commented Oct 14, 2021

WenzDaniel commented Oct 14, 2021

JoranAngevaare left a comment

Add get_zarr method to context #540

Add get_zarr method to context #540

Conversation

jmosbacher commented Oct 4, 2021 • edited

JoranAngevaare commented Oct 6, 2021 • edited

jmosbacher commented Oct 7, 2021 • edited by JoranAngevaare

WenzDaniel commented Oct 7, 2021

WenzDaniel commented Oct 7, 2021 • edited

WenzDaniel commented Oct 7, 2021

WenzDaniel commented Oct 7, 2021

jmosbacher commented Oct 7, 2021 • edited

JoranAngevaare left a comment

Choose a reason for hiding this comment

WenzDaniel commented Oct 12, 2021 • edited

WenzDaniel commented Oct 12, 2021

WenzDaniel left a comment

Choose a reason for hiding this comment

jmosbacher commented Oct 12, 2021 • edited

jmosbacher commented Oct 12, 2021

WenzDaniel commented Oct 12, 2021

jmosbacher commented Oct 12, 2021

WenzDaniel left a comment

Choose a reason for hiding this comment

jmosbacher commented Oct 14, 2021

WenzDaniel commented Oct 14, 2021

JoranAngevaare left a comment

Choose a reason for hiding this comment

jmosbacher commented Oct 4, 2021 •

edited

JoranAngevaare commented Oct 6, 2021 •

edited

jmosbacher commented Oct 7, 2021 •

edited by JoranAngevaare

WenzDaniel commented Oct 7, 2021 •

edited

jmosbacher commented Oct 7, 2021 •

edited

WenzDaniel commented Oct 12, 2021 •

edited

jmosbacher commented Oct 12, 2021 •

edited