-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use 3 numpy arrays for manifest internally #107
Conversation
for more information, see https://pre-commit.ci
Here is a script which generated a 9GB JSON file across many years of NWM data: https://gist.github.com/rsignell-usgs/d386c85e02697c5b89d0211371e8b944 . I'll see if I can find a parquet version, but you should reckon on 10x in on-disk size (parquet or compressed numpy). Unfortunately, the references have been deleted, because the whole dataset is now also available as zarr. I may have the chance sometime to regenerate them, if it's important. |
Also, a super-simple arrow- or awkward-like string representation as contiguous numpy arrays could look something like
|
That's useful context for #104, thanks Martin! |
…rlying numpy arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h5py (the latter from the scientific python nightly repo)
Can I do this by adding a special pip install
command to a conda env?
Now builds atop #139 |
you can: # - netcdf4
- h5netcdf
- pip
- pip:
- -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple
- --pre
- h5py
- numpy (and if you need multiple packages from different sources, put the definition in |
Note that en efficient packing of a string array would be all the character data concatenated, and a separate offsets array. This is what arrow or awkward does. Parquet can also store this way, but alternating length-data is standard. Dense packings like that ssusme immutability https://github.com/martindurant/fastparquet/blob/faster/fastparquet/wrappers.py#L51 has the simplest possible version of this (yes, fastparquet is intending to move to pure numpy within months, if I didn't say this already). |
@martindurant does numpy's new string dtype not do that too? |
No, it allocates on the heap and stores pointers, I believe. That is not (necessarily) memory contiguous, but allows for mutability. |
This PR now works (passes all tests locally). The failures are the same as on
@rabernat @martindurant a pyarrow string array might be a bit more memory-efficient, but a) I can store millions of chunk references using just a few MB with this new numpy dtype, which seems plenty good enough to me In [1]: from virtualizarr.tests import create_manifestarray
In [2]: marr = create_manifestarray(shape=(100, 100, 100), chunks=(1, 1, 1))
In [3]: marr
Out[3]: ManifestArray<shape=(100, 100, 100), dtype=float32, chunks=(1, 1, 1)>
In [4]: marr.manifest._paths.nbytes / 1e6
Out[4]: 16.0
In [5]: (marr.manifest._paths.nbytes + marr.manifest._offsets.nbytes + marr.manifest._lengths.nbytes) / 1e6
Out[5]: 24.0 b) IIUC pyarrow string arrays are not N-dimensional arrays, and half the point of this PR is that my implementation of |
virtualizarr/manifests/manifest.py
Outdated
@classmethod | ||
def validate_chunks(cls, entries: Any) -> Mapping[ChunkKey, ChunkEntry]: | ||
validate_chunk_keys(list(entries.keys())) | ||
def from_arrays( | ||
cls, | ||
paths: np.ndarray[Any, np.dtype[np.dtypes.StringDType]], | ||
offsets: np.ndarray[Any, np.dtype[np.int32]], | ||
lengths: np.ndarray[Any, np.dtype[np.int32]], | ||
) -> "ChunkManifest": | ||
""" | ||
Create manifest directly from numpy arrays containing the path and byte range information. | ||
|
||
Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, | ||
as these 3 arrays are what will be used internally to store the references. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushnag @sharkinsspatial you might want to try building numpy arrays of references and passing them to this new constructor instead to keep memory usage down.
Supercedes #39 as a way to close #33, the difference being that this uses 3 separate numpy arrays to store the path strings, byte offsets, and byte range lengths (rather than trying to put them all in one numpy array with a structured dtype). Effectively implements (2) in #104.
Relies on numpy 2.0 (which is currently only available as a release candidate).