Trying to write combined virtual dataset (for MUR SST) results in `TypeError: Can only serialize wrapped arrays...` #60

abarciauskas-bgse · 2024-03-27T22:04:12Z

Testing with a "real world dataset" (s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1) mostly worked, with a few changes required, which are present in https://github.com/TomNicholas/VirtualiZarr/tree/ab/testing-mursst. Specifically:

We need a way to read from S3 - perhaps this should be a separate issue. I wrote in a workaround but probably need a more thought-out solution (EDIT: Generating references from files in S3 (using kerchunk + fsspec) #61)
I got pydantic errors for the filters property on ZArray and Codecs which was returned from this dataset as a list of dictionaries, not a string (a list of dicts appears to conform to the zarr v2 storage spec but I'm not sure if something changed in v3 or it's expected that the filters are encoded as a string. I changed the type to use List[Dict].

With those changes in place, I was able to create the virtual zarr datasets, but when trying to write the combined reference to json, I got this error: *** TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'> which I haven't been able to figure out, yet.

Here is my code to replicate:

from virtualizarr import open_virtual_dataset
import xarray as xr
# first get + set credentials from https://archive.podaac.earthdata.nasa.gov/s3credentials
vds1 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    # we have to put in the filetype to avoid trying to open the dataset with NetCDF4 
    filetype='netcdf4'
)
vds2 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    filetype='netcdf4'
)
combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
combined_vds['analysed_sst'].data.manifest.dict() # this works

# combined_vds.virtualize.to_kerchunk('combined.json', format='json')
# results in
# *** TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-27T22:25:35Z

Thanks @abarciauskas-bgse !

We need a way to read from S3

This I hadn't thought about yet. Thoughts and PR's welcome (and it deserves a separate issue - #61).

I got pydantic errors for the filters property on ZArray and Codecs

Thanks for reporting that. Do I reproduce just by calling .filters?

not sure if something changed in v3

The current code is supposed to work with v2, but there will be differences to smooth over (xref #17). Ideally I would be able to import classes directly from zarr-python to handle all of that.

TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

That's this error, which @norlandrhagen also reported. It will require another upstream adjustment to xarray to fix. In the meantime you should be able to avoid it by not creating indexes (i.e. pass indexes={} to open_virtual_dataset).

abarciauskas-bgse · 2024-03-28T15:53:03Z

Apologies this is one issue which can probably now be separated into 3 issues, 2 of which are open.

S3

@TomNicholas Thanks for opening #61 I may take a closer look at how we could incorporate reading from S3 tomorrow.

Filters typing

I am getting pydantic errors when using open_virtual_dataset for this dataset. I made a change to the filters property to change the type to Optional[List[Dict]]. Otherwise the traceback looks like this:

Traceback (most recent call last):
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/testing-virtualizarr.py", line 4, in <module>
    vds1 = open_virtual_dataset(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 74, in open_virtual_dataset
    vds = dataset_from_kerchunk_refs(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 111, in dataset_from_kerchunk_refs
    vars[var_name] = variable_from_kerchunk_refs(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/xarray.py", line 135, in variable_from_kerchunk_refs
    chunk_dict, zarray, zattrs = kerchunk.parse_array_refs(arr_refs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/kerchunk.py", line 144, in parse_array_refs
    zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/pangeo-forge-aws-batch/docker-images/02_generate_kerchunk/VirtualiZarr/virtualizarr/zarr.py", line 74, in from_kerchunk_refs
    return ZArray(
           ^^^^^^^
  File "/Users/aimeebarciauskas/github/developmentseed/eoapi/infrastructure/aws/.venv/lib/python3.11/site-packages/pydantic/main.py", line 150, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 1 validation error for ZArray
filters
  Input should be a valid string [type=string_type, input_value=[{'elementsize': 2, 'id':...d': 'zlib', 'level': 7}], input_type=list]
    For further information visit https://errors.pydantic.dev/2.1.2/v/string_type

Would it help if I created a separate issue for this error with a minimally reproducible example (via an artificially generated dataset, perhaps?)

TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

FWIW I get this error even when passing indexes={} to open_virtual_dataset, but I don't have more information about why yet.

TomNicholas · 2024-03-28T16:12:09Z

I may take a closer look at how we could incorporate reading from S3 tomorrow.

That would be awesome. Especially as it seems solving that issue should be quite separate from all the guts of the rest of the package.

Would it help if I created a separate issue for this error with a minimally reproducible example (via an artificially generated dataset, perhaps?)

That would certainly be the most correct way to move forward! But also if you think the fix is just a simple change of type hint then I'm happy to just accept a PR for that.

FWIW I get this error even when passing indexes={} to open_virtual_dataset, but I don't have more information about why yet.

That's weird. Are you sure you're using both the most recent version of this package (i.e. main because I haven't released it yet) and also the forked branch of xarray (see #14 (comment))?

You will get this error when your virtual dataset contains any arrays that are not ManifestArrays. In your case it will be because the coordinate arrays are somehow being accidentally coerced to np.ndarray inside xr.concat. We could actually imagine writing these out to disk anyway, see #62.

abarciauskas-bgse · 2024-03-29T21:16:57Z

@TomNicholas you were right, I had not correctly installed the forked branch of xarray in my testing. For future reference:

pip install xarray@git+https://github.com/TomNicholas/xarray@concat-no-indexes

Once I had verified the forked version was installed, I ran the example again and following completes without error:

from virtualizarr import open_virtual_dataset
import xarray as xr

# first get + set credentials from https://archive.podaac.earthdata.nasa.gov/s3credentials
vds1 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    # we have to put in the filetype to avoid trying to open the dataset with NetCDF4 
    filetype='netcdf4',
    indexes={}
)
vds2 = open_virtual_dataset(
    's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
    filetype='netcdf4',
    indexes={}
)
combined_vds = xr.concat([vds1, vds2], dim='time', coords='minimal', compat='override')
combined_vds['analysed_sst'].data.manifest.dict() # this works

combined_vds.virtualize.to_kerchunk('combined.json', format='json')

This issue is now covered by #61 and #65, so closing.

TomNicholas mentioned this issue Mar 27, 2024

Generating references from files in S3 (using kerchunk + fsspec) #61

Closed

abarciauskas-bgse mentioned this issue Mar 29, 2024

Filters should be a list of dictionaries #65

Closed

abarciauskas-bgse closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to write combined virtual dataset (for MUR SST) results in `TypeError: Can only serialize wrapped arrays...` #60

Trying to write combined virtual dataset (for MUR SST) results in `TypeError: Can only serialize wrapped arrays...` #60

abarciauskas-bgse commented Mar 27, 2024 •

edited by TomNicholas

TomNicholas commented Mar 27, 2024 •

edited

abarciauskas-bgse commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

abarciauskas-bgse commented Mar 29, 2024

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Trying to write combined virtual dataset (for MUR SST) results in TypeError: Can only serialize wrapped arrays... #60

Comments

abarciauskas-bgse commented Mar 27, 2024 • edited by TomNicholas

TomNicholas commented Mar 27, 2024 • edited

abarciauskas-bgse commented Mar 28, 2024

S3

Filters typing

TypeError: Can only serialize wrapped arrays of type ManifestArray, but got type <class 'numpy.ndarray'>

TomNicholas commented Mar 28, 2024

abarciauskas-bgse commented Mar 29, 2024

Trying to write combined virtual dataset (for MUR SST) results in `TypeError: Can only serialize wrapped arrays...` #60

Trying to write combined virtual dataset (for MUR SST) results in `TypeError: Can only serialize wrapped arrays...` #60

abarciauskas-bgse commented Mar 27, 2024 •

edited by TomNicholas

TomNicholas commented Mar 27, 2024 •

edited