Description
What happened?
This is basically a failure to round-trip through zarr.
When saving and then loading and then re-saving through zarr, the contents of "string" type coordinates or variables is deleted.
I have been banging my head against this for about a month because certain "get" functions in XARRAY seem to have side-effects (e.g. to_numpy, as_numpy).
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep xarray
xarray==2025.6.1
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep numpy
numpy==2.2.6
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep zarr
zarr==2.18.3
Here is a minimal example:
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('mydim', [1,2,3]),
col2=('mydim',['aa','be','cefe']) )
);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.
Output is:
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py
['' '' '']
Note that a simple modification (reading using to_numpy or as_numpy between the second load/save) causes this to disappear:
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('mydim', [1,2,3]),
col2=('mydim',['aa','be','cefe']) )
);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #WORKS FINE
Output:
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py
['aa' 'be' 'cefe']
['aa' 'be' 'cefe']
Numeric columns seem unaffected although I have observed situations where they too disappear or are filled with random memory garbage:
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('mydim', [1,2,3]),
col2=('mydim',['aa','be','cefe']) )
);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col1'].to_numpy()); #WORKS FINE
Output:
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py
[1 2 3]
For some reason this only happens when have dimensions names different than their array names. If I do not set the dimensions of the contents, or set each to its own independent dimension with the same name as the variable, then everything works fine (of course this is only relevant for a 1D array variable):
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('col1', [1,2,3]),
col2=('col2',['aa','be','cefe']) )
);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" on the dataarray before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); ## WORKS FINE
Output:
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py
['aa' 'be' 'cefe']
Note that everything else in the dataset is correct, it's just the data that is deleted (replaced with empty strings...).
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('mydim', [1,2,3]),
col2=('mydim',['aa','be','cefe']) )
);
print(ds1);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.
print(ds1);
Output:
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py
<xarray.Dataset> Size: 72B
Dimensions: (mydim: 3)
Dimensions without coordinates: mydim
Data variables:
col1 (mydim) int64 24B 1 2 3
col2 (mydim) <U4 48B 'aa' 'be' 'cefe'
['' '' '']
<xarray.Dataset> Size: 72B
Dimensions: (mydim: 3)
Dimensions without coordinates: mydim
Data variables:
col1 (mydim) int64 24B ...
col2 (mydim) <U4 48B '' '' ''
What did you expect to happen?
The data contents is not deleted (strings all become empty strings ''), i.e. correct round-trip through zarr.
Minimal Complete Verifiable Example
import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
col1=('mydim', [1,2,3]),
col2=('mydim',['aa','be','cefe']) )
);
fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
No response