Streaming zarr data in tutorials?

I've been doing a bit of fiddling with streaming data - mostly for visualisation (eg. [here](https://charles-turner-1.github.io/personal-homepage/#/projects/zarr-data-streamer)) , but it also strikes me (reading some stuff from the earthmover guys) it might be helpful for these purposes.

Eg. in the [Train and run a simplified global weather model (](https://pyearthtools.readthedocs.io/en/latest/notebooks/tutorial/FourCastMini_Demo.html) notebook, there's this cell:

```python

file_location = workdir + '/mini.nc'

if not os.path.exists(file_location):
    print("Training data not found, downloading around 2.8GB of data")
    era5_lowres = xr.open_zarr('gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_conservative.zarr')
    subset = era5_lowres[['10m_u_component_of_wind',
                          '10m_v_component_of_wind',
                          '2m_temperature',
                          'mean_sea_level_pressure',
                          #'geopotential',  # Uncomment this to fetch additional data
                          #'toa_incident_solar_radiation_6hr', # Uncomment this to fetch additional data
                          #'temperature' # Uncomment this to fetch additional data
                         ]]

    # bilevel = subset.sel({'level': [50, 500]}) Uncomment if fetching addtional data
    # bilevel.to_netcdf(file_location)

    subset.to_netcdf(file_location)  # Comment this out if using the bilevel data instead
    print("Wrote file to {file_location}")
    assert os.path.exists(file_location)
else:
    print("File already downloaded, skipping ...")
```

I think - I haven't investigated super thoroughly - that in principle the bottleneck for training should be crunching the data in the GPU, not downloading it - and so in theory this could be done via streaming that zarr dataset straight into the model for training.  The benefit of this would be that there is then no requirement to save any of this data to disk, so the amount of training data fed into a local model for training could scale arbitrarily.

Context: https://earthmover.io/blog/cloud-native-dataloader

Anyway, I'm not sure whether restructuring this would actually be much use, but I thought I'd raise the possibility - happy to take a look into the viability of it myself as I'm very much thinking about this kind of data streaming architecture right now. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming zarr data in tutorials? #244

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming zarr data in tutorials? #244

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions