Skip to content

Streaming zarr data in tutorials? #244

@charles-turner-1

Description

@charles-turner-1

I've been doing a bit of fiddling with streaming data - mostly for visualisation (eg. here) , but it also strikes me (reading some stuff from the earthmover guys) it might be helpful for these purposes.

Eg. in the Train and run a simplified global weather model ( notebook, there's this cell:

file_location = workdir + '/mini.nc'

if not os.path.exists(file_location):
    print("Training data not found, downloading around 2.8GB of data")
    era5_lowres = xr.open_zarr('gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_conservative.zarr')
    subset = era5_lowres[['10m_u_component_of_wind',
                          '10m_v_component_of_wind',
                          '2m_temperature',
                          'mean_sea_level_pressure',
                          #'geopotential',  # Uncomment this to fetch additional data
                          #'toa_incident_solar_radiation_6hr', # Uncomment this to fetch additional data
                          #'temperature' # Uncomment this to fetch additional data
                         ]]

    # bilevel = subset.sel({'level': [50, 500]}) Uncomment if fetching addtional data
    # bilevel.to_netcdf(file_location)

    subset.to_netcdf(file_location)  # Comment this out if using the bilevel data instead
    print("Wrote file to {file_location}")
    assert os.path.exists(file_location)
else:
    print("File already downloaded, skipping ...")

I think - I haven't investigated super thoroughly - that in principle the bottleneck for training should be crunching the data in the GPU, not downloading it - and so in theory this could be done via streaming that zarr dataset straight into the model for training. The benefit of this would be that there is then no requirement to save any of this data to disk, so the amount of training data fed into a local model for training could scale arbitrarily.

Context: https://earthmover.io/blog/cloud-native-dataloader

Anyway, I'm not sure whether restructuring this would actually be much use, but I thought I'd raise the possibility - happy to take a look into the viability of it myself as I'm very much thinking about this kind of data streaming architecture right now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions