-
Notifications
You must be signed in to change notification settings - Fork 27
Description
I've been doing a bit of fiddling with streaming data - mostly for visualisation (eg. here) , but it also strikes me (reading some stuff from the earthmover guys) it might be helpful for these purposes.
Eg. in the Train and run a simplified global weather model ( notebook, there's this cell:
file_location = workdir + '/mini.nc'
if not os.path.exists(file_location):
print("Training data not found, downloading around 2.8GB of data")
era5_lowres = xr.open_zarr('gs://weatherbench2/datasets/era5/1959-2022-6h-64x32_equiangular_conservative.zarr')
subset = era5_lowres[['10m_u_component_of_wind',
'10m_v_component_of_wind',
'2m_temperature',
'mean_sea_level_pressure',
#'geopotential', # Uncomment this to fetch additional data
#'toa_incident_solar_radiation_6hr', # Uncomment this to fetch additional data
#'temperature' # Uncomment this to fetch additional data
]]
# bilevel = subset.sel({'level': [50, 500]}) Uncomment if fetching addtional data
# bilevel.to_netcdf(file_location)
subset.to_netcdf(file_location) # Comment this out if using the bilevel data instead
print("Wrote file to {file_location}")
assert os.path.exists(file_location)
else:
print("File already downloaded, skipping ...")I think - I haven't investigated super thoroughly - that in principle the bottleneck for training should be crunching the data in the GPU, not downloading it - and so in theory this could be done via streaming that zarr dataset straight into the model for training. The benefit of this would be that there is then no requirement to save any of this data to disk, so the amount of training data fed into a local model for training could scale arbitrarily.
Context: https://earthmover.io/blog/cloud-native-dataloader
Anyway, I'm not sure whether restructuring this would actually be much use, but I thought I'd raise the possibility - happy to take a look into the viability of it myself as I'm very much thinking about this kind of data streaming architecture right now.