# Loading ERA5

This notebook serves to subset and load ERA5 data from Google's public ERA5 analysis-ready, cloud-optimised (ARCO) mirror. Here, ERA5 data for the 1940-01-01 to 2025-12-31 period (continually, if irregularly, updated) at hourly frequency is stored in Zarr format. Beyond format, the sole difference between data available therein and that through the Copernicus Climate Data Store (CDS) is variable naming: longnames are used in the former and shortnames in the latter.

By default, subset data is written to the default blob storage container for the workspace, "workspaceblobstore".

In [None]:
import sys
from datetime import UTC, datetime
from pathlib import Path
from uuid import uuid4

import adlfs
import xarray as xr  # also requires zarr, fsspec, gcsfs, dask
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.core.exceptions import ResourceNotFoundError
from azure.identity import DefaultAzureCredential

# insert parent directory to path for proper absolute local imports
sys.path.insert(0, str(Path.cwd().parent.parent.resolve()))
from setup.common.utils import get_aml_ci_env_vars

Define the GCP ERA5 dataset from which to extract a subset.

N.B. See the [GCP ERA5 ARCO bucket](https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5) for other datasets including alternatively gridded Zarr and raw source NetCDF files. Not all datasets contain every variable or the same time range / frequency.

In [38]:
GCP_ERA5_PATH = "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"

Define parameters dictating timestamps to load, namely start and end date and timestep (frequency).

In [11]:
START_DATE = datetime(2025, 1, 1, 0, tzinfo=UTC)
END_DATE = datetime(2025, 1, 1, 12, tzinfo=UTC)
FREQUENCY = 6

Define the surface and pressure level variables and pressure levels to load. Variable longnames are mapped to shortnames for convenience (particularly when reading data into Aurora `Batch` objects) and are non-functional.

To ingest new variables and levels, add the former by longname to the appropriate dictionary and the latter by integer pressure level (hPa) to the given list.

N.B. The two variable mappings are not strict. That is, single-level variables can be added to the pressure level variable mapping without error, they simply afford separation and readability.

In [40]:
SURF_VAR_MAP = {
    "10m_u_component_of_wind": "10u",
    "10m_v_component_of_wind": "10v",
    "2m_temperature": "2t",
    "2m_dewpoint_temperature": "2d",
    "mean_sea_level_pressure": "msl",
    "skin_temperature": "skt",
    "surface_pressure": "sp",
    "total_column_water": "tcw",
    "land_sea_mask": "lsm",
    "geopotential_at_surface": "z",
    "slope_of_sub_gridscale_orography": "slor",
    "standard_deviation_of_orography": "sdor",
    "soil_temperature_level_1": "stl1",
    "soil_temperature_level_2": "stl2",
    "volumetric_soil_water_layer_1": "swvl1",
    "volumetric_soil_water_layer_2": "swvl2",
}
ATMOS_VAR_MAP = {
    "geopotential": "z",
    "temperature": "t",
    "u_component_of_wind": "u",
    "v_component_of_wind": "v",
    "specific_humidity": "q",
    "vertical_velocity": "w",
}
ATMOS_LEVELS = [1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 50]

Lazy load and subset data by variables, levels, time range, and timestep.

N.B. This will take at least 1 minute regardless of subset size due to the need to load all remote metadata which, for an archive of this volume, comprises several GB.

In [41]:
ds = xr.open_zarr(GCP_ERA5_PATH, chunks={})
var_subset_ds = ds[list(SURF_VAR_MAP.keys()) + list(ATMOS_VAR_MAP.keys())]
subset_ds = var_subset_ds.sel(
    time=slice(START_DATE, END_DATE, FREQUENCY),
    # no error if only variables without levels were requested
    **{"level": ATMOS_LEVELS} if "level" in var_subset_ds.coords else {},
)
# update metadata attributes to reflect the subset, not parent, data
subset_ds.attrs.update(
    valid_time_start=START_DATE.isoformat(),
    valid_time_stop=END_DATE.isoformat(),
)
subset_ds

Obtain and create necessary environment parameters and Azure interface objects.

In [None]:
az_cred = DefaultAzureCredential()
sub_id, rg_name, ws_name = get_aml_ci_env_vars()
ml_client = MLClient(
    credential=az_cred,
    subscription_id=sub_id,
    resource_group_name=rg_name,
    workspace_name=ws_name,
)

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


Define location and write subset data using the default workspace blob storage container ("workspaceblobstore"), the corresponding storage account, and a UUID v4 store name to avoid inadvertent naming collisions.

N.B. The filesystem object and mapper can be avoided by using "abfs://" protocol paths and the `storage_options` parameter of `.to_zarr()`, though doing so can result in bugs from event loops created and managed by `xarray` / `zarr` and `fsspec` / `adlfs`. For example:
```python
subset_ds.to_zarr(
    f"abfs://{dst_datastore.container_name}/{uuid4()}.zarr",
    mode="w",
    compute=True,
    consolidated=True,
    zarr_format=2,
    storage_options={
        "credential": DefaultAzureCredential(),
        "account_name": dst_datastore.account_name,
    },
)
```

In [None]:
dst_datastore = ml_client.datastores.get("workspaceblobstore")
path = f"aurora-workshop/input/{uuid4()}.zarr"
store = adlfs.AzureBlobFileSystem(dst_datastore.account_name, credential=az_cred)
mapper = store.get_mapper(f"{dst_datastore.container_name}/{path}")
subset_ds.to_zarr(
    mapper,
    mode="w",
    compute=True,
    consolidated=True,
    zarr_format=2,
)
print(
    f"Output to: account={dst_datastore.account_name}, "
    f"container={dst_datastore.container_name}, store={path}",
)

Output to: account=datasciencetea5561064689, container=azureml-blobstore-f8f32545-a0e0-4129-9345-c12bf82b4c73, store=aurora-workshop/input/4d69cea7-4ff3-417c-97b3-d70cc7587005.zarr


Confirm persisted data is available and valid.

N.B. An equality check with the original `subset_ds` (e.g. `new_ds.equals(subset_ds)`) can be used for the avoidance of doubt but requires loading data into memory, which may take time and result in an OOM error, subset size dependent.

In [None]:
xr.open_dataset(mapper, engine="zarr", chunks={})

KeysView(<fsspec.mapping.FSMap object at 0x7154e85e0af0>)

Define and create / update the Azure Machine Learning data asset entity for persisted data.

In [None]:
asset_name = "workshop-test-asset"
asset_description = "Zarr subset of ERA5 data from the GCP ERA5 ARCO dataset."
try:
    new = int(next(ml_client.data.list(name=asset_name)).version) + 1
except ResourceNotFoundError:
    new = 1

data_asset = Data(
    name=asset_name,
    version=str(new),
    description=asset_description,
    path=f"azureml://subscriptions/{sub_id}/resourcegroups/{rg_name}/workspaces/{ws_name}/datastores/{dst_datastore.name}/paths/{path}",
)
ml_client.data.create_or_update(data_asset)
action = "Created" if new == 1 else "Updated"
print(
    f"{action} asset: name={data_asset.name}, version={data_asset.version}, "
    f"path={data_asset.path}",
)

Updated asset: name=workshop-test-asset, version=5, path=azureml://subscriptions/62118f5c-be37-400f-9f20-a8b77a2a7877/resourcegroups/data-science-team-rg/workspaces/data-science-team-workspace/datastores/workspaceblobstore/paths/aurora-workshop/input/4d69cea7-4ff3-417c-97b3-d70cc7587005.zarr
