# Comparing Data sources

This section compares three different methods for retrieving ERA5 data in the WeatherBench 2 format. The goal is to identify the most suitable approach for real-time weather prediction using Google DeepMind’s GenCast model at 1.0° resolution.

In [2]:
!conda install -c conda-forge xarray zarr gcsfs -y
!pip install cdsapi
!pip install xarray
!pip install numpy
!pip install pandas
!pip install gcsfs


Retrieving notices: done
Channels:
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 25.3.0
    latest version: 25.5.1

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs:
    - gcsfs
    - xarray
    - zarr


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    asciitree-0.3.3            |             py_2           6 KB  conda-forge
    ca-certificates-2025.8.3   |       hbd8a1cb_0         151 KB  conda-forge
    cachetools-5.5.2           |     pyhd8ed1ab_0          15 KB  conda-forge
    certifi-2025.8.3           |     pyhd8ed1ab_0         155 KB  conda-forge
    fasteners-0.19             |     pyhd8ed1ab_1          20 KB  conda-forge
    gcsfs-202

In [2]:
import cdsapi
import datetime
import xarray
import xarray as xr
import numpy as np
import pandas as pd
import gcsfs
import os
from datetime import datetime, timedelta


# Goal : 

This section compares three methods for retrieving ERA5 data in the WeatherBench 2 format. The objective is to replicate the exact data structure used by DeepMind’s GenCast model for its publicly released test case on March 29, 2019.

Since DeepMind provides usable test data only for this specific date, our goal is to reproduce that format using alternative data access methods—making it possible to run GenCast in real-time or on other historical dates.

# Deepminds provided data :

In [None]:


DATA_PATH = "./source-era5_date-2019-03-29_res-1.0_levels-13_steps-12.nc"  # E.g. "source-era5_date-2019-03-29_res-1.0_levels-13_steps-04.nc"
# @title Load weather data
with open(DATA_PATH, "rb") as f:
  example_batch = xarray.load_dataset(f).compute()
example_batch

# Methode one : Weather Bench data



Using the weatherbench2 era5 datasets, available data : 1959 to 2023_01_10

https://weatherbench2.readthedocs.io/en/latest/data-guide.html#era5

In [3]:



def extract_era5_data(date):
    # Path to the Zarr dataset in Google Cloud Storage
    zarr_path = 'gs://weatherbench2/datasets/era5/1959-2023_01_10-wb13-6h-1440x721_with_derived_variables.zarr'
    
    # Open the Zarr dataset using xarray and gcsfs
    fs = gcsfs.GCSFileSystem()
    ds = xr.open_zarr(fs.get_mapper(zarr_path), consolidated=True)
    
    # Select 14 time steps: 00:00 and 12:00 UTC for 7 days
    times = pd.date_range(start=date, periods=14, freq='12H')
    ds = ds.sel(time=times)
    
    
    # Define variables and levels
    variables = [
        'land_sea_mask', 'geopotential_at_surface',
        '2m_temperature', 'sea_surface_temperature', 'mean_sea_level_pressure', '10m_v_component_of_wind',
        'total_precipitation_12hr', '10m_u_component_of_wind', 'u_component_of_wind', 'specific_humidity',
        'temperature', 'vertical_velocity', 'v_component_of_wind', 'geopotential', 
    ]
    levels = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]
    ds = ds[variables].sel(level=levels)
    
    # Rename dimensions before assigning new coordinates
    ds = ds.rename({'latitude': 'lat', 'longitude': 'lon'})

    # Expand dataset with a new 'batch' dimension first
    ds = ds.expand_dims('batch')
    
    # Create new coordinates
    datetime_coord = np.array(times, dtype='datetime64[ns]').reshape(1, -1)
    time_coord = (times - times[0]).values.astype('timedelta64[ns]')
    
          # Assign coordinates
    ds = ds.assign_coords({
        'lon':ds.lon,
        'lat': ds.lat,
        'time': time_coord,
        'datetime': (('batch', 'time'), datetime_coord)
    })
    ds = ds.sortby('lat')
    return ds


date = '2019-03-29'
ds_1deg_0x25  = extract_era5_data(date)

# Convert from 0.25 to 1.0 deg
weatherbench2_ds_deg_0x25  = ds_1deg_0x25 .compute()
weatherbench2_ds_deg_1x0 = weatherbench2_ds_deg_0x25.isel(lat=slice(None, None, 4), lon=slice(None, None, 4))
weatherbench2_ds_deg_1x0 



  times = pd.date_range(start=date, periods=14, freq='12H')


# Methode two : Google research using weatherbench in a gcp Bucket

Using the public gcp bucket with era5 datasets, The stable version of ERA5 is updated on a monthly cadence (on roughly the 9th of each month) with a 3 month delay. 

ERA5T data is produced by ECMWF (European Centre for Medium-Range Weather Forecasts).
It takes 5–6 days to process and publish each day's data.
ERA5 (stable) waits 3 months to ensure all corrections and validations are complete.

https://github.com/google-research/arco-era5?tab=readme-ov-file#analysis-ready-data

## Stable Version : 

In [None]:


def extract_arco_era5_input(date):
    # Load the dataset from ARCO ERA5
    ds = xr.open_zarr(
        'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
        chunks=None,
        storage_options=dict(token='anon')
    )

    # Restrict to valid time range
    ds = ds.sel(time=slice(ds.attrs['valid_time_start'], ds.attrs['valid_time_stop']))

    # Select 14 time steps: 00:00 and 12:00 UTC for 7 days
    times = pd.date_range(start=date, periods=14, freq='12h')
    ds = ds.sel(time=times)


    # Rename dimensions
    ds = ds.rename({'latitude': 'lat', 'longitude': 'lon'})

    # Select 13 pressure levels
    levels = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]
    ds = ds.sel(level=levels)

    # Select specific variables
    variables = [
        'land_sea_mask', 'geopotential_at_surface',
        '2m_temperature', 'sea_surface_temperature', 'mean_sea_level_pressure', '10m_v_component_of_wind'
        , '10m_u_component_of_wind', 'u_component_of_wind', 'specific_humidity',
        'temperature', 'vertical_velocity', 'v_component_of_wind', 'geopotential'
    ]
    ds = ds[variables]

    # Expand dataset with a new 'batch' dimension first
    ds = ds.expand_dims('batch')

    # Create new coordinates
    datetime_coord = np.array(times, dtype='datetime64[ns]').reshape(1, -1)
    time_coord = (times - times[0]).values.astype('timedelta64[ns]')

    # Assign coordinates
    ds = ds.assign_coords({
        'time': time_coord,
        'datetime': (('batch', 'time'), datetime_coord)
    })

    # Convert all variables to NumPy arrays
    ds = ds.compute()

    return ds

# Example usage
date = '2019-03-29'
gcp_ds_deg_0x25 = extract_arco_era5_input(date)

# Convert from 0.25 to 1.0 deg
gcp_ds_deg_0x25  = gcp_ds_deg_0x25 .compute()
gcp_ds_deg_1x0 = gcp_ds_deg_0x25.isel(lat=slice(None, None, 4), lon=slice(None, None, 4))
gcp_ds_deg_1x0


## Preliminary version

The preliminary version of ERA5, known as ERA5T is available with approximately 1 week delay (where 5-6 days delay are due to processing at ECWMF).


In [None]:

def extract_latest_arco_era5_input():
    # Load the dataset from ARCO ERA5
    ds = xr.open_zarr(
        'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
        chunks=None,
        storage_options=dict(token='anon')
    )

    # Get the latest available time from metadata
    latest_time_str = ds.attrs.get('valid_time_stop_era5t', ds.attrs.get('valid_time_stop'))
    latest_time = datetime.strptime(latest_time_str, "%Y-%m-%d")

    # Start date is 1 days before the latest time
    start_date = latest_time - timedelta(days=1)

    # Generate 2 time steps: 00:00 and 12:00 UTC for 7 days
    requested_times = pd.date_range(start=start_date, periods=2, freq='12h')

    # Restrict to valid time range
    ds = ds.sel(time=slice(ds.attrs['valid_time_start'], latest_time_str))

    # Filter requested times to those that exist in the dataset
    available_times = pd.to_datetime(ds.time.values)
    valid_times = [t for t in requested_times if t in available_times]

    # Rename dimensions
    ds = ds.rename({'latitude': 'lat', 'longitude': 'lon'})

    # Select 13 pressure levels
    levels = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]
    ds = ds.sel(level=levels)

    # Select specific variables
    variables = [
        'land_sea_mask', 'geopotential_at_surface',
        '2m_temperature', 'sea_surface_temperature', 'mean_sea_level_pressure',
        '10m_v_component_of_wind', '10m_u_component_of_wind',
        'u_component_of_wind', 'specific_humidity', 'temperature',
        'vertical_velocity', 'v_component_of_wind', 'geopotential'
    ]
    ds = ds[variables]

    # Select the valid time steps
    ds = ds.sel(time=valid_times)

    # Expand dataset with a new 'batch' dimension
    ds = ds.expand_dims('batch')

    # Create new coordinates
    datetime_coord = np.array(valid_times, dtype='datetime64[ns]').reshape(1, -1)
    time_coord = (pd.to_datetime(valid_times) - pd.to_datetime(valid_times[0])).to_numpy().astype('timedelta64[ns]')
    


    # Assign coordinates
    ds = ds.assign_coords({
        'time': time_coord,
        'datetime': (('batch', 'time'), datetime_coord)
    })

    # Compute the dataset
    ds = ds.compute()

    return ds

# Example usage
gcp_ds_deg_0x25 = extract_latest_arco_era5_input()

# Convert from 0.25° to 1.0° resolution
gcp_ds_deg_1x0 = gcp_ds_deg_0x25.isel(lat=slice(None, None, 4), lon=slice(None, None, 4))
gcp_ds_deg_1x0


# Methode one three : Directly fetch from Era5 and Home made WeatherBench Format

(Home made) Using the Copernicus era5 datasets API, available data : 1959 to 5 days behind real time 

https://cds.climate.copernicus.eu/datasets/reanalysis-era5-pressure-levels?tab=documentation

In [16]:


cdsapirc_content = """url: https://cds.climate.copernicus.eu/api
key: 3c50bae0-5fe3-4a95-855e-6769bf62e617
"""

# Chemin vers le répertoire utilisateur
user_home = os.path.expanduser("~")
cdsapirc_path = os.path.join(user_home, ".cdsapirc")

# Écrire le fichier dans le répertoire utilisateur
with open(cdsapirc_path, 'w') as f:
    f.write(cdsapirc_content)

print(f"Configuration écrite dans {cdsapirc_path}")

Configuration écrite dans /home/ec2-user/.cdsapirc


In [18]:
def open_zipped_or_netcdf(file_path, extract_dir="temp_extract"):
    """
    Opens a Copernicus file that can be:
      - a pure NetCDF,
      - or a ZIP archive containing one or more .nc files.
    Returns a merged xarray.Dataset (if several .nc files are found in the ZIP).
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File {file_path} does not exist.")

    # Reads the first bytes to detect ZIP signature (PK\x03\x04)
    with open(file_path, "rb") as f:
        start_bytes = f.read(4)

    if start_bytes.startswith(b"PK\x03\x04"):
        # It's a ZIP, we decompress it
        if not os.path.exists(extract_dir):
            os.makedirs(extract_dir, exist_ok=True)

        with zipfile.ZipFile(file_path, 'r') as z:
            z.extractall(extract_dir)

        # Opens and merges all .nc files found
        nc_files = glob.glob(os.path.join(extract_dir, "**", "*.nc"), recursive=True)
        if not nc_files:
            raise ValueError(f"No .nc file found after decompressing {file_path}")

        ds_list = []
        for nc in nc_files:
            ds_tmp = xr.open_dataset(nc)
            ds_list.append(ds_tmp)
        ds_merged = xr.merge(ds_list)
        return ds_merged

    else:
        # It's probably a direct NetCDF
        return xr.open_dataset(file_path)

In [26]:
def download_gencast_global_two_times(output_file,area, target_date, time1,time2):
    # ================== Configuration Parameters ==================
    extract_dir = "temp_extract"
    c = cdsapi.Client()

    # Pressure level variables
    pressure_vars = [
        'geopotential',
        'specific_humidity',
        'temperature',
        'u_component_of_wind',
        'v_component_of_wind',
        'vertical_velocity',
    ]
    pressure_levels = ['50', '100', '150', '200', '250', '300', '400', '500', '600', '700', '850', '925', '1000']

    # Surface variables (single level)
    surface_vars = [
        # 'total_precipitation_12hr',
        'land_sea_mask',
        '2m_temperature',
        'mean_sea_level_pressure',
        '10m_v_component_of_wind',
        '10m_u_component_of_wind',
        'sea_surface_temperature',
    ]

    # Time field parsing
    year_ = [target_date[:4]]
    month_ = [target_date[5:7]]
    day_ = [target_date[8:10]]

    # Output filenames
    pl_file = f"API_data/pressure_{target_date}_{time1}_{time2}.nc"
    sl_file = f"API_data/single_levels_{target_date}_{time1}_{time2}.nc"


    # ================== Download Pressure Level Data ==================
    if not os.path.exists(pl_file):
        print("=== Downloading pressure levels ===")
        c.retrieve(
            'reanalysis-era5-pressure-levels',
            {
                'product_type': 'reanalysis',
                'format': 'netcdf',
                'variable': pressure_vars,
                'pressure_level': pressure_levels,
                'year': year_,
                'month': month_,
                'day': day_,
                'time': [time1, time2],
                'area': area,
                'grid': [0.25, 0.25],
                'expver': '1',  # Experiment version, typically set to '1' to avoid getting default version
            },
            pl_file
        )
        print("Pressure data downloaded.\n")

    # ================== Download Surface Single Level Data ==================
    if not os.path.exists(sl_file):
        print("=== Downloading single levels ===")
        c.retrieve(
            'reanalysis-era5-single-levels',
            {
                'product_type': 'reanalysis',
                'format': 'netcdf',
                'variable': surface_vars,
                'year': year_,
                'month': month_,
                'day': day_,
                'time': [time1, time2],
                'area': area,
                'grid': [0.25, 0.25],
                'expver': '1',
            },
            sl_file
        )
        print("Surface data downloaded.\n")

    # ================== Open and Merge Datasets ==================
    ds_pl = open_zipped_or_netcdf(pl_file, extract_dir="temp_extract_pl")
    ds_sl = open_zipped_or_netcdf(sl_file, extract_dir="temp_extract_sl")

    ds = xr.merge([ds_pl, ds_sl])

    # ================== Dimension Correction and Validation ==================
    rename_dims = {}
    if 'latitude' in ds.dims and 'latitude' != 'lat':
        rename_dims['latitude'] = 'lat'
    if 'longitude' in ds.dims and 'longitude' != 'lon':
        rename_dims['longitude'] = 'lon'
    if 'pressure_level' in ds.dims and 'pressure_level' != 'level':
        rename_dims['pressure_level'] = 'level'
    if 'valid_time' in ds.dims and 'time' not in ds.dims:
        rename_dims['valid_time'] = 'time'

    ds = ds.rename(rename_dims)

    ds = ds.expand_dims('batch')

    # Remove unnecessary dimensions
    for coord in ['expver', 'number']:
        if coord in ds.coords:
            ds = ds.drop_vars(coord)


    if 'time' not in ds.dims or ds.sizes['time'] != 2:
        raise ValueError("Expected 2 time steps (0h, +12h). Found something else.")

    # ================== Convert time to timedelta (relative to reference time) ==================
    reference_time = np.datetime64(f"{target_date}T{time1}:00")
    if 'time' in ds.coords:
        time_deltas = ds.time - reference_time
        ds['time'] = time_deltas.astype('timedelta64[ns]')

    # --- Add datetime coordinate (batch, time) with [0h, +12h] ---
    dt0 = np.datetime64(f"{target_date}T{time1}:00")
    dt1 = np.datetime64(f"{target_date}T{time2}:00")
    datetimes = np.array([[dt0, dt1]], dtype='datetime64[ns]')
    ds.coords['datetime'] = (('batch', 'time'), datetimes)

    # ================== Compute cyclical day/year variables ==================
    hours_since1970 = (datetimes - np.datetime64('1970-01-01')) / np.timedelta64(1, 'h')  # Hours since 1970-01-01

    # Compute the daily progress (cosine and sine)
    lon_size = ds.sizes['lon']
    day_cos = np.cos(2 * np.pi * (hours_since1970 % 24) / 24)
    day_sin = np.sin(2 * np.pi * (hours_since1970 % 24) / 24)

    # Add day progress variables to dataset
    ds['day_progress_cos'] = (
        ('batch', 'time', 'lon'),
        day_cos[:, :, np.newaxis] * np.ones((1, 2, lon_size))
    )
    ds['day_progress_sin'] = (
        ('batch', 'time', 'lon'),
        day_sin[:, :, np.newaxis] * np.ones((1, 2, lon_size))
    )

    # Compute the day of year
    date_obj = datetime.datetime.strptime(target_date, "%Y-%m-%d")
    day_of_year = date_obj.timetuple().tm_yday

    # Add year progress variables to dataset
    ds['year_progress_cos'] = (
        ('batch', 'time'),
        np.cos(2 * np.pi * (day_of_year + hours_since1970 / 24) / 365.25)
    )
    ds['year_progress_sin'] = (
        ('batch', 'time'),
        np.sin(2 * np.pi * (day_of_year + hours_since1970 / 24) / 365.25)
    )

    # ================== Rename variables to match the expected names ==================
    rename_mapping = {
        'lsm': 'land_sea_mask',
        # 'tp': 'total_precipitation_12hr',
        't2m': '2m_temperature',
        't': 'temperature',
        'v10': '10m_v_component_of_wind',
        'u10': '10m_u_component_of_wind',
        'msl': 'mean_sea_level_pressure',
        'z': 'geopotential',
        'q': 'specific_humidity',
        'u': 'u_component_of_wind',
        'v': 'v_component_of_wind',
        'w': 'vertical_velocity',
        'sst': 'sea_surface_temperature'
    }

    ds = ds.rename(rename_mapping)

    # ================== Calculate and add 'geopotential_at_surface' variable ==================
    if 'geopotential' in ds and 'level' in ds.dims:
        ds['geopotential_at_surface'] = ds['geopotential'].sel(level=1000)  # 1000 hPa level (near surface)

    # ================== Adjust dimensions ==================
    # Adjust longitude coordinates
    ds = ds.reindex(lon=sorted(ds.lon.values))
    ds["lon"] = ds.lon.astype("float32")

    # Adjust latitude coordinates
    ds = ds.reindex(lat=sorted(ds.lat.values))
    ds["lat"] = ds.lat.astype("float32")

    # Adjust level coordinates
    ds["level"] = ds.level.round().astype("int32")
    ds = ds.sortby("level")

    # Adjust time coordinates
    ds["time"] = ds.time - np.timedelta64(12, "h")
    ds = ds.sortby("time")

    # ================== Adjust static variables ==================
    if "geopotential_at_surface" in ds:
        ds["geopotential_at_surface"] = ds["geopotential_at_surface"].isel(batch=0, drop=True)
        ds["geopotential_at_surface"] = ds["geopotential_at_surface"].isel(time=0, drop=True)
    if "land_sea_mask" in ds:
        ds["land_sea_mask"] = ds["land_sea_mask"].isel(time=0, drop=True)
        ds["land_sea_mask"] = ds["land_sea_mask"].isel(batch=0, drop=True)

    # Ensure data type is float32
    for var in ds.data_vars:
        if ds[var].dtype != 'float32':
            ds[var] = ds[var].astype('float32')

    # ================== Save output ==================
    output_file =  f"API_data/{output_file}_{target_date}.nc"
    ds.to_netcdf(output_file)
    print(f"Dataset saved to {output_file}")

    return output_file

In [33]:
area = [90, -180, -90, 180]  # global
#API_target_date = "2019-03-29"
API_target_date = "2025-05-29"

output_file_1 = download_gencast_global_two_times("ecmwf_2_times_1",area, API_target_date, '00:00','12:00')

file_path_1 = f"{output_file_1}"
copernicus_ds_deg_0x25 = xr.open_dataset(file_path_1)

# convert from 0.25 to 1 deg
copernicus_ds_deg_0x25  = copernicus_ds_deg_0x25.compute()
copernicus_ds_deg_1x0 = copernicus_ds_deg_0x25.isel(lat=slice(None, None, 4), lon=slice(None, None, 4))
copernicus_ds_deg_1x0 

2025-06-03 13:59:55,691 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-06-03 13:59:55,874 INFO Request ID is 68171120-f35f-4ba6-a9a9-a3bf37537606


=== Downloading pressure levels ===


2025-06-03 13:59:55,957 INFO status has been updated to accepted
2025-06-03 14:00:04,453 INFO status has been updated to running
2025-06-03 14:01:50,442 INFO status has been updated to successful


dfd8aaee5f0ed91a1d73192790d4613.nc:   0%|          | 0.00/282M [00:00<?, ?B/s]

Pressure data downloaded.

=== Downloading single levels ===


2025-06-03 14:01:56,095 INFO Request ID is 43a61a8b-97ad-4ebf-a0fa-333f9b5eb045
2025-06-03 14:01:56,174 INFO status has been updated to accepted
2025-06-03 14:02:46,869 INFO status has been updated to successful


582e3f9347138261285cf8938d334a47.nc:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

Surface data downloaded.

Dataset saved to API_data/ecmwf_2_times_1_2025-05-29.nc


# Performance Test :

Comparing the data to the data provided by deepmind on a random date with a random pressure level and on a random varialbe


In [32]:

# Coordinates for Paris
lat, lon = 48.8566, 2.3522
variable = "u_component_of_wind"


# deepmind example_batch
# Extract temperature at nearest grid point
temperature_at_location_1 = example_batch[variable].sel(lat=lat, lon=lon, method="nearest")
temperature_ac_1 = temperature_at_location_1.isel(time=0)

# weatherbench2
# Extract temperature at nearest grid point
temperature_at_location_2 = weatherbench2_ds_deg_1x0[variable].sel(lat=lat, lon=lon, method="nearest")
temperature_ac_2 = temperature_at_location_2.isel(time=0)

# gcp
# Extract temperature at nearest grid point
temperature_at_location_3 = gcp_ds_deg_1x0[variable].sel(lat=lat, lon=lon, method="nearest")
temperature_ac_3 = temperature_at_location_3.isel(time=0)

#copernicus
# # Extract temperature at nearest grid point
temperature_at_location_4 = copernicus_ds_deg_1x0[variable].sel(lat=lat, lon=lon, method="nearest")
temperature_ac_4 = temperature_at_location_4.isel(time=0)

print("example batch")
print(temperature_ac_1.values)
print("weatherbench2")
print(temperature_ac_2.values)
print("gcp")
print(temperature_ac_3.values)
print("copernicus")
print(temperature_ac_4.values)


example batch
[[  0.9020796   -0.28765106  -4.1036263   -8.896164   -19.376253
  -16.91042    -15.8122425  -11.86301     -9.127249    -6.8284526
   -4.577563    -8.104098    -6.4245706 ]]
ds_input
[[  0.9020796   -0.28765106  -4.1036263   -8.896164   -19.376253
  -16.91042    -15.8122425  -11.86301     -9.127249    -6.8284526
   -4.577563    -8.104098    -6.4245706 ]]
ds_1deg
[[  0.9020796   -0.28765106  -4.1036263   -8.896164   -19.376253
  -16.91042    -15.8122425  -11.86301     -9.127249    -6.8284526
   -4.577563    -8.104098    -6.4245706 ]]
copernicus
[[  0.9018116   -0.28805542  -4.1036835   -8.896225   -19.375809
  -16.909836   -15.81279    -11.863586    -9.127823    -6.8285675
   -4.577408    -8.1045685   -6.424774  ]]


# Data Source Comparison Summary

The WeatherBench 2 dataset and the GCP ERA5 bucket provide data that is identical to the format used by DeepMind for GenCast's test case on 2019-03-29.
The Copernicus CDS source is slightly different, but still very close in structure and content.

For our project, we proceeded with the GCP Bucket ERA5 data, as it supports real-time prediction. However, this source does not include the total_precipitation_12hr variable. While this variable is not used by the model for prediction, the codebase expects it to be present. To resolve this, we manually added it as a NaN-filled placeholder.

Additionally, we had to adjust the dimensions of static variables to match the expected format. All these preprocessing steps are implemented in the notebook: real_time_prediction_1G