# ERA5 - reanalysis data for weather forecasting

## Introduction

The ERA5 dataset, developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) through the Copernicus Climate Change Service, is a reanalysis dataset that provides a comprehensive and consistent record of global atmospheric, land, and ocean-wave conditions from 1940 to the present. It combines vast amounts of historical observations with advanced climate models using a process called data assimilation. This process ensures a globally complete and reliable dataset.

ERA5 offers high temporal resolution with hourly estimates and is updated daily with a latency of about five days. It is the successor to the ERA-Interim dataset and serves as a fundamental resource for various climate research, weather forecasting, and environmental applications.

Find the official documentation of the dataset [here](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=overview).

![temp_era5](imgs/copernicus.png)

### ERA5 in research

State-of-the-art machine learning models for weather forecasting are often trained on the ERA5 dataset. These models learn complex spatiotemporal patterns to predict future atmospheric states from current conditions.

For training and evaluation, it is a common practice to split the data chronologically. Models are typically trained on data up to the end of 2016, validated on 2017, and tested on 2018.

The complete ERA5 dataset includes hundreds of variables, or "channels" (e.g., temperature, wind speed, pressure). However, for practical purposes, models usually focus on a a key subset of these variables to reduce complexity and training time.

### Hackathon Dataset
For this hackathon, we provide a pre-processed subset of the 2018 ERA5 data. A dedicated dataloader is also included, allowing you to easily extract the data for any selected time point, helping you to get started with building and testing your models immediately.

## Hackathon ERA5 Data

ERA5 is a valuable dataset for weather analysis, but its massive size presents significant challenges. A single year of data is approximately 400 GB, with the entire dataset from 1980 to 2018 exceeding 15 TB.

This sheer volume makes it impossible to load the full dataset into memory at once. Therefore, a common workaround is to use lazy loading, which only loads the necessary data into memory as it is needed. Even with this approach, the computational cost and time required for training complex models on ERA5 are substantial. Here, we will rely on the h5py data format.

For this hackathon, we will not train new weather models due to these computational constraints. Instead, the focus will be on evaluating and testing existing, pre-trained models for specific downstream tasks.

The above linked website of the Copernius provider allows you to download ERA5 data directly (yopu need to create an account for this). However, to simplfy, we provide parts of the ERA5 data for the year 2018, so that it can be used for the task of this hackathon. Our dataset contains 73 key variables.

## Data Access and Structure

Our prepped dataset is available on Hugging Face at the following link: https://huggingface.co/datasets/franzigrkn/thinking_earth_hackathon_bids2025.

The dataset is organized into two folders, each containing a 73-variable dataset:

73varQ: This folder contains data that includes the q variable (specific humidity). This dataset is suitable for models like Aurora and Pangu-Weather, which require atmospheric variables u, v, t, z, and q.

73varR: This folder contains data that includes the r variable (relative humidity). This dataset is suitable for models like SFNO, which use the variables u, v, t, z, and r.

This structure ensures that the correct data is readily available for different model requirements.

Each folder contains then:

- A folder named ```2018```, which holds the pre-processed ERA5 data for the year 2018 in HDF5 (.h5) format. 
- A ```data.json``` file containing essential metadata about the dataset.
- A folder ```static``` containing the static variables.

We've provided a subset of the 2018 data, as the full year exceeds our size limitations. In each folder, you'll find two files: a small 1.5GB file containing data for January 1st (ideal for testing) and a larger file covering all of February 2018. We suggest starting with these.

For data covering longer time periods, please download it directly from the official Copernicus provider.

TODO: Link WB2 repo for data!

For all evaluation tasks, it is crucial to normalize the input data before feeding it to the models. You should use the provided statistics in the stats folder for this purpose, unless a model's documentation specifies different normalization parameters.

## Dataloader

We provide a dataloader, that allows you to extract the data sample for a specific time. 

In [5]:
from era5.dataloader_era5 import *

We use for the setup the included docker image. Please download the data and mount the data folder to the ```/era5``` folder. Please specify in the following the path to the data folder.

In [None]:
# Specify the datapath, adapt to you local setup
metadata_path = "/era5/data.json"
data_path = "/era5/2018/restricted_3days_2018.h5"

The models use a calculated mean and standard deviation calculated from their trainings data. To run inference with the Aurora model and the Pangu-Weather model, this normalization is applied automatically in the loaded inference pipeline. For the SFNO model, this normalization needs to be applied to the data manually. The statistics are included in the checkpoint folder that is openly available for the SFNO model. 

In [None]:
# load the stats -- TODO adapt this to common setup for all three models
stats_mean_path = "/era5/stats_era5/global_means.npy"
stats_std_path = "/era5/stats_era5/global_stds.npy"

In [7]:
data = dataloader_era5(
    data_path=data_path,
    stats_mean_path=stats_mean_path,
    stats_std_path=stats_std_path,
    metadata_path=metadata_path,
    in_channels=[0, 1, 2],
    out_channels=[3, 4, 5],
    model="sfno",
    normalize=True)

The standard is to use a temporal resolution of 6 hours of the measurements. The data is collected at each day on the times 00:00:00, 06:00:00, 12:00:00 and 18:00:00. To extract a given datasample for a specific time, one needs to provide the extact time. Our dataloader requires the date in the format "%Y-%M-%DT%h:%m:%s", i.e. for example "2018-01-01T18:00:00". The ```get_data``` function takes the date as input, and outputs the corresponding data sample. The output has dimension ```75 x 721 x 1440```, representing 75 variables and a spatial resolution of 0.25 degrees, corresponding to dimension 721 x 1440.

In [None]:
date = "2018-01-02T18:00:00"
data_sample = data.get_data(date)
print(f"Shape of data: {data_sample.shape}")

## Variables

The data provided includes 75 variables: 8 surface variables and 5 atmospheric variables across 13 pressure levels.

- Surface Variables: ```u10m, v10m, u100m, v100m, t2m, sp, msl, tcwv```.

- Atmospheric Variables (at 13 pressure levels): ```u,v,t,z``` and ```q``` (or ```r```) The pressure levels are: ```50,100,150,200,250,300,400,500,600,700,850,925``` and ```1000``` hPa.

For detailed descriptions of these variables, please refer to the official Copernicus ERA5 documentation.

## Data Structure

The provided data covers the entire year of 2018. It is structured with the following dimensions: ```(num_timestamps, num_variables, latitudes, longitudes)```.

- Spatial Grid: The spatial grid has a resolution of 0.25 degrees and a shape of 721 x 1440, representing the latitude and longitude coordinates.

- Metadata: A metadata file is included to provide specifics on the dataset. It contains general information and detailed specifications for each dimension. You can find the exact list of latitudes, longitudes, and variable (channel) names by accessing the following keys:
  - metadata['coords']['lat']
  - metadata['coords']['lon']
  - metadata['coords']['channel']

In [11]:
metadata_path = "/era5/data.json"

In [None]:
# metadata
with open(metadata_path, "r") as f:
    metadata = json.load(f)

print(f"Keys of metadata: {metadata.keys()}\n")

print(f"Metadata dataset name: {metadata['dataset_name']}")
print(f"Metadata description: {metadata['attrs']['description']}")
print(f"Metadata entry_key: {metadata['h5_path']}")
print(f"Metadata dimensions: {metadata['dims']}")
print(f"Metadata temporal resolution (dhours): {metadata['dhours']}\n")

print(f"Metadata coords keys: {metadata['coords'].keys()}")
print(f"Metadata grid_type: {metadata['coords']['grid_type']}")
print(f"Metadata lat: {len(metadata['coords']['lat'])}")
print(f"Metadata lon: {len(metadata['coords']['lon'])}")
print(f"Metadata channel: {metadata['coords']['channel']}\n")

In [None]:
print(f"Metadata channel: {metadata['coords']['channel']}\n")
print(f"Total number of channel: {len(metadata['coords']['channel'])}")

## Further readings

Loading the data uses the h5py format. The datetimes are converted to UTC timestamps. All loading functionality is handled by the dataloader. If you need to modify the code for your project, you may find these documentations helpful.

- [H5py documentation](https://docs.h5py.org/en/stable/)
- [Climate Data Store, Copernicus ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=overview)
- [Python datetime](https://docs.python.org/3/library/datetime.html)

