<img src="img/era-temp2.png" alt="ERA Temperature" style="width:60%;">

# Exploring ERA5 Weather Data with Zarr


---

This demo showcases how to efficiently access and analyze large climate datasets using the Zarr format with ERA5 data stored on Google Cloud.


## 1. Introduction to ERA5

ERA5 is the fifth generation of atmospheric reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF). 

### What is Reanalysis?
Reanalysis is a scientific method that creates a consistent, best-estimate reconstruction of Earth's past weather and climate by combining:

Historical Observations:

1. Surface stations
1. Weather balloons
1. Satellites
1. Ships and buoys
1. Aircraft reports

Numerical Weather Models:

1. Modern physics equations
1. Data assimilation techniques
1. Supercomputing power

### Key Characteristics:

- **"Best guess"** of past weather: Fills gaps where direct observations are missing
- **Temporally consistent**: Uses one modern system for entire period (unaffected by changing observation technologies)
- **Spatially complete**: Provides global coverage even over oceans/polar regions

### Key Features:
- **Global coverage**: Data for the entire Earth at high resolution
- **Temporal range**: 1950 to present (with hourly data from 1979)
- **Spatial resolution**: 0.25° latitude/longitude (~31 km grid)
- **Variables**: 240+ weather parameters (temperature, precipitation, wind, etc.)
- **Data assimilation**: Combines models with observations for consistency

### Relevant Resources:
1. [Official ECMWF ERA5 Documentation](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)
2. [ERA5 Data Access Guide](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
3. [Google Cloud Public Dataset Info](https://cloud.google.com/storage/docs/public-datasets/era5)
4. [ARCO-ERA5 Project](https://github.com/google-research/arco-era5)
5. [Copernicus Climate Data Store](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview)

## 2. Accessing ERA5 Data in Zarr format

Now we'll open the [ERA5 dataset](https://cloud.google.com/storage/docs/public-datasets/era5) stored in Zarr format on Google Cloud Storage.

Let's check first what is inside the cloud bucket.

In [None]:
import fsspec

fs = fsspec.filesystem('gcs', token='anon')
store_path = 'gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
files = fs.ls(store_path)

for f in files[:10]:
    print(f)

Remember how a zarr store is structured

<img src="https://zarr-specs.readthedocs.io/en/latest/_images/terminology-hierarchy.excalidraw.png" alt="Zarr Hierarchy" style="width:40%;">


Now, we can look inside the `.zmetadata` file

In [None]:
import json

fs = fsspec.filesystem('gcs', token='anon')
path = 'gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/.zmetadata'

with fs.open(path, 'r') as f:
    zmetadata = json.load(f)

In [None]:
# zmetadata

In [None]:
zmetadata["metadata"][".zattrs"]

In [None]:
zmetadata["metadata"]["2m_temperature/.zarray"]

## 3. Setting Up the Distributed Computing Environment

First, we'll set up a distributed computing cluster using Dask and SLURM to handle the large dataset efficiently.


In [None]:
import xarray as xr
from dask_jobqueue import SLURMCluster
from dask.distributed import Client

In [None]:
cluster = SLURMCluster(
    # name=#...,
    cores=4,
    memory="8GB",
    processes=True,
    scheduler_options={"dashboard_address": ":0"}
)
client = Client(cluster)  # Connect to distributed cluster and override default
client

Deploy 2 workers for your Dask cluster

In [None]:
cluster.scale(jobs=2)

### Now open the ERA5 dataset using `xarray`

In [None]:
%%time
ds = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
    # chunks=None,
    chunks={},
    consolidated=True,
    storage_options=dict(token='anon'),
)

In [None]:
ds

Our dataset has `time`, `latitude`, and `longitude` for dimensions. This makes our lives easier. For more info, check this Python [short-course](https://projectpythia.org/AtmosCol-2023/) in Spanish!


In [None]:
total_bytes = ds.nbytes
print(f"Dataset size: {total_bytes / 1024**5:.2f} PiB")

## 4. Data Slicing & Querying

### 4.1 Data selection
Understanding ERA5's arrays (Variables)

In [None]:
ds.data_vars

Let's get the 2 meter above the ground temperature `2m_temperature` array - variable

In [None]:
t2m =  ds['2m_temperature']  # or ds.t2m
t2m

Similarly, we can check the size of the temperature array

In [None]:
print(f"Temperature array size: {t2m.nbytes / 1024**4:.2f} TB")

As we can se the chunksize, size and other aspects that we can get out of the `metadata` file without putting any data in memory

> 💡 **An important aspect to consider is the way data is chunked. It will depend on how data will be queried.** Check this [link](https://blog.lobelia.earth/arco-the-smartest-way-to-access-big-geospatial-data-eaf689eff3c9) for further information

<div style="background-color:#e6f0ff; border-left:5px solid #3399ff; padding:10px; margin:10px 0; height:300px; overflow-y:auto;">
  <p align="center">
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*TpY57PSaWMX__S4_fXkXAQ.png" alt="Cloud Zarr" style="width:45%;">
    <br>
    <em style="color:#3366cc;">Figure: Cloud-native architecture of Zarr enabling scalable storage and retrieval.</em>
  </p>
</div>

### 4.2. Data slicing

#### 4.2.1. Time slicing
Understanding ERA5's time

In [None]:
print(f"Time range: {ds.time.min().values} to {ds.time.max().values}")

In [None]:
# Time dimension analysis
time_res_ns = int(ds.time.diff('time')[0].values)  # nanoseconds
time_res_hours = time_res_ns / (1e9 * 3600)        # convert to hours

# Calculate total time span
total_ns = int(ds.time[-1].values - ds.time[0].values)
total_days = total_ns / (1e9 * 3600 * 24)
total_years = total_days / 365.25

# Count of time steps
n_steps = len(ds.time)

print(f"{total_days:,.0f} days")
print(f"{total_years:,.0f} years")
print(f"\nNumber of time steps: {n_steps:,}")
print(f"Expected steps: {int(total_days * 24):,} (hourly data)")

* **Historical Reanalysis (1900-1978)**
     - Lower quality than main ERA5
     - Uses fewer observations
     - Primarily useful for long-term climate studies
     - Note: The official ERA5 starts in 1950, but this extended version includes preliminary data back to 1900

We can use [`xarray.Dataset.sel`](https://docs.xarray.dev/en/v2023.10.1/generated/xarray.Dataset.sel.html) or [`xarray.DataArray.sel`](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.sel.html) or  method to slice data along a given dimension

In [None]:
%%time
temp_hist = t2m.sel(time=slice('1900', '1978'))
temp_hist

we can now see we have less number of timestamps along the `time` dimension

In [None]:
print(f"Historical Temperature size: {temp_hist.nbytes / 1024**4:.2f} TB")

* **Main ERA5 Reanalysis (1979-Present)**
     - Highest quality period
     - Incorporates satellite era data
     - Hourly temporal resolution
     - 0.25° spatial resolution

*  **ERA5 Forecasts (Future Dates)**
  
        - Model projections beyond present day
        - Multiple ensemble members possible
        - Useful for testing model systems
        - Not a true forecast - generated with fixed sea surface temperatures

#### 4.2.2. Lat - Lon Slicing
We can select data using `longitude`, `latitude`, and `time` coordinates directly. This is a game changer in geospatial data analysis!

Instead of dealing with numerical indexes, we can now use the intuitive .sel() method to slice data by real-world coordinates.

Let’s take a look at an example:

In this dataset, the `longitude` values range from `0°` to `360°`, and `latitude` ranges from `80°` to `-80°` (i.e., from North to South).

To extract data over Colombia, we can use the .sel() method with coordinate slices like this:

```python 
.sel(longitude=slice(275, 300), latitude=slice(14, -3))

In [None]:
t2m_col = t2m.sel(longitude=slice(275, 300), latitude=slice(14, -3))

In [None]:
t2m_col

In this dataset, the spatial resolution has been significantly reduced.

Instead of the traditional **1440 longitude points** and **720 latitude points** (common in full-resolution ERA5), we now have:

- longitude: **101 points**
- latitude: **69 points**

This coarser grid reduces data size and speeds up processing — perfect for learning, prototyping, or regional analysis.

In [None]:
print(f"Historical Temperature size: {t2m_col.nbytes / 1024**3:.2f} GB")

## 5. Temperature Data Visualization

**Concept: Lazy Loading + Data Streaming**

In this section, we demonstrate how to visualize massive ERA5 temperature datasets without loading the entire dataset into memory.

Thanks to `Dask` and `Zarr`, data is accessed using lazy loading — meaning only the required chunks are **fetched and streamed on-demand**, right when needed. This makes it possible to work efficiently with terabyte-scale datasets using standard hardware.

✅ This approach is ideal for scalable, interactive analysis of climate and weather data.



In [None]:
import matplotlib.pyplot as plt

### 5.1. Basic Streaming Plot


This plot displays the 2-meter air temperature (t2m) over the entire spatial domain at a single time step:
January 1, 2015 at 00:00 UTC.

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
t2m.sel(time="2015-01-01 00:00").plot()
ax.set_title("2015-01-01 00:00 temperature" )
plt.savefig("img/era-temp2.png", dpi=150, bbox_inches='tight')
plt.show()

So far, we've only queried a **single time slice** of temperature data — just one chunk. But with `Dask` and `Zarr`, we can go further: run **scalable, parallel operations** across time or space without loading everything into memory.

🚀 This enables large-scale analysis on modest hardware.

### 5.2. Mean Temperature Trend for Colombia — January 2010
Let’s compute and visualize the mean **2-meter temperature** over **Colombia** during **January 1st, 2010**. This will give us an idea of the spatial temperature distribution across the country on that day.

In [None]:
col_temp_2010 = t2m_col.sel(time="2010-01-01") - 273.15 # Convert to degrees Celcius
col_temp_2010

In [None]:
print(f"Mean Colombian temperature for 2010-01-01: {col_temp_2010.nbytes / 1024**2:.2f} MB")

In [None]:
mean_temp_2010 = col_temp_2010.mean("time")

Up to this point, we've only built up lazy operations — no actual computation or data loading has happened yet. Now, let’s finally trigger computation and bring some data into memory!

🧠 This is where Dask turns your delayed operations into real results.

In [None]:
mean_temp_2010.plot(
    cmap="coolwarm"
)

In [None]:
mean_temp_2010.plot.contourf(
    cmap="coolwarm"
)

Other commands that will fetch/download data are 

- `.values`
- `.data`
- `.to_numpy()`
- `.to_dataframe()`, `.to_series()`
- `.compute()`

or when saving data 

- `.to_netcdf()`
- `.to_zarr()`
- `.to_csv()`

let's try it!

In [None]:
mean_temp_2010.compute()

<div style="background-color:#ffdddd; border-left:6px solid #f44336; padding:16px; font-family:Arial, sans-serif; font-size:16px;">

<h2 style="color:#b30000;">⚠️ WARNING: Do NOT operate directly on the full <code>ds</code> ERA5 dataset!</h2>

<p>This dataset represents <strong>~3.27 Petabytes (PB)</strong> of data.</p>

<p>Any direct operation on <code>ds</code> (such as <code>.plot()</code>, <code>.values</code>, <code>.to_numpy()</code>, <code>.mean()</code>, etc. without subsetting) will <strong>trigger massive data fetching</strong> and <strong>very likely crash your session or overload the backend</strong>.</p>

<h3>✅ Best practices:</h3>
<ul>
  <li>Use <code>.isel()</code> or <code>.sel()</code> to select a small chunk of data before any processing.</li>
  <li>Use <code>.compute()</code> or <code>.load()</code> <strong>only</strong> after subsetting.</li>
  <li>Avoid calling <code>.plot()</code> or accessing <code>.values</code> on the full dataset.</li>
</ul>

</div>


## 6. Release the resources reserved on the High-Performance Computing System

In [None]:
cluster.close()

In [None]:
client.close()