### Introduction
Author: Jake Goh

Date: 4th December 2025

Description:
This notebook documents the process to acquire the ERA5-Land and ERA5 dataset, and some basic preprocessing steps done to get the data into the format for the next step. 


In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
import xarray as xr
import glob
import zipfile
import os
import re
import cdsapi
from pathlib import Path
logging.basicConfig(level=logging.INFO)

Useful ERA5-Land and ERA5 documentation:
1. ERA5-Land: https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation
2. ERA5: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation

I have used the Climate Data Store (CDS) to get the required datasets for this project. 


For the API documentation CDSAPI, I mainly referred to:
1. https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=download
2. https://cds.climate.copernicus.eu/datasets/reanalysis-era5-land?tab=download



To download the datasets using CDS API:
1. Create a free account in this website: https://cds.climate.copernicus.eu/
2. Go to your profile and find the API key: https://cds.climate.copernicus.eu/profile 
3. Create a .cdsapirc file in your home directory which allows Python to authenticate automatically.
4. Install the cdsapi library: pip install cdsapi. 
5. The CDS interface (links above) also provide code snippets that you can use directly in your code. They can also be adapted for batch downloads as seen in my code. 


Firstly, I want to determine the required parameters that I want to use when calling the API. This will include variables to be downloaded, timestamp, data format, and also the area of interest. 

For my project, I download data for 2022, for Peninsular Malaysia and only for the following variables:
1. surface_solar_radiation_downwards (for both datasets)
2. surface_net_solar_radiation (for both datasets)
3. toa_incident_solar_radiation  (for ERA5 dataset)



In [None]:
#Setting up the folders
#Using Path and os.path.join so that paths work on all platforms (windows, macOS, linux)
ERA5_LAND_RAW_DATA_DIR = Path('../data/malaysia_raw_data_era5_land')
ERA5_LAND_EXTRACTED_DATA_DIR = Path("../data/malaysia_era5_land_extracted")
ERA5_RAW_DATA_DIR = Path('../data/malaysia_raw_data_era5')
ERA5_EXTRACTED_DATA_DIR = Path("../data/malaysia_era5_extracted")

os.makedirs(ERA5_LAND_RAW_DATA_DIR, exist_ok=True)
os.makedirs(ERA5_LAND_EXTRACTED_DATA_DIR, exist_ok=True)
os.makedirs(ERA5_RAW_DATA_DIR, exist_ok=True)
os.makedirs(ERA5_EXTRACTED_DATA_DIR, exist_ok=True)

### Download era5-land. 

In [None]:
# Days can take a range of values but not the month or the year. 
# To get the entire year 2022 data, I just called it for month by month
c = cdsapi.Client()

for month in range(1,13):
    c.retrieve(
        'reanalysis-era5-land',
        {
            'variable': [
                'surface_solar_radiation_downwards',     # ssrd
                'surface_net_solar_radiation',           # ssr
            ],
            'year': [
                '2022', 
            ],
            'month': f'{month:02}', #Month can only take one value at a time
            'day': [
                f"{d:02d}" for d in range(1,32)  # 01 to 31
            ],
            'time': [
                f"{h:02d}:00" for h in range(24) # 00:00 to 23:00
            ],
            'data_format': 'netcdf', 
            'download_format': 'zip',
            'area': [7.5, 99.5, 1, 104.5], #Only Peninsular Malaysia's boundary
        },
        os.path.join(ERA5_LAND_RAW_DATA_DIR,f'era5_land_malaysia_2022_{month}.nc.zip')


2025-09-01 11:41:36,008 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
INFO:ecmwf.datastores.legacy_client:[2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-09-01 11:41:36,676 INFO Request ID is eb231835-a975-4c13-9483-db492ff5e925
INFO:ecmwf.datastores.legacy_client:Request ID is eb231835-a975-4c13-9483-db492ff5e925
2025-09-01 11:41:36,899 INFO status has been updated to accepted
INFO:ecmwf.datastores.legacy_client:status has been updated to accepted
2025-09-01 11:41:59,168 INFO status has been updated to running
INFO:ecmwf.datastores.legacy_client:status has been updated to running
2025-09-01 11:45:57,820 INFO status has been updated to successful
INFO:ecmwf.datastores.legacy_client:status has been updated to successful
INFO:multiurl.base:Downloading https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-cache-2/2025-09-01/f503e823b41a

In [20]:
# This block of code is to unzip the monthly zip files and store it in a output directory

# Find all zip files (e.g. era5_land_malaysia_2022_1.zip … 12.zip)
zip_files = os.listdir(ERA5_LAND_RAW_DATA_DIR)

for zip_file_name in zip_files:
    
    # Extract year and month from filename using regex
    match = re.search(r"(\d{4})_(\d{1,2})\.nc\.zip$", zip_file_name)

    if not match:
        print(f"Skipping {zip_file_name}, no match for year/month")
        continue

    year, month = match.groups()
    month = month.zfill(2)  # pad the month to two digits
    
    # Expected output filename
    output_file_name = os.path.join(ERA5_LAND_EXTRACTED_DATA_DIR, f"era5_land_malaysia_{year}_{month}.nc")

    # Open zip and extract
    with zipfile.ZipFile(os.path.join(ERA5_LAND_RAW_DATA_DIR, zip_file_name), "r") as z:
        # Assume only 1 file inside. 
        nc_file = z.namelist()[0]
        z.extract(nc_file, ERA5_LAND_EXTRACTED_DATA_DIR)
        
        # Rename to desired format
        extracted_path = os.path.join(ERA5_LAND_EXTRACTED_DATA_DIR, nc_file)
        os.rename(extracted_path, output_file_name)
        print(f"Extracted {zip_file_name} → {output_file_name}")

print("All files extracted and renamed")


Extracted era5_land_malaysia_2022_1.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_01.nc
Extracted era5_land_malaysia_2022_10.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_10.nc
Extracted era5_land_malaysia_2022_11.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_11.nc
Extracted era5_land_malaysia_2022_12.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_12.nc
Extracted era5_land_malaysia_2022_2.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_02.nc
Extracted era5_land_malaysia_2022_3.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_03.nc
Extracted era5_land_malaysia_2022_4.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_04.nc
Extracted era5_land_malaysia_2022_5.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_2022_05.nc
Extracted era5_land_malaysia_2022_6.nc.zip → ../data/malaysia_era5_land_extracted\era5_land_malaysia_

In [23]:
#Concatenate the monthly files into a single file in the ERA5_LAND_EXTRACTED_DATA_DIR

#output_dir = os.path.join(os.getcwd(), "malaysia_era5_extracted") 

# 1. List all monthly files
files = sorted(glob.glob( os.path.join(ERA5_LAND_EXTRACTED_DATA_DIR, "era5_land_malaysia_2022_*.nc" )))

print(files)
# 2. Open and concatenate along the time dimension
ds = xr.open_mfdataset(files, combine="by_coords")

# 3. Check variables and dimensions
print(ds)

# 4. Save as one yearly NetCDF
ds.to_netcdf(os.path.join(ERA5_LAND_EXTRACTED_DATA_DIR, "era5_land_malaysia_2022_year.nc"))


['../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_01.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_02.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_03.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_04.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_05.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_06.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_07.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_08.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_09.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_10.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_11.nc', '../data/malaysia_era5_land_extracted\\era5_land_malaysia_2022_12.nc']
<xarray.Dataset> Size: 236MB
Dimensions:     (valid_time: 8760, latitude: 66, longitude: 51)
Coordinates:
    number      int64 8B 0
  * valid_time

### Downloading ERA5 dataset

In [None]:
# Code block is the same as the one for ERA5-Land but I just changed the 'reanalysis-era5-land' to 'reanalysis-era5-single-levels' and added the new variable

c = cdsapi.Client()

for month in range(1,13):
    c.retrieve(
        'reanalysis-era5-single-levels',
        {
            "product_type": ["reanalysis"],
            'variable': [
                "surface_net_solar_radiation",
                "surface_solar_radiation_downwards",
                "toa_incident_solar_radiation"     #This TISR will be used to compare with our calculated values later. 
            ],
            'year': [
                '2022', 
            ],
            'month': f'{month:02}',
            'day': [
                f"{d:02d}" for d in range(1,32)  # 01 to 31
            ],
            'time': [
                f"{h:02d}:00" for h in range(24) # 00:00 to 23:00
            ],
            'data_format': 'netcdf',  
            'download_format': 'zip',
            'area': [7.5, 99.5, 1, 104.5], 
        },
        os.path.join(ERA5_RAW_DATA_DIR,f'era5_land_malaysia_2022_{month}.nc.zip')


2025-09-25 19:35:29,665 INFO [2025-09-03T00:00:00] To improve our C3S service, we need to hear from you! Please complete this very short [survey](https://confluence.ecmwf.int/x/E7uBEQ/). Thank you.
INFO:ecmwf.datastores.legacy_client:[2025-09-03T00:00:00] To improve our C3S service, we need to hear from you! Please complete this very short [survey](https://confluence.ecmwf.int/x/E7uBEQ/). Thank you.
2025-09-25 19:35:29,665 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
INFO:ecmwf.datastores.legacy_client:[2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-09-25 19:35:30,367 INFO Request ID is 72472644-c54f-4863-b253-2a9bc3515952
INFO:ecmwf.datastores.legacy_client:Request ID is 72472644-c54f-4863-b253-2a9bc3515952
2025-09-25 19:35:30,565 INFO status has been updated to accepted
INFO:ecmwf.datastores.legacy_client:status has been updated to a

In [4]:
# This block of code is to unzip the monthly zip files and store it in a output directory

# Find all zip files (e.g. era5_malaysia_2022_1.nc.zip -> 12.nc.zip)

zip_files = os.listdir(ERA5_RAW_DATA_DIR)

for zip_file_name in zip_files:
    
    # Extract year + month from filename using regex
    match = re.search(r"(\d{4})_(\d{1,2})\.nc\.zip$", zip_file_name)
    
    if not match:
        print(f"Skipping {zip_file_name}, no match for year/month")
        continue
    
    year, month = match.groups()
    month = month.zfill(2)  # pad to two digits
    
    # Expected output filename
    output_file_name = os.path.join(ERA5_EXTRACTED_DATA_DIR, f"era5_malaysia_{year}_{month}.nc")

    # Open zip and extract
    with zipfile.ZipFile(os.path.join(ERA5_RAW_DATA_DIR, zip_file_name), "r") as z:
        # Assume only 1 file inside. 
        nc_file = z.namelist()[0]
        z.extract(nc_file, ERA5_EXTRACTED_DATA_DIR)
        
        # Rename to desired format
        extracted_path = os.path.join(ERA5_EXTRACTED_DATA_DIR, nc_file)
        os.rename(extracted_path, output_file_name)
        print(f"Extracted {zip_file_name} → {output_file_name}")
    
print("All files extracted and renamed.")

Extracted era5_malaysia_2022_1.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_01.nc
Extracted era5_malaysia_2022_10.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_10.nc
Extracted era5_malaysia_2022_11.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_11.nc
Extracted era5_malaysia_2022_12.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_12.nc
Extracted era5_malaysia_2022_2.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_02.nc
Extracted era5_malaysia_2022_3.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_03.nc
Extracted era5_malaysia_2022_4.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_04.nc
Extracted era5_malaysia_2022_5.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_05.nc
Extracted era5_malaysia_2022_6.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_06.nc
Extracted era5_malaysia_2022_7.nc.zip → ../data/malaysia_era5_extracted\era5_malaysia_2022_07.nc
Extracted era5_malaysia_202

In [6]:
#Concatenate the monthly files

# 1. List all monthly files
files = sorted(glob.glob(os.path.join(ERA5_EXTRACTED_DATA_DIR, "era5_malaysia_2022_*.nc" )))

# 2. Open and concatenate along the time dimension
ds = xr.open_mfdataset(files, combine="by_coords")

# 3. Check variables and dimensions
print(ds)

# 4. Save as one yearly NetCDF
ds.to_netcdf(os.path.join(ERA5_EXTRACTED_DATA_DIR, "era5_malaysia_2022_year.nc"))


<xarray.Dataset> Size: 60MB
Dimensions:     (valid_time: 8760, latitude: 27, longitude: 21)
Coordinates:
    number      int64 8B 0
  * valid_time  (valid_time) datetime64[ns] 70kB 2022-01-01 ... 2022-12-31T23...
  * latitude    (latitude) float64 216B 7.5 7.25 7.0 6.75 ... 1.75 1.5 1.25 1.0
  * longitude   (longitude) float64 168B 99.5 99.75 100.0 ... 104.0 104.2 104.5
    expver      (valid_time) <U4 140kB dask.array<chunksize=(744,), meta=np.ndarray>
Data variables:
    ssr         (valid_time, latitude, longitude) float32 20MB dask.array<chunksize=(744, 27, 21), meta=np.ndarray>
    ssrd        (valid_time, latitude, longitude) float32 20MB dask.array<chunksize=(744, 27, 21), meta=np.ndarray>
    tisr        (valid_time, latitude, longitude) float32 20MB dask.array<chunksize=(744, 27, 21), meta=np.ndarray>
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:     

Note that we have the same time dimensions across the 2 datasets, but we have more grid cells (as shown in latitude and longitude dimensions).
This is because of the higher resolution in the ERA5-Land dataset, so we will get more values in that dimension eventhough the API request is for the same geographical area. 


ERA5-Land: (valid_time: 8760, latitude: 66, longitude: 51)

ERA5: (valid_time: 8760, latitude: 27, longitude: 21) 

There is an extra variable in the ERA5 data (tisr). This will be used to compare our calculated TOA values

### Reviewing the dataset

In [8]:
ds = xr.open_dataset(os.path.join(ERA5_LAND_EXTRACTED_DATA_DIR, "era5_land_malaysia_2022_year.nc"), engine="netcdf4")
print(ds)

<xarray.Dataset> Size: 236MB
Dimensions:     (valid_time: 8760, latitude: 66, longitude: 51)
Coordinates:
    number      int64 8B ...
  * valid_time  (valid_time) datetime64[ns] 70kB 2022-01-01 ... 2022-12-31T23...
  * latitude    (latitude) float64 528B 7.5 7.4 7.3 7.2 7.1 ... 1.3 1.2 1.1 1.0
  * longitude   (longitude) float64 408B 99.5 99.6 99.7 ... 104.3 104.4 104.5
    expver      (valid_time) <U4 140kB ...
Data variables:
    ssrd        (valid_time, latitude, longitude) float32 118MB ...
    ssr         (valid_time, latitude, longitude) float32 118MB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-09-01T03:35 GRIB to CDM+CF via cfgrib-0.9.1...


In [7]:
ds = xr.open_dataset(os.path.join(ERA5_EXTRACTED_DATA_DIR, "era5_malaysia_2022_year.nc"), engine="netcdf4")
print(ds)

<xarray.Dataset> Size: 60MB
Dimensions:     (valid_time: 8760, latitude: 27, longitude: 21)
Coordinates:
    number      int64 8B ...
  * valid_time  (valid_time) datetime64[ns] 70kB 2022-01-01 ... 2022-12-31T23...
  * latitude    (latitude) float64 216B 7.5 7.25 7.0 6.75 ... 1.75 1.5 1.25 1.0
  * longitude   (longitude) float64 168B 99.5 99.75 100.0 ... 104.0 104.2 104.5
    expver      (valid_time) <U4 140kB ...
Data variables:
    ssr         (valid_time, latitude, longitude) float32 20MB ...
    ssrd        (valid_time, latitude, longitude) float32 20MB ...
    tisr        (valid_time, latitude, longitude) float32 20MB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-09-25T11:38 GRIB to CDM+CF via cfgrib-0.9.1...
