#### Insturctions on how to prepare the exmaple data used in the notebooks
To prepare the example data yourself, you need to first download the raw ERA5 data (2 meter temperature) and upscale the data to daily timescale and regrid it to 2 degree. These steps can be done with `era5cli` and `cdo`. Check the [README.md](./README.md) for more information.

Afterward, please follow this notebook to prepare the clustered 2 meter temperature dataset.

In [1]:
import numpy as np
from pathlib import Path
import xarray as xr

Load your dataset

In [23]:
path_to_data = "./"
t2m_raw = xr.load_dataset(Path(path_to_data, "t2m_1959-2021_1_12_daily_2.0deg.nc"))
t2m_raw

Sort by latitudes and longitudes to make the coordinate monotonically increase

In [3]:
t2m_raw = t2m_raw.assign_coords(longitude=((t2m_raw.longitude + 360) % 360))
t2m_raw = t2m_raw.sortby('longitude') # otherwise slicing does not work due to jump in data
t2m_raw = t2m_raw.sortby('latitude')
t2m_raw

Load clusters mask

In [4]:
data_folder = '~/AI4S2S/data'
cluster_mask = xr.open_dataset(Path(data_folder,'tf5_nc5_dendo_80d77.nc'))
cluster_mask

In [5]:
cluster_mask_2_deg = cluster_mask["xrclustered"][::2, 1::2]
cluster_mask_2_deg = cluster_mask_2_deg.drop("tfreq")
cluster_mask_2_deg = cluster_mask_2_deg.sortby('latitude')
cluster_mask_2_deg

Match the domain size of input data as the mask

In [6]:
def match_coords_xarrays(wanted_coords_arr, to_match):
    dlon = float(to_match.longitude[:2].diff('longitude'))
    dlat = float(to_match.latitude[:2].diff('latitude'))
    lonmin = wanted_coords_arr.longitude.min()
    lonmax = wanted_coords_arr.longitude.max()
    latmin = wanted_coords_arr.latitude.min()
    latmax = wanted_coords_arr.latitude.max()
    return to_match.sel(longitude=np.arange(lonmin, lonmax+dlon,dlon),
                       latitude=np.arange(latmin, latmax+dlat,dlat),
                       method='nearest')

In [7]:
# matching domain of xrclustered DataArray
t2m_raw = match_coords_xarrays(cluster_mask_2_deg, t2m_raw)

Take the mean for each cluster using the cluster masks

In [8]:
list_data_arrays = []

for i in range(1, cluster_mask["n_clusters"].values + 1):
    print(f"Process cluster number {i}")
    t2m_cluster = (
        t2m_raw["t2m"]
        .where(cluster_mask_2_deg.where(cluster_mask_2_deg == i).fillna(False), drop=False)
        .mean(dim=["latitude", "longitude"])
    )
    t2m_cluster.expand_dims({"cluster": 1}, axis=0)
    t2m_cluster["cluster"] = i
    list_data_arrays.append(t2m_cluster)

Process cluster number 1
Process cluster number 2
Process cluster number 3
Process cluster number 4
Process cluster number 5
Process cluster number 6


In [9]:
t2m_target = xr.concat(list_data_arrays, dim='n_cluster')

In [10]:
t2m_target

Save the cluster data and cluster mask into a single dataset

In [18]:
t2m_target_dataset = xr.Dataset({"t2m": t2m_target, "xrcluster": cluster_mask_2_deg})
t2m_target_dataset

Add attributes to variables and the dataset

In [22]:
# add attribute for t2m
t2m_target_dataset["t2m"] = t2m_target_dataset["t2m"].assign_attrs(
    {"long_name": "clustered 2 meter temperature", "units": "K"}
)

# add attribute to cluster masks
t2m_target_dataset["xrcluster"].attrs = {"long_name": "Cluster masks for t2m",
                                         "units": 1,
                                         "method": "AgglomerativeClustering",
                                         "kwrgs": "{'q': 66, 'n_clusters': [2, 3, 4, 5, 6, 7, 8], 'affinity': 'jaccard', 'linkage': 'average'}",
                                         "target": "mx2t_exceedances_of_66th_percentile"}

# add global attribute to the dataset
t2m_target_dataset = t2m_target_dataset.assign_attrs(
    {
        "history": "The dataset contains 2 meter temperature clustered using Agglomerative Clustering approach w.r.t rainfall depth.",
        "source": "ERA5",
    }
)

t2m_target_dataset

Export processed data as netcdf4 file

In [29]:
t2m_target_dataset.to_netcdf(Path(path_to_data, "t2m_daily_1950-2021_2deg_clustered_226_300E_30_70N.nc"))

In [11]:
path_to_data = "./"
sst = xr.load_dataset("./sst_daily_1950-2021_5deg_Pacific_175_240E_25_50N.nc")
sst