# Workshop Context and Objectives

This notebook is a hands-on workshop exercise demonstrating how to build a workflow from remote sensing and climate data to estimating missing spatio-temporal information using an analogue approach.

## Context

<details>
<summary><b>Details</b></summary>

Suppose we want to run a fully distributed hydrological model over the Volta River Basin. Such models require continuous daily inputs over long time periods, but in many regions observational datasets are incomplete, temporally sparse, or unavailable for recent years. To overcome this limitation, we will use a data-driven analogue method to generate the missing information needed to run the model.

<div style="display:flex; gap:10px;">

<img src="https://raw.githubusercontent.com/LoicGerber/synthetic_data_generation_workshop/52acd5850ad2311637ecf5704099c15e947f975e/isohyets.png" width="35%">

<img src="https://raw.githubusercontent.com/LoicGerber/synthetic_data_generation_workshop/52acd5850ad2311637ecf5704099c15e947f975e/dem.png" width="34.8%">

</div>

In this workshop, we will use GLEAM evapotranspiration data as our target variable. The goal is to learn the relationships between climate conditions and evapotranspiration from this reference dataset, and then generate synthetic images that statistically and spatially resemble GLEAM. To achieve this, we will use ERA5-Land temperature and precipitation as predictor variables, allowing us to reconstruct realistic evapotranspiration fields even when observations are missing.

</details>

## Part 1 - Accessing, downloading, and visualizing data

<details>
<summary><b>Details</b></summary>

In the first part, we focus on data acquisition and preprocessing:
- Access remote sensing and reanalysis data directly from Google Earth Engine (GEE).
- Download Actual Evapotranspiration (AET) from the WaPOR Level 1 product (FAO).
- Download daily climate predictors (precipitation and temperature) from ERA5-Land.
- Spatially subset all datasets to the Volta River Basin using HydroSHEDS geometries.
- Convert GEE ImageCollections into local NumPy arrays for simple python handling.
- Save and reload processed datasets efficiently using compressed .npz files for reproducibility.
- Visualize AET, precipitation, and temperature maps for qualitative inspection.

</details>

## Part 2 – kNN-based analogue modeling: method understanding and validation

<details>
<summary><b>Details</b></summary>

This second part focuses on understanding the kNN-based analogue method through a simple, step-by-step demonstration. Following this illustrative example, the method is applied to reconstruct three full years of daily data within a validation framework, allowing for qualitative and quantitative assessment of its performance.
- Load pre-processed NetCDF datasets of ET (target), precipitation, and temperature (predictors) clipped to the Volta Basin.
- Ensure temporal alignment of predictors and target.
- Optionally subset the spatial domain for faster computation.
- Construct covariate features, including lagged predictors (*climate window*), to capture temporal dynamics.
- Split the dataset into training and testing periods based on years.
- Apply k-nearest neighbors (kNN) regression, where each unobserved ET image is estimated by finding the most similar historical climate “analogues.”
- Visualize predicted versus observed ET maps for qualitative assessment.
- Perform quantitative evaluation using Root Mean Squared Error (RMSE) over space and time.

</details>

## Part 3 – kNN-based analogue modeling: production run for 2021–2025

<details>
<summary><b>Details</b></summary>

In this final part, we move from method validation to a full production run, applying the kNN-based analogue approach to estimate daily ET for the period 2021–2025, using all available historical observations up to 2020 as training data.

Key steps include:
- Prepare training datasets from ET (target) and climate predictors (precipitation and temperature) for all available historical data.
- Apply the kNN analogue method to generate daily ET estimates for 2021–2025.
- Compute _analogue-based uncertainty_ for each generated day, defined as the weighted standard deviation across the k nearest analogues.
- Visualize daily ET reconstructions and their uncertainty.
- Save the production datasets in both NetCDF and compressed `.npz` formats.
- Copy final outputs to Google Drive for long-term storage and reproducibility.

</details>

## Learning Outcomes

<details>
<summary><b>Details</b></summary>

By the end of this workshop, participants will be able to:
1. Access and process geospatial datasets from GEE and local NetCDF files.
2. Handle spatio-temporal data using `NumPy` and `xarray` efficiently.
3. Apply masking and subsetting to focus on a specific river basin.
4. Understand and implement a kNN-based analogue approach for estimating missing or unobserved images.
5. Validate analogue-based reconstructions using visual diagnostics and RMSE.
6. Apply the same method in a production setting to temporally disaggregate remote sensing products.
7. Build reproducible workflows that integrate data acquisition, preprocessing, modeling, validation, and production runs.

</details>

___

## Part 3 – Production run: generating ET for 2021–2025

In this final part, we run the analogue-based kNN model in production mode.

All available observations up to the end of 2020 are used for training, and
ET is generated for the period 2021–2025 using climate predictors only.

The methodology, predictors, distance metric, and kNN configuration are
identical to Part 2.  
The only difference is that no validation is performed.

## Import libraries and define helper functions

> **Note:** this block must be run, but we will not look into it in details.

In this first step, we import all Python libraries required throughout the notebook. Key libraries include:
- `NumPy` and `xarray` for numerical data handling
- `matplotlib` for visualization
- `scikit-learn` for kNN regression

In [None]:
# Import required libraries
import numpy as np
from google.colab import files, drive
import matplotlib.pyplot as plt
import xarray as xr
import gdown
from sklearn.neighbors import NearestNeighbors
import os

### Prepare production data
We define a function `prepare_production_data` to:
- Construct covariates (precipitation and temperature)
- Optionally include lagged values (`time_window`)
- Split the dataset into training and production sets based on availability of the target dataset

In [None]:
def prepare_production_data(et_ds, pre_ds, tmax_ds, time_window):
    """
    Prepare X_train, y_train, X_prod, dates_train, dates_prod
    for production runs where ET is unavailable in the future.

    Training uses all available ET.
    Production uses predictors only, after the last ET date.
    """

    # ==========================================================
    # 1) EXTRACT VARIABLE NAMES FROM DATASETS
    # ==========================================================
    # Assumes each dataset contains only one variable
    target_var = list(et_ds.data_vars)[0]
    pre_var    = list(pre_ds.data_vars)[0]
    tmax_var   = list(tmax_ds.data_vars)[0]

    # Extract DataArrays
    et_da   = et_ds[target_var]
    pre_da  = pre_ds[pre_var]
    tmax_da = tmax_ds[tmax_var]

    time_dim = "time"

    # ==========================================================
    # 2) SPLIT TIME INTO TRAINING AND PRODUCTION PERIODS
    # ==========================================================
    # Find last available date for ET observations
    last_et_time = et_da[time_dim].max().values

    # Training predictors → same period as ET
    pre_train  = pre_da.sel(time=slice(None, last_et_time))
    tmax_train = tmax_da.sel(time=slice(None, last_et_time))

    # Production predictors → future period (after ET ends)
    pre_prod   = pre_da.sel(time=slice(last_et_time, None))
    tmax_prod  = tmax_da.sel(time=slice(last_et_time, None))

    # ==========================================================
    # 3) BUILD FEATURE STACK WITH LAGGED VARIABLES
    # ==========================================================
    # This inner function builds predictors with time lags
    def build_features(pre_da, tmax_da):

        feat_list = []

        # Loop through predictors
        for da in (pre_da, tmax_da):

            # Current day predictor
            feat_list.append(da)

            # Add lagged predictors (t-1, t-2, ..., t-n)
            for lag in range(1, time_window + 1):
                feat_list.append(
                    da.shift({time_dim: lag})
                )

        # Concatenate all predictors into a single feature dimension
        features = xr.concat(feat_list, dim="feature")

        # Remove first days where lagged data is incomplete
        features = features.isel({time_dim: slice(time_window, None)})

        return features

    # Build training and production predictor stacks
    X_train_da = build_features(pre_train, tmax_train)
    X_prod_da  = build_features(pre_prod,  tmax_prod)

    # ==========================================================
    # 4) ALIGN ET TARGET WITH TRAINING FEATURES
    # ==========================================================
    # Remove first days so ET matches lagged predictors
    et_eff = et_da.isel(time=slice(time_window, None))

    # ==========================================================
    # 5) IDENTIFY SPATIAL DIMENSIONS
    # ==========================================================
    # Everything except time is considered spatial
    spatial_dims = [d for d in et_eff.dims if d != time_dim]

    # ==========================================================
    # 6) CONVERT XARRAY DATA → NUMPY ARRAYS
    # ==========================================================
    # Format predictors for ML:
    # (time, features, lat, lon)
    X_train = X_train_da.transpose(
        time_dim, "feature", *spatial_dims
    ).values

    X_prod = X_prod_da.transpose(
        time_dim, "feature", *spatial_dims
    ).values

    # Format target ET:
    # (time, lat, lon)
    y_train = et_eff.transpose(
        time_dim, *spatial_dims
    ).values

    # ==========================================================
    # 7) EXTRACT DATE ARRAYS
    # ==========================================================
    dates_train = et_eff[time_dim].values
    dates_prod  = X_prod_da[time_dim].values

    # ==========================================================
    # RETURN RESULTS
    # ==========================================================
    return X_train, y_train, X_prod, dates_train, dates_prod

### kNN estimation

- For each test day, kNN identifies climate analogues in the training set.
- ET is reconstructed from the analogue ET maps.
- Predictions are generated independently for each day.
- For each query day, the `k_neighbors` closest analogues are weighted by similarity (inverse distance).
- The predicted ET is the weighted mean of these analogues.
- The "uncertainty" is the weighted standard deviation of the analogue ET values around the mean.

> **Note:** This uncertainty reflects the spread of the analogue ensemble not a formal statistical confidence interval.  
> It indicates regions or days where the selected analogues are less consistent (larger spread = higher analogue-based uncertainty), not a formal statistical confidence interval.

In [None]:
def knn_predict_with_uncertainty(X_train, y_train, X_query, k_neighbors):
    """
    KNN analogue prediction with uncertainty from analogue spread.

    Parameters
    ----------
    X_train : (T_train, n_features, latX, lonX)
        Predictor variables for training period
    y_train : (T_train, latY, lonY)
        Target variable for training period
    X_query : (T_query, n_features, latX, lonX)
        Predictor variables for query period
    k_neighbors : int
        Number of analogues (nearest neighbors)

    Returns
    -------
    y_mean : (T_query, latY, lonY)
        Predicted mean field
    y_std  : (T_query, latY, lonY)
        Analogue-based uncertainty (weighted std)
    """

    # -------------------------------
    # Extract dimensions
    # -------------------------------
    T_train, n_features, n_latX, n_lonX = X_train.shape
    T_query = X_query.shape[0]
    _, n_latY, n_lonY = y_train.shape

    # Total number of spatial pixels
    NX = n_latX * n_lonX
    NY = n_latY * n_lonY

    # ==========================================================
    # X MASKING — remove pixels that contain NaNs in ANY time or feature
    # ==========================================================

    # Combine train + query to ensure consistent valid mask
    X_all = np.concatenate([X_train, X_query], axis=0)

    # Valid pixel mask → True where all values finite across time + features
    valid_X = np.all(np.isfinite(X_all), axis=(0, 1))

    # Flatten spatial mask for vector indexing
    mask_X = valid_X.ravel()

    # Reshape predictors → (time, features, pixels)
    # then keep only valid pixels
    Xtr = X_train.reshape(T_train, n_features, NX)[:, :, mask_X]
    Xq  = X_query.reshape(T_query, n_features, NX)[:, :, mask_X]

    # Flatten feature + pixel dimensions into single vector
    # final shapes:
    #   Xtr_vec = (T_train, n_valid_features)
    #   Xq_vec  = (T_query, n_valid_features)
    Xtr_vec = Xtr.reshape(T_train, -1)
    Xq_vec  = Xq.reshape(T_query,  -1)

    # ==========================================================
    # Y MASKING — remove pixels invalid in training target
    # ==========================================================

    # Valid target pixels across ALL training time
    valid_Y = np.all(np.isfinite(y_train), axis=0)

    # Flatten spatial mask
    mask_Y = valid_Y.ravel()

    # Reshape y_train → (time, pixels) and keep valid ones
    ytr = y_train.reshape(T_train, NY)[:, mask_Y]

    # ==========================================================
    # NEAREST NEIGHBOUR SEARCH
    # ==========================================================

    # Build KNN search structure
    nn = NearestNeighbors(
        n_neighbors=k_neighbors,
        metric="euclidean",
        n_jobs=-1              # use all CPU cores
    )

    # Fit on training predictors
    nn.fit(Xtr_vec)

    # Find k nearest analogues for each query timestep
    distances, indices = nn.kneighbors(Xq_vec)

    # ==========================================================
    # DISTANCE → WEIGHTS
    # ==========================================================

    # Convert distances to inverse-distance weights
    # (small distance → large weight)
    weights = 1.0 / (distances + 1e-12)

    # Normalize weights so each row sums to 1
    weights /= weights.sum(axis=1, keepdims=True)

    # ==========================================================
    # ANALOGUE PREDICTIONS
    # ==========================================================

    # Dimensions
    Tq, k = indices.shape

    # Container for analogue target values
    # shape = (query_time, k_neighbors, valid_pixels)
    y_pred = np.empty((Tq, k, ytr.shape[1]))

    # Gather analogue fields
    for i in range(Tq):
        # indices[i] → indices of k nearest training dates
        y_pred[i] = ytr[indices[i]]

    # ==========================================================
    # WEIGHTED MEAN PREDICTION
    # ==========================================================

    # Weighted average across k analogues
    y_mean_flat = np.sum(
        weights[:, :, None] * y_pred,
        axis=1
    )

    # ==========================================================
    # UNCERTAINTY ESTIMATION
    # ==========================================================

    # Weighted variance across analogues
    y_var_flat = np.sum(
        weights[:, :, None] * (y_pred - y_mean_flat[:, None, :])**2,
        axis=1
    )

    # Standard deviation = uncertainty estimate
    y_std_flat = np.sqrt(y_var_flat)

    # ==========================================================
    # RESTORE FULL SPATIAL GRID (reinsert masked pixels)
    # ==========================================================

    # Initialize full grids with NaNs
    y_mean = np.full((Tq, NY), np.nan)
    y_std  = np.full((Tq, NY), np.nan)

    # Fill valid pixels only
    y_mean[:, mask_Y] = y_mean_flat
    y_std[:,  mask_Y] = y_std_flat

    # Reshape back to spatial maps
    y_mean = y_mean.reshape(Tq, n_latY, n_lonY)
    y_std  = y_std.reshape(Tq, n_latY, n_lonY)

    return y_mean, y_std

### Visualizing Observed vs Predicted Maps

In [None]:
def plot_maps(y_obs, y_mean, y_std, dates, lon, lat, indices):
    """
    Plot observed ET, predicted ET, error, and uncertainty maps.

    Parameters
    ----------
    y_obs : array (T, lat, lon)
        Observed / reference ET (use None for production runs).
    y_mean : array (T, lat, lon)
        Mean reconstructed ET.
    y_std : array (T, lat, lon)
        Uncertainty (analogue spread).
    dates : array-like (T,)
        Datetime array.
    lon, lat : 1D or 2D arrays
        Spatial coordinates.
    indices : list of int
        Time indices to visualise.
    """

    # ==========================================================
    # PREPARE SPATIAL GRID
    # ==========================================================
    # If coordinates are 1D vectors, convert to 2D meshgrid
    # so they match the shape required by pcolormesh.
    if lon.ndim == 1 and lat.ndim == 1:
        Lon, Lat = np.meshgrid(lon, lat)
    else:
        # Already 2D coordinate grids
        Lon, Lat = lon, lat

    # ==========================================================
    # LOOP THROUGH REQUESTED TIME INDICES
    # ==========================================================
    for idx in indices:

        # Extract prediction and uncertainty for this timestep
        y_pred = y_mean[idx]
        y_unc  = y_std[idx]

        # Convert date to readable format
        date = np.array(dates[idx]).astype("datetime64[D]")

        # ------------------------------------------------------
        # IF REFERENCE DATA EXISTS → compute error + shared scale
        # ------------------------------------------------------
        if y_obs is not None:

            # Reference map
            y_ref = y_obs[idx]

            # Prediction error map
            y_err = y_pred - y_ref

            # Use same color scale for observed + predicted
            vmin  = np.nanmin([y_ref, y_pred])
            vmax  = np.nanmax([y_ref, y_pred])

            # Error scale based on robust percentile
            vmax_err = np.nanpercentile(np.abs(y_err), 95)

        # ------------------------------------------------------
        # IF NO REFERENCE DATA (production mode)
        # ------------------------------------------------------
        else:
            y_ref = None
            y_err = None

            # Scale only based on prediction
            vmin  = np.nanmin(y_pred)
            vmax  = np.nanmax(y_pred)

        # Robust scale for uncertainty
        vmax_std = np.nanpercentile(y_unc, 95)

        # ------------------------------------------------------
        # DETERMINE NUMBER OF PANELS
        # ------------------------------------------------------
        # If reference exists → show 4 panels
        # Otherwise → only prediction + uncertainty
        ncols = 4 if y_obs is not None else 2

        # Create figure
        fig, axes = plt.subplots(
            1, ncols,
            figsize=(4.5 * ncols, 4),
            constrained_layout=True
        )

        # Column pointer for flexible plotting
        col = 0

        # ======================================================
        # PANEL 1 — OBSERVED MAP
        # ======================================================
        if y_obs is not None:

            im0 = axes[col].pcolormesh(
                Lon, Lat, y_ref,
                shading="auto",
                vmin=vmin, vmax=vmax
            )

            axes[col].set_title(f"Observed ET\n{date}")
            axes[col].set_aspect("equal")

            # Colorbar
            c0 = plt.colorbar(im0, ax=axes[col])
            c0.set_label("[mm/day]")

            col += 1

        # ======================================================
        # PANEL 2 — PREDICTED MAP
        # ======================================================
        im1 = axes[col].pcolormesh(
            Lon, Lat, y_pred,
            shading="auto",
            vmin=vmin, vmax=vmax
        )

        # Title depends on mode
        if y_obs is not None:
            axes[col].set_title("Predicted ET")
        else:
            axes[col].set_title(f"Predicted ET\n{date}")

        axes[col].set_aspect("equal")

        # Colorbar
        c1 = plt.colorbar(im1, ax=axes[col])
        c1.set_label("[mm/day]")

        col += 1

        # ======================================================
        # PANEL 3 — ERROR MAP (only if reference exists)
        # ======================================================
        if y_obs is not None:

            im2 = axes[col].pcolormesh(
                Lon, Lat, y_err,
                shading="auto",
                vmin=-vmax_err, vmax=vmax_err,
                cmap="coolwarm"   # diverging colormap for errors
            )

            axes[col].set_title("Error (Pred − Ref)")
            axes[col].set_aspect("equal")

            # Colorbar
            c2 = plt.colorbar(im2, ax=axes[col])
            c2.set_label("[mm/day]")

            col += 1

        # ======================================================
        # FINAL PANEL — UNCERTAINTY MAP
        # ======================================================
        im3 = axes[col].pcolormesh(
            Lon, Lat, y_unc,
            shading="auto",
            vmin=0, vmax=vmax_std,
            cmap="magma"      # sequential colormap for uncertainty
        )

        axes[col].set_title("Uncertainty (analogue spread)")
        axes[col].set_aspect("equal")

        # Colorbar
        c3 = plt.colorbar(im3, ax=axes[col])
        c3.set_label("[mm/day]")

        # ======================================================
        # DISPLAY FIGURE
        # ======================================================
        plt.show()

### Quantitative evaluation using RMSE

- Here, we compute Root Mean Squared Error (RMSE):
- RMSE $= \sqrt{\frac{1}{N}\sum^{N}_{i=1}\left(y_{obs} - y_{pred}\right)^2}$
- Computed spatially per time step.
- Can also be aggregated over the entire test period.

In [None]:
def compute_rmse(y_obs, y_pred, dates, max_gap_days=1):
    """
    Compute RMSE over space for each time step and split by date gaps.

    Returns
    -------
    rmse : (T,)
    segments : list of index arrays for continuous date segments
    """

    # ==========================================================
    # INITIALIZATION
    # ==========================================================
    # Number of time steps
    T = y_obs.shape[0]

    # Array to store RMSE at each timestep
    rmse = np.zeros(T)

    # ==========================================================
    # COMPUTE SPATIAL RMSE PER TIME STEP
    # ==========================================================
    for t in range(T):

        # Difference map between observation and prediction
        diff = y_obs[t] - y_pred[t]

        # RMSE over spatial domain (ignores NaNs)
        rmse[t] = np.sqrt(np.nanmean(diff**2))

    # ==========================================================
    # DETECT TEMPORAL GAPS IN THE DATE SERIES
    # ==========================================================
    # Convert dates to numpy array for vectorized operations
    dates = np.asarray(dates)

    # Compute day differences between consecutive dates
    gaps = np.diff(dates).astype("timedelta64[D]").astype(int)

    # Identify where gaps exceed allowed threshold
    breaks = np.where(gaps > max_gap_days)[0]

    # ==========================================================
    # SPLIT TIME SERIES INTO CONTINUOUS SEGMENTS
    # ==========================================================
    segments = []

    # Start index of current segment
    start = 0

    # Loop over detected breaks
    for b in breaks:

        # Segment runs from current start → break index
        segments.append(np.arange(start, b + 1))

        # Next segment starts after the break
        start = b + 1

    # Add final segment after last break
    segments.append(np.arange(start, T))

    # ==========================================================
    # RETURN RESULTS
    # ==========================================================
    return rmse, segments

## 3.1 - Temporal coverage for production

The datasets already loaded from Google Drive contain:
- ET (GLEAM): daily, 1980–2020
- Precipitation (ERA5-Land): daily, 1950–2025
- Temperature (ERA5-Land): daily, 1950–2025

This allows us to:
- train the model on 2000–2020
- generate ET for 2021–2025

In [None]:
url_et   = "https://drive.google.com/file/d/1QKJCwt44LdsQNpNBIfFA4-kFs2MTaJRD/view?usp=drive_link"
url_pre  = "https://drive.google.com/file/d/1ZyRFuYSFle5zqPpyoSBXLiyIp5frSCWk/view?usp=drive_link"
url_tmax = "https://drive.google.com/file/d/1Ru0eKa_DrvjA2x_F1MF7EPixge2VXIJh/view?usp=drive_link"

gdown.download(url_et,   output="et.nc",   fuzzy=True)
gdown.download(url_pre,  output="pre.nc",  fuzzy=True)
gdown.download(url_tmax, output="tmax.nc", fuzzy=True)

et   = xr.open_dataset("et.nc")
pre  = xr.open_dataset("pre.nc")
tmax = xr.open_dataset("tmax.nc")

pre  =  pre.rename({'longitude': 'lon', 'latitude': 'lat'})
tmax = tmax.rename({'longitude': 'lon', 'latitude': 'lat'})

# Extend predictors beyond 2020 for production
t_start = "2010-01-01"
t_end   = "2025-12-31"

et   =   et.sel(time=slice(t_start, "2020-12-31"))   # ET only available until 2020
pre  =  pre.sel(time=slice(t_start, t_end))
tmax = tmax.sel(time=slice(t_start, t_end))

# Ensure strict temporal alignment to ET where needed
pre  =  pre.sel(time=pre.time)
tmax = tmax.sel(time=tmax.time)

## 3.2 - Preparing training and production datasets

We reuse the same `prepare_data` logic as in Part 2, but slightly adapted to the production run.
The last five years (2021–2025) are treated as the **test block**, which now
corresponds to the production period.

We prepare:
- `X_train` and `y_train` from all available historical ET
- `X_prod` from predictors only

In [None]:
X_train, y_train, X_prod, dates_train, dates_prod = prepare_production_data(
    et_ds       = et,
    pre_ds      = pre,
    tmax_ds     = tmax,
    time_window = 2
)

print("Training:", dates_train[0], "-", dates_train[-1])
print("Production:", dates_prod[0], "-", dates_prod[-1])

## 3.3 - kNN analogue prediction with uncertainty

We now run the kNN-based analogue reconstruction.  

- The mean ET is the weighted average of the `k_neighbors` closest analogues.
- The uncertainty is the weighted standard deviation of the analogue ET values.

> This uncertainty reflects the spread among analogues not a formal statistical confidence interval.


In [None]:
y_prod_mean, y_prod_std = knn_predict_with_uncertainty(
    X_train     = X_train,
    y_train     = y_train,
    X_query     = X_prod,
    k_neighbors = 20
)

print("Production ET generated for:")
print(dates_prod[0], "-", dates_prod[-1])

## 3.4 - Visualising production ET and uncertainty

In [None]:
target_var = list(et.data_vars)[0]
et_da      = et[target_var]
lat        = et_da["lat"].values
lon        = et_da["lon"].values

plot_maps(
    y_obs   = None,
    y_mean  = y_prod_mean,
    y_std   = y_prod_std,
    dates   = dates_prod,
    lon     = lon,
    lat     = lat,
    indices = [0, 120, 365] # <------------------------------------ CHANGE DAYS TO VISUALIZE HERE (based on index)
)

We also examine temporal and spatial patterns of uncertainty:
- Spatially averaged uncertainty per day
- Mean uncertainty map over 2021–2025
- Distribution of ET uncertainty values

In [None]:
# Spatial mean uncertainty per day
uncertainty_time_mean = np.nanmean(y_prod_std, axis=(1, 2))

plt.figure(figsize=(10, 3))
plt.plot(dates_prod, uncertainty_time_mean, lw=1.5)
plt.ylabel("Mean uncertainty [mm/day]")
plt.xlabel("Date")
plt.title("Spatially averaged ET uncertainty")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Mean uncertainty map
mean_uncertainty_map = np.nanmean(y_prod_std, axis=0)
plt.figure(figsize=(5, 4))
im = plt.pcolormesh(lon, lat, mean_uncertainty_map, shading="auto", cmap="inferno")
plt.gca().set_aspect("equal")
plt.title("Mean ET uncertainty")
plt.xticks([]); plt.yticks([])
c = plt.colorbar(im, fraction=0.046, pad=0.04)
c.set_label("Mean uncertainty [mm/day]")
plt.tight_layout()
plt.show()

# Histogram of uncertainty
plt.figure(figsize=(5, 3))
plt.hist(y_prod_std.ravel(), bins=50, density=True, alpha=0.8)
plt.xlabel("Uncertainty [mm/day]")
plt.ylabel("Density")
plt.title("Distribution of ET uncertainty")
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 3.5 - Saving production ET and uncertainty

We save the results as:
- NetCDF (`et_mean` and `et_uncertainty`)
- Compressed `.npz` for lightweight local usage

Both files are copied to Google Drive for persistent storage.

In [None]:
# Save NetCDF
ds_prod = xr.Dataset(
    data_vars=dict(
        et_mean=(("time", "lat", "lon"), y_prod_mean),
        et_uncertainty=(("time", "lat", "lon"), y_prod_std),
    ),
    coords=dict(time=dates_prod, lat=lat, lon=lon),
    attrs=dict(
        title="Analogue-based evapotranspiration production run",
        method="kNN climate analogue reconstruction",
        uncertainty="Analogue spread (weighted standard deviation)",
        period="2021–2025",
        units="mm day-1",
    ),
)
ds_prod["et_mean"].attrs = dict(
    long_name="Daily evapotranspiration (mean reconstruction)",
    units="mm day-1",
    description="Weighted mean of k nearest climate analogues"
)
ds_prod["et_uncertainty"].attrs = dict(
    long_name="Evapotranspiration uncertainty",
    units="mm day-1",
    description="Weighted standard deviation across climate analogues"
)

output_path = "ET_production_2021_2025_mean_uncertainty.nc" # <---------------------- CHANGE FILE NAME HERE
ds_prod.to_netcdf(output_path)
files.download(output_path)

# Copy to Drive
name      = "Volta_ET_Production_2021_2025"       # <-------------------------------- CHANGE FILE NAME HERE
save_dir  = "/content/drive/MyDrive/SyntheticSatDataWorkshop2026" # <---------------- CHANGE DIRECTORY HERE
os.makedirs(save_dir, exist_ok=True)
data_path = os.path.join(save_dir, f"{name}.nc")
!cp ET_production_2021_2025_mean_uncertainty.nc "$data_path"
print(f"Production dataset saved to Google Drive at:\n{data_path}")

# Save compressed npz
np.savez_compressed(
    f"{name}.npz",
    et_mean=y_prod_mean,
    et_std=y_prod_std,
    dates=dates_prod,
    lat=lat,
    lon=lon
)
files.download(f"{name}.npz")
!cp Volta_ET_Production_2021_2025.npz "$save_dir" # <-------------------------------- CHANGE FILE NAME HERE
print(f"Compressed dataset also saved to Google Drive at:\n{save_dir}/{name}.npz")