# Workshop Context and Objectives

This notebook is a hands-on workshop exercise demonstrating how to build a workflow from remote sensing and climate data to estimating missing spatio-temporal information using an analogue approach.

## Context

<details>
<summary><b>Details</b></summary>

Suppose we want to run a fully distributed hydrological model over the Volta River Basin. Such models require continuous daily inputs over long time periods, but in many regions observational datasets are incomplete, temporally sparse, or unavailable for recent years. To overcome this limitation, we will use a data-driven analogue method to generate the missing information needed to run the model.

<div style="display:flex; gap:10px;">

<img src="https://raw.githubusercontent.com/LoicGerber/synthetic_data_generation_workshop/52acd5850ad2311637ecf5704099c15e947f975e/isohyets.png" width="35%">

<img src="https://raw.githubusercontent.com/LoicGerber/synthetic_data_generation_workshop/52acd5850ad2311637ecf5704099c15e947f975e/dem.png" width="34.8%">

</div>

In this workshop, we will use GLEAM evapotranspiration data as our target variable. The goal is to learn the relationships between climate conditions and evapotranspiration from this reference dataset, and then generate synthetic images that statistically and spatially resemble GLEAM. To achieve this, we will use ERA5-Land temperature and precipitation as predictor variables, allowing us to reconstruct realistic evapotranspiration fields even when observations are missing.

</details>

## Part 1 - Accessing, downloading, and visualizing data

<details>
<summary><b>Details</b></summary>

In the first part, we focus on data acquisition and preprocessing:
- Access remote sensing and reanalysis data directly from Google Earth Engine (GEE).
- Download Actual Evapotranspiration (AET) from the WaPOR Level 1 product (FAO).
- Download daily climate predictors (precipitation and temperature) from ERA5-Land.
- Spatially subset all datasets to the Volta River Basin using HydroSHEDS geometries.
- Convert GEE ImageCollections into local NumPy arrays for simple python handling.
- Save and reload processed datasets efficiently using compressed .npz files for reproducibility.
- Visualize AET, precipitation, and temperature maps for qualitative inspection.

</details>

## Part 2 – kNN-based analogue modeling: method understanding and validation

<details>
<summary><b>Details</b></summary>

This second part focuses on understanding the kNN-based analogue method through a simple, step-by-step demonstration. Following this illustrative example, the method is applied to reconstruct three full years of daily data within a validation framework, allowing for qualitative and quantitative assessment of its performance.
- Load pre-processed NetCDF datasets of ET (target), precipitation, and temperature (predictors) clipped to the Volta Basin.
- Ensure temporal alignment of predictors and target.
- Optionally subset the spatial domain for faster computation.
- Construct covariate features, including lagged predictors (*climate window*), to capture temporal dynamics.
- Split the dataset into training and testing periods based on years.
- Apply k-nearest neighbors (kNN) regression, where each unobserved ET image is estimated by finding the most similar historical climate “analogues.”
- Visualize predicted versus observed ET maps for qualitative assessment.
- Perform quantitative evaluation using Root Mean Squared Error (RMSE) over space and time.

</details>

## Part 3 – kNN-based analogue modeling: production run for 2021–2025

<details>
<summary><b>Details</b></summary>

In this final part, we move from method validation to a full production run, applying the kNN-based analogue approach to estimate daily ET for the period 2021–2025, using all available historical observations up to 2020 as training data.

Key steps include:
- Prepare training datasets from ET (target) and climate predictors (precipitation and temperature) for all available historical data.
- Apply the kNN analogue method to generate daily ET estimates for 2021–2025.
- Compute _analogue-based uncertainty_ for each generated day, defined as the weighted standard deviation across the k nearest analogues.
- Visualize daily ET reconstructions and their uncertainty.
- Save the production datasets in both NetCDF and compressed `.npz` formats.
- Copy final outputs to Google Drive for long-term storage and reproducibility.

</details>

## Learning Outcomes

<details>
<summary><b>Details</b></summary>

By the end of this workshop, participants will be able to:
1. Access and process geospatial datasets from GEE and local NetCDF files.
2. Handle spatio-temporal data using `NumPy` and `xarray` efficiently.
3. Apply masking and subsetting to focus on a specific river basin.
4. Understand and implement a kNN-based analogue approach for estimating missing or unobserved images.
5. Validate analogue-based reconstructions using visual diagnostics and RMSE.
6. Apply the same method in a production setting to temporally disaggregate remote sensing products.
7. Build reproducible workflows that integrate data acquisition, preprocessing, modeling, validation, and production runs.

</details>

## Import libraries and define helper functions


> **Note:** this block must be run, but we will not look into it in details.



In this first step, we import all Python libraries required throughout the notebook. Key libraries include:
- `ee` and `geemap` for interacting with Google Earth Engine
- `NumPy` and `xarray` for numerical data handling
- `matplotlib` for visualization
- `scikit-learn` for kNN regression

In [None]:
# Import required libraries
import ee
import geemap
import numpy as np
from google.colab import files, drive
import matplotlib.pyplot as plt
from tqdm import tqdm
import os

### Converting ImageCollections to NumPy arrays

GEE datasets are stored as ImageCollections, which must be converted into local arrays for analysis.

This helper function:
- Iterates over all images in an ImageCollection
- Clips each image to the Volta Basin
- Converts each image to a NumPy array
- Stores acquisition dates alongside the data

In [None]:
# Function to convert an Earth Engine ImageCollection into a NumPy array
def ee_ic_to_numpy(ic, region, scale):
    """
    Convert an Earth Engine ImageCollection to a local NumPy array with corresponding dates.

    Parameters
    ----------
    ic : ee.ImageCollection
        The Earth Engine ImageCollection to download.
    region : ee.Geometry
        The spatial region to clip each image.
    scale : float
        The spatial resolution (in meters) for the output array.

    Returns
    -------
    np_array : np.ndarray (time, y, x)
        Stacked NumPy array of all images in the collection.
    dates : list of str
        Acquisition dates corresponding to each image (YYYY-MM-DD).
    """

    # Convert ImageCollection to a server-side list for iteration
    img_list = ic.toList(ic.size())

    # Force server-side evaluation to get number of images
    n = img_list.size().getInfo()
    print(f"    Downloading {n} daily images...")

    # Initialize lists to hold arrays and dates
    arrays = []
    dates = []

    # If the collection is empty, return empty outputs
    if n == 0:
        return np.array([]), [], np.array([])

    # Loop over each image in the collection
    for i in tqdm(range(n), desc="      Progress"):

        # Retrieve individual image from server-side list
        img = ee.Image(img_list.get(i))

        # Extract the image acquisition date as a string
        date = ee.Date(img.get('system:time_start')).format('YYYY-MM-dd').getInfo()

        # Convert EE image to NumPy array and clip to the specified region
        arr = geemap.ee_to_numpy(img.clip(region), scale=scale)

        # Append array and date to lists
        arrays.append(arr)
        dates.append(date)

    # Stack all images along a new time dimension
    np_array = np.stack(arrays)

    # Return the stacked array and corresponding dates
    return np_array, dates

## 1.1 - Authenticate and initialize Google Earth Engine
Before accessing GEE datasets, we must authenticate and initialize the Earth Engine API.
- `ee.Authenticate()` opens an authentication prompt (only required once per session)
- `ee.Initialize()` establishes the connection to GEE

> **IMPORTANT**: You MUST specify a project name. Otherwise, GEE access may fail.

How to find your project ID:
1. Go to https://code.earthengine.google.com
2. Look at the top-right corner, where your name and photo are
3. The name displayed is your Project ID
4. Paste it below

If you do not have a project yet, you can create one from the Earth Engine Code Editor in a few clicks. Follow the steps listed here: https://developers.google.com/earth-engine/guides/access.

In [None]:
ee.Authenticate()
ee.Initialize(project='lgerber') # <----------- UPDATE WITH YOUR PROJECT ID

## 1.2 - Define temporal and spatial parameters
Here we define:
- The time period of interest
- The spatial resolution of each dataset

Why different resolutions?
- WaPOR AET is available at ~1km
- ERA5-Land climate variables are available at coarser (~11 km) resolution

This scale mismatch is not an issue for analogue-based modeling. With this, we take advantage of coarse-resolution predictors (for computational efficiency) to generate a fine-scale target.

In [None]:
# Set parameters
START_DATE       = '2025-01-01'
END_DATE         = '2025-01-31'
SCALE_TARGET     = 1000   # ~ 1 km (WaPOR resolution)
SCALE_PREDICTORS = 11132  # ~11 km (ERA5-Land resolution)

## 1.3 - Load and visualize the Volta Basin geometry

We now load the Volta River Basin boundary using the HydroSHEDS basin dataset and create a binary mask.

Steps performed:

1. Load basin polygons from HydroSHEDS
2. Select the Volta Basin using its unique basin ID
3. Convert the basin polygon into a raster mask

To verify that the correct basin is selected, we display it on an interactive map.

In [None]:
# Load the Volta Basin geometry and create a mask
basins      = ee.FeatureCollection("WWF/HydroSHEDS/v1/Basins/hybas_3")
volta_basin = basins.filter(ee.Filter.eq('HYBAS_ID', 1030023300)).geometry()
mask_img    = ee.Image().byte().paint(volta_basin, 1)

mask_target = geemap.ee_to_numpy(mask_img, region=volta_basin, scale=SCALE_TARGET)

# Visualize the Volta Basin on an interactive map
Map = geemap.Map(center=[8, -1], zoom=6)  # Approximate center of Volta Basin
Map.add_basemap("SATELLITE")
Map.addLayer(volta_basin, {'color': 'red'}, "Volta Basin Boundary")
Map.addLayerControl()
Map

## 1.4 - Download WaPOR Actual Evapotranspiration (AET)

We now download daily Actual Evapotranspiration from the WaPOR Level 1 product.

Processing steps:
- Filter by date
- Select the AET band
- Convert units from tenths of mm/day to mm/day
- Apply the Volta Basin mask

In [None]:
# Download WaPOR Actual Evapotranspiration (AET)
aet_ic = (
    ee.ImageCollection('FAO/WAPOR/3/L1_AETI_D')
    .filterDate(START_DATE, END_DATE)
    .select('L1-AETI-D')
    .map(lambda img: img.divide(10).copyProperties(img, img.propertyNames()))
)
print("Downloading WaPOR Actual ET...")
aet, aet_dates = ee_ic_to_numpy(aet_ic, volta_basin, SCALE_TARGET)
aet = np.where(mask_target == 1, aet, np.nan)

## 1.5 - Download ERA5-Land temperature and precipitation
We next download daily climate predictors from ERA5-Land.

### 1.5.1 Air temperature at 2 meters
- Converted from Kelvin to degrees Celsius
- Masked to valid land pixels

In [None]:
# Download ERA5-Land 2m Temperature
temp_ic = (
    ee.ImageCollection('ECMWF/ERA5_LAND/DAILY_AGGR')
    .filterDate(START_DATE, END_DATE)
    .select('temperature_2m')
    .map(lambda img: img.subtract(273.15).copyProperties(img, img.propertyNames()))  # Kelvin to °C
)
print("Downloading ERA5-Land 2m Temperature...")
temp, temp_dates = ee_ic_to_numpy(temp_ic, volta_basin, SCALE_PREDICTORS)
mask = np.where(temp[0,:,:] != 0, 1, 0)
temp = np.where(mask == 1, temp, np.nan)

### 1.5.2 Tatal daily precipitation
- Converted from meters/day to mm/day

In [None]:
# Download ERA5-Land Precipitation
prec_ic = (
    ee.ImageCollection('ECMWF/ERA5_LAND/DAILY_AGGR')
    .filterDate(START_DATE, END_DATE)
    .select('total_precipitation_sum')  # meters/day
    .map(lambda img: img.multiply(1000).copyProperties(img, img.propertyNames()))  # mm/day
)
print("Downloading ERA5-Land Precipitation...")
prec, prec_dates = ee_ic_to_numpy(prec_ic, volta_basin, SCALE_PREDICTORS)
prec = np.where(mask == 1, prec, np.nan)

## 1.6 - Save the processed dataset

To avoid repeated downloads, we save all arrays and metadata into a compressed NumPy archive (.npz).

In [None]:
# Save the downloaded data
name = "example_Volta_data"    # <-------------------------------- CHANGE FILE NAME HERE
np.savez_compressed(
    f'{name}.npz',
    aet=aet, aet_dates=aet_dates,
    precip=prec, precip_dates=prec_dates,
    temp=temp, temp_dates=temp_dates
)
print("All data downloaded and saved.")

# Trigger browser download to the Downloads folder
files.download(f'{name}.npz')

We then also save a copy to Google Drive for persistent storage

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

# Define data path (folder + file)
save_dir  = '/content/drive/MyDrive/SyntheticSatDataWorkshop2026' # <-------------------------------- CHANGE DIRECTORY HERE
data_path = os.path.join(save_dir, f'{name}.npz')

# Create folder if it does not exist
os.makedirs(save_dir, exist_ok=True)

# Copy file to Google Drive
!cp example_Volta_data.npz "$data_path"     # <------------------------------------------------------ CHANGE FILE NAME HERE

print(f"File saved to Google Drive at: {data_path}")

## 1.7 - Reloading the saved data

This section demonstrates how to reload the dataset without re-running Earth Engine queries.

In [None]:
# Load dataset from Google Drive
data = np.load(data_path, allow_pickle=True)

aet  = data['aet']
prec = data['precip']
temp = data['temp']

aet_dates  = data['aet_dates']
prec_dates = data['precip_dates']
temp_dates = data['temp_dates']

print(f"AET shape: {aet.shape}, Precip shape: {prec.shape}, Temp shape: {temp.shape}")
print(f"Dates: {aet_dates[0]} - {aet_dates[-1]}")

## 1.8 - Visual inspection of the datasets

Before moving to modeling, it is crucial to visually inspect the data.

This final section allows you to:
- Select a specific date index
- Compare spatial patterns of AET, precipitation, and temperature

In [None]:
# Visual comparison of AET, Precipitation, and Temperature

AET_dateIndex = 0   # <---- CHANGE VALUE HERE
pre_dateIndex = 0   # <---- CHANGE VALUE HERE
tem_dateIndex = 0   # <---- CHANGE VALUE HERE

# Plotting
fig, axes = plt.subplots(1, 3, figsize=(18,6))

im0 = axes[0].imshow(aet[AET_dateIndex,:,:], cmap='viridis', vmin=0, vmax=10)
axes[0].set_title(f"AET — {aet_dates[AET_dateIndex]}")
axes[0].axis('off')
cbar0 = fig.colorbar(im0, ax=axes[0], fraction=0.046, pad=0.04)
cbar0.set_label("AET [mm/day]")

im1 = axes[1].imshow(prec[pre_dateIndex,:,:], cmap='Blues', vmin=0)
axes[1].set_title(f"Precipitation — {prec_dates[pre_dateIndex]}")
axes[1].axis('off')
cbar0 = fig.colorbar(im1, ax=axes[1], fraction=0.046, pad=0.04)
cbar0.set_label("Precip [mm/day]")

im2 = axes[2].imshow(temp[tem_dateIndex,:,:], cmap='coolwarm', vmin=15, vmax=35)
axes[2].set_title(f"Temperature — {temp_dates[tem_dateIndex]}")
axes[2].axis('off')
cbar0 = fig.colorbar(im2, ax=axes[2], fraction=0.046, pad=0.04)
cbar0.set_label("Maximum temperature [°C]")

plt.tight_layout()
plt.show()