# 10 - Data Assimilation

## Using PAVICS-Hydro to perform data assimilation of streamflow to prepare the model states for a forecast.

Here we apply the Ensemble Kalman Filter (EnKF) data assimilation method to the initial states of a 'Raven' hydrological model, which will allow improving the estimation of the initial states to reduce the initial model bias. This also helps improve the forecast skill for shorter-term forecasts (up to a few days lead-time), and even longer in some instances.

In [None]:
import datetime as dt
import xarray as xr
import numpy as np
import xskillscore as xss
import matplotlib.pyplot as plt

from ravenpy.utilities.testdata import get_file
from ravenpy.utilities.data_assimilation import perturb_full_series, sequential_assimilation, assimilation_initialization

## A note on datasets

For this introduction to data assimilation, we will use pre-existing datasets that are hosted on the PAVICS-Hydro servers, as we did in the previous example notebooks. We also provide a model configuration and parameterization to keep things simple. However, you can adapt this to your own data and model setups using the tools seen in the previous notebooks.

In [None]:
# Here we are using a pre-defined file that is available on PAVICS-Hydro servers. 
# Replace with your own file that you can upload to your writable-workspace if desired.
forcing = get_file("raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc")

# Display the datasets that we will be using
display(forcing)

## Case 1: Open-loop simulation

An open-loop (OL) simulation is one that is done without any data assimilation at any time step. To demonstrate the ability of the data assimilation method to improve the model states and reduce initial biases, we will compare an Open-Loop simulation to a simulation that has integrated data assimilation.



In [None]:
# Define some of the common parameters for each model run
common_model_inputs=dict(
    area=4250.6,
    elevation=843.0,
    latitude=54.4848,
    longitude=-123.3659,
    params=(0.1353389, -0.005067198, 576.8007, 6.986121, 1.102917, 0.9224778)
)


In [None]:
# Import the GR4JCN model template we will be using 
from ravenpy.models import GR4JCN

# Generate a GR4JCN-configured Raven model instance. 
# By replacing "GR4JCN()" by "HMETS()", we would then be running a HMETS model emulator instead. 
model = GR4JCN()

# Here is where we launch the model using the configuration parameters, as well as the forcing data and start and end dates. 
model(
    ts=forcing,
    start_date=dt.datetime(1997, 1, 1),
    end_date=dt.datetime(1999, 1, 1),
    **common_model_inputs,
)

In [None]:
openloop_hydrograph = model.hydrograph.q_sim 

## Case 2: Simulation with EnKF data assimilation 
Run the model the same as before, but perform the data assimilation every 7 days during the entire period. The process is a bit more convoluted, but we will attempt to keep things as simple as possible here.


In [None]:
# First, define a set of hyperparameters to use during the assimilation steps.

# Hyperparameters for the input uncertainty. 
# Note the Observed flow variable name is the CF-compliant name for discharge "water_volume_transport_in_river_channel"
# Note that these values reprensent the uncertainty around the observed values. Larger values = more uncertainty.
std={
    "rainfall": 0.30,
    "prsn": 0.30,
    "tasmin": 2.0,
    "tasmax": 2.0,
    "water_volume_transport_in_river_channel": 0.10 # This is a required key! Without this variable, the assimilation will fail. This is the variable long_name attribute.
}

if "water_volume_transport_in_river_channel" not in std:
    raise ValueError("Assimilation requires perturbing the flow variable. Please add the variable 'water_volume_transport_in_river_channel'.")

# Hyperparameters on the number of ensemble members to use for EnKF (typically 25 is a good number, here we will use 5 to keep things faster)
n_members = 5

# What are the distributions to sample from? Ex: temperature uncertainty follows a normal distribution, but precipitation follows a Gamma distribution.
# Default is norm, so any other distribution should be specified here.
dists = {
    "pr": "gamma",
    "rainfall": "gamma",
    "prsn": "gamma",
    "water_volume_transport_in_river_channel": "rnorm",
}

# Define which variables we want to assimilate. Here we will only adjust the water content of the 2 first layers of soil (soil0 and soil1)
assim_var = ("soil0", "soil1")

# Assimilation period (days between each assimilation step)
assim_step_days = 7

# define the start and end dates of the entire period
start_date = dt.datetime(1997,1,1)
end_date = dt.datetime(1999,1,1)


### Hyperparameters are ready, now we need to actually run the sequential assimilation.
We first start by initializing a new GR4JCN model instance, and initialize the state variables so we can get an ensemble of initial states to pass to the assimilation functions.

In [None]:
# Let's re-initialize a new model instance of GR4JCN for the assimilation part:
model = GR4JCN()

# Do the first assimilation pass to get hru_states and basin_states.
# Can be skipped if there is already this data from a previous run.
model, xa, hru_states, basin_states = assimilation_initialization(
    model,
    ts=forcing,
    start_date=start_date,
    end_date=start_date + dt.timedelta(days=assim_step_days - 1),
    assim_var=assim_var,
    n_members=n_members,
    **common_model_inputs,
)

# This will return the model instance with n_members identical initial states at the end of the first period of assim_step_days.


In [None]:
# Now we can perturb the inputs for the rest of the assimilation
perturbed = perturb_full_series(
    model,
    std=std,
    start_date=start_date,
    end_date=end_date,
    dists=dists,
    n_members=n_members,
)

# Create netcdf from the perturbed inputs so we can feed the path to the Raven GR4JCN model
p_fn = model.workdir / "perturbed_forcing.nc"
perturbed = xr.Dataset(perturbed)
perturbed.to_netcdf(p_fn, mode="w")

# A last step: Get observed streamflow required in the assimilation and to plot results later.
q_obs = xr.open_dataset(forcing)["qobs"].sel(time=slice(start_date, end_date))

In [None]:
%%capture --no-display 
# Adding this to avoid spamming warning messages for overwriting files.
# Finally, we now have: (1) an ensemble of initial states after the first assim_step_days and (2) perturbed hydrometeotological data for the assimilation and simulation steps. 
# We can now perform the assimilation loop and produce streamflows for the entire series:
q_assim, hru_states, basin_states = sequential_assimilation(
    model,
    hru_states,
    basin_states,
    p_fn,
    q_obs,
    assim_var,
    start_date=start_date + dt.timedelta(days=assim_step_days), # We have already "lost" one period during initialization.
    end_date=end_date,
    n_members=n_members,
    assim_step_days=assim_step_days,
)
# Also note that hru_states and basin_states can be used to generate a forecast for the next time step.

In [None]:
# We can now plot everything!
plt.plot(q_assim.T, "r", label=None)  # plot the assimilated flows
plt.plot(q_assim[0,:].T, "r", label="Assimilated")  # plot the assimilated flows
plt.plot(q_obs.T, "b", label="Observed")  # plot the observed flows
plt.plot(openloop_hydrograph, "g", label="Open-Loop") # plot the open_loop (simulation with no assimilation)
plt.legend()
plt.show()

print('RMSE - Assimilated: ' + str(xss.rmse(q_assim.mean(dim='state').T,q_obs[0:q_assim.shape[1]].T).data))
print('RMSE - Open-Loop: ' + str(xss.rmse(openloop_hydrograph[0:q_assim.shape[1],0],q_obs[0:q_assim.shape[1]].T).data))