![](./figures/Logo.PNG)

## In this part of the tutorial, you will
* use metrics to assess simulation performance
* study scatter plots of multiple objective functions

- - -

# 2b - Statistical Evaluation Metrics

- - -

## 1. Introducing Bias, RMSE, NSE and KGE

In tutorial 2a, we have relied on visual inspection to learn about the model performance and to fit of the model output to the observed runoff. For some sets of parameter combinations, it can be difficult to assess which set returns the best result. In this tutorial, we will use evaluation metrics, enabling a more robust comparison between model runs with different parameterizations.

**#1 Bias**  
Bias is the consistent deviation of simulation results from observed values. It indicates the model's tendency to systematically overestimate or underestimate the target variable.  
  
Let $y_i$ represent the observed value and $\hat{y}_i$ denote the simulated value. The bias is calculated as:

$$
\text{Bias} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)
$$

where $n$ is the total number of data points.

**#2 Root Mean Square Error (RMSE)**  
RMSE measures the square root of the average squared differences between predicted values and the corresponding actual values (in other words: the square root of the MSE).

$$
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
$$

where $y_i$ represents the observed value, $\hat{y}_i$ denotes the simulated value, and $n$ is the total number of data points.

**#3 Kling-Gupta Efficiency (KGE)**  
KGE is a hydrological metric that assesses the performance of hydrological models by measuring the correlation, bias, and variability of their predictions against observed hydrograph data. It allows evaluation of the model's accuracy, timing, and volume representation.

$$
\text{KGE} = 1 - \sqrt{(r - 1)^2 + (\alpha - 1)^2 + (\beta - 1)^2}
$$

where $r$ represents the Pearson correlation coefficient, $\alpha$ (alpha) is the ratio of the standard deviations between observed and simulated values, and $\beta$ (beta) is the ratio of their means.

**#4 Nash-Sutcliffe Efficiency (NSE)**  
NSE measures the proportion of the observed variance that is explained by the model results. It is particularly useful for evaluating streamflow predictions. A perfect NSE value of 1 indicates a perfect fit between the model and observed data, while negative values suggest the model performs worse than simply using the mean of the observed values.

$$
\text{NSE} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

where $y_i$ represents the observed value, $\hat{y}_i$ denotes the simulated value, $n$ is the total number of data points, and $\bar{y}$ is the mean of the observed values.

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task I: Understanding the Metrics</h4>
    <ol>
        <li>Described in your own words what these metrics put their focus on and how they differ.</li>
        <li>Based on your answers from the first question: what could be limitations of these metrics and when should they be applied carefully?</li>
        <li>If <i>x</i> is the metric value, what does it mean if <i>x<0</i>, <i>x=0</i>, <i>x>0</i>, <i>x=1</i>?</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue;">
</div>

## 2. Using Bias, RMSE, NSE and KGE

**Import packages**

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import scipy
import random
import matplotlib.pyplot as plt
import matplotlib.dates as mdate
import itertools
import sys
sys.path.append('src/')
import HBV
from ipywidgets import interact, interactive_output, Dropdown, FloatSlider, VBox, Tab, fixed

**Defining Bias, RMSE, NSE and KGE**

By the way, the red string at the start of the function which uses three  <code style="color:darkred">"""</code> is called [docstring](https://realpython.com/documenting-python-code/#documenting-your-python-code-base-using-docstrings). It acts as a description of the function and is also used to describe the arguments and return value. We would suggest that you write docstring whenever the function gets more complicated or its arguments aren't immediately clear.

In Jupyter Notebook (or Lab) the documentation can be accessed by pressing `Shift` + `Tab` (both Windows and Mac) when the cursor is placed in the function call. This also works for all other functions and modules, e.g. _numpy_, _scipy_, _matplotlib_, ...

In [2]:
def bias(obs, sim):
    """
    Calculate the Bias between observed and simulated values.
    
    Bias measures the consistent deviation of simulation results from observed values,
    indicating whether the model systematically overestimates or underestimates the target variable.
    """
    return np.mean(np.subtract(obs, sim))  # Mean of observation values minus simulation results

def abs_bias(obs, sim):
    """Calculate the absolute bias"""
    return np.abs(bias(obs, sim))

def rmse(obs, sim):
    """
    Calculate the Root Mean Square Error (RMSE) between observed and simulated values.
    
    RMSE measures the square root of the average squared differences between predicted and actual values.
    """
    return np.sqrt(np.mean(np.square(np.subtract(obs, sim))))

def nse(obs, sim):
    """
    Calculate the Nash-Sutcliffe Efficiency (NSE) between observed and simulated values.
    
    NSE measures the proportion of the observed variance explained by the model results.
    A perfect NSE of 1 indicates a perfect fit, while negative values suggest worse performance than using the mean of observed values.
    """
    r_nse = np.corrcoef(obs, sim)[0][1] 
    alpha_nse = np.divide(np.std(sim), np.std(obs))
    beta_nse = np.divide(np.subtract(np.mean(sim), np.mean(obs)), np.std(obs))
    nse = 2 * alpha_nse * r_nse - np.square(alpha_nse) - np.square(beta_nse)
    return nse

def kge(obs, sim):
    """
    Calculate the Kling-Gupta Efficiency (KGE) between observed and simulated values.
    
    KGE assesses model performance by measuring correlation, bias, and variability against observed data.

    Returns:
    tuple: (correlation coefficient, variation ratio, bias ratio, KGE value)
    """
    r_kge = np.corrcoef(obs, sim)[0][1]  # Pearson correlation coefficient
    alpha_kge = np.divide(np.std(sim), np.std(obs))  # Variation ratio
    beta_kge = np.divide(np.mean(sim), np.mean(obs))  # Bias
    kge = 1 - np.sqrt(np.square(r_kge - 1) + np.square(beta_kge - 1) + np.square(alpha_kge - 1))
    return round(r_kge, 3), round(alpha_kge, 3), round(beta_kge, 3), round(kge, 3)

def kge_only(obs, sim):
    """
    Calculate the Kling-Gupta Efficiency (KGE) between observed and simulated values.
    
    Returns only the KGE value, excluding intermediate metrics.
    """
    _, _, _, kge_value = kge(obs, sim)
    return kge_value

def hbv(par, precip, temp, evap):
    # Run HBV snow routine
    p_s, _, _ = HBV.snow_routine(par[:4], temp, precip)
    # Run HBV runoff simulation
    Case = 1 # for now we assume that the preferred path in the upper zone is runoff (Case = 1), it can be set to percolation (Case = 2)
    ini = np.array([0,0,0]) # initial state
    runoff_sim, _, _ = HBV.hbv_sim(par[4:], p_s, evap, Case, ini)
    return runoff_sim

def hbv_and_one_obj_fun(par, precip, temp, evap, runoff_obs, n_days, obj_fun):
    runoff_sim = hbv(par, precip, temp, evap)
    
    errors = obj_fun(runoff_obs[n_days:], runoff_sim[n_days:])
    return errors

**Create and display dropdown for selecting catchment**

In [3]:
# DO NOT ALTER! code to select the catchment

catchment_names = ["Trout River, BC, Canada", "Medina River, TX, USA", "Siletz River, OR, USA"]
dropdown = Dropdown(
    options=catchment_names,
    value=catchment_names[0],
    description='Catchment:',
    disabled=False)

display(dropdown)

Dropdown(description='Catchment:', options=('Trout River, BC, Canada', 'Medina River, TX, USA', 'Siletz River,…

**Read catchment data and prepare input data for model**

In [4]:
# Read catchment data
catchment_name = dropdown.value
# Read catchment data
file_dic = {catchment_names[0]: "hysets_10BE007", catchment_names[1]: "camels_08178880", catchment_names[2]: "camels_14305500"}
df_obs = pd.read_csv(f"data/{file_dic[catchment_name]}.csv")
# Make sure the date is interpreted as a datetime object -> makes temporal operations easier
df_obs.date = pd.to_datetime(df_obs['date'], format='%Y-%m-%d')
# Select time frame
start_date = '2002-10-01'
end_date = '2003-09-30'

# Index frame by date
df_obs.set_index('date', inplace=True)
# Select time frame
df_obs = df_obs[start_date:end_date]
# Reformat the date for plotting
df_obs["date"] = df_obs.index.map(lambda s: s.strftime('%b-%d-%y'))
# Reindex
df_obs = df_obs.reset_index(drop=True)
# Select snow, precip, PET, streamflow and T
df_obs = df_obs[["snow_depth_water_equivalent_mean", "total_precipitation_sum","potential_evaporation_sum","streamflow", "temperature_2m_mean", "date"]]
# Rename variables
df_obs.columns = ["Snow [mm/day]", "P [mm/day]", "PET [mm/day]", "Q [mm/day]", "T [C]", "Date"]

# Prepare the data intput for both models
P = df_obs["P [mm/day]"].to_numpy()
evap = df_obs["PET [mm/day]"].to_numpy()
temp = df_obs["T [C]"].to_numpy()    

# load calibrated parameters
params_calibrated = pd.read_csv("./data/calibrated_parameters - HBV.csv")
params_calibrated = params_calibrated[params_calibrated.catchment_name == catchment_name]

**Using bias and RMSE to evaluate HyMOD results**

In [5]:
def compare_hbv_runs(obj_funs=bias, lmbda=0.2, **params):

    if callable(obj_funs):
       obj_funs = [obj_funs, obj_funs] 
    
    # convert params dict to array holding 26 parameters (two runs)
    params = np.array(list(params.values()))
    
    plt.figure(figsize=(20, 4))
    
    # the parameters are for two model runs that can be compared
    for i, obj_fun in enumerate(obj_funs):
        
        # run HBV model and plot output
        Q_sim = hbv(params[i*13:(i+1)*13], P, temp, evap)

        # Box-Cox transformation
        Q_obs = scipy.special.boxcox1p(df_obs["Q [mm/day]"], lmbda)
        Q_sim = scipy.special.boxcox1p(Q_sim, lmbda)
        
        # evaluate the objective function
        obj_fun_value = obj_fun(df_obs["Q [mm/day]"], Q_sim)
        plt.plot(df_obs["Date"], Q_sim, color=["red", "blue"][i], label=f"Model Run #{i + 1}\n{obj_fun.__name__}: {obj_fun_value:.2f}")

    plt.plot(df_obs["Date"], Q_obs, color="black", label="Observed")
    
    plt.title(catchment_name + f"\n(Box-Cox transformed)")
    plt.legend()
    plt.gca().xaxis.set_major_locator(mdate.MonthLocator(bymonth=range(1,13,4)))
    plt.show()

In [6]:
# DO NOT ALTER! parameter definitions for easy input

param_names = ["Ts", "CFMAX", "CFR", "CWH", "BETA", "LP", "FC", "PERC", "K0", "K1", "K2", "UZL", "MAXBAS"]
lower = [-3, 0, 0, 0, 0, 0.3, 1, 0, 0.05, 0.01, 0.005, 0, 1] # lower bounds for the parameters
upper = [3, 20, 1, 0.8, 7, 1, 2000, 100, 2, 1, 0.1, 100, 6]  # upper bounds for the parameters

# slider for Box-Cox lambda
lmbda = FloatSlider(value=0, min=-2, max=2, description="Lambda")

# widgets for easy input
params = {f"{param}_{i}": FloatSlider(value=np.round(params_calibrated.iloc[i, j+3], 1), min=xmin, max=xmax, step=0.1, description=param) for i in range(2) for j, xmin, xmax, param in zip(range(13), lower, upper, param_names)}
tabs   = Tab([VBox(list(params.values())[:13] + [lmbda,]), VBox(list(params.values())[13:] + [lmbda,])])
tabs.set_title(0, "First Run (Red)")
tabs.set_title(1, "Second Run (Blue)")

display(tabs)
interactive_output(compare_hbv_runs, params | {"lmbda": lmbda, "obj_funs": fixed([bias, rmse])}) 

Tab(children=(VBox(children=(FloatSlider(value=0.5, description='Ts', max=3.0, min=-3.0), FloatSlider(value=9.…

Output()

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task II: Apply and Bias and RMSE</h4>    
    In the above plot you see two HBV runs for the catchment that you selected. One was calibrated using the evaluation metric "bias" (red) and one using the "rmse" (blue).
    <ol>
        <li>Compare the two simulated hydrographs. What differences can you see between the two metrics. Compare this to your answers to Task I.</li>
        <li>What features of the hydrograph do the two metrics pick up or miss? Again, think <i>timing, magnitude, ...</i> </li>
        <li>Under which conditions would you choose which metric?</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue;">
</div>

**Using NSE and KGE to evaluate HyMOD results**

In [7]:
# DO NOT ALTER! Code to compare NSE and KGE

# slider for Box-Cox lambda
lmbda = FloatSlider(value=0, min=-2, max=2, description="Lambda")

# widgets for easy input
params = {f"{param}_{i}": FloatSlider(value=np.round(params_calibrated.iloc[2, j+3], 1), min=xmin, max=xmax, step=0.1, description=param) for i in range(2) for j, xmin, xmax, param in zip(range(13), lower, upper, param_names)}
tabs   = Tab([VBox(list(params.values())[:13] + [lmbda,]), VBox(list(params.values())[13:] + [lmbda,])])
tabs.set_title(0, "First Run (Red)")
tabs.set_title(1, "Second Run (Blue)")

display(tabs)
interactive_output(compare_hbv_runs, params | {"lmbda": lmbda, "obj_funs": fixed([nse, kge_only])}) 

Tab(children=(VBox(children=(FloatSlider(value=-2.9, description='Ts', max=3.0, min=-3.0), FloatSlider(value=0…

Output()

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task III: Apply and Discuss NSE and KGE</h4>
    In the above plots you again find two model runs. They initially both use the same set of parameters. Only the evaluation metric displayed in the legend is different ("nse" for the red run and "kge" for the blue run). As you can see, the same set of parameters (e.g. same model run) can lead to different values of the two evaluation metrics.
    <ol>
        <li>Compare the values of NSE and KGE by indiviudally tuning the paramters of both runs. How can you increase the values closer to their optimal of one?</li>
        <li>Are the two metrics affected differently by individual parameters?</li>
        <li>Which component of KGE dominates the result? (reminder: in KGE, r represents the correlation coefficient, alpha is the ratio of the standard deviations between observed and simulated values, and beta is the ratio of their means)</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue;">
</div>

**Relationships between the Evaluation Metrics**

In [8]:
import itertools

@interact(lmbda=(-2, 2, 0.1))
def multiple_objectives(lmbda):
    
    # the function needs to be defined here to use the lambda parameter
    def bc_bias(obs, sim):
        return abs_bias(scipy.special.boxcox1p(obs, lmbda), scipy.special.boxcox1p(sim, lmbda))

    def plot_multiple_objectives(obj_funs=(bias, rmse, nse, kge_only)):

        n = len(obj_funs)
        
        # generate n_samples parameter sets from a uniform distribution
        n_samples = 200
        X0 = np.random.uniform(low=lower, high=upper, size=(n_samples, 13))
    
        # evaluate the different objective functions for all samples
        obj_fun_evaluations = dict()
        for obj_fun in obj_funs:
            obj_fun_evaluations[obj_fun] = [hbv_and_one_obj_fun(x0, P, temp, evap, df_obs["Q [mm/day]"].to_numpy(), 0, obj_fun) for x0 in X0]
    
        # scatter plots of all combinations of objective functions
        obj_fun_combinations = list(itertools.combinations(obj_funs, 2))
    
        fig, axs = plt.subplots(n - 1, n - 1, figsize=(4*(n - 1), 4*(n - 1)))

        for ax in axs.ravel():
            ax.axis("off")
        
        for i, j in itertools.combinations(range(n), 2):
            fun1 = obj_funs[j]
            fun2 = obj_funs[i]
            ax = axs[i, 3 - j]
            ax.set_xlabel(fun1.__name__)
            ax.set_ylabel(fun2.__name__)
            ax.scatter(obj_fun_evaluations[fun1], obj_fun_evaluations[fun2], alpha=0.5)
            ax.grid()
            ax.axis("on")
        
        plt.tight_layout()
        plt.show()

    plot_multiple_objectives()

interactive(children=(FloatSlider(value=0.0, description='lmbda', max=2.0, min=-2.0), Output()), _dom_classes=…

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task IV: Relationships Between Evaluation Metrics</h4>
    For the above plots we have run HBV 200 times using random parameter sets within the usually used ranges. For each model run, we calculated the bias, RMSE and NSE and plotted them in multi-objective scatterplots. This allows you to compare the general relationships between the different metrics.
    <ol>
        <li>Building on the answers you gave earlier, continue your discussion on how bias, RMSE and NSE are related.</li>
        <li>Can you define regions in which the metrics behave similarly?</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue;">
</div>