![](./figures/Logo.PNG)

Please click the <span>&#x23E9;</span> button to run all cells before you start working on the notebook ...

## In this part of the tutorial, you will
* learn about different sampling strategies
* compare how these strategies result in different model evaluations 
* think of and implement your own method of accessing which paramter is "relevant" in a model

---

# 3 Sampling of Input Paramters

---

Sampling involves selecting a subset of potential input parameters from defined ranges. A goal of the process is to understand how variations in input parameters impact the output of computer models, aiding in model optimization and analysis. Typically, three different sampling strategies are used:


#### Grid Sampling
Input parameters are selected at even intervals within specified parameter bounds. A derived approach is **stratified sampling**, where the parameters are not assigned to the grid intersections but are random point is drawn in each grid cell. 

#### Random Sampling
Random sampling within parameter bounds from a uniform distribution. This is also known as Monte-Carlo sampling.

#### Latin Hypercube Sampling (LHS)
Systematic and efficient technique for selecting a diverse set of input parameter combinations within defined ranges, minimizing redundancy and ensuring a more representative coverage of the parameter space in computational experiments.

It work's like this:
1. Parameter Ranges Definition: Define the ranges for each input parameter that you want to sample in your study.
2. Divide Ranges: Divide each parameter range into equal intervals, corresponding to the desired number of samples or scenarios.
3. Create a Matrix: Create a Latin square matrix, where each row and column represents one interval of each parameter. The Latin property ensures that each interval is sampled exactly once across rows and columns.
4. Random Permutation: Randomly permute the elements within each row of the matrix, ensuring that the samples are selected randomly within their respective intervals.
5. Select Samples: Choose one sample from each row of the matrix. These samples represent the selected combinations of input parameters for your simulation or experiment.

<center><img src="https://www.researchgate.net/publication/347334888/figure/fig1/AS:976080278138880@1609727081034/Comparison-of-random-stratified-and-latin-hypercube-samplings-with-16-points-d-2-M.png" style="max-width:50%"/></center>

| **Method**          | **Coverage**                       | **Bias**                 | **Complexity**  | **Marginal Distribution** |
|---------------------|------------------------------------|--------------------------|-----------------|---------------------------|
| Grid                | evenly spaced                      | low                      | simple          | discrete |
| Stratified          | randomly distributed               | low                      | simple          | roughly uniform |
| Random              | randomly distributed               | moderate (can be skewed) | simple          | roughly uniform  |
| LHS                 | evenly distributed in intervals    | low                      | moderate        | uniform  |

## Loading Catchment Dat

In [1]:
import sys
sys.path.append('src/')
import HBV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numbers import Number
from scipy.stats.qmc import LatinHypercube, scale
from ipywidgets import interact, interactive_output, Dropdown, Checkbox, VBox, Layout

In [2]:
def rmse(obs, sim, spinup=365):
    obs = obs[spinup:]
    sim = sim[spinup:]
    return np.sqrt(np.mean(np.square(np.subtract(obs, sim))))

def nse(obs, sim, spinup=365):
    obs = obs[spinup:]
    sim = sim[spinup:]
    r_nse     = np.corrcoef(obs, sim)[0][1] 
    alpha_nse = np.divide(np.std(sim), np.std(obs))
    beta_nse  = np.divide(np.subtract(np.mean(sim), np.mean(obs)), np.std(obs))
    return 2 * alpha_nse * r_nse - np.square(alpha_nse) - np.square(beta_nse)

def hbv(par, precip, temp, evap):
    # Run HBV snow routine
    p_s, _, _ = HBV.snow_routine(par[:4], temp, precip)
    # Run HBV runoff simulation
    Case = 1 # for now we assume that the preferred path in the upper zone is runoff (Case = 1), it can be set to percolation (Case = 2)
    ini = np.array([0,0,0]) # initial state
    runoff_sim, _, _ = HBV.hbv_sim(par[4:], p_s, evap, Case, ini)
    return runoff_sim

In [3]:
# DO NOT ALTER! code to select the catchment

catchment_names = ["Medina River, TX, USA", "Siletz River, OR, USA", "Trout River, BC, Canada"]
dropdown = Dropdown(
    options=catchment_names,
    value=catchment_names[0],
    description='Catchment:',
    disabled=False
)

display(dropdown)

Dropdown(description='Catchment:', options=('Medina River, TX, USA', 'Siletz River, OR, USA', 'Trout River, BC…

In [4]:
# read catchment data
catchment_name = dropdown.value
file_dic = {catchment_names[0]: "camels_08178880", catchment_names[1]: "camels_14305500", catchment_names[2]: "hysets_10BE007"}
df_obs = pd.read_csv(f"data/{file_dic[catchment_name]}.csv")

# correctly load the date and restrict analysis to one year
df_obs.date = pd.to_datetime(df_obs['date'], format='%Y-%m-%d')
start_date  = '2003-01-01' # the first year is used as spinup
end_date    = '2005-12-30'

# Index frame by date
df_obs.set_index('date', inplace=True)
# Select time frame
df_obs = df_obs[start_date:end_date]
# Reformat the date for plotting
df_obs["date"] = df_obs.index.map(lambda s: s.strftime('%b-%d-%y'))
# Reindex
df_obs = df_obs.reset_index(drop=True)
# Select snow, precip, PET, streamflow and T
df_obs = df_obs[["snow_depth_water_equivalent_mean", "total_precipitation_sum","potential_evaporation_sum","streamflow", "temperature_2m_mean", "date"]]
# Rename variables
df_obs.columns = ["Snow [mm/day]", "P [mm/day]", "PET [mm/day]", "Q [mm/day]", "T [C]", "Date"]

# load calibrated parameters
params_calibrated = pd.read_csv("./data/calibrated_parameters - HBV.csv")
params_calibrated = params_calibrated[(params_calibrated.catchment_name == catchment_name) & (params_calibrated.objective_function == "nse")] # use only this catchment and the rmse parameters
params_calibrated = params_calibrated.iloc[0,3:].values

## Comparing Different Sampling Strategies

In [5]:
param_names = ["Ts", "CFMAX", "CFR", "CWH", "BETA", "LP", "FC", "PERC", "K0", "K1", "K2", "UZL", "MAXBAS"]
lower       = np.array([-3, 0, 0, 0, 0, 0.3, 1, 0, 0.05, 0.01, 0.005, 0, 1])
upper       = np.array([3, 20, 1, 0.8, 7, 1, 2000, 100, 2, 1, 0.1, 100, 6])

# restrict the paramter range around a calibrated solution
ranges = list(zip(params_calibrated, lower, upper))
lower  = np.array([max(low, value - 0.1*(high - low)) for value, low, high in ranges])
upper  = np.array([min(value + 0.1*(high - low), high) for value, low, high in ranges])

def uniform(low, high, n=100, **kwargs):
    p = len(low)
    return np.random.uniform(low, high, (n, p))

def grid(low, high, n=100, m=5, **kwargs):
    dims = [np.linspace(l, h, m) for l, h in zip(low, high)]
    return np.array([np.random.choice(dim, n) for dim in dims]).T

def lhs(low, high, n=100, **kwargs):
    return scale(LatinHypercube(len(low)).random(n), low, high)

In [6]:
# DO NOT ALTER! code to illustrate the different sampling strategies

@interact(i=Dropdown(description="First P.", index=0, options=zip(param_names, range(13))), j=Dropdown(description="Second P.", index=1, options=zip(param_names, range(13))), n=(10, 1000, 10), m=(3, 20, 1))
def plot_scatter(i, j, n=100, m=5):

    np.random.seed(30)
    
    fig, axs = plt.subplots(2, 6, figsize=(15, 5), width_ratios=[4, 1, 4, 1, 4, 1], height_ratios=[1, 4], sharex="col", sharey="row")
    
    for k, strategy in enumerate([uniform, grid, lhs]):

        df = pd.DataFrame.from_records(strategy(lower[[i, j]], upper[[i, j]], n, m=m))
        df.columns = (i, j)
        
        ax = axs[1, k*2]
        ax.scatter(df[i], df[j], color=f"C{k}")
        ax.set_xlabel(param_names[i])
        ax.set_ylabel(param_names[j] if k == 0 else " ")
        
        ax = axs[0, k*2]
        ax.hist(df[i], color=f"C{k}")
        ax.set_title(strategy.__name__.upper() + ("*" if strategy == grid else ""))
        ax.set_yticks([])
        
        ax = axs[1, k*2 + 1]
        ax.hist(df[j], color=f"C{k}", orientation="horizontal")
        ax.set_xticks([])

        ax = axs[0, k*2+1]
        ax.axis("off")
        ax.set_xlim(ax.get_ylim())
        
    plt.tight_layout(w_pad=-0.5)
    plt.show()

    print("* Due to performance reasons, the grid strategy samples n random points from a mxm grid and not all possible points.")
    print("  This will become handy later, when we sample for all 13 parameters. Then, the grid then would have m**13 points which are to many to run HBV for.")

interactive(children=(Dropdown(description='First P.', options=(('Ts', 0), ('CFMAX', 1), ('CFR', 2), ('CWH', 3…

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task I: Parameter Distributions</h4>
    Above you find the joint and marginal paramter distributions for three different sampling strategies (uniform, grid, LHS).
    <ol>
        <li>What differences and similarities between the joint and marginal distributions can you spot?</li>
        <li>Which sampling strategy fill the parameter space best and why is this important?</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
</div>

## Evaluating HBV for Different Sampling Strategies

In [7]:
# DO NOT ALTER! code to run hbv for sampled parameter values and plot results

runs = dict()

@interact(n=(10, 1000, 10), m=(3, 20, 1))
def plot_metrics(n=500, m=10):

    np.random.seed(30)

    fig, axs = plt.subplots(3, 2, figsize=(4, 6), sharex="col", sharey=True)
    
    for k, strategy in enumerate([uniform, grid, lhs]):
        print(f"Running HBV for n={n} samples for the {strategy.__name__.upper()} strategy")
        
        # sample n parameters using the strategy
        df = pd.DataFrame.from_records(strategy(lower, upper, n, m=m))

        # evaluate hbv using rmse and nse
        df_rmse = df.apply(lambda params: rmse(df_obs["Q [mm/day]"], hbv(params.to_numpy(), df_obs["P [mm/day]"], df_obs["T [C]"], df_obs["PET [mm/day]"])), axis=1)
        df_nse  = df.apply(lambda params:  nse(df_obs["Q [mm/day]"], hbv(params.to_numpy(), df_obs["P [mm/day]"], df_obs["T [C]"], df_obs["PET [mm/day]"])), axis=1)
        runs[(strategy, "rmse")] = pd.concat([df, df_rmse], axis=1)
        runs[(strategy,  "nse")] = pd.concat([df,  df_nse], axis=1)
        
        ax = axs[k, 0]
        ax.hist(df_rmse.values, color=f"C{k}", bins=30)
        ax.set_xlabel("RMSE")
        ax.set_title(strategy.__name__.upper())
        
        ax = axs[k, 1]
        ax.hist( df_nse.values, color=f"C{k}", bins=30)
        ax.set_xlabel("NSE")
        ax.set_title(strategy.__name__.upper())

    plt.tight_layout()
    plt.show()

interactive(children=(IntSlider(value=500, description='n', max=1000, min=10, step=10), IntSlider(value=10, de…

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task II: Model Evaluation</h4>
    You can now run HBV for n parameters using the three sampling strategies (m tunes the grid size). The resulting plots show the distributions of RMSE and NSE for these model runs.
    <ol>
        <li>Can you spot further similarities and differences between the strategies?</li>
    </ol>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
</div>

## Partial Dependence Plots and Parameter Relevance

In [8]:
def bin_df(df, n_groups=10):
    """
    Input: 
        The function will be called automatically once for each paramter.
        The input is a pandas DataFrame with columns (param, metric) holding the parmater values and metric values that each run produced.
        These are the points in the scatter plots below.
    Your Task:
        Cut the param range into n_groups sections and calculate the mean parameter value and mean and variance of the metric value for each group.
    Expected Return:
        The resulting dataframe should have three columns (mean_param, mean_metric, vari_metric) and n_group rows.
    """
    pass

def parameter_relevance(binned_df):
    """
    Input:
        This function will be called once per paramter.
        The input is your result from the bin_df function which represents the partial dependence of the metric on one parameter. 
        It should have columns (mean_param, mean_metric, vari_meric) and n_groups rows.
    Your Task:
        Think of and implement a way to measure which parameter is relevant, e.g. measure the impact of a paramter on the metric.
    Expected Return:
        A number that represents the paramter relevance.
    
    """
    pass

In [18]:
# DO NOT ALTER! code to plot partial dependence plots and parameter relevances

relevances = dict()

@interact(metric=["rmse", "nse"])
def plot_params_relation(metric):

    params = np.array(param_names)[[1, 4, 7, 9, 11]]

    if len(runs) == 0:
        print("To make this plot, please run HBV in Task II at least once.")
        return
    
    fig, axs = plt.subplots(3, 5, figsize=(5/3*6, 6), sharex="col", sharey="row")

    for i, strategy in enumerate([uniform, grid, lhs]):
        for j, param in enumerate(params):
            
            # grab the HBV runs from Task II
            df = runs[(strategy, metric)]
            m = df.iloc[:,-1] # the metric values
            p = df.iloc[:,param_names.index(param)]  # the parameter values

            # scatter all the runs in the background
            ax = axs[i, j]
            ax.scatter(p, m, color=f"C{i}", alpha=0.05)

            # bin the values into groups (IMPLEMENTED BY STUDENTS)
            df = pd.DataFrame(dict(param=p, metric=m))
            df = bin_df(df, n_groups=10)

            if df is not None:
                # check for errors in the implementation
                if df.shape != (10, 3):
                    raise RuntimeWarning(f"shape should be {(10, 3)}, yours has {df.shape}")
                if np.any(df.columns != ["mean_param", "mean_metric", "vari_metric"]):
                    raise RuntimeWarnign(f"columns should be {['mean_param', 'mean_metric', 'vari_metric']}, yours has {df.columns}")

                # plot binned values and confidence intervals
                ax.scatter(df.mean_param, df.mean_metric, color=f"C{i}")
                ax.plot([df.mean_param, df.mean_param], [df.mean_metric - df.vari_metric**0.5, df.mean_metric + df.vari_metric**0.5], color=f"C{i}", marker="_", ms=5)
            else:
                ax.text(0.5, 0.5, "please implement \n bin_df", ha="center", va="center", transform=ax.transAxes)
                

            # calculate parameter relevance (IMPLEMENTED BY STUDENTS)
            relevance = parameter_relevance(df) if df is not None else np.nan
            #if not isinstance(relevance, Number):
            #    raise RuntimeWarning(f"relevance should be a float number, yours is {type(relevance)})")
            relevances[(strategy.__name__, param)] = relevance

            if j == 0: ax.set_title(strategy.__name__.upper())
            ax.set_ylabel(metric)
            ax.set_xlabel(param)

    fig.suptitle(f"Partial Dependene of {metric.upper()} on Parameters", fontsize=15)
    plt.tight_layout()
    plt.show()

    fig, axs = plt.subplots(1, 3, figsize=(5/3*6, 3), sharey=True)

    for i, strategy in enumerate([uniform, grid, lhs]):
        
        df = pd.DataFrame(relevances.values(), index=relevances.keys()).unstack(sort=False)
        df = df[df.index == strategy.__name__]
        
        ax = axs[i]
        ax.bar(params, df.replace(np.nan, 0).values.flatten(), color=f"C{i}")
        ax.set_title(strategy.__name__.upper())
        ax.set_ylabel("Parameter Importance")

        if np.all(np.isnan(df.values.flatten())):
            ax.text(0.5, 0.5, "please implement\nparameter_relevance", ha="center", va="center", transform=ax.transAxes)

    fig.suptitle("Your Parameter Importances", fontsize=15)
    plt.tight_layout()
    plt.show()

interactive(children=(Dropdown(description='metric', options=('rmse', 'nse'), value='rmse'), Output()), _dom_c…

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
    <h4><span>&#129300 </span>Task III: Partial Dependence Plots and Paramter Relevance</h4>
    <p>
        Now it's your turn to implement some code for yourself. Above, we want to show partial dependence plots of the metric values on the parameter and access paramter importance relevance in some way.
        As you can see, we already scattered the RMSE and NSE values against the paramter values for all the HBV runs from Task II. Above the cell that creates this plot you will find two functions that you need to implement yourself.
    </p>
    <ol>
        <li>Your first task is to implement the function <code>bin_df</code> that bins the parameter ranges and calculates mean and variance for the metric values in each bin. You'll find further information on the implementation details in the docstring.
        <li>Once you have implemented your solution, rerun the plotting cell. What patterns can you now spot in the partial dependence plots?</li>
    </ol>
    <p>We now want to use these partial dependence plots to estimate parameter relevance in the model. The function <code>paramter_relevance</code> will be called with the dataframes that your implementation of <code>bin_df</code> returns and should return a number that represents the relevance of a parameter. For this task we wan't you to come up with your own way of determining which parameter is relevant. Don' worry - there is no one solution. </p>
    <ol start=3>
        <li>Implement your idea in the <code>parameter_relevance</code> function.</li>
        <li>Once you have implemented your solution, rerun the plotting cell. Which parameters have a high relevance according to your method.</li>
    </ol>
    <p>When you struggle to implement the code, you can find hints and solutions below.</p>
</div>

_PUT YOUR ANSWERS HERE_

<div style="background:#e0f2fe; padding: 1%; border:1mm solid SkyBlue; color:black">
</div>

## Hints and Solutions

### Hint for Dataframe Binning

Pandas offers easy dataframe manipulation. To functions that could be helpful are `df.groupby` and `pd.cut`. You can read more about them in the documentation.

### Solution

In [14]:
def bin_df(df, n_groups=10):
    """
    Input: df is a pandas DataFrame with columns (param, metric) holding the metric values.
    Your task is to to cut the param into n_groups sections and calculate the mean parameter value and mean and variance of the metric value. 
    The resulting dataframe should have three columns (mean_param, mean_metric, vari_metric).
    """
    groups = pd.cut(df.param, n_groups)
    binned_df = pd.DataFrame()
    binned_df["mean_param"]  = df.groupby(groups, observed=False).mean().param.values
    binned_df["mean_metric"] = df.groupby(groups, observed=False).mean().metric.values
    binned_df["vari_metric"] = df.groupby(groups, observed=False).var().metric.values
    return binned_df

### Hint for Parameter Relevance

Note how for some paramters the RMSE or NSE is nearly independent of the parameter value. How could you express the strength of the relationship between the parameter value and the metric value?

### Solution

In [17]:
def parameter_relevance(binned_df):
    # variance of the mean metric values -> higher trend means higher variance
    return binned_df.mean_metric.var()
    # another approach could be to calculate the regression coefficient
    # return abs(np.corrcoef(binned_df.mean_param, binned_df.vari_metric)[0, 1])