In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import nivapy3 as nivapy
import numpy as np
import pandas as pd

plt.style.use("ggplot")

# Estimating fluxes using NivaPy

Nivapy includes some simple functions for estimating riverine fluxes (also know as "loads"). This notebook creates a synthetic dataset based on discharge data from Langtjern and compares the various methods.

## 1. Read Langtjern discharge data

In [None]:
# Read CSV data
csv_path = r"/home/jovyan/dstoolkit_examples/data/csv/langtjern_daily_flows.csv"
df = pd.read_csv(csv_path, parse_dates=True, index_col=0)

# Linear interp. of missing data
df.interpolate(method="linear", inplace=True)

# Flux functions require column named 'flow_m3/s'
df.rename({"flow_m3ps": "flow_m3/s"}, inplace=True, axis="columns")

df.head()

## 2. Create "fake" concentration data

We assume a log-log relationship between discharge and concentration (which is often theorised and *sometimes* observed in reality). The aim is just to generate a vaguely plausible timeseries of daily concentrations that can be used as a reference.

$$log(C) = m * log(Q) + log(\alpha) + \epsilon$$

where $\epsilon \sim \mathcal{N}(0,\,\sigma^{2})$

In [None]:
# Create fake data
m = 0.3
alpha = 2
sigma = 0.1

# Calculate concentrations
np.random.seed(0)
df["par_mg/l"] = 10 ** (
    m * np.log10(df["flow_m3/s"])
    + np.log10(alpha)
    + np.random.normal(loc=0, scale=sigma, size=len(df["flow_m3/s"]))
)

df.plot(subplots=True, figsize=(12, 8))

## 3. Estimate "true" fluxes

In [None]:
# "True" flux
df["true_flux_kg"] = 1000 * 24 * 60 * 60 * df["flow_m3/s"] * df["par_mg/l"] / 1e6
ann_true_df = df.resample("YE").sum()[["true_flux_kg"]]
ann_true_df.index = ann_true_df.index.year
ann_true_df.head()

## 4. Estimate fluxes with NivaPy

### 4.1. Using all data

Using the "full" series should give similar results to the "true" series (except when `method='simple_means'`, which is very biased). This isn't very useful, but should test for implementation errors in the code

In [None]:
# Get help for nivapy flux estimation methods
nivapy.stats.estimate_fluxes?

In [None]:
# Extract datasets
q_df = df[["flow_m3/s"]].copy()
conc_df = df[["par_mg/l"]].copy()

# Container for results
df_list = []

# Loop over methods
for method in [
    "linear_interpolation",
    "simple_means",
    "log_log_linear_regression",
    "ospar_annual",
]:
    print(f"Processing: {method}")
    # Estimate fluxes
    flux_df = nivapy.stats.estimate_fluxes(
        q_df, conc_df, base_freq="D", agg_freq="YE", method=method
    )

    # Delete flow vol, as interested in TOC
    if method != "ospar_annual":
        del flux_df["flow_m3"]

        # Convert date-times to integer years
        flux_df.index = flux_df.index.year

    # Rename col with method name for later
    flux_df.columns = [method]

    # Add to results
    df_list.append(flux_df)

# Merge results
flux_df = pd.concat(df_list, axis="columns")

flux_df.head()

These results compare well to the true/reference fluxes above, which suggests the code is working as expected.

### 4.2. Consider only monthly sampling

What if we include only every 30th observation from the concentration data (i.e. approximately monthly sampling)? This is a more realistic test of the statistical properties of each algorithm.

In [None]:
# Extract datasets
q_df = df[["flow_m3/s"]].copy()
conc_df = df[["par_mg/l"]].copy()

# Use every 30th conc measurement
conc_df = pd.concat([q_df, conc_df[::30]], axis="columns")[["par_mg/l"]]

# Container for results
df_list = []

# Loop over methods
for method in [
    "linear_interpolation",
    "simple_means",
    "log_log_linear_regression",
    "ospar_annual",
]:
    print(f"Processing: {method}")
    # Estimate fluxes
    flux_df = nivapy.stats.estimate_fluxes(
        q_df, conc_df, base_freq="D", agg_freq="YE", method=method
    )

    # Delete flow vol, as interested in TOC
    if method != "ospar_annual":
        del flux_df["flow_m3"]

        # Convert date-times to integer years
        flux_df.index = flux_df.index.year

    # Rename col with method name for later
    flux_df.columns = [method]

    # Add to results
    df_list.append(flux_df)

# Merge results
flux_df = pd.concat(df_list, axis="columns")

flux_df.head()

Note that `method='log_log_linear_regression'` does a good job of estimating the correct regression coefficients here: $m = 0.3$ and $\alpha = 2 = 10^{0.3}$.

## 5. Compare estimates to true values

In [None]:
# Plot
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 6))
ann_true_df.plot(ax=ax, lw=2, c="k")
flux_df.plot(ax=ax, style="--")
_ = ax.set_ylabel("Flux in kg")

Unsurprisingly, the log-log regression method works very well here, because the "fake" data were generated by assuming a (noisy) log-log relationship, so the $R^2$ value is very high in this example compared to what is usually observed in reality. The OSPAR approach is also reasonable (although it overestimates in 2006 for some reason), while the other two methods tend to underestimate fluxes. This is largely as expected from statistical theory.

A more detailed understanding of the performance of each algorithm could be obtained by **bootstrap resampling** the synthetic concentration dataset, but this is not covered here.