In [1]:
import os

import numpy as np
import pandas as pd

# TEOTIL3: Tidy annual data

## Part 3: Large wastewater

This notebooks estimates inflows and outflow of nutrients from "large" wastewater sites (≥50 p.e.) based on data provided by Torstein Finnesand at Miljødirektoratet.

## Workflow overview

 1. Torstein provides several data files that need combining, cleaning and patching to create a coherent dataset.
    
 2. Inflow and outflow fluxes use measured data where available, and are otherwise estimated based on the number of people & cabins connected and typical nutrient discharges per person.

 3. Cleaning and gap-filling is performed for TOTN, TOTP, SS and BOF5 (and, to a lesser extent, KOF). 

 4. Annual Excel files are generated for **nutrients**. Files summarising reported annual discharges of **metals** are also created. These are not used directly in TEOTIL3, but they are relevant for OSPAR reporting, so it is useful to have them in the database.

In [2]:
# Final year for which emissions will be estimated
final_year = 2023

# Raw datasets to use i.e.
# /home/jovyan/shared/common/teotil3/point_data/raw_data_delivered_{deliv_year}
deliv_year = 2025

# Excel files from Torstein
site_xls = "AnleggSøk2025.03.28.xlsx"
meta_xls1 = "Anleggsdata - Avløp 04.09.2025.xlsx"
meta_xls2 = "Anleggsdata 2 - Avløp 28.03.2025.xlsx"
net_xls = "Avløpsnett - Avløp 25.08.2025.xlsx"
dis_xls = "Årlig utslipp - Avløp 21.08.2025.xlsx"
link_xls = "Avløpsnettsanlegg med ID til RA. PQ.xlsx"

## 1. Default discharges per person

The default values come from a [book published by Norsk Vann](https://va-kompetanse.no/butikk/laerebok-i-vann-og-avlopsteknikk/). See e-mail from Gisle received 27.03.2025 for a screenshot of the relevant page.

These factors are used for both large and small wastewater sites.

In [3]:
# Get a dict mapping parameters to per-person discharges
url = r"https://raw.githubusercontent.com/NIVANorge/teotil3/refs/heads/main/data/nutrient_discharges_per_person.csv"
pers_df = pd.read_csv(url)

# Convert to dict
pers_dict = dict(zip(pers_df["parameter"], pers_df["g_per_pers_per_day"]))
pers_dict

{'totn': 12.0, 'totp': 1.8, 'bof5': 60.0, 'kof': 120.0, 'ss': 70.0}

## 2. Treatment efficiencies

The default treatment efficiencies for large plants assumed by Miljødirektoratet have been provided by Torstein (see e-mail received 21.08.2025 at 06:56). They are stored in `treatment_efficiencies_large_wastewater.csv`.

In [4]:
# Read treatment efficiency data
url = r"https://raw.githubusercontent.com/NIVANorge/teotil3/refs/heads/main/data/treatment_efficiencies_large_wastewater.csv"
eff_df = pd.read_csv(url).rename(columns={"mdir_type": "type"})
eff_df

Unnamed: 0,type,teotil_type,totp_eff_pct,totn_eff_pct,bof5_eff_pct,kof_eff_pct,ss_eff_pct
0,Annen rensing,Annen rensing,75,20,85,85,60
1,Biologisk,Biologisk,30,20,90,80,85
2,Kjemisk,Kjemisk,90,20,75,75,88
3,Kjemisk-biologisk,Kjemisk-biologisk,95,25,95,90,90
4,Kjemisk-biologisk med nitrogenfjerning,Kjemisk-biologisk m/N-fjerning,95,70,95,95,95
5,Mekanisk,Mekanisk,15,15,20,40,45
6,Mekanisk - sil/rist,"Mekanisk - sil, rist",15,15,20,35,45
7,Mekanisk - slamavskiller,Mekanisk - slamavskiller,15,15,20,50,45
8,Naturbasert,Naturbasert,75,20,85,85,87
9,Urenset,Urenset,0,0,0,0,0


## 3. Read raw data from Miljødirektoratet

Raw files for large wastewater sites are exported from Miljødirektoratet's database and sent by Torstein Finnesand. The **raw datasets should span the entire period from 2010 to the year of interest** (which is usually the current year minus one). This is because historic data are used both for outlier detection and for patching data gaps. The raw files are as follows:

 * `AnleggSøkYYYY.MM.DD.xlsx` provides basic site details (location co-ordinates etc.).
   
 * `Anleggsdata - Avløp YYYY.MM.DD.xlsx` includes extra information, including site capacity.

 * `Anleggsdata 2 - Avløp YYYY.MM.DD.xlsx` includes the renseprinsipp.

 * `Avløpsnett - Avløp YYYY.MM.DD.xlsx` includes the number of people and fritidsboliger (cabins etc.) connected to each plant.

 * `Årlig utslipp - Avløp YYYY.MM.DD.xlsx` is the main dataset containing annual inflows and outflows of nutrients at each plant (either directly measured or based on estimates made by MDir).

 * `Avløpsnettsanlegg med ID til RA. PQ.xlsx` is a lookup table linking each treatment plant to its associated network components (see below).

`Avløpsnett - Avløp YYYY.MM.DD.xlsx` includes data both for the main treatment plants and for the network components connected to them. The latter are listed with `activity == 'Offentlig avløpsnett'` in the Excel files. We are only interested in the treatment plants themselves, not the separate network components. In most cases, data for the treatments plants matches the sum of the connected network components, but there are a few cases where data are only provided at network level.

The code in the cells below first reads all the Excel files listed above and joins them together to give the location, capacity, treatment type, and number of people & cabins connected to each plant. Gaps in this dataset are then patched where possible using aggregated data from the network dataset (linked by `Avløpsnettsanlegg med ID til RA. PQ.xlsx`).

 1. Remove any comment rows from the top of the files listed above.

 2. Upload all the files to

        /home/jovyan/shared/common/teotil3/point_data/raw_data_delivered_{deliv_year}

 3. Update the dates in the filenames at the start of this notebook to identify the correct files (this is left as a manual step, because there are often several iterations of data delivery).

In [5]:
# Location of raw data
raw_dir = (
    f"/home/jovyan/shared/common/teotil3/point_data/raw_data_delivered_{deliv_year}"
)

# Basic site details ####################################################################
xl_path = os.path.join(raw_dir, site_xls)
site_df = pd.read_excel(xl_path, sheet_name="Treffliste")
col_dict = {
    "Anleggsnr.": "anlegg_nr",
    "Navn på anlegg": "anlegg_name",
    "Anleggsaktivitet": "activity",
    "Kommune": "kommune",
    "Fylke": "fylke",
    "Sone (lok.)": "site_zone",
    "Øst (lok.)": "site_east",
    "Nord (lok.)": "site_north",
    "Sone (utslipp)": "outlet_zone",
    "Øst (utslipp)": "outlet_east",
    "Nord (utslipp)": "outlet_north",
    "Kilderefnr.": "kilderefnr",
}
site_df.rename(columns=col_dict, inplace=True)
site_df = site_df[col_dict.values()]

# We are not interested in the avløpsnett, just the wastewater plants
# See above and and e-mail from Torstein received 08.05.2025
site_df = site_df.query("activity != 'Offentlig avløpsnett'")

# Additional site details 1 - capacity ##################################################
xl_path = os.path.join(raw_dir, meta_xls1)
cap_df = pd.read_excel(xl_path, sheet_name="Sheet1")
prop_dict = {
    "Dimensjonerende kapasitet, i pe": "design_capacity",
    "Tilført mengde til avløpsanlegget inkl. overløp, i pe": "current_capacity",
    "Tilført mengde til avløpsanlegget inkl. overløp, i pe, beregnet av Forurensning ut fra BOF5 og fmaks = 1,5": "current_capacity_est",
    "Oppstartsår": "start_year",
    "Siste utvidelsesår": "upgraded_year",
    "Nedlagt? Hvis ja, angi nedlagt år og hvor avløpsvannet nå blir ført": "shut_down_year",
}
cap_df["variable"] = cap_df["Spørsmål"].replace(prop_dict)
var_list = list(prop_dict.values())
cap_df = cap_df.query("variable in @var_list")
col_dict = {
    "År": "year",
    "AnleggNummer": "anlegg_nr",
    "variable": "variable",
    "Verdi": "value",
}
cap_df.rename(columns=col_dict, inplace=True)
cap_df = cap_df[col_dict.values()]
cap_df = cap_df.pivot(
    index=["anlegg_nr", "year"], columns="variable", values="value"
).reset_index()
cap_df.columns.name = ""
cap_df["current_capacity"] = (
    cap_df["current_capacity"].fillna(cap_df["current_capacity_est"]).round(0)
)
del cap_df["current_capacity_est"]

# Additional site details 2 - renseprinsipp #############################################
xl_path = os.path.join(raw_dir, meta_xls2)
ren_df = pd.read_excel(xl_path, sheet_name="Sheet1")
col_dict = {
    "År": "year",
    "AnleggNummer": "anlegg_nr",
    "Renseprinsipp": "type",
}
ren_df = ren_df.rename(columns=col_dict).dropna(subset="type")
ren_df = ren_df[col_dict.values()]

# Additional site details 3 - number connected ##########################################
xl_path = os.path.join(raw_dir, net_xls)
con_df = pd.read_excel(xl_path, sheet_name="Sheet1").query(
    "`Tilknytning, overløp og lekkasjer` in ('Antall fritidsboliger tilknyttet avløpsnettet', 'Antall innbyggere tilknyttet avløpsnettet')"
)
id_cols = [
    "År",
    "AnleggNummer",
    "Anleggsaktivitet",
    "Tilknytning, overløp og lekkasjer",
]
con_df = (
    con_df[id_cols + ["Verdi"]]
    .set_index(id_cols)
    .unstack("Tilknytning, overløp og lekkasjer")
)
con_df.columns = con_df.columns.get_level_values(1)
con_df = con_df.reset_index()
col_dict = {
    "År": "year",
    "AnleggNummer": "anlegg_nr",
    "Anleggsaktivitet": "activity",
    "Antall fritidsboliger tilknyttet avløpsnettet": "n_fritid",
    "Antall innbyggere tilknyttet avløpsnettet": "n_connected",
}
con_df.rename(columns=col_dict, inplace=True)
con_df = con_df[col_dict.values()]

# Merge
prop_df = pd.merge(ren_df, cap_df, on=["year", "anlegg_nr"], how="outer")
prop_df = pd.merge(prop_df, con_df, on=["year", "anlegg_nr"], how="outer")

# Split into site properties and network properties
net_df = prop_df.query("activity == 'Offentlig avløpsnett'").copy()
prop_df = prop_df.query("activity != 'Offentlig avløpsnett'")
prop_df = prop_df.dropna(subset="type")
del prop_df["activity"]
net_df = net_df.drop(
    columns=[
        "activity",
        "type",
        "design_capacity",
        "current_capacity",
        "start_year",
        "upgraded_year",
        "shut_down_year",
    ]
)
net_df = net_df.rename(columns={"anlegg_nr": "net_anlegg_nr"})

# Linking table connecting network components to processing plants ######################
xl_path = os.path.join(raw_dir, link_xls)
link_df = pd.read_excel(xl_path, sheet_name="Avløpsnett")
link_df = link_df.rename(
    columns={"RAAnleggNummer": "anlegg_nr", "AnleggNummer": "net_anlegg_nr"}
)[["anlegg_nr", "net_anlegg_nr"]]

# Join to 'net_df' and aggregate
net_df = pd.merge(net_df, link_df, how="left", on="net_anlegg_nr").dropna(
    subset=["anlegg_nr", "year"]
)
del net_df["net_anlegg_nr"]
net_df = net_df.groupby(["anlegg_nr", "year"]).sum().reset_index()

# Join to prop_df
prop_df = pd.merge(
    prop_df, net_df, how="left", on=["anlegg_nr", "year"], suffixes=["", "_net"]
)

# Fill NaNs where possible
for col in ["n_fritid", "n_connected"]:
    prop_df[col] = prop_df[col].fillna(prop_df[f"{col}_net"])
    del prop_df[f"{col}_net"]

# Discharge data ########################################################################
xl_path = os.path.join(raw_dir, dis_xls)
dis_df = pd.read_excel(xl_path, sheet_name="Sheet1")

# Just pars of interest
par_dict = {
    "fosfor, total": "totp",
    "nitrogen, totalt": "totn",
    "biokjemisk oksygenforbruk (BOF), 5 døgn": "bof5",
    "kjemisk oksygenforbruk (KOF)": "kof",
    "suspendert stoff": "ss",
}
par_list = list(par_dict.keys())
dis_df = dis_df.query("Stoff in @par_list")
dis_df["Stoff"] = dis_df["Stoff"].replace(par_dict)
col_dict = {
    "År": "year",
    "AnleggNummer": "anlegg_nr",
    "Stoff": "par",
    "Tilført mengde": "in_tonnes",
    "Utslippsmengde": "out_tonnes",
    "Grunnlag for utslippet": "method",
}
dis_df.rename(columns=col_dict, inplace=True)
dis_df = dis_df[col_dict.values()]
dis_df["method"] = dis_df["method"].fillna("Unknown")

# Pivot to wide
dis_df = dis_df.pivot_table(
    index=["year", "anlegg_nr", "method"],
    columns="par",
    values=["in_tonnes", "out_tonnes"],
)
dis_df.columns = [f"{col[1]}_{col[0]}" for col in dis_df.columns]
dis_df.reset_index(inplace=True)

# Combine
cols = [
    "anlegg_nr",
    "kilderefnr",
    "anlegg_name",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
    "start_year",
    "upgraded_year",
    "shut_down_year",
    "type",
    "year",
    "method",
    "bof5_in_tonnes",
    "bof5_out_tonnes",
    "kof_in_tonnes",
    "kof_out_tonnes",
    "ss_in_tonnes",
    "ss_out_tonnes",
    "totn_in_tonnes",
    "totn_out_tonnes",
    "totp_in_tonnes",
    "totp_out_tonnes",
]
df = pd.merge(dis_df, prop_df, on=["year", "anlegg_nr"], how="left")
df = pd.merge(df, site_df, on="anlegg_nr", how="left")
df = (
    df[cols]
    .sort_values(by=["anlegg_nr", "year"], ascending=True)
    .reset_index(drop=True)
)

df.head()

Unnamed: 0,anlegg_nr,kilderefnr,anlegg_name,activity,kommune,fylke,site_zone,site_east,site_north,outlet_zone,...,bof5_in_tonnes,bof5_out_tonnes,kof_in_tonnes,kof_out_tonnes,ss_in_tonnes,ss_out_tonnes,totn_in_tonnes,totn_out_tonnes,totp_in_tonnes,totp_out_tonnes
0,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,32.0,...,7378.308128,526.710061,18562.913291,1461.364028,,,1328.249,407.903,,
1,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,32.0,...,7808.765696,574.262503,21736.681119,2845.252931,,,1449.41,479.505,166.855,17.143
2,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,32.0,...,6617.02,323.644,21146.08,1809.76,,,1452.467,412.18,171.102,11.191
3,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,32.0,...,7532.404,351.8728,19336.6,1745.56,,,1389.5,435.0,170.1,7.14
4,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,32.0,...,7031.698239,548.827537,21407.565445,2968.005254,,,1489.6,433.0,174.4,15.18


## 4. Patch basic metadata

Some site-years are missing metadata. Patch gaps with info from the same site in other years, if possible.

**Check that the number of records being dropped due to incomplete data is fairly small (e.g. < 100).**

In [6]:
# Cols to patch. For each site, first fill forward, then fill backward
cols = [
    "anlegg_name",
    "kilderefnr",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
    "start_year",
    "upgraded_year",
    "shut_down_year",
    "type",
]
df[cols] = df.groupby("anlegg_nr")[cols].transform(lambda x: x.ffill().bfill())

# Drop rows with incomplete data in key columns
n_rows_before = len(df)
cols = [
    "anlegg_nr",
    "kilderefnr",
    "site_zone",
    "site_east",
    "site_north",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
    "type",
]
df = df.dropna(subset=cols, how="any")
n_rows_after = len(df)
print(n_rows_before - n_rows_after, "records have been dropped due to incomplete data.")

  df[cols] = df.groupby("anlegg_nr")[cols].transform(lambda x: x.ffill().bfill())


96 records have been dropped due to incomplete data.


## 5. Join treatment efficiencies

In [7]:
# Convert renseeffekter in percent to fractions
numeric_cols = eff_df.select_dtypes(include="number").columns
eff_df[numeric_cols] = eff_df[numeric_cols] / 100

# Join default renseeffekter
df = pd.merge(df, eff_df.drop(columns="teotil_type"), how="left", on="type")

## 6. Quality check data

There are some strange values in Miljødirektoratet's dataset (both measured and estimated) - mostly from before 2017 when the reporting system was different. The code below gets the median discharge for each parameter at each site and flags values that are `thresh` times bigger or smaller than the median. These are set to NaN and patched later, if possible.

I am using a conservative threshold of 10 i.e. I will only ignore MDir's values if their input or output estimates are more than 10 times bigger or smaller than the median for the site. This removes some really large outliers that are obviously wrong, but leaves a lot of dubious-looking data in the dataset. I think it is better for MDir to clean the dataset at their end than for me to do it.

In [8]:
# Remove values bigger than thresh*median and smaller than (1/thresh)*median
thresh = 10

# Create a folder for QC results
qc_dir = os.path.join(raw_dir, "qc")
os.makedirs(qc_dir, exist_ok=True)

# Flag outliers
df_list = []
for site_id, site_ts_df in df.groupby("anlegg_nr"):
    for par in pers_dict.keys():
        for flow in ("in", "out"):
            col_name = f"{par}_{flow}_tonnes"
            outlier_col = f"{par}_{flow}_outlier"

            if site_ts_df[col_name].count() < 3:
                site_ts_df[outlier_col] = 0
            else:
                med_val = site_ts_df[col_name].median()
                ratio = site_ts_df[col_name] / med_val
                site_ts_df[outlier_col] = (
                    (ratio > thresh) | (ratio < 1 / thresh)
                ).astype(int)
    df_list.append(site_ts_df)

# Combine and save for checking
df = pd.concat(df_list, axis="rows")
xl_path = os.path.join(raw_dir, "qc", "outliers.xlsx")
df.to_excel(xl_path, index=False)

# Set outliers to NaN in original data
for par in pers_dict.keys():
    for flow in ("in", "out"):
        col_name = f"{par}_{flow}_tonnes"
        outlier_col = f"{par}_{flow}_outlier"
        df.loc[df[outlier_col] == 1, col_name] = np.nan
        del df[outlier_col]

## 7. Fill data gaps

We need a complete set of input data for BOF5, SS, TOTN and TOTP for each site. For KOF, I will use the measured data (and MDir's estimated values, where available). However, I will not patch gaps for KOF and will use estimated BOF5 instead. The reason for this is that TEOTIL3 uses KOF in preference to BOF5, because KOF is "closer" to TOC. If I estimate values for KOF, there may therefore be cases where the model chooses to use *estimated* KOF instead of *measured* BOF5, which is undesirable. Estimates of BOF5 and KOF based on the number of people connected are likely to be equally approximate, so I don't think there's a strong argument to prefer estimated KOF over estimated BOF5. However, I want to use measured values where available.

For consistency, gaps are patched sequentially, as described below. For BOF5, SS, TOTN and TOTP:

 1. **Case 1: Inflow and outflow are both known (either measured directly or estimated by MDir)**

    Values from MDir's database are assumed to be correct. The "true" renseeffekt can be estimated from the reported inflows and outflows.
    
 2. **Case 2: Inflow is known, outflow is NaN**

    If the input is known but the output is not, the output is calculated using typical renseeffekter for each site type.

 3. **Case 3: Inflow is NaN, outflow is known**

    If the outflow is known but the inflow is not, the inflow is calculated using typical renseeffekter for each site type.

 4. **Case 4: Inflow and outflow are both NaN**
    
    If the inflow and outflow are both NaN, but `n_fritid` and/or `n_connected` are known, the input is estimated from the number of people/households connected. The outflow is then calculated from the estimated inflow by assuming a typical renseeffekter based on the site type.

Patching in this order prioritises measured data and gives the most consistent results.

In [9]:
# Create a single dataset from MDir's data. Assume values here are correct
cols = [
    "anlegg_nr",
    "type",
    "year",
    "n_fritid",
    "n_connected",
    "bof5_in_tonnes",
    "kof_in_tonnes",
    "totn_in_tonnes",
    "totp_in_tonnes",
    "ss_in_tonnes",
    "bof5_out_tonnes",
    "kof_out_tonnes",
    "totn_out_tonnes",
    "totp_out_tonnes",
    "ss_out_tonnes",
]
mdir_df = df[cols].copy()

# Check for NaNs
cols = ["type", "n_fritid", "n_connected"]
for col in cols:
    if mdir_df[col].isna().sum() > 0:
        raise ValueError(f"NaNs are present in column '{col}'.")

# Combine measured and estimated values from MDir (assume both are OK)
mdir_df = mdir_df.groupby(["anlegg_nr", "type", "year"]).mean().reset_index()

# Join default renseeffekter
mdir_df = pd.merge(mdir_df, eff_df.drop(columns="teotil_type"), how="left", on="type")

print(len(mdir_df), "rows in the dataset.")

for par, per_val in pers_dict.items():
    if par == "kof":
        # Do not patch KOF - see text above for explanation
        continue
    print("\n###########################")
    print(par.upper())

    print("\nCASE 1: Inflow and outflow both known.\nOK.")

    print("\nCASE 2: Inflow known; outflow NaN.")
    print("Estimating outflows from inflows.")
    mdir_df[f"{par}_out_est_tonnes"] = mdir_df[f"{par}_in_tonnes"] * (
        1 - mdir_df[f"{par}_eff_pct"]
    )
    n_vals0 = mdir_df[f"{par}_out_tonnes"].count()
    mdir_df[f"{par}_out_tonnes"] = mdir_df[f"{par}_out_tonnes"].fillna(
        mdir_df[f"{par}_out_est_tonnes"]
    )
    del mdir_df[f"{par}_out_est_tonnes"]
    n_vals1 = mdir_df[f"{par}_out_tonnes"].count()
    print(f"Before filling: {n_vals0} non-NaN values.")
    print(f"After filling:  {n_vals1} non-NaN values.")

    print("\nCASE 3: Inflow NaN; outflow known.")
    print("Estimating inflows from outflows.")
    mdir_df[f"{par}_in_est_tonnes"] = mdir_df[f"{par}_out_tonnes"] / (
        1 - mdir_df[f"{par}_eff_pct"]
    )
    n_vals0 = mdir_df[f"{par}_in_tonnes"].count()
    mdir_df[f"{par}_in_tonnes"] = mdir_df[f"{par}_in_tonnes"].fillna(
        mdir_df[f"{par}_in_est_tonnes"]
    )
    del mdir_df[f"{par}_in_est_tonnes"]
    n_vals1 = mdir_df[f"{par}_in_tonnes"].count()
    print(f"Before filling: {n_vals0} non-NaN values.")
    print(f"After filling:  {n_vals1} non-NaN values.")

    print("\nCASE 4: Inflow NaN; outflow NaN.")
    print("Estimating inflows from 'n_connected' and 'n_fritid'.")
    mdir_df[f"{par}_in_est_tonnes"] = (
        mdir_df["n_connected"] * per_val * 365.25
        + mdir_df["n_fritid"] * per_val * 4 * 60
    ) / 1e6
    mdir_df[f"{par}_in_est_tonnes"] = mdir_df[f"{par}_in_est_tonnes"].replace(0, np.nan)
    n_vals0 = mdir_df[f"{par}_in_tonnes"].count()
    mdir_df[f"{par}_in_tonnes"] = mdir_df[f"{par}_in_tonnes"].fillna(
        mdir_df[f"{par}_in_est_tonnes"]
    )
    del mdir_df[f"{par}_in_est_tonnes"]
    n_vals1 = mdir_df[f"{par}_in_tonnes"].count()
    print(f"Before filling: {n_vals0} non-NaN values.")
    print(f"After filling:  {n_vals1} non-NaN values.")

    print("Calculating outflows from estimated inflows.")
    mdir_df[f"{par}_out_est_tonnes"] = mdir_df[f"{par}_in_tonnes"] * (
        1 - mdir_df[f"{par}_eff_pct"]
    )
    n_vals0 = mdir_df[f"{par}_out_tonnes"].count()
    mdir_df[f"{par}_out_tonnes"] = mdir_df[f"{par}_out_tonnes"].fillna(
        mdir_df[f"{par}_out_est_tonnes"]
    )
    del mdir_df[f"{par}_out_est_tonnes"]
    n_vals1 = mdir_df[f"{par}_out_tonnes"].count()
    print(f"Before filling: {n_vals0} non-NaN values.")
    print(f"After filling:  {n_vals1} non-NaN values.")

# Drop unnecessary cols
cols = [col for col in mdir_df.columns if col.endswith("_eff_pct")] + [
    "n_fritid",
    "n_connected",
    "type",
]
mdir_df = mdir_df.drop(columns=cols)

# Only keep rows with at least some data
cols = [col for col in mdir_df.columns if col.endswith("_tonnes")]
all_nan_df = mdir_df[mdir_df[cols].isna().all(axis="columns")][["anlegg_nr", "year"]]
mdir_df = mdir_df.dropna(subset=cols, how="all")
if len(all_nan_df) > 0:
    print(
        "\nWARNING: The following site-years have no data at all. "
        "These rows cannot be gap-filled and will be dropped."
    )
    print(all_nan_df)

mdir_df.head()

23015 rows in the dataset.

###########################
TOTN

CASE 1: Inflow and outflow both known.
OK.

CASE 2: Inflow known; outflow NaN.
Estimating outflows from inflows.
Before filling: 19765 non-NaN values.
After filling:  20048 non-NaN values.

CASE 3: Inflow NaN; outflow known.
Estimating inflows from outflows.
Before filling: 19939 non-NaN values.
After filling:  20048 non-NaN values.

CASE 4: Inflow NaN; outflow NaN.
Estimating inflows from 'n_connected' and 'n_fritid'.
Before filling: 20048 non-NaN values.
After filling:  22957 non-NaN values.
Calculating outflows from estimated inflows.
Before filling: 20048 non-NaN values.
After filling:  22957 non-NaN values.

###########################
TOTP

CASE 1: Inflow and outflow both known.
OK.

CASE 2: Inflow known; outflow NaN.
Estimating outflows from inflows.
Before filling: 22321 non-NaN values.
After filling:  22590 non-NaN values.

CASE 3: Inflow NaN; outflow known.
Estimating inflows from outflows.
Before filling: 22429 no

Unnamed: 0,anlegg_nr,year,bof5_in_tonnes,kof_in_tonnes,totn_in_tonnes,totp_in_tonnes,ss_in_tonnes,bof5_out_tonnes,kof_out_tonnes,totn_out_tonnes,totp_out_tonnes,ss_out_tonnes
0,0301.0979.01,2010,7378.308128,18562.913291,1328.249,179.177478,6968.013045,526.710061,1461.364028,407.903,8.958874,348.400652
1,0301.0979.01,2011,7808.765696,21736.681119,1449.41,166.855,7256.849092,574.262503,2845.252931,479.505,17.143,362.842455
2,0301.0979.01,2012,6617.02,21146.08,1452.467,171.102,7530.421342,323.644,1809.76,412.18,11.191,376.521067
3,0301.0979.01,2013,7532.404,19336.6,1389.5,170.1,7744.676992,351.8728,1745.56,435.0,7.14,387.23385
4,0301.0979.01,2014,7031.698239,21407.565445,1489.6,174.4,7644.17115,548.827537,2968.005254,433.0,15.18,382.208558


## 8. Check data completeness

For records with at least some data in MDir's database.

In [10]:
# We always need inputs and outputs for SS, TOTN and TOTP
req_cols = [
    "ss_in_tonnes",
    "ss_out_tonnes",
    "totn_in_tonnes",
    "totn_out_tonnes",
    "totp_in_tonnes",
    "totp_out_tonnes",
]
req_complete = mdir_df[req_cols].notnull().all(axis=1)

# We also need *either* complete data for BOF5 or complete data for KOF, but not both
bof5_complete = mdir_df[["bof5_in_tonnes", "bof5_out_tonnes"]].notnull().all(axis=1)
kof_complete = mdir_df[["kof_in_tonnes", "kof_out_tonnes"]].notnull().all(axis=1)

# Combine criteria
mdir_df["complete"] = ((bof5_complete | kof_complete) & req_complete).astype(int)

# Estimate number of site-years with sufficient data for scenarios estimation
pct_complete = 100 * mdir_df["complete"].sum() / len(mdir_df)
print(f"Complete data for {pct_complete:.1f}% of site-years.")

# Filter to only 'complete' records
mdir_df = mdir_df.query("complete == 1").drop(columns="complete")

Complete data for 99.7% of site-years.


## 9. Interpolate missing data

Some sites report data in one year, and then do not appear at all in MDir's dataset until later. I would expect these sites to be included in MDir's "estimated" values, but in many cases they are not. In SSB's dataset, Gisle has filled these gaps.

In the code below, I generate complete annual time series for each site from 2013 to 2024, padded with NaNs as necessary. I then **linearly interpolate missing data, but without extrapolation**. For example, if a site appears in MDir's dataset in 2015, and then again in 2020, I assume the site has been active from 2015 to 2020 and interpolate values for the missing years. However, I do not estimate values before 2015 or after 2020.

In [11]:
# Combine data and site properties
cols = [
    "anlegg_nr",
    "kilderefnr",
    "anlegg_name",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
    "start_year",
    "upgraded_year",
    "shut_down_year",
    "type",
    "year",
    "bof5_in_tonnes",
    "bof5_out_tonnes",
    "kof_in_tonnes",
    "kof_out_tonnes",
    "ss_in_tonnes",
    "ss_out_tonnes",
    "totn_in_tonnes",
    "totn_out_tonnes",
    "totp_in_tonnes",
    "totp_out_tonnes",
]
df = pd.merge(mdir_df, prop_df, on=["year", "anlegg_nr"], how="left")
df = pd.merge(df, site_df, on="anlegg_nr", how="left")
df = (
    df[cols]
    .sort_values(by=["anlegg_nr", "year"], ascending=True)
    .reset_index(drop=True)
)

# Some site-years are missing metadata. Patch with info from the same site in other years, if possible
cols = [
    "anlegg_name",
    "kilderefnr",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
    "start_year",
    "upgraded_year",
    "shut_down_year",
    "type",
]
df[cols] = df.groupby("anlegg_nr")[cols].transform(lambda x: x.ffill().bfill())

# Because we dropped sites with missing essential metadata above, we should now have
# complete metadata for everything except the outlet co-ords. Check this
for col in cols:
    if col not in (
        "outlet_zone",
        "outlet_east",
        "outlet_north",
        "start_year",
        "upgraded_year",
        "shut_down_year",
    ):
        if df[col].isna().sum() > 0:
            raise ValueError(f"Column '{col}' contains NaNs.")

# Get combination of sites and all years
years = list(range(df["year"].min(), df["year"].max()))
anlegg_ids = df["anlegg_nr"].unique()
full_index = pd.MultiIndex.from_product(
    [anlegg_ids, years], names=["anlegg_nr", "year"]
)

# Reindex to include all combinations
df = df.set_index(["anlegg_nr", "year"])
df = df.reindex(full_index).reset_index()

# Save for checking
xl_path = os.path.join(raw_dir, "qc", "before_interp.xlsx")
df.to_excel(xl_path, index=False)

# Interpolate data gaps for inflows and outflows
cols = [
    col for col in df.columns if col.endswith("_tonnes") and not col.startswith("kof")
]
df[cols] = df.groupby("anlegg_nr")[cols].transform(
    lambda x: x.interpolate(method="linear", limit_area="inside")
)

# Interpolate data gaps for capacity and number connected
cols = ["design_capacity", "current_capacity", "n_fritid", "n_connected"]
df[cols] = df.groupby("anlegg_nr")[cols].transform(
    lambda x: (
        x.interpolate(method="linear", limit_area="inside").ffill().bfill().round()
    )
)

# Fill gaps in other columns
cols = [
    "anlegg_name",
    "kilderefnr",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "start_year",
    "upgraded_year",
    "shut_down_year",
    "type",
]
df[cols] = df.groupby("anlegg_nr")[cols].transform(lambda x: x.ffill().bfill())

# Drop empty rows
cols = [col for col in df.columns if col.endswith("_tonnes")]
df = df.dropna(subset=cols, how="all")

# Check key metadata is complete
cols = [
    "anlegg_name",
    "kilderefnr",
    "activity",
    "kommune",
    "fylke",
    "site_zone",
    "site_east",
    "site_north",
    "type",
    "design_capacity",
    "current_capacity",
    "n_fritid",
    "n_connected",
]
for col in cols:
    if df[col].isna().sum() > 0:
        raise ValueError(f"Column '{col}' contains NaNs.")

# Save for checking
xl_path = os.path.join(raw_dir, "qc", "after_interp.xlsx")
df.to_excel(xl_path, index=False)

  df[cols] = df.groupby("anlegg_nr")[cols].transform(lambda x: x.ffill().bfill())


## 10. Save annual nutrient discharges

In [12]:
# Convert MDir types to TEOTIL types
mdir_types = set(df["type"].unique())
valid_mdir_types = set(eff_df["type"].unique())
assert mdir_types.issubset(valid_mdir_types)
types_dict = eff_df.set_index("type")["teotil_type"].to_dict()
df["type"] = df["type"].replace(types_dict)

# Tidy
val_cols = [col for col in df.columns if col.endswith("_tonnes")]
df[val_cols] = df[val_cols].round(4)

# Save all data for checking
xl_path = os.path.join(raw_dir, "qc", "cleaned_patched.xlsx")
df.to_excel(xl_path, index=False)

# Save tidied data for each year
for year, ann_df in df.groupby("year"):
    if (year >= 2013) and (year <= final_year):
        xl_path = f"/home/jovyan/shared/common/teotil3/point_data/{year}/large_wastewater_{year}_raw.xlsx"
        ann_df.to_excel(xl_path, index=False)

df.head()

Unnamed: 0,anlegg_nr,year,kilderefnr,anlegg_name,activity,kommune,fylke,site_zone,site_east,site_north,...,bof5_in_tonnes,bof5_out_tonnes,kof_in_tonnes,kof_out_tonnes,ss_in_tonnes,ss_out_tonnes,totn_in_tonnes,totn_out_tonnes,totp_in_tonnes,totp_out_tonnes
0,0301.0979.01,2010,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,...,7378.3081,526.7101,18562.9133,1461.364,6968.013,348.4007,1328.249,407.903,179.1775,8.9589
1,0301.0979.01,2011,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,...,7808.7657,574.2625,21736.6811,2845.2529,7256.8491,362.8425,1449.41,479.505,166.855,17.143
2,0301.0979.01,2012,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,...,6617.02,323.644,21146.08,1809.76,7530.4213,376.5211,1452.467,412.18,171.102,11.191
3,0301.0979.01,2013,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,...,7532.404,351.8728,19336.6,1745.56,7744.677,387.2338,1389.5,435.0,170.1,7.14
4,0301.0979.01,2014,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,Avløpsnett og -rensing,Oslo,Oslo,32.0,599070.0,6639810.0,...,7031.6982,548.8275,21407.5654,2968.0053,7644.1712,382.2086,1489.6,433.0,174.4,15.18


## 11. Metals ("miljøgifter")

The old version of TEOTIL used to include data for "miljøgifter". These parameters have never been modelled by TEOTIL, but since they are part of the OSPAR reporting workflow they are included in the TEOTIL database. SS used to be included with the miljøgifter, but this is now modelled explicitly by TEOTIL3. The remaining "miljøgifter" are all heavy metals.

In [13]:
# Read discharge data again
xl_path = os.path.join(raw_dir, dis_xls)
metals_df = pd.read_excel(xl_path, sheet_name="Sheet1")

# Just pars of interest
par_dict = {
    "arsen": "As",
    "kadmium": "Cd",
    "krom": "Cr",
    "kobber": "Cu",
    "kvikksølv": "Hg",
    "nikkel": "Ni",
    "bly": "Pb",
    "sink": "Zn",
}
par_list = list(par_dict.keys())
metals_df = metals_df.query("Stoff in @par_list")
assert (
    metals_df["Enhet"] == "Kilogram"
).all(), "Not all values in the 'Enhet' column are 'Kilogram'"
metals_df["Stoff"] = metals_df["Stoff"].replace(par_dict)

# Tidy
col_dict = {
    "År": "year",
    "AnleggNummer": "anlegg_nr",
    "Stoff": "par",
    "Tilført mengde": "in_kg",
    "Utslippsmengde": "out_kg",
}
metals_df.rename(columns=col_dict, inplace=True)
metals_df = metals_df[col_dict.values()]

# Pivot to wide
metals_df = metals_df.pivot_table(
    index=["anlegg_nr", "year"],
    columns="par",
    values=["in_kg", "out_kg"],
)
metals_df.columns = [f"{col[1]}_{col[0]}" for col in metals_df.columns]
metals_df.reset_index(inplace=True)

In [14]:
# Remove extreme outliers (typically unit errors)
# Remove values bigger than thresh*median and smaller than (1/thresh)*median
thresh = 100

# Flag outliers
df_list = []
for site_id, site_ts_df in metals_df.groupby("anlegg_nr"):
    for par in par_dict.values():
        for flow in ("in", "out"):
            col_name = f"{par}_{flow}_kg"
            outlier_col = f"{par}_{flow}_outlier"

            if site_ts_df[col_name].count() < 3:
                site_ts_df[outlier_col] = 0
            else:
                med_val = site_ts_df[col_name].median()
                ratio = site_ts_df[col_name] / med_val
                site_ts_df[outlier_col] = (
                    (ratio > thresh) | (ratio < 1 / thresh)
                ).astype(int)
    df_list.append(site_ts_df)

# Combine
metals_df = pd.concat(df_list, axis="rows")

# Set outliers to NaN in original data
for par in par_dict.values():
    for flow in ("in", "out"):
        col_name = f"{par}_{flow}_kg"
        outlier_col = f"{par}_{flow}_outlier"
        metals_df.loc[metals_df[outlier_col] == 1, col_name] = np.nan
        del metals_df[outlier_col]

# Just period of interest
metals_df = metals_df.query("2013 <= year <= @final_year")

# Check all sites are in main dataset
metals_stns = set(metals_df["anlegg_nr"].unique())
nutrient_stns = set(df["anlegg_nr"].unique())
metals_only = metals_stns - nutrient_stns
if len(metals_only) > 0:
    print(
        "WARNING: The following sites only have data for metals (not nutrients) and will be dropped:"
    )
    print(metals_only)
nutrient_stns = list(nutrient_stns)
metals_df = metals_df.query("anlegg_nr in @nutrient_stns")

# Join site metadata
site_cols = [
    "anlegg_nr",
    "kilderefnr",
    "anlegg_name",
    "site_zone",
    "site_east",
    "site_north",
    "outlet_zone",
    "outlet_east",
    "outlet_north",
    "type",
    "year",
]
value_cols = [col for col in metals_df if col.endswith("_kg")]
kilde_df = df[site_cols].drop_duplicates()
metals_df = pd.merge(metals_df, kilde_df, how="left", on=["anlegg_nr", "year"])
metals_df = metals_df[site_cols + value_cols]
metals_df[value_cols] = metals_df[value_cols].round(1)

# Check 'type' matches TEOTIL types
mdir_types = set(metals_df["type"].unique())
valid_teo_types = set(eff_df["teotil_type"].unique())
assert mdir_types.issubset(valid_teo_types)

# Save tidied data for each year
for year, ann_df in metals_df.groupby("year"):
    if (year >= 2013) and (year <= final_year):
        xl_path = f"/home/jovyan/shared/common/teotil3/point_data/{year}/metals_{year}_raw.xlsx"
        ann_df.to_excel(xl_path, index=False)

metals_df.head()

Unnamed: 0,anlegg_nr,kilderefnr,anlegg_name,site_zone,site_east,site_north,outlet_zone,outlet_east,outlet_north,type,...,Pb_in_kg,Zn_in_kg,As_out_kg,Cd_out_kg,Cr_out_kg,Cu_out_kg,Hg_out_kg,Ni_out_kg,Pb_out_kg,Zn_out_kg
0,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,32.0,599070.0,6639810.0,32.0,598452.0,6639512.0,Kjemisk-biologisk m/N-fjerning,...,,,7.1,0.7,24.6,924.0,0.2,183.0,13.3,1432.0
1,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,32.0,599070.0,6639810.0,32.0,598452.0,6639512.0,Kjemisk-biologisk m/N-fjerning,...,,,23.0,1.5,32.5,1191.0,0.2,210.0,21.6,1803.0
2,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,32.0,599070.0,6639810.0,32.0,598452.0,6639512.0,Kjemisk-biologisk m/N-fjerning,...,,,19.6,1.4,46.1,1171.0,0.4,240.0,19.8,2223.0
3,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,32.0,599070.0,6639810.0,32.0,598452.0,6639512.0,Kjemisk-biologisk m/N-fjerning,...,,,21.7,1.1,9.3,694.0,0.1,168.0,7.7,1451.0
4,0301.0979.01,0301AL01,Bekkelaget renseanlegg med tilførselstuneller ...,32.0,599070.0,6639810.0,32.0,598452.0,6639512.0,Kjemisk-biologisk m/N-fjerning,...,100.6,4431.2,22.8,1.3,19.6,641.8,0.3,162.0,17.3,1497.3
