In [1]:
import os
import sqlite3
import sys

sys.path.append("..")

import nivapy3 as nivapy
import numpy as np
import pandas as pd
import utils

# Tiltaksovervakingen: opsjon for kvalitetskontroll av analysedata
# Eurofins 2021 Q2

## Notebook 1: Initial exploration and data cleaning

## 1. Initial data check

This notebook explores the 2021 Q2 data from **Eurofins** sent by Marianne Isebakke at 20.09.2021 at 14.04. The code performs initial data exploration and cleaning, with the aim of creating a tidy dataset in SQLite that can be used for further analysis.

**Note:** This file uses an updated export from Vannmiljø reflecting corrections made to two station codes in August 2021 - see e-mail from Kjetil received 27.05.2021 at 17:15 for details.

Initial screening of the Eurofins Excel file shows the following:

 * For three samples, the sample date column contains the text `Kosåna utløp` instead of the date
 
 * There is one sample with pH < 1, which seems unlikely
 
 * The columns for conductivity, alkalinity, alk-E, LAL and SO4 all include `,` as the decimal separator for some values. I have replaced all occurrences of `,` with `.`
 
 * The columns `Vannmengde` and `Temp (C)` are missing from the latest spreadsheet (but have been included in previous rounds). For the QC process described here, `Vannmengde` is not used as it contains qualitative data, but `Temp (C)` has previously been considered. I have excluded temperature from the comparison for this round, but if it is of interest it should be included in the final submission

## 2. Create SQLite database to store results

Using a database will provide basic checks on data integrity and consistency. For this project, three tables will be sufficient:

 * Station locations and metadata
 * Parameters and units used by VestfoldLAB, Eurofins and Vannmiljø, and conversion factors between these
 * Water chemistry data
 
The code below creates a basic database structure, which will be populated later.

In [2]:
# Create database
dbname = "kalk_data.db"
if os.path.exists(dbname):
    os.remove(dbname)
eng = sqlite3.connect(dbname, detect_types=sqlite3.PARSE_DECLTYPES)

# Turn off journal mode for performance
eng.execute("PRAGMA synchronous = OFF")
eng.execute("PRAGMA journal_mode = OFF")

# Create stations table
sql = (
    "CREATE TABLE stations "
    "( "
    "  fylke text NOT NULL, "
    "  vassdrag text NOT NULL, "
    "  station_name text NOT NULL, "
    "  station_number text, "
    "  vannmiljo_code text NOT NULL, "
    "  vannmiljo_name text, "
    "  utm_east real NOT NULL, "
    "  utm_north real NOT NULL, "
    "  utm_zone integer NOT NULL, "
    "  lon real NOT NULL, "
    "  lat real NOT NULL, "
    "  liming_status text NOT NULL, "
    "  comment text, "
    "  PRIMARY KEY (vannmiljo_code) "
    ")"
)
eng.execute(sql)

# Create parameters table
sql = (
    "CREATE TABLE parameters_units "
    "( "
    "  vannmiljo_name text NOT NULL UNIQUE, "
    "  vannmiljo_id text NOT NULL UNIQUE, "
    "  vannmiljo_unit text NOT NULL, "
    "  vestfoldlab_name text NOT NULL UNIQUE, "
    "  vestfoldlab_unit text NOT NULL, "
    "  vestfoldlab_to_vm_conv_fac real NOT NULL, "
    "  eurofins_name text NOT NULL UNIQUE, "
    "  eurofins_unit text NOT NULL, "
    "  eurofins_to_vm_conv_fac real NOT NULL, "
    "  min real NOT NULL, "
    "  max real NOT NULL, "
    "  PRIMARY KEY (vannmiljo_id) "
    ")"
)
eng.execute(sql)

# Create chemistry table
sql = (
    "CREATE TABLE water_chemistry "
    "( "
    "  vannmiljo_code text NOT NULL, "
    "  sample_date datetime NOT NULL, "
    "  lab text NOT NULL, "
    "  period text NOT NULL, "
    "  depth1 real, "
    "  depth2 real, "
    "  parameter text NOT NULL, "
    "  flag text, "
    "  value real NOT NULL, "
    "  unit text NOT NULL, "
    "  PRIMARY KEY (vannmiljo_code, sample_date, depth1, depth2, parameter), "
    "  CONSTRAINT vannmiljo_code_fkey FOREIGN KEY (vannmiljo_code) "
    "      REFERENCES stations (vannmiljo_code) "
    "      ON UPDATE NO ACTION ON DELETE NO ACTION, "
    "  CONSTRAINT parameter_fkey FOREIGN KEY (parameter) "
    "      REFERENCES parameters_units (vannmiljo_name) "
    "      ON UPDATE NO ACTION ON DELETE NO ACTION, "
    "  CONSTRAINT unit_fkey FOREIGN KEY (unit) "
    "      REFERENCES parameters_units (vannmiljo_unit) "
    "      ON UPDATE NO ACTION ON DELETE NO ACTION "
    ")"
)
eng.execute(sql)

<sqlite3.Cursor at 0x7fe7600b82d0>

## 3. Explore station data

Station details are stored in `../../data/active_stations_2020.xlsx`, which is a tidied version of Øyvind's original file here:

    K:\Prosjekter\langtransporterte forurensninger\Kalk Tiltaksovervåking\12 KS vannkjemi\Vannlokaliteter koordinater_kun aktive stasj 2020.xlsx
    
Note that corrections (e.g. adjusted station co-ordinates) have been made to the tidied file, but not the original on `K:`. **The version in this repository should therefore been used as the "master" copy**.

In [3]:
# Read station data
stn_df = pd.read_excel(r"../../data/active_stations_2020.xlsx", sheet_name="data")
stn_df = nivapy.spatial.utm_to_wgs84_dd(stn_df)

print("The following stations are missing spatial co-ordinates:")
stn_df.query("lat != lat")

The following stations are missing spatial co-ordinates:


Unnamed: 0,fylke,vassdrag,station_name,station_number,vannmiljo_code,vannmiljo_name,utm_east,utm_north,utm_zone,liming_status,comment,lat,lon


In [4]:
print("The following stations do not have a code in Vannmiljø:")
stn_df.query("vannmiljo_code != vannmiljo_code")

The following stations do not have a code in Vannmiljø:


Unnamed: 0,fylke,vassdrag,station_name,station_number,vannmiljo_code,vannmiljo_name,utm_east,utm_north,utm_zone,liming_status,comment,lat,lon


In [5]:
# Map
stn_map = nivapy.spatial.quickmap(
    stn_df.dropna(subset=["lat"]),
    lat_col="lat",
    lon_col="lon",
    popup="station_name",
    cluster=True,
    kartverket=True,
    aerial_imagery=True,
)

stn_map.save("../../pages/stn_map.html")

stn_map

In [6]:
# Add to database
stn_df.dropna(subset=["vannmiljo_code", "lat"], inplace=True)
stn_df.to_sql(name="stations", con=eng, if_exists="append", index=False)

## 4. Parameters and units of interest

The file `../../data/parameter_unit_mapping.xlsx` provides a lookup between parameter names & units used by the labs and those in Vannmiljø. It also contains plausible ranges (using Vannmiljø units) for each parameter. These ranges have been chosen by using the values already in Vannmiljø as a reference. However, it looks as though some of the data in Vannmiljø might also be spurious, so it would be good to refine these ranges based on domain knowledge, if possible.

**Note:** Concentrations reported as *exactly* zero are likely to be errors, because most (all?) lab methods should report an LOQ instead.

In [7]:
# Read parameter mappings
par_df = utils.get_par_unit_mappings()

# Add to database
par_df.to_sql(name="parameters_units", con=eng, if_exists="append", index=False)

par_df

Unnamed: 0,vannmiljo_name,vannmiljo_id,vannmiljo_unit,vestfoldlab_name,vestfoldlab_unit,vestfoldlab_to_vm_conv_fac,eurofins_name,eurofins_unit,eurofins_to_vm_conv_fac,min,max
0,Temperatur,TEMP,°C,Temp,°C,1.0,Temp,°C,1.0,-10,30
1,pH,PH,<ubenevnt>,pH,enh,1.0,pH,enh,1.0,1,10
2,Konduktivitet,KOND,mS/m,Kond,mS/m,1.0,Kond,ms/m,1.0,0,100
3,Total alkalitet,ALK,mmol/l,Alk,mmol/l,1.0,Alk,mmol/l,1.0,0,2
4,Totalfosfor,P-TOT,µg/l P,Tot-P,µg/l,1.0,Tot-P,µg/l,1.0,0,500
5,Totalnitrogen,N-TOT,µg/l N,Tot-N,µg/l,1.0,Tot-N,µg/l,1.0,0,4000
6,Nitrat,N-NO3,µg/l N,NO3,µg/l,1.0,NO3,µg/l,1.0,0,2000
7,Totalt organisk karbon (TOC),TOC,mg/l C,TOC,mg/l,1.0,TOC,mg/l,1.0,0,100
8,Reaktivt aluminium,RAL,µg/l Al,RAl,µg/l,1.0,RAl,µg/l,1.0,0,500
9,Ikke-labilt aluminium,ILAL,µg/l Al,ILAl,µg/l,1.0,ILAl,µg/l,1.0,0,500


## 5. Historic data from Vannmiljø

The Vannmiljø dataset is large and reading from Excel is slow; the code below takes a couple of minutes to run.

Note from the output below that **there are more than 1100 "duplicated" samples in the Vannmiljø dataset** i.e. where the station code, sample date, sample depth, lab and parameter name are all the same, but a different value is reported. It would be helpful to know why these duplicates were collected e.g. are these reanalysis values, where only one of the duplictaes should be used, or are they genuine (in which case should they be averaged or kept separate?). **For the moment, I will ignore these values**.

In [8]:
# Read historic data from Vannmiljø
his_df = utils.read_historic_data(
    r"../../data/vannmiljo_export_2012-19_2021-08-16.xlsx"
)

# Tidy lab names for clarity
his_df["lab"].replace(
    {"NIVA": "NIVA (historic)", "VestfoldLAB AS": "VestfoldLAB (historic)"},
    inplace=True,
)

# Check duplicates
key_cols = [
    "vannmiljo_code",
    "sample_date",
    "lab",
    "depth1",
    "depth2",
    "par_unit",
]
his_dup_df = his_df[
    his_df.duplicated(
        key_cols,
        keep=False,
    )
].sort_values(key_cols)
his_dup_df.to_csv(r"../../output/vannmiljo_duplicates.csv", index=False)

## Average duplicates
# his_df = his_df.groupby(key_cols).aggregate(
#    {
#        "flag": "first",
#        "value": "mean",
#    }
# ).reset_index()

# Remove all duplicates
his_df.drop_duplicates(subset=key_cols, keep=False, inplace=True)

# Add label for data period
his_df["period"] = "historic"

# Print summary
n_stns = len(his_df["vannmiljo_code"].unique())
print(f"The number of unique stations with data is: {n_stns}.\n")

print(
    f"There are {len(his_dup_df)} duplicated records (same station_code-date-depth-parameter, but different value).\n"
    "These will be averaged or ignored (depending on the settings above).\n"
)

his_df.head()

The number of unique stations with data is: 211.

There are 1620 duplicated records (same station_code-date-depth-parameter, but different value).
These will be averaged or ignored (depending on the settings above).



Unnamed: 0,vannmiljo_code,sample_date,lab,depth1,depth2,par_unit,flag,value,period
0,027-28435,2012-01-02,NIVA (historic),0.0,0.0,ALK_mmol/l,=,0.06,historic
1,027-28435,2012-01-02,NIVA (historic),0.0,0.0,CA_mg/l,=,1.23,historic
2,027-28435,2012-01-02,NIVA (historic),0.0,0.0,ILAL_µg/l Al,=,18.0,historic
3,027-28435,2012-01-02,NIVA (historic),0.0,0.0,KOND_mS/m,=,3.8,historic
4,027-28435,2012-01-02,NIVA (historic),0.0,0.0,PH_<ubenevnt>,=,6.24,historic


## 6. New data from labs

The code below reads the Excel template provided by Eurofins and reformats it to the same structure (parameter names, units etc.) as the data in Vannmiljø.

In [9]:
# Metadata
lab = "Eurofins"
year = 2021
qtr = 2

# Read new data
new_df = utils.read_data_template(
    r"../../data/eurofins_data_to_2021-06-30.xlsx",
    sheet_name="results",
    lab=lab,
)

# Check duplicates
new_dup_df = new_df[
    new_df.duplicated(
        key_cols,
        keep=False,
    )
].sort_values(key_cols)
new_dup_df.to_csv(
    f"../../output/{lab.lower()}_{year}_q{qtr}_duplicates.csv", index=False
)

## Average duplicates
# new_df = (
#    new_df.groupby(key_cols)
#    .aggregate(
#        {
#            "flag": "first",
#            "value": "mean",
#        }
#    )
#    .reset_index()
# )

# Remove all duplicates
new_df.drop_duplicates(subset=key_cols, keep=False, inplace=True)

# Add label for data period
new_df["period"] = "new"

print(
    f"\nThere are {len(new_dup_df)} duplicated records (same station_code-date-depth-parameter, but different value).\n"
    "These will be averaged or ignored (depending on the settings above).\n"
)

new_df.head()


The following location IDs have inconsistent names within this template:
     036-58861 ['Lågen oppstrøms Heimsåna (14)']   ==>   ['Lågen oppstrøms Heimsåna' 'Lågen oppstrøms Mosåna']
     063-62170 ['Mysterelvi (1b)']   ==>   ['Mysterelvi' 'Mysterøyri']

The following location names have multiple IDs within this template:

There are 175 duplicated records (same station_code-date-depth-parameter, but different value).
These will be averaged or ignored (depending on the settings above).



Unnamed: 0,vannmiljo_code,sample_date,lab,depth1,depth2,par_unit,flag,value,period
0,023-45835,2021-04-07 07:20:25,Eurofins,0.0,0.0,PH_<ubenevnt>,,5.6,new
1,023-45835,2021-05-05 07:20:27,Eurofins,0.0,0.0,PH_<ubenevnt>,,5.8,new
2,023-45835,2021-06-08 07:00:58,Eurofins,0.0,0.0,PH_<ubenevnt>,,6.1,new
3,023-62175,2021-04-07 07:20:25,Eurofins,0.0,0.0,PH_<ubenevnt>,,6.3,new
4,023-62175,2021-05-05 07:20:27,Eurofins,0.0,0.0,PH_<ubenevnt>,,6.8,new


## 7 . Combine

Combine the `historic` and `new` datasets into a single dataframe in "long" format.

In [10]:
# Combine
df = pd.concat([his_df, new_df], axis="rows")

# Separate par and unit
df[["parameter", "unit"]] = df["par_unit"].str.split("_", n=1, expand=True)
del df["par_unit"]

df.reset_index(drop=True, inplace=True)

df.head()

Unnamed: 0,vannmiljo_code,sample_date,lab,depth1,depth2,flag,value,period,parameter,unit
0,027-28435,2012-01-02,NIVA (historic),0.0,0.0,=,0.06,historic,ALK,mmol/l
1,027-28435,2012-01-02,NIVA (historic),0.0,0.0,=,1.23,historic,CA,mg/l
2,027-28435,2012-01-02,NIVA (historic),0.0,0.0,=,18.0,historic,ILAL,µg/l Al
3,027-28435,2012-01-02,NIVA (historic),0.0,0.0,=,3.8,historic,KOND,mS/m
4,027-28435,2012-01-02,NIVA (historic),0.0,0.0,=,6.24,historic,PH,<ubenevnt>


In [11]:
# Apply correction to historic SIO2
df["value"] = np.where(
    (df["lab"] == "VestfoldLAB (historic)") & (df["parameter"] == "SIO2"),
    df["value"] * 467.5432,
    df["value"],
)

# Reclassify (nitrate + nitrite) to nitrate
df["parameter"].replace({"N-SNOX": "N-NO3"}, inplace=True)

## 8. Check data ranges

A simple method for preliminary quality control is to check whether parameter values are within sensible ranges (as defined in the `parameters_units` table; see Section 4 above). I believe this screening should be implemented differently for the `historic` (i.e. Vannmiljø) and `new` datsets, as follows:

 * For the `historic` data in Vannmiljø, values outside the plausible ranges should be **removed from the dataset entirely**. This is because we intend to use the Vannmiljø data as a reference against which new values will be compared, so it is important the dataset does not contain anything too strange. Ideally, the reference dataset should be carefully manually curated to ensure it is as good as possible, but I'm not sure we have the resouces in this project to thoroughly quality assess the data *already* in Vannmiljø. Dealing with any obvious issues is a good start, though
 
 * For the `new` data, values outside the plausible ranges should be highlighted and checked with the reporting lab
 
**Note:** At present, my code will remove any concentration values of exactly zero from the historic dataset. **Check with Øyvind whether this is too strict**.

In [12]:
# Check ranges
df = utils.check_data_ranges(df)


Checking data ranges for the 'historic' period.
    KOND: Maximum value of 309.00 is greater than or equal to upper limit (100.00).
    LAL: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    SO4: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    CA: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    SIO2: Maximum value of 51897.30 is greater than or equal to upper limit (7000.00).

Checking data ranges for the 'new' period.
    PH: Minimum value of 1.00 is less than or equal to lower limit (1.00).
    LAL: Minimum value of 0.00 is less than or equal to lower limit (0.00).

Dropping problem rows from historic data.
    Dropping rows for KOND.
    Dropping rows for LAL.
    Dropping rows for SO4.
    Dropping rows for CA.
    Dropping rows for SIO2.


In [13]:
# Add to database
df.to_sql(
    name="water_chemistry",
    con=eng,
    if_exists="append",
    index=False,
    method="multi",
    chunksize=1000,
)

## 9. Summary

Points to note from the initial data exploration:

### 9.1. Data formatting and missing data

 * For three samples, the sample date column contains the text `Kosåna utløp` instead of the date
 
 * There is one sample with pH < 1, which seems unlikely
 
 * The columns for conductivity, alkalinity, alk-E, LAL and SO4 all include `,` as the decimal separator for some values. I have replaced all occurrences of `,` with `.`
 
 * The columns `Vannmengde` and `Temp (C)` are missing from the latest spreadsheet (but have been included in previous rounds). For the QC process described here, `Vannmengde` is not used as it contains qualitative data, but `Temp (C)` has previously been considered. I have excluded temperature from the comparison for this round, but if it is of interest it should be included in the final submission
 
### 9.2. Site IDs

The following **location IDs have inconsistent names within this template**:
    
         036-58861 ['Lågen oppstrøms Heimsåna (14)']   ==>   ['Lågen oppstrøms Heimsåna' 'Lågen oppstrøms Mosåna']
         063-62170 ['Mysterelvi (1b)']                 ==>   ['Mysterelvi' 'Mysterøyri']
 
It seems the wrong station codes have been used for some samples:

 * `Lågen oppstrøms Mosåna` has been entered with code `036-58861` in the template, but it should be `036-58862`  
 * `Mysterøyri` has been entered with code `063-62170` in the template, but it should be `063-58804`
 
### 9.3. Duplicates

 * There are **175 duplicated records** in the Eurofins data submission, where same station_code-date-depth-parameter has been reported more than once (see [here](https://github.com/NIVANorge/tiltaksovervakingen/blob/master/output/eurofins_2021_q2_duplicates.csv) for a list). **78** of these are caused by the site ID issues mentioned above, and a further **21** are due to dates being entered as `Kosåna utløp` (see 9.1, above). The remainder may be genuine duplicates (e.g. flood samples), but **check with Eurofins**
 
### 9.4. Range checks

**Note:** Some of the points identified below are explored more thoroughly in the time series analysis.

 * Values of pH reported as `<1` seem very low?