In [1]:
import nivapy3 as nivapy
import pandas as pd
import utils
from sqlalchemy import exc, text

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed


In [2]:
eng = nivapy.da.connect()

Username:  ········
Password:  ········


Connection successful.


# ICP Waters Call for Data 2022-23

## Part B: Upload data

This notebook processes data from the 2022 call. The main work here is to create two Excel spreadsheets summarising:

 1. New stations to be added
 
 2. New data to be added
 
## 1. Data processing by country

The following summarises the work to be done for each country.

### 1.1. Armenia

The database already contains some Armenian sites, but data has never been uploaded for them as previous submissions did not use the template (and the last previous submission was in 2012, comprising data from 2004 to 2008). This year, we have data for a new set of sites using the new template.

 * Old Armenian sites and previous data submissions for 2004 to 2008 will be ignored.
 
 * Create new sites based on the latest submission and add the data provided for 2020 and 2021. These sites will be assigned to the `Core` group.
 
Some of the data distributions for the Armenian sites seem strange. In particular, there are lots of cases where ORTP > TOTP. We have agreed to ignore these issues for now and upload the data as they area - see e-mail from Heleen received 19.04.2023 13.12.
 
### 1.2. Austria

No data provided.

### 1.3. Canada

New data will be uploaded. The original submissions contained "duplicates" where the station code and sample date appeared twice, but each row contained only half the data. I have manually merged these rows in Excel.

### 1.4. Czech Republic

New data will be uploaded.

### 1.5. Estonia

New data will be uploaded.

### 1.6. Finland

Only provided data for the "core" sites. This will be uploaded.

### 1.7. Georgia

Limited data received for a few sites. These sites are new to ICPW and no station details have been provided. Cathrine has asked for details, but no reply yet. **Cannot be processed until site details are provided**.

### 1.8. Germany

New data will be uploaded.

`DE13` and `DE24` are affected by liming. **These sites should be "archived" i.e. removed from the current ICPW project**.

At `DE12`, one TOTN value has been removed as it seems very low (see e-mail from Heleen received 19.04.2023 13.12).

### 1.9. Ireland

Lakes data was previously sent to James; rivers data recently sent to Cathrine. 

We need to clarify exactly which Irish stations should be used: Julian Aherne previously suggested lakes to be included, but each lake actually has several monitoring sites, including "open water" and "near shore" alternatives. I have chosen a single "open water" station for each lake and updated the site co-ordinates in the database. I have also added the Irish station code as the `NFC_Code` attribute to make it clear which site we are using at each location. This should make it easier to request data in the future, since Wayne and Alan will have a better idea of what data to send.

Station `IE04` has been "archived", as we cannot identify a current location that matches the description from Julian.

Rivers data for 2020 are not available.

### 1.10. Italy

New data will be uploaded.

### 1.11. Latvia

**Some of the values seem extreme - check?**

New data will be uploaded.

### 1.12. Moldova

**Some of the values seem extreme - check?**

New data will be uploaded.

### 1.13. Montenegro

There was an old ICPW site on Black Lake (`ME 01`), but data were never submitted using the template and have never been added to the database. This site is no longer considered suitable for ICPW, so adding the (rather limited) old data is no longer relevant. 

**I do not understand the data recently submitted by Montenegro**:

 * There is a file named `Montenegro_Site list.xlsx` that contains co-ordinates for a location in Canada (near Toronto).
 
 * There is a data template for "Black Lake", but it is not clear if this is the same as the original Black Lake site (which is apparently no longer suitable for ICPW anyway?).
 
 * There is a file named `crno_jezero_emep.xlsx`, which presumably contains data from the new EMEP station that we are supposed to be adding to ICPW. However, according to the e-mail chain, the new EMEP site was only established in 2019, whereas data in the file go back to 2000. Furthermore, the station co-ordinates for 2020 and 2021 are different to the co-ordinates for the new EMEP station given in the e-mail, and co-ordinates for years prior to 2020 are missing completely.
 
 * Data for the EMEP station do not use the template.
 
**These data cannot be processed until these issues are resolved**.

### 1.14. Netherlands

Data for 8 new sites has been included. New sites will be created and the new data added.

### 1.15. Poland

Site co-ordinates have been updated. New data will be added.

### 1.16. Slovakia

Entirely new dataset submitted for 1990 to 2022, due to errors discovered in previous submissions. All old data should be deleted and the new data added.

### 1.17. Spain

Create three new sites and add data. Also update metadata (land cover, catchment area etc.). Two samples ahve been removed due to strange TOTN values (see e-mail from Heleen received 19.04.2023 13.12).

### 1.18. Sweden

Try to upload automatically using the API.

### 1.19. Switzerland

Create one new site. Add all data. Update metadata (land cover etc.) for several sites.

### 1.20. UK

New data will be uploaded.

### 1.21. USA

Includes a lot of data for stations that are unknown/have not been included before. Based on discussions with John:

 * Need to filter to just the stations we have used previously
 
 * Al data can be ignored
 
 * LOD values need adjusting. See e-mail from John received 18.04.2023
 
 * There are lots of duplicated values. Check to see if these are genuine and then either average or remove them 
 
## 2. Data issues

Potential data issues highlighted by the app regarding the ion balance, conductivity and outliers have been largely ignored. These data will be uploaded as supplied - see e-mail from Heleen received 19.04.2023 at 13.12. 

The issues relating to `LAl != RAl - ILAl` are due to some Focal Centres reporting LAl (which is a "mandatory" parameter) but not RAl and ILAl (which are "optional").

There are more than 100 samples where LAl > 20 and pH > 6.4. After discussion with Rolf and Heleen, I have removed any LAl values for these samples greater than 50 ug/l - see e-mail from Heleen received 17.04.2023 at 17.35.
 
## 3. Create new stations

The code below creates stations that are entirely new to ICPW in this call.

In [3]:
# List of new sites
xl_path = r"./data/new_sites_cfd_2022.xlsx"
stn_df = pd.read_excel(xl_path)
stn_df.head()

Unnamed: 0,station_id,station_code,station_name,latitude,longitude,altitude,continent,country,region,group
0,38810,AM_001,"Pambak river, 0.5 km above Khnkoyan village",40.839315,44.048911,,Europe,Armenia,Armenia,Core
1,38811,AM_057,"Marmarik river, 0.5 km above Hankavan village",40.663605,44.466029,,Europe,Armenia,Armenia,Core
2,38812,AM_080,"Vedi river, 0.5 km above Urtsadzor village",39.921174,44.819649,,Europe,Armenia,Armenia,Core
3,38813,AM_083,"Arpa river, 0.5 km above Jermuk town",39.843045,45.686151,,Europe,Armenia,Armenia,Core
4,38814,AM_089,"Meghri river, 0.5 km above Meghri town",38.915268,46.233703,,Europe,Armenia,Armenia,Core


In [4]:
# # Add basic station info to 'resa2.stations'
# cols = [
#     "station_id",
#     "station_code",
#     "station_name",
#     "latitude",
#     "longitude",
#     "altitude",
# ]
# stn_df[cols].to_sql("stations", eng, schema="resa2", if_exists="append", index=False)

In [5]:
# # Add station attributes
# cols = ["station_id", "continent", "country", "region", "group"]
# stn_df = stn_df[cols].copy()
# stn_df.rename(
#     {
#         "continent": 322,
#         "country": 261,
#         "region": 254,
#         "group": 357,
#     },
#     axis="columns",
#     inplace=True,
# )
# stn_df = stn_df.melt(id_vars='station_id', var_name='var_id')
# stn_df.to_sql("stations_par_values", eng, schema="resa2", if_exists="append", index=False)

## 4. Update station properties

For all stations currently within ICPW, the code below updates the station properties. This includes edits to station codes, station names, co-ordinates etc.

In [6]:
def upsert(stn_id, var_id, value, eng):
    """Updates a value in resa2.stations_par_values if it exists,
    otherwise inserts it as a new row.
    """
    try:
        sql = text(
            "INSERT INTO resa2.stations_par_values "
            "(station_id, var_id, value) "
            "VALUES (:stn_id, :var_id, :value)"
        )
        eng.execute(sql, stn_id=stn_id, var_id=var_id, value=value)
    except exc.IntegrityError:
        sql = text(
            "UPDATE resa2.stations_par_values "
            "SET value = :value "
            "WHERE station_id = :stn_id "
            "AND var_id = :var_id"
        )
        eng.execute(sql, stn_id=stn_id, var_id=var_id, value=value)

In [7]:
# Get current stations
xl_path = r"../../../all_icpw_sites_mar_2023.xlsx"
stn_df = pd.read_excel(xl_path, sheet_name="all_icpw_stns")
stn_df.head()

Unnamed: 0,station_id,station_code,nfc_code,station_name,latitude,longitude,altitude,continent,country,region,group
0,38810,AM_001,,"Pambak river, 0.5 km above Khnkoyan village",40.839315,44.048911,,Europe,Armenia,Armenia,Core
1,38811,AM_057,,"Marmarik river, 0.5 km above Hankavan village",40.663605,44.466029,,Europe,Armenia,Armenia,Core
2,38812,AM_080,,"Vedi river, 0.5 km above Urtsadzor village",39.921174,44.819649,,Europe,Armenia,Armenia,Core
3,38813,AM_083,,"Arpa river, 0.5 km above Jermuk town",39.843045,45.686151,,Europe,Armenia,Armenia,Core
4,38814,AM_089,,"Meghri river, 0.5 km above Meghri town",38.915268,46.233703,,Europe,Armenia,Armenia,Core


In [8]:
# for idx, row in stn_df.iterrows():
#     stn_id, stn_code, nfc_code, stn_name, lat, lon, alt, cont, country, reg, grp = row

#     if pd.isna(alt):
#         alt = None

#     if pd.isna(nfc_code):
#         nfc_code = None

#     # Update 'stations' table
#     sql = text(
#         "UPDATE resa2.stations "
#         "SET station_code = :stn_code, "
#         "  station_name = :stn_name, "
#         "  latitude = :lat, "
#         "  longitude = :lon, "
#         "  altitude = :alt "
#         "WHERE station_id = :stn_id"
#     )
#     eng.execute(
#         sql,
#         stn_code=stn_code,
#         stn_name=stn_name,
#         lat=lat,
#         lon=lon,
#         alt=alt,
#         stn_id=stn_id,
#     )

#     # Update 'stations_par_values'
#     upsert(stn_id, 322, cont, eng)
#     upsert(stn_id, 261, country, eng)
#     upsert(stn_id, 254, reg, eng)
#     upsert(stn_id, 357, grp, eng)
#     upsert(stn_id, 321, nfc_code, eng)

## 5. Update ICPW project

In [9]:
# # Delete stations currently associated with project
# sql = text("DELETE FROM resa2.projects_stations WHERE project_id = 4617")
# eng.execute(sql)

# # Add correct stations
# proj_stn_df = stn_df[["station_id"]].copy()
# proj_stn_df["project_id"] = 4617
# proj_stn_df.to_sql(
#     name="projects_stations",
#     schema="resa2",
#     con=eng,
#     if_exists="append",
#     index=False,
# )

## 6. Upload data

New data submitted by the Focal Centres has been cleaned, compiled and added to a single input template named `icpw_cfd_2022_all_data.xlsx`. However, a few countries require special treatment first.

### 6.1. Sweden

The notebook [here](https://github.com/JamesSample/icpw/blob/master/upload_sweden_icpw_data.ipynb) has been used to upload the Swedish data directly from the MVM national database. The notebook is here:

    ...\ICP_Waters\TOC_Trends_Analysis_2015\Python\icpw\upload_sweden_icpw_data.ipynb
    
### 6.2. Slovakia

As noted above, an entirely new dataset for Slovakia has been provided. The code below deletes all the existing data for the Slovakian sites.

In [10]:
slov_df = stn_df.query("country == 'Slovakia'")
slov_df

Unnamed: 0,station_id,station_code,nfc_code,station_name,latitude,longitude,altitude,continent,country,region,group
343,38323,BA-01,,Batizovské pleso,49.15203,20.12988,1884.0,Europe,Slovakia,ECE,Trends
344,38324,FU-01,,Vyšné Wahlenbergovo pleso,49.16383,20.02567,2150.0,Europe,Slovakia,ECE,Trends
345,38325,ME-01,,Velké Hincovo pleso,49.17917,20.05904,1953.0,Europe,Slovakia,ECE,Trends
346,38326,ME-02,,Malé Hincovo pleso,49.17398,20.05742,1927.0,Europe,Slovakia,ECE,Trends
347,38327,ME-04,,Vyšné Satanie pliesko,49.17006,20.06079,1903.0,Europe,Slovakia,ECE,Trends
348,38328,NE-01,,Vyšné Terianske pleso,49.16774,20.02059,2112.0,Europe,Slovakia,ECE,Trends
349,38329,NE-03,,Nižné Terianske pleso,49.16966,20.01287,1943.0,Europe,Slovakia,ECE,Trends
350,38330,RO-01,,Štvrté Rohácske pleso,49.20589,19.73576,1719.0,Europe,Slovakia,ECE,Trends
351,38331,SL-02,,Slavkovské pleso,49.1526,20.18326,1676.0,Europe,Slovakia,ECE,Trends
352,38332,ST-01,,Jamské pleso,49.13301,20.01243,1447.0,Europe,Slovakia,ECE,Trends


In [11]:
# # Station IDs for which data should be deleted
# stn_ids = slov_df["station_id"].tolist()
# stn_ids = ",".join("%d" % i for i in stn_ids)

# # Get all sample IDs associated with stations
# sql = text(
#     "SELECT water_sample_id FROM resa2.water_samples "
#     "WHERE station_id IN (%s)" % stn_ids
# )
# samp_df = pd.read_sql(sql, eng)
# samp_ids = samp_df["water_sample_id"].tolist()
# samp_ids = ",".join("(1, %d)" % i for i in samp_ids)

# # Delete from sample selections
# sql = text(
#     "DELETE FROM resa2.sample_selections WHERE (1, water_sample_id) IN (%s)" % samp_ids
# )
# eng.execute(sql)

# # Delete from values
# sql = text(
#     "DELETE FROM resa2.water_chemistry_values2 WHERE (1, sample_id) IN (%s)" % samp_ids
# )
# eng.execute(sql)

# # Delete from water samples
# sql = text("DELETE FROM resa2.water_samples WHERE station_id IN (%s)" % stn_ids)
# eng.execute(sql)

### 6.3 Upload template

Finally, we can upload data from the new template. Note the following:

 * **Make sure the template has been checked using the ICPW app and that all issues have been fixed before running the code below**

 * The new template distinguishes between "total" and "filtered" metals. This has not been done previously and is not standard in RESA2. All previously uploaded data are assumed to be for "total" metals. For this data call, I have also created new methods in RESA for the filtered samples. However, these methods are not yet connected to `parameters` in RESA, so they can't be accessed yet (but they are in the database).

In [12]:
xl_path = r"./data/icpw_cfd_2022_all_data.xlsx"
ws_df, df = utils.process_template(xl_path, eng, dups="mean", dry_run=True)

display(ws_df.head())
display(df.head())

Unnamed: 0,water_sample_id,station_id,date
0,,23472,2020-08-31
1,,23472,2020-10-08
2,,23472,2021-09-13
3,,23472,2021-09-27
4,,23474,2020-08-31


Unnamed: 0,method_id,value,flag1,water_sample_id
0,10249,43.69,,
1,10251,0.593,,
2,10252,0.1,<,
3,10253,0.093,,
4,10254,0.2,,


In [13]:
len(ws_df), len(df)

(3269, 59820)

In [16]:
ws_df.describe()

Unnamed: 0,water_sample_id,station_id
count,47.0,3269.0
mean,875310.531915,34311.313246
std,3653.154056,6649.368826
min,866264.0,23472.0
25%,872653.5,23624.0
50%,872665.0,38331.0
75%,878132.5,38490.0
max,881238.0,38831.0
