In [1]:
%matplotlib inline
import geopandas as gpd
import geopandas.tools
import matplotlib.pyplot as plt
import nivapy3 as nivapy
import numpy as np
import pandas as pd
import pyproj
import seaborn as sn
import useful_rid_code as rid
from shapely.geometry import Point
from sqlalchemy import types

sn.set_context("notebook")

# Process model input datasets (2020-21)

Modelling for the RID programme makes use of the following input datasets:

 * **Avløp** (sewage and other drainage). These datasets are provided by **Gisle Berge at SSB** and are sub-divided into
     * Large treatment works ("store anlegg"; ≥ 50 p.e.)
     * Small treatment works ("små anlegg"; < 50 p.e.)
     * Other environmental pollutants ("miljøgifter") <br><br>
     
 * **Fiskeoppdret** (Fish farming). Provided by **Knut Johan Johnsen at Fiskeridirektoratet** <br><br>
 
 * **Industri** (industrial point sources). Provided by **Preben Danielsen at Miljødirektoratet** <br><br>
 
 * **Jordbruk** (land use and management activities). Provided by **Hans Olav Eggestad at NIBIO**
 
In addition, an annual figure for the use of **copper in aquaculture** is provided by **Preben Danielsen at Miljødirektoratet**.
 
The raw datasets must be restructured into a standardised format and added to the RESA2 database. Once in the database, they can be used to generate input files for TEOTIL2.

This notebook processes the raw data and adds it to RESA2.

In [2]:
# Connect to db
engine = nivapy.da.connect()

Username:  ···
Password:  ··············


Connection successful.


In [3]:
# Year of interest
year = 2020

## 1. Store anlegg, Miljøgifter and Industri

These three datasets are all treated similarly, and there is some duplication between the files. 

 * The **store anlegg** dataset is in wide format. Copy and rename the file, then change the worksheet name to `store_anlegg_{year}`. The header must also be tidied (see example datasets from previous years) and blank rows at the end of the worksheet can be deleted <br><br>
 
 * The **miljøgifter** dataset is in wide format. Copy and rename the file, then change the worksheet name to `miljogifter_{year}`. Also check that column headings are the same as in previous years <br><br>
 
 * The **industri** dataset is in long format and usually contains data for multiple years. Copy and rename the file, then rename the worksheet to `industry_{year}`. Delete rows above the header and check the header is the same as in previous years. Remember to filter the data to **only include the year of interest** (i.e. delete rows for other years)

The data in these files must be added to two tables in RESA2:

 * The site data must be added to `RESA2.RID_PUNKTKILDER`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "regine" catchment ID (the latter being most important). This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs. <br><br>
 
    **Note:** As of 2017, many (>120) of the stations already in the database were missing regine IDs and many more (>3000) were missing co-ordinate information. John Rune says we have previously asked Miljødirektoratet about this, but they have not been able to provide the missing data. During data processing for 2019/20, I noticed that some of the missing co-ordinate information *was* provided in more recent data submissions. However, if a site is already in the database (with missing spatial information), it will not be updated even if co-ordinates are provided in later years. In 2020, I created a notebook ([here](https://nbviewer.org/github/JamesSample/rid/blob/master/notebooks/update_renseanlegg_coords.ipynb)) to update co-ordinates where possible based on data submitted since 2016. This has reduced the number of sites without regine IDs to 30 (`SELECT count(*) from resa2.rid_punktkilder WHERE regine IS NULL;`). I have also modified the code in this notebook so that sites are only added to `RESA2.RID_PUNKTKILDER` when co-ordinate information is available and a regine ID has been successfully assigned. This usually means that some sites in the submission for each year must be ignored, but the benefit is that if the same sites are reported with complete information at a later date, they can then be added to the database and processed correctly. This seems preferable to adding incomplete data that cannot be used. <br><br>
 
 * The chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_PUNKTKILDER_INPAR_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.

In [4]:
# Read raw (tidied) data

# Store anlegg
in_xlsx = f"../../../Data/point_data_{year}/avlop_stor_anlegg_{year}_raw.xlsx"
stan_df = pd.read_excel(in_xlsx, sheet_name=f"store_anlegg_{year}")

# Miljøgifter
in_xlsx = f"../../../Data/point_data_{year}/avlop_miljogifter_{year}_raw.xlsx"
milo_df = pd.read_excel(in_xlsx, sheet_name=f"miljogifter_{year}")

# Industri
in_xlsx = f"../../../Data/point_data_{year}/industri_{year}_raw.xlsx"
ind_df = pd.read_excel(in_xlsx, sheet_name=f"industry_{year}")

# Drop blank rows
stan_df.dropna(how="all", inplace=True)
milo_df.dropna(how="all", inplace=True)
ind_df.dropna(how="all", inplace=True)

### 1.1. Basic data checking

All of the store anlegg and miljøgifter sites are classified as `RENSEANLEGG` in the `TYPE` column of `RESA2.RID_PUNKTKILDER`; industri sites are labeled `INDUSTRI`.

The code below adds `TYPE` columns, merges site data from different sources, converts UTM co-ordinates to WGS84 decimal degrees and identifies sites not already in the database. Issues identified below (e.g. missing co-ordinates) should be corrected if possible before continuing.

In [5]:
# Add TYPE cols
stan_df["TYPE"] = "RENSEANLEGG"
milo_df["TYPE"] = "RENSEANLEGG"
ind_df["TYPE"] = "INDUSTRI"

# Get just stn info from each df
stan_loc = stan_df[
    ["ANLEGGSNR", "ANLEGGSNAVN", "Kommunenr", "TYPE", "Sone", "UTM_E", "UTM_N"]
].copy()

milo_loc = milo_df[
    ["ANLEGGSNR", "ANLEGGSNAVN", "KOMMUNE_NR", "TYPE", "SONEBELTE", "UTMOST", "UTMNORD"]
].copy()

ind_loc = ind_df[
    [
        "Anleggsnr",
        "Anleggsnavn",
        "Komm.nr",
        "TYPE",
        "Geografisk Longitude",
        "Geografisk Latitude",
    ]
].copy()


# Rename cols
stan_loc.columns = [
    "anlegg_nr",
    "anlegg_navn",
    "komm_no",
    "TYPE",
    "zone",
    "east",
    "north",
]
milo_loc.columns = [
    "anlegg_nr",
    "anlegg_navn",
    "komm_no",
    "TYPE",
    "zone",
    "east",
    "north",
]
ind_loc.columns = ["anlegg_nr", "anlegg_navn", "komm_no", "TYPE", "lon", "lat"]

# Drop duplicates
stan_loc.drop_duplicates(inplace=True)
milo_loc.drop_duplicates(inplace=True)
ind_loc.drop_duplicates(inplace=True)

# Convert UTM Zone col to Pandas' nullable integer data type
# (because proj. now complains about float UTM zones)
stan_loc["zone"] = stan_loc["zone"].astype(pd.Int64Dtype())
milo_loc["zone"] = milo_loc["zone"].astype(pd.Int64Dtype())

# Convert UTM to lat/lon
# "Industri" data is already in dd
stan_loc = nivapy.spatial.utm_to_wgs84_dd(stan_loc, "zone", "east", "north")
milo_loc = nivapy.spatial.utm_to_wgs84_dd(milo_loc, "zone", "east", "north")

# Remove UTM data
del stan_loc["zone"], stan_loc["east"], stan_loc["north"]
del milo_loc["zone"], milo_loc["east"], milo_loc["north"]

# Combine into single df
loc_df = pd.concat([stan_loc, milo_loc, ind_loc], axis=0, sort=True)

# The same site can be in multiple files, so drop duplicates
loc_df.drop_duplicates(inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: "%04d" % x
loc_df["komm_no"] = loc_df["komm_no"].apply(fmt)

# Check ANLEGG_NR is unique
assert loc_df.index.duplicated().all() == False, 'Some "ANLEGGSNRs" are duplicated.'

# Check if any sites are not already in db
sql = "SELECT UNIQUE(ANLEGG_NR) FROM resa2.rid_punktkilder"
annr_df = pd.read_sql_query(sql, engine)

not_in_db = set(loc_df["anlegg_nr"].values) - set(annr_df["anlegg_nr"].values)
not_in_db_df = loc_df[loc_df["anlegg_nr"].isin(list(not_in_db))][
    ["anlegg_nr", "anlegg_navn"]
].sort_values("anlegg_nr")
no_coords_df = loc_df.query("(lat!=lat) or (lon!=lon)")[
    ["anlegg_nr", "anlegg_navn"]
].sort_values("anlegg_nr")
not_in_db_no_coords_df = not_in_db_df[
    not_in_db_df["anlegg_nr"].isin(no_coords_df["anlegg_nr"])
].sort_values("anlegg_nr")

print(f"The following {len(not_in_db_df)} locations are not already in the database:")
print(not_in_db_df)

print(
    f"\nThe following {len(no_coords_df)} locations do not have co-ordinates "
    "in this year's data:"
)
print(no_coords_df)

print(
    f"\nThe following {len(not_in_db_no_coords_df)} locations are not in the "
    "database and do not have co-ordinates (and therefore must be ignored):"
)
print(not_in_db_no_coords_df)

The following 46 locations are not already in the database:
         anlegg_nr                                  anlegg_navn
1247      0429AL10                  Camp Rødsmoen, avløpsanlegg
1389      0540AL13                       Ølnesseter renseanlegg
987       0604AL92              Skrim - Omholtfjell renseanlegg
1107      0618AL22                          Markegardslia avløp
1558      1029AL13             Lande Eiendom Invest renseanlegg
49    1101.0126.01                                Prima Protein
63        1102AL24                                      Breivik
92    1108.0247.01                 Hermod Teigen, Foss-Eikeland
1742      1221AL27                                 Sæverhagen 2
1789      1224AL39                                   Eidsvik SA
552       1557AL14                                       Søvika
643       1579AL00                                       Vevang
733       1820AL13                          Bærøya avløpsanlegg
734       1820AL14                        Of

### 1.2. Identify Regine Vassdragsnummer

The shapefile here:

    K:/Kart/Regine_2006/RegMinsteF.shp

shows locations for all the regine catchments used by TEOTIL (see e-mail from John Rune received 29/06/2017 at 17.26). I've copied this file locally here:

    ../../../Data/gis/shapefiles/RegMinsteF.shp

and re-projected it to WGS84 geographic co-ordinates. The new file is called `reg_minste_f_wgs84.shp`.

The code cell below identifies which regine polygon each point is located in.

In [6]:
# Path to Regine catchment shapefile
reg_shp_path = r"../../../Data/gis/shapefiles/reg_minste_f_wgs84.shp"

# Spatial join
loc_df = nivapy.spatial.identify_point_in_polygon(
    loc_df, reg_shp_path, "anlegg_nr", "VASSDRAGNR", "lat", "lon"
)

loc_df.head()



Unnamed: 0,TYPE,anlegg_navn,anlegg_nr,komm_no,lat,lon,VASSDRAGNR
0,RENSEANLEGG,Bekkelaget renseanlegg med tilførselstuneller ...,0301AL01,301,59.882995,10.767014,006.21
1,RENSEANLEGG,Mariholtet renseanlegg,0301AL14,301,59.891822,10.906905,002.CBB5
2,RENSEANLEGG,Tryvannsstua renseanlegg,0301AL19,301,59.997898,10.663006,007.A0
3,RENSEANLEGG,Grefsenkollen renseanlegg,0301AL27,301,59.958846,10.803291,006.A3
4,RENSEANLEGG,Kobberhaugshytta,0301AL30,301,60.036105,10.663652,007.B1


### 1.3. Restructuring site data

Rename columns to match RESA2.

In [7]:
# Rename other cols to match RESA2
loc_df["ANLEGG_NR"] = loc_df["anlegg_nr"]
loc_df["ANLEGG_NAVN"] = loc_df["anlegg_navn"]
loc_df["KNO"] = loc_df["komm_no"]
loc_df["REGINE"] = loc_df["VASSDRAGNR"]
loc_df["LON_UTL"] = loc_df["lon"]
loc_df["LAT_UTL"] = loc_df["lat"]

del loc_df["anlegg_nr"], loc_df["anlegg_navn"], loc_df["komm_no"]
del loc_df["VASSDRAGNR"], loc_df["lon"], loc_df["lat"]

# Get details for sites not already in db
loc_upld = loc_df[loc_df["ANLEGG_NR"].isin(list(not_in_db))].copy()

# Drop rows where 'regine' is NaN (usually because of missing co-ordinates).
# In the past, all rows have been added, leading to sites in the database 
# without co-ordinates. These then do not get updated if co-ordinates are 
# provided in later years. It is therefore better to only add sites with
# co-ordinates, as sites with missing data this year may be completed in
# subsequent years
loc_upld.dropna(subset=['REGINE'], inplace=True)

loc_upld

Unnamed: 0,TYPE,ANLEGG_NR,ANLEGG_NAVN,KNO,REGINE,LON_UTL,LAT_UTL
63,RENSEANLEGG,1102AL24,Breivik,1108,031.A3F,6.64971,58.997915
552,RENSEANLEGG,1557AL14,Søvika,1557,108.223,7.59941,62.932531
987,RENSEANLEGG,0604AL92,Skrim - Omholtfjell renseanlegg,3006,015.C6B,9.7593,59.505362
1247,RENSEANLEGG,0429AL10,"Camp Rødsmoen, avløpsanlegg",3422,002.JB4,11.462672,61.201083
1389,RENSEANLEGG,0540AL13,Ølnesseter renseanlegg,3449,012.HAZ,9.46982,60.781269
1560,RENSEANLEGG,1029AL13,Lande Eiendom Invest renseanlegg,4205,023.1,7.413042,58.009407
1744,RENSEANLEGG,1221AL27,Sæverhagen 2,4614,044.31,5.527282,59.799517
1791,RENSEANLEGG,1224AL39,Eidsvik SA,4617,042.94,5.68561,59.790565
2745,INDUSTRI,1101.0126.01,Prima Protein,1101,027.4,5.978354,58.438394
2755,INDUSTRI,1108.0247.01,"Hermod Teigen, Foss-Eikeland",1108,028.B,5.734317,58.80195


In [8]:
# Add to RESA2.RID_PUNKTKILDER
# loc_upld.to_sql('rid_punktkilder', con=engine, schema='resa2',
#                if_exists='append', index=False)

### 1.4. Restructuring values

In [9]:
# Store Anlegg
# Get cols of interest
stan_vals = stan_df[["ANLEGGSNR", "MENGDE_P_UT_kg", "MENGDE_N_UT_kg"]]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
stan_vals.columns = ["ANLEGG_NR", 45, 44]

# Melt to "long" format
stan_vals = pd.melt(
    stan_vals,
    id_vars="ANLEGG_NR",
    value_vars=[45, 44],
    var_name="INP_PAR_ID",
    value_name="VALUE",
)

# Drop NaN values
stan_vals.dropna(how="any", inplace=True)

As far as I can tell from exploring the 2015 data in the database, the main columns of interest for Miljøgifter are given in `milo_dict`, below, together with the corresponding parameter IDs from `RESA2.RID_PUNKTKILDER_INPAR_DEF`. This hard-coding is a bit messy, but I can't see any database table providing a nice lookup between these values, so they're included here for now.

In [10]:
# Miljøgifter
# Get cols of interest
milo_dict = {
    "MILJOGIFTHG2": 16,
    "MILJOGIFTPAH2": 48,
    "MILJOGIFTPCB2": 30,
    "MILJOGIFTCD2": 8,
    "MILJOGIFTDEHP2": 119,
    "MILJOGIFTAS2": 2,
    "MILJOGIFTCR2": 10,
    "MILJOGIFTPB2": 28,
    "MILJOGIFTNI2": 25,
    "MILJOGIFTCU2": 15,
    "MILJOGIFTZN2": 38,
    "KONSMENGDTOTP10": 45,
    "KONSMENGDTOTN10": 44,
    "KONSMENGDSS10": 46,
    "ANLEGGSNR": "ANLEGG_NR",
}  # Make headings match RESA

milo_vals = milo_df[milo_dict.keys()]

# Get par IDs from dict
milo_vals.columns = [milo_dict[i] for i in milo_vals.columns]

# Melt to "long" format
milo_vals = pd.melt(
    milo_vals, id_vars="ANLEGG_NR", var_name="INP_PAR_ID", value_name="VALUE"
)

# Drop NaN values
milo_vals.dropna(how="any", inplace=True)

The industry data is already in "long" format.

In [11]:
# Industri
# Get cols of interest
ind_vals = ind_df[["Anleggsnr", "Komp.kode", "Mengde", "Enhet"]]
ind_vals.columns = ["anlegg_nr", "name", "value", "unit"]

# Get par defs from db
# Check if any sites are not already in db
sql = "SELECT * " "FROM resa2.rid_punktkilder_inpar_def"
par_df = pd.read_sql_query(sql, engine)
del par_df["descr"]

# Convert all units to capitals
ind_vals["unit"] = ind_vals["unit"].str.capitalize()
par_df["unit"] = par_df["unit"].str.capitalize()

# Join
ind_vals = pd.merge(ind_vals, par_df, how="left", on=["name", "unit"])

# Some parameters that are not of interest are not matched
# Drop these
ind_vals.dropna(how="any", inplace=True)

# Get just cols of interest
ind_vals = ind_vals[["anlegg_nr", "in_pid", "value"]]

# Rename for db
ind_vals.columns = ["ANLEGG_NR", "INP_PAR_ID", "VALUE"]

# Convert col types
ind_vals["INP_PAR_ID"] = ind_vals["INP_PAR_ID"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ind_vals["unit"] = ind_vals["unit"].str.capitalize()


In [12]:
# Combine
val_df = pd.concat([stan_vals, milo_vals, ind_vals], axis=0, sort=True)

# Add column for year
val_df["YEAR"] = year

# Explicitly set data types
val_df["ANLEGG_NR"] = val_df["ANLEGG_NR"].astype(str)
val_df["INP_PAR_ID"] = val_df["INP_PAR_ID"].astype(int)
val_df["VALUE"] = val_df["VALUE"].astype(float)
val_df["YEAR"] = val_df["YEAR"].astype(int)

# Store Anlegg and Miljøgifter contain some duplicated information
val_df.drop_duplicates(inplace=True)

# Average any remaining duplciates (because sometimes the same value is reported with different precision)
val_df = val_df.groupby(["ANLEGG_NR", "INP_PAR_ID", "YEAR"]).mean().reset_index()

val_df.head()

Unnamed: 0,ANLEGG_NR,INP_PAR_ID,YEAR,VALUE
0,0101AL01,44,2020,815.98
1,0101AL01,45,2020,2.35
2,0101AL01,46,2020,267.8
3,0101AL06,44,2020,253.65
4,0101AL06,45,2020,2.86


In [13]:
## Drop any existing values for this year
# sql = ("DELETE FROM resa2.rid_punktkilder_inpar_values "
#       "WHERE year = %s" % year)
# res = engine.execute(sql)
#
## Add to RESA2.RID_PUNKTKILDER_INPAR_VALUES
# val_df.to_sql('rid_punktkilder_inpar_values', con=engine, schema='resa2',
#              if_exists='append', index=False)

## 2. Små anlegg (small treatment works)

Copy and rename the file, and rename the worksheet `sma_anlegg_{year}`. Delete rows above the header and delete unnecessary columns. The only columns required are `KOMMUNENR`, `SUM FOSFOR` and `SUM NITROGEN`, which should be renamed `KOMMUNENR`, `P_kg` and `N_kg`, respectively.

This data is added directly to `RESA2.RID_KILDER_SPREDT_VALUES`. 

**Note:** The kommuner ID numbers in the små anlegg file should be present in 

    ../../../teotil2/data/core_input_data/regine_{year}.csv
    
Kommune IDs change from year to year, so they will usually need updating in TEOTIL - see [update_regine_kommune.ipynb](https://nbviewer.org/github/JamesSample/rid/blob/master/notebooks/update_regine_kommune.ipynb) for details.

In [14]:
# Read raw (tidied) data
in_xlsx = f"../../../Data/point_data_{year}/avlop_sma_anlegg_{year}_raw.xlsx"
sman_df = pd.read_excel(in_xlsx, sheet_name=f"sma_anlegg_{year}")

# Drop blank rows
sman_df.dropna(how="all", inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: "%04d" % x
sman_df["KOMMUNENR"] = sman_df["KOMMUNENR"].apply(fmt)

# Check if any kommuner are not already in TEOTIL
reg_csv = f"../../../teotil2/data/core_input_data/regine_{year}.csv"
kmnr_df = pd.read_csv(reg_csv, sep=";", encoding="utf-8")
kmnr_df["komnr"] = kmnr_df["komnr"].apply(fmt)

not_in_db = set(sman_df["KOMMUNENR"].values) - set(kmnr_df["komnr"].values)
if len(not_in_db) > 0:
    print(f'\nThe following {len(not_in_db)} kommuner are not in the TEOTIL "regine" file. Consider updating?:')
    print(sman_df[sman_df["KOMMUNENR"].isin(list(not_in_db))])

# Get cols of interest for RID_KILDER_SPREDT_VALUES
sman_df = sman_df[["KOMMUNENR", "P_kg", "N_kg"]]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
sman_df.columns = ["KOMM_NO", 45, 44]

# Melt to "long" format
sman_df = pd.melt(
    sman_df,
    id_vars="KOMM_NO",
    value_vars=[45, 44],
    var_name="INP_PAR_ID",
    value_name="VALUE",
)

# Add column for year
sman_df["AR"] = year

sman_df.head()

Unnamed: 0,KOMM_NO,INP_PAR_ID,VALUE,AR
0,301,45,211.06125,2020
1,1101,45,830.9079,2020
2,1103,45,2895.16905,2020
3,1106,45,198.236249,2020
4,1108,45,1334.749972,2020


In [15]:
# Add to RESA2.RID_KILDER_SPREDT_VALUES
# sman_df.to_sql('rid_kilder_spredt_values', con=engine, schema='resa2',
#               if_exists='append', index=False)

## 3. Fish farms

An example of the raw data is here:

 * K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Fiskeoppdrett\Teotil - 2015 (2) (pr. 09.08.16).xlsx.zip

I have made a local copy of the 2016 file here:

 * C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016\fiske_oppdret_2016_raw.xlsx

The data must be added to two tables in RESA2:

 * First, the site data must be added to `RESA2.RID_KILDER_AQUAKULTUR`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "Regine" catchment ID. This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs.
 
    **Note:** The key ID fields in the raw data appear to be `LOKNR` and `LOKNAVN`. <br><br>
 
 * Secondly, the chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_KILDER_AQKULT_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.
 
### 3.1. Basic data checking

In [None]:
# Set the year for the data in question
year = 2019

In [None]:
# Read raw (tidied) data
# Fish farms
in_xlsx = r"../../../Data/point_data_%s/fiske_oppdret_%s_raw.xlsx" % (year, year)
fish_df = pd.read_excel(in_xlsx, sheet_name="Ark1")

# Drop no data
fish_df.dropna(how="all", inplace=True)

In [None]:
# Check if any sites are not already in db
sql = "SELECT UNIQUE(NR) " "FROM resa2.rid_kilder_aquakultur"
aqua_df = pd.read_sql_query(sql, engine)

not_in_db = set(fish_df["LOKNR"].values) - set(aqua_df["nr"].values)

nidb_df = fish_df[fish_df["LOKNR"].isin(list(not_in_db))][
    ["LOKNR", "LOKNAVN", "N_DESIMALGRADER_Y", "O_DESIMALGRADER_X"]
].drop_duplicates(subset=["LOKNR"])

print("\nThe following locations are not in the database:")
print(nidb_df)

### 3.2. Geocode fish farms and add to database

In [None]:
# Path to Regine catchment shapefile
reg_shp_path = r"../../../Data/gis/shapefiles/reg_minste_f_wgs84.shp"

# Spatial join
if len(nidb_df) > 0:
    loc_df = rid.identify_point_in_polygon(
        nidb_df,
        reg_shp_path,
        "LOKNR",
        "VASSDRAGNR",
        "N_DESIMALGRADER_Y",
        "O_DESIMALGRADER_X",
    )

    # Rename cols
    loc_df.columns = ["NR", "NAVN", "LENGDE", "BREDDE", "REGINE"]

    print(loc_df.head())

In [None]:
# Add to RESA2.RID_KILDER_AQUAKULTUR
# loc_df.to_sql('rid_kilder_aquakultur', con=engine, schema='resa2',
#              if_exists='append', index=False)

### 3.3. Estimate nutrient inputs

The methodology here is a little unclear. The following is my best guess, based on the files located here:

K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\2016\Rådata\Fiskeoppdrett

Old workflow:

 1. Calculate the fish biomass from the raw data. See the equation in the `Biomasse` column of the spreadsheet *JSE_TEOTIL_2015.xlsx* <br><br>
 
 2. Split the data according to salmon ("laks"; species ID 71101) and trout ("øret"; species ID 71401), then group by location and month, summing biomass and `FORFORBRUK_KILO` columns (see Fiskeoppdrett_biomasse_2016.accdb) <br><br>
 
 3. Calculate production. This involves combining biomass for the current month with that for the previous month. See the calculations in e.g. *N_P_ørret_2015.xlsx*. <br><br>
 
 4. Calculate NTAP and PTAP. **NB:** I don't know what these quantities are, so I'm just blindly duplicating the Excel calculations in the code below. The functions are therefore not very well explained <br><br>
 
 5. Estimate copper usage at each fish farm by scaling the total annual Cu usage in proportion to P production. For 2018, John Rune has supplied an annual Cu value of **1217 tonnes** (see e-mail received 15.10.2019 at 10.27).

In [None]:
# Annual Cu usage in tonnes
an_cu = 1217

# Estimate nutrient inputs from fish farns
fish_nut = rid.estimate_fish_farm_nutrient_inputs(fish_df, year, an_cu)

fish_nut.head()

In [None]:
# Add to RESA2.RID_KILDER_AQKULT_VALUES
# fish_nut.to_sql('rid_kilder_aqkult_values', con=engine, schema='resa2',
#                if_exists='append', index=False)

## 4. Land use

An example of the raw data is here:

 * K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\2016\Rådata\Jordbruk\to-niva.2015.xls

Note that this file is not really an Excel file and opening it directly creates errors. I have corrected the data format, tidied the column headings and made a local copy of the 2016 data here:

 * C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016\jordbruk_2016.xlsx
 
This is added to the table `RESA2.RID_AGRI_INPUTS`.

**Note:** In recent years, the entry for Oslo (fylke_sone = 3_1) has been missing from the data provided by Bioforsk. This row should be added manually to the Excel file using `omrade = "osl1"`. The values should be identical to those for område `ake2`. This works because the land areas in `RID_Fylke-Sone_LU_Areas.xlsx` have been made identical for `osl1` and `ake2` (even though this is not correct), so the inputs in terms of kg/km2 are calculated as being the same for both regions, which is what is required.

In [None]:
# Set the year for the data in question
year = 2019

In [None]:
# Path to (tidied) Bioforsk data
in_xlsx = r"../../../Data/point_data_%s/jordbruk_%s.xlsx" % (year, year)

lu_df = pd.read_excel(in_xlsx)

# Add year
lu_df["year"] = year

# Order cols
lu_df = lu_df[
    [
        "omrade",
        "year",
        "n_diff_kg",
        "n_point_kg",
        "n_back_kg",
        "p_diff_kg",
        "p_point_kg",
        "p_back_kg",
    ]
]

lu_df.head()

In [None]:
# Write to RESA
# lu_df.to_sql(name='rid_agri_inputs', con=engine,
#             schema='resa2', index=False,
#             if_exists='append',
#             dtype={'omrade': types.VARCHAR(lu_df['omrade'].str.len().max())})