In [None]:
%matplotlib inline
import geopandas as gpd
import geopandas.tools
import matplotlib.pyplot as plt
import nivapy3 as nivapy
import numpy as np
import pandas as pd
import pyproj
import seaborn as sn
import useful_rid_code as rid
from shapely.geometry import Point
from sqlalchemy import types

sn.set_context("notebook")

# Process model input datasets (2020-21)

Modelling for the RID programme makes use of the following input datasets:

 * **Avløp** (sewage and other drainage). These datasets are provided by **Gisle Berge at SSB** and are sub-divided into
     * Large treatment works ("store anlegg"; ≥ 50 p.e.)
     * Small treatment works ("små anlegg"; < 50 p.e.)
     * Other environmental pollutants ("miljøgifter") <br><br>
     
 * **Fiskeoppdret** (Fish farming). Provided by **Knut Johan Johnsen at Fiskeridirektoratet** <br><br>
 
 * **Industri** (industrial point sources). Provided by **Preben Danielsen at Miljødirektoratet** <br><br>
 
 * **Jordbruk** (land use and management activities). Provided by **Hans Olav Eggestad at NIBIO**
 
In addition, an annual figure for the use of **copper in aquaculture** is provided by **Preben Danielsen at Miljødirektoratet**.
 
The raw datasets must be restructured into a standardised format and added to the RESA2 database. Once in the database, they can be used to generate input files for TEOTIL2.

This notebook processes the raw data and adds it to RESA2.

In [None]:
# Connect to db
engine = nivapy.da.connect()

In [None]:
# Year of interest
year = 2020

## 1. Store anlegg, Miljøgifter and Industri

These three datasets are all treated similarly, and there is some duplication between the files. 

 * The **store anlegg** dataset is in wide format. Copy and rename the file, then change the worksheet name to `store_anlegg_{year}`. The header must also be tidied (see example datasets from previous years) and blank rows at the end of the worksheet can be deleted <br><br>
 
 * The **miljøgifter** dataset is in wide format. Copy and rename the file, then change the worksheet name to `miljogifter_{year}`. Also check that column headings are the same as in previous years <br><br>
 
 * The **industri** dataset is in long format and usually contains data for multiple years. Copy and rename the file, then rename the worksheet to `industry_{year}`. Delete rows above the header and check the header is the same as in previous years. Remember to filter the data to **only include the year of interest** (i.e. delete rows for other years)

The data in these files must be added to two tables in RESA2:

 * The site data must be added to `RESA2.RID_PUNKTKILDER`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "regine" catchment ID (the latter being most important). This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs. <br><br>
 
    **Note:** As of 2017, many (>120) of the stations already in the database were missing regine IDs and many more (>3000) were missing co-ordinate information. John Rune says we have previously asked Miljødirektoratet about this, but they have not been able to provide the missing data. During data processing for 2019/20, I noticed that some of the missing co-ordinate information *was* provided in more recent data submissions. However, if a site is already in the database (with missing spatial information), it will not be updated even if co-ordinates are provided in later years. In 2020, I created a notebook ([here](https://nbviewer.org/github/JamesSample/rid/blob/master/notebooks/update_renseanlegg_coords.ipynb)) to update co-ordinates where possible based on data submitted since 2016. This has reduced the number of sites without regine IDs to 30 (`SELECT count(*) from resa2.rid_punktkilder WHERE regine IS NULL;`). I have also modified the code in this notebook so that sites are only added to `RESA2.RID_PUNKTKILDER` when co-ordinate information is available and a regine ID has been successfully assigned. This usually means that some sites in the submission for each year must be ignored, but the benefit is that if the same sites are reported with complete information at a later date, they can then be added to the database and processed correctly. This seems preferable to adding incomplete data that cannot be used. <br><br>
 
 * The chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_PUNKTKILDER_INPAR_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.

In [None]:
# Read raw (tidied) data

# Store anlegg
in_xlsx = f"../../../Data/point_data_{year}/avlop_stor_anlegg_{year}_raw.xlsx"
stan_df = pd.read_excel(in_xlsx, sheet_name=f"store_anlegg_{year}")

# Miljøgifter
in_xlsx = f"../../../Data/point_data_{year}/avlop_miljogifter_{year}_raw.xlsx"
milo_df = pd.read_excel(in_xlsx, sheet_name=f"miljogifter_{year}")

# Industri
in_xlsx = f"../../../Data/point_data_{year}/industri_{year}_raw.xlsx"
ind_df = pd.read_excel(in_xlsx, sheet_name=f"industry_{year}")

# Drop blank rows
stan_df.dropna(how="all", inplace=True)
milo_df.dropna(how="all", inplace=True)
ind_df.dropna(how="all", inplace=True)

### 1.1. Basic data checking

All of the store anlegg and miljøgifter sites are classified as `RENSEANLEGG` in the `TYPE` column of `RESA2.RID_PUNKTKILDER`; industri sites are labeled `INDUSTRI`.

The code below adds `TYPE` columns, merges site data from different sources, converts UTM co-ordinates to WGS84 decimal degrees and identifies sites not already in the database. Issues identified below (e.g. missing co-ordinates) should be corrected if possible before continuing.

In [None]:
# Add TYPE cols
stan_df["TYPE"] = "RENSEANLEGG"
milo_df["TYPE"] = "RENSEANLEGG"
ind_df["TYPE"] = "INDUSTRI"

# Get just stn info from each df
stan_loc = stan_df[
    ["ANLEGGSNR", "ANLEGGSNAVN", "Kommunenr", "TYPE", "Sone", "UTM_E", "UTM_N"]
].copy()

milo_loc = milo_df[
    ["ANLEGGSNR", "ANLEGGSNAVN", "KOMMUNE_NR", "TYPE", "SONEBELTE", "UTMOST", "UTMNORD"]
].copy()

ind_loc = ind_df[
    [
        "Anleggsnr",
        "Anleggsnavn",
        "Komm.nr",
        "TYPE",
        "Geografisk Longitude",
        "Geografisk Latitude",
    ]
].copy()


# Rename cols
stan_loc.columns = [
    "anlegg_nr",
    "anlegg_navn",
    "komm_no",
    "TYPE",
    "zone",
    "east",
    "north",
]
milo_loc.columns = [
    "anlegg_nr",
    "anlegg_navn",
    "komm_no",
    "TYPE",
    "zone",
    "east",
    "north",
]
ind_loc.columns = ["anlegg_nr", "anlegg_navn", "komm_no", "TYPE", "lon", "lat"]

# Drop duplicates
stan_loc.drop_duplicates(inplace=True)
milo_loc.drop_duplicates(inplace=True)
ind_loc.drop_duplicates(inplace=True)

# Convert UTM Zone col to Pandas' nullable integer data type
# (because proj. now complains about float UTM zones)
stan_loc["zone"] = stan_loc["zone"].astype(pd.Int64Dtype())
milo_loc["zone"] = milo_loc["zone"].astype(pd.Int64Dtype())

# Convert UTM to lat/lon
# "Industri" data is already in dd
stan_loc = nivapy.spatial.utm_to_wgs84_dd(stan_loc, "zone", "east", "north")
milo_loc = nivapy.spatial.utm_to_wgs84_dd(milo_loc, "zone", "east", "north")

# Remove UTM data
del stan_loc["zone"], stan_loc["east"], stan_loc["north"]
del milo_loc["zone"], milo_loc["east"], milo_loc["north"]

# Combine into single df
loc_df = pd.concat([stan_loc, milo_loc, ind_loc], axis=0, sort=True)

# The same site can be in multiple files, so drop duplicates
loc_df.drop_duplicates(inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: "%04d" % x
loc_df["komm_no"] = loc_df["komm_no"].apply(fmt)

# Check ANLEGG_NR is unique
assert loc_df.index.duplicated().all() == False, 'Some "ANLEGGSNRs" are duplicated.'

# Check if any sites are not already in db
sql = "SELECT UNIQUE(ANLEGG_NR) FROM resa2.rid_punktkilder"
annr_df = pd.read_sql_query(sql, engine)

not_in_db = set(loc_df["anlegg_nr"].values) - set(annr_df["anlegg_nr"].values)
not_in_db_df = loc_df[loc_df["anlegg_nr"].isin(list(not_in_db))][
    ["anlegg_nr", "anlegg_navn"]
].sort_values("anlegg_nr")
no_coords_df = loc_df.query("(lat!=lat) or (lon!=lon)")[
    ["anlegg_nr", "anlegg_navn"]
].sort_values("anlegg_nr")
not_in_db_no_coords_df = not_in_db_df[
    not_in_db_df["anlegg_nr"].isin(no_coords_df["anlegg_nr"])
].sort_values("anlegg_nr")

print(f"The following {len(not_in_db_df)} locations are not already in the database:")
print(not_in_db_df)

print(
    f"\nThe following {len(no_coords_df)} locations do not have co-ordinates "
    "in this year's data:"
)
print(no_coords_df)

print(
    f"\nThe following {len(not_in_db_no_coords_df)} locations are not in the "
    "database and do not have co-ordinates (and therefore must be ignored):"
)
print(not_in_db_no_coords_df)

### 1.2. Identify Regine Vassdragsnummer

The shapefile here:

    K:/Kart/Regine_2006/RegMinsteF.shp

shows locations for all the regine catchments used by TEOTIL (see e-mail from John Rune received 29/06/2017 at 17.26). I've copied this file locally here:

    ../../../Data/gis/shapefiles/RegMinsteF.shp

and re-projected it to WGS84 geographic co-ordinates. The new file is called `reg_minste_f_wgs84.shp`.

The code cell below identifies which regine polygon each point is located in.

In [None]:
# Path to Regine catchment shapefile
reg_shp_path = r"../../../Data/gis/shapefiles/reg_minste_f_wgs84.shp"

# Spatial join
loc_df = nivapy.spatial.identify_point_in_polygon(
    loc_df, reg_shp_path, "anlegg_nr", "VASSDRAGNR", "lat", "lon"
)

loc_df.head()

### 1.3. Restructuring site data

Rename columns to match RESA2.

In [None]:
# Rename other cols to match RESA2
loc_df["ANLEGG_NR"] = loc_df["anlegg_nr"]
loc_df["ANLEGG_NAVN"] = loc_df["anlegg_navn"]
loc_df["KNO"] = loc_df["komm_no"]
loc_df["REGINE"] = loc_df["VASSDRAGNR"]
loc_df["LON_UTL"] = loc_df["lon"]
loc_df["LAT_UTL"] = loc_df["lat"]

del loc_df["anlegg_nr"], loc_df["anlegg_navn"], loc_df["komm_no"]
del loc_df["VASSDRAGNR"], loc_df["lon"], loc_df["lat"]

# Get details for sites not already in db
loc_upld = loc_df[loc_df["ANLEGG_NR"].isin(list(not_in_db))].copy()

# Drop rows where 'regine' is NaN (usually because of missing co-ordinates).
# In the past, all rows have been added, leading to sites in the database
# without co-ordinates. These then do not get updated if co-ordinates are
# provided in later years. It is therefore better to only add sites with
# co-ordinates, as sites with missing data this year may be completed in
# subsequent years
loc_upld.dropna(subset=["REGINE"], inplace=True)

loc_upld

In [None]:
# Add to RESA2.RID_PUNKTKILDER
# loc_upld.to_sql(
#     "rid_punktkilder", con=engine, schema="resa2", if_exists="append", index=False
# )

### 1.4. Restructuring values

In [None]:
# Store Anlegg
# Get cols of interest
stan_vals = stan_df[["ANLEGGSNR", "MENGDE_P_UT_kg", "MENGDE_N_UT_kg"]]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
stan_vals.columns = ["ANLEGG_NR", 45, 44]

# Melt to "long" format
stan_vals = pd.melt(
    stan_vals,
    id_vars="ANLEGG_NR",
    value_vars=[45, 44],
    var_name="INP_PAR_ID",
    value_name="VALUE",
)

# Drop NaN values
stan_vals.dropna(how="any", inplace=True)

As far as I can tell from exploring the 2015 data in the database, the main columns of interest for Miljøgifter are given in `milo_dict`, below, together with the corresponding parameter IDs from `RESA2.RID_PUNKTKILDER_INPAR_DEF`. This hard-coding is a bit messy, but I can't see any database table providing a nice lookup between these values, so they're included here for now.

In [None]:
# Miljøgifter
# Get cols of interest
milo_dict = {
    "MILJOGIFTHG2": 16,
    "MILJOGIFTPAH2": 48,
    "MILJOGIFTPCB2": 30,
    "MILJOGIFTCD2": 8,
    "MILJOGIFTDEHP2": 119,
    "MILJOGIFTAS2": 2,
    "MILJOGIFTCR2": 10,
    "MILJOGIFTPB2": 28,
    "MILJOGIFTNI2": 25,
    "MILJOGIFTCU2": 15,
    "MILJOGIFTZN2": 38,
    "KONSMENGDTOTP10": 45,
    "KONSMENGDTOTN10": 44,
    "KONSMENGDSS10": 46,
    "ANLEGGSNR": "ANLEGG_NR",
}  # Make headings match RESA

milo_vals = milo_df[milo_dict.keys()]

# Get par IDs from dict
milo_vals.columns = [milo_dict[i] for i in milo_vals.columns]

# Melt to "long" format
milo_vals = pd.melt(
    milo_vals, id_vars="ANLEGG_NR", var_name="INP_PAR_ID", value_name="VALUE"
)

# Drop NaN values
milo_vals.dropna(how="any", inplace=True)

The industry data is already in "long" format.

In [None]:
# Industri
# Get cols of interest
ind_vals = ind_df[["Anleggsnr", "Komp.kode", "Mengde", "Enhet"]]
ind_vals.columns = ["anlegg_nr", "name", "value", "unit"]

# Get par defs from db
# Check if any sites are not already in db
sql = "SELECT * " "FROM resa2.rid_punktkilder_inpar_def"
par_df = pd.read_sql_query(sql, engine)
del par_df["descr"]

# Convert all units to capitals
ind_vals["unit"] = ind_vals["unit"].str.capitalize()
par_df["unit"] = par_df["unit"].str.capitalize()

# Join
ind_vals = pd.merge(ind_vals, par_df, how="left", on=["name", "unit"])

# Some parameters that are not of interest are not matched
# Drop these
ind_vals.dropna(how="any", inplace=True)

# Get just cols of interest
ind_vals = ind_vals[["anlegg_nr", "in_pid", "value"]]

# Rename for db
ind_vals.columns = ["ANLEGG_NR", "INP_PAR_ID", "VALUE"]

# Convert col types
ind_vals["INP_PAR_ID"] = ind_vals["INP_PAR_ID"].astype(int)

In [None]:
# Combine
val_df = pd.concat([stan_vals, milo_vals, ind_vals], axis=0, sort=True)

# Add column for year
val_df["YEAR"] = year

# Explicitly set data types
val_df["ANLEGG_NR"] = val_df["ANLEGG_NR"].astype(str)
val_df["INP_PAR_ID"] = val_df["INP_PAR_ID"].astype(int)
val_df["VALUE"] = val_df["VALUE"].astype(float)
val_df["YEAR"] = val_df["YEAR"].astype(int)

# Store Anlegg and Miljøgifter contain some duplicated information
val_df.drop_duplicates(inplace=True)

# Average any remaining duplciates (because sometimes the same value is reported with different precision)
val_df = val_df.groupby(["ANLEGG_NR", "INP_PAR_ID", "YEAR"]).mean().reset_index()

val_df.head()

In [None]:
# # Drop any existing values for this year
# sql = f"DELETE FROM resa2.rid_punktkilder_inpar_values WHERE year = {year}"
# res = engine.execute(sql)

# # Add to RESA2.RID_PUNKTKILDER_INPAR_VALUES
# val_df.to_sql(
#     "rid_punktkilder_inpar_values",
#     con=engine,
#     schema="resa2",
#     if_exists="append",
#     index=False,
# )

## 2. Små anlegg (small treatment works)

Copy and rename the file, and rename the worksheet `sma_anlegg_{year}`. Delete rows above the header and delete unnecessary columns. The only columns required are `KOMMUNENR`, `SUM FOSFOR` and `SUM NITROGEN`, which should be renamed `KOMMUNENR`, `P_kg` and `N_kg`, respectively.

This data is added directly to `RESA2.RID_KILDER_SPREDT_VALUES`. 

**Note:** The kommuner ID numbers in the små anlegg file should be present in 

    ../../../teotil2/data/core_input_data/regine_{year}.csv
    
Kommune IDs change from year to year, so they will usually need updating in TEOTIL - see [update_regine_kommune.ipynb](https://nbviewer.org/github/JamesSample/rid/blob/master/notebooks/update_regine_kommune.ipynb) for details.

In [None]:
# Read raw (tidied) data
in_xlsx = f"../../../Data/point_data_{year}/avlop_sma_anlegg_{year}_raw.xlsx"
sman_df = pd.read_excel(in_xlsx, sheet_name=f"sma_anlegg_{year}")

# Drop blank rows
sman_df.dropna(how="all", inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: "%04d" % x
sman_df["KOMMUNENR"] = sman_df["KOMMUNENR"].apply(fmt)

# Check if any kommuner are not already in TEOTIL
reg_csv = f"../../../teotil2/data/core_input_data/regine_{year}.csv"
kmnr_df = pd.read_csv(reg_csv, sep=";", encoding="utf-8")
kmnr_df["komnr"] = kmnr_df["komnr"].apply(fmt)

not_in_db = set(sman_df["KOMMUNENR"].values) - set(kmnr_df["komnr"].values)
if len(not_in_db) > 0:
    print(
        f'\nThe following {len(not_in_db)} kommuner are not in the TEOTIL "regine" file. Consider updating?:'
    )
    print(sman_df[sman_df["KOMMUNENR"].isin(list(not_in_db))])

# Get cols of interest for RID_KILDER_SPREDT_VALUES
sman_df = sman_df[["KOMMUNENR", "P_kg", "N_kg"]]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
sman_df.columns = ["KOMM_NO", 45, 44]

# Melt to "long" format
sman_df = pd.melt(
    sman_df,
    id_vars="KOMM_NO",
    value_vars=[45, 44],
    var_name="INP_PAR_ID",
    value_name="VALUE",
)

# Add column for year
sman_df["AR"] = year

sman_df.head()

In [None]:
# Add to RESA2.RID_KILDER_SPREDT_VALUES
# sman_df.to_sql(
#     "rid_kilder_spredt_values",
#     con=engine,
#     schema="resa2",
#     if_exists="append",
#     index=False,
# )

## 3. Fish farms

The aquaculture data is usually encrypted and must be stored securely. Copy and rename the file, and change the worksheet name to `fiskeoppdrett_{year}`. Check that data have only been provided for **one year** and that the column names match submissions from previous years.

These data must be added to two tables in RESA2:

 * First, the site data must be added to `RESA2.RID_KILDER_AQUAKULTUR`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "regine" catchment ID. This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs.
 
    **Note:** The key ID fields in the raw data appear to be `LOKNR` and `LOKNAVN`. <br><br>
 
 * Secondly, the chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_KILDER_AQKULT_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.

### 3.1. Basic data checking

In [None]:
# Read raw (tidied) data
# Fish farms
in_xlsx = f"../../../Data/point_data_{year}/fiske_oppdret_{year}_raw.xlsx"
fish_df = pd.read_excel(in_xlsx, sheet_name=f"fiskeoppdrett_{year}")

# Drop no data
fish_df.dropna(how="all", inplace=True)

In [None]:
# Check if any sites are not already in db
sql = "SELECT UNIQUE(NR) FROM resa2.rid_kilder_aquakultur"
aqua_df = pd.read_sql_query(sql, engine)

not_in_db = set(fish_df["LOKNR"].values) - set(aqua_df["nr"].values)
nidb_df = fish_df[fish_df["LOKNR"].isin(list(not_in_db))][
    ["LOKNR", "LOKNAVN", "N_DESIMALGRADER_Y", "O_DESIMALGRADER_X"]
].drop_duplicates(subset=["LOKNR"])
if len(not_in_db) > 0:
    print(f"nThe following {len(not_in_db)} locations are not in the database:")
    print(nidb_df)

# Check for missing co-ords
no_coords_df = fish_df.query(
    "(N_DESIMALGRADER_Y!=N_DESIMALGRADER_Y) or "
    "(O_DESIMALGRADER_X!=O_DESIMALGRADER_X)"
)[["LOKNR", "LOKNAVN"]].sort_values("LOKNR")
if len(no_coords_df) > 0:
    print(
        f"\nThe following {len(no_coords_df)} locations do not have co-ordinates "
        "in this year's data:"
    )
    print(no_coords_df)

### 3.2. Geocode fish farms and add to database

In [None]:
# Path to Regine catchment shapefile
reg_shp_path = r"../../../Data/gis/shapefiles/reg_minste_f_wgs84.shp"

# Spatial join
if len(nidb_df) > 0:
    loc_df = nivapy.spatial.identify_point_in_polygon(
        nidb_df,
        reg_shp_path,
        "LOKNR",
        "VASSDRAGNR",
        "N_DESIMALGRADER_Y",
        "O_DESIMALGRADER_X",
    )

    # Rename cols
    loc_df.columns = ["NR", "NAVN", "LENGDE", "BREDDE", "REGINE"]

    # Drop rows where 'REGINE' is NaN
    no_reg = pd.isna(loc_df["REGINE"])
    if no_reg.sum() > 0:
        no_reg_df = loc_df[no_reg]
        print(
            f"The following {len(no_reg_df)} locations cannot be linked to a regine. "
            "These sites will be ignored."
        )
        print(no_reg_df)

        loc_df.dropna(subset=["REGINE"], inplace=True)

    print(f"The following {len(loc_df)} locations will be added to the database.")
    print(loc_df)

In [None]:
# Add to RESA2.RID_KILDER_AQUAKULTUR
# loc_df.to_sql(
#     "rid_kilder_aquakultur", con=engine, schema="resa2", if_exists="append", index=False
# )

### 3.3. Estimate nutrient and copper inputs

The methodology here is a little unclear. The following is my best guess, based on the files located here:

    K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\2016\Rådata\Fiskeoppdrett

Old workflow:

 1. Calculate the fish biomass from the raw data. See the equation in the `Biomasse` column of the spreadsheet *JSE_TEOTIL_2015.xlsx* <br><br>
 
 2. Split the data according to salmon ("laks"; species ID 71101) and trout ("øret"; species ID 71401), then group by location and month, summing biomass and `FORFORBRUK_KILO` columns (see Fiskeoppdrett_biomasse_2016.accdb) <br><br>
 
 3. Calculate production. This involves combining biomass for the current month with that for the previous month. See the calculations in e.g. *N_P_ørret_2015.xlsx*. <br><br>
 
 4. Calculate NTAP and PTAP. **NB:** I don't know what these quantities are, so I'm just duplicating the Excel calculations in the code below. The functions are therefore not very well explained <br><br>
 
 5. Estimate copper usage at each fish farm by scaling the total annual Cu usage in proportion to P production
 
The annual copper figure is provided by Miljødirektoratet and it is assumed that 85% of the total is lost to water. Values for each year are stored in 

    ../../../Data/annual_copper_useage_aquaculture.xlsx
    
**This file should be updated with the latest figure before running the code below**.

In [None]:
# Get annual copper usgae
cu_xlsx = r"../../../Data/annual_copper_usage_aquaculture.xlsx"
cu_df = pd.read_excel(cu_xlsx, sheet_name="Sheet1", index_col=0)
tot_an_cu = cu_df.loc[year, "tot_cu_tonnes"]
an_cu = 0.85 * tot_an_cu
print(f"The total annual copper lost to water from aquaculture is {an_cu:.1f} tonnes.")

In [None]:
# Estimate nutrient inputs from fish farns
fish_nut = rid.estimate_fish_farm_nutrient_inputs(fish_df, year, an_cu)
fish_nut.head()

In [None]:
# Add to RESA2.RID_KILDER_AQKULT_VALUES
# fish_nut.to_sql(
#     "rid_kilder_aqkult_values",
#     con=engine,
#     schema="resa2",
#     if_exists="append",
#     index=False,
# )

## 4. Land use

The land use dataset is provided by NIBIO. The file usually gives errors when it is opened, but it can be saved again as a `.xlsx` without problems. 

Open the file and `Save as`, then rename the worksheet to `jordbruk_{year}`. Tidy the header to match previous submissions and correct any issues with Norwegian characters in the `omrade` column. 

The entry for Oslo (`osl1`; fylke_sone = 3_1) is usually missing from the data provided by NIBIO. This row should be added manually to the Excel file and the values should be set identical to those for område `ake2`. This works because the land areas in TEOTIL's `fysone_land_areas.csv` have been made identical for `osl1` and `ake2` (even though this is not correct), so the inputs in terms of kg/km2 are calculated as being the same for both regions, which is what is required.
 
These data are added to the table `RESA2.RID_AGRI_INPUTS`.

In [None]:
# Path to (tidied) Bioforsk data
lu_xlsx = f"../../../Data/point_data_{year}/jordbruk_{year}.xlsx"
lu_df = pd.read_excel(lu_xlsx)

lu_df["year"] = year

# Order cols
lu_df = lu_df[
    [
        "omrade",
        "year",
        "n_diff_kg",
        "n_point_kg",
        "n_back_kg",
        "p_diff_kg",
        "p_point_kg",
        "p_back_kg",
    ]
]

lu_df.head()

In [None]:
# Write to RESA
# lu_df.to_sql(
#     name="rid_agri_inputs",
#     con=engine,
#     schema="resa2",
#     index=False,
#     if_exists="append",
#     dtype={"omrade": types.VARCHAR(lu_df["omrade"].str.len().max())},
# )