In [1]:
%matplotlib inline
import imp
import pandas as pd
import geopandas as gpd
import geopandas.tools
import pyproj
import numpy as np
from shapely.geometry import Point
from sqlalchemy import types
import seaborn as sn
import matplotlib.pyplot as plt
sn.set_context('notebook')

# Process model input datasets

Modelling for the RID programme makes use of the following input datasets:

 * **Avløp** (sewage and other drainage), sub-divided into
     * Large treatment works
     * Small treatment works
     * Other environmental pollutants <br><br>
     
 * **Fiskeoppdret** (Fish farming) <br><br>
 
 * **Industri** (industrial point sources) <br><br>
 
 * **Jordbruk** (land use and management activities)
 
The raw datasets come from a variety of different sources and must be restructured into a standardised format and added to the RESA2 database. Once in the database, these can can either be used to generate input files for TEOTIL (using either Tore's code or the workflow documented [here](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/prepare_teotil_inputs.ipynb)), or they can be used to run the new [NOPE model](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/nope_model.ipynb). Generating input files for NOPE from the data in RESA2 is very straightforward: simply call `nope.make_rid_input_file()` for the year of interest.

This notebook takes the raw data, restructures it, and adds it to RESA2.

In [2]:
# Connect to db
resa2_basic_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\Upload_Template'
                    r'\useful_resa2_code.py')

resa2_basic = imp.load_source('useful_resa2_code', resa2_basic_path)

engine, conn = resa2_basic.connect_to_resa2()

# Import custom RID functions
rid_func_path = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
                 r'\Python\rid\notebooks\useful_rid_code.py')

rid = imp.load_source('useful_rid_code', rid_func_path)

## 1. Store anlegg, Miljøgifter and Industri

These three datasets are all treated similarly, and there is some duplication between the files. Examples of the raw data formats are here:

 * K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Avløp\TEOTIL store anlegg 2015 (sendt 18.08.2016).xlsx

 * K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Avløp\Miljogifter_NIVA_RID-prosjektet_2015.xlsx

 * K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Industri\Teotiluttrekket til NIVA - 2016_v2.xlsx

I have made local copies of the 2016 data files here:

 * C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016

and tidied all the column headings. See also the information in the e-mail from John Rune received 29/06/2017 at 15.53. 

**Note:** The raw data files for industry often contain several years of data. For the file in the folder above, I've filtered the values to only include the year of interest.

The data in these files must be added to two tables in RESA2:

 * First, the site data must be added to `RESA2.RID_PUNKTKILDER`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "Regine" catchment ID. This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs.
 
    **Note:** Many (>70) of the stations already in the database are missing Regine IDs. Many more (>3000) are missing co-ordinate information. We have previously asked Miljødirektoratet about this, but they have not yet provided the missing data. <br><br>
 
 * Secondly, the chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_PUNKTKILDER_INPAR_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.

In [3]:
# Read raw (tidied) data

# Store anlegg
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\avlop_stor_anlegg_2016_raw.xlsx')
stan_df = pd.read_excel(in_xlsx, sheetname='store_anlegg_2016')

# Miljøgifter
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\avlop_miljogifter_2016_raw.xlsx')
milo_df = pd.read_excel(in_xlsx, sheetname='miljogifter_2016')

# Industri
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\industri_2016_raw.xlsx')
ind_df = pd.read_excel(in_xlsx, sheetname='industry_2016')

# Drop blank rows
stan_df.dropna(how='all', inplace=True)
milo_df.dropna(how='all', inplace=True)
ind_df.dropna(how='all', inplace=True)

### 1.1. Basic data checking

All of the "Store Anlegg" and "Miljøgifter" sites are classified as `RENSEANLEGG` in the `TYPE` column of `RESA2.RID_PUNKTKILDER`; "Industri" sites as labelled `INDUSTRI`.

Add `TYPE` columns, merge site data from different sources, convert UTM co-ordinates to WGS84 decimal degrees and identify sites not already in the database. Issues identified below (e.g. missing co-ordinates) should be corrected if possible before continuing.

In [4]:
# Add TYPE cols
stan_df['TYPE'] = 'RENSEANLEGG'
milo_df['TYPE'] = 'RENSEANLEGG'
ind_df['TYPE'] = 'INDUSTRI'

# Get just stn info from each df
stan_loc = stan_df[['ANLEGGSNR', 'ANLEGGSNAVN', 'Kommunenr', 
                    'TYPE', 'Sone', 'UTM_E', 'UTM_N']]

milo_loc = milo_df[['ANLEGGSNR', 'ANLEGGSNAVN', 'KOMMUNE_NR', 
                    'TYPE', 'SONEBELTE', 'UTMOST', 'UTMNORD']]

ind_loc = ind_df[['Anleggsnr', 'Anleggsnavn', 'Komm.nr', 'TYPE', 
                  'Geografisk Longitude', 'Geografisk Latitude']]


# Rename cols
stan_loc.columns = ['anlegg_nr', 'anlegg_navn', 'komm_no',
                    'TYPE', 'zone', 'east', 'north']
milo_loc.columns = ['anlegg_nr', 'anlegg_navn', 'komm_no',
                    'TYPE', 'zone', 'east', 'north']
ind_loc.columns = ['anlegg_nr', 'anlegg_navn', 'komm_no',
                   'TYPE', 'lon', 'lat']

# Drop duplicates
stan_loc.drop_duplicates(inplace=True)
milo_loc.drop_duplicates(inplace=True)
ind_loc.drop_duplicates(inplace=True)

# Convert UTM to lat/lon
# "Industri" data is already in dd
stan_loc = rid.utm_to_wgs84_dd(stan_loc, 'zone', 'east', 'north')
milo_loc = rid.utm_to_wgs84_dd(milo_loc, 'zone', 'east', 'north')

# Remove UTM data 
del stan_loc['zone'], stan_loc['east'], stan_loc['north']
del milo_loc['zone'], milo_loc['east'], milo_loc['north']

# combine into single df
loc_df = pd.concat([stan_loc, milo_loc, ind_loc], axis=0)

# The same site can be in multiple files, so drop duplicates
loc_df.drop_duplicates(inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: '%04d' % x
loc_df['komm_no'] = loc_df['komm_no'].apply(fmt)

# Check ANLEGG_NR is unique
assert loc_df.index.duplicated().all() == False, 'Some "ANLEGGSNRs" are duplicated.'

# Check if any sites are not already in db
sql = ('SELECT UNIQUE(ANLEGG_NR) '
       'FROM resa2.rid_punktkilder')
annr_df = pd.read_sql_query(sql, engine)

not_in_db = set(loc_df['anlegg_nr'].values) - set(annr_df['anlegg_nr'].values)

print '\nThe following locations are not in the database:'
print loc_df[loc_df['anlegg_nr'].isin(list(not_in_db))][['anlegg_nr', 'anlegg_navn']]

# Check if any sites are missing co-ords
print '\nThe following locations do not have co-ordinates:'
print loc_df.query('(lat!=lat) or (lon!=lon)')[['anlegg_nr', 'anlegg_navn']]


The following locations are not in the database:
Empty DataFrame
Columns: [anlegg_nr, anlegg_navn]
Index: []

The following locations do not have co-ordinates:
     anlegg_nr                          anlegg_navn
55    0226AL71          MIRA renseanlegg (v/Tangen)
136   0430AL03                    Koppang rensepark
342   0615AL11        Damtjernhallin 2 avløpsanlegg
343   0615AL12  Solheimseter, Sørbølfjell - trinn 1
349   0617AL19       Einarset Stølslag Felt H1 + H5
353   0617AL90                  Brekko Camping r.a.
365   0619AL69                        Øyni menighet
377   0621AL37                     Nedre Eggedal RA
399   0626AL64                          Tronstad RA
410   0631AL32            Borge–Blestua hytteområde
413   0633AL21           EKT FJELLGÅRD OG LEIRSKOLE
414   0633AL24                IMINGFJELL TURISTHEIM
621   1101AL28                          Trosavig RA
666   1114AL18                      Stavtjørnknuten
768   1141AL43                   Østabøvågen Talgje
827   1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return func(*args, **kwargs)


### 1.2. Identify Regine Vassdragsnummer

The shapefile here:

K:\Kart\Regine_2006\RegMinsteF.shp

shows locations for all the Regine catchments used by TEOTIL (see e-mail from John Rune received 29/06/2017 at 17.26). I've copied this file locally here:

C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\gis\shapefiles\RegMinsteF.shp

and re-projected it to WGS84 geographic co-ordinates. The new file is called *reg_minste_f_wgs84.shp*.

I have also written a function to perform a spatial join and identify which Regine polygon each point is located in.

**Note:** Geopandas is quite fussy about its input data (and also to install). The code below works, but the GDAL/OGR version [here](https://stackoverflow.com/a/13433127/505698) might be more robust?

In [5]:
# Path to Regine catchment shapefile
reg_shp_path = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
                r'\Data\gis\shapefiles\reg_minste_f_wgs84.shp')

# Spatial join
loc_df = rid.identify_point_in_polygon(loc_df, reg_shp_path, 
                                       'anlegg_nr', 'VASSDRAGNR',
                                       'lat', 'lon')

loc_df.head()



Unnamed: 0,TYPE,anlegg_navn,anlegg_nr,komm_no,lat,lon,VASSDRAGNR
0,RENSEANLEGG,Bakke,0101AL02,101,59.019598,11.443762,001.2220
1,RENSEANLEGG,Kornsjø,0101AL06,101,58.935184,11.668959,001.1J
2,RENSEANLEGG,Remmendalen avløpsanlegg,0101AL07,101,59.120864,11.360106,001.31Z
3,RENSEANLEGG,Kambo,0104AL01,104,59.474488,10.686496,003.20
4,RENSEANLEGG,Alvim Renseanlegg,0105AL00,105,59.273056,11.075773,002.A4


### 1.3. Restructuring site data

For sites dataframe, rename columns to match RESA2.

In [6]:
# Rename other cols to match RESA2
loc_df['ANLEGG_NR'] = loc_df['anlegg_nr']
loc_df['ANLEGG_NAVN'] = loc_df['anlegg_navn']
loc_df['KNO'] = loc_df['komm_no']
loc_df['REGINE'] = loc_df['VASSDRAGNR']
loc_df['LON_UTL'] = loc_df['lon']
loc_df['LAT_UTL'] = loc_df['lat']

del loc_df['anlegg_nr'], loc_df['anlegg_navn'], loc_df['komm_no']
del loc_df['VASSDRAGNR'], loc_df['lon'], loc_df['lat']

loc_df.head()

Unnamed: 0,TYPE,ANLEGG_NR,ANLEGG_NAVN,KNO,REGINE,LON_UTL,LAT_UTL
0,RENSEANLEGG,0101AL02,Bakke,101,001.2220,11.443762,59.019598
1,RENSEANLEGG,0101AL06,Kornsjø,101,001.1J,11.668959,58.935184
2,RENSEANLEGG,0101AL07,Remmendalen avløpsanlegg,101,001.31Z,11.360106,59.120864
3,RENSEANLEGG,0104AL01,Kambo,104,003.20,10.686496,59.474488
4,RENSEANLEGG,0105AL00,Alvim Renseanlegg,105,002.A4,11.075773,59.273056


In [7]:
# Get details for sites not already in db
loc_upld = loc_df[loc_df['ANLEGG_NR'].isin(list(not_in_db))]

loc_upld

Unnamed: 0,TYPE,ANLEGG_NR,ANLEGG_NAVN,KNO,REGINE,LON_UTL,LAT_UTL


In [8]:
# Add to RESA2.RID_PUNKTKILDER
#loc_upld.to_sql('rid_punktkilder', con=engine, schema='resa2', 
#                if_exists='append', index=False)

### 1.4. Restructuring values

In [9]:
# Set the year for the data in question
year = 2016

In [10]:
# Store Anlegg
# Get cols of interest 
stan_vals = stan_df[['ANLEGGSNR', 'MENGDE_P_UT_kg', 'MENGDE_N_UT_kg']]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
stan_vals.columns = ['ANLEGG_NR', 45, 44]

# Melt to "long" format
stan_vals = pd.melt(stan_vals, id_vars='ANLEGG_NR', value_vars=[45, 44],
                    var_name='INP_PAR_ID', value_name='VALUE')

# Drop NaN values
stan_vals.dropna(how='any', inplace=True)

As far as I can tell from exploring the 2015 data in the database, the main columns of interest for Miljøgifter are given in `milo_dict`, below, together with the corresponding parameter IDs from `RESA2.RID_PUNKTKILDER_INPAR_DEF`. This hard-coding is a bit messy, but I can't see any database table providing a nice lookup between these values, so they're included here for now.

In [11]:
# Miljøgifter
# Get cols of interest 
milo_dict = {'MILJOGIFTHG2':16, 
             'MILJOGIFTPAH2':48, 
             'MILJOGIFTPCB2':30, 
             'MILJOGIFTCD2':8, 
             'MILJOGIFTDEHP2':119, 
             'MILJOGIFTAS2':2,
             'MILJOGIFTCR2':10, 
             'MILJOGIFTPB2':28, 
             'MILJOGIFTNI2':25,
             'MILJOGIFTCU2':15, 
             'MILJOGIFTZN2':38, 
             'KONSMENGDTOTP10':45,
             'KONSMENGDTOTN10':44, 
             'KONSMENGDSS10':46,
             'ANLEGGSNR':'ANLEGG_NR'} # Make heading match RESA

milo_vals = milo_df[milo_dict.keys()]

# Get par IDs from dict
milo_vals.columns = [milo_dict[i] for i in milo_vals.columns]

# Melt to "long" format
milo_vals = pd.melt(milo_vals, id_vars='ANLEGG_NR',
                    var_name='INP_PAR_ID', value_name='VALUE')

# Drop NaN values
milo_vals.dropna(how='any', inplace=True)

The industry data is already in "long" format.

In [12]:
# Industri
# Get cols of interest
ind_vals = ind_df[['Anleggsnr', 'Komp.kode', 'Mengde', 'Enhet']]
ind_vals.columns = ['anlegg_nr', 'name', 'value', 'unit']

# Get par defs from db
# Check if any sites are not already in db
sql = ('SELECT * '
       'FROM resa2.rid_punktkilder_inpar_def')
par_df = pd.read_sql_query(sql, engine)
del par_df['descr']

# Convert all units to capitals
ind_vals['unit'] = ind_vals['unit'].str.capitalize()
par_df['unit'] = par_df['unit'].str.capitalize()

# Join
ind_vals = pd.merge(ind_vals, par_df, how='left',
                    on=['name', 'unit'])

# Some parameters that are not of interest are not matched
# Drop these
ind_vals.dropna(how='any', inplace=True)

# Get just cols of interest
ind_vals = ind_vals[['anlegg_nr', 'in_pid', 'value']]

# Rename for db
ind_vals.columns = ['ANLEGG_NR', 'INP_PAR_ID', 'VALUE']

# Convert col types
ind_vals['INP_PAR_ID'] = ind_vals['INP_PAR_ID'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [13]:
# Combine
val_df = pd.concat([stan_vals, milo_vals, ind_vals], axis=0)

# Add column for year
val_df['YEAR'] = year

# Explicitly set data types
val_df['ANLEGG_NR'] = val_df['ANLEGG_NR'].astype(str)
val_df['INP_PAR_ID'] = val_df['INP_PAR_ID'].astype(int)
val_df['VALUE'] = val_df['VALUE'].astype(float)
val_df['YEAR'] = val_df['YEAR'].astype(int)

# Store Anlegg and Miljøgifter contain some duplicated information
val_df.drop_duplicates(inplace=True)

In [14]:
# Drop any existing values for this year
#sql = ("DELETE FROM resa2.rid_punktkilder_inpar_values "
#       "WHERE year = %s" % year)
#res = conn.execute(sql)

# Add to RESA2.RID_PUNKTKILDER_INPAR_VALUES 
#val_df.to_sql('rid_punktkilder_inpar_values', con=engine, schema='resa2', 
#              if_exists='append', index=False)

## 2. Små anlegg (small treatment works)

An example of the raw data format is here:

K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Avløp\TEOTIL små anlegg 2015 (sendt 18.08.2016).xlsx

I have made a local copy of the 2016 file here:

C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016\avlop_sma_anlegg_2016_raw.xlsx

and deleted unnecessary columns. All of this data can be added directly to `RESA2.RID_KILDER_SPREDT_VALUES`. The kommuner ID numbers and names are in `RESA2.KOMMUNER`, but not all kommune IDs in `RID_KILDER_SPREDT_VALUES` are in `KOMMUNER`. Need to check to see if Tore's code actually uses the `KOMMUNER` table to link kommuners to OSPAR areas. If it does, **need to be careful**, but perhaps it's done directly on kommuner ID?

In [15]:
# Set the year for the data in question
year = 2016

In [16]:
# Read raw (tidied) data
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\avlop_sma_anlegg_2016_raw.xlsx')
sman_df = pd.read_excel(in_xlsx, 
                        sheetname='sma_anlegg_2016')
# Drop blank rows
sman_df.dropna(how='all', inplace=True)

# Kommune nr. should be a 4 char string, not a float
fmt = lambda x: '%04d' % x
sman_df['KOMMUNENR'] = sman_df['KOMMUNENR'].apply(fmt)

# Check if any kommuner are not already in db
sql = ('SELECT UNIQUE(kommune_no) '
       'FROM resa2.kommuner')
kmnr_df = pd.read_sql_query(sql, engine)

not_in_db = set(sman_df['KOMMUNENR'].values) - set(kmnr_df['kommune_no'].values)

print '\nThe following locations are not in the database:'
print sman_df[sman_df['KOMMUNENR'].isin(list(not_in_db))]

# Get cols of interest for RID_KILDER_SPREDT_VALUES
sman_df = sman_df[['KOMMUNENR', 'P_kg', 'N_kg']]

# In RESA2.RID_PUNKTKILDER_INPAR_DEF, N is par_id 44 and P par_id 45
sman_df.columns = ['KOMM_NO', 45, 44]

# Melt to "long" format
sman_df = pd.melt(sman_df, id_vars='KOMM_NO', value_vars=[45, 44],
                  var_name='INP_PAR_ID', value_name='VALUE')

# Add column for year
sman_df['AR'] = year

sman_df.head()


The following locations are not in the database:
    KOMMUNENR   KOMMUNENAVN        P_kg       N_kg
259      1505  Kristiansund  1433.17980   9554.532
292      1576          Aure  1438.50150   9951.360
340      1756       Inderøy  1292.87745   8896.875
386      1903       Harstad  2595.41280  17317.206


Unnamed: 0,KOMM_NO,INP_PAR_ID,VALUE,AR
0,101,45,809.2269,2016
1,104,45,93.0312,2016
2,105,45,1074.29355,2016
3,106,45,395.15265,2016
4,111,45,29.59785,2016


In [17]:
# Add to RESA2.RID_KILDER_SPREDT_VALUES
#sman_df.to_sql('rid_kilder_spredt_values', con=engine, schema='resa2', 
#               if_exists='append', index=False)

## 3. Fish farms

An examples of the raw data is here:

 * K:\Prosjekter\Ferskvann\O-13255-TEOTIL\2016\Rådata\Fiskeoppdrett\Teotil - 2015 (2) (pr. 09.08.16).xlsx.zip

I have made a local copy of the 2016 file here:

 * C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016\fiske_oppdret_2016_raw.xlsx

The data must be added to two tables in RESA2:

 * First, the site data must be added to `RESA2.RID_KILDER_AQUAKULTUR`. Most of the sites should already be there, but occasionally new sites are added. Any new stations must be be assigned lat/lon co-ordinates and the appropriate "Regine" catchment ID. This usually requires geocoding plus co-ordinate conversions and/or a spatial join to determine catchment IDs.
 
    **Note:** The key ID fields in the raw data appear to be `LOKNR` and `LOKNAVN`. <br><br>
 
 * Secondly, the chemistry data for each site must be extracted and converted to "long" format, then added to `RESA2.RID_KILDER_AQKULT_VALUES`. Parameter IDs etc. are taken from `RESA2.RID_PUNKTKILDER_INPAR_DEF`.
 
### 3.1. Basic data checking

In [18]:
# Set the year for the data in question
year = 2016

In [19]:
# Read raw (tidied) data
# Fish farms
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\fiske_oppdret_2016_raw.xlsx')
fish_df = pd.read_excel(in_xlsx, sheetname='Ark1')

# Drop no data
fish_df.dropna(how='all', inplace=True)

In [20]:
# Check if any sites are not already in db
sql = ('SELECT UNIQUE(NR) '
       'FROM resa2.rid_kilder_aquakultur')
aqua_df = pd.read_sql_query(sql, engine)

not_in_db = set(fish_df['LOKNR'].values) - set(aqua_df['nr'].values)

nidb_df = fish_df[fish_df['LOKNR'].isin(list(not_in_db))][['LOKNR', 'LOKNAVN', 
                                                           'N_DESIMALGRADER_Y',
                                                           'O_DESIMALGRADER_X']].drop_duplicates(subset=['LOKNR'])

print '\nThe following locations are not in the database:'
print nidb_df


The following locations are not in the database:
Empty DataFrame
Columns: [LOKNR, LOKNAVN, N_DESIMALGRADER_Y, O_DESIMALGRADER_X]
Index: []


### 3.2. Geocode fish farms and add to database

In [21]:
## Path to Regine catchment shapefile
#reg_shp_path = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
#                r'\Data\gis\shapefiles\reg_minste_f_wgs84.shp')
#
## Spatial join
#loc_df = rid.identify_point_in_polygon(nidb_df, reg_shp_path, 
#                                       'LOKNR', 'VASSDRAGNR',
#                                       'N_DESIMALGRADER_Y',
#                                       'O_DESIMALGRADER_X')
#
## Rename cols
#loc_df.columns = ['NR', 'NAVN', 'LENGDE', 'BREDDE', 'REGINE']
#
#loc_df.head()

In [22]:
# Add to RESA2.RID_KILDER_AQUAKULTUR
#loc_df.to_sql('rid_kilder_aquakultur', con=engine, schema='resa2', 
#              if_exists='append', index=False)

### 3.3. Estimate nutrient inputs

The methodology here is a little unclear. The following is my best guess, based on the files located here:

K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\2016\Rådata\Fiskeoppdrett

Old workflow:

 1. Calculate the fish biomass from the raw data. See the equation in the `Biomasse` column of the spreadsheet *JSE_TEOTIL_2015.xlsx* <br><br>
 
 2. Split the data according to salmon ("laks"; species ID 71101) and trout ("øret"; species ID 71401), then group by location and month, summing biomass and `FORFORBRUK_KILO` columns (see Fiskeoppdrett_biomasse_2016.accdb) <br><br>
 
 3. Calculate production. This involves combining biomass for the current month with that for the previous month. See the calculations in e.g. *N_P_ørret_2015.xlsx*. <br><br>
 
 4. Calculate NTAP and PTAP. **NB:** I don't know what these quantities are, so I'm just blindly duplicating the Excel calculations in the code below. The functions are therefore not very well explained <br><br>
 
 5. Estimate copper usage at each fish farm by scaling the total annual Cu usage in proportion to P production. For 2016, John Rune has supplied an annual Cu value of **1088 tonnes** (see e-mail received 12/09/2017 at 09.49).

In [23]:
# Annual Cu usage in tonnes
an_cu = 1088

# Estimate nutrient inputs from fish farns
fish_nut = rid.estimate_fish_farm_nutrient_inputs(fish_df, year, an_cu)

fish_nut.head()

Unnamed: 0,ANLEGG_NR,INP_PAR_ID,AR,MANED,ART,VALUE
0,10029,39,2016,6,,0.0
1,10041,39,2016,6,,157490.099784
2,10050,39,2016,6,,231.467583
3,10054,39,2016,6,,156819.055095
4,10078,39,2016,6,,71567.641771


In [24]:
# Add to RESA2.RID_KILDER_AQKULT_VALUES
#fish_nut.to_sql('rid_kilder_aqkult_values', con=engine, schema='resa2', 
#                if_exists='append', index=False)

## 4. Land use

An example of the raw data is here:

 * K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\2016\Rådata\Jordbruk\to-niva.2015.xls

Note that this file is not really an Excel file and opening it directly creates errors. I have corrected the data format, tidied the column headings and made a local copy of the 2016 data here:

 * C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Data\point_data_2016\jordbruk_2016.xlsx
 
This is added to the table `RESA2.RID_AGRI_INPUTS`.

**Note:** In recent years, the entry for Oslo (fylke_sone = 3_1) has been missing from the data provided by Bioforsk. This row should be added manually to the Excel file using `omrade = "osl1"`. The values should be identical to those for område `ake2`. This works because the land areas in `RID_Fylke-Sone_LU_Areas.xlsx` have been made identical for `osl1` and `ake2` (even though this is not correct), so the inputs in terms of kg/km2 are calculated as being the same for both regions, which is what is required.

In [25]:
# Path to (tidied) Bioforsk data
in_xlsx = (r'C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet'
           r'\Data\point_data_2016\jordbruk_2016.xlsx')

lu_df = pd.read_excel(in_xlsx)

# Add year
lu_df['year'] = 2016

# Order cols
lu_df = lu_df[['omrade', 'year', 'n_diff_kg', 'n_point_kg', 
               'n_back_kg', 'p_diff_kg', 'p_point_kg', 
               'p_back_kg']]# Write to RESA

In [26]:
#lu_df.to_sql(name='rid_agri_inputs', con=engine, 
#             schema='resa2', index=False,
#             if_exists='append',
#             dtype={'omrade': types.VARCHAR(lu_df['omrade'].str.len().max())})