In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
import numpy as np
import os
import nivapy3 as nivapy
import useful_rid_code as rid

sn.set_context('notebook')

In [13]:
# Connect to db
engine = nivapy.da.connect()

Username:  ···
Password:  ········


Connection successful.


# RID 2018-19: data processing notebook

Previous notebooks have developed the tools and methodology required to implement the RID workflow. Note that the programme for 2017 - 2020 requires some changes, which are described [here](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/programme_changes_2017-18.ipynb). In particular, we now have 20 "core" sites and 135 "other" sites, instead of the previous 11/36/108 split.

This notebook keeps a record of the processing for the 2018 data.

## 1. Add 2018 datasets

### 1.1. Update flow datasets

[This notebook](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/update_flow_datasets.ipynb) documents the processing and upload of the NVE flow datasets (both modelled and observed) for each year since 2016. The RESA2 database now contains a complete record of all the discharge data required for this year's RID processing.

### 1.2. Water chemistry quality control

The water samples collected for this project are analysed by the NIVA laboratory and results are automatically transferred to the RESA2 database. Liv Bente has quality-checked the 2018 data and the necessary corrections have been made in the database. 

### 1.3. Sample selections

The analysis for the RID report should only use water samples collected as part of the "core" monitoring programme (i.e. not flood samples or those collected under Option 3). In 2018, **no flood samples were taken**, but the **Option 3 samples need to be separated from the "core" sampling**, as agreed with Øyvind and Hans Fredrik (see e-mail from Hans Fredrik received 09.08.2019 at 08.16 for full details).

The code below is modified from [this notebook](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/programme_changes_2017-18.ipynb) and separates the Option 3 and "core" samples using the `SAMPLE_SELECTIONS` tables.

In [3]:
# Read new site groupings (for 2017 to 2020)
in_xlsx = (r'../../../Data/RID_Sites_List_2017-2020.xlsx')
rid_20_df = pd.read_excel(in_xlsx, sheet_name='RID_20')
rid_135_df = pd.read_excel(in_xlsx, sheet_name='RID_135')
rid_155_df = pd.read_excel(in_xlsx, sheet_name='RID_All')

In [4]:
# Get WS IDs for Option 3
sql = ("SELECT water_sample_id from resa2.water_samples "
       "WHERE station_id IN ( "
       "  SELECT station_id FROM resa2.projects_stations "
       "  WHERE project_id = 4310) "
       "AND sample_date >= DATE '2018-01-01' "
       "AND sample_date < DATE '2019-01-01'")
ws_df = pd.read_sql_query(sql, engine)

# Add to sample selections
ws_df['sample_selection_id'] = 65
#ws_df.to_sql('sample_selections', con=engine, schema='resa2', 
#             if_exists='append', index=False)

In [5]:
# Get flood and Option 3 samples
sql = ("SELECT water_sample_id "
       "FROM resa2.sample_selections "
       "WHERE sample_selection_id IN (64, 65)")
oth_ws = pd.read_sql_query(sql, engine)
assert oth_ws['water_sample_id'].is_unique

# Get all WS associated with core sites for 2018
sql = ("SELECT water_sample_id FROM resa2.water_samples "
       "WHERE station_id IN %s "
       "AND sample_date >= DATE '2018-01-01' "
       "AND sample_date < DATE '2019-01-01'" % str(tuple(rid_155_df['station_id'].astype(int))))
all_ws = pd.read_sql_query(sql, engine)
assert all_ws['water_sample_id'].is_unique

# Remove flood and option 3 samples from core
core_ws = set(all_ws['water_sample_id']) - set(oth_ws['water_sample_id'])

# Add to sample selections
core_df = pd.DataFrame({'water_sample_id':list(core_ws)})
core_df['sample_selection_id'] = 63
#core_df.to_sql('sample_selections', con=engine, schema='resa2', 
#               if_exists='append', index=False)

## 2. Tabulate raw water chemistry and flow

### 2.1. Data for year of interest

From 2017 onwards, water chemistry samples have been collected at 20 sites (`RID_20`). These data are exported to CSV format below.

**Remember to update the year!**

In [6]:
# Year of interest
year = 2018

# Output CSV
out_csv = r'../../../Results/Loads_CSVs/concs_and_flows_rid_20_%s.csv' % year

df = rid.write_csv_water_chem(rid_20_df, year, out_csv, engine, samp_sel=63)

    Extracting flow data...
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Ex

### 2.2. Data for all years

**Added 03.09.2018**. We would like to estimate trends in the "main" rivers for both loads and concentrations for the period from 1990 to 2017. Because of the changes to the programme in 2017, Eva needs concentrations tabulating for all years, not just 2017 - see e-mail from Øyvind received 31.08.2018 at 14.04. 

This may not be necessary for the 2018 data, but I'll run the code here anyway as the output may be useful.

In [7]:
# Year of interest
end_year = 2018

# Container for data
df_list = []

# Dummy path for intermediate output (which isn't needed here)
out_csv = r'../../../Results/Loads_CSVs/cons_and_flows_intermed.csv'

# Loop over years
for year in range(1990, end_year+1):
    # Get data
    df = rid.write_csv_water_chem(rid_20_df, year, out_csv, engine, samp_sel=63)
    
    # Add to output
    df_list.append(df)

# Delete intermediate
os.remove(out_csv)

# Combine
df = pd.concat(df_list, axis=0)

# Reorder cols and tidy
st_cols = ['station_id', 'station_code', 'station_name', 'old_rid_group', 
           'new_rid_group', 'ospar_region', 'sample_date', 'Qs_m3/s']
par_cols = [i for i in df.columns if i not in st_cols]
par_cols.sort()
df = df[st_cols + par_cols]
        
# Output CSV
out_csv = r'../../../Results/Loads_CSVs/concs_and_flows_rid_20_1990-%s.csv' % end_year
df.to_csv(out_csv, encoding='utf-8', index=False)

    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  df = pd.concat(df_list, axis=0)


    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
No chemistry data found for station ID 29837.
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
    Extracting flow data...
  

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




## 3. Estimate observed loads

### 3.1. Annual flows

First get a dataframe of annual flow volumes to join to the summary output. **NB:** This dataframe isn't actually used in the loads calculations - they are handled separately - it's just for the output CSVs.

**Remember to update the years!**

In [8]:
# Sites of interest: combine all site dfs into one
rid_all_df = pd.concat([rid_20_df, rid_135_df], axis=0)

# Get flow data
q_df = rid.get_flow_volumes(rid_all_df, 1990, 2018, engine)

q_df.head()

Unnamed: 0,station_id,year,mean_q_1000m3/day
0,29612,1990,25891.13466
1,29612,1991,19274.318392
2,29612,1992,22209.901227
3,29612,1993,28155.888465
4,29612,1994,27384.945933


### 3.2. Loads for all rivers

The code below is taken from Section 2 of [notebook 3](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/estimate_loads.ipynb), but this time run using the 2018 data. Loads are calculated directly from contemporary observations for the RID_20, and they are inferred from historic concentrations for the RID_135 sites.

As above, note the use of the `'samp_sel'` argument in the code below.

**Remember to change the year below!**

In [9]:
# Sites of interest: combine all site dfs into one
rid_all_df = pd.concat([rid_20_df, rid_135_df], axis=0)

# Pars of interest
par_list = ['SPM', 'TOC', 'PO4-P', 'TOTP', 'NO3-N', 'NH4-N', 
            'TOTN', 'SiO2', 'Ag', 'As', 'Pb', 'Cd', 'Cu', 
            'Zn', 'Ni', 'Cr', 'Hg']

# Year of interest
year = 2018

# Container for results from each site
loads_list = []

# Loop over sites
for stn_id in rid_all_df['station_id'].values:
    # Estimate loads at this site
    loads_list.append(rid.estimate_loads(stn_id, par_list, 
                                         year, engine,
                                         infer_missing=True,
                                         samp_sel=63))

# Concatenate to new df
lds_all = pd.concat(loads_list, axis=0)
lds_all.index.name = 'station_id'
lds_all.reset_index(inplace=True)

# Get flow data for year
q_yr = q_df.query('year == @year')

# Join
lds_all = pd.merge(lds_all, rid_all_df, how='left', on='station_id')
lds_all = pd.merge(lds_all, q_yr, how='left', on='station_id')

# Reorder cols and tidy
st_cols = ['station_id', 'station_code', 'station_name', 'old_rid_group', 
           'new_rid_group', 'ospar_region', 'mean_q_1000m3/day']
unwant_cols = ['nve_vassdrag_nr', 'lat', 'lon', 'utm_north', 'utm_east', 
               'utm_zone', 'station_type', 'year'] 
par_cols = [i for i in lds_all.columns if i not in (st_cols+unwant_cols)]

for col in unwant_cols:
    del lds_all[col]

lds_all = lds_all[st_cols + par_cols]

# Write output
out_csv = r'../../../Results/Loads_CSVs/loads_and_flows_all_sites_%s.csv' % year
lds_all.to_csv(out_csv, encoding='utf-8', index=False)

    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




### 3.3. Loads for the RID_20 rivers through time

The code below is taken from Section 3 of [notebook 3](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/estimate_loads.ipynb), but this time the bar charts include data from 2018 for the RID_20 stations.

Note the use of the `'samp_sel'` argument in the code below.

**Remember to change the years below!**

In [10]:
# Period of interest
st_yr, end_yr = 1990, 2018

# Container for results 
loads_list = []

# Loop over sites
for stn_id in rid_20_df['station_id'].values:
    # Loop over years
    for year in range(st_yr, end_yr+1):
        print ('Processing Station ID %s for %s' % (stn_id, year))
        
        # Get loads
        l_df = rid.estimate_loads(stn_id, par_list, 
                                  year, engine,
                                  infer_missing=True,
                                  samp_sel=63)
        
        if l_df is not None:
            # Name and reset index
            l_df.index.name = 'station_id'
            l_df.reset_index(inplace=True)

            # Add year
            l_df['year'] = year

            # Add to outout
            loads_list.append(l_df)

# Concatenate to new df
lds_ts = pd.concat(loads_list, axis=0)

# Join
lds_q_ts = pd.merge(lds_ts, rid_20_df, how='left', on='station_id')
lds_q_ts = pd.merge(lds_q_ts, q_df, how='left', on=['station_id', 'year'])

# Reorder cols and tidy
st_cols = ['station_id', 'station_code', 'station_name', 'old_rid_group', 
           'new_rid_group', 'ospar_region', 'mean_q_1000m3/day']
unwant_cols = ['nve_vassdrag_nr', 'lat', 'lon', 'utm_north', 'utm_east', 
               'utm_zone', 'station_type'] 
par_cols = [i for i in lds_q_ts.columns if i not in (st_cols+unwant_cols)]

for col in unwant_cols:
    del lds_q_ts[col]

lds_q_ts = lds_q_ts[st_cols + par_cols]

# Save output
out_csv = r'../../../Results/Loads_CSVs/loads_and_flows_rid_20_%s-%s.csv' % (st_yr, end_yr)
lds_q_ts.to_csv(out_csv, encoding='utf-8', index=False)

# Build multi-index on lds_ts for further processing
lds_ts.set_index(['station_id', 'year'], inplace=True)

Processing Station ID 29612 for 1990
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

Processing Station ID 29612 for 1991
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

Processing Station ID 29612 for 1992
Processing Station ID 29612 for 1993
Processing Station ID 29612 for 1994
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

Processing Station ID 29612 for 1995
P

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




**Remember to change the year in the folder path below!**

In [11]:
%%capture
# This code cell produces lots of Deprecation Warnings from Seaborn/Pandas.
# %%capture suppresses all output from this cell to keep things tidy

# Output folder for plots
out_fold = r'../../../Results/TS_Plots/RID_Plots_To_2018'

# Loop over df
for stn_id in rid_20_df['station_id'].values:
    # Get data for this station
    df = lds_ts.ix[stn_id]
    
    # Separate est and val cols to two dfs
    cols = df.columns
    est_cols = [i for i in cols if i.split('_')[1]=='Est']
    val_cols = [i for i in cols if i.split('_')[1]!='Est']    
    val_df = df[val_cols]
    est_df = df[est_cols]
    
    # Convert to "long" format
    val_df.reset_index(inplace=True)
    val_df = pd.melt(val_df, id_vars='year', var_name='par_unit')    
    est_df.reset_index(inplace=True)
    est_df = pd.melt(est_df, id_vars='year', var_name='par_est', value_name='est')
    
    # Get just par for joining
    val_df['par'] = val_df['par_unit'].str.split('_', expand=True)[0]
    est_df['par'] = est_df['par_est'].str.split('_', expand=True)[0]
    
    # Join
    df = pd.merge(val_df, est_df, how='left',
                  on=['year', 'par'])
    
    # Extract cols of interest
    df = df[['year', 'par_unit', 'value', 'est']]

    # Plot
    g = sn.factorplot(x='year', y='value', hue='est',
                      col='par_unit', col_wrap=3,
                      data=df, 
                      kind='bar',
                      dodge=False,
                      sharex=False,
                      sharey=False,
                      alpha=0.5,
                      aspect=2,
                      legend=False)
    
    # Rotate tick labels and tidy
    for ax in g.axes.flatten(): 
        for tick in ax.get_xticklabels(): 
            tick.set(rotation=45)
    plt.tight_layout()
    
    # Save
    out_path = os.path.join(out_fold, '%s.png' % stn_id)
    plt.savefig(out_path, dpi=200)
    plt.close()

The three files created above (`concs_and_flows_rid_11-36_year.csv`, `loads_and_flows_all_sites_year.csv` and `loads_and_flows_rid_11_1990-year.csv`) can now be imported into Excel and send to NIBIO. The data layout is illustrated here:

C:\Data\James_Work\Staff\Oeyvind_K\Elveovervakingsprogrammet\Results\Loads_CSVs\rid_conc_and_loads_summaries_2016.xlsx

**NB:** For neatness, a couple of columns can be manually reordered so that the "flag" columns always come before the data columns.

## 4. Generate output tables for Word

### 4.1. Table 1: Raw water chemistry

The code below is based on Section 2 of [notebook 5](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/word_data_tables.ipynb).

**Updated 24/09/2018**

This function has been modified to refelect changes in the 2017-20 monitoring programme:

 1. The Word template now has pages for just the 20 "main" rivers, not the 11 + 36 rivers, as previously <br><br>
 
 2. Four new columns have been added for new parameters measured during 2017-20 (DOC, Part. C, Tot. Part. N and TDP) <br><br>
 
 3. Hours and minutes have been removed from the date-time column to create space for the new columns <br><br>
 
 4. I have corrected various typos in the database (and in the template):
 
     * `'Tot.part. N'` > `'Tot. Part. N'`
     * `'Vosso(Bolstadelvi)'` > `'Vosso (Bolstadelvi)'`
     * `'Nidelva(Tr.heim)'` > `'Nidelva (Tr.heim)'`
     * `'More than 70%LOD'` > `'More than 70% >LOD'` (template only) 

**Remember to change the year below!**

In [14]:
# Path to *COPIED* template for editing
in_docx = r'../../../Results/Word_Tables/2019Analysis_2018Data/Table_1_2018.docx'

# Write tables
rid.write_word_water_chem_tables(rid_20_df, 2018, in_docx, engine, samp_sel=63)

Processing: Glomma ved Sarpsfoss
    Extracting water chemistry data...
    Extracting flow data...
    Writing sample dates...
    Deleting empty rows...
    Writing data values...
    Writing summary statistics...
    Done.
Processing: Alna
    Extracting water chemistry data...
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a separate dataframe.

    Extracting flow data...
    Writing sample dates...
    Deleting empty rows...
    Writing data values...
    Writing summary statistics...
    Done.
Processing: Drammenselva
    Extracting water chemistry data...
    The database contains duplicated values for some station-date-parameter combinations.
    Only the most recent values will be used, but you should check the repeated values are not errors.
    The duplicated entries are returned in a sep

### 4.2. Table 2: Estimated loads at each site

The code below is based on Section 3 of [notebook 5](http://nbviewer.jupyter.org/github/JamesSample/rid/blob/master/notebooks/word_data_tables.ipynb).

**Updated 24/09/2018**

For the 2017-20 programme, we will only report loads for the 20 "main" rivers, not all 155. I have therefore simplified the Word template by deleting unnecessary rows. The function itself is unchanged.

**Remember to change the year in the file paths below!**

In [15]:
# Path to *COPIED* template for editing
in_docx = r'../../../Results/Word_Tables/2019Analysis_2018Data/Table_2_2018.docx'

# Read loads data (from "loads notebook")
loads_csv = r'../../../Results/Loads_CSVs/loads_and_flows_all_sites_2018.csv'

# Write tables
rid.write_word_loads_table(rid_20_df, loads_csv, in_docx, engine)

Processing:
    Drammenselva (BUSEDRA)...


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  load = lds_df.ix[stn_id, par_l]


    Altaelva (FINEALT)...
    Tanaelva (FINETAN)...
    Vosso (Bolstadelvi) (HOREVOS)...
    Vefsna (NOREVEF)...
    Alna (OSLEALN)...
    Orreelva (ROGEORR)...
    Vikedalselva (ROGEVIK)...
    Orkla (STREORK)...
    Skienselva (TELESKI)...
    Otra (VAGEOTR)...
    Numedalslågen (VESENUM)...
    Glomma ved Sarpsfoss (ØSTEGLO)...
    Målselv (TROEMÅL)...
    Driva (MROEDRI)...
    Pasvikelva (FINEPAS)...
    Nidelva (Tr.heim) (STRENID)...
    Bjerkreimselva (ROGEBJE)...
    Vegårdselva (AAGEVEG)...
    Nausta (SFJENAU)...
Finished.
