In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import glob
import os
import nivapy3 as nivapy

# Update RID flow datasets

Each year, updated flow datasets (both modelled and observed) are obtained from NVE and added to RESA2. Tore has a number of Access files here:

K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\Vannføring

which handle the update process. The code in this notebook replaces this.

In [2]:
# Connect to db
engine = nivapy.da.connect()

Username:  ···
Password:  ········


Connection successful.


## 1. Observed discharge

Observed time series are used **only** for the 11 main rivers (project `RID (O 25800 03)`) - all other calculations are based on modelled flows (from HBV). Discharge for these 11 sites can be obtained from NVE (mostly from the [Hydra II database](http://www4.nve.no/xhydra/)). Note that more than 11 discharge stations are involved, because at some chemistry sampling locations the flow is the sum of several NVE discharge series. Note also the following:

 * Chemistry station 29613 should ideally use the sum of NVE series 16.133 and 16.153, but the latter is no longer available. Trine Fjeldstad at NVE can supply data for station 16.133 and we simply assume the input from 16.153 is constant at 10 $m^3/s$ (which is roughly equal to the long-term average). <br><br>
 
 * The discharge for chemistry station 29614 is **either** NVE station 21.71 **or** 21.11. Both stations should have exactly the same flow values in the HydraII database. Only one set of values are required - check HydraII to see which dataset is updated first. <br><br> 
 
 * Discharge data for chemistry stations 29617 (NVE ID 2.605) and 36225 (NVE ID 6.78) are often delayed. Need to contact Trine at NVE early to avoid problems later.

### 1.1. Data for 2016

The discharge stations associated with the 11 water chemistry sampling locations are listed in the table below, together with where the datasets came from in 2016. 

| Chem station ID | Chem station code | NVE station ID(s) | RESA flow station ID | Availability | Comment |
|:---------------:|:-----------------:|:-----------------:|:--------------------:|:------------:|:-----------------------------------------------------------------------------------------------:|
| 29612 | BUSEDRA | 12.285 | 57 | Hydra II | 2016 data downloaded 09/08/2017 |
| 29613 | TELESKI | 16.153 + 16.133 | 59 | From Trine | Data for 16.153 not available - assume constant at 10 m3/s. Data for 16.133 supplied 28/06/2017 |
| 29614 | VAGEOTR | 21.71 or 21.11 | 487 | Hydra II | 2016 data downloaded 19/06/2017 |
| 29615 | VESENUM | 15.61 | 58 | Hydra II | 2016 data downloaded 19/06/2017 |
| 29617 | ØSTEGLO | 2.605 | 56 | From Trine | 2016 data received 28/06/2017 |
| 29778 | STREORK | 121.22 | 348 | Hydra II | 2016 data downloaded 19/06/2017 |
| 29779 | FINEALT | 212.11 | 386 | Hydra II | 2016 data downloaded 26/07/2017 |
| 29782 | NOREVEF | 151.5 | 351 | Hydra II | 2016 data downloaded 26/07/2017 |
| 29783 | ROGEORR | 28.7 | 355 | Hydra II | 2016 data downloaded 19/06/2017 |
| 29821 | HOREVOS | 62.5 | 546 | Hydra II | 2016 data downloaded 19/06/2017 |
| 36225 | OSLEALN | 6.78 | 626 | Hydra II | 2016 data downloaded 26/07/2017 |

### 1.2. Data for 2017

For 2017, all datasets were supplied by Trine Fjeldstad at NVE (see e-mails received 15.08.2018 at 10.13 and 15.08.2018 at 11.52). For station_id 29614, Trine has provided data for station 21.71 (rather than 21.11). This is fine - see the second bullet point above.

### 1.3. Data for 2018

Most datasets were downloaded from Hydra-II on 08.08.2019. Data for three sites has been provided by Trine - see e-mail received 08.08.2019 at 12:14. This year, Trine has provided data for station 21.11.

In [3]:
# Year of interest
year = 2018

# Folder containing Hydra II data
hyd_fold = r'../../../Data/nve_observed/2018-19/hydra_ii'

# Folder containing data from Trine
tri_fold = r'../../../Data/nve_observed/2018-19/from_trine'

In [4]:
# Dict mapping NVE codes to RESA discharge station IDs
stn_id_dict = {'12_285':57,
               '16_133':59,   # Need to add 10 m3/s
               '21_11':487,   # Could also use 21_11
               '15_61':58,
               '2_605':56,
               '121_22':348,
               '212_11':386,
               '151_5':351,
               '28_7':355,
               '62_5':546,
               '6_78':626}

# List to store output
df_list = []

# Get list of Hydra II files to process
search_path = os.path.join(hyd_fold, '*.csv')
file_list = glob.glob(search_path)

# Loop over Hydra II data
for file_path in file_list:
    # Get RESA station ID
    name = os.path.split(file_path)[1][:-4]
    stn_id = stn_id_dict[name]
    
    # Parse file
    df = pd.read_csv(file_path, skiprows=2, index_col=0,
                     parse_dates=True, header=None, 
                     names=['xdate', 'xvalue'], 
                     na_values='-9999')
    
    # Get just records for year of interest
    df = df.truncate(before='%s-01-01' % year,
                     after='%s-01-01' % (year+1))

    # Linear interpolation and back-filling of NaN
    df['xvalue'].interpolate(method='linear', inplace=True)
    df['xvalue'].fillna(method='backfill', inplace=True)
    
    # Remove HH:MM:SS part from dates
    df = df.resample('D').mean()

    # Add 10 m3/s to 16.133 (RESA2 ID 59)
    if stn_id == 59:
        df['xvalue'] = df['xvalue'] + 10.
        
    # Add other required cols and tidy
    df['dis_station_id'] = stn_id
    df['xcomment'] = np.nan
    df.reset_index(inplace=True)
    
    # Reorder cols
    df = df[['dis_station_id', 'xdate', 'xvalue', 'xcomment']]
    
    # Append to output
    df_list.append(df)

# Get list of files from Trine to process
search_path = os.path.join(tri_fold, '*.csv')
file_list = glob.glob(search_path)

# Loop over files from Trine
for file_path in file_list:
    # Get RESA station ID
    name = os.path.split(file_path)[1][:-4]
    stn_id = stn_id_dict[name]
    
    # Parse file
    df = pd.read_csv(file_path, skiprows=1, index_col=0,
                     parse_dates=True, header=None, 
                     sep=';', names=['xdate', 'xvalue'],
                     na_values='-9999')
    
    # Get just records for year of interest
    df = df.truncate(before='%s-01-01' % year,
                     after='%s-01-01' % (year+1))
    
    # Linear interpolation and back-filling of NaN
    df['xvalue'].interpolate(method='linear', inplace=True)
    df['xvalue'].fillna(method='backfill', inplace=True)

    # Remove HH:MM:SS part from dates
    df = df.resample('D').mean()

    # Add 10 m3/s to 16.133 (RESA2 ID 59)
    if stn_id == 59:
        df['xvalue'] = df['xvalue'] + 10.
        
    # Add dis_id and tidy
    df['dis_station_id'] = stn_id
    df['xcomment'] = np.nan
    df.reset_index(inplace=True)
    
    # Reorder cols
    df = df[['dis_station_id', 'xdate', 'xvalue', 'xcomment']]
    
    # Append to output
    df_list.append(df)  

# Stack data
df = pd.concat(df_list, axis=0)

# Check length of df is as expected 
# Get number of days in year of interest
days = len(pd.date_range(start='%s-01-01' % year, 
                         end='%s-12-31' % year,
                         freq='D'))

assert len(df) == 11*days, 'Check datasets are complete for all sites.'

df.head()

Unnamed: 0,dis_station_id,xdate,xvalue,xcomment
0,348,2018-01-01,45.396,
1,348,2018-01-02,46.759,
2,348,2018-01-03,41.772,
3,348,2018-01-04,45.874,
4,348,2018-01-05,42.909,


In [5]:
# Basic checking
assert df['xvalue'].dtypes == np.float64, 'Check for text in "xvalue" column.'
assert pd.isnull(df['xvalue']).sum() == 0, 'Check for NaN in "xvalue" column.'

In [6]:
# Add new rows to database
#df.to_sql('discharge_values', con=engine, schema='resa2', 
#          if_exists='append', index=False) 

## 2. Modelled discharge

Each year, Stein Beldring supplies modelled data from HBV for the period from 1990 to the year of interest. These datasets are stored locally here:

    ...Elveovervakingsprogrammet\Data\hbv_modelled

and on the network here:

K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\Vannføring\Modellert

The flow files are named e.g. `hbv_00000001.var`, where the number corresponds to the NVE "vassdragsområde". These are listed in *vassomr.pdf* in the above folder, and they're also included in RESA2's `DISCHARGE_STATIONS` table. The vassdragsområde numbers are stored in the `NVE_SERINUMMER` field.

Tore has an Access database in e.g.

K:\Avdeling\Vass\316_Miljøinformatikk\Prosjekter\RID\Vannføring\Modellert\NVE_MODELLERT_2016\vannføring

that first deletes the modelled NVE values for each station from 1990 onwards and then adds the new data, which includes everything from 1990 plus the additional year of data. The code below does the same, and performs some basic checking of the data at the same time.

### Data delivery

 * **2016:** See e-mail from Stein received 13.06.2017 at 12.17
 * **2017:** See e-mail from Stein received 14.08.2018 at 08.51
 * **2018:** See e-mail from Stein received 31.05.2019 at 09.25

In [7]:
# Year of interest
year = 2018

# Folder containing modelled data
data_fold = r'../../../Data/hbv_modelled/RID_2018'

In [8]:
# Get a list of files to process (only interested in flow here)
search_path = os.path.join(data_fold, 'hbv_*.var')
file_list = glob.glob(search_path)

# Get number of days between 1990 and year of interest
days_new = len(pd.date_range(start='1990-01-01', 
                             end='%s-12-31' % year,
                             freq='D'))

# Get number of days between 1990 and year before
days_old = len(pd.date_range(start='1990-01-01', 
                             end='%s-12-31' % (year-1),
                             freq='D'))

# Loop over files
for file_path in file_list:
    # Get name and reg. nr.
    name = os.path.split(file_path)[1]
    reg_nr = int(name.split('_')[1][:-4])
    
    # Get RESA2 station ID
    sql = ("SELECT dis_station_id FROM resa2.discharge_stations "
           "WHERE nve_serienummer = '%s'" % reg_nr)
    dis_id = pd.read_sql_query(sql, engine).iloc[0,0]

    # Check number of post-1990 records already in db
    # (should equal days_old)
    sql = ("SELECT COUNT(*) FROM resa2.discharge_values "
           "WHERE dis_station_id = %s "
           "AND xdate >= DATE '1990-01-01'" % dis_id)    
    cnt_old = pd.read_sql_query(sql, engine).iloc[0,0]    
    assert cnt_old == days_old, 'Unexpected number of records already in database.'
    
    # Read new data
    df = pd.read_csv(file_path, delim_whitespace=True, 
                     header=None, names=['XDATE', 'XVALUE'])
    
    # Convert dates
    df['XDATE'] = pd.to_datetime(df['XDATE'], format='%Y%m%d/1200')

    # Check st, end and length
    assert df['XDATE'].iloc[0] == pd.Timestamp('1990-01-01'), 'New series does not start on 01/01/1990.'
    assert df['XDATE'].iloc[-1] == pd.Timestamp('%s-12-31' % year), 'New series does not end on 31/12/%s.' % year
    assert len(df) == days_new, 'Unexpected length for new series.'
    
    # Add station ID to df
    df['DIS_STATION_ID'] = dis_id
    
    # Drop existing rows post-1990 for this site
    sql = ("DELETE FROM resa2.discharge_values "
           "WHERE dis_station_id = %s "
           "AND xdate >= DATE '1990-01-01'" % dis_id)
    res = engine.execute(sql)
    
    # Add new rows
    df.to_sql('discharge_values', con=engine, schema='resa2', 
              if_exists='append', index=False)    