# Creating an iES experiment

The iterative ensemble smoother as implemented in PEST++ is an ensemble-based version of the Gauss-Levenberg-Marquardt (GLM), gradient-based regression technique for minimizing a weighted sum-of-squared residuals objective function for history matching. This approach is similar to an Ensemble Kalman Filter in which all the data are assimilated in a single batch.

We adopt the following steps for performing history matching using iES. 

<img src="./images/iES_workflow.png" width="500"/>

In the previous lesson we performed the prior MC run, and in this lesson we will go through the rest of the steps shown in the flowchart. 

In [None]:
import pandas as pd
import pyemu
import sys
import shutil
import os
import numpy as np
from pathlib import Path
from datetime import datetime as dt
import matplotlib.pyplot as plt
from pytsproc import filters, series_metrics
plt.rcParams['font.size']=12
%matplotlib inline
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

## **Step 1. Define the working directories**

In [None]:
# Define the path to the single model run PEST++ host directory 
singledir = Path('/home/docker/wrf-hydro-training/output/lesson4/host')

# Define the path to the prior MC PEST++ host directory 
priordir = Path('/home/docker/wrf-hydro-training/output/lesson5/host') 

# Define where the iES work directory would be
iesdir = Path('/home/docker/wrf-hydro-training/output/lesson6/iES_Run') 

# Define where to save plots generated in this notebook
plotdir = Path('/home/docker/wrf-hydro-training/output/lesson6/plots')

In [None]:
if not os.path.exists(plotdir):
    plotdir.mkdir(parents=True)

## **Step 2.  Define the calibration, validation and spin up period.** 
Below is a short description of 
* burn_in: this is the time period that the model simulations will be discarded. We usually define a period where the model is run to adjust to the change of the parameters and avoid model instability, this period is not used in the calibration or validation of the model. In a real experiment, this could be a year. 
* calibration: this is the period used for parameter estimation.
* validation: the independent period used to verify the model performance after performing history matching. 

In [None]:
simulation_start_date = '2018-08-01'
valid_start_date = '2018-08-02'
calib_start_date = '2018-08-10'

## **Step 3. Read in the PESTPP control file from the single model run**

We could read the information of a given model run using PyEMU library, and make adjustment and modifications and create a new model run. 
Here we are attempting to follow the steps in flowchart in blue using the single model run. 
Let us read the experiment, look at the parameters and the observation used in parameter estimation. 


In [None]:
pst = pyemu.Pst(str(singledir / 'wrfpst.pst'), resfile=str(singledir / 'wrfpst.base.rei'))

Parameters used in the history matching experiment is saved in the `parameter_data`. 

In [None]:
pst.parameter_data.head()

Observation dataset used in the history matching is saved in the `observation_data`. 

In [None]:
pst.observation_data.head()

We could display different components of the phi calculation, here we had only one obsevration category called `streamflow`. In this lesson we will define more groups and weigh them differently to form the objective function. 

In [None]:
pst.plot(kind='phi_pie')

## **Step4. Categorize the observation**

Let's start with defining the burnin and validation period first. 

In [None]:
# make a copy of the observation data 
obs = pst.observation_data.copy()

obs.loc[obs.obsval==-9999, 'obsval'] = np.nan

def parsename(cn):
    '''
    parse the dates from the WRF_hydro obs names
    '''
    tmp = cn.replace('obs_','')
    return dt.strptime(tmp, '%Y%m%d_%H0000')

# get the time from the name of the observation obs_YYYYmmdd_hhMMSS
obs['dtime'] = [np.nan] + [parsename(i) for i in obs.iloc[1:].index]

# trim off the burn-in period
obs.loc[obs.dtime<valid_start_date, 'obgnme'] = 'burn_in'

# label the validation period
obs.loc[(obs.dtime >= valid_start_date) & (obs.dtime < calib_start_date), 'obgnme'] = 'validation'

In [None]:
# copy over the updated obsgroups to the original
pst.observation_data.obgnme = obs.obgnme.values

obgnames = obs.obgnme.copy()

obgnames.loc['kge'] = 'kge'
print(obgnames.unique())

obs['discharge'] = obs.obsval

In [None]:
obs.iloc[:]

In [None]:
 # set index to datetime
obs.set_index('dtime', drop=False, inplace=True)

# plot the full period of simulation
ax = obs.discharge[1:].plot(figsize=(14,4))
ax.axvline('2018-08-02', c='orange', alpha=.4);
ax.axvline('2018-08-10', c='green', alpha=.4);

### For next analysis, trim off kge and burn in and validation period

In [None]:
# trim off the observation and only keep the calibration period
obs = obs.iloc[1:].loc[obs.dtime>= calib_start_date]

# Finding the NaN streamflows
print(obs.loc[obs.discharge.isnull()])

# fill in the nan discharge values with linear interpolatoin
obs['discharge']=obs.discharge.interpolate()

# flip back the NaN obs values to -9999
obs.loc[obs.obsval.isnull(), 'obsval'] = -9999

### Quantiles

In this step, we will define the streamflow quantiles and apply and categorize the flow based on which quantiles if falls into. 


In [None]:
# set number of quantiles
quantiles=4

quantile_vals = [obs.discharge.quantile(((i+1)/quantiles)) for i in range(quantiles)]

quantile_vals

#identify the locations of the quantiles
obs['quantile_grp'] = np.nan
for i,q_current in zip(range(1,quantiles+1),quantile_vals):
    if i==1:
        obs.loc[obs.discharge<=q_current, 'quantile_grp'] = 'q1'
    else:
        obs.loc[(obs.discharge <= q_current) & (obs.discharge>quantile_vals[i-2]), 'quantile_grp'] = f'q{i}'
        

assert len(obs.loc[obs.quantile_grp.isnull()]) == 0

obs.quantile_grp.unique()

for cn,cg in obs.groupby('quantile_grp'):
    print(cn, len(cg))

### Event based weighting

Next, we will define events using the function `hydro_events`, this function provided the start and end date of events as well as peak time. 

In [None]:
Qhe = series_metrics.hydro_events(obs,  wlen=50, prominence=25, height=2)

In [None]:
 ax = obs.discharge.plot(figsize=(14,4))
 [ax.axvline(i, c='orange', alpha=.4) for i in Qhe[1]['event_starts']];
 [ax.axvline(i, c='green', alpha=.4) for i in Qhe[1]['event_ends']];


In [None]:
Qhe[1]

In [None]:
# Assign the new group naming 
obs.obgnme = obs.quantile_grp
for st, en in zip(Qhe[1]['event_starts'], Qhe[1]['event_ends']):
    obs.loc[(obs.index>=st) & (obs.index<=en), 'obgnme'] = 'event'
obs.loc[obs.obgnme=='event']

### Now bring the new observation group names back to `pst.observation_data`

In [None]:
pst.observation_data.loc[obs.obsnme, 'obgnme'] = obs.obgnme.values

# there is only the KGE field which has the group name of the streamflow
pst.observation_data.loc[pst.observation_data.obgnme=='streamflow', 'obgnme'] = 'kge'
pst.observation_data.loc[pst.observation_data['obgnme'] == 'kge', 'weight'] = 0

pst.observation_data.loc[pst.observation_data.index.isin(obs.obsnme), 'obgnme'] = obs.obgnme.values

pst.observation_data.loc[pst.observation_data['obsval'] == -9999, 'obgnme'] = 'burn_in'
pst.observation_data.loc[pst.observation_data['obsval'] == -9999, 'weight'] = 0
pst.observation_data.loc[pst.observation_data['obgnme'] == 'burn_in', 'weight'] = 0
pst.observation_data

### Now let us check the contribution of each category to the phi 

In [None]:
pst.plot(kind='phi_pie')

## **Step 5. Reweight observations based on the 10% CV**

In [None]:
pst.observation_data.loc[pst.observation_data.obgnme != 'kge' , 'weight'] = \
    10 / pst.observation_data.loc[pst.observation_data.obgnme != 'kge' , 'obsval']

### let us set the weights for the burn in and validation period to 0 also so they do not contibute to the calibration of the phi.

In [None]:
pst.observation_data.loc[pst.observation_data['obsval'] == -9999, 'obgnme'] = 'burn_in'
pst.observation_data.loc[pst.observation_data['obsval'] == -9999, 'weight'] = 0

pst.observation_data.loc[pst.observation_data.obgnme == 'burn_in' , 'weight'] = 0 
pst.observation_data.loc[pst.observation_data.obgnme == 'validation' , 'weight'] = 0 

pst.res['weight'] = pst.observation_data.weight.values # have to trick the residuals to know about new obsgp

In [None]:
pst.observation_data.head()

In [None]:
pst.observation_data.loc[pst.observation_data.obgnme=='streamflow']

In [None]:
pst.plot(kind='phi_pie')

## **Step 6. Provide weighting for each category**

The objective function for iES is the weighted sum of squared errors. While this makes up a single quantity, the weights–assigned to each observation–are used to balance the contribution of various observation categories (e.g. streamflows of different flow regimes, peaks, recession periods, and others) to the objective function. This allows tuning the history matching process to estimate parameters appropriate for a specific output category. In this training, we are targeting a superior performance during the events and therefore, providing half of the weights to the `event` category. 

In [None]:
# make sure there were no 0 flow values which would result in infinite weights
assert np.unique(np.isinf(pst.observation_data.weight.values)) == np.array([False])

In [None]:
new_portions = {'burn_in': 0.0,
 'validation':0,
 'event': 0.5,
 'kge': 0.0,
 'q1': 0.1,
 'q2': 0.1,
 'q3': 0.1,
 'q4': 0.2}

In [None]:
new_portions = {k:v*pst.nnz_obs for k,v in 
                new_portions.items()}

In [None]:
pst.res['group'] = pst.observation_data.obgnme.values # have to trick the residuals to know about new obsgp
pst.res['weight'] = pst.observation_data.weight.values # have to trick the residuals to know about new obsgp
pst.adjust_weights(obsgrp_dict=new_portions)

In [None]:
pst.plot(kind='phi_pie')

## **Step 7. Recalculate the objective function (phi) with new weights to perform rejection sampling**

In [None]:
obens = pyemu.ObservationEnsemble.from_csv(pst, str(priordir / 'wrfpst.0.obs.csv'),
                                           index_col=0, dtype={'real_name':str})

In [None]:
obens.head()

In [None]:
phi = obens.phi_vector
print(len(phi))

In [None]:
#No noise
phi.hist(bins=10)

## **Step 7. Perform rejection sampling Based on Prior MC runs**

To perform rejection sampling we would look into the objective function calculated using the new weighting, and define a cutoff to remove the outliers. In this small exmaple, we probably do not benefit from rejection sampling. This is only for training purposes. 

In [None]:
phicutoff = 1200

In [None]:
print(len(phi))
phi = phi.loc[phi<phicutoff]
print(len(phi))


In [None]:
reals_to_keep = phi.index.values
reals_to_keep

In [None]:
prior_pars = pd.read_csv(priordir / 'wrfpst.0.par.csv', index_col=0, dtype={'real_name':str})
pp = prior_pars.loc[reals_to_keep]
pp.index = [str(i) for i in range(len(pp)-1)] + ['base']
pp.to_csv(
            priordir/'wrfpst.starting_pars.csv')
oe = obens._df.loc[reals_to_keep].copy()
oe.index = pp.index
oe.to_csv(
            priordir/'wrfpst.starting_obs.csv')

## **Step 8. Create iES run directory**
Let us first Delete iES folder if it is already exist.

In [None]:
if os.path.exists(Path(iesdir)):
    shutil.rmtree(iesdir)

Let us create the run directory similar by copying the single model run directory. 

In [None]:
!cp -r ~/wrf-hydro-training/output/lesson4/Single_Model_Run/ ~/wrf-hydro-training/output/lesson6/iES_Run

Now we could make the modification to the PEST++ control files and overwrite the files in the iES run directly. Let us start with preparing the new starting observation and parameter files for the experiment. 

In [None]:
prior_pars = pd.read_csv(priordir / 'wrfpst.0.par.csv', index_col= 0, dtype={'real_name':str})
pp = prior_pars.loc[reals_to_keep]
pp.index = [str(i) for i in range(len(pp)-1)] + ['base']
pp.to_csv(iesdir/'wrfpst.starting_pars.csv')
oo = obens._df.loc[reals_to_keep].copy()
oo.index = pp.index
oo.to_csv(iesdir/'wrfpst.starting_obs.csv')

### User specified observation file with added noise
The iES algorithm is predicated on the assumption that the ensemble of observation measurement values is corrupted by noise. 
However, PEST++ also includes an option to ignore noise in the observations 
(ies_no_noise = true, and no provided observation ensemble). 
If the noise is ignored, the ensemble of observations will be all identical members. 
Next available option is to generate the noise using the existing framework in PEST++ 
(ies_no_noise = false, and no provided observation ensemble). 
In this case, error in observations are additive noise sampled from an assumed pdf of observation error. 
The time series generated using this approach are usually not smooth, since the additive noise could be a positive 
value at one time step and reverse sign in the next time step. 
Ideally, we would like to have smooth hydrographs, in particular to preserve the diurnal cycles properly in the model. 
Alternative method is to supply an external text file containing observation realizations with added noise. 
Using this feature, we could provide the observation ensemble with temporally auto-correlated additive noise. 
Here, we have created an observation file with the added noise which will be used instead of the single observation time series. Note the observation was created for a case with 300 ensembles, however, due to compute limitations here, we are only using a subset of those ensemble members.

In [None]:
obs_noise_ens = pd.read_csv('/home/docker/wrf-hydro-training/example_case/OBS/obs_noise_01473000_300ens_201808.csv', 
                            index_col = 'real_name', dtype={'real_name':str})
one = obs_noise_ens.loc[reals_to_keep].copy()
one.index=pp.index
one.to_csv(iesdir/'wrfpst.starting_obs+noise.csv')

In [None]:
obs_noise_ens.head()

### Modify the PEST++ Control File
Last but not the least, we want to modify the entries of the PEST++ control file and set some problem specific PESTPP-IES settings

In [None]:
len(reals_to_keep)

In [None]:
pst.control_data.noptmax=2
pst.pestpp_options["ies_num_reals"] = len(reals_to_keep)
pst.pestpp_options["overdue_giveup_minutes"] = 200
pst.pestpp_options["ies_no_noise"] = 'false'
pst.pestpp_options["ies_observation_ensemble"] = 'wrfpst.starting_obs+noise.csv'
pst.pestpp_options["ies_restart_observation_ensemble"] = 'wrfpst.starting_obs.csv'
pst.pestpp_options["ies_parameter_ensemble"] = 'wrfpst.starting_pars.csv'

pst.write(iesdir/'wrfpst.pst', version=2)

## **Step 9. Running PEST++ with WRF-Hydro**



In [None]:
# rel_path (str, optional) – the relative path to where pest(++) should be run from within the worker_dir, defaults to the uppermost level of the worker dir. 

pyemu.utils.os_utils.start_workers(worker_dir = "/home/docker/wrf-hydro-training/output/lesson6/iES_Run", 
                                   exe_rel_path = "pestpp-ies", 
                                   pst_rel_path = "wrfpst.pst", 
                                   num_workers=8, 
                                   worker_root='/home/docker/wrf-hydro-training/output/lesson6/',
                                   master_dir = "/home/docker/wrf-hydro-training/output/lesson6/host",
                                   port=4004, 
                                   verbose = True, 
                                   cleanup = False)


## **Step 10. Let check the run directory**

In [None]:
%%bash 
ls  /home/docker/wrf-hydro-training/output/lesson6/

In [None]:
%%bash 
ls  /home/docker/wrf-hydro-training/output/lesson5/host

The successful finish message of PEST++ could be found in the wrfpst.rmr file.

In [None]:
%%bash 
cat  /home/docker/wrf-hydro-training/output/lesson5/host/wrfpst.rec | tail -n 5

Let us check the content of the workder folder.

In [None]:
%%bash 
ls  /home/docker/wrf-hydro-training/output/lesson4/worker_0/

Now we could continue to the next lesson and verify the results of the history matching exercise using the iES. 