# data worth and related assessments

In this notebook, we will use outputs from previous notebooks (in particular `pestpp-glm_part1.ipynb`) to undertake data worth assessments based on first-order second-moment (FOSM) techniques. "Worth" is framed here in the context of the extent to which the uncertainty surrounding a model prediction of management interest is reduced through data collection.  Given that these anayses can help target and optimize data acquisition strategies, this is a concept that really resonates with decision makers.

In [None]:
%matplotlib inline
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.rcParams['font.size']=12
import flopy
import pyemu


In [None]:
m_d = "master_glm"

In [None]:
pst = pyemu.Pst(os.path.join(m_d,"freyberg_pp.pst"))
print(pst.npar_adj)
pst.write_par_summary_table(filename="none")

first ingredient: parameter covariance matrix (representing prior uncertainty in this instance)

In [None]:
cov = pyemu.Cov.from_binary(os.path.join(m_d,"prior_cov.jcb")).to_dataframe()
cov = cov.loc[pst.adj_par_names,pst.adj_par_names]
cov = pyemu.Cov.from_dataframe(cov)

In [None]:
# let's inspect only
x = cov.x.copy()
x[x<1e-7] = np.nan
c = plt.imshow(x)
plt.colorbar()

In [None]:
pst.adj_par_groups

second ingredient: jacobian matrix

In [None]:
jco = os.path.join(m_d,"freyberg_pp.jcb")

the third ingredient--the (diagonal) noise covariance matrix--populated on-the-fly using weights when constructing the Schur object below...

In [None]:
sc = pyemu.Schur(jco=jco,parcov=cov)

In [None]:
sc

### there we have it--all computations done and contained within `sc`.  We will only be required to access different parts of `sc` below...

### Parameter uncertainty

First let's inspect the (approx) posterior parameter covariance matrix and the reduction in parameter uncertainty through "data assimilation", before mapping to forecasts... (note that this matrix is ${\it not}$ forecast-specific)

In [None]:
sc.posterior_parameter.to_dataframe().sort_index(axis=1).iloc[100:105:,100:105]

In [None]:
x = sc.posterior_parameter.x.copy()
x[x<1e-7] = np.nan
c = plt.imshow(x)
plt.colorbar(c)

We can see the posterior variance for each parameter along the diagonal. The off-diags are symmetric.

In [None]:
par_sum = sc.get_parameter_summary().sort_values("percent_reduction",ascending=False)
par_sum

In [None]:
par_sum.loc[par_sum.index[:25],"percent_reduction"].plot(kind="bar",color="turquoise")

What have we achieved by "notionally calibrating" our model to 13 head and 1 stream flow observations? Which parameters are informed? Will they matter for the forecast of interest? Which ones are un-informed?

In [None]:
pst.nnz_obs_names

## Forecast uncertainty

In [None]:
forecasts = sc.pst.pestpp_options['forecasts'].split(",")
forecasts

In [None]:
df = sc.get_forecast_summary()
df

In [None]:
# make a pretty plot 
fig = plt.figure()
ax = plt.subplot(111)
ax = df["percent_reduction"].plot(kind='bar',ax=ax,grid=True)
ax.set_ylabel("percent uncertainy\nreduction from calibration")
ax.set_xlabel("forecast")
plt.tight_layout()

Surprise, surprise... Some forecasts benefit from calibration, some do not! 

### Before moving onto data worth, let's look at the contribution of different parameters to forecast uncertainty

Parameter contributions to uncertainty are quantified by "fixing" parameters (or parameter groups) and observing the uncertainty reduction as a result. This approach is of course subject to some sizable assumptions--related to parameter representativeness. But it can be very informative. Let's do by group.

In [None]:
par_contrib = sc.get_par_group_contribution()

In [None]:
par_contrib.head()

In [None]:
base = par_contrib.loc["base",:]
par_contrib = 100.0 * (base - par_contrib) / par_contrib
par_contrib.sort_index()

In [None]:
for forecast in par_contrib.columns:
    fore_df = par_contrib.loc[:,forecast].copy()
    fore_df.sort_values(inplace=True, ascending=False)
    ax = fore_df.iloc[:10].plot(kind="bar",color="b")
    ax.set_title(forecast)
    ax.set_ylabel("percent variance reduction")
    plt.show()

### Data worth

### what is the worth of ${\it existing}$ observations?

What is happening under the hood is that we are recalculating the Schur complement without some of the observations to see how the posterior forecast uncertainty increases (wrt a "base" condition in which we have all observation data available).

In [None]:
dw_rm = sc.get_removed_obs_importance()
dw_rm.head()

Here the ``base`` row contains the results of the Schur complement calculation (in terms of forecast uncertainty variance) using all observations.  

In [None]:
# let's normalize to make more meaningful comparisons of data worth (unctainty variance reduction)
base = dw_rm.loc["base",:]
dw_rm = 100 * (dw_rm  - base) / dw_rm
dw_rm.head()

In [None]:
for forecast in dw_rm.columns:
    fore_df = dw_rm.loc[:,forecast].copy()
    fore_df.sort_values(inplace=True, ascending=False)
    ax = fore_df.iloc[:10].plot(kind="bar",color="b")
    ax.set_title(forecast)
    ax.set_ylabel("percent variance increase")
    plt.show()

There is also an option to calculate the worth of observations by taking a "base" condition of zero observation (i.e., a priori) and calculating the reduction in uncertainty through adding observations to the dataset. 

In [None]:
dw_ad = sc.get_added_obs_importance()
base = dw_ad.loc["base",:]
dw_ad = 100 * (base - dw_ad) / base
for forecast in dw_ad.columns:
    fore_df_ad = dw_ad.loc[:,forecast].copy()
    fore_df_ad.sort_values(inplace=True, ascending=False)
    ax = fore_df_ad.iloc[:20].plot(kind="bar",color="b")
    ax.set_title(forecast)
    ax.set_ylabel("percent variance decrease")
    plt.show()

Do these two approaches give the same answer? They shouldn't.. Why? Let's discuss..

### what is the worth of ${\it potential}$ observations? what data should we collect?

Recall we are "carrying" cell-by-cell heads, reach-based sfr flows, etc..

In [None]:
z_obs = pst.observation_data.loc[(pst.observation_data.weight == 0),"obsnme"].tolist()
z_obs = [x for x in z_obs if x not in forecasts]  # less our forecasts
z_obs

We can therefore repeat above analysis for the observations that currently have zero weight by turning those observations "on".

#### Beware: calculating the Schur complement for all potential observation types and locations could take some time!! So we will sample to speed things up. You may need to further reduce the number of potential obs - you can do this by adding [0::2] to take every second element for example.

In [None]:
new_obs = [x for x in z_obs if "hds_00" in x]#and x.endswith("_000")  # all heads in top layer
print("number of new potential head observation locations considered: {}".format(len(new_obs)))

In [None]:
from datetime import datetime
start = datetime.now()
df_worth_new = sc.get_added_obs_importance(obslist_dict=new_obs, base_obslist=sc.pst.nnz_obs_names, reset_zero_weight=True)
print("took:",datetime.now() - start)

In [None]:
df_worth_new.head()

### nice! now let's process a little bit and make some plots of (potential) data worth

In [None]:
def worth_plot_prep(df):
    # some processing
    df_new_base = df.loc["base",:].copy()  # "base" row
    df_new_imax = df.apply(lambda x: df_new_base - x, axis=1).idxmax()  # obs with largest unc red for each pred
    df_new_worth = 100.0 * (df.apply(lambda x: df_new_base - x, axis=1) / df_new_base)  # normalizing like above
    
    # plot prep
    df_new_worth_plot = df_new_worth[df_new_worth.index != 'base'].copy()
    df_new_worth_plot.loc[:,'names'] = df_new_worth_plot.index
    names = df_new_worth_plot.names
    df_new_worth_plot.loc[:,"i"] = names.apply(lambda x: int(x[8:10]))
    df_new_worth_plot.loc[:,"j"] = names.apply(lambda x: int(x[11:14]))
    df_new_worth_plot.loc[:,'kper'] = names.apply(lambda x: int(x[-3:]))
    #df_new_worth_plot.head()
    
    return df_new_worth_plot, df_new_imax

In [None]:
df_worth_new_plot, df_worth_new_imax = worth_plot_prep(df_worth_new)

In [None]:
df_worth_new_plot.head()

In [None]:
df_worth_new_imax  # which obs causes largest unc var reduction?

In [None]:
df_worth_new_plot.drop(axis=1,labels=["part_status"],inplace=True) # drop "part_status"
df_worth_new_plot.head()

### plotting

In [None]:
m = flopy.modflow.Modflow.load("freyberg.nam", model_ws=os.path.join(m_d))

In [None]:
def plot_added_importance(df_worth_plot, ml, forecast_name=None, 
                          newlox=None,):

    vmax = df_worth_plot[forecast_name].max()
    
    fig, axs = plt.subplots(1,2)
    if newlox:
        currx = []
        curry = []
        for i,clox in enumerate(newlox):
            crow = int(clox[8:10])
            ccol = int(clox[11:14])
            currx.append(ml.sr.xcentergrid[crow,ccol])
            curry.append(ml.sr.ycentergrid[crow,ccol])

    for sp,ax in enumerate(axs): # by kpers
        unc_array = np.zeros_like(ml.upw.hk[0].array) - 1
        df_worth_csp = df_worth_plot.groupby('kper').get_group(sp)
        for i,j,unc in zip(df_worth_csp.i,df_worth_csp.j,
                           df_worth_csp[forecast_name]):
            unc_array[i,j] = unc 
        unc_array[unc_array == -1] = np.NaN
        cb = ax.imshow(unc_array,interpolation="nearest",
                       alpha=0.5,extent=ml.sr.get_extent(), 
                       vmin=0, vmax=vmax)
        if sp==1:
            plt.colorbar(cb,label="percent uncertainty reduction")
        
        # plot sfr
        sfr_data = ml.sfr.stress_period_data[0]
        sfr_x = ml.sr.xcentergrid[sfr_data["i"],sfr_data["j"]]
        sfr_y = ml.sr.ycentergrid[sfr_data["i"],sfr_data["j"]]
        for (x,y) in zip(sfr_x,sfr_y):
            ax.scatter([x],[y],marker="s",color="g",s=26)
       
        # plot the pumping wells
        wel_data = ml.wel.stress_period_data[0]
        wel_x = ml.sr.xcentergrid[wel_data["i"],wel_data["j"]]
        wel_y = ml.sr.ycentergrid[wel_data["i"],wel_data["j"]]
        for w,(x,y) in enumerate(zip(wel_x,wel_y)):
            ax.scatter([x],[y],marker="v",color="m",s=10)

        if newlox:
            for nl,(cx,cy,cobs) in enumerate(zip(currx, curry, newlox)):
                csp = int(cobs[-1])
                if csp == sp:
                    ax.plot(cx, cy, 'rd', mfc=None, ms=10, alpha=0.8)
                    ax.text(cx-50,cy-50, nl, size=10)

        # plot the location of the forecast if possible
        if forecast_name.startswith('hds'):
            i = int(forecast_name[8:10])
            j = int(forecast_name[11:14])
            forecast_x = ml.sr.xcentergrid[i,j]
            forecast_y = ml.sr.ycentergrid[i,j]
            ax.scatter(forecast_x, forecast_y, marker='o', s=600, alpha=0.5)
            
        ax.set_title("worth for {0}\n at kper {1}".format(forecast_name,sp), fontsize=13)
        plt.tight_layout()
    return fig

In [None]:
fig = plot_added_importance(df_worth_plot=df_worth_new_plot, ml=m,forecast_name="part_time")

In [None]:
for i in [x for x in forecasts if "part_status" not in x]:
    fig = plot_added_importance(df_worth_plot=df_worth_new_plot, ml=m, 
                                forecast_name=i)
    #fig.savefig('add_worth_{}.pdf'.format(i))

## the "next best" observation

This is what we would ultimately like to know... Takes into account what we already know through incrementally making additional observations. For example, consider making an observation in the middle of the zone of highest worth. Where should we subsequently collect data? 

Let's just use the same potential observation list for now (the head in every top-layer cell) and evaluate which ones to collect, if we only had the budget for 5, in the context of the particle travel time prediction.

In [None]:
start = datetime.now()
next_most_df = sc.next_most_important_added_obs(forecast='part_time',niter=5,obslist_dict=dict(zip(new_obs,new_obs)),
                                                base_obslist=sc.pst.nnz_obs_names,reset_zero_weight=True)
print("took:",datetime.now() - start)

In [None]:
next_most_df

In [None]:
fig = plot_added_importance(df_worth_new_plot, m, 'part_time', 
                            newlox=next_most_df.best_obs.tolist())

In [None]:
# for fun after class - this will take a while!
for i in ["fa_tw_19801229","part_time"]:#[x for x in forecasts if "part_status" not in x]:
    next_most_df = sc.next_most_important_added_obs(forecast=i,niter=10,obslist_dict=dict(zip(new_obs,new_obs)),
                                                    base_obslist=sc.pst.nnz_obs_names,reset_zero_weight=True)
    fig = plot_added_importance(df_worth_new_plot, m, forecast_name=i, 
                                newlox=next_most_df.best_obs.tolist())
    fig.savefig('next_best_10_worth_{}.pdf'.format(i))

### Note: an important assumption underpinning the above is that the model is able to fit observations to a level that is commensurate with measurement noise... Are we comfortable with this assumption? We will discuss this more in `pestpp-glm_part2.ipynb`

In [None]:
# recall...
pst.observation_data.loc[pst.nnz_obs_names,:]

### an "extra" if we have time: parameter identifiability

In [None]:
la = pyemu.ErrVar(jco=jco)

In [None]:
s = la.qhalfx.s  # singular spectrum
s.x[:10]

In [None]:
figure = plt.figure()
ax = plt.subplot(111)
ax.plot(np.log10(s.x))
ax.set_ylabel("log10 singular value")
ax.set_xlabel("index")
ax.set_xlim(0,100)
plt.show()

As expected, singluar spectrum decays rapidly.

In [None]:
truncation_thresh = 1e-6
n_signif_singvals = ((s.x / s[0].x) > 1e-6).sum()
n_signif_singvals

In [None]:
print("This means that, on the basis of the {0} (non-zero) weighted observations, \
there are {1} unique pieces of information in the calibration dataset.  \
Recall the inverse problem we are trying to solve involves the estimation of {2} parameters using this information only...".\
      format(pst.nnz_obs, n_signif_singvals, pst.npar_adj))

Now let's compute the identifiability of actual model parameters based on these singular vectors. Identifiability ranges from 0 (not identified by the data) to 1 (full identified by the data).

In [None]:
ident_df = la.get_identifiability_dataframe()  # sing val trunc defaults to pst.nnz_obs

In [None]:
ident_df.sort_values(by="ident",ascending=False).iloc[0:20].loc[:,"ident"].plot(kind="bar")

Note similarity with some of the earlier parameter contribution to forecast uncertainty results

In [None]:
css = la.get_par_css_dataframe()
css.head()

In [None]:
css.sort_values(by="pest_css",ascending=False)