This notebook is used to get residence-time distribution (RTD) for the entire aquifer from an existing MODFLOW model. It is possible to read in any group or label from a 3D array and make RTDs for those groups. The approach is to 
* read an existing model
* create flux-weighted particle starting locations in every cell
* run MODPATH and read endpoints
* fit parametric distributions

This notebook fits parametric distributions. Another notebook creates flux-weighted particles.

In [None]:
__author__ = 'Jeff Starn'
%matplotlib notebook

from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')
from IPython.display import Image
from IPython.display import Math
from ipywidgets import interact, Dropdown
from IPython.display import display

import os
import sys
import shutil
import pickle
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.ticker as mt
import matplotlib.patches as patches

import flopy as fp
import imeth
import fit_parametric_distributions
import pandas as pd
import scipy.stats as ss
import scipy.optimize as so


# Preliminary stuff

## Set user-defined variables

MODFLOW and MODPATH use elapsed time and are not aware of calendar time. To place MODFLOW/MODPATH elapsed time on the calendar, two calendar dates were specified at the top of the notebook: the beginning of the first stress period (`mf_start_date`) and when particles are to be released (`mp_release_date`). The latter date could be used in many ways, for example to represent a sampling date, or it could be looped over to create a time-lapse set of ages. 

## Loop through home directory to get list of name files

In [None]:
homes = ['../Models']
fig_dir = '../Figures'

if not os.path.exists(fig_dir):
    os.mkdir(dst)

mfpth = '../executables/MODFLOW-NWT_1.0.9/bin/MODFLOW-NWT_64.exe'

mf_start_date_str = '01/01/1900' 
mp_release_date_str = '01/01/2020' 

logtransform = False

age_cutoff = 65
year_cutoff = '01/01/1952'

# weighting scheme, either 'flow' or 'volume'
# weight_scheme = 'volume'
# the weighting scheme is always volume in this version of the notebook
# to get flow weighted travel times, a weight dataframe is output
# that has as weights the proportion of flow in each cell
weight_scheme = 'flow'

# dist_list = [ss.invgauss, ss.gamma, ss.weibull_min]
dist_list = [ss.weibull_min]

por = 0.20

dir_list = []
mod_list = []
i = 0

for home in homes:
    if os.path.exists(home):
        for dirpath, dirnames, filenames in os.walk(home):
            for f in filenames:
                if os.path.splitext(f)[-1] == '.nam':
                    mod = os.path.splitext(f)[0]
                    mod_list.append(mod)
                    dir_list.append(dirpath)
                    i += 1
print('    {} models read'.format(i))

model_area = Dropdown(
    options=mod_list,
    description='Model:',
    background_color='cyan',
    border_color='black',
    border_width=2)
display(model_area)


In [None]:
model = model_area.value
model_ws = [item for item in dir_list if model in item][0]
nam_file = '{}.nam'.format(model)
print("working model is {}".format(model_ws))

##  Create names and path for model workspace. 

The procedures in this notebook can be run from the notebook or from a batch file by downloading the notebook as a Python script and uncommenting the following code and commenting out the following block. The remainder of the script has to be indented to be included in the loop.  This may require familiarity with Python. 

In [None]:
# for pth in dir_list:
#     model = os.path.normpath(pth).split(os.sep)[2]
#     model_ws = [item for item in dir_list if model in item][0]
#     nam_file = '{}.nam'.format(model)
#     print("working model is {}".format(model_ws))

# Load an existing model

In [None]:
print ('Reading model information')

fpmg = fp.modflow.Modflow.load(nam_file, model_ws=model_ws, exe_name=mfpth, version='mfnwt', 
                               load_only=['DIS', 'BAS6', 'UPW', 'OC'], check=False)

dis = fpmg.get_package('DIS')
bas = fpmg.get_package('BAS6')
upw = fpmg.get_package('UPW')
oc = fpmg.get_package('OC')

delr = dis.delr
delc = dis.delc
nlay = dis.nlay
nrow = dis.nrow
ncol = dis.ncol
bot = dis.getbotm()
top = dis.gettop()

hnoflo = bas.hnoflo
ibound = np.asarray(bas.ibound.get_value())
hdry = upw.hdry

print ('   ... done') 

## Specification of time in MODFLOW/MODPATH

There are several time-related concepts used in MODPATH.
* `simulation time` is the elapsed time in model time units from the beginning of the first stress period
* `reference time` is an arbitrary value of `simulation time` that is between the beginning and ending of `simulation time`
* `tracking time` is the elapsed time relative to `reference time`. It is always positive regardless of whether particles are tracked forward or backward
* `release time` is when a particle is released and is specified in `tracking time`

In [None]:
# convert string representation of dates into Python datetime objects
mf_start_date = dt.datetime.strptime(mf_start_date_str , '%m/%d/%Y')
mp_release_date = dt.datetime.strptime(mp_release_date_str , '%m/%d/%Y')

In [None]:
src = os.path.join(model_ws, 'water_table.csv')
water_table = pd.read_csv(src, index_col=0).values.reshape(nrow, ncol)

# Process endpoint information

## Read endpoint file

## Review zones and create zone groups for processing

# Fit parametric distributions

## Subsample endpoints
This next cell takes `s` number of stratified random samples from the endpoints. This is in order to make the curve fitting much faster. 50,000 samples seems to work pretty well.

## Calculate summary statistics

Two definitions of young fraction are included. The first is relative to a particle travel time (age) cut-off in years. For steady-state models, it is independent of time and describes the young fraction of the RTD. For transient models, the young fraction depends on the particle release time. The second definition is relative to a calendar date; only particles that recharged after that date are included in the summary statistics. It is equal to the first definition by assuming particle release in 2017. It may be useful for assessing the breakthrough of chemicals released at a known date. For both definitions, the young fraction is described by the number of particles, the mean travel time of those particles, and the fraction of total particles.



In [None]:
src = os.path.join(model_ws, 'zone_df.csv')
zone_df = pd.read_csv(src, index_col=0)
fit_dict = dict()

cols = ['mean particle age', 'standard dev of particle age', 
        'minimum particle age', 
        '10th percentile of particle age', 
        '20th percentile of particle age', 
        '30th percentile of particle age', 
        '40th percentile of particle age', 
        '50th percentile of particle age', 
        '60th percentile of particle age', 
        '70th percentile of particle age', 
        '80th percentile of particle age', 
        '90th percentile of particle age', 
        'maximum particle age', 
        'number particles < {} yrs old'.format(age_cutoff), 
        'mean age of particles < {} yrs old'.format(age_cutoff), 
        'proportion of particles < {} yrs old'.format(age_cutoff), 
        'number particles recharged since {}'.format(year_cutoff), 
        'mean age of particles recharged since {}'.format(year_cutoff), 
        'proportion of particles particles recharged since {}'.format(year_cutoff), 
        'total number of particles', 
        'minimum linear x-y path length',
        'median linear x-y path length',
        'maximum linear x-y path length',
        'minimum linear x-y-z path length',
        'median linear x-y-z path length',
        'maximum linear x-y-z path length',
        'One component Weibull shape',
        'One component Weibull location',
        'One component Weibull scale',
        'Two component Weibull shape 1',
        'Two component Weibull location 1',
        'Two component Weibull scale 1',
        'Two component Weibull shape 2',
        'Two component Weibull location 2',
        'Two component Weibull scale 2',
        'Two component Weibull fraction',
       ]

data_df = pd.DataFrame(index=cols)

# switch weight_scheme in the first cell to switch between volume weighted and flow weighted 
if weight_scheme == 'flow':
    weight_label = 'flow'

elif weight_scheme == 'volume':
    weight_label = 'volume'



In [None]:
for group in zone_df:
    
    print('Summarizing endpoints and fitting distributions for {}'.format(group))

    mpname = '{}_{}_{}'.format(fpmg.name, weight_label, group)

    endpoint_file = '{}.{}'.format(mpname, 'mpend')
    endpoint_file = os.path.join(model_ws, endpoint_file)

    # read the endpoint file
    ep_data = fit_parametric_distributions.read_endpoints(endpoint_file, dis)

    # set the Z coordinate for particles that end in dry cells to the 
    # head of the nearest non-dry cell below the dry cell.
    ind = np.isclose(ep_data['Final Global Z'], hdry, rtol=0.1)
    ep_data['Final Global Z'] = np.where(ind, water_table[ep_data['Final Row']-1, 
                                        ep_data['Final Column']-1], ep_data['Final Global Z'])

    # eliminate particles that start in dry cells
    ind = np.isclose(ep_data['Initial Global Z'], hdry, rtol=0.99999)
    ep_data = ep_data.loc[~ind, :]

    # calculate approximate linear path distances
    x_dist = ep_data['Final Global X'] - ep_data['Initial Global X']
    y_dist = ep_data['Final Global Y'] - ep_data['Initial Global Y']
    z_dist = ep_data['Final Global Z'] - ep_data['Initial Global Z']
    ep_data['xy_path_len'] = np.sqrt(x_dist**2 + y_dist**2)
    ep_data['xyz_path_len'] = np.sqrt(x_dist**2 + y_dist**2 + z_dist**2)

    endpoint_file = '{}_mod.{}'.format(mpname, 'mpend')
    endpoint_file = os.path.join(model_ws, endpoint_file)
    ep_data.to_csv(endpoint_file)

    # extract travel times 
    trav_time_raw = ep_data.loc[:, ['rt', 'xy_path_len', 'xyz_path_len']].copy()

    # sort them
    trav_time_raw.sort_values('rt', inplace=True)

    # create arrays of CDF value between 1/x and 1

    # number of particles 
    n = trav_time_raw.shape[0]

    # number of particles desired to approximate the particle CDF
    s = 50000

    ly = np.linspace(1. / s, 1., s, endpoint=True)

    tt_cdf = np.linspace(1. / n, 1., n, endpoint=True)

    #     log transform the travel times
    if logtransform:
        tt = np.log(trav_time_raw.rt)
    tt = trav_time_raw.rt

    # interpolate at equally spaced points to reduce the number of particles
    lprt = np.interp(ly, tt_cdf , tt)

    first = lprt.min()
    
    fit_dict[group] = fit_parametric_distributions.fit_dists(ly, lprt, dist_list)

    xp = np.arange(1, 10) * 10
    data = np.zeros((36))
    tt_count = trav_time_raw.shape[0]
    data[0] = trav_time_raw.rt.mean()
    data[1] = trav_time_raw.rt.std()
    data[2] = trav_time_raw.rt.min()
    for n, i in enumerate(xp):
        data[n+3] = np.percentile(trav_time_raw.rt, i)
    data[12] = trav_time_raw.rt.max()
    cutoff_age_ar = trav_time_raw[trav_time_raw.rt < age_cutoff]
    cutoff_age_count = cutoff_age_ar.shape[0]
    data[13] = cutoff_age_count
    data[14] = cutoff_age_ar.rt.mean()
    data[15] = cutoff_age_count / tt_count
    cut_off_years_ago = mp_release_date - dt.datetime.strptime(year_cutoff, '%m/%d/%Y')

    cutoff_year_ar = trav_time_raw.loc[trav_time_raw.rt < cut_off_years_ago.days / 365.25, :]
    cutoff_year_count = cutoff_year_ar.shape[0]
    data[16] = cutoff_year_count
    data[17] = cutoff_year_ar.rt.mean()
    data[18] = cutoff_year_count / tt_count                
    data[19] = trav_time_raw.rt.count()
    data[20] = trav_time_raw.xy_path_len.min()
    data[21] = trav_time_raw.xy_path_len.median()
    data[22] = trav_time_raw.xy_path_len.max()
    data[23] = trav_time_raw.xyz_path_len.min()
    data[24] = trav_time_raw.xyz_path_len.median()
    data[25] = trav_time_raw.xyz_path_len.max()
    data[26:29] = fit_dict[group]['par']['uni_weibull_min']
    data[29:37] = fit_dict[group]['par']['add_weibull_min']

    data_df[group] = data
        
    print('   ...done')
    
dst = os.path.join(model_ws, 'summary_data_aquifer_system_{}.csv'.format(weight_scheme))
data_df.to_csv(dst)

dst = os.path.join(model_ws, 'fit_dict_{}.pickle'.format(weight_scheme))
with open(dst, 'wb') as f:
    pickle.dump(fit_dict, f)

# Notes on RTD parent distributions

From Stack Exchange Cross Validated:

Both the gamma and Weibull distributions can be seen as generalisations of the exponential distribution. If we look at the exponential distribution as describing the waiting time of a Poisson process (the time we have to wait until an event happens, if that event is equally likely to occur in any time interval), then the $\Gamma(k, \theta)$ distribution describes the time we have to wait for $k$ independent events to occur.

On the other hand, the Weibull distribution effectively describes the time we have to wait for one event to occur, if that event becomes more or less likely with time. Here the $k$ parameter describes how quickly the probability ramps up (proportional to $t^{k−1}$).

We can see the difference in effect by looking at the pdfs of the two distributions. Ignoring all the normalising constants:

$f_\Gamma(x)\propto x^{k-1}\exp(-\frac{x}{\theta})$

$f_W(x)\propto x^{k-1}\exp(-{(\frac{x}{\lambda})^k)}$

The Weibull distribution drops off much more quickly (for $k>1$) or slowly (for $k<1$) than the gamma distribution. In the case where $k=1$, they both reduce to the exponential distribution.

From Wikipedia:

The generalized gamma has three parameters: $a>0$, $d>0$, and $p>0$. For non-negative x, the probability density function of the generalized gamma is$^{[2]}$

$f(x;a,d,p)={\frac  {(p/a^{d})x^{{d-1}}e^{{-(x/a)^{p}}}}{\Gamma (d/p)}},$

where $\Gamma (\cdot )$ denotes the gamma function.

The cumulative distribution function is
$F(x;a,d,p)={\frac  {\gamma (d/p,(x/a)^{p})}{\Gamma (d/p)}},$

where $\gamma (\cdot )$ denotes the lower incomplete gamma function.

If $d=p$ then the generalized gamma distribution becomes the Weibull distribution. Alternatively, if $p=1$ the generalised gamma becomes the gamma distribution.

From NIST National Engineering Handbook

The formula for the probability density function of the general Weibull distribution is

$f(x)=\frac\gamma\alpha(\frac{x-\mu}\alpha)^{(\gamma-1)}\exp(-(\frac{(x-\mu)}\alpha)^\gamma)$

$x\ge\mu; \gamma,\alpha>0$

where $\gamma$ is the shape parameter, $\mu$ is the location parameter and $\alpha$ is the scale parameter. The case where $\mu = 0$ and $\alpha = 1$ is called the standard Weibull distribution. The case where $\mu = 0$ is called the 2-parameter Weibull distribution. The equation for the standard Weibull distribution reduces to

$f(x)=\gamma x^{(\gamma-1)}\exp(-(x^\gamma))$

Since the general form of probability functions can be expressed in terms of the standard distribution, all subsequent formulas in this section are given for the standard form of the function.

The general formula for the probability density function of the gamma distribution is

$f(x)=\frac{(\frac{x-\mu}{\beta})^{\gamma-1}\exp(-\frac{x-\mu}{\beta})}{\beta\Gamma(\gamma)}$

where $\gamma$ is the shape parameter, $\mu$ is the location parameter, $\alpha$ is the scale parameter, and $\Gamma$ is the gamma function which has the formula

$\Gamma(a)=\int_0^\infty t^{a-1}e^{-t}dt$

The case where $\mu=0$ and $\beta=1$ is called the standard gamma distribution. The equation for the standard gamma distribution reduces to

$f(x)=\frac{x^{\gamma-1}\exp(-x)}{\Gamma(\gamma)}$

  
Since the general form of probability functions can be expressed in terms of the standard distribution, all subsequent formulas in this section are given for the standard form of the function.