# Estimating Flood Return Periods with Annual Maxima Series (AMS) vs. Partial Duration Series (PDS) / Peaks over Threshold (POT)

For peaks over threshold estimation, we'll use some built-in functions in the [pyextremes](https://georgebv.github.io/pyextremes/) library. We'll also use the [dataretrieval](https://github.com/DOI-USGS/dataretrieval-python) package to get daily streamflow data from USGS. Both of these packages need to be installed.

In [None]:
!pip install pyextremes
!pip install dataretrieval

We'll download the historical record of daily discharge data on the Potomac River at Washington, DC from HY 1931-2024. This is [USGS gauge 01646500](https://waterdata.usgs.gov/nwis/inventory?site_no=01646500&agency_cd=USGS).

In [None]:
import pandas as pd
import pyextremes
import dataretrieval.nwis as nwis

flow_df = nwis.get_record(sites='01646500', service='dv', parameterCd='00060', start='1930-10-01', end='2024-09-30') # Potomac River at Washington, DC
flow_df.head()

## Annual Maxima Series baseline

To estimate return periods from annual maxima, we first need to find the maximum flood in each hydrologic year, which begins October 1 and ends September 30.

In [None]:
import numpy as np

# find year of each data point
flow_df['Year'] = flow_df.index.year
flow_df['Month'] = flow_df.index.month

flow_df['Year'][np.where(flow_df['Month']>=10)[0]] += 1

maxQ = flow_df.groupby('Year').max()
maxQ.head()

Now let's use utils.py to fit an LN3 distribution to the annual maxima using MOM. utils.py loads the lmoments package, so we'll need to install that h

In [None]:
!pip install lmoments3

In [None]:
from google.colab import drive

# allow access to google drive
drive.mount('/content/drive')

!cp "drive/MyDrive/Colab Notebooks/CE6280/CodingExamples/utils.py" .
from utils import *

# fit LN3 to maxQ with MOM and estimate 100-yr flood from it
distfit = LogNormal()
distfit.fit(maxQ["00060_Mean"], 'MOM', 3)
q100 = distfit.findReturnPd(100)
distfit.plotHistPDF(maxQ["00060_Mean"], 0, 450000, "LN3 MOM Fit")
print("LN3 MOM mu: %0.2f" % distfit.mu)
print("LN3 MOM sigma: %0.2f" % distfit.sigma)
print("LN3 MOM tau: %0.2f" % distfit.tau)
print("LN3 MOM 100-yr flood: %0.0f cms" % q100)

## Peaks Over Threshold comparison

The `pyextremes` library only takes in one column with the data, indexed by the datetime, so let's subset the "00060_Mean" column with the mean daily flow.

In [None]:
flow_df = flow_df["00060_Mean"]
flow_df.head()

### Threshold Estimation from Mean Residual Life Plot

To decide on a threshold, we want to see where the mean residual life plot starts to linearly increase. There is a function `plot_mean_residual_life` in the `pyextremes` library.

In [None]:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
pyextremes.plot_mean_residual_life(flow_df, ax=ax)

This is a tough one! It's linear almost from the beginning, but that would certainly include non-extreme events. The slope does slightly change after 100k, so let's try 110,000 cfs as the threshold. Let's add it to the plot to see if the increase seems linear after that (before getting noisy from limited data at higher thresholds).

In [None]:
thres = 110000

fig, ax = plt.subplots()
pyextremes.plot_mean_residual_life(flow_df, ax=ax)
ax.plot([thres,thres],[20000,140000])

That seems decent. We can also see if the parameter estimates of the GP distribution fit to exceedances are relatively stable after this point (before becoming noisy from limited data at higher thresholds).

In [None]:
pyextremes.plot_parameter_stability(flow_df)

In [None]:
fig, axes = plt.subplots(2,1)
pyextremes.plot_parameter_stability(flow_df, axes=axes)
axes[0].plot([thres,thres],[-3,1])
axes[1].plot([thres,thres],[0,1.25E06])

The parameter estimates actually start to become $\textit{unstable}$ around there, so might even want to lower it.  

What would the arrival rate be if we stuck with a threshold of 110,000 cfs? To determine this, we need to extract the floods over the threshold, which we can do with the `get_extremes` function in `pyextremes`. This requires specifying a value for the parameter `r`, which defines the duration over which multiple peaks are not considered independent. Thus, if more than one peak occurs within duration `r`, only the largest will be returned as an extreme. We will set `r="5d"` for 5 days.

In [None]:
extremes = pyextremes.get_extremes(flow_df, method="POT", threshold=thres, r="5d")
extremes.head()

In [None]:
arrival_rate = len(extremes) / (2024-1930+1)
print("lambda = %0.2f floods/year" % arrival_rate)

This is on the low end, also justifying a potential lowering of the threshold to 100k, but we'll continue with 110k just for this example.  

We can plot the time series of exceedances of the threshold by creating a model using the `EVA` class and its `plot_extremes` method after using `get_extremes` to extract them.

In [None]:
model = pyextremes.EVA(flow_df)
extremes = model.get_extremes("POT", threshold=thres, r="5d")
model.plot_extremes()

### Distribution of peaks over selected threshold

We can fit a Generalized Pareto distribution to the exceedances with `fit_model` and plot the quality of the fit with `plot_diagnostic`. This uses `scipy.stats.genpareto.fit`, which uses MLE for parameter estimation. We'd gave to write our own `GenPareto` class to fit this with MOM and Lmom (using the lmoments3 library). You'll have to do this for your homework.

In [None]:
model.fit_model(distribution="genpareto") # uses MLE in scipy.stats.genpareto
model.plot_diagnostic(alpha=0.95)

These fits are quite good, with p-values of 0.000 on the PPCC test.

Unfortunately, if we wanted to estimate the values of floods of different return periods from this, `pyextremes` only estimates these empirically, as shown below.

In [None]:
return_periods = pyextremes.get_return_periods(
    ts=flow_df,
    extremes=extremes,
    extremes_method="POT",
    extremes_type="high",
)
return_periods.sort_values("return period", ascending=False).head()

We can also plot the stability of return period estimates with the `plot_return_value_stability` function, shown using the 100-year event below. By default, this tests 100 different threshold values and takes a long time, so we'll just pass 20 thresholds between 50k and 150k to it to speed that up for illustrative purposes.

In [None]:
pyextremes.plot_return_value_stability(
    flow_df,
    return_period=100,
    thresholds=np.linspace(50000,150000,20),
    alpha=0.95,
)

The estimates are pretty stable up to about 100k, so that would likely be the highest threshold we'd want to choose.  

To get better, analytical (not empirical) estimates of the return periods, you'll have to write your own code for HW3 to estimate the parameters of a GEV distribution of annual maxima from 1) the arrival rate of peaks over the threshold and 2) the parameters of the GP distribution of exceedances using the formulas from class. Thus, the main benefits of the `pyextremes` library are its built-in mean residual life plot and parameter/return period stability plots for informing the selection of a threshold for POT analysis, and extracting the peaks above the selected threhsold.