# GEOGloWS Python API

This Esri Notebook follows closely the material covered in GEOGloWS tutorials found in the following notebooks: 

* [GEOGloWS lat_lon values notebook](https://gist.github.com/rileyhales/ad92d1fce3aa36ef5873f2f7c2632d31)

* [GEOGloWS package tutorial notebook](https://gist.github.com/rileyhales/873896e426a5bd1c4e68120b286bc029)

* [GEOGloWS Bias Correction Tutorial](https://gist.github.com/rileyhales/d5290e12b5858d59960d0898fbd0ed69)

The GEOGloWS ECMWF Streamflow model is a global hydrologic model driven by the HTESSEL land surface model. It produces new 15 day streamflow forecasts each day at midnight (UTC +0). It also has a 40+ year (and growing) historical simulation of streamflow.

The GEOGloWS ECMWF Streamflow Data Service (REST API) and this python package client were developed at Brigham Young University in the Civil and Environmental Engineering Department by Riley Hales, Rohit Khatar, Chris Edwards, Kyler Ashby, Gio Romero, and others. This project is an axpansion and enhancement of the original work by Dr Jim Nelson and Dr Michael Suffront with funding from GEOGloWS, ECMWF, NASA, The World Bank, Microsoft Azure, BYU, and others.

You can interact with the streamflow model using the `geoglows` python package. This notebook will take you through some of the functions available in the `geoglows` python library.

Please refer to the following resources for more information:

* [GEOGloWS Documentation](https://geoglows.readthedocs.io)

* [HydroViewer Website](https://apps.geoglows.org/apps/geoglows-hydroviewer/)

* [GEOGloWS Model and Forecast publication (2022) in the Journal of Flood Risk Management](https://onlinelibrary.wiley.com/doi/full/10.1111/jfr3.12859)

* [GEOGloWS Bias Correction Tutorial](https://gist.github.com/rileyhales/d5290e12b5858d59960d0898fbd0ed69)

In [None]:
# Find the directory of the current project and add to PATH
import sys, os, arcpy
home_folder = arcpy.mp.ArcGISProject("current").homeFolder
sys.path.insert(0, home_folder)
os.chdir(home_folder)

# The 00_environment_setup notebook contains libraries and other things common to all the notebooks (e.g. file paths)
%run "00_environment_setup.ipynb"

### Read in the station data file to get station coordinates

We have staged a CSV format ASCII file on a server at NCAR that describes a set of station locations within India for which bias-correction of GloFAS data has been performed. We will use the `pandas` library to read these fiels directly from the server.

In [None]:
# URL to the station file
station_details_URL = 'https://staff.ral.ucar.edu/hopson/GloFAS/Q2Qbiascorrection/glofas_stations_details.CSV'
    
# Create a data frame
df = pd.read_csv(station_details_URL)
df.head()

In [None]:
# Station selection dropdown
w = drop_down_select(df['sta_id'].tolist(), descriptor='Station:')
print('Select a station')
display(w)

For the purpose of consistency with the rest of this training, please select station `024-mgd5pat`

In [None]:
# Subset the DataFrame to just this station
station = w.value
print ('Station selected: {0}'.format(station))
df_station = df.loc[df['sta_id']==station]

# Obtain the coordinates of the station, for obtaining streamflow
station_lat = float(df_station['lat_cwc'])
station_lon = float(df_station['lon_cwc'])
print(station_lat, station_lon)

### View the location in Google Maps

In [None]:
# double check you put the right coordinates correctly by following this link to google maps
print(f'https://www.google.com/maps/place/{station_lat},{station_lon}')

### View the location in an interactive map widget within this notebook

In [None]:
# Create a spatially enabled data frame
sedf = pd.DataFrame.spatial.from_xy(df=df_station, x_column='lon_cwc', y_column='lat_cwc', sr=4326)

# Passing a place name to the constructor will initialize the extent of the map.
m1 = gis.map('India', zoomlevel=12)

# Change the size of the map widget
m1.layout = Layout(flex='1 1', padding='0px', height='600px')

# Plot the SEDF on the map
sedf.spatial.plot(map_widget=m1)

# Zoom to the layer
m1.zoom_to_layer(sedf)

# Specify map center point. Must reverse coordinate order here
m1.center = sedf.spatial.bbox.centroid[::-1]     

# Plot the map
m1

### Use the geoglows library to find the stream segement

Note that the geoglows library may provide a poorly snapped location for certain gauges, especially where gauges exist near a confluence. The coordinate provided is snapped to the nearest GEOGloWS reach, and this network may not exactly match true hydro-locations.

In [None]:
# Use the geoglows package funciton to convert from latitude/longitude to reach_id
model_data = geoglows.streamflow.latlon_to_reach(station_lat, station_lon)
reach_id = model_data['reach_id']
print('Coordinates (lat,lon): ({0},{1})'.format(station_lat, station_lon))
print('GEOGloWS Reach ID: {0}'.format(reach_id))

# Take a look at the kind of information thats available. The distances are in degrees of latitude/longitude. The equivalent length depends on the location on the globe.
print('Distance from coordinate: {0:3.4f}\u00b0'.format(model_data['distance']))

### Check for warnings for this reach

`geoglows.streamflow.forecast_warnings(<region>)` - The forecast warnings product is a rudimentary flood early warning system. It is a table produced by comparing the return periods to the forecasted average flow. If the average flow is forecasted to exceed a return period flow, the river is considered possible to experience flooding in the next 15-day forecast period.

We will obtain this table for the region ("south_asia-geoglows"), and determine if this reach ID is in the table. For a list of all available regions, you can call `geoglows.streamflow.available_regions()`

In [None]:
%%time

# Obtain the warning table for this region
warnings = geoglows.streamflow.forecast_warnings('south_asia-geoglows')
display(warnings)

# warnings.index[0] - for a working example
if reach_id in warnings.index:
    print("!!! Found reach ID {0} in the warnings table !!!".format(reach_id))
    display(warnings.loc[reach_id])
else:
    print("Did not find reach ID {0} in the warnings table.".format(reach_id))

### Examine the recent historical forecast data for this stream segment

The `geoglows.streamflow.forecast_records()` function returns a time-series of ensemble average forecasted flow for the current calendar year (January 1 to the current date). The Records data set is a log of the average streamflow from the first day of each forecast. This is a log of the best estimate for streamflow in each river. It can be viewed and plotted with the current forecast to provide additional context for what the river has done in the recent past. This fills in the gaps between the end of the hindcast data set and the current day's forecast.

In [None]:
%%time
%matplotlib inline

records = geoglows.streamflow.forecast_records(reach_id)
records.plot(figsize=(15, 5))

### Obtain the available forecast dates and select the most recent forecast

In [None]:
# Obtain available forecast dates
available_dates = geoglows.streamflow.available_dates(reach_id)
latest_forecast = sorted(available_dates['available_dates'])[-1]
print('Most recent forecast for reach {0}: {1}'.format(reach_id, latest_forecast))

### Pull the ensemble forecast as well as forecast statistics that are based on the 51-member ensemble

From the [GEOGloWS Publication](https://onlinelibrary.wiley.com/doi/full/10.1111/jfr3.12859), the returned forecast is as follows:

A single 15-day time series of the average flow of the 51 HTESSEL ensemble members with the same spatiotemporal resolution (18 km, 3 h). The “Forecast” data set represents the best estimate of streamflow in a river based on the ensemble streamflow forecast.

1. `geoglows.streamflow.forecast_stats()` -  A statistical summary of the streamflow results for each of the 52 forecasts. Forecast Stats provides seven separate 15-day time series; one each for the maximum, minimum, average, median, 75th percentile, and 25th percentile of the forecast ensemble members at each time step plus the higher resolution forecast. This provides a more complete summary of the available forecasts than the simple average. A sample plot of this data is shown in the figure below.


2. `geoglows.streamflow.forecast_ensembles()` - About 52 separate, 15-day time series of streamflow forecasts. These time series include the 51 forecasts coming from each of the HTESSEL runoff ensemble members as well as the single higher resolution deterministic runoff forecast.

<img src="https://onlinelibrary.wiley.com/cms/asset/d00d261b-ee84-4da2-bff9-1cb7e39f8fe3/jfr312859-fig-0005-m.jpg" width="1000">
A graphical representation of the forecast statistics. The blue shaded region shows the maximum and minimum range. The green shaded region marks the 25th and 75th percentile range. The solid blue line shows the average. The black line shows the high-resolution ensemble member.

### Analyze the forecast using a statistical representation of the ensemble forecast

In [None]:
%%time

# Pull a statistical represention of the streamflow forecast time-series products:
stats = geoglows.streamflow.forecast_stats(reach_id, forecast_date=latest_forecast)
display(stats)

forecast_length = stats.index[-1]-stats.index[0]
print('Forecast length: {0} ({1} timesteps)'.format(forecast_length, len(stats.index)))

In [None]:
# Plot the ensemble average flow forecast from the forecast statistics table.
stats['flow_avg_m^3/s'].plot(figsize=(15, 5), marker='o', markersize=2, legend=True)

In [None]:
# Plot all of the data from the forecast statistics table.
stats.plot(figsize=(15, 5), title='Forecasted Streamflow', linestyle='-', marker='.', lw=1, markersize=2, legend=True)

### Analyze the forecast using the 51-member ensemble

In [None]:
%%time

# Pull the full ensemble 15-day forecast
ensembles = geoglows.streamflow.forecast_ensembles(reach_id, forecast_date=latest_forecast)
ensembles

### Save forecast data to disk

We will save this forecast data as a CSV file to be read in later. 

Be sure this forecast point selected is consistent with the basin delineation performed earlier.

In [None]:
%%time

# Setup output files
out_file_stats = output_data_dir / 'GEOGloWS_forecast_stats.csv'
out_file_ens = output_data_dir / 'GEOGloWS_forecast_ens.csv'

# Save to disk
stats.to_csv(out_file_stats)
print('Saved output GEOGloWS ensemble statistics file to disk: {0}'.format(out_file_stats))
ensembles.to_csv(out_file_ens)
print('Saved output GEOGloWS ensemble forecast file to disk: {0}'.format(out_file_ens))

### Pull historic simulation data

First, attempt to pull the data from the GEOGloWS API.

In [None]:
%%time

# Historical simulation products from GEOGloWS
hist = geoglows.streamflow.historic_simulation(reach_id)
rperiods = geoglows.streamflow.return_periods(reach_id)
day_avg = geoglows.streamflow.daily_averages(reach_id)
mon_avg = geoglows.streamflow.monthly_averages(reach_id)

A variety of historical data can be obtained for any given reach from the `geoglows` API.

1. `geoglows.streamflow.historic_simulation` - A time series of simulated daily average streamflow produced by the ERA5 reanalysis forcing. This data set started on January 1, 1979 with one value per day. We update the historical simulation as additional curated ERA5 data becomes available; usually monthly with 2 months of lag from the present. 

2. `geoglows.streamflow.monthly_averages()` & `daily_averages()` - Daily and Monthly Averages of the 40+ year historical simulation which could be understood as representations of a typical hydrologic year. The daily average data set has 366-time steps (each day of the year including leap day), while the monthly average data set has 12 steps.

3. `geoglows.streamflow.return_periods()` - This data set includes an estimation of 2-, 5-, 10-, 25-, 50-, and 100-year return period high flows of each stream segment using the Gumbel type-I distribution. The 2-year return period was included since it is a typical approximation for “bank-full” flows. There are many challenges to accurately estimating flood frequencies and bank-full conditions which are documented in the literature (Ahilan et al., 2013; Konrad & Restivo, 2021; Wilkerson, 2008; Zhou & Jin, 2021; Zsoter et al., 2020). GEOGloWS chose to use the 2-year return period for visualization and analytical purposes. This is comparable to the GloFAS which also provides a 2-year return period and the NWM which uses a 1.5-year return period as the threshold for bank-full. This generalization will not be accurate for all rivers.

Occasionally, the geoglows service will time-out for large requests such as the `historic_simulation` data. We can optionally load this data from disk.

In [None]:
%%time

# Select whether or not to load data form local disk
load_local = False

if load_local:
    # Load previously-generated GEOGloWS historical simulation products
    hist = pd.read_csv(os.path.join(input_data_dir, 'GEOGloWS_historical_simulations.csv'))
    rperiods = pd.read_csv(os.path.join(input_data_dir, 'GEOGloWS_return_periods.csv'))
    day_avg = pd.read_csv(os.path.join(input_data_dir, 'GEOGloWS_daily_averages.csv'))
    mon_avg = pd.read_csv(os.path.join(input_data_dir, 'GEOGloWS_monthly_averages.csv')).drop(columns=['datetime'])

### Plot the historic simulation time-series

In [None]:
%matplotlib inline

# Historic forecast data
hist_length = hist.index[-1]-hist.index[0]
print('Length of historical forecast time-series: {0} ({1} timesteps)'.format(hist_length, len(hist.index)))
hist.plot(figsize=(15, 4), title='Historical Forecasted Streamflow', lw=1)

# Day of year average from historical forecasts
day_avg.plot(figsize=(15, 4), title='Daily Average Forecasted Streamflow', lw=2)

# Monthly average from historical forecasts
mon_avg.plot(figsize=(15, 4), title='Monthly Average Forecasted Streamflow', lw=2)

### Plot the forecast statistics for this stream segment

The `geoglows` library contains some handy plotting functions that will render a `plotly` plot of the data. The plot will also display useful information such as return periods ont this plot for additional context.

Unfortunately, the returned plot must currently be rendered in a browser because it is not supported in this version of Jupyter Notebook. The plot can be written as HTML to disk for viewing later, or the user can call the `.show()` method on the plot to open in the default browser.

In [None]:
%matplotlib inline

# Statistical summary of the forecasted flows
forecast_figure = geoglows.plots.forecast_stats(stats, rperiods, titles={'Reach ID': reach_id})

# Save the figure locally
out_file = os.path.join(output_data_dir, 'GeoGLOWS_{0}.html'.format(station.replace('-', '_')))
forecast_figure.write_html(out_file)

# Display the figure
forecast_figure.show()

In [None]:
# Hydroviewer plot
geoglows.plots.hydroviewer(records, stats, ensembles, rperiods).show()

The previous cell should open your computer's default web browser to render the HTML. Once you have finished exploring the interactive figure, please close it and continue this lesson.

## Application: Compute aggregated forecast statistics as a proxy for reservoir inflow

We have previously identified the representative inflow segments from the GEOGloWS flow network for the reservoir created by the K.R.Sagara Dam, operated and maintained by the Madhya Pradesh Water Resources Department.

The inflow reach IDs are as follows: `5076980, 5076976, 5076778, 5076636, 5076687`

In [None]:
# Define the inflow reservoir name and GEOGloWS reach IDs
reservoir_name = 'K_R_Sagara'

# Resevoir location
point_lon_lat = (76.586, 12.43)    # KRS Dam

# Define the inflow reach IDs from GEOGloWs
inflow_reach_IDs = [5076980, 5076976, 5076778, 5076636, 5076687]

### Make a map of the area

Create a map widget, centered on the reservoir of interest. Add the GEOGloWS forecast layer, and a point for the dam location for reference.

In [None]:
from arcgis.features import FeatureLayerCollection
from arcgis.geometry import Point

# URL for the ArcGIS Feature Service of GeoGLOWs data
geoglows_url = r'https://livefeeds2.arcgis.com/arcgis/rest/services/GEOGLOWS/GlobalWaterModel_Medium/MapServer'
geoglows_FLC = FeatureLayerCollection(geoglows_url)

# URL for a layer of CWC reservoirs
CWC_Reservoirs = "34b71f5ea24b49ce857e8ee5e71a4117"
res_layer = gis.content.get(CWC_Reservoirs)
    
# Passing a place name to the constructor will initialize the extent of the map.
map = gis.map('India', zoomlevel=11)

map.add_layer(res_layer)

# Add layer if it returns a valid HTTP code
if gis.content.check_url(geoglows_url)['httpStatusCode']==200:
    map.add_layer(geoglows_FLC.layers[0])
    
# Change the size of the map widget. Limit the size to get the feature layer to draw some features quickly.
map.layout = Layout(flex='1 1', padding='0px', height='600px', width='800px')

# Specify map center point. Must reverse coordinate order here
map.center = [point_lon_lat[1], point_lon_lat[0]] # here we are setting the map's center to NWA

# Draw point on the map
pt = Point({"x" : point_lon_lat[0], "y" : point_lon_lat[1], "spatialReference" : {"wkid" : 4326}})
map.draw(pt)

# Set a particular basemap
map.basemap = 'arcgis-topographic'

# Plot the map
map

The inflow reaches for this reservoir were selected by identifying the reach IDs for each GEOloWS flowline that represents an inflow to the reservoir.

#### Now aggregate the individual reach forecasts to emulate a forecast of reservoir inflows 

We will iterate over each of these reaches, combining the forecasts by adding all of the ensemble and deterministic forecasts for the inflows identified above. Much of the functionality here has already been demonstrated, but in this section we will demonstrate how to aggregate forecasts to simulate the naturalied inflows to a theoretical reservoir. Remember that reservoirs do not exist on the GEOGLoWS network, so caution must be applied when interpreting the results.

In [None]:
%%time

# Gather statistics from forecast ensembles
print('Iterating over {0} reaches to aggregated forecasts.'.format(len(inflow_reach_IDs)))

# Aggregate historical simulations
historical = True        

# Iterate over each inflow ID and gather the ensemble statistics and (optional) the historical simulation data
for n,inflow_reach in enumerate(inflow_reach_IDs):
    print('  [{0}] Obtaining data for inflow reach {1}'.format(n+1, inflow_reach))
    if n == 0:
        ensembles_df = geoglows.streamflow.forecast_ensembles(inflow_reach, forecast_date=latest_forecast)
        if historical:
            historical_df = geoglows.streamflow.historic_simulation(inflow_reach)
    else:   
        ensembles_df += geoglows.streamflow.forecast_ensembles(inflow_reach, forecast_date=latest_forecast)
        if historical:
            historical_df += geoglows.streamflow.historic_simulation(inflow_reach)

#### We will re-construct a new statistics DataFrame from the aggregated ensembles

Because the 52nd ensemble is a deterministic high-temporal resolution run of HTESSEL, we can pull this directly from the aggregated ensemble DataFrame. For the rest fo the fields, we can compute them directly from the ensemble forecast. First, we will remove the deterministic, high-resolution forecast from the ensemble table, and then calculate row-by-row statistics directly from the ensemble members 1-51. The statistics calculated are min, max, and mean, and 25th and 75th percentile.

In [None]:
%%time

# Build statistics from forecast ensembles table
stats_df = pd.DataFrame()

# Remove ensemble member 52, which is a high-resolution deterministic forecast
pure_ensembles = ensembles_df.drop(columns=['ensemble_52_m^3/s'])

# Populate the columns of the table using ensemble members 1-51.
stats_df['flow_max_m^3/s'] = pure_ensembles.max(axis=1)
stats_df['flow_75%_m^3/s'] = pure_ensembles.quantile(0.75, axis=1)
stats_df['flow_avg_m^3/s'] = pure_ensembles.mean(axis=1)
stats_df['flow_25%_m^3/s'] = pure_ensembles.quantile(0.25, axis=1)
stats_df['flow_min_m^3/s'] = pure_ensembles.min(axis=1)
stats_df['high_res_m^3/s'] = ensembles_df['ensemble_52_m^3/s']

# Display the resulting dataframe
stats_df

### Save aggregated reservoir inflow simulations

In [None]:
%%time

# Define output CSV files
out_file_ensembles = os.path.join(output_data_dir, f'GeoGLOWS_Reservoir_Inflow_ensembles_{reservoir_name}.csv')
out_file_historical = os.path.join(output_data_dir, f'GeoGLOWS_Reservoir_Inflow_historical_{reservoir_name}.csv')
out_file_stats = os.path.join(output_data_dir, f'GeoGLOWS_Reservoir_Inflow_ensemble_stats_{reservoir_name}.csv')

# Save files to disk
ensembles_df.to_csv(out_file_ensembles)
stats_df.to_csv(out_file_stats)
if historical:
    historical_df.to_csv(out_file_historical)

### Plot aggregated reservoir inflow forecasts

In [None]:
%matplotlib inline

if historical:
    historical_df.plot(figsize=(15, 5), title=f'Aggregated Historical Reservoir Inflow Simulation for {reservoir_name}', linestyle='-', marker='.', lw=1, markersize=2, legend=True)

In [None]:
%matplotlib inline
stats_df.plot(figsize=(15, 5), title=f'Aggregated Reservoir Inflow Forecast Statistics for {reservoir_name}', linestyle='-', marker='.', lw=1, markersize=2, legend=True)

In [None]:
%matplotlib inline
pure_ensembles.plot(figsize=(15, 5), title=f'Aggregated Reservoir Inflow Ensemble Forecast for {reservoir_name}', linestyle='-', marker='.', lw=1, markersize=2, legend=False)

### Use the `geoglows` library to plot the forecasted inflows

Because we created a statistics table identical to the `forecast_stats` table returned by GEOGloWS, we can use plotting utitlies to plot the aggregated forecast. The only issue is that we do not have return periods for an aggregation of forecasts, so we will input `None`, and return periods will not be displayed.

In [None]:
%matplotlib inline

# Statistical summary of the forecasted flows
forecast_figure = geoglows.plots.forecast_stats(stats_df, None, titles={'Inflows for Gandhi Sagar Reservoir': ''})

# Save the figure locally
out_file = os.path.join(output_data_dir, 'GeoGLOWS_Reservoir_Inflows.html')
forecast_figure.write_html(out_file)

# Display the figure
forecast_figure.show()

## Bias Correction of GEOGloWS

In order to use the bias correction tools in the geoglows package, you need 3 things.

1. Observed streamflow data
2. Simulated historical streamflow data from the geoglows model
3. Data to correct: either the historical data, or any other timeseries of simulated flows from the geoglows model

The simulated historical and predicted streamflow are available through the GEOGloWS ECMWF Streamflow model via the geoglows python package.

Methods for recording streamflow and formats to save them in vary by country and not all streamflow is publically available online. As such, there is not a generic tool for retrieving observed streamflow through the geoglows package. You will need to provide it yourself.

GEOGloWS bias correction code can be found in the [bias.py](https://github.com/geoglows/pygeoglows/blob/master/geoglows/bias.py) script within the `geoglows` library.

### NCAR EHP Naturalized Flow Data

Naturalized flow is the flow in the river that would have occurred if no human interventions would have taken place, i.e., the regulation, use, and management influence on waters were removed, resulting in an unregulated ‘natural’ flow. Naturalized flows are typically used to understand the water availability in river basins and are beneficial in the development and calibration of hydrologic models, which represent rainfall-runoff processes that are not influenced by human activities. Among the several approaches for development of naturalized flow series in the literature (Emerson, 2005; Prairie & Callejo, 2005), the most common approach is based on development of a water budget using traditional mass balance methods.

NCAR and other collaborators have derived a daily naturalized inflow dataset for a number of reservoirs in India. This dataset can be used as the 'observation' dataset for comparing to simulated flow, and for the purpose of bias-correcting an inflow forecast. 

In order to use the `geoglows.bias` functions, the input observation data must have 2 columns and both should have names as the first item. The first one should be titled `datetime` and contain dates in a standard format. The other may have any title but *must* contain streamflow values in cubic meters per second (m^3/s)

In [None]:
# Read data from CSV file
observed_data = pd.read_csv(os.path.join(input_data_dir, 'reservoir', 'Cauvery_KRS_NatFlow_1Dtmh.csv'))

# Interpret time from string to datetime object and set time as the index
observed_data['Date [YYYY-MM-DD]'] = pd.to_datetime(observed_data['Date [YYYY-MM-DD]'], format ='%Y-%m-%d')
observed_data = observed_data.rename(columns={'Date [YYYY-MM-DD]': "datetime"})
observed_data = observed_data.set_index("datetime").tz_localize('UTC')

# Adjust from native units of 1000m3/day to m3/s
observed_data = (observed_data*1000.)/86400.
observed_data = observed_data.rename(columns={'AdjNatFlow [1000m3/day]': 'AdjNatFlow [m3/s]'})

# Plot the time series
observed_data.plot(figsize=(15, 5), title=f'Naturalized Reservoir Inflows for {reservoir_name}', linestyle='-', lw=1, markersize=2, legend=True)

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
historical_df.plot(title=f'Naturalized and Simulated Reservoir Inflows for {reservoir_name}', linestyle='-', marker='.', lw=0.5, markersize=1, legend=True, ax=ax)
observed_data.plot(linestyle='-', lw=0.5, markersize=1, legend=True, ax=ax)

In [None]:
pd.concat([observed_data.describe(), historical_df.describe()], axis=1)

If we can treat the naturalized inflow time series for this reservoir as `observations`, then we can bias-correct the forecasts using the observations, the GEOGloWS historical simulations (forced with ERA5), and the aggregated inflow forecast data.

In [None]:
%%time

corrected_historical = geoglows.bias.correct_historical(historical_df, observed_data)

### Modify the GEOGloWS bias-correction function

In [None]:
import math
import hydrostats as hs
import hydrostats.data as hd
from scipy import interpolate

__all__ = ['correct_historical', 'correct_forecast', 'statistics_tables']


def correct_historical(simulated_data: pd.DataFrame, observed_data: pd.DataFrame) -> pd.DataFrame:
    """
    Accepts a historically simulated flow timeseries and observed flow timeseries and attempts to correct biases in the
    simulation on a monthly basis.

    Args:
        simulated_data: A dataframe with a datetime index and a single column of streamflow values
        observed_data: A dataframe with a datetime index and a single column of streamflow values

    Returns:
        pandas DataFrame with a datetime index and a single column of streamflow values
    """
    # list of the unique months in the historical simulation. should always be 1->12 but just in case...
    unique_simulation_months = sorted(set(simulated_data.index.strftime('%m')))
    dates = []
    values = []

    for month in unique_simulation_months:
        # filter historic data to only be current month
        monthly_simulated = simulated_data[simulated_data.index.month == int(month)].dropna()
        to_prob = _flow_and_probability_mapper(monthly_simulated, to_probability=True, extrapolate=True)
        # filter the observations to current month
        monthly_observed = observed_data[observed_data.index.month == int(month)].dropna()
        to_flow = _flow_and_probability_mapper(monthly_observed, to_flow=True, extrapolate=True)

        dates += monthly_simulated.index.to_list()
        value = to_flow(to_prob(monthly_simulated.values))
        values += value.tolist()

    corrected = pd.DataFrame(data=values, index=dates, columns=['Corrected Simulated Streamflow'])
    corrected.sort_index(inplace=True)
    return corrected

def _flow_and_probability_mapper(monthly_data: pd.DataFrame, to_probability: bool = False,
                                 to_flow: bool = False, extrapolate: bool = False) -> interpolate.interp1d:
    if not to_flow and not to_probability:
        raise ValueError('You need to specify either to_probability or to_flow as True')

    # get maximum value to bound histogram
    max_val = math.ceil(np.max(monthly_data.max()))
    min_val = math.floor(np.min(monthly_data.min()))

    if max_val == min_val:
        warnings.warn('The observational data has the same max and min value. You may get unanticipated results.')
        max_val += .1

    # determine number of histograms bins needed
    number_of_points = len(monthly_data.values)
    number_of_classes = math.ceil(1 + (3.322 * math.log10(number_of_points)))

    # specify the bin width for histogram (in m3/s)
    step_width = (max_val - min_val) / number_of_classes

    # specify histogram bins
    bins = np.arange(-np.min(step_width), max_val + 2 * np.min(step_width), np.min(step_width))

    if bins[0] == 0:
        bins = np.concatenate((-bins[1], bins))
    elif bins[0] > 0:
        bins = np.concatenate((-bins[0], bins))

    # make the histogram
    counts, bin_edges = np.histogram(monthly_data, bins=bins)

    # adjust the bins to be the center
    bin_edges = bin_edges[1:]

    # normalize the histograms
    counts = counts.astype(float) / monthly_data.size

    # calculate the cdfs
    cdf = np.cumsum(counts)

    # interpolated function to convert simulated streamflow to prob
    if to_probability:
        if extrapolate:
            return interpolate.interp1d(bin_edges, cdf, fill_value='extrapolate')
        return interpolate.interp1d(bin_edges, cdf)
    # interpolated function to convert simulated prob to observed streamflow
    elif to_flow:
        if extrapolate:
            return interpolate.interp1d(cdf, bin_edges, fill_value='extrapolate')
        return interpolate.interp1d(cdf, bin_edges)

### Use the modified functions and plot the bias-corrected historical forecast

In [None]:
# Bias-correct historical
corrected_historical = correct_historical(historical_df, observed_data)

# Save file to disk
out_file_historical_bc = os.path.join(output_data_dir, f'GeoGLOWS_Reservoir_Inflow_historical_corrected_{reservoir_name}.csv')
corrected_historical.to_csv(out_file_historical_bc)

corrected_historical

In [None]:
# You can add more entries to the dicionary and they will appear in the title of the graph
titles = {'Reach ID': "KSR", 'bias_corrected': True}

# This is a plot of the Original Simulated, Corrected Simulated, and Observed data
geoglows.plots.corrected_historical(corrected_historical, historical_df, observed_data, titles=titles).show()

The resulting plot is displayed in a browser window.

Now examine the statistics of the 'observations' (naturalized inflows), the GEOGloWS historical simulations, and the bias-corrected GEOGloWS historical simulations

In [None]:
pd.concat([observed_data.describe(), historical_df.describe(), corrected_historical.describe()], axis=1)

Now, lets proceed to bias-correct the forecast statistics and ensembles. Use the `geoglows.bias` tools to correct the bias using your observed data.

In [None]:
%%time

# Bias-correct forecast
corrected_stats = geoglows.bias.correct_forecast(stats_df, historical_df, observed_data)
corrected_ensembles = geoglows.bias.correct_forecast(ensembles_df, historical_df, observed_data)

Since there are many lines on the forecast plots, we recommend plotting the adjusted and original forecasts side by side rather than overlaying them together.

In [None]:
# corrected data
geoglows.plots.forecast_stats(corrected_stats, titles=titles).show()

In [None]:
corrected_stats

### Statistics, Summaries, Averages, etc.

There are many tools in the geoglows package to analyze how much the bias correction improved the streamflow simulations. These are based on the statistical analysis performed by the hydrostats and HydroErr python packages

In [None]:
# This is a scatter plot of the original vs simulated data
geoglows.plots.corrected_scatterplots(corrected_historical, historical_df, observed_data, titles=titles).show()

In [None]:
# This is a plot of the monthly averages
geoglows.plots.corrected_month_average(corrected_historical, historical_df, observed_data, titles=titles).show()

In [None]:
# This is a plot of the daily averages
geoglows.plots.corrected_day_average(corrected_historical, historical_df, observed_data, titles=titles).show()

In [None]:
# This is a plot of the cumulative annual volumes
geoglows.plots.corrected_volume_compare(corrected_historical, historical_df, observed_data, titles=titles).show()

In [None]:
# This is a table of a few important statistics 
display(HTML(geoglows.bias.statistics_tables(corrected_historical, historical_df, observed_data)))

### Plot the ensemble of bias-corrected forecasts for this reservoir

In [None]:
%matplotlib inline
corrected_ensembles.plot(figsize=(15, 5), title=f'Bias-Corrected Aggregated Reservoir Inflow Ensemble Forecast for {reservoir_name}', linestyle='-', marker='.', lw=1, markersize=2, legend=False)

Save the data to disk for use later on in this training.

In [None]:
# Setup output files
out_file_corrected = output_data_dir / f'GeoGLOWS_Reservoir_Inflow_forecast_ensemble_corrected_{reservoir_name}.csv'

# Save
corrected_ensembles.to_csv(out_file_corrected)

### Conclusion

We installed the open-source `geoglows` library which is a powerful API giving access to streamflow forecasts based on GEOGlowWS and GloFAS ensembles. 

* For a selected reach, we examined and plotted the available data in the forecast. 
* For a known reservoir, we used a list of the primary inflow reaches from GEOGloWS and aggregated forecast ensembles, manually re-building the statistics for the ensemble and high-resolution forecasts of streamflow to simulate possible forecasts of reservoir inflows.
* We used the reservoir inflow historical simulation data as well as a time-series of naturalized inflows for that reservoir, and used the `geoglows` library to bias-correct the hitorical time series.
* We also bias-corrected the ensemble streamflow forecast using the naturalized flow time-series.

Lets save this notebook and save the current map document.

In [None]:
aprx.save()

### Reset the namespace

The following `%reset -f` command is a built-in command in Jupyter Notebook that will reset the namespace. This is good practice to run when you are finished with the notebook.

In [None]:
%reset -f

# Next up - Explore Gridded Precipitation Data

This concludes this lesson. In the next lesson, we will explore gridded precipitation data.

**IT IS BEST TO EITHER SHUTDOWN THIS LESSON OR CLOSE IT BEFORE PROCEEDING TO THE NEXT LESSON TO AVOID POSSIBLY EXCEEDING ALLOCATED MEMORY. Select `Command Pallette -> restart kernel`.**

© UCAR 2023