# `get_data()` Demo Notebook from `CIROH-UVM/forecast-workflow`

## Setup
1. Clone the [forecast-workflow repository](https://github.com/CIROH-UVM/forecast-workflow/tree/main) to your user space on the testbed
2. Launch Jupyter Lab and select the kernel called "forecast"
3. Add the path to your cloned repo to you Python path by running the cell below (only run once per notebook)

In [None]:
# import sys
# sys.path.append('/your/path/to/forecast-workflow')

## NWM forecasted streamflow data

In [None]:
'''
A function to download and process NWM hydrology forecast data to return nested dictionary of pandas series for each variable, for each location.

Args:
-- start_date: The start date for which to retrieve data.
-- end_date: The end date for which to retrieve data.
-- member: The member type of NWM forecast to get (medium_range_mem1, long_range_mem3, short_range, etc).
-- locations: A dictionary or list of locations to pull out of the forecast files. When a dict is passed in the format {"user-name":"gauge_ID"},
	then the function will use those user-defined names when returning data. Otherwise, default location IDs found in the dataset are used.
-- variables: A dictionary or list of variables to pull out of the 'channel_rt' forecast files. When a dictionary is passed in the format ("user-name":"variable-name"),
	then the function will use those user-defined names when returning data. Otherwise, default variable names found in the dataset are used. 
-- reference_date: The forecast reference time, i.e., the date and time at which 
	the forecast was initialized. Defaults to start_date if None.
-- data_dir: Directory to store downloaded data. Defaults to OS's default temp directory.
-- format: The format in which to return the data. Default is 'dictionary', which returns a nested dictionary of pandas series. Other valid option is 'xarray', which returns an xarray dataset.
-- gcs: Flag determining whether or not to use Google buckets for NWM download as opposed to NOMADs site. Default is True.
-- end_date_exclusive: Whether to exclude the end date from the time series. Defaults to True.
-- dwnld_threads: Number of threads to use for downloads. Default is half of OS's available threads.
-- load_threads: Number of threads to use for reading data. Default is half of OS's available threads.

Returns:
NWM timeseries for the given locations in a nested dict format where 1st-level keys are user-provided location names and 2nd-level keys
are variable names and values are the respective data in a Pandas Series object, or an xarray Dataset if format='xarray'.
'''

In [None]:
import data.nwm_fc as nwm
import datetime as dt
import pandas as pd

fc_start_dt = dt.datetime(2024, 1, 16, 6)
# without an hour specified, will default to midnight forecast cycle
# fc_start_dt = "202401016"

# use the same hour as our start datetime, so that we get a full 10 days of fc data
fc_end_dt = dt.datetime(2024, 1, 26, 6)

# define some locations to grab data for. Reach IDs for the NWM can be found at: 
# https://water.noaa.gov/map
reaches =  {"Missisquoi River":"166176984",
			"Jewett Brook":"4587092",
            "Mill River":"4587100"}

fc_type = "medium_range_mem1"

# define a directory in which to download NWM data
# this specifc directroy hosts a large swath of NWM and GFS data used collectively by CIROH researchers
data_directory = "/netfiles/ciroh/downloadedData/"

In [None]:
nwm_data = nwm.get_data(start_date=fc_start_dt,
						end_date=fc_end_dt,
						member=fc_type,
						locations=reaches,
						data_dir=data_directory,
						variables={'discharge':'streamflow'},
						format='dictionary',
      					end_date_exclusive=False) # by default, the end date is excluded from the time series

In [None]:
nwm_data

In [None]:
pd.DataFrame(nwm_data['Missisquoi River']['discharge'])

## NWM Forcings Data

In [None]:
import data.nwm_forcings_fc as nwm_forcings

'''
A function to download and process NWM forcings data.

Args:
-- start_date: The start date for which to retrieve data.
-- end_date: The end date for which to retrieve data.
-- member: The NWM forecast member for which you to get forcings for (Currently accepts 'medium_range', 'short_range', 'analysis_assim', and 'analysis_assim_extend').	-- locations: A dictionary containing either bounding box information or a list of points to extract from the gridded forcings dataset. Default value of None does not spatially subset the data. See validate_locations() for more details.
	Note that the bounding box must be in latitude, longitude (WGS 1984), but the dataset is NOT reprojected and will maintain its orginal CRS.
-- locations: A dictionary containing either bounding box information or a dict of points to extract from the gridded forcings dataset. Default value of None does not spatially subset the data. See validate_locations() for more details.
	Note that the bounding box must be in latitude, longitude (WGS 1984), but the dataset is NOT reprojected and will maintain its orginal CRS.
-- variables: A dictionary or list of variables to pull out of the forcing files. When a dictionary is passed in the format {"user-name":"variable-name"}, the function will use those user-defined names when returning data. Otherwise, the variable names found in the dataset are used. Default value 'all' keeps all variables
-- reference_date: The forecast reference time, i.e., the date and time at which 
	the forecast for which you want forcings for was initialized. Defaults to start_date if None.
-- data_dir: Directory to store downloaded data. Defaults to OS's default temp directory.
-- end_date_exclusive: Whether to exclude the end date from the time series. Defaults to True.
-- dwnld_threads: Number of threads to use for downloads. Default is half of OS's available threads.

Returns:
xr.Dataset or dict: If locations is None or a bounding box, returns an xarray.Dataset. If locations specifies points, returns a nested dictionary where 1st-level keys are points, 2nd-level keys are variables, and 2nd-level values are pandas.Series
'''

In [None]:
# The NWM Forcings data getter is unique in that it can return data in two formats: an xarray dataset or a nested dictionary of pandas series.
# the format the data is returned in is determined by the nature of the locations argument passed in.

# If locations is a bounding box, the data will be returned as an xarray dataset.
boundary_box = {'bbox' : {'min_lat' : 42.34, 'max_lat' : 45.21, 'min_lon' : -75.89, 'max_lon' : -72.68}}
# If locations is a dictionary of labelled points, the data will be returned as a nested dictionary
labelled_points = {'points': {'01413088': (42.29380556, -74.5591944),
                        	  '01413398': (42.1508333, -74.60138889)}}

In [None]:
forcings_data = nwm_forcings.get_data(start_date=fc_start_dt,
									 end_date=fc_end_dt,
									 member="medium_range",
									 locations=boundary_box,  # or labelled_points
									 variables='all',
									 data_dir=data_directory,
          							 end_date_exclusive=True) # by default, the end date is excluded from the time series

In [None]:
# Note that since we passed in a boundary box, the data is returned as an xarray dataset.
# if we passed in a dictionary of labelled points, the data would be returned as a nested dictionary of pandas series.
# the later method is not recommended for large datasets, as it can be very memory intensive and take a long time to execute
forcings_data

## USGS observed streamflow data

In [None]:
"""
A function to download and process USGS observational hydrology data to return nested dictionary of pandas series fore each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab USGS data
-- end_date (str, date, or datetime) [req]: the end date for which to grab USGS data
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get USGS data for.
-- variables (dict) [req]: a dictionary of variables to download, where keys are user-defined variable names and values are dataset-specific variable names.
-- service (str) [opt]: what USGS service to get data from. Default is instanteous values service. For more options, see https://waterservices.usgs.gov/docs/

Returns:
USGS observed streamflow data for the given stations in a nested dict format where 1st-level keys are user-provided location names and 2nd-level keys
are variables names and values are the respective data in a Pandas Series object.
"""

In [None]:
import data.usgs_ob as usgs
import matplotlib.pyplot as plt

In [None]:
# USGS site numbers can be found at:
# https://maps.waterdata.usgs.gov/mapper/index.html
usgs_stations = {"Missisquoi River":"04294000",
				 "Jewett Brook":"04292810",
            	 "Mill River":"04292750"}

In [None]:
usgs_data = usgs.get_data(start_date = "20240116",
						  end_date = "20240126",
						  locations = usgs_stations)

In [None]:
df = pd.DataFrame(usgs_data['Missisquoi River']['streamflow'].astype('float') * 0.0283168)
# pd.options.display.max_rows = 60
df

In [None]:
nwm_data['Missisquoi River']

In [None]:
plt.plot(nwm_data['Missisquoi River']['discharge'], label='NWM')
# Convert from cubic ft/s (USGS) to cubic m/s (NWM)
plt.plot(usgs_data['Missisquoi River']['streamflow'].astype('float') * 0.0283168, label='USGS')
plt.xticks(rotation = 15)
plt.ylabel('Streamflow (m^3/s)')
plt.grid()
plt.legend()

plt.show()

## GFS Forecasted Meterological Data

In the `forecast-workflow` repository, there are two modules for accessing GFS data, `gfs_fc.py` and `gfs_fc_thredds.py`. The former acessess data from the NOAA Operational Model Archive and Distribution System, or [NOMADS](https://nomads.ncep.noaa.gov/gribfilter.php?ds=gfs_0p25), the later accesses data from the [NCAR THREDDS Research Data Archive](https://thredds.rda.ucar.edu/thredds/catalog/catalog_d084001.html). NOMADS only contains the GFS Data from the last 10 days, while the THREDDS server has GFS data going back to 2015. In this demo, we'll use the THREDDS module to access GFS

In [None]:
"""
Download specified GFS forecast data and return nested dictionary of pandas series fore each variable, for each location.

Args:
-- forecast_datetime (str, date, or datetime) [req]: the start date and time (00, 06, 12, 18) of the forecast to download. Times are assumed to be UTC time.
-- end_datetime (str, date, or datetime) [req]: the end date and time for the forecast. GFS forecasts 16-days out for a given start date.
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to download forecast data for.
-- data_dir (str) [opt]: directory to store donwloaded data. Defaults to OS's default temp directory.
-- dwnld_threads (int) [opt]: number of threads to use for downloads. Default is half of OS's available threads.
-- load_threads (int) [opt]: number of threads to use for reading data. Default is 2 for GFS, since file reads are already pretty fast.
-- useTCDCInstant (bool) [opt]: wether to use instantaneous var for cloud cover or rolling average value.

Returns:
GFS forecast data for the given locations in a nested dict format where 1st-level keys are user-provided location names and 2nd-level keys
are variables names and values are the respective data in a Pandas Series object.
"""

In [None]:
import data.gfs_fc_thredds as gfs

start_dt = dt.datetime(2023, 7, 1)
# wothout an hour specified, will default to midnight forecast cycle
# fc_start_dt = "202401016"

# use the same hour as our start datetime, so that we get a full 10 days of fc data
end_dt = dt.datetime(2023, 7, 2)

# define some locations to grab data for. Dictionary value must be lat/long tuple, up to 0.25 resolution
stations = {'401': (45.00, -73.25),
			'402': (44.75, -73.25),
			'403': (44.75, -73.25)}

# define a directory in which to download NWM data
data_directory = "/netfiles/ciroh/downloadedData/"

In [None]:
gfs_data = gfs.get_data(forecast_datetime = start_dt,
						end_datetime = end_dt,
						locations = stations,
						data_dir = data_directory)

In [None]:
# let's check out the meterological variables downloaded - hardcoded for now
gfs_data['401'].keys()

In [None]:
gfs_data['401']['T2']

## Observed local climatological data from NOAA
[NCEI Data Service API User Documentation](https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation)

Additional LCD station IDs can be found at: https://www.ncei.noaa.gov/cdo-web/datatools/lcd

In [None]:
"""
A function to download and process NOAA Local Climatological Data data to return nested dictionary of pandas series for each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab LCD data.
-- end_date (str, date, or datetime) [req]: the end date for which to grab LCD data.
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get data for.
-- variables (dict) [req]: a dictionary of variables to download, where keys are user-defined variable names and values are LCD-specific variable names.
							Currently only tested for variables listed in global var_units. 
-- units (str) [opt]: specifies unit convention for the data request. Options are 'standard' for standard US units, or 'metric' for metric units.
	
Returns:
NOAA Local Climatological Data timeseries for the given locations in a nested dict format where 1st-level keys are user-provided location names and 2nd-level keys
are variables names and values are the respective data in a Pandas Series object.
"""

In [None]:
import data.lcd_ob as lcd

In [None]:
start = dt.datetime(2025, 7, 1)
end = dt.datetime(2025, 7, 2)

In [None]:
lcd_data = lcd.get_data(start_date = start,
						end_date = end,
						locations = {"BTV":"72617014742"},
      					variables={'preip':'HourlyPrecipitation',
                      			   'relhum':'HourlyRelativeHumidity'},
           				units='metric')

In [None]:
# Local Climatological Dataset (LCD) only provides total cloud cover (%) and rain data
lcd_data['BTV'].keys()

In [None]:
lcd_data['BTV']['TCDC']

## Observed meterology data from UVM Forest Ecosystem Monitoring Cooperative (FEMC)
Right now can be used to get Colchester reef quality-controlled met data

In [None]:
"""
A function to download and process observational meterological data from UVM FEMC (Forest Ecosysytem Monitoring Cooperative - https://www.uvm.edu/femc/) to return nested dictionary of pandas series fore each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab FEMC data
-- end_date (str, date, or datetime) [req]: the end date for which to grab FEMC data
-- locations (dict) [opt]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get FEMC data for.
-- variables (dict) [opt]: a dictionary of variables to get; keys can be whatever you want to call the variables, but the values must be the variable abbreviations as seen in default dictionary

Returns:
FEMC obsrvational meterological data for the specifed data range and locations, in a nested dict format where 1st-level keys are user-provided location names and 2nd-level keys
are variables names and values are the respective data in a Pandas Series object.
"""

In [None]:
import data.femc_ob as femc

start = dt.datetime(2021, 7, 1)
end = dt.datetime(2021, 7, 5)

In [None]:
femc_data = femc.get_data(start_date = start,
						  end_date = end)

In [None]:
# taking a look at what meterological vars we have
femc_data['CR'].keys()

In [None]:
femc_data['CR']['T2']