# `get_data()` Demo Notebook from `CIROH-UVM/forecast-workflow`

## Setup
1. Clone the [forecast-workflow repository](https://github.com/CIROH-UVM/forecast-workflow/tree/main) to your user space on the testbed
2. Launch Jupyter Lab and select the kernel called "forecast"
3. Add the path to your cloned repo to you Python path by running the cell below (only run once per notebook)

In [None]:
# # import sys
# sys.path.append('/your/path/to/forecast-workflow')

## NWM forecasted streamflow data


In [None]:
"""
A function to download and process NWM hydrology forecast data to return nested dictionary of pandas series fore each variable, for each location.

Args:
-- forecast_datetime (str, date, or datetime) [req]: the start date and time (00, 06, 12, 18) of the forecast to download. Times are assumed to be UTC time.
-- end_datetime (str, date, or datetime) [req]: the end date and time for the forecast. GFS forecasts 16-days out for a given start date.
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to download forecast data for.
-- forecast_type (str) [req]: The type of forecast.
-- data_dir (str) [opt]: directory to store donwloaded data. Defaults to OS's default temp directory.
-- dwnld_threads (int) [opt]: number of threads to use for downloads. Default is half of OS's available threads.
-- load_threads (int) [opt]: number of threads to use for reading data. Default is 2 for GFS, since file reads are already pretty fast.
-- forecast_cycle (str) [req]: The starting time for the forecasts. valid values are 00, 06, 12, 18
-- google_buckets (bool) [opt]: Flag determining wether or not to use google buckets for nwm download as opposed to NOMADs site.
-- archive (bool) [opt]: Flag determining wether or not data you are grabbing is older than the last two days (relevant for NWM only)
-- return_type (string) [opt]: string indicating which format to return data in. Default is "dict", which will return data in a nested dict format:
								{locationID1:{
									var1_name:pd.Series,
									var2_name:pd.Series,
									...},
								locationID2:{...},
								...
								}
								Alternative return type is "dataframe", which smashes all data into a single dataframe muliIndex'd by station ID, then timestamp	
Returns:
NWM data in the format specified by return_type
"""

In [None]:
import data.nwm_fc as nwm
import datetime as dt
import pandas as pd

fc_start_dt = dt.datetime(2024, 1, 16, 6)
# without an hour specified, will default to midnight forecast cycle
# fc_start_dt = "202401016"

# use the same hour as our start datetime, so that we get a full 10 days of fc data
fc_end_dt = dt.datetime(2024, 1, 26, 6)

# define some locations to grab data for. Reach IDs for the NWM can be found at: 
# https://water.noaa.gov/map
reaches =  {"Missisquoi River":"166176984",
			"Jewett Brook":"4587092",
            "Mill River":"4587100"}

fc_type = "medium_range_mem1"

# define a directory in which to download NWM data
data_directory = "/data/users/n/b/nbeckage/forecastData/"

# yes, we want to use google buckets for all data older than yesterday
buckets = True
arch = True


In [None]:
nwm_data = nwm.get_data(forecast_datetime = fc_start_dt,
			 		   end_datetime = fc_end_dt,
			 		   locations = reaches,
					   forecast_type = fc_type,
					   data_dir = data_directory,
					   google_buckets = buckets,
					   archive = arch)

In [None]:
nwm_data

In [None]:
pd.DataFrame(nwm_data['Missisquoi River']['streamflow'])

## USGS observed streamflow data

In [None]:
"""
A function to download and process USGS observational hydrology data to return nested dictionary of pandas series fore each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab USGS data
-- end_date (str, date, or datetime) [req]: the end date for which to grab USGS data
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get USGS data for.
-- return_type (string) [opt]: string indicating which format to return data in. Default is "dict", which will return data in a nested dict format:
							{locationID1:{
								var1_name:pd.Series,
								var2_name:pd.Series,
								...},
							locationID2:{...},
							...
							}
							Alternative return type is "dataframe", which smashes all data into a single dataframe muliIndex'd by station ID, then timestamp

Returns:
USGS observed streamflow data for the given stations in the format specified by return_type
"""

In [None]:
import data.usgs_ob as usgs
import matplotlib.pyplot as plt

In [None]:
# USGS site numbers can be found at:
# https://maps.waterdata.usgs.gov/mapper/index.html
usgs_stations = {"Missisquoi River":"04294000",
				 "Jewett Brook":"04292810",
            	 "Mill River":"04292750"}

In [None]:
usgs_data = usgs.get_data(start_date = "20240116",
						  end_date = "20240126",
						  locations = usgs_stations)

In [None]:
df = pd.DataFrame(usgs_data['Missisquoi River']['streamflow'].astype('float') * 0.0283168)
# pd.options.display.max_rows = 60
df

In [None]:
plt.plot(nwm_data['Missisquoi River']['streamflow'], label='NWM')
# Convert from cubic ft/s (USGS) to cubic m/s (NWM)
plt.plot(usgs_data['Missisquoi River']['streamflow'].astype('float') * 0.0283168, label='USGS')
plt.xticks(rotation = 60)
plt.legend()
plt.show()

## GFS forecasted meterological data

In [None]:
"""
Download specified GFS forecast data and return nested dictionary of pandas series fore each variable, for each location.

Args:
-- forecast_datetime (str, date, or datetime) [req]: the start date and time (00, 06, 12, 18) of the forecast to download. Times are assumed to be UTC time.
-- end_datetime (str, date, or datetime) [req]: the end date and time for the forecast. GFS forecasts 16-days out for a given start date.
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to download forecast data for.
-- data_dir (str) [opt]: directory to store donwloaded data. Defaults to OS's default temp directory.
-- dwnld_threads (int) [opt]: number of threads to use for downloads. Default is half of OS's available threads.
-- load_threads (int) [opt]: number of threads to use for reading data. Default is 2 for GFS, since file reads are already pretty fast.
-- return_type (string) [opt]: string indicating which format to return data in. Default is "dict", which will return data in a nested dict format:
								{locationID1:{
									var1_name:pd.Series,
									var2_name:pd.Series,
									...},
								locationID2:{...},
								...
								}
								Alternative return type is "dataframe", which smashes all data into a single dataframe muliIndex'd by station ID, then timestamp

Returns:
GFS forecast data for the given locations in the format specified by return_type
"""

In [None]:
import data.gfs_fc as gfs

start_dt = dt.datetime(2024, 1, 16)
# wothout an hour specified, will default to midnight forecast cycle
# fc_start_dt = "202401016"

# use the same hour as our start datetime, so that we get a full 10 days of fc data
end_dt = dt.datetime(2024, 1, 17)

# define some locations to grab data for. Dictionary value must be lat/long tuple, up to 0.25 resolution
stations = {'401': (45.00, -73.25),
			'402': (44.75, -73.25),
			'403': (44.75, -73.25)}

# define a directory in which to download NWM data
data_directory = "/data/users/n/b/nbeckage/forecastData/"

In [None]:
gfs_data = gfs.get_data(forecast_datetime = start_dt,
						end_datetime = end_dt,
						locations = stations,
						data_dir = data_directory)

In [None]:
# let's check out the meterological variables downloaded - hardcoded for now
gfs_data['401'].keys()

In [None]:
gfs_data['401']['T2']

## Observed local climatological data from NOAA
[NCEI Data Service API User Documentation](https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation)

Additional LCD station IDs can be found at: https://www.ncei.noaa.gov/cdo-web/datatools/lcd

In [None]:
"""
A function to download and process NOAA Local Climatological Data data to return nested dictionary of pandas series for each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab LCD data
-- end_date (str, date, or datetime) [req]: the end date for which to grab LCD data
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get USGS data for.
-- return_type (string) [opt]: string indicating which format to return data in. Default is "dict", which will return data in a nested dict format:
								{locationID1:{
									var1_name:pd.Series,
									var2_name:pd.Series,
									...},
								locationID2:{...},
								...
								}
								Alternative return type is "dataframe", which smashes all data into a single dataframe muliIndex'd by station ID, then timestamp

Returns:
NOAA Local Climatological Data (total cloud cover and precipitation currently) for the dat range and locations provided
"""

In [None]:
import data.lcd_ob as lcd

In [None]:
lcd_data = lcd.get_data(start_date = start_dt,
						end_date = end_dt,
						locations = {"BTV":"72617014742"})

In [None]:
# Local Climatological Dataset (LCD) only provides total cloud cover (%) and rain data
lcd_data['BTV'].keys()

In [None]:
lcd_data['BTV']['TCDC']

## Observed meterology data from UVM Forest Ecosystem Monitoring Cooperative (FEMC)
Right now can be used to get Colchester reef quality-controlled met data

In [None]:
"""
A function to download and process observational meterological data from UVM FEMC (Forest Ecosysytem Monitoring Cooperative - https://www.uvm.edu/femc/) to return nested dictionary of pandas series for each variable, for each location.

Args:
-- start_date (str, date, or datetime) [req]: the start date for which to grab FEMC data
-- end_date (str, date, or datetime) [req]: the end date for which to grab FEMC data
-- locations (dict) [req]: a dictionary (stationID/name:IDValue/latlong tuple) of locations to get FEMC data for.
-- return_type (string) [opt]: string indicating which format to return data in. Default is "dict", which will return data in a nested dict format:
								{locationID1:{
									var1_name:pd.Series,
									var2_name:pd.Series,
									...},
								locationID2:{...},
								...
								}
								Alternative return type is "dataframe", which smashes all data into a single dataframe muliIndex'd by station ID, then timestamp

Returns:
FEMC obsrvational meterological data for the specifed data range and locations, in the format specified by return_type
"""

In [None]:
import data.femc_ob as femc

In [None]:
femc_data = femc.get_data(start_date = start_dt,
						  end_date = end_dt)

In [None]:
# taking a look at what meterological vars we have
femc_data['CR'].keys()

In [None]:
femc_data['CR']['T2']