# 02 - Loading Data
TEEHR comes with utilities to fetch and format data from a few different sources and store to the TEEHR data format.  We are open to adding more, but also attempted to make the format simple enough that getting your own data into the TEEHR format should be relatively easy - create a Pandas DataFrame with the correct column names and save a Parquet file.

## The included loading utilities are:

- NWM v2.1 and NWM v2.2 feature data in Google Cloud
- NWM v2.1 and NWM v2.2 forcing data in Google Cloud 
- NWM v2.0 and NWM v2.1 Retrospective in AWS
- USGS NWIS Data

This covers both feature data (e.g., data at NWM features) as well as aggregated grid data (e.g., mean areal precipitation for catchments).

In [1]:
import teehr.loading.nwm_point_data as nwmp
from pathlib import Path

In [4]:
?nwmp.nwm_to_parquet

[0;31mSignature:[0m
[0mnwmp[0m[0;34m.[0m[0mnwm_to_parquet[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mrun[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_type[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvariable_name[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart_date[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mdatetime[0m[0;34m.[0m[0mdatetime[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mingest_days[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlocation_ids[0m[0;34m:[0m [0mIterable[0m[0;34m[[0m[0mint[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mjson_dir[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_parquet_dir[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mt_minus_hours[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mIterable[0m[0;34m[[0m[0mint[0m[0;

In [5]:
# Set some notebook variables to point to the relevant study files.  
RUN = "short_range"
OUTPUT_TYPE = "channel_rt"
VARIABLE_NAME = "streamflow"

START_DATE = "2023-03-18"
INGEST_DAYS = 1

OUTPUT_ROOT = Path(Path().home(), "cache")
JSON_DIR = Path(OUTPUT_ROOT, "zarr", RUN)
OUTPUT_DIR = Path(OUTPUT_ROOT, "timeseries", RUN)

# For this simple example, we'll get data for 10 NWM reaches that coincide with USGS gauges
LOCATION_IDS = [7086109,  7040481,  7053819,  7111205,  7110249, 14299781, 14251875, 14267476,  7152082, 14828145]

In [None]:
from dask.distributed import Client
client = Client(n_workers=8)
client

In [None]:
%%time
nwmp.nwm_to_parquet(
    RUN,
    OUTPUT_TYPE,
    VARIABLE_NAME,
    START_DATE,
    INGEST_DAYS,
    LOCATION_IDS,
    JSON_DIR,
    OUTPUT_DIR
)

## Cached Data
We cached several datasets on AWI's 2i2c JupyterHub shared drive for this workshop.

List datasets here!