# 02 - Loading Data
TEEHR comes with utilities to fetch and format data from a few different sources and store to the TEEHR data format.  We are open to adding more, but also attempted to make the format simple enough that getting your own data into the TEEHR format should be relatively easy - create a Pandas DataFrame with the correct column names and save a Parquet file.

## The included loading utilities are:

- NWM v2.1 and NWM v2.2 feature data in Google Cloud
- NWM v2.1 and NWM v2.2 forcing data in Google Cloud 
- NWM v2.0 and NWM v2.1 Retrospective in AWS
- USGS NWIS Data

This covers both feature data (e.g., data at NWM features) as well as aggregated grid data (e.g., mean areal precipitation for catchments). Caching data can talka a significant amopunt of time so we have attempted to make these tools as efficient as possible.

| Data                       | Increment                                      | Time                         |
| -------------------------- | ---------------------------------------------- | ---------------------------- |
| NWM v2.2 Feature Data      | 1 medium range forcast at ~7600 USGS gages     | X mins                       |
| NWM v2.2 Forcing Data      | 1 medium range forcast aggregated to HUC10     | Y mins                       |
| NWM v2.1 Retrospective     |
| USGS NWIS Data             |

The following is an example of of the loading tools would be utilized.  Scroll to the bottom for a list of the datasets that were cached for this workshop.

In [None]:
import teehr.loading.nwm_point_data as nwmp
from pathlib import Path

In [None]:
# Set some notebook variables to point to the relevant study files.  
RUN = "short_range"
OUTPUT_TYPE = "channel_rt"
VARIABLE_NAME = "streamflow"

START_DATE = "2023-03-18"
INGEST_DAYS = 1

OUTPUT_ROOT = Path(Path().home(), "cache")
JSON_DIR = Path(OUTPUT_ROOT, "zarr", RUN)
OUTPUT_DIR = Path(OUTPUT_ROOT, "timeseries", RUN)

# For this simple example, we'll get data for 10 NWM reaches that coincide with USGS gauges
LOCATION_IDS = [7086109,  7040481,  7053819,  7111205,  7110249, 14299781, 14251875, 14267476,  7152082, 14828145]

In [None]:
from dask.distributed import Client
client = Client()
client

In [None]:
# ?nwmp.nwm_to_parquet

In [None]:
"""
nwmp.nwm_to_parquet(
    run: str,
    output_type: str,
    variable_name: str,
    start_date: Union[str, datetime.datetime],
    ingest_days: int,
    location_ids: Iterable[int],
    json_dir: str,
    output_parquet_dir: str,
    t_minus_hours: Optional[Iterable[int]] = None,
)
"""
nwmp.nwm_to_parquet(
    RUN,
    OUTPUT_TYPE,
    VARIABLE_NAME,
    START_DATE,
    INGEST_DAYS,
    LOCATION_IDS,
    JSON_DIR,
    OUTPUT_DIR
)

## Cached Data
We cached several datasets on AWI's 2i2c JupyterHub shared drive for this workshop that we will be exploring in susequent sections.
Because we are using small JupyterHub instances, we will be somewhat selective about what and how much data we query.

Looking at the `tree` output for the cache directory we have setup shows the following.

In [None]:
!tree ~/shared/rti-eval -d