# 02 - Loading Data
TEEHR comes with utilities to fetch and format data from a few different sources and store to the TEEHR data format.  We are open to adding more, but also attempted to make the format simple enough that getting your own data into the TEEHR format should be relatively easy - create a Pandas DataFrame with the correct column names and save a Parquet file.

### The included loading utilities are:

- NWM v2.1 and NWM v2.2 feature data in Google Cloud
- NWM v2.1 and NWM v2.2 forcing data in Google Cloud 
- NWM v2.0 and NWM v2.1 Retrospective in AWS
- USGS NWIS Data

This covers both feature data (e.g., data at NWM features) as well as aggregated grid data (e.g., mean areal precipitation for catchments). Caching data can take a significant amount of time so we have attempted to make these tools as efficient as possible.

| Data                       | Increment                                       | Time                         |
| -------------------------- | ----------------------------------------------- | ---------------------------- |
| NWM v2.2 Feature Data      | 1 medium range forecast at ~7600 USGS gages     | 15 sec                       |
| NWM v2.2 Forcing Data      | 1 medium range forecast aggregated to HUC10     | 2.5 min                      |
| NWM v2.1 Retrospective     | 40-yr Retrospective at 1 NWM COMID              | 1.5 min                      |
| USGS NWIS Data             | 40-yr Observation record at 1 USGS gage         | 30 sec                       |

The following is an example of of the loading tools would be utilized.  Scroll to the bottom for a list of the datasets that were cached for this workshop.

In [1]:
import teehr.loading.nwm_point_data as nwmp
from pathlib import Path

In [2]:
# Set some notebook variables to define what data to ingest and where to save it.  
RUN = "short_range"
OUTPUT_TYPE = "channel_rt"
VARIABLE_NAME = "streamflow"

START_DATE = "2023-03-18"
INGEST_DAYS = 1

OUTPUT_ROOT = Path(Path().home(), "cache")
JSON_DIR = Path(OUTPUT_ROOT, "zarr", RUN)
OUTPUT_DIR = Path(OUTPUT_ROOT, "timeseries", RUN)

# For this simple example, we'll get data for 10 NWM reaches that coincide with USGS gauges
LOCATION_IDS = [7086109,  7040481,  7053819,  7111205,  7110249, 14299781, 14251875, 14267476,  7152082, 14828145]

In [3]:
# Start a LocalCluster
from dask.distributed import Client
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/mgdenno/proxy/8787/status,

0,1
Dashboard: /user/mgdenno/proxy/8787/status,Workers: 2
Total threads: 2,Total memory: 7.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:41387,Workers: 2
Dashboard: /user/mgdenno/proxy/8787/status,Total threads: 2
Started: Just now,Total memory: 7.00 GiB

0,1
Comm: tcp://127.0.0.1:37665,Total threads: 1
Dashboard: /user/mgdenno/proxy/34003/status,Memory: 3.50 GiB
Nanny: tcp://127.0.0.1:38983,
Local directory: /tmp/dask-worker-space/worker-_ldvls34,Local directory: /tmp/dask-worker-space/worker-_ldvls34

0,1
Comm: tcp://127.0.0.1:44801,Total threads: 1
Dashboard: /user/mgdenno/proxy/34763/status,Memory: 3.50 GiB
Nanny: tcp://127.0.0.1:37283,
Local directory: /tmp/dask-worker-space/worker-t4hx2_5l,Local directory: /tmp/dask-worker-space/worker-t4hx2_5l


In [4]:
# ?nwmp.nwm_to_parquet

In [5]:
"""
nwmp.nwm_to_parquet(
    run: str,
    output_type: str,
    variable_name: str,
    start_date: Union[str, datetime.datetime],
    ingest_days: int,
    location_ids: Iterable[int],
    json_dir: str,
    output_parquet_dir: str,
    t_minus_hours: Optional[Iterable[int]] = None,
)
"""
nwmp.nwm_to_parquet(
    RUN,
    OUTPUT_TYPE,
    VARIABLE_NAME,
    START_DATE,
    INGEST_DAYS,
    LOCATION_IDS,
    JSON_DIR,
    OUTPUT_DIR
)

## Cached Data
We cached several datasets on AWI's 2i2c JupyterHub shared drive for this workshop that we will be exploring in susequent sections.
Because we are using small JupyterHub instances, we will be somewhat selective about what and how much data we query, but on larger instances, it is quite capable.

Looking at the `tree` output for the cache directory we setup.  It shows cached data for 3 studies:

- huc1802_retro
- ngen-simulation-example
- post-event-example

In [6]:
!tree -d /home/jovyan/shared/teehr-workshop

[01;34m/home/jovyan/shared/teehr-workshop[0m
├── [01;34mhuc1802_retro[0m
│   ├── [01;34mgeo[0m
│   ├── [01;34mnotebooks[0m
│   └── [01;34mtimeseries[0m
│       ├── [01;34mnwm20_retro[0m
│       ├── [01;34mnwm21_retro[0m
│       └── [01;34musgs[0m
├── [01;34mngen-simulation-example[0m
│   ├── [01;34mgeo[0m
│   ├── [01;34mngen[0m
│   │   ├── [01;34mconfig[0m
│   │   ├── [01;34mforcings[0m
│   │   └── [01;34moutput[0m
│   └── [01;34mtimeseries[0m
└── [01;34mpost-event-example[0m
    ├── [01;34mgeo[0m
    └── [01;34mtimeseries[0m
        ├── [01;34manalysis_assim[0m
        ├── [01;34mforcing_analysis_assim[0m
        ├── [01;34mforcing_medium_range[0m
        ├── [01;34mforcing_short_range[0m
        ├── [01;34mmedium_range_mem1[0m
        ├── [01;34mshort_range[0m
        └── [01;34musgs[0m

24 directories
