## FSSPEC download for NWM RouteLink file for developing topologic relationships
This notebook demonstrates accessing the National Water Model (NWM) topological definition of the NWM channel routing simulation. The methods applied here utilize Zarr and FSSpec to retrieve the header for the file and then only the topology-definining fields: "link" and "to". Building the dataframe directly from these elements in the file from the web saves a 200Mb download and takes quite a bit less time than when obtaining the full file and operating from a local storage resource.

The key here is to note which operations take a long time: 
* The initial `SingleHdf5ToZarr` step is about 1 second
* The `.translate()` operation (inline in our example) is about 8 seconds
* Opening the dataset from the translated .json object is only a few milliseconds
* reading the "to" and "from" attributes into a pandas dataframe is 11 seconds

That last step would be a lot longer if all variables were downloaded. 

In [None]:
import fsspec
import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr

fs = fsspec.filesystem("http")

rl_nwm_url = "https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v2.2.0/parm/DOMAIN_WCOSS_Names/RouteLink_CONUS.nc"
with fs.open(rl_nwm_url) as f:
    %time    rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()
    
    # Key example here: 
    # https://fsspec.github.io/kerchunk/test_example.html
 

The `kerchunk`-ing example that we started with had a number of other parameters... 
perhaps some might be reintroduced to make the data access even speedier!
e.g., ...
```py
fs = fsspec.filesystem('ftp', host="https://www.nco.ncep.noaa.gov/pmb")

with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:
```
 ...
 
One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so. 
```py
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s
```
Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output. 
```py
    %time    rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)
    %time    rl_t = rl_h5.translate() # This translate MUST happen inside the context block
```
    

In [None]:
backend_args = { 
    "consolidated": False,
    "storage_options": { 
        "fo": rl_t,
        # Adding these options returns a properly dimensioned but otherwise null dataframe
        # "remote_protocol": "https", 
        # "remote_options": {'anon':True} 
        }
    }
%time ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args,)

In [None]:
subslice = ["link", "to",]
%time df = ds[subslice].to_dataframe()

In [None]:
df
