## Creating a parquet file of NWM RouteLink
This notebook creates a parquet file of National Water Model (NWM) RoutLink to be uploaded to BigQuery.

To access the NWM RouteLink, some codes (cells 1 to 4) were adopted from  [route_link_fsspec.ipynb](https://github.com/AlabamaWaterInstitute/data_access_examples/blob/main/nwm_network/route_link_fsspec.ipynb).

### Imports

In [None]:
import fsspec
import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr
from pyarrow.parquet import ParquetFile

### FSSPEC download for NWM RouteLink file

In [None]:
fs = fsspec.filesystem("http")

rl_nwm_url = "https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v2.2.0/parm/DOMAIN_WCOSS_Names/RouteLink_CONUS.nc"
with fs.open(rl_nwm_url) as f:
    %time    rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()
    
    # Key example here: 
    # https://fsspec.github.io/kerchunk/test_example.html
 

The `kerchunk`-ing example that we started with had a number of other parameters... 
perhaps some might be reintroduced to make the data access even speedier!
e.g., ...
```py
fs = fsspec.filesystem('ftp', host="https://www.nco.ncep.noaa.gov/pmb")

with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:
```
 ...
 
One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so. 
```py
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s
    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s
```
Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output. 
```py
    %time    rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)
    %time    rl_t = rl_h5.translate() # This translate MUST happen inside the context block
```
    

In [None]:
backend_args = {
    "consolidated": False,
    "storage_options": {
        "fo": rl_t,
        # Adding these options returns a properly dimensioned but otherwise null dataframe
        # "remote_protocol": "https",
        # "remote_options": {'anon':True}
    },
}
%time ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs=backend_args,)

In [None]:
# only keep the necessary variables
subslice = ["link","to"]

# Convert to pandas dataframe
%time df = ds[subslice].to_dataframe().astype({"link": int, "to": int})

In [None]:
# Set the "link" ast the index of the dataframe

df = df.set_index("link")
df

### Convert the dataframe to parquet and save it

In [None]:
df.to_parquet("/Users/grad/NWMRouteLinkParquet.gzip", engine="pyarrow", compression="gzip")

# Show the metadata of the parquet file 
ParquetFile("/Users/grad/NWMRouteLinkParquet.gzip").metadata 