In [1]:
from kerchunk.grib2 import parse_grib_idx, build_idx_grib_mapping, map_from_index, extract_datatree_chunk_index, grib_tree, scan_grib
import pandas as pd
import datatree
import fsspec

## Testing out the building of k_index(kerchunk index)

In this notebook, we're going to see a single step in the index building. We will be using a `mapping` for the file `s3://noaa-gefs-pds/gefs.20170101/18/gec00.t18z.pgrb2af006` and compare it against the file `s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006`. 

This also works for files of the same repository and **same forecast`horizon`**, irrespective of the runtime. 

In [2]:
idxdf = parse_grib_idx("s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", storage_options=dict(anon=True))

In [3]:
mapping = build_idx_grib_mapping("s3://noaa-gefs-pds/gefs.20170228/18/gec00.t18z.pgrb2af006", storage_options=dict(anon=True), remote_options=dict(anon=True), validate=True)

The grib hierarchy in s3://noaa-gefs-pds/gefs.20170228/18/gec00.t18z.pgrb2af006 is not unique for 54 variables: ['gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'u', 'v', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 't', 'r', 'u', 'v', 'gh']


In [4]:
grib_tree_store = grib_tree(scan_grib("s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", storage_options=dict(anon=True)), remote_options=dict(anon=True))



In [5]:
dt = datatree.open_datatree(fsspec.filesystem("reference", fo=grib_tree_store, remote_protocol="s3", remote_options={"anon": True}).get_mapper(""), engine="zarr", consolidated=False)

In [6]:
grib_df = extract_datatree_chunk_index(dt, grib_tree_store)
grib_df.loc[grib_df['varname'] == "ulwrf"]

Unnamed: 0,varname,typeOfLevel,stepType,name,number,step,time,valid_time,uri,offset,length,inline_value,surface,isobaricInhPa,meanSea,atmosphereSingleLayer,heightAboveGround,atmosphere,nominalTop
66,ulwrf,nominalTop,avg,Upward long-wave radiation flux,0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3924345,43221,,,,,,,,0.0
67,ulwrf,surface,avg,Upward long-wave radiation flux,0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3885258,39087,,0.0,,,,,,


In [7]:
mapped_index = map_from_index(
    pd.Timestamp("2017-01-01T06"),
    mapping.loc[~mapping["attrs"].duplicated(keep="first"), :],
    idxdf.loc[~idxdf["attrs"].duplicated(keep="first"), :],
)
mapped_index.loc[mapped_index['varname'] == "ulwrf"]

Unnamed: 0,varname,typeOfLevel,stepType,name,step,level,time,valid_time,uri,offset,length,inline_value
78,ulwrf,surface,avg,Upward long-wave radiation flux,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3885258,39087,
79,ulwrf,nominalTop,avg,Upward long-wave radiation flux,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3924345,43221,


As we can see, for the `grib_df` and the `mapped_index` have the same values for a given variable i.e. `ulwrf`. 