# Fast NODD GRIB Aggregations

## Overview

In this tutorial we are going to demonstrate building kerchunk aggregations of **NODD grib2 weather forecasts** fast. This workflow primarily involves [xarray-datatree](https://xarray-datatree.readthedocs.io/en/latest/), [pandas](https://pandas.pydata.org/) and `grib_tree` function released in **kerchunkv0.2.3** for the operation.


### About the Dataset

For this operation we will be looking at GRIB2 files generated by [**NOAA Global Ensemble Forecast System (GEFS)**](https://www.ncei.noaa.gov/products/weather-climate-models/global-ensemble-forecast), is a weather forecast model made up of 21 separate forecasts, or ensemble members. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days, with an update frequency of 4 times a day, every 6 hours starting at midnight.

More information on this dataset can be found [here](https://registry.opendata.aws/noaa-gefs)


## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |
| [Pandas Tutorial](https://foundations.projectpythia.org/core/pandas/pandas.html#) | Required | Core |
| [Kerchunk and Xarray-Datatree](https://projectpythia.org/kerchunk-cookbook/notebooks/using_references/Datatree.html) | Required | IO |
| [Xarray-Datatree Overview](https://xarray-datatree.readthedocs.io/en/latest/quick-overview.html)| Required | IO |

- **Time to learn**: 30 minutes

## Motivation

As we know that **kerchunk**  provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …) by generating *references*. This task flow ability to build large aggregations from **NODD grib forecasts**
in a fraction of the time using the `idx files`.

### Imports

In [1]:
from kerchunk.grib2 import (
    scan_grib,
    grib_tree, 
    parse_grib_idx, 
    strip_datavar_chunks,
    build_idx_grib_mapping, 
    map_from_index, 
    reinflate_grib_store,
    AggregationType
)
import copy
import pandas as pd
import datatree
import fsspec

### Listing the files to build the datatree

In [40]:
s3_files = [
    "s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", 
    "s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af012", 
    "s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af018"
    
    # "s3://noaa-gefs-pds/gefs.20170101/00/gec00.t00z.pgrb2af006",
    # "s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006",
    # "s3://noaa-gefs-pds/gefs.20170101/12/gec00.t12z.pgrb2af006"
]

In [41]:
grib_tree_store = grib_tree([group for f in s3_files for group in scan_grib(f, storage_options=dict(anon=True))], remote_options=dict(anon=True))



In [42]:
s3_dt = datatree.open_datatree(fsspec.filesystem("reference", fo=grib_tree_store).get_mapper(""), engine="zarr", consolidated=False)

#### Removing 

In [68]:
deflated_grib_tree_store = copy.deepcopy(grib_tree_store)
strip_datavar_chunks(deflated_grib_tree_store)

### Index Dataframe made from a single Grib file

In [44]:
# what an idx dataframe looks like
idxdf = parse_grib_idx("s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", storage_options=dict(anon=True))
idxdf.head()

Unnamed: 0_level_0,offset,date,attrs,length,idx_uri,grib_uri
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,d=2017010106,HGT:10 mb:6 hour fcst:ENS=low-res ctl,47493,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
2,47493,d=2017010106,TMP:10 mb:6 hour fcst:ENS=low-res ctl,19438,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
3,66931,d=2017010106,RH:10 mb:6 hour fcst:ENS=low-res ctl,10835,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
4,77766,d=2017010106,UGRD:10 mb:6 hour fcst:ENS=low-res ctl,22625,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
5,100391,d=2017010106,VGRD:10 mb:6 hour fcst:ENS=low-res ctl,20488,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...


### Building a mapping between the index dataframe and grib metadata

In [45]:
# creating a mapping for a single horizon file which is to be used for later use
mapping = build_idx_grib_mapping("s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006", storage_options=dict(anon=True), remote_options=dict(anon=True))
mapping.head()

The grib hierarchy in s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z.pgrb2af006 is not unique for 54 variables: ['gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'u', 'v', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 'gh', 't', 'r', 'u', 'v', 't', 'r', 'u', 'v', 'gh']


Unnamed: 0_level_0,offset_idx,date,attrs,length_idx,idx_uri,grib_uri,varname,typeOfLevel,stepType,name,level,step,time,valid_time,uri,offset_grib,length_grib,inline_value
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,d=2017010106,HGT:10 mb:6 hour fcst:ENS=low-res ctl,47493,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,gh,isobaricInhPa,instant,Geopotential height,0.0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,0,47493,
2,47493,d=2017010106,TMP:10 mb:6 hour fcst:ENS=low-res ctl,19438,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,t,isobaricInhPa,instant,Temperature,0.0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,47493,19438,
3,66931,d=2017010106,RH:10 mb:6 hour fcst:ENS=low-res ctl,10835,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,r,isobaricInhPa,instant,Relative humidity,0.0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,66931,10835,
4,77766,d=2017010106,UGRD:10 mb:6 hour fcst:ENS=low-res ctl,22625,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,u,isobaricInhPa,instant,U component of wind,0.0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,77766,22625,
5,100391,d=2017010106,VGRD:10 mb:6 hour fcst:ENS=low-res ctl,20488,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,v,isobaricInhPa,instant,V component of wind,0.0,0 days 06:00:00,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,100391,20488,


### 

In [46]:
# this step will be performed for every grib-idx pair where we will be using the "mapping" dataframe which we created previously 
mapped_index = map_from_index(
    pd.Timestamp("2017-01-01T06"),
    mapping.loc[~mapping["attrs"].duplicated(keep="first"), :],
    idxdf.loc[~idxdf["attrs"].duplicated(keep="first"), :],
)
mapped_index

Unnamed: 0,varname,typeOfLevel,stepType,name,step,level,time,valid_time,uri,offset,length,inline_value
0,gh,isobaricInhPa,instant,Geopotential height,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,0,47493,
1,t,isobaricInhPa,instant,Temperature,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,47493,19438,
2,r,isobaricInhPa,instant,Relative humidity,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,66931,10835,
3,u,isobaricInhPa,instant,U component of wind,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,77766,22625,
4,v,isobaricInhPa,instant,V component of wind,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,100391,20488,
...,...,...,...,...,...,...,...,...,...,...,...,...
78,ulwrf,surface,avg,Upward long-wave radiation flux,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3885258,39087,
79,ulwrf,nominalTop,avg,Upward long-wave radiation flux,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3924345,43221,
80,cape,pressureFromGroundLayer,instant,Convective available potential energy,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,3967566,42488,
81,cin,pressureFromGroundLayer,instant,Convective inhibition,0 days 06:00:00,0.0,2017-01-01 06:00:00,2017-01-01 12:00:00,s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...,4010054,43027,


In [47]:
mapped_index_list = []

deduped_mapping = mapping.loc[~mapping["attrs"].duplicated(keep="first"), :]

for date in range(1, 21):
  for runtime in range(0, 24, 6):
    fname = f"s3://noaa-gefs-pds/gefs.201701{date:02}/{runtime:02}/gec00.t{runtime:02}z.pgrb2af006"
    
  idxdf = parse_grib_idx(basename=fname, storage_options=dict(anon=True))

  mapped_index = map_from_index(
      pd.Timestamp(f"2017-01-{date:02}T{runtime:02}"),
      deduped_mapping,
      idxdf.loc[~idxdf["attrs"].duplicated(keep="first"), :],
  )

  mapped_index_list.append(mapped_index)

gfs_kind = pd.concat(mapped_index_list)

In [48]:
axes = [
  pd.Index(
    [
      pd.timedelta_range(start="0 hours", end="24 hours", freq="6h", closed="right", name="6 hour"),
    ],
    name="step"
  ),
  pd.date_range("2017-01-01T00:00", "2017-01-20T18:00", freq="360min", name="valid_time")
]
axes

[Index([[0 days 06:00:00]], dtype='object', name='step'),
 DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 06:00:00',
                '2017-01-01 12:00:00', '2017-01-01 18:00:00',
                '2017-01-02 00:00:00', '2017-01-02 06:00:00',
                '2017-01-02 12:00:00', '2017-01-02 18:00:00',
                '2017-01-03 00:00:00', '2017-01-03 06:00:00',
                '2017-01-03 12:00:00', '2017-01-03 18:00:00',
                '2017-01-04 00:00:00', '2017-01-04 06:00:00',
                '2017-01-04 12:00:00', '2017-01-04 18:00:00',
                '2017-01-05 00:00:00', '2017-01-05 06:00:00',
                '2017-01-05 12:00:00', '2017-01-05 18:00:00',
                '2017-01-06 00:00:00', '2017-01-06 06:00:00',
                '2017-01-06 12:00:00', '2017-01-06 18:00:00',
                '2017-01-07 00:00:00', '2017-01-07 06:00:00',
                '2017-01-07 12:00:00', '2017-01-07 18:00:00',
                '2017-01-08 00:00:00', '2017-01-08 06:00:00',
            

In [63]:
gfs_store = reinflate_grib_store(
    axes=axes,
    aggregation_type=AggregationType.HORIZON,
    chunk_index=gfs_kind.loc[gfs_kind.varname.isin(["ulwrf", "prmsl"])],
    zarr_ref_store=deflated_grib_tree_store
)

  if lookup not in unique_groups:


In [64]:
gfs_dt = datatree.open_datatree(fsspec.filesystem("reference", fo=gfs_store).get_mapper(""), engine="zarr", consolidated=False)

In [65]:
gfs_dt

In [66]:
gfs_dt.ulwrf.avg.surface