# Obtain JSONS & Consolidated into Single JSON For NetCDF From S3

__Purpose:__
To use kerchunk formatted configuration file in order to request a data anemoi conversion from zarr in cloud.

### 1) Extracting JSONs from NetCDF(s) stored in S3 & consolidating JSONS into a single JSON. 

__Finding:__

Consolidating JSONs of a data file is a requirement for a data file to be converted into an Anemoi formatted data file.

__Data Tested:__
  
fn="e5.oper.an.pl.128_060_pv.ll025sc.2024030100_2024030123.nc"
pattern = f"s3://nsf-ncar-era5/e5.oper.an.pl/202403/{fn}"


In [1]:
import json
import fsspec
import tqdm
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

# Establish S3 filesystem
fs = fsspec.filesystem("s3", anon=True)

# Cloud object(s) of interest
fn="e5.oper.an.pl.128_060_pv.ll025sc.2024030100_2024030123.nc"
pattern = f"s3://nsf-ncar-era5/e5.oper.an.pl/202403/{fn}"

# Generation of a List of JSONs per compatible formatted (such as NetCDF & GRIB) data file in cloud
jsons = []
for file in tqdm.tqdm(fs.glob(pattern)):
    with fs.open(file, "rb", anon=True) as f:
        h5chunks = SingleHdf5ToZarr(f, file)
        jsons.append(h5chunks.translate())

print("CHECK HERE ON THE JSON & HOW IT LOOKS PRIOR TO IT BEING ADD WITH MULTIZARRTOZARR", jsons)

# Consolidate list of jsons into a single json
mzz = MultiZarrToZarr(
    jsons,
    remote_protocol="s3",
    remote_options={"anon": True},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)

with open("combined.json", "w") as f:
    json.dump(mzz.translate(), f)

100%|█████████████████████████████████████████████| 1/1 [00:14<00:00, 14.12s/it]

CHECK HERE ON THE JSON & HOW IT LOOKS PRIOR TO IT BEING ADD WITH MULTIZARRTOZARR [{'version': 1, 'refs': {'.zgroup': '{"zarr_format":2}', '.zattrs': '{"CONVERSION_DATE":"Mon 03 Jun 2024 06:17:05 PM MDT","CONVERSION_PLATFORM":"Linux crhtc58 5.14.21-150400.24.46-default #1 SMP PREEMPT_DYNAMIC Thu Feb 9 08:38:18 UTC 2023 (2d95137) x86_64 x86_64 x86_64 GNU\\/Linux","Conventions":"CF-1.6","DATA_SOURCE":"ECMWF: https:\\/\\/cds.climate.copernicus.eu, Copernicus Climate Data Store","NCO":"netCDF Operators version 5.1.9 (Homepage = http:\\/\\/nco.sf.net, Code = http:\\/\\/github.com\\/nco\\/nco, Citation = 10.1016\\/j.envsoft.2008.03.004)","NETCDF_COMPRESSION":"NCO: Precision-preserving compression to netCDF4\\/HDF5 (see \\"history\\" and \\"NCO\\" global attributes below for specifics).","NETCDF_CONVERSION":"CISL RDA: Conversion from ECMWF GRIB 1 data to netCDF4.","NETCDF_VERSION":"4.9.2","history":"Mon Jun  3 18:17:16 2024: ncks -4 -L 1 --baa=0 --ppc default=7 e5.oper.an.pl.128_060_pv.ll025sc




In [4]:
! anemoi-datasets create test_s3_nc.yaml test_s3_nc.zarr

2024-09-16 13:47:35 INFO Task init((),{}) starting
2024-09-16 13:47:35 INFO Setting flatten_grid=True in config
2024-09-16 13:47:35 INFO Setting ensemble_dimension=2 in config
2024-09-16 13:47:35 INFO Setting flatten_grid=True in config
2024-09-16 13:47:35 INFO Setting ensemble_dimension=2 in config
2024-09-16 13:47:35 INFO {'start': datetime.datetime(2024, 3, 1, 20, 0), 'end': datetime.datetime(2024, 3, 1, 23, 0), 'frequency': '1h', 'group_by': 'monthly'}
2024-09-16 13:47:35 INFO Groups(dates=1)
2024-09-16 13:47:35 INFO FunctionAction: json=combined_tested.json param=PV level=[1000, 50] 
2024-09-16 13:47:35 INFO Minimal input for 'init' step (using only the first date) :
2024-09-16 13:47:35 INFO xarray-kerchunk(['2024-03-01T20:00:00'])
2024-09-16 13:47:35 INFO Config loaded ok:
2024-09-16 13:47:35 INFO Found 4 datetimes.
2024-09-16 13:47:35 INFO Dates: Found 4 datetimes, in 1 groups: 
2024-09-16 13:47:35 INFO Missing dates: 0
2024-09-16 13:47:35 INFO Read reference from URL combined_t

### 2) Testing Conversion of NetCDF stored in S3 to Anemoi Format

__Finding:__

Consolidating JSONs of a zarr store appears to be a possible requirement for a zarr store to be converted into an Anemoi formatted data when using kerchunk configuration format file.

It does not appear to be a feature offered in the current anemoi-dataset framework (main branch & develop branch). The capability needs to be added into the framework that would allow consolidation of JSONs of a zarr store into a single json file. (see error below)

__Data Tested:__
  
pattern = 's3://noaa-ufs-gdas-pds/test_ar.zarr/*'

In [5]:
import json
import fsspec
import tqdm
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

# Establish S3 filesystem
fs = fsspec.filesystem("s3", anon=True)

# Cloud object of interest
pattern = 's3://noaa-ufs-gdas-pds/test_ar.zarr/*'

# ======================== TODO: =========================
# To create a script that will consolidate the jsons into one & then, feed it into a 
# configuration kerchuk file formatted
# =================================================================
# Generation of a List of JSONs per compatible formatted (such as NetCDF & GRIB) data file in cloud
jsons = []
for file in tqdm.tqdm(fs.glob(pattern)):
    with fs.open(file, "rb", anon=True) as f:
        h5chunks = SingleHdf5ToZarr(f, file) 
        jsons.append(h5chunks.translate())

# Consolidate list of jsons into a single json
print("CHECK HERE ON THE JSON & HOW IT LOOKS PRIOR TO IT BEING ADD WITH MULTIZARRTOZARR", jsons)
mzz = MultiZarrToZarr(
    jsons,
    remote_protocol="s3",
    remote_options={"anon": True},
    concat_dims=["time"],
    identical_dims=["latitude", "longitude"],
)
# =================================================================

with open("combined4zarr.json", "w") as f:
    json.dump(mzz.translate(), f)

  0%|                                                    | 0/39 [00:00<?, ?it/s]


OSError: Unable to open file (file signature not found)