# Virtual data set (VDS) reference file for MUR-JPL-L4-GLOB-v4.1 using Virtualizarr

Saves VDS as parquet file. MUR-JPL-L4-GLOB-v4.1 (https://doi.org/10.5067/GHGMR-4FJ04) is an L4 data set with one granule per time step. Will concatinate along time dimension.

In [1]:
# Built-in packages
import os
import sys

# Filesystem management 
import fsspec
import earthaccess

# Data handling
import numpy as np
import xarray as xr
from virtualizarr import open_virtual_dataset
import pandas as pd

# Parallel computing 
import multiprocessing
from dask import delayed
import dask.array as da
from dask.distributed import Client
import coiled

# Other
import matplotlib.pyplot as plt

## 1. Get Data File S3 endpoints in Earthdata Cloud

In [2]:
# Get Earthdata creds
earthaccess.login()

Enter your Earthdata Login username:  deanh808
Enter your Earthdata password:  ········


<earthaccess.auth.Auth at 0x7fd79c3ffce0>

In [3]:
# Get AWS creds. Note that if you spend more than 1 hour in the notebook, you may have to re-run this line!!!
fs = earthaccess.get_s3_filesystem(daac="PODAAC")

In [4]:
# Locate CCMP file information / metadata:
granule_info = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    )

In [5]:
# Get S3 endpoints for all files:
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
print(len(data_s3links))
data_s3links[0:3]

8374


['s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']

## 2. Generate single-orbit reference files

One file per orbit, so one reference file per orbit.

In [6]:
n_files_process = 1000

In [7]:
# This will be assigned to 'loadable_variables' and needs to be modified per the specific 
# coord names of the data set:
coord_vars = ["lat","lon"]

In [8]:
print("CPU count =", multiprocessing.cpu_count())

CPU count = 32


In [9]:
reader_opts = {"storage_options": fs.storage_options} # S3 filesystem creds from previous section.

In [10]:
# Start up cluster and print some information about it:
client = Client(n_workers=15, threads_per_worker=1)
print(client.cluster)
print("View any work being done on the cluster here", client.dashboard_link)

LocalCluster(b755e4ad, 'tcp://127.0.0.1:35117', workers=15, threads=15, memory=122.29 GiB)
View any work being done on the cluster here https://cluster-qfcgi.dask.host/jupyter/proxy/8787/status


In [11]:
%%time
# Create individual references:
open_vds_par = delayed(open_virtual_dataset)
tasks = [
    open_vds_par(p, indexes={}, reader_options=reader_opts, loadable_variables=coord_vars) 
    for p in data_s3links[:n_files_process]
    ]
virtual_ds_list = list(da.compute(*tasks)) # The xr.combine_nested() function below needs a list rather than a tuple.

CPU times: user 1min 52s, sys: 18.7 s, total: 2min 11s
Wall time: 14min 20s


In [13]:
len(virtual_ds_list)

1000

## 3. Generate combined reference file

In [14]:
%%time
# Create the combined reference
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

NotImplementedError: The ManifestArray class cannot concatenate arrays which were stored using different codecs, But found codecs Codec(compressor=None, filters=[{'elementsize': 1, 'id': 'shuffle'}, {'id': 'zlib', 'level': 7}]) vs Codec(compressor=None, filters=[{'id': 'zlib', 'level': 8}]) .See https://github.com/zarr-developers/zarr-specs/issues/288

In [19]:
# Save in JSON or PARQUET format:
fname_combined_json = 'ref_combined_1year.json'
fname_combined_parq = 'ref_combined_1year.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')