The processing involves these basic steps.

* download the raw file
* load all the daily files
* pad the dates since data if dates do not go back to 1979
* trim the dates to the max in the zarr file
* rename the data varibles if needed
* interpolate (coarsen or adjust centers) to the 0.25 degree grid
* change from float64 to float32
* rechunk the time element
* save to a zarr file

In [2]:
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

## The grids and times

In [3]:
# The 0.25 degree grid
zarr_grid=xr.open_dataset('../grid.nc')

In [3]:
# The time grid; there is probably a better way to do this
zarr_time=xr.open_dataset("../time.nc")
date_start=str(zarr_time.time.min().values)[:10]
date_end=str(zarr_time.time.max().values)[:10]

## The functions

Are saved in this file.

In [4]:
# Load the functions
%run -i "~/indian-ocean-zarr/notebooks/functions.py"

## How to do the steps

In [None]:
# Open nc files
ds = xr.open_dataset('/home/jovyan/shared/data/copernicus/chlorophyll-a/20240424.nc')
# Open multiple files
ds = xr.open_mfdataset('/home/jovyan/shared/data/copernicus/chlorophyll-a/*.nc')

In [None]:
# Slice data
ds = ds.sel(time=slice(date_start, date_end), lat=slice(lat1, lat2))

In [None]:
# Regrid and interpolate
ds = ds.interp_like(zarr_grid)
# to force the interpolation use compute()
ds = ds.interp_like(zarr_grid).compute()

In [None]:
# Rename variables
ds = ds.rename({"latitude": "lat", "longitude": "lon"})

In [8]:
# Pad time back to start of the zarr time
ds = ds.sel(time=slice(ds_date_start, zarr_date_end))
timepad=pd.to_datetime(chloro1.time[0].values)-pd.to_datetime(zarr_time.time[0].values)
chloro_interp = chloro_interp.pad(time=(timepad.days,0))
chloro_interp['time']=zarr_time.time.values

In [9]:
# Fix vars to be float32 and chucks to be time 100 days
# Function defined in functions.py
chloro_interp = standardize_float64(chloro_interp)
chloro_interp = standardize_chunk(chloro_interp)

## Save the files

This can be very memory intensive. Saving intermediary files help. Using `dask.distributed` helps a lot.

In [12]:
from dask.distributed import Client;
client=Client(n_workers=4);
#client.close()

In [24]:
# Add full attributes to the variables
# When the initial dataset has the full attributes
for var in ds.data_vars:
    ds[var].attrs.update(ds.attrs)

In [32]:
# Save data to zarr to preserve chunks
ds.to_zarr('/home/jovyan/shared/data/finalized/chlorophyll.zarr')

<xarray.backends.zarr.ZarrStore at 0x7fbc9e3910c0>

### Append variables

When you have an zarr file with the same lat, lon, time, you can append via `mode='a'`.

In [37]:
ds2.to_zarr('/home/jovyan/shared/data/finalized/chlorophyll.zarr', mode='a')

<xarray.backends.zarr.ZarrStore at 0x7fbc7eca8a40>

In [None]:
topo = xr.open_dataarray('data/topography.nc')

In [None]:
grid = xr.open_dataarray('data/grid.nc')

In [None]:
topo = topo.rename({'latitude': 'lat', 'longitude': 'lon'})

In [None]:
topo_interp = topo.interp_like(grid)

In [None]:
topo_interp.to_netcdf('shared/data/finalized/topography.nc')

In [None]:
topo_interp.plot.imshow()

In [None]:
grid