# General data reader for AQUA 
## Units, coordinates, variable name fixer

The reader includes a simple 'data fixer', that is the capability to edit the metadata of the input datasets fixing variable or coordinate names and performing unit conversions.

It is still unclear how much this will be needed, also given the fact that the GSV should provide a common format, but still at the moment this capability is needed to ingest in a uniform way NextGEMS data on Levante.

A further problem is the absence (for now) of a well defined standard for the data format to adopt.

In [10]:
import sys
#sys.path.append("../..")  # hack to import module -- to be removed later

from aqua import Reader, catalogue

Let's load some IFS data. We first instantiate a `Reader` object specifying the type of data which we want to read from the catalogue. Then the actual data are read with the `retrieve` method. The `fix=False` for now prevents it from attempting to add unit fixes and other fixes. 

In [11]:
reader = Reader(model="IFS", exp="tco2559-ng5", source="ICMGG_atm2d")
data = reader.retrieve(fix=False)

These are raw IFS data on the original grid. Notice how for example a variable with short name `2t` represents near-surface temperatures.

In [12]:
data["2t"]

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 2 graph layers,8329 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.59 TiB 200.70 MiB Shape (8329, 26306560) (1, 26306560) Dask graph 8329 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  8329,

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 2 graph layers,8329 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Now let's try again dropping the `fix=false` flag:

In [13]:
data = reader.retrieve()

tp: corrected multiplying by density of water 1000 kg m-3
tp: corrected dividing by accumulation time 10800 s


The resulting data are now adjusted using the instructions in the `config/fixes.yaml` file. For now, for IFS data `2t` is renamed to `tas` and `tp` is converted to `pr`. Units are converted too and cumulated IFS fluxes are converted to fluxes (the information on the output time interval is saved in `fixes.yaml`). The `config/fixes.yaml` will need to be extended for other variables.
The fixer uses the metpy.units module and is capable of guessing some basic conversions. In particular if a density is missing it will assume that it is the density of water and will take it into account.

In [14]:
data.pr

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.59 TiB 200.70 MiB Shape (8329, 26306560) (1, 26306560) Dask graph 8329 chunks in 3 graph layers Data type float64 numpy.ndarray",26306560  8329,

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


By default the fixer converts all variables it can directly, so that now `tp` has been converted also in terms of units to Kg/m2/s. This behaviour can be switched off by specifying `apply_unit_fix=False`. 

In [15]:
data = reader.retrieve(apply_unit_fix=False)

In [16]:
data.pr

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.59 TiB 200.70 MiB Shape (8329, 26306560) (1, 26306560) Dask graph 8329 chunks in 3 graph layers Data type float64 numpy.ndarray",26306560  8329,

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Notice that this time while precipitation has been renamed to `pr` the actual unit conversion has not been performed yet. Units are still m (the precipitation is still cumulated), but the fixer has annotated the DataArray with the attributes "target_units", "factor" and "offset" which can be used to perform the conversion. The reason why it may be preferrable to delay the actual conversion to later is that it requires a product or a sum and it may be more efficient to perform this operation at the very end, for example after aggregation. This concept/sequence is still experimental and we will have to discuss what is best (and what the best defaults are).

We can now perform the actual unit conversion of a specific DataArray explicitly with:

In [17]:
reader.apply_unit_fix(data.pr)

The units have now been fixed (marked also by an additional attribute `units_fixed=True`)

In [18]:
data.pr

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.59 TiB 200.70 MiB Shape (8329, 26306560) (1, 26306560) Dask graph 8329 chunks in 3 graph layers Data type float64 numpy.ndarray",26306560  8329,

Unnamed: 0,Array,Chunk
Bytes,1.59 TiB,200.70 MiB
Shape,"(8329, 26306560)","(1, 26306560)"
Dask graph,8329 chunks in 3 graph layers,8329 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 200.70 MiB 200.70 MiB Shape (26306560,) (26306560,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",26306560  1,

Unnamed: 0,Array,Chunk
Bytes,200.70 MiB,200.70 MiB
Shape,"(26306560,)","(26306560,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


The entire concept of the 'delayed' unit conversion has to be discussed/tested. It might well be that applying unit conversion immediately (which involves a costly operation like a multiplication of the data) is actually quite efficient in dask because of its scheduling properties.