# Model Agnostic Analysis

## Introduction

In this tutorial model agnostic analysis means writing your notebook so that it can easily be used with any CF compliant data source.

### What are the CF Conventions?

From [CF Metadata conventions](https://cfconventions.org):

> The CF metadata conventions are designed to promote the processing and sharing of files created with the NetCDF API. The conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities. The CF convention includes a standard name table, which defines strings that identify physical quantities.

In most cases the model output data accessed through the COSIMA Cookbook complies with some version of the CF conventions, enough to be usable for model agnostic analysis.

### Why bother?

Model agnostic means the same code can work for multiple models. This makes your code more usable by **you** and by others. You no longer need to have different versions of code for different models. It makes you and any one who uses your code more productive. It allows for common tasks to be abstracted into general methods that can be more easily reused, meaning less code needs to be written and maintained. This is an enormous productivity boost.

### How is model agnostic analysis achieved?

This can be achieved by using packages that enable this:
- [cf_xarray](https://cf-xarray.readthedocs.io/en/latest/index.html) for generalised coordinate naming
- [xgcm](https://xgcm.readthedocs.io) to make grid operations generic across data
- [pint](https://pint.readthedocs.io/) and [pint-xarray](https://pint-xarray.readthedocs.io/) for handling units easily and robustly

## Example

### Introduction

This example uses an example analysis, shows how the this might be done in a traditional, model specific, manner, and then implements the same analysis in a model agnostic way.

First step is to import necessary libaries.

In [1]:
%matplotlib inline

import pandas as pd
import intake
catalog = intake.cat.access_nri
import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
import cf_xarray as cfxr
import pint_xarray
from pint import application_registry as ureg
import cf_xarray.units
import cmocean as cm
import cartopy.crs as ccrs
import cartopy.feature as cft

In [2]:
from dask.distributed import Client
client = Client("tcp://10.6.77.34:8786")
client


+-------------+----------------+-----------------+-----------------+
| Package     | Client         | Scheduler       | Workers         |
+-------------+----------------+-----------------+-----------------+
| dask        | 2023.8.0       | 2023.10.0       | 2023.10.0       |
| distributed | 2023.8.0       | 2023.10.0       | 2023.10.0       |
| pandas      | 2.0.3          | 2.1.1           | 2.1.1           |
| python      | 3.9.17.final.0 | 3.10.13.final.0 | 3.10.13.final.0 |
| tornado     | 6.3.2          | 6.3.3           | 6.3.3           |
+-------------+----------------+-----------------+-----------------+


0,1
Connection method: Direct,
Dashboard: /proxy/8787/status,

0,1
Comm: tcp://10.6.77.34:8786,Workers: 1
Dashboard: /proxy/8787/status,Total threads: 12
Started: 4 hours ago,Total memory: 46.00 GiB

0,1
Comm: tcp://10.6.77.34:46695,Total threads: 12
Dashboard: /proxy/42291/status,Memory: 46.00 GiB
Nanny: tcp://10.6.77.34:36197,
Local directory: /scratch/iq82/mp7041/dasktmp/dask-scratch-space/worker-yk50t068,Local directory: /scratch/iq82/mp7041/dasktmp/dask-scratch-space/worker-yk50t068
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 3.88 GiB,Spilled bytes: 0 B
Read bytes: 5.45 kiB,Write bytes: 5.63 kiB


cf_xarray works best when xarray keeps attributes by default.

In [3]:
xr.set_options(keep_attrs=True);

Now load surface temperature data from a 0.25$^\circ$ global MOM5 model

In [5]:
sorted(set().union(*cat_subset.df['variable']))

['ANGLE',
 'ANGLET',
 'HTE',
 'HTN',
 'NCAT',
 'TLAT',
 'TLON',
 'Tair_m',
 'Tsfc_m',
 'ULAT',
 'ULON',
 'age_global',
 'agm',
 'aice',
 'aice_m',
 'aicen_m',
 'aiso_bih',
 'albice_m',
 'albsni_m',
 'albsno_m',
 'alidf_ai_m',
 'alidr_ai_m',
 'alvdf_ai_m',
 'alvdr_ai_m',
 'area_t',
 'area_u',
 'aredi',
 'average_DT',
 'average_T1',
 'average_T2',
 'bih_fric_u',
 'bih_fric_v',
 'blkmask',
 'bmf_u',
 'bmf_v',
 'bottom_salt',
 'bottom_temp',
 'bottom_temp_max',
 'buoyfreq2_wt',
 'bv_freq',
 'congel',
 'congel_m',
 'daidtd',
 'daidtd_m',
 'daidtt',
 'daidtt_m',
 'dht',
 'diff_cbt_s',
 'diff_cbt_t',
 'divu_m',
 'drag_coeff',
 'dvidtd',
 'dvidtd_m',
 'dvidtt',
 'dvidtt_m',
 'dvirdgdt_m',
 'dxt',
 'dxu',
 'dyt',
 'dyu',
 'ekman_we',
 'eta_adjust',
 'eta_global',
 'eta_nonbouss',
 'eta_t',
 'evap',
 'evap_ai_m',
 'evap_heat',
 'fcondtop_ai',
 'fcondtop_ai_m',
 'fcondtopn_ai_m',
 'fhocn_ai_m',
 'flat_ai_m',
 'flatn_ai_m',
 'flwdn_m',
 'flwup_ai_m',
 'fmeltt_ai_m',
 'fmelttn_ai_m',
 'fprec',
 'fp

In [6]:
experiment = '025deg_jra55_iaf_omip2_cycle6'
variable = 'sst'
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable, frequency='1mon')
darray = var_search.to_dask()
darray = darray[variable]
SST = darray

ESMDataSourceError: Failed to load dataset with key='ocean_month.1mon'
                 You can use `cat['ocean_month.1mon'].df` to inspect the assets/files for this key.
                 

This is a 3D dataset in latitude, longitude and time:

In [None]:
SST

### Model specific

First do this as it might usually be done, in a model specific manner:

1. Use the time coordinate name in the mean function
2. Subtract a hard-coded value to convert the temperature degrees celcius from degrees Kelvin (the meta-data says the units are `deg_C` but this is clearly incorrect)

In [None]:
SST_time_mean = SST.mean('time') - 273.15
SST_time_mean

Now plot the result

In [None]:
plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.Robinson())

SST_time_mean.plot(ax=ax,
                   x='xt_ocean', y='yt_ocean', 
                   transform=ccrs.PlateCarree(),
                   vmin=-2, vmax=30, extend='both',
                   cmap=cm.cm.thermal)

ax.coastlines();

Note that the arctic is not correctly respresented due to the 1D lat/lon coordinates not being correct in the tripole area. See the [Making Maps with cartopy Tutorial](https://cosima-recipes.readthedocs.io/en/latest/tutorials/Making_Maps_with_Cartopy.html#fixing-the-tripole) for more information.

### Model agnostic

Now do the same calculation in a model agnostic manner

For this data it is necessary to correct the units attribute first. This shouldn't be necessary if the metadata is correct

In [None]:
SST.attrs['units'] = 'K'

Now use `pint` to ensure this is in degrees C. Note that if the data was originally in degrees celcius this would be fine, it would do nothing. So this is a way of catering for any temperature units that are understood by `pint` in a transparent way. Note the call to `quantify` which invokes `pint`'s machinery to parse the units and allow unit conversions.

In [None]:
SST = SST.pint.quantify().pint.to('C')

In [None]:
SST

Now take the time mean, but this time use the `cf` accessor to automatically determine the name of the time dimension. `cf_xarray` checks the names of variables and coordinates, and associated metadata to try and infer information about the data based on the CF conventions.

To see what `cf_xarray` information is available just evaluate the accessor:

In [None]:
SST.cf

In this case it has found `X`, `Y` and `T` axes, and `longitude`, `latitude` and `time` axes. These are now accessible like a `dict` using the `cf` accessor. Note that this returns the actual coordinate, and many functions just want a simple string argument, which is the name of the coordinate.

`cf_xarray` also wraps many standard xarray functions allowing `cf` names to be used, which are [automatically converted to the name in the data](https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html#feature-rewriting-arguments). 

The upshot: just use the `cf` accessor and then append the required function and use the standard CF coordinate name (in this case they are the same, `time`, but that is not guaranteed)

In [None]:
SST_time_mean = SST.cf.mean('time')
SST_time_mean

Using the `cf_xarray` wrapped function makes the code more legible and easier to write, i.e.
```python
SST.cf.mean('time')
```
compared to
```python
SST.mean(SST.cf['time'].name)
```

In the same way the `cf` accessor can be used in the plot and the CF names for latitude and longitude used as `x` and `y` arguments

In [None]:
plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.Robinson())

SST_time_mean.cf.plot(ax=ax,
                      x='longitude', y='latitude', 
                      transform=ccrs.PlateCarree(),
                      vmin=-2, vmax=30, extend='both',
                      cmap=cm.cm.thermal)

ax.coastlines();

## Putting this into practice

Above a model agnostic version of some code was demonstrated, but that doesn't utilise the full power of what it is capable of. The model agnostic code can now be easily turned into a function that accepts an xarray DataArray:

In [None]:
def plot_global_temp_in_degrees_celcius(da):
    # Take the time mean of da and plot a global temperature field in a Robinson projection
    # 
    # Input DataArray (da) should be a 3D array of latitude, longitude and time.
    
    da = da.pint.quantify().pint.to('C')
    da_time_mean = da.cf.mean('time')
    
    plt.figure(figsize=(12, 6))
    ax = plt.axes(projection=ccrs.Robinson())

    da_time_mean.cf.plot(ax=ax,
                         x='longitude', y='latitude', 
                         transform=ccrs.PlateCarree(),
                         vmin=-2, vmax=30, extend='both',
                         cmap=cm.cm.thermal)

    ax.coastlines();

Try it out with the SST data used above

In [None]:
plot_global_temp_in_degrees_celcius(SST)

Ok, so now try on the output from a different experiment and model (MOM6):

In [None]:
experiment = 'OM4_025.JRA_RYF'
variable = 'tos'
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable, frequency='1 monthly')
darray = var_search.to_dask()
darray = darray[variable]
SST_mom6 = darray

In [None]:
SST_mom6

Check to see it has correctly parsed the CF information. It is not necessary to print this out, but interesting, and note it has quite different index and coordinate names

In [None]:
SST_mom6.cf

Use the function from above which also worked on MOM5 data with very different coordinates

In [None]:
plot_global_temp_in_degrees_celcius(SST_mom6)

## What to do when it goes wrong

The model agnostic function worked flawlessly with two different ocean data sets, after the units were corrected in the MOM5 data. What about some ice data?

Using the same experiment from which the first `SST` data was obtained, load the ice air temperature variable

In [None]:
experiment = '025deg_jra55v13_iaf_gmredi6'
variable = 'Tair_m'
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable, frequency='1 monthly')
darray = var_search.to_dask()
darray = darray[variable]
ice_air_temp = darray

In [None]:
ice_air_temp

And try the generic routine

In [None]:
plot_global_temp_in_degrees_celcius(ice_air_temp)

The error message is
```
"Receive multiple variables for key 'longitude': ['TLON', 'ULON']. Expected only one. Please pass a list ['longitude'] instead to get all variables matching 'longitude'."
```
This suggests that `cf_xarray` has found multiple longitude coordinates `TLON` and `ULON` and doesn't know how to resolve this automatically. 

Inspecting the `cf` information doesn't show multiple axes like it [does in the documentation](https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html#what-attributes-have-been-discovered):

In [None]:
ice_air_temp.cf

[This is a bug](https://github.com/xarray-contrib/cf-xarray/issues/396), taking the mean alters the coordinates:

In [None]:
ice_air_temp.cf.mean('time').cf

So the solution is to drop the redundant velocity grid:

In [None]:
ice_air_temp = ice_air_temp.drop_vars(['ULON', 'ULAT'])

Now trying to plot again using the generic function and there is another error:

In [None]:
plot_global_temp_in_degrees_celcius(ice_air_temp)

This error:
```
ValueError: x and y arguments to pcolormesh cannot have non-finite values or be of type numpy.ma.core.MaskedArray with masked values
```
is because there are `NaN`'s in the coordinate variables, as [explained in the plotting with cartopy tutorial](https://cosima-recipes.readthedocs.io/en/latest/tutorials/Making_Maps_with_Cartopy.html#fixing-the-tripole).

By following the instructions in that tutorial and the [Spatial selection with tripolar ACCESS-OM2 grid notebook](https://cosima-recipes.readthedocs.io/en/latest/documented_examples/Spatial_selection.html#gallery-documented-examples-spatial-selection-ipynb) the coordinates can be fixed by replacing them with coordinates from the ice grid input file. It requires some work, the dimensions must be renamed to match, and coordinates converted from radians to degrees.

In [None]:
ice_grid = xr.open_dataset('/g/data/ik11/inputs/access-om2/input_eee21b65/cice_025deg/grid.nc').rename({'ny': 'nj', 'nx': 'ni'})
ice_grid = ice_grid.pint.quantify()

ice_air_temp = ice_air_temp.assign_coords({'TLON': ice_grid.tlon.pint.to('degrees_E'), 'TLAT': ice_grid.tlat.pint.to('degrees_N')})
ice_air_temp

Finally, the generic plotting routine works

In [None]:
plot_global_temp_in_degrees_celcius(ice_air_temp)

One more step is to modify the original routine to take the vertical range as an argument, so it is more generally useful:

In [None]:
def plot_global_temp_in_degrees_celcius(da, vmin=-2, vmax=30):
    # Take the time mean of da and plot a global temperature field in a Robinson projection
    # 
    # Input DataArray (da) should be a 3D array of latitude, longitude and time.
    
    da = da.pint.quantify().pint.to('C')
    da_time_mean = da.cf.mean('time')
    
    plt.figure(figsize=(12, 6))
    ax = plt.axes(projection=ccrs.Robinson())

    da_time_mean.cf.plot(ax=ax,
                         x='longitude', y='latitude', 
                         transform=ccrs.PlateCarree(),
                         vmin=vmin, vmax=vmax, extend='both',
                         cmap=cm.cm.thermal)

    ax.coastlines();

By specifying default values for the arguments it is completely backwards compatible, we have lost no functionality, but the ice air temperature can now be plotted with a range that better suits the range of the data

In [None]:
plot_global_temp_in_degrees_celcius(ice_air_temp, vmin=-30)

## Conclusion

Model specific code to take a time mean and plot the data was converted to a model agnostic function with no loss of functionality.

The same function can be used with a wide range CF compliant data.

In some cases the input data will need to be modified if it is not sufficiently compliant, or non-conforming in some way, such as the ice data above with `NaN`'s in the coordinate. It is better to modify the data to be more compliant and higher quality, and use generic tools, than have multiple code versions to account for the vagaries or problems with individual datasets. 

Ideally those improvements can be incorporated into future versions of the data at source, in post-processing, or in some utility functions for transforming a class of non-conforming data.