# How to use the COSIMA Cookbook

This notebook is designed to help new users get to grips with the COSIMA Cookbook.

It assumes that:
 * You have access to the COSIMA cookbook.
 * We recommend using the latest version of the cookbook available through the `conda/analysis3-unstable` module on NCI.
 * You can fire up a Jupyter notebook!

**Before starting,** load in some standard libraries that you are likely to need:

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

import pandas as pd
import matplotlib.pyplot as plt
import cmocean as cm
import xarray as xr
import numpy as np
import IPython.display

In addition, you **always** need to load the `cosima_cookbook` module. This provides a bunch of functions that we use:

In [2]:
from dask.distributed import Client
client = Client("tcp://10.6.58.34:8786")
client

0,1
Connection method: Direct,
Dashboard: /proxy/8787/status,

0,1
Comm: tcp://10.6.58.34:8786,Workers: 1
Dashboard: /proxy/8787/status,Total threads: 12
Started: 3 hours ago,Total memory: 46.00 GiB

0,1
Comm: tcp://10.6.58.34:40723,Total threads: 12
Dashboard: /proxy/41799/status,Memory: 46.00 GiB
Nanny: tcp://10.6.58.34:36889,
Local directory: /scratch/iq82/mp7041/dasktmp/dask-scratch-space/worker-poxqoog5,Local directory: /scratch/iq82/mp7041/dasktmp/dask-scratch-space/worker-poxqoog5
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 2.04 GiB,Spilled bytes: 0 B
Read bytes: 10.56 kiB,Write bytes: 10.36 kiB


In [3]:
import intake
catalog = intake.cat.access_nri

## 1. The Cookbook Philosophy
The COSIMA Cookbook is a framework for analysing ocean-sea ice model output.
It is designed to:

* Provide examples of commonly used diagnostics;
* Write efficient, well-documented, openly accessible code;
* Encourage community input to the code;
* Ensure diagnostic results are reproducible;
* Process diagnostics directly from the model output, minimising creation of intermediate files;
* Find methods to deal with the memory limitations of analysing high-resolution model output.


### 1.1 A database of experiments
The COSIMA Cookbook relies on a database of experiments in order to load model output. This database effectively holds metadata for each experiment, as well as variable names, data ranges and so on. 

**NCI Projects**: Access to COSIMA ocean-sea ice model output requires that you are a member of NCI projects `hh5` and `ik11` and, potentially also of `cj50`, and `jk72`.

With that sorted out, there are three different ways for you to access the database:

1. Use the default database, which is periodically refreshed automatically. This database sits in `/g/data/ik11/databases/cosima_master.db` and should be readable for all users. It includes all experiments stored in the COSIMA data directories on NCI under various projects mentioned above. The examples in this tutorial use this database.

2. Use some of the databases sitting in `/g/data/ik11/databases/`. Note that some may *not* be up to date.

3. Make your own database, which is stored in your own path and includes only the experiments you are interested in. Please refer to the `Make_Your_Own_Database` tutorial for instructions on how to create this database.

To access the default database, you need to start a database session each time you fire up a notebook:

### 1.2 Inbuilt Database Functions

We have constructed a few functions to help you operate the cookbook and to access the datasets. These functions all sit in the `cosima_cookbook` directory. The following functions query the data available in the database, without loading the data itself.

They return `pandas` dataframes, which by default are truncated to show only the first and last 5 rows. To see more results, we need to set an option in `pandas` itself:

```python
import pandas as pd
pd.set_option("display.max_rows", n)
```

Where `n` is the maximum number of rows to display without truncation; if there are more rows, the result will again be truncated. Pass `None` to never truncate (but some dataframes can be very big!)

`get_experiments` lists all of the experiments that are catalogued in the database.

In [5]:
sorted(set(catalog.df['name']))

['01deg_jra55v13_ryf9091',
 '01deg_jra55v140_iaf',
 '01deg_jra55v140_iaf_cycle2',
 '01deg_jra55v140_iaf_cycle3',
 '01deg_jra55v140_iaf_cycle4',
 '01deg_jra55v140_iaf_cycle4_jra55v150_extension',
 '01deg_jra55v150_iaf_cycle1',
 '025deg_jra55_iaf_omip2_cycle1',
 '025deg_jra55_iaf_omip2_cycle2',
 '025deg_jra55_iaf_omip2_cycle3',
 '025deg_jra55_iaf_omip2_cycle4',
 '025deg_jra55_iaf_omip2_cycle5',
 '025deg_jra55_iaf_omip2_cycle6',
 '025deg_jra55_ryf9091_gadi',
 '1deg_jra55_iaf_omip2_cycle1',
 '1deg_jra55_iaf_omip2_cycle2',
 '1deg_jra55_iaf_omip2_cycle3',
 '1deg_jra55_iaf_omip2_cycle4',
 '1deg_jra55_iaf_omip2_cycle5',
 '1deg_jra55_iaf_omip2_cycle6',
 '1deg_jra55_iaf_omip2spunup_cycle1',
 '1deg_jra55_iaf_omip2spunup_cycle10',
 '1deg_jra55_iaf_omip2spunup_cycle11',
 '1deg_jra55_iaf_omip2spunup_cycle12',
 '1deg_jra55_iaf_omip2spunup_cycle13',
 '1deg_jra55_iaf_omip2spunup_cycle14',
 '1deg_jra55_iaf_omip2spunup_cycle15',
 '1deg_jra55_iaf_omip2spunup_cycle16',
 '1deg_jra55_iaf_omip2spunup_cycle17'

Internally, an experiment is a set of netCDF4 files as shown in the above table.

`get_ncfiles` provides a list of all the netcdf filenames saved for a given experiment along with the time stamp for when that file was added to the cookbook database. Note that each of these filenames are present in some or all of the output directories -- **but the cookbook philosophy is that you don't need to know about the directories in which these files are stored**. To see the relevant files:

More usefully, `get_variables` provides a list of all the variables available in a specific experiment. 

In [None]:
catalog.search

In [7]:

dd = catalog['025deg_jra55v13_iaf_gmredi6'].search(frequency='1mon')
#cc.querying.get_variables(session, experiment='025deg_jra55v13_iaf_gmredi6', )

KeyError: "key='025deg_jra55v13_iaf_gmredi6' not found in catalog. You can access the list of valid source keys via the .keys() method."

Since this is a pretty big list, we can search the dataframe for variable names which contain a specific string.

In [None]:
vars_025deg = cc.querying.get_variables(session, experiment='025deg_jra55v13_iaf_gmredi6')
vars_025deg[vars_025deg['name'].str.lower().str.contains('temp')]

Omitting the `frequency` would give variables at all temporal frequencies.  To determine what frequencies are in a given experient, we can use `get_frequencies`. Leaving off the `experiment` gives all possible frequencies.

In [None]:
cc.querying.get_frequencies(session, experiment='025deg_jra55v13_iaf_gmredi6')

### 1.3 Loading data from a netcdf file

Python has many ways of reading in data from a netcdf file ... so we thought we would add another way. This is done via the `querying.getvar()` function, which is the most commonly used function in the cookbook. This function queries the database to find a specific variable, and loads some or all of that file.

Let's take now a little while to get to know this function. In it's simplest form, you need just three arguments: `experiment`, `variable`, and database's `session`. 

You can see all the available options using the inbuilt help function, which brings up the function's documentation.

In [None]:
help(cc.querying.getvar)

You may like to note a few things about this function:
1. The data is returned as an xarray DataArray, which includes the coordinate and attribute information from the netcdf file (more on xarray later). 
2. The variable time does not start at zero - and if you don't like it you can introduce an offset to alter the time axis.
3. By default, we load the whole dataset, but we can load a subset of the times (see below).
4. Other customisable options include setting the variable chunking and incorporating a function to operate on the data.

In [None]:
experiment = '025deg_jra55v13_iaf_gmredi6'
variable = 'temp_global_ave'

cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable)
darray = var_search.to_dask()
darray = darray[variable]
darray

You can see that this operation loads the globally averaged potential temperature  from the model output. The time axis runs from 1900 to 2198. For some variables (particularly 3D variables that might use a lot of memory) you may prefer to restrict yourself to a smaller time window:

In [None]:
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable)
darray = var_search.to_dask()
darray = darray[variable]
darray = darray.sel(time=slice('2000-01-01', '2050-12-31'))
darray

You will see that the time boundaries are not exact here. `cc.querying.getvar` loads all files that include any dates within the specified range.  You can use `.sel(time = ...)` to refine this selection if required (see below).

### 1.4 Exercises
OK, this is a tutorial, so now you have to do some work. Your tasks are to:
* Find and load SSH from an experiment (an experiment... perhaps choose a 1° configuration for start).

* Just load the last 10 files from an experiment (any variable you like).

* Load potential temperature from an experiment (again, 1° would be quickest). Can you chunk the data differently from the default?

## 2. How to manipulate and plot variables with xarray
We use the python package `xarray` (which is built on `dask`, `pandas`, `matplotlib` and `numpy`) for many of our diagnostics. `xarray` has a a lot of nice features, some of which we will try to demonstrate for you. 

### 2.1 Plotting
`xarray`'s `.plot()` method does its best to figure out what you are trying to plot, and plotting it for you. Let's start by loading a 1-dimensional variable and plotting.

In [None]:
experiment = '025deg_jra55v13_iaf_gmredi6'
variable = 'temp_global_ave'
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable)
darray = var_search.to_dask()
darray = darray[variable]
darray.plot();

In [None]:
darray

You should see that `xarray` has figured out that this data is a timeseries, that the x-axis is representing time and that the y-axis is `temp_global_ave`. You can always modify aspects of your plot if you are unhappy with the default xarray behaviour:

In [None]:
darray.plot()
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.title('Globally Averaged Temperature');

Because `xarray` knows about dimensions, it has plotting routines which can figure out what it should plot. By way of example, let's load a single time slice of `surface_temp` and see how `.plot()` handles it: 

In [None]:
experiment = '025deg_jra55v13_iaf_gmredi6'
variable = 'surface_temp'
cat_subset = catalog[experiment]
var_search = cat_subset.search(variable=variable)
var_search = var_search.search(path=var_search.df['path'][0])
darray = var_search.to_dask()
darray = darray[variable]
darray.mean('time').plot();

Again, you can customise this plot as you see fit:

In [None]:
temp_C = darray - 273.15 # convert from Kelvin to Celsius
temp_C.mean('time').plot.contourf(levels=np.arange(-2, 32, 2), cmap=cm.cm.thermal);
plt.ylabel('latitude')
plt.xlabel('longitude');

### 2.2 Slicing and dicing

There are two different ways of subselecting from a DataArray: `isel` and `sel`. The first of these is probably what you are used to -- you specify the value of the index of the array. In the second case you specify the value of the coordinate you want to select. These two methods are demonstrated in the following example:

In [None]:
cat_subset = catalog['025deg_jra55v13_iaf_gmredi6']
var_search = cat_subset.search(variable='pot_rho_2')
darray = var_search.to_dask()
darray = darray['pot_rho_2']
density = darray.isel(time = 200).sel(st_ocean = 1000, method='nearest')
density.plot();

In the above example, a 300-year dataset is loaded. We then use `isel` to select the 201st year (time index of 200) and use `sel` to select the $z$ level that is about 1000m deep. The `sel` method is very flexible, allowing us to use similar code in differing model resolutions or grids. In addition, both methods allow you to slice a range of values:

In [None]:
cat_subset = catalog['1deg_jra55v13_iaf_spinup1_B1']
var_search = cat_subset.search(variable='v')
darray = var_search.to_dask()
darray = darray['v']
v = darray.isel(time = 100).sel(st_ocean=50, method='nearest').sel({'xu_ocean': slice(-230, -180),
                                                                    'yu_ocean': slice(-50, -20)})
v.plot();

Here we have taken meridional velocity, and sliced out a small region of interest for our plot.

### 2.3 Averaging along dimensions

We often perform operations such as averaging on dataarrays. Again, knowledge of the coordinates can be a big help here, as you can instruct the `mean()` method to operate along given coordinates. The case below takes a temporal and zonal average of potential density.

#### IMPORTANT
To be precise, it is actually a mean in the $i$-grid direction, which is only zonal outside the tripolar region in the Arctic, i.e., *south of 65N* in the ACCESS-OM2 models. To compute the zonal mean correctly one needs to be a bit more carefull; see the [`DocumentedExamples/True_Zonal_Mean.ipynb`](https://cosima-recipes.readthedocs.io/en/latest/documented_examples/True_Zonal_Mean.html#gallery-documented-examples-true-zonal-mean-ipynb).

In [None]:
cat_subset = catalog['1deg_jra55v13_iaf_spinup1_B1']
var_search = cat_subset.search(variable='pot_rho_2')
darray = var_search.to_dask()
darray = darray['pot_rho_2']
darray.mean({'time', 'xt_ocean'}).plot(cmap=cm.cm.haline)
plt.gca().invert_yaxis();

### 2.4 Resampling

`xarray` uses `datetime` conventions to allow for operations such as resampling in time. This resampling is simple and powerful. Here is an example of re-plotting the figure from 2.1 with annual averaging:

In [None]:
cat_subset = catalog['025deg_jra55v13_iaf_gmredi6']
var_search = cat_subset.search(variable='temp_global_ave')
darray = var_search.to_dask()
darray = darray['temp_global_ave']
meandata = darray.resample(time='A').mean(dim='time')
meandata.plot();

### 2.5 Exercises

 * Pick an experiment and plot a map of the temperature of the upper 100m of the ocean for one year.

 * Now, take the same experiment and construct a timeseries of spatially averaged (regional or global) upper 700m temperature, resampled every 3 years.

## 3. More Advanced Stuff

### 3.1 Making a map with cartopy
Refer to [map tutorial](https://cosima-recipes.readthedocs.io/en/latest/tutorials/Making_Maps_with_Cartopy.html#gallery-tutorials-making-maps-with-cartopy-ipynb).

### 3.2 Distributed computing

Many of our scripts use multiple cores for their calculations, usually via the following . It sets up a local cluster on your node for distributed computation. 

In [None]:
from dask.distributed import Client

client = Client("tcp://10.6.43.39:8786")
client

The dashboard link should allow you to access information on how your work is distributed between the cores on your local cluster.