# How To Use The COSIMA CookBook

This notebook is designed to help new users get to grips with the COSIMA Cookbook. It assumes that:
 * You have cloned the cosima-cookbook git repository to a location that can see the COSIMA storage space on [NCI](http://cosima-cookbook.readthedocs.io/en/latest/nci.org.au) (/g/data3/hh5/tmp/cosima). We recommend the [Virtual Desktop Infrastructure (VDI)](http://nci.org.au/services/vdi/).
 * You have access to a python3 distribution with the required packages.
 * You have installed the cosima-cookbook package (via `pip install --user -e`).
 * You can fire up a Jupyter notebook!

**Before starting,** load in some libraries that you are likely to need:

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
import IPython.display

In addition, you **always** need to load the cosima_cookbook module. This provides a bunch of functions that you will use:

In [2]:
import cosima_cookbook as cc

netcdf_index loaded.


## 1. The CookBook Philosophy
* framework
* consistent methods
* reproducibility
* a database of expts
* efficient code

### 1.1 A database of experiments
Assumes that we store data as:
* configuration, 
* experiment,
* output???
* netcdf files

In [6]:
cc.build_index()

Files found but not yet indexed: 0
No new .nc files found.


True

### 1.2 Inbuilt Database Functions

We have constructed a few functions to help you operate the cookbook and to access the datasets. These functions all sit in the `cosima_cookbook` directory. For example, `netcdf_index.py` contains the above `build_index` function as well as a series of functions that are built to query the SQL database.

`get_configuration` returns a list of all model configurations that are saved into the database. 

In [3]:
cc.get_configurations()

['mom01v5',
 'APE-MOM',
 'access-om2',
 'mom025',
 'mom-sis',
 'access-om2-025',
 'access-om2-01',
 'woa13',
 'kds75_wp2']

`get_experiments` lists all of the experiments that are catalogued for a given configuration. These function needs one of the above configurations as an input.

In [10]:
cc.get_experiments('access-om2')

['1deg_core_nyf_spinup_A',
 '1deg_jra55_ryf0304_gfdl50',
 '1deg_jra55_ryf0304_kds100_RCP45',
 '1deg_jra55_ryf0304_kds100_s6_RCP45',
 '1deg_jra55_ryf0304_kds100_s6_mushy',
 '1deg_jra55_ryf0304_kds100_sss12',
 '1deg_jra55_ryf0304_kds50_RCP45',
 '1deg_jra55_ryf0304_kds50_s13p8_RCP45',
 '1deg_jra55_ryf0304_kds50_s13p8_mushy',
 '1deg_jra55_ryf0304_kds50_sss12',
 '1deg_jra55_ryf0304_kds75_sss12',
 '1deg_jra55_ryf03_kds50',
 '1deg_jra55_ryf04_kds50',
 '1deg_jra55_ryf8485_gfdl50',
 '1deg_jra55_ryf8485_kds100_RCP45',
 '1deg_jra55_ryf8485_kds100_s6_RCP45',
 '1deg_jra55_ryf8485_kds100_s6_mushy',
 '1deg_jra55_ryf8485_kds100_sss12',
 '1deg_jra55_ryf8485_kds50_RCP45',
 '1deg_jra55_ryf8485_kds50_s13p8_RCP45',
 '1deg_jra55_ryf8485_kds50_s13p8_mushy',
 '1deg_jra55_ryf8485_kds50_sss12',
 '1deg_jra55_ryf8485_kds75_sss12',
 '1deg_jra55_ryf8485_spinup1',
 '1deg_jra55_ryf8485_spinup2',
 '1deg_jra55_ryf84_kds50',
 '1deg_jra55_ryf85_kds50',
 '1deg_jra55_ryf9091_gfdl50',
 '1deg_jra55_ryf9091_gfdl50_mdsr',
 '1d

`get_ncfiles` provides a list of all the netcdf filenames saved for a given experiment. Note that each of these filenames are present in some or all of the output directories.

In [7]:
cc.get_ncfiles('025deg_jra55v13_ryf8485_spinup_A')

['rmp_jra55_runoff_cice_conserve.nc',
 'rmp_jrar_to_cict_CONSERV.nc',
 'rmp_jra55_cice_smooth.nc',
 'i2a.nc',
 'a2i.nc',
 'rmp_jra55_cice_conserve.nc',
 'ocean.nc',
 'ocean_scalar.nc',
 'o2i.nc',
 'ocean_grid.nc',
 'ocean_month.nc',
 'iceh.\\d+-\\d+.nc']

And finally, `get_variables` provides a list of all the variables available in a specific netcdf file. This functions requires both the experiment and the filename to be provided.

In [8]:
cc.get_variables('025deg_jra55v13_ryf8485_spinup_A','ocean.nc')

['xt_ocean',
 'yt_ocean',
 'st_ocean',
 'st_edges_ocean',
 'time',
 'nv',
 'xu_ocean',
 'yu_ocean',
 'sw_ocean',
 'sw_edges_ocean',
 'grid_xu_ocean',
 'grid_yt_ocean',
 'potrho',
 'potrho_edges',
 'grid_xt_ocean',
 'grid_yu_ocean',
 'neutral',
 'neutralrho_edges',
 'temp',
 'salt',
 'age_global',
 'u',
 'v',
 'wt',
 'dzt',
 'pot_rho_0',
 'pot_rho_2',
 'tx_trans',
 'ty_trans',
 'tx_trans_rho',
 'ty_trans_rho',
 'ty_trans_nrho_submeso',
 'temp_xflux_adv',
 'temp_yflux_adv',
 'diff_cbt_t',
 'average_T1',
 'average_T2',
 'average_DT',
 'time_bounds']

### 1.3 Loading data from a netcdf file

Python has many ways of reading in data from a netcdf file ... so we thought we would add another way. This is achieved in the `get_nc_variable` function, which is the most commonly used function in the cookbook. This function queries the database to find a variable from a specific file, from a specific variable, and loads some or all of that file. We will now take a little while to get to know this function. In it's simplest form, you need just three arguments: expt, ncfile and variable:

In [9]:
cc.get_nc_variable('025deg_jra55v13_ryf8485_spinup_A','ocean_scalar.nc','temp_global_ave')



<xarray.DataArray 'temp_global_ave' (time: 2328, scalar_axis: 1)>
dask.array<shape=(2328, 1), dtype=float64, chunksize=(1, 1)>
Coordinates:
  * scalar_axis  (scalar_axis) float64 0.0
  * time         (time) datetime64[ns] 1900-01-16T12:00:00 1900-02-15 ...
Attributes:
    long_name:      Global mean temp in liquid seawater
    units:          deg_C
    valid_range:    [  -10.  1000.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    standard_name:  sea_water_potential_temperature

You may like to note a few things about this function:
1. The data is returned as an xarray DataArray, which includes the coordinate and attribute information from the netcdf file (more on xarray later). 
2. The variable time does not start at zero (like the netcdf file) - we generally shift it to be in a data range that allows us to use `pandas` time series and date functionality.
3. By default, we load the whole dataset, but we could just load the last `n` netcdf files (useful for testing).
4. Other customisable options include setting the variable chunking and incorporating a function to operate on the data.
You can see all the avaiable options using the inbuilt help function, which brings up the function documentation.

In [11]:
help(cc.get_nc_variable)

Help on function get_nc_variable in module cosima_cookbook.netcdf_index:

get_nc_variable(expt, ncfile, variable, chunks={}, n=None, op=None, time_units='days since 1900-01-01', use_bag=False)
    For a given experiment, concatenate together
    variable over all time given a basename ncfile.
    
    Since some NetCDF4 files have trailing integers (e.g. ocean_123_456.nc)
    ncfile can use glob syntax http://www.sqlitetutorial.net/sqlite-glob/
    and regular expressions also work in some limited cases.
    
    By default, xarray is set to use the same chunking pattern that is
    stored in the ncfile. This can be overwritten by passing in a dictionary
    chunks or setting chunks=None for no chunking (load directly into memory).
    
    n > 0 means only use the last n ncfiles files. Useful for testing.
    
    op() is function to apply to each variable before concatenating.
    
    time_units (e.g. "days since 1600-01-01") can be used to override
    the original time.units.  If 

### 1.4 Exercises
OK, this is a tutorial, so now you have to do some work. Your tasks are to:
* Find and load SSH from an experiment (an experiment ... perhaps a 1° configuration would be best).

* Just load the last 10 files from an experiment (any variable you like).

* Load potential temperature from an experiment (again, 1° would be quickest). Can you chunk the data differently from the default?