# The Dataset and DataArray objects used in the ECCOv4 Python package.

## Objectives

To introduce the two high-level data structures, `Dataset` and `DataArray`, that are used in by the `ecco_v4_py` Python package to load and store the ECCO v4 model grid parameters and state estimate variables.

## Introduction 

In this first tutorial we will start slowly, providing detail at every step.  Later tutorials will be assume knowledge of some basic operations introduced here.

Let's get started.

## Import external packages and modules

Before using Python libraries we must import them.  Usually this is done at the beginning of every Python program or interactive Juypter notebook instance but one can import a library at any point in the code.  Python libraries, called **packages**, contain subroutines and/or define data structures that provide useful functionality.

Before we go further, let's import some packages needed for this tutorial:

In [1]:
# NumPy is the fundamental package for scientific computing with Python. 
# It contains among other things:
#    a powerful N-dimensional array object
#    sophisticated (broadcasting) functions
#    tools for integrating C/C++ and Fortran code
#    useful linear algebra, Fourier transform, and random number capabilities
# http://www.numpy.org/
#
# make all functions from the 'numpy' module available with the prefix 'np'
import numpy as np

# xarray is an open source project and Python package that aims to bring the 
# labeled data power of pandas to the physical sciences, by providing
# N-dimensional variants of the core pandas data structures.
# Our approach adopts the Common Data Model for self- describing scientific 
# data in widespread use in the Earth sciences: xarray.Dataset is an in-memory
# representation of a netCDF file.
# http://xarray.pydata.org/en/stable/
#
# import all function from the 'xarray' module available with the prefix 'xr'
import xarray as xr

### Load the ECCO Version 4 Python package

The *ecco_v4_py* is a Python package written specifically for working with the NetCDF output provided in the [nctiles_monthly](ftp://ecco.jpl.nasa.gov/Version4/Release3/nctiles_monthly/) directory of the [ECCO v4 release](ftp://ecco.jpl.nasa.gov/Version4/Release3/)

See the "Getting Started" page in the tutorial for instructions about installing the *ecco_v4_py* module on your machine.

In [2]:
import ecco_v4_py as ecco

The syntax 

```Python
  import XYZ package as ABC
```

allows you to access all of the subroutines and/or objects in a package with perhaps a long complicated name with a shorter, easier name.

Here, we import `ecco_v4_py` as `ecco` because typing `ecco` is easier than `ecco_v4_py` every time.  Also, `ecco_v4_py` is actually comprised of multiple python modules and by importing just `ecco_v4_py` we can actually access all of the subroutines in those modules as well.  Fancy.

## Load a single NetCDF grid tile file

To load ECCO v4's NetCDF files we will use the *open_dataset* command from the Python package [xarray](http://xarray.pydata.org/en/stable/index.html). The *open_dataset* routine creates a `Dataset` object and loads the contents of the NetCDF file, including its metadata, into a data structure.    

Let's open the model model grid parameter file associated with *tile 3* (the North East Atlantic Ocean).

In [3]:
# Set this to be the directory for your fields
ECCO_dir = '/Users/ifenty/ECCOv4/R3'

# Load all tiles of the LLC90 Grid    
grid_dir= ECCO_dir + '/nctiles_grid/' 

fname = 'GRID.0003.nc'
ds = xr.open_dataset(grid_dir + fname)

What is *ds*?  It is a `Dataset` object which is defined somewhere deep in the `xarray` package:

In [4]:
type(ds)

xarray.core.dataset.Dataset

## The Dataset object 

According to the xarray documentation, a [Dataset](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html) is a Python object designed as an "in-memory representation of the data model from the NetCDF file format."

What does that mean?  NetCDF files are *self-describing* in the sense that they [include information about the data they contain](https://www.unidata.ucar.edu/software/netcdf/docs/faq.html).  When `Datasets` are created by loading a NetCDF file they load all of the same data and metadata.

Just as a NetCDF file can contain many variables, a `Dataset` can contain many variables.  These variables are referred to as `Data Variables` in the `xarray` nomenclature.

`Datasets` contain three main classes of field:

1. **Coordinates**   : indices and labels for all of the coordinates used by all data variables 
2. **Data Variables**: `DataArray` objects which contain numerical arrays, their coordinates, coordinate labels, and variable-specific metadata
3. **Attributes**    : metadata 

Now that we've loaded `GRID.0003.nc` as the *ds* `Dataset` object let's examine its contents.  

> **Note:** *You can get information about objects and their contents by typing the name of the variable and hitting **enter** in an interactive session of an IDE such as Spyder or by executing the cell of a Jupyter notebook.*

In [5]:
ds

<xarray.Dataset>
Dimensions:  (i1: 50, i2: 90, i3: 90)
Coordinates:
  * i1       (i1) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
  * i2       (i2) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
  * i3       (i3) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
Data variables:
    hFacC    (i1, i2, i3) float64 ...
    hFacW    (i1, i2, i3) float64 ...
    hFacS    (i1, i2, i3) float64 ...
    XC       (i2, i3) float64 ...
    YC       (i2, i3) float64 ...
    XG       (i2, i3) float64 ...
    YG       (i2, i3) float64 ...
    RAC      (i2, i3) float64 ...
    RAZ      (i2, i3) float64 ...
    DXC      (i2, i3) float64 ...
    DYC      (i2, i3) float64 ...
    DXG      (i2, i3) float64 ...
    DYG      (i2, i3) float64 ...
    Depth    (i2, i3) float64 ...
    AngleCS  (i2, i3) float64 ...
    AngleSN  (i2, i3) float64 ...
    RC       (i1) float64 ...
    RF       (i1) float64 ...
    DRC      (i1) float64 ...
    DRF      (i1) float64 .

### Examining the Dataset object contents

Let's go through *ds* piece by piece, starting from the top.

#### 1. Object type
`<xarray.Dataset>`

The top line tells us what type of object the variable is.  *ds* is an instance of a`Dataset` defined in `xarray`.

#### 2. Dimensions
```Dimensions:  (i1: 50, i2: 90, i3: 90)```

The *Dimensions* list shows all of the different dimensions used by all of the different arrays stored in the NetCDF file (and now loaded in the `Dataset` object.)
  
Arrays may use any combination of these dimensions.  We find 1D, 2D, and 3D arrays in the loaded NetCDF ECCO grid tile file.
  
The names and lengths of the three dimensions is given by: `(i1: 50, i2: 90, i3: 90)`.  There are 50 vertical levels in the ECCO v4 model grid so the `i1` obviously corresponds to the vertical dimension while `i2` and `i3` correspond to the horizontal dimensions.

> **Note:** Each tile in the llc90 grid used by ECCO v4 has 90x90 horizontal grid points.  That's where the 90 in llc**90** comes from!  

#### 3. Coordinates
```
Coordinates:
    i1       (i1) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
    i2       (i2) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
    i3       (i3) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ... 
``` 
  
**i1**, **i2**, and **i3** are the [coordinates](http://xarray.pydata.org/en/stable/data-structures.html#coordinates), which are Python dictionaries of arrays whose values *label* each point.  They are used for label-based indexing and alignment.

In this case, the *coordinates* of each dimension consist of labels $[1, 2, ... n]$, where $n$ is the length of the dimension:
  
  + Dim **i1**: `array([  1.,   2., ..., 50.])`
  + Dim **i2** and **i3**: `array([  1.,   2., ..., 90.])`
  
> **Note:** Actually these coordinates are so-called *Dimension coordinates*, one-dimensional arrays (marked by an asterix **"\*"** when printing a dataset or data array) from $1..n$ where $n$ is the length of the array in a given dimension.

#### 4. Data Variables
```
Data variables:
    hFacC    (i1, i2, i3) float64 ...
    hFacW    (i1, i2, i3) float64 ...
    hFacS    (i1, i2, i3) float64 ...
    ...
    XC       (i2, i3) float64 ...
    YC       (i2, i3) float64 ...
    ...
    RC       (i1) float64 ...
    RF       (i1) float64 ...
```

The *Data Variables* are one or more `xarray.DataArray` objects.  `DataArray` objects are labeled, multi-dimensional arrays that may also contain metadata (attributes).  `DataArray` objects are very important to understand because they are container objects which store the  numerical arrays of the state estimate fields.  We'll investigate these objects in more detail after completing our survey of this `Dataset`.

A subset of all *Data variables* in *ds* are shown above to demonstrate that in this NetCDF grid file there are variables with three different dimension combinations: 3D with  dimensions (**i1**, **i2**, **i3**), 2D with dimensions (**i2**, **i3**), and 1D with  dimensions (**i1**)
  
The 1D variables have values along the single **i1** (vertical) dimension, the 2D variables  have values in the **i2** and **i3** (horizontal) dimensions, and the 3D variables have values in all three dimensions.  All of these particular variables are 64 bit floating point numbers.
  
> **Note:** ECCO v4 NetCDF grid files include a number of grid parameters.  Of these, 3 are 3D, 13 are 2D, and 4 are 1D.  The 3D grid parameters vary with horizontal location and depth,  2D grid parameters only vary with horizontal location and are therefore independent of depth, and the 1D grid parameters only vary with depth and are therefore independent of horizontal location. The meaning of all MITgcm grid parameters can be found in section [2.10.5 of the MITgcm documentation](http://mitgcm.org/sealion/online_documents/node47.html).

#### 5. Attributes
```
Attributes:
    description:    C-grid parameters (see MITgcm documentation for details)....
    A:              :Format      = native grid (nctiles w. 13 tiles)
    B:              :source      = ECCO consortium (http://ecco-group.org/)
    C:              :institution = JPL/UT/MIT/AER
    D:              :history     = files revision history :
    E:                                 04/20/2017: fill in geometry info for ...
    F:                                 11/06/2016: third release of ECCO v4 (...
    ...
    W:              file created using gcmfaces_IO/write2nctiles.m
    date:           21-Apr-2017
    Conventions:    CF-1.6
    _FillValue:     nan
    missing_value:  nan
```
  
The `attrs` variable is a Python [dictionary object](https://www.python-course.eu/dictionaries.php) containing metadata or any auxilliary information.
  
Metadata is presented as a set of dictionary `key-value` pairs.  Here the `keys` are *description, A, B,  ... missing_value.* while the `values` are the corresponding text and non-text values.  
  
To see the metadata `value` associated with the metadata `key` called "Conventions" we can print the value as follows:

In [6]:
print ds.attrs['Conventions']

CF-1.6


"CF-1.6" tells us that ECCO NetCDF output conforms to the [**Climate and Forecast Conventions version 1.6**](http://cfconventions.org/).  How convenient.  

### Map of the `Dataset` object

Now that we've completed our survey, we see that a `Dataset` is a really a kind of *container* comprised of (actually pointing to) many other objects.  

+ dims: A `dict` that maps dimension names (keys) with dimension lengths (values)
+ coords: A `dict` that maps dimension names (keys such as **i1**, **i2**, **i3**) with arrays that label each point in the dimension (values) 
+ One or more *Data Variables* that are pointers to `DataArray` objects 
+ attrs A `dict` that maps different attribute names (keys) with the attributes themselves (values).

![Dataset-diagram](../figures/Dataset-diagram.png)

## The `DataArray` Object

It is worth looking at the `DataArray` object in more detail because `DataArrays` store the arrays that store the ECCO output.  Please see the [xarray documentation on the DataArray object](http://xarray.pydata.org/en/stable/data-structures.html#dataarray) for more information.

`DataArrays` are actually very similar to `Datasets`.  They also contain dimensions, coordinates, and attributes.  The two main differences between `Datasets` and `DataArrays` is that `DataArrays` have a **name** (a string) and an array of **values**.  The **values** array is a [numpy n-dimensional array](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html), an `ndarray`.

### Examining the contents of a `DataArray` 

Let's examine the contents of one of the `DataArrays` found in *ds*, *XC*:

In [7]:
ds.XC

<xarray.DataArray 'XC' (i2: 90, i3: 90)>
array([[-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,
         51.5     ],
       [-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,
         51.5     ],
       [-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,
         51.5     ],
       ...,
       [-37.730072, -37.178291, -36.597565, ...,  50.597565,  51.178291,
         51.730072],
       [-37.771988, -37.291943, -36.764027, ...,  50.764027,  51.291943,
         51.771988],
       [-37.837925, -37.44421 , -36.968143, ...,  50.968143,  51.44421 ,
         51.837925]])
Coordinates:
  * i2       (i2) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
  * i3       (i3) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
Attributes:
    long_name:  longitude
    units:      degrees_east

### Examining the `DataArray`

The layout of `DataArrays` is very similar to those of `Datasets`.  Let's examine each part of *ds.XC*, starting from the top.

#### 1. Object type
`<xarray.DataArray>`

This is indeed a `DataArray` object from the `xarray` package.

> Note: You can also find the type of an object with the `type` command: `print type(ds.XC)`

In [8]:
print type(ds.XC)

<class 'xarray.core.dataarray.DataArray'>


#### 2. Object Name
`XC`

The top line also tells the name of this `DataArray`, `XC`.

In [9]:
print ds.XC.name

XC


#### 3. Dimensions
`Dimensions:  (i2: 90, i3: 90)`  

Unlike $ds$, $XC$ only has two dimensions, **i2** and **i3**.  This makes sense since the longitude of the grid cell centers only vary with horizontal location and not depth.

In [10]:
print(ds.XC.dims)

(u'i2', u'i3')


#### 4. The `numpy` Array
````
array([[-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,  51.5     ],
       [-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,  51.5     ],
       [-37.5     , -36.5     , -35.5     , ...,  49.5     ,  50.5     ,  51.5     ],
       ..., 
       [-37.730072, -37.178291, -36.597565, ...,  50.597565,  51.178291,
         51.730072],
       [-37.771988, -37.291943, -36.764027, ...,  50.764027,  51.291943,
         51.771988],
       [-37.837925, -37.44421 , -36.968143, ...,  50.968143,  51.44421 ,
         51.837925]])
````

Unlike the `Dataset` object there are no *Data variables*.   Instead, we find an **array** of values.  Python prints out a small subset of the entire array.  

> **Note**: `DataArrays` store **only one** array while `DataSets` can store **one or more** `DataArrays`.

We access the `numpy` array by invoking the `.values` command on the `DataArray`.

In [11]:
ds.XC.values

array([[-37.5       , -36.5       , -35.5       , ...,  49.5       ,
         50.5       ,  51.5       ],
       [-37.5       , -36.5       , -35.5       , ...,  49.5       ,
         50.5       ,  51.5       ],
       [-37.5       , -36.5       , -35.5       , ...,  49.5       ,
         50.5       ,  51.5       ],
       ...,
       [-37.73007202, -37.17829132, -36.5975647 , ...,  50.5975647 ,
         51.17829132,  51.73007202],
       [-37.77198792, -37.2919426 , -36.76402664, ...,  50.76402664,
         51.2919426 ,  51.77198792],
       [-37.83792496, -37.44421005, -36.96814346, ...,  50.96814346,
         51.44421005,  51.83792496]])

The array that is returned is a numpy n-dimensional array:

In [12]:
type(ds.XC.values)

numpy.ndarray

Being a numpy array, one can use all of the numerical operations provided by the numpy module on it.
> ** Note: ** You may find it useful to learn about the operations that can be made on numpy arrays. Here is a quickstart guide: 
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

We'll learn more about how to access the values of this array in a later tutorial.  For now it is sufficient to know how to access the arrays!

#### 4. Coordinates
```
Coordinates:
  i2       (i2) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
  i3       (i3) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
```

We find two 1D arrays with coordinate labels for **i2** and **i3**.

In [13]:
ds.XC.coords

Coordinates:
  * i2       (i2) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...
  * i3       (i3) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ...

#### 5. Attributes
```
Attributes:
    long_name:  longitude
    units:      degrees_east
```

The `XC` variable has a `long_name` (longitude) and units (degrees_east).  This metadata was loaded from the NetCDF file.  The entire attribute dictoinary is accessed using `.attrs`.

In [14]:
ds.XC.attrs

OrderedDict([(u'long_name', u'longitude'), (u'units', u'degrees_east')])

In [15]:
print ds.XC.attrs['long_name']

longitude


### Map of the `DataArray` Object

The `DataArray` can be mapped out with the following diagram:

![DataArray-diagram](../figures/DataArray-diagram.png)

## Summary

Now you know the basics of the `Dataset` and `DataArray` objects that will store the ECCO v4 model grid parameters and state estimate output variables.  Go back and take a look athe grid $ds$ object that we originally loaded.  It should make a lot more sense now!