# Xarray Access CMIP5 Data 


In this notebook we demonstrate how to access CMIP5 data locally and remotely using Xarray.

* What is Xarray
* Remote vs. direct filesystem access
* Build a multi-file dataset from an OpenDAP server
* File Variables and Attributes

This example uses Coupled Model Intercomparison Project (CMIP5) collections. For more information, please visit the [data catalogue](https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f3525_9322_8600_7716) and note the terms of use.
   
---------

- Authors: NCI Virtual Research Environment Team
- Keywords: CMIP, Xarray, OpenDAP, NetCDF
- Create Date: 2020-Apr
---

### Xarray

Xarray builds upon and extends the strengths of pandas and numpy. Numpy provides the structure and core for working with multi-dimensional arrays while pandas integrates its indexing and dataframe type capabilities. Xarray is actively developed by the climate science community and a useful tool for analysis. For more information on the developments being undertaken (along with other related projects) see the Pangeo community: https://pangeo.io/
 
We will use xarray to open the CMIP5 file defined below. Opening a file with xarray creates an xarray.Dataset. A 'Dataset' is a collection of multiple variables. A DataArray on the other hand is a single multi-dimensional variable and the coordinates.
 
xarray always loads netCDF data 'lazily' which means that data can be manipulated, sliced and subsetted without loading array values into memory. Data is loaded into memory when the load( ) command is applied or when a computation is performed on the data.
 
xarray is designed for use with multidimensional datasets and is particularly useful for climate data on multidimensional grids with dimensions such as lat, lon, depth and time.

#### Import the xarray and netCDF modules

In [1]:
import xarray as xr
import netCDF4 as nc
%matplotlib inline

### Remote vs. direct filesystem access

In this example, we will use a file from the CMIP5 Australian Published data collection, specifically the monthly historical tasmax data:

    /g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/Amon/r1i1p1/v20120727/tasmax/tasmax_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc
    

and we are going to compare direct vs. remote access. Timings (using the `%%time` magic function) will also be shown to help illustrate when it can be useful to conduct analysis on the filesystem.

#### Local path on /g/data

In [2]:
path = '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### OPeNDAP Data URL

For more information on where to find the CMIP5 OPeNDAP endpoints, see our 
<a href="https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f7448_2157_9857_1076">Geonetwork catalogue</a>


In [4]:
url = 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/\
Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### Open the file, comparing the time on the local filesystem and remote url

In [5]:
%%time
f1 = xr.open_dataset(path)

CPU times: user 86.1 ms, sys: 87.6 ms, total: 174 ms
Wall time: 2.86 s


In [6]:
%%time
f2 = xr.open_dataset(url)

CPU times: user 54 ms, sys: 5.77 ms, total: 59.8 ms
Wall time: 4.55 s


In [7]:
f1

In [8]:
f2

The difference in times is due to the remote URL access. Data access response time is quick using Xarray's `open_dataset()` function because of the lazy loading of data. But if we force the data to load into memory:

In [9]:
%%time
tasmax = f1.tasmax
tasmax.load()

CPU times: user 267 ms, sys: 484 ms, total: 751 ms
Wall time: 3.08 s


In [10]:
%%time
tasmax = f2.tasmax
tasmax.load()

CPU times: user 1.16 s, sys: 572 ms, total: 1.73 s
Wall time: 4.65 s


<div class="alert alert-info">
One big advantage of working directly on the filesystem is that data access is much faster. For modest subsets, the difference is quite small but as you work with larger data, remote access can become much slower or even exceed NCI's THREDDS Data Server memory limits. </div>

### Building multi-file datasets using OpenDAP vs locally on the gdata filesystem

In [11]:
### OPeNDAP multi-file dataset

base_url = 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/\
historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_'
files_opendap = [f'{base_url}{year}0101-{year+9}1231.nc' for year in range(1850, 1990, 10)]
files_opendap

['http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18500101-18591231.nc',
 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18600101-18691231.nc',
 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18700101-18791231.nc',
 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18800101-18891231.nc',
 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18900101-18991231.nc',
 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical

We can now read the entire ensemble of files on the remote server as a single xarray Dataset using `Xarray.mfdataset( )`!

In [12]:
%%time
f3 = xr.open_mfdataset(files_opendap,combine='by_coords')
f3

CPU times: user 1.78 s, sys: 146 ms, total: 1.93 s
Wall time: 2min 5s


Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 432.00 kB 432.00 kB Shape (300, 360) (300, 360) Count 65 Tasks 1 Chunks Type float32 numpy.ndarray",360  300,

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 432.00 kB 432.00 kB Shape (300, 360) (300, 360) Count 65 Tasks 1 Chunks Type float32 numpy.ndarray",360  300,

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,818.14 kB,58.45 kB
Shape,"(51134, 2)","(3653, 2)"
Count,42 Tasks,14 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 818.14 kB 58.45 kB Shape (51134, 2) (3653, 2) Count 42 Tasks 14 Chunks Type datetime64[ns] numpy.ndarray",2  51134,

Unnamed: 0,Array,Chunk
Bytes,818.14 kB,58.45 kB
Shape,"(51134, 2)","(3653, 2)"
Count,42 Tasks,14 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 88.36 GB 6.31 GB Shape (51134, 300, 360, 4) (3653, 300, 360, 4) Count 56 Tasks 14 Chunks Type float32 numpy.ndarray",51134  1  4  360  300,

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 88.36 GB 6.31 GB Shape (51134, 300, 360, 4) (3653, 300, 360, 4) Count 56 Tasks 14 Chunks Type float32 numpy.ndarray",51134  1  4  360  300,

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,22.09 GB,1.58 GB
Shape,"(51134, 300, 360)","(3653, 300, 360)"
Count,42 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 22.09 GB 1.58 GB Shape (51134, 300, 360) (3653, 300, 360) Count 42 Tasks 14 Chunks Type float32 numpy.ndarray",360  300  51134,

Unnamed: 0,Array,Chunk
Bytes,22.09 GB,1.58 GB
Shape,"(51134, 300, 360)","(3653, 300, 360)"
Count,42 Tasks,14 Chunks
Type,float32,numpy.ndarray


In [13]:
### gdata multi-file dataset
base_dir = '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/\
historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_'
files_gdata = [f'{base_dir}{year}0101-{year+9}1231.nc' for year in range(1850, 1990, 10)]
files_gdata

['/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18500101-18591231.nc',
 '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18600101-18691231.nc',
 '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18700101-18791231.nc',
 '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18800101-18891231.nc',
 '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_18900101-18991231.nc',
 '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/day/ocean/day/r1i1p1/latest/tos/tos_day_ACCESS1-3_historical_r1i1p1_19000101-19091231.nc',
 '/g/data/rr3/publicat

In [14]:
%%time
f4 = xr.open_mfdataset(files_gdata,combine='by_coords')
f4

CPU times: user 2.25 s, sys: 2.56 s, total: 4.81 s
Wall time: 1min 29s


Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 432.00 kB 432.00 kB Shape (300, 360) (300, 360) Count 65 Tasks 1 Chunks Type float32 numpy.ndarray",360  300,

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 432.00 kB 432.00 kB Shape (300, 360) (300, 360) Count 65 Tasks 1 Chunks Type float32 numpy.ndarray",360  300,

Unnamed: 0,Array,Chunk
Bytes,432.00 kB,432.00 kB
Shape,"(300, 360)","(300, 360)"
Count,65 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,818.14 kB,58.45 kB
Shape,"(51134, 2)","(3653, 2)"
Count,42 Tasks,14 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 818.14 kB 58.45 kB Shape (51134, 2) (3653, 2) Count 42 Tasks 14 Chunks Type datetime64[ns] numpy.ndarray",2  51134,

Unnamed: 0,Array,Chunk
Bytes,818.14 kB,58.45 kB
Shape,"(51134, 2)","(3653, 2)"
Count,42 Tasks,14 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 88.36 GB 6.31 GB Shape (51134, 300, 360, 4) (3653, 300, 360, 4) Count 56 Tasks 14 Chunks Type float32 numpy.ndarray",51134  1  4  360  300,

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 88.36 GB 6.31 GB Shape (51134, 300, 360, 4) (3653, 300, 360, 4) Count 56 Tasks 14 Chunks Type float32 numpy.ndarray",51134  1  4  360  300,

Unnamed: 0,Array,Chunk
Bytes,88.36 GB,6.31 GB
Shape,"(51134, 300, 360, 4)","(3653, 300, 360, 4)"
Count,56 Tasks,14 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,22.09 GB,1.58 GB
Shape,"(51134, 300, 360)","(3653, 300, 360)"
Count,42 Tasks,14 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 22.09 GB 1.58 GB Shape (51134, 300, 360) (3653, 300, 360) Count 42 Tasks 14 Chunks Type float32 numpy.ndarray",360  300  51134,

Unnamed: 0,Array,Chunk
Bytes,22.09 GB,1.58 GB
Shape,"(51134, 300, 360)","(3653, 300, 360)"
Count,42 Tasks,14 Chunks
Type,float32,numpy.ndarray


### File variables and attributes

With xarray, you can easily view the dataset variables and attributes contained in the file by printing the loaded metadata

In [15]:
print(f1)

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 145, lon: 192, time: 1872)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 1.5
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] 1850-01-01 1850-02-01 ... 2006-01-01
    lat_bnds   (lat, bnds) float64 -90.0 -89.38 -89.38 ... 89.38 89.38 90.0
    lon_bnds   (lon, bnds) float64 -0.9375 0.9375 0.9375 ... 357.2 357.2 359.1
    tasmax     (time, lat, lon) float32 242.22649 242.22649 ... 248.79779
Attributes:
    institution:            CSIRO (Commonwealth Scientific and Industrial Res...
    institute_id:           CSIRO-BOM
    experiment_id:          historical
    source:                 ACCESS1.3 2011. Atmosphere: AGCM v1.0 (N96 grid-p...
    model_id:               ACCES

### Dataset and DataArray

In the above we have loaded the Dataset and you can see the multiple variables included in the file. If we look at a specific variable, like tasmax, we will get an xarray.DataArray with its coordinates.

In [16]:
print(f1.tasmax)

<xarray.DataArray 'tasmax' (time: 1872, lat: 145, lon: 192)>
array([[[242.22649, 242.22649, 242.22649, ..., 242.22028, 242.22028,
         242.22028],
        [243.91931, 243.87326, 243.82875, ..., 244.05038, 244.00957,
         243.96425],
        [245.13287, 245.08662, 245.0183 , ..., 245.27603, 245.23296,
         245.18419],
        ...,
        [242.64438, 242.74596, 242.86943, ..., 242.36124, 242.45232,
         242.54388],
        [241.69199, 241.74554, 241.79771, ..., 241.67577, 241.68051,
         241.69809],
        [240.87485, 240.87485, 240.87485, ..., 240.87485, 240.87485,
         240.87485]],

       [[235.21791, 235.21791, 235.21791, ..., 235.2146 , 235.2146 ,
         235.2146 ],
        [237.052  , 237.029  , 237.00833, ..., 237.17393, 237.13855,
         237.09853],
        [237.41914, 237.29343, 237.1765 , ..., 237.80144, 237.6658 ,
         237.53809],
        ...,
        [240.60106, 240.53203, 240.48082, ..., 240.71024, 240.71312,
         240.65256],
        [23

#### Print an attribute
The attributes of a variable can be easily accessed using the `.<attribute>` command. So if we want to print the units of tasmax we could go:

In [17]:
f1.tasmax.units

'K'

### Summary

In this example, we show how to access CMIP5 datasets on Gadi and utilising OPeNDAP URLs from our THREDDS server. The same approach applies to other data collections. Please visit [Xarray](http://xarray.pydata.org/en/stable/) for more information.