![](../nci-logo.png)

-------
# Data Access and Manipulation using iPython Notebooks
## The Basics



### In this notebook:

- Using iPython Notebooks with NetCDF data within the VDI
    - <a href='#part1'>Launch Jupyter Notebook</a>  
    - <a href='#part2'>What is Xarray</a>  
    - <a href='#part3'>Remote vs. direct filesystem access</a> 
    - <a href='#part4a'>File Variables and Attributes</a> 
    - <a href='#part5'>Subsetting</a>
    - <a href='#part6'>Xarray .plot()</a>
---------

<br>


<a id='part1'></a> 
## Launch the Jupyter Notebook application

#### Using the public hh5 conda environment managed by CLEX

Many python modules are available under the hh5 conda environment that is maintained by CLEX, as well as additional modules such as that of CleF used in the previous examples. Also load pre-built netCDF and Xarray module. 
```
    $ module use /g/data3/hh5/public/modules
    $ module load conda/analysis3
   
```  

Launch the Jupyter Notebook application:
```
    $ jupyter notebook
``` 

<div class="alert alert-info">
<b>NOTE: </b> This will launch the <b>Notebook Dashboard</b> within a new web browser window. 
</div>

<br>



<a id='part2'></a> 
## Xarray

Xarray builds upon and extends the strengths of panda and numpy. Numpy provides the structure and core for working with multi-dimensional arrays while pandas integrates it indexing and dataframe type capabilities. Xarray is actively developed by the climate science community and useful tool for analysis. For more information on the developments being undertaken (along with other related projects) see the Pangeo community: https://pangeo.io/

We will use xarray to open the CMIP5 file defined above. Opening a file with xarray creates an xarray.Dataset. A 'Dataset' is a collection of multiple variables. A DataArray on the other hand is a single multi-dimensional variable and the coordinates. 

xarray always load netCDF data 'lazily', this means that data can be manipuateld, sliced and subset without loading array values into memory. Data is loaded into memory when the load() command is applied or when a computaiton is performed on the data.

xarray is designed for use with multidimensional datasets and is paritcularly useful for climate data on multidimensionl grids with dimensions such as lat, lon, depth and time. 

#### Import the xarray and netCDF modules

In [None]:
import xarray as xr
import netCDF4 as nc

<a id='part3'></a> 
## Remote vs. direct filesystem access

In this example, we will use a file from the CMIP5 Australian Published data collection, spefically the monthly historical tasmax data:

    /g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/Amon/r1i1p1/v20120727/tasmax/tasmax_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc
    

and we are going to compare direct vs. remote access. Timings (using the `%%time` magic function) will also be shown to help illustrate when it can be useful to conduct analysis on the filesystem.

#### Local path on /g/data

In [None]:
path = '/g/data/rr3/publications/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/\
Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### OPeNDAP Data URL

For more information on where to find OPeNDAP URL's, see:
<a href="https://nbviewer.jupyter.org/github/nci-training/readthedoc_NCI_data_training/blob/master/docs/_notebook/TDS/tds_OPeNDAP_cmip5.ipynb">THREDDS Data Server: Data Access</a>


In [None]:
url = 'http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/\
Amon/r1i1p1/v20130325/tasmax/tasmax_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc'

#### Open the file, comparing the time on the local filesystem and remote url

In [None]:
%%time
f1 = xr.open_dataset(path)

In [None]:
%%time
f2 = xr.open_dataset(url)

#### Not much different in times because of the lazy loading of data. But if force the data to load into memory:

In [None]:
%%time
f1 = xr.open_dataset(path)
tasmax = f1.tasmax
tasmax.load()

In [None]:
%%time
f2 = xr.open_dataset(url)
tasmax = f2.tasmax
tasmax.load()

<div class="alert alert-info">
One big advantage of working directly on the filesystem is that data access is much faster. For modest subsets, the difference is quite small but as you work with larger data, remote access can become much slower or even exceed NCI's THREDDS Data Server memory limits. 

<a id='part4a'></a>
## File variables and attributes

With xarray, you can easily view the dataset variables and attributes contained in the file by printing the loaded metadata

In [None]:
f1 = xr.open_dataset(path)
print(f1)

### Dataset and DataArray

In the above we have loaded the Dataset and you can see the multiple variables included in the file. If we look at a specific variable, like tasmax, we will get an xarray.DataArray with its coordinates.

In [None]:
f1 = xr.open_dataset(path)
print(f1.tasmax)

#### Print an attribute
The attributes of a variable can be easily accessed using the `.<attribute>` command. So if we want to print the units of tasmax we could go:

In [None]:
f1.tasmax.units

<a id='part5'></a>
## Subsettting

There are multiple ways to select subsets of the data using xarray. 

#### 1. Using the regular numpy method of indexing:


In [None]:
tasmax = f1.tasmax
tasmax[1:10,:,0].values

But in the above it is not as simple to know *where* the values selected have come from. xarray.sel() permits label based indexing.

#### 2. Using .sel() for label based indexing

In the case below we find the values of tasmax for latitude of -12.46 and longitude at 130.85 degrees - approxmately Darwin. This is done using the `.sel()` method which can be performed on a Dataset or DataArray. This method permits selection based on coordinates rather than indices. 

However, it is unlikely there is a specific lat/lon coorinate value at those exact locations. For example look at the available values of longitude:

In [None]:
f1.lon

There is no longitude value at 130.85 deg E, thus if we wanted to find the tasmax over Darwin, we need to use interpolation. This is done by adding an argument to the `.sel()` command and in this case we are going to use the nearest neighbour method.

In [None]:
tasmax.sel(lon=130.85,lat=-12.46,method='nearest')

### Subsetting in Time 
Notice that the time variable has also been automatically decoded by xarray to represent dates rather than floats

In [None]:
f1.time

Compared to the original numpy data:

In [None]:
ft = xr.open_dataset(path, decode_times=False)
ft.time

This decoding is very helpful in quickly selecting data over specific time periods.

### Exercise

Using the `.sel` method show above. Find the tasmax values during the year 2005. What are the dimensions of tasmax in this case?

<a href="#ans1" data-toggle="collapse">Answer</a>
<div class="collapse" id="ans1">
<pre><code>
tasmax.sel(time='2005')
</code></pre>
</div>

### Selecting data within a range of values

In the above examples we found the tasmax values at a particular lat/lon location, and a particular time. To select data over a range of values you can use the `slice()` function under the `sel()` command.

In the below case we are getting the subset of tasmax data from years 2000 to 2005 and between latitudes 20S to 20N.

In [None]:
subset = tasmax.sel(time=slice('2000','2005'),lat=slice(-20,20))
subset

### Exercise
Find the tasmax values over Australia, between latitudes -45 to -10 deg N and longitudes of 110 to 155 deg E and during the years from 1990 to 2000.

<a id='part6'></a>
## Xarray .plot()

Plotting with xarray it very simple. Xarray plots will automatically define the axes values and labels based on the information contained within the data array.

In [None]:
tasmax.sel(time='2005-01-16',method='nearest').plot()

In [None]:
tasmax.sel(lon=130.85,lat=-12.46,method='nearest').plot()

### Exercise

Try plotting the colormap of tasmax in the region over Australia. So that is between latitudes -45 to -10 deg N of and longitudes of 110 to 155 deg E. Pick the date for the year and month you were born and plot for that.