# Exercise 0.4 - NetCDF files (using netCDF4)
prepared by M.Hauser

The `netCDF4` library is not the only library to load netCDFs. In this exercise we will get to know [xarray](http://xarray.pydata.org/en/stable/). `xarray` combines a netCDF-like data model with capabilities of [pandas](http://pandas.pydata.org/).

In [None]:
import xarray as xr

import numpy as np

We will again use the netCDF file with the growing season lenght (GSL), see [exercise 0.3](./ex0_3_netCDF4.ipynb).

The data is described in Donat et al., [2013](http://onlinelibrary.wiley.com/doi/10.1002/jgrd.50150/abstract), and was obtained from http://www.climdex.org/. 

The data has already undergone some postprocessing - see [prepare_HadEX2_GSL](./../data/prepare_HadEX2_GSL.ipynb)

## Opening a dataset

In [None]:
fN = './../data/HadEX2_GSL.nc'

In [None]:
ds = xr.open_dataset(fN)

ds

## Dataset vs DataArray

xarray has two main types `Dataset` and `DataArray`. A `Dataset` is a collection of `DataArray`s.

### Reading variables (`DataArray`s)

With xarray, you can get variables with the dot notation (`ds.lat`) or dict-like (`ds['lat']`).


In [None]:
# get lat from the file, using dict-like indexing
ds['lat'][:10]

In [None]:
# get lat from the file, using dot notation
lat = ds.lat
lat[:10]

## Conversion to numpy array

If you want an numpy array (instead of a xarray `DataArray`), you can use `lat.values`, or `np.array(lat)` (the second also works if you pass a numpy array).

In [None]:
print(lat.values[:10])

print(np.array(lat.values)[:10])

# this works
print(np.array(lat.values)[:10])

# this errors
print(np.array(lat).values)

In [None]:
# load the trend
trend = ds.trend
# in contrast to netCDF4, the arrays are not masked but use NaN to denote invalid values.
trend

If you look at the output of `trend` again, you see how it shows 'Coordinates'. - xarray brings them along, which is super helpful, because it allows you to very easily subset data for certain regions. See later.

### Exercise
 * Get the longitude.

### Solution

In [None]:
ds.lon

## Time

In [None]:
ds.time[:10]

`time` has already the correct format (which was quite annoying to obtain with netCDF)! This is also very helpful, see later.

## Subsetting data

xarray can subset (select) data according to the coordinates very easily. Next, we select a region in Central North America (CNA).

In [None]:
# load the growing season length
GSL = ds.GSL

In [None]:
# select a region in Central North America
lat = slice(30, 50)
lon = slice(360-105, 360-85)

GSL_CNA = GSL.sel(lat=lat, lon=lon)

print(GSL.shape)
print(GSL_CNA.shape)
print()
print(GSL_CNA.lon)
print()
print(GSL_CNA.lat)

### Exercise

* Obtain a GSL time series for Switzerland (approx. 47° N, 8° E). (Hint: `method='nearest'`).

In [None]:
# code here

### Solution

In [None]:
# solution

lat = 47
lon = 8

GSL_CH = GSL.sel(lat=lat, lon=lon, method='nearest')
GSL_CH

## Computations (`mean` etc.)

With numpy arrays you need to know which axis number corresponds to which dimension. If your data has the form `(time, lat lon)` you have to compute the time mean as `data.mean(axis=0)`. xarray allows to used the named coordinates, thus you can do `data.mean('time')`.

In [None]:
GSL.mean('time')

### Exercise

 * Compute the mean GSL over the Central North American domain.
 * Ensure that you still have the time dimension

### Solution

Note that you can calculate the mean over more than one dimension, if put them in brackets:

In [None]:
GSL_CNA.mean(('lat', 'lon'))