# Exercise 0.4 - NetCDF files (using `xarray`)
prepared by M.Hauser

The `netCDF4` library is not the only library to load netCDFs. In this exercise we will get to know [xarray](http://xarray.pydata.org/en/stable/). `xarray` combines a netCDF-like data model with capabilities of [pandas](http://pandas.pydata.org/) (see next exercise).

In [None]:
import xarray as xr

import numpy as np

We will use the netCDF file with the growing season length (GSL), see [exercise 0.3](./ex0_3_netCDF4.ipynb).

The data is described in Donat et al., [2013](http://onlinelibrary.wiley.com/doi/10.1002/jgrd.50150/abstract), and was obtained from http://www.climdex.org/.

The data has already undergone some postprocessing - see [prepare_HadEX2_GSL](./../data/prepare_HadEX2_GSL.ipynb)

## Opening a dataset

In [None]:
fN = './../data/HadEX2_GSL.nc'

ds = xr.open_dataset(fN)

ds

## Data types: `Dataset ` and `DataArray`

xarray has two main types `Dataset` and `DataArray`. A `Dataset` is a collection of `DataArray`s. Usually reading a netCDF file opens a `Dataset `.

### Reading variables (= `DataArray`)

To read a variable (=`DataArray`), you can use dot notation (`ds.lat`) or with dict-like indexing (`ds['lat']`).

#### get lat - dict-like indexing

In [None]:
lat = ds['lat']
lat[:10]

#### #### get lat - dot notation

In [None]:
lat = ds.lat
lat[:10]

### Exercise
 * Get the longitude.

### Solution

In [None]:
ds.lon

## Conversion to numpy array

`DataArray`s behave more or less like an numpy array. However, sometimes an operation requires raw numpy arrays, you can use `lat.values`, or `np.asarray(lat)`

#### using values

In [None]:
lat.values[:10]

#### using asarray

In [None]:
print(np.asarray(lat)[:10])

## NaNs

Invalid data is given as NaN. This is different from the `netCDF4` library that used masked arrays.

In [None]:
# load the trend
trend = ds.trend

trend

## Coordinates

If you look at the output of `trend` again, you see how it shows `Coordinates`. - xarray brings them along, which is super helpful, because it allows you to very easily subset data for certain regions.

## Subsetting data

xarray can subset (select) data according to the coordinates very easily. Let's select a region in Central North America (CNA).

In [None]:
# load the growing season length
GSL = ds.GSL

#### select a region in Central North America (CNA)

 * lon: 30°N to 50°N
 * lat: -150°E to -85°E

In [None]:
# select a region in Central North America
lat = slice(30, 50)
lon = slice(360 - 105, 360 - 85)

GSL_CNA = GSL.sel(lat=lat, lon=lon)

print('Shape of the data:')
print(' * all:', GSL.shape)
print(' * CNA:', GSL_CNA.shape)

*Note on `slice`*: the following two commands are equivalent:

    GSL.values[:10]
    GSL.values[slice(0, 10)]
    
However the `:` operator only works in square brackets (`[]`). So for functions like `GSL.sel(...)` we need to use `slice`.

#### lon goes from 255 to 273.75

In [None]:
GSL_CNA.lon

### Exercise

* Obtain a GSL time series for Switzerland (approx. 47° N, 8° E). (Hint: `method='nearest'`).

In [None]:
# code here
lat = 47
lon = 8

# GSL_CH = GSL.sel(...)


### Solution

In [None]:
lat = 47
lon = 8

GSL_CH = GSL.sel(lat=lat, lon=lon, method='nearest')
GSL_CH

## Selection of time

To select time-ranges from datetime arrays, we need to set the dates as strings:

In [None]:
GSL.sel(time=slice('1960', '1961'))

### Exercise

 * select the data for the year 2000 for `GSL_CH`

In [None]:
#GSL_CH.sel(...)

### Solution

In [None]:
GSL_CH.sel(time='2000')

## Computations (`mean` etc.)

With numpy arrays you need to know which axis number corresponds to which dimension. If your data has the form `(time, lat lon)` you have to compute the time mean as `data.mean(axis=0)`. xarray allows you to used the named coordinates, thus you can do `data.mean('time')` to compute the climatology.

In [None]:
GSL.mean('time')

### Exercise

 * Compute the mean GSL over the Central North American domain.
 * Ensure that you still have the time dimension

In [None]:
GSL_CNA

### Solution

Note that you can calculate the mean over more than one dimension, if you put them in brackets:

In [None]:
GSL_CNA.mean(('lat', 'lon'))