# Exercise 0.3 - NetCDF files (using netCDF4)
prepared by M.Hauser

We need to read (and write) netCDF files. There are several modules which are able to do this, most commonly, the `netCDF4` library from [Unidata](http://unidata.github.io/netcdf4-python/) is used.

In [None]:
import netCDF4 as nc
import numpy as np

We will open a netCDF file with the growing season lenght (GSL) from 1956 to 2005. GSL is a climate index indicating conditions favourable for plant growth. It is defined as the number of consecutive days per year with a temperature above 5° C.

The data is described in Donat et al., ([2013](http://onlinelibrary.wiley.com/doi/10.1002/jgrd.50150/abstract)), and was obtained from http://www.climdex.org/. 
We will use this dataset in a later exercise.

The data has already undergone some postprocessing - see [prepare_HadEX2_GSL](./data/prepare_HadEX2_GSL.ipynb)

## Opening the dataset

In [None]:
fN = '../data/HadEX2_GSL.nc'

In [None]:
ncf = nc.Dataset(fN)

print(ncf)

In [None]:
# you can also use ncdump to show the structure of the file
! ncdump -h {fN}

## Print all variables on the dataset

In [None]:
# get all variables
print(ncf.variables.keys())

## Get a variable

You can get a variable from a netCDF file like so:

In [None]:
# get a variable from the file
ncf.variables['lon']

However, you did not load the data, yet - but a special kind of data structure called `netCDF4.variable`. It also contains some metadata, e.g. the netCDF4 variable also contains the `units` attributes:

In [None]:
ncf.variables['time'].units

To load the actual data (as a numpy array), you have to index it:

In [None]:
# get data of lon from the file
lon = ncf.variables['lon'][:]
# this is a numpy array
lon

Note: if you only need a subset of the data you can index it here: `ncf.variables['lon'][:10]`. This only loads the first ten elements from the file.

### Exercise

* get the values of the latitude

### Solution

In [None]:
lat = ncf.variables['lat'][:]
lat

In [None]:
# load the trend
trend_masked = ncf.variables['trend'][:]

trend_masked

Trend_masked is also a numpy array - a masked array. Masked arrays have one array with the actual data (e.g. \[0, 1, 2\], and one array that indicates if this value is masked (= invalid, e.g. \[True, False, False\]). This would correspond to a not-masked array that looks like \[NaN, 1, 2\].

In [None]:
# example

ma = np.ma.array([0., 1, 2], mask=[True, False, False], fill_value=np.NaN)
ma

In [None]:
# masked arrays can be converted to NaN arrays as:
trend = np.asarray(trend_masked)
trend

### Time

Next we load the time vector. It's still in the original form, and thus not very helpful:

In [None]:
# load time
time = ncf.variables['time'][:]
# time is still in 'netCDF' format
time[:10]

netCDF files (should) follow some conventions for the storage of time stamps. We can convert the original timestamps, using `nc.num2date`.

In [None]:
ncv = ncf.variables['time']

print(ncv.units)
print(ncv.calendar)

time = nc.num2date(ncv[:], ncv.units, ncv.calendar)

time[:10]

However, this format is still not very helpfull, we'll convert it further to a numpy 'datetime64' object.

In [None]:
np.asarray([np.datetime64(t) for t in time])