# Trend Analysis

In this workshop, we're going to explore a 'data cube' - a medium-sized\* vegetation dataset with x, y, and time dimensions.

- What is a data cube anyway?  Is there higher-dimensional data?
- Reducing data to a 1D timeseries
- Calculating summaries along various dimensions
- Dealing with unevenly-spaced observations and missing data
- Drawing awesome plots

In [None]:
# First things first - as usual, we import the tools
import xarray as xr
import numpy as np
import pandas as pd  # "Python ANd Data AnalysiS" - like Excel, but better

import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

In the course theory you will have heard how passive microwave observations from a series of satellite radiometers can be used to develop time series of a measure called Vegetation Optical Depth, and from this, global annual maps of above-ground biomass. We did this as part of a journal article that you can find in the reading material (Liu et al., 2015). Here we are going to look at these time series and do some trend analysis.

In [None]:
# I put the data on NCI for us, so you don't have to download it again.
data = xr.open_dataset('http://dapds00.nci.org.au/thredds/dodsC/ub8/au/RegionTimeSeries/VOD_NCC2015_VOD_1993-2012.nc')
data

By now, you should be familiar with the display above.  Perhaps you recognise the creator_name in the attributes metadata?

An excellent piece of free software to visualise, explore and map NetCDF is *Panoply*, developed by NASA. [You can download it here](http://www.giss.nasa.gov/tools/panoply/). To avoid any problems with downloading and installing, we will not be using it in the tutorial, but if you will be using the netCDF data type it is strongly recommended for visual exploration and even for publishing nice-looking maps - it's *much* easier than MatLab, and still faster than Python (grumble grumble).

*We're skipping a lot of stuff here, where xarray automatically handles things that are tedious and error-prone in Matlab.  Nice choice to use Python instead!*

Because the dataset has only one attribute we're interested in, let's work with the data array instead of the data set (conceptually, a set of data arrays that happens to only have one entry).  Because this data is relatively small at 14MB, we'll also download the lot to save time later.

In [None]:
VOD = data.VOD
VOD.load()
VOD

Remember that this array still has plenty of metadata - for example, you can see the time of each of the time steps by inspecting `VOD.time` in a new cell ('Insert > Insert Cell Below' in the menu).

In [None]:
# Just to show off, let's make a grid with a VOD map for every timestep
VOD.plot.imshow(robust=True, col='time', col_wrap=5)

Why is the world map on its side?  That's just how the data is stored in this file!  You can change the order of dimensions using [`VOD.transpose('time', 'lat', 'lon')`](http://xarray.pydata.org/en/stable/reshaping.html#reordering-dimensions) (for example), but in this notebook we're going to be reducing the dimensionality of the data and analysing it in a more traditional format (tables! timeseries! statistics!) so it doesn't matter much.  This is of course only possible because we can use the metadata to operate on dimensions by name - much better than having to remember if latitude is `1` or `2` in every file!

## Above-ground Biomass Carbon

In our (ed. note: Albert's) study we used existing biomass data to develop an equation that predicts Above-ground Biomass Carbon (ABC, in MgC/ha - or 10^6 grammes Carbon per hectare) from Vegetation Optical Depth (VOD). This is called a _retrieval algorithm_, albeit in this case only a partial one: VOD itself was derived from the original passive microwave brightness temperatures using a retrieval algorithm, and in a second step we are extending this to find ABC. The ABC retrieval algorithm can be applied directly to the whole data cube.

You can find the origin and description of the equation in our article). The _arctan_ command calculates the inverse tangent (trigonometric function, with the result in radians).  Numpy supplies so many such situationally useful functions that nobody remembers them all - just look it up as I looked up arctan when writing this!

In [None]:
# Coefficients
a = 320.6
b = 9.10
c = 0.95
d = 5.5
# The equation
ABC = a * ( np.arctan( b*(VOD-c) ) - np.arctan( b*(0-c) ) ) / (np.arctan( b*(np.inf-c) ) - np.arctan( b*(0-c) ) ) + d
ABC

Well, the equation worked - but there are two problems.  We've lost our attributes metadata, and the array is still called VOD!  Let's fix both of those:

In [None]:
ABC.name = 'ABC'
ABC.attrs = {
    'long_name': 'Above-ground Biomass Carbon',
    'units': 'MgC/ha (mega-grams of Carbon per hectare)',
    'comment': 'Derived from vegetation optical depth.  See Liu et al., 2015.',
    'author': 'Your name goes here!',
}
# Ah, much better - plots will be correctly labelled, and your name will appear if you save the file.
ABC

In [None]:
# Select the next-to-last time step, transpose the axes, and plot:
ABC.isel(time=-1).T.plot.imshow(robust=True)

You may notice that ABC is much more concentrated (mainly in the tropics and boreal forests) than VOD was. This is because of the non-linear shape of the retrieval algorithm.

## Reducing data to one dimension

Let's reduce our data to the mean along a latitude dimension so we can inspect this relationship more closely.

In [None]:
abc_1d = ABC.mean(dim=['time', 'lon'])
abc_1d

While we can work with low-dimensioned and smaller data in xarray, it's not at it's strongest - because xarray must support all its capabilities for very large and high-dimensioned data too.  We'll have a quick look at this approach, then demonstrate pandas.

[Pandas, short for 'Python ANd Data AnalysiS'](http://pandas.pydata.org/pandas-docs/stable/), is a powerful and concise package for working with one and two-dimensional data.  It has excellent statistical tools built-in, and makes it easy to manipulate and summarise data.  If you could do it in Excel, Pandas is probably the best way to do it in Python.

In [None]:
# xarray can still plot 1D arrays, of course.  
# Try abc_1d.<tab> to see what other plots are available!
abc_1d.plot.line()

In [None]:
# Convert our lat_abc array to a dataframe (called df by convention)
# then look at a summary.  This is more impressive with multiple columns!
df = abc_1d.to_dataframe()
df.describe()

In [None]:
df.plot()

Interpretation: `mean` has only counted values that are not `nan`, and therefore terrestrial areas only.  There are clear peaks around 60 (Siberia), 5 to -10 degrees (the Amazon, central Africa, Indonesia, etc. - the tropics).

A few differences between Xarray and Pandas are apparent even for this super-simple plot.

- Pandas has used a legend, where Xarray used a y-axis label.  This is because Pandas can draw the same plot with multiple columns (lines), which may not have identical units.
- Our latitude coordinates actually count down from 90 degrees.  Xarray puts the coordinates in ascending order for display, while Pandas displays exactly what you give it.
- Xarray has limited the plot axis to the part that has data (as our mean ABC in the southern ocean is `nan`, i.e. missing data).  Pandas is often used for things where it's important to distinguish between 'out of range' and 'in range but missing' data, so it does not adjust the axis.

You may prefer one approach or the other, and that's OK - it's simply good to be aware that they treat labelled data a little differently.  In short, Xarray labels must be coordinates ordered in some dimension - but Pandas can label data with almost anything.

## Selecting data by coordinates

If you have been working through the [Software Carpentry](https://software-carpentry.org/lessons/) or [*Think Python*](http://greenteapress.com/wp/think-python-2e/) materials, as I suggest you do, you will be familiar with integer-based indexing.  (recap: `some_list[n]` is the item `n` places after the start of the list).  

This also works on xarray data, using `data.isel(time=0)` - index selection of the first step along the time dimension.  However, we will usually be interested in selecting data based on the coordinates - either a single point, or a smaller area.  [The xarray documentation](http://xarray.pydata.org/en/stable/indexing.html) describes this in detail - let's just look at the two most common examples.

First, selecting a single point - we'll use Canberra as our example:

In [None]:
# This doesn't work, because the exact coordinates we gave aren't in the index
ABC.sel(lat=-35.5, lon=148.75)

In [None]:
# Instead, we should explicitly ask for the nearest point to the location we want.
# You might also give an optional tolerance=0.2 argument, with n the maximum acceptable distance 
# (in this example, 0.2 degrees).  What happens with tolerance=0.1 ?
point = ABC.sel(lat=-35.5, lon=148.75, method='nearest', )
point

In [None]:
# You can see the 2003 bushfires, and also the 2009/10 la nina
point.plot()
plt.title('Mean ABC, Canberra')

Selecting an area is a bit different, because you can't ask for the nearest area.  Instead, you ask for all pixels within a given boundary as represented with `slice(start_coord, end_coord)` objects.  Be careful with the order of your coordinates - if the coordinates are decreasing, the end_coord must be smaller than the start_coord!

In [None]:
# Define our boundaries
aus_lats = slice(-10, -45)
aus_lons = slice(112, 155)
# Select the australian data, and name the datacube "aus"
aus = ABC.sel(lat=aus_lats, lon=aus_lons)
# Plot, after taking the mean along the time dimension and transposing
aus.mean(dim='time').T.plot.imshow(robust=True)

In [None]:
# We can also reduce this area to a 1D timeseries, and plot that:
aus.mean(dim=['lat', 'lon']).plot()
plt.title('Mean ABC, Australia')

## Time-series analysis

TODO:

- Saving pandas dataframes to CSV for Excel etc
- Saving Xarray data to netcdf for Arc or Qgis etc.
- Using scipy.stats for linear regression (noting that mean, median, stdev, etc are built in to arrays)
- Per-pixel regression of an area (vectorised if possible, looping if not)

## This workshop notebook is incomplete.

Please ask Zac for suggestions if you're working this far ahead!