# Trend Analysis

In this workshop, we're going to explore a 'data cube' - a medium-sized\* vegetation dataset with x, y, and time dimensions.

- What is a data cube anyway?  Is there higher-dimensional data?
- Reducing data to a 1D timeseries
- Calculating summaries along various dimensions
- Dealing with unevenly-spaced observations and missing data
- Drawing awesome plots

In [None]:
# First things first - as usual, we import the tools
import xarray as xr
import numpy as np
import pandas as pd      # "Python ANd Data AnalysiS" - like Excel, but better

import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

In [None]:
# Let's disable the white gridlines for this notebook.
# See https://seaborn.pydata.org/tutorial/aesthetics.html - you might even pick a different style!
# Matplotlib also has several styles built in, for versions released this year.
seaborn.set_style("dark")

In the course theory you will have heard how passive microwave observations from a series of satellite radiometers can be used to develop time series of a measure called Vegetation Optical Depth, and from this, global annual maps of above-ground biomass. We did this as part of a journal article that you can find in the reading material (Liu et al., 2015). Here we are going to look at these time series and do some trend analysis.

In [None]:
# I put the data on NCI for us, so you don't have to download it again.
data = xr.open_dataset('http://dapds00.nci.org.au/thredds/dodsC/ub8/au/RegionTimeSeries/VOD_NCC2015_VOD_1993-2012.nc')
data

By now, you should be familiar with the display above.  Perhaps you recognise the creator_name in the attributes metadata?

An excellent piece of free software to visualise, explore and map NetCDF is *Panoply*, developed by NASA. [You can download it here](http://www.giss.nasa.gov/tools/panoply/). To avoid any problems with downloading and installing, we will not be using it in the tutorial, but if you will be using the netCDF data type it is strongly recommended for visual exploration and even for publishing nice-looking maps - it's *much* easier than MatLab, and still faster than Python (grumble grumble).

*We're skipping a lot of stuff here, where xarray automatically handles things that are tedious and error-prone in Matlab.  Nice choice to use Python instead!*

Because the dataset has only one attribute we're interested in, let's work with the data array instead of the data set (conceptually, a set of data arrays that happens to only have one entry).  Because this data is relatively small at 14MB, we'll also download the lot to save time later.

In [None]:
VOD = data.VOD
VOD.load()
VOD

Remember that this array still has plenty of metadata - for example, you can see the time of each of the time steps by inspecting `VOD.time` in a new cell ('Insert > Insert Cell Below' in the menu).

In [None]:
# Just to show off, let's make a grid with a VOD map for every timestep
VOD.plot.imshow(robust=True, col='time', col_wrap=5)

Why is the world map on its side?  That's just how the data is stored in this file!  You can change the order of dimensions using [`VOD.transpose('time', 'lat', 'lon')`](http://xarray.pydata.org/en/stable/reshaping.html#reordering-dimensions) (for example), but in this notebook we're going to be reducing the dimensionality of the data and analysing it in a more traditional format (tables! timeseries! statistics!) so it doesn't matter much.  This is of course only possible because we can use the metadata to operate on dimensions by name - much better than having to remember if latitude is `1` or `2` in every file!

## Above-ground Biomass Carbon

In our (ed. note: Albert's) study we used existing biomass data to develop an equation that predicts Above-ground Biomass Carbon (ABC, in MgC/ha - or 10^6 grammes Carbon per hectare) from Vegetation Optical Depth (VOD). This is called a _retrieval algorithm_, albeit in this case only a partial one: VOD itself was derived from the original passive microwave brightness temperatures using a retrieval algorithm, and in a second step we are extending this to find ABC. The ABC retrieval algorithm can be applied directly to the whole data cube.

You can find the origin and description of the equation in our article). The _arctan_ command calculates the inverse tangent (trigonometric function, with the result in radians).  Numpy supplies so many such situationally useful functions that nobody remembers them all - just look it up as I looked up arctan when writing this!

In [None]:
# Coefficients
a = 320.6
b = 9.10
c = 0.95
d = 5.5
# The equation
ABC = a * ( np.arctan( b*(VOD-c) ) - np.arctan( b*(0-c) ) ) / (np.arctan( b*(np.inf-c) ) - np.arctan( b*(0-c) ) ) + d
ABC

Well, the equation worked - but there are two problems.  We've lost our attributes metadata, and the array is still called VOD!  Let's fix both of those:

In [None]:
ABC.name = 'ABC'
ABC.attrs = {
    'long_name': 'Above-ground Biomass Carbon',
    'units': 'MgC/ha (mega-grams of Carbon per hectare)',
    'comment': 'Derived from vegetation optical depth.  See Liu et al., 2015.',
    'author': 'Your name goes here!',
}
# Ah, much better - plots will be correctly labelled, and your name will appear if you save the file.
ABC

In [None]:
# Select the next-to-last time step, transpose the axes, and plot:
ABC.isel(time=-1).T.plot.imshow(robust=True)

You may notice that ABC is much more concentrated (mainly in the tropics and boreal forests) than VOD was. This is because of the non-linear shape of the retrieval algorithm.

## Reducing data to one dimension

Let's reduce our data to the mean along a latitude dimension so we can inspect this relationship more closely.

In [None]:
abc_1d = ABC.mean(dim=['time', 'lon'])
abc_1d

While we can work with low-dimensioned and smaller data in xarray, it's not at it's strongest - because xarray must support all its capabilities for very large and high-dimensioned data too.  We'll have a quick look at this approach, then demonstrate pandas.

[Pandas, short for 'Python ANd Data AnalysiS'](http://pandas.pydata.org/pandas-docs/stable/), is a powerful and concise package for working with one and two-dimensional data.  It has excellent statistical tools built-in, and makes it easy to manipulate and summarise data.  If you could do it in Excel, Pandas is probably the best way to do it in Python.

Note that a "dataframe" is a 2D table of data, from Pandas.  "Tidy data" has one observation per row, and one variable per column - this makes analysis far easier.

In [None]:
# xarray can still plot 1D arrays, of course.  
# Try abc_1d.<tab> to see what other plots are available!
abc_1d.plot.line()

In [None]:
# Convert our lat_abc array to a Pandas dataframe (called df by convention)
# then look at a summary.  This is more impressive with multiple columns!
df = abc_1d.to_dataframe()
df.describe()

In [None]:
df.plot()

Interpretation: `mean` has only counted values that are not `nan`, and therefore terrestrial areas only.  There are clear peaks around 60 (Siberia), 5 to -10 degrees (the Amazon, central Africa, Indonesia, etc. - the tropics).

A few differences between Xarray and Pandas are apparent even for this super-simple plot.

- Pandas has used a legend, where Xarray used a y-axis label.  This is because Pandas can draw the same plot with multiple columns (lines), which may not have identical units.
- Our latitude coordinates actually count down from 90 degrees.  Xarray puts the coordinates in ascending order for display, while Pandas displays exactly what you give it.
- Xarray has limited the plot axis to the part that has data (as our mean ABC in the southern ocean is `nan`, i.e. missing data).  Pandas is often used for things where it's important to distinguish between 'out of range' and 'in range but missing' data, so it does not adjust the axis.

You may prefer one approach or the other, and that's OK - it's simply good to be aware that they treat labelled data a little differently.  In short, Xarray labels must be coordinates ordered in some dimension - but Pandas can label data with almost anything.

## Selecting data by coordinates

If you have been working through the [Software Carpentry](https://software-carpentry.org/lessons/) or [*Think Python*](http://greenteapress.com/wp/think-python-2e/) materials, as I suggest you do, you will be familiar with integer-based indexing.  (recap: `some_list[n]` is the item `n` places after the start of the list).  

This also works on xarray data, using `data.isel(time=0)` - index selection of the first step along the time dimension (you can even index without names, if you don't mind selecting the wrong data).  However, we will usually be interested in selecting data based on the coordinates - either a single point, or a smaller area.  [The xarray documentation](http://xarray.pydata.org/en/stable/indexing.html) describes this in detail - let's just look at the two most common examples.

First, selecting a single point - we'll use Canberra as our example:

In [None]:
# This doesn't work, because the exact coordinates we gave aren't in the index
ABC.sel(lat=-35.5, lon=148.75)

In [None]:
# Instead, we should explicitly ask for the nearest point to the location we want.
# You might also give an optional tolerance=0.2 argument, with n the maximum acceptable distance 
# (in this example, 0.2 degrees).  What happens with tolerance=0.1 ?
point = ABC.sel(lat=-35.5, lon=148.75, method='nearest', )
point

In [None]:
# You can see the 2003 bushfires, and also the 2009/10 la nina
point.plot()
plt.title('Mean ABC, Canberra')

Selecting an area is a bit different, because you can't ask for the nearest area.  Instead, you ask for all pixels within a given boundary as represented with `slice(start_coord, end_coord)` objects.  Be careful with the order of your coordinates - if the coordinates are decreasing, the end_coord must be smaller than the start_coord!

In [None]:
# Define our boundaries
aus_lats = slice(-10, -45)
aus_lons = slice(112, 155)
# Select the australian data, and name the datacube "aus"
aus = ABC.sel(lat=aus_lats, lon=aus_lons)
# Plot, after taking the mean along the time dimension and transposing
aus.mean(dim='time').T.plot.imshow(robust=True)

In [None]:
# We can also reduce this area to a 1D timeseries, and plot that:
aus.mean(dim=['lat', 'lon']).plot()
plt.title('Mean ABC, Australia')

## Saving Data

Sometimes, you will need to share your data with someone who doesn't use Python.  At this point it is important that you can save your data to a standard, interoperable file format - not like the Matlab files we used in notebook 3.

For gridded data, the earth sciences community has largely standardised on NetCDF - of course other formats still exist, but NetCDF is very flexible and the built-in metadata makes it "self-describing".  Any scientific GIS tool should support NetCDF.

For tabular data, we'll use `.csv` format.  This can be opened by almost anything, including Microsoft Excel - in fact, it is so easy to share csv data that better-but-specialised formats have never really caught on.

In [None]:
# Saving an Xarray dataset as NetCDF is *really* easy:
aus.to_netcdf('australia_datacube_ABC.nc')

Done!  You can now load this up in another cell, or - if you're on your own computer - install NASA's [Panoply](https://www.giss.nasa.gov/tools/panoply/) viewer and take a look.  You might also want to save a subset of your input data, if you're selecting and transforming a small part of a much larger dataset - or if you want to keep working without internet access!

Saving Pandas dataframes to csv is just as easy:

In [None]:
# `df` is the global mean along the lat dimension, or time if you changed it
df.to_csv('global_mean_ABC.csv')

In [None]:
# I'll also create and save a table with multiple columns

# First, make a dataframe (rows and columns) from the global mean
table = ABC.mean(dim=['lat', 'lon']).to_dataframe()
# and give it a meaningful name ("ABC" is correct, but doesn't distinguish columns)
table.columns = ['ABC (global mean)']

# Then add new series (columns) to the dataframe with meaningful names
table['ABC (Aus. mean)'] = aus.mean(dim=['lat', 'lon']).to_series()
table['ABC (Canberra)'] = point.to_series()

# Finally, save to .csv and print a summary of the data
table.to_csv('ABC_timeseries_comparison.csv')
table.describe()

Congratulations - you've saved your data in a format that anyone can use!  Try opening the comparison data in Excel, and see that it really worked.


## Time-series analysis

We now have a table with three timeseries, so let's do some more traditional statistics with Python.

Pandas and Xarray both have simple statistics like min, median, max, mean, std (standard deviation), and so on all built-in as methods of the data.  Try `table.std()` for example, and then `table.median(axis=1)`.  What does the `axis` argument do?

Graphing your data is always a good way to start, as [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) shows.  If you're not familar with this, Seaborn has [several small demonstration datasets](https://github.com/mwaskom/seaborn-data) that you can load with e.g. `seaborn.load_dataset('anscombe')` - play around a bit!

In [None]:
table.plot()

Does there appear to be a trend in ABC? If so, describe it - is it linear, a step change, or a different type of trend?

And having confirmed that the data isn't *too* weird, we'll try calculating a least-squares linear regression.  For statistics like this, `scipy.stats` is a good choice.  If you want something fancier, I would reach for the [`statsmodels`](http://www.statsmodels.org/stable/) package - and a good textbook, as it's very easy to get correct-but-meaningless output from advanced tools.

In [None]:
# Load the linregress function, and check it's documentation.
from scipy.stats import linregress
help(linregress)

So we need two arrays of the same length, suitable for numerical calculations.  For `x`, we'll use `table.index.year` - that's the year taken from each timestamp, because `linregress` only deals with unitless numbers - you could alternatively convert to seconds since 1970-01-01, but an annual rate of change is probably easier to interpret!  For `y`, we simply select one of the series from our table.

In [None]:
reg = linregress(x=table.index.year, y=table['ABC (Canberra)'])
reg

Does this fitted linear trend summarises the temporal pattern well? From a statistical perspective, we can answer that question by looking at the statistical significance of the fitted model. You may recall that the p-value is a measure of statistical significance - in this case of a trend - and p < 0.05 is usually considered significant.

Even if statistical testing suggests there is a significant trend, that does not mean that there is good evidence for a linear trend. For example in the example above, you might argue that there is a step change to very low values in Canberra between 2003 and 2009, after which values increased again. Such step changes are not well captured by simple trend models, and their results can be deceiving.

This is why `linregress` also shows you the standard error, and the rvalue (which can be used to calculate r-squared, the coefficient of determination).  Given this extra information, do you think that this timeseries exhibits a linear trend?  Why or why not?


## Change Mapping

Another approach is to examine change over time, by calculating the change from each timestep to the next.  This is known as a finite difference or discrete difference, and is [built into xarray](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.diff.html) just like `mean`.

In [None]:
# Take the australia data, transpose axes, difference along time dimension
change = aus.T.diff('time')
# Update name and units
change.name = 'Annual change in ABC'
change.attrs['units'] = 'MgC/ha-yr (mega-grams of Carbon per hectare per year)'
# Show another grid of maps
change.plot.imshow(robust=True, cmap='RdYlGn', col='time', col_wrap=4)

Can you draw any conclusions about spatial or temporal trends from these maps?  What if you use `robust=False` or manually specify `vmin` and `vmax` values for the colour map?  Calculating the second difference (see the docs!) or calculating statistics for smaller areas is left as an excercise - this notebook has given you all the pieces, and it's up to you to put them togther.  You 


## Summary and Research Ideas

This Tutorial showed examples of parameter retrieval, dimensionality reduction, trend mapping and statistical significance testing.

On [Australia's Environment Explorer](http://www.ausenv.online) you can find several environmental variables that are available in NetCDF-formatted data cubes. In principle, you could do a spatio-temporal trend analysis on any of these data. Have a look at that website, and see if there is anything that raises a question or research idea in your mind. You can also read the accompanying annual environment report for inspiration.