# Image Analysis with Python

In this workshop, we're going to load and explore some satellite imagery, then calculate some indices such as NDVI.
By the end, you will have learned a bit about the NetCDF data format, how to load and visualise gridded data, and can say that you've worked with "Big Data"!

Below, I've written some demonstration code to:

1. load a MODIS composite image of Australia
2. slice the array (eg monochrome images, mean colour)
3. view the image
4. calculate and view NDVI

First, let's import the libraries (packages of code) that we want to use for this task.  `numpy` is the foundation of scientific Python, by supporting very fast numerical operations on arrays.  `matplotlib`, well, is a matlab-style plotting library; we're importing `seaborn` because it makes the default styles much nicer (and has nice statistical graphs, but that's a topic for an optional session later).  `xarray` [(docs)](http://xarray.pydata.org) is the nicest way to work with labelled multidimensional data - specialised, but indispensible for us.

In [None]:
# NumPy for arrays, and Xarray for gridded geospatial datasets
import numpy as np
import xarray as xr

# To draw some images, and with nice styles
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Next, we'll open our dataset for this workshop - a MODIS composite covering all of Australia.

For the interested, here's how I found the data - note that **you won't need to do this for the course**; I'll be happy to help you find data if you need it for your research paper.  Identifying what data you want or need and where it can be found is often one of the most challenging parts of an analysis.  In this case I visited the TERN (terrestrial ecosystem research network) Auscover website, and searched for "MODIS mosaic" - finding [this page](http://data.auscover.org.au/xwiki/bin/view/Product+pages/LPDAAC+Mosaics+MxD09+CMAR).  From there I clicked through NetCDF and Australian mosaics to find [this listing](http://data.auscover.org.au/thredds/catalog/auscover/lpdaac-csiro/c5/v2-nc4/aust/catalog.html).  I chose "Nadir BRDF-Adjusted Reflectance 16-Day 500m - Combined" because it adjusts for atmospheric effects, view angle, illumination, etc; and taking the best quality pixel in 16 days avoids most problems with clouds etc (at least in Australia!).

`xarray`, the tool we will use for NetCDF data, can load data from a URL as well as a file.  If this data is provided via OpeNDAP (**Ope**n **N**etwork **D**ata **A**ccess **P**rotocol), `xarray` will automatically avoid downloading data until you need it - so opening very large collections of files only transfers a little metadata, and taking subsets is usually quite efficient.  We'll therefore avoid downloading anything manually, and just use the OpeNDAP link for a recent mosaic.  (There's also multi-file support, but more on that later)

In [None]:
# Specify the filename and url on two lines for readability
# Note that there are tools to discover these links automatically, but we'll do it manually for now
file_name = 'MCD43A4.2017.073.aust.005.nadir_brdf_adjusted_reflectance.nc'
url = 'http://data.auscover.org.au/thredds/dodsC/auscover/lpdaac-csiro/c5/v2-nc4/aust/MCD43A4.005/2017.03.14/'

# Open the dataset and see what's inside
ds = xr.open_dataset(url + file_name)
ds

What does all of this mean?  In order:

- `<xarray.Dataset>` means that this represents... an [xarray dataset](http://xarray.pydata.org/en/stable/data-structures.html#dataset): "a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format."

- `Dimensions:` lists the dimensions of this data in space and time, and the size of the data in each dimension.  Latitude, longitude, and time are pretty obvious, but what could `nv` represent?  We'll see in the next point!

- `Coordinates:` contains a a list of "coordinate arrays", which tell us the location that each index in our arrays of data corresponds to in a real coordinate system.  In a NetCDF file these are stored seperately to the dimensions, by `xarray` is smart enough to handle this for us; the dimension or dimensions of a data array are shown in brackets, then the type, then the first few values.  For example, time is measured along the time dimension (yep!), in a 64-bit date-and-time stamp with nanosecond precision (!!), and the first value is March 14th, 2017.

- `Dimensions without coordinates: nv`... huh?  If you look down at the data variables section, the `time_bounds` array has a time and a `nv` dimension.  Looking at `ds.nv` in a new cell shows that it's equal to `[0, 1]` - so it looks like this is used to describe the edges of the pixels.  It's useful to recognise this when you see it, but we won't be using pixel boundaries in this course!

- `Data variables:` is where the fun really starts: they're the variables with data in them!  (did you guess?)

  - `crs` is the coordinate reference system.  Critically important for changing the data projection, but we'll just lat/long coordinates and don't care today.
  - time_bounds, lat_bounds, and lon_bounds describe pixel boundaries, as mentioned above.
  - `nbar_0620_0670nm` through to `nbar_2105_2155nm` are the surface reflectance data - the whole reason we got this file!
  - typical_mask, quality, and snow all describe quality issues with the data.  This is something you should usually think about and handle carefully to ensure that your results are valid, but is less important when you're just learning to open files and explore the data.
    
  Note that where the coordinates entries showed their first few values, the data variables just show `...`.  Because we are loading over the internet, the data is not actually downloaded until we attempt to use it.  This makes many operations (eg subsetting) much faster, because we can avoid downloading anything we won't actually use.  This is convenient on a 400MB file, and crucial when working with larger collections (such as all such files through time) which may be many gigabytes or even terabytes.
  
- Finally, `Attributes:` lists all the other metadata of the file in general.  Author, contact, provenance, description; it's all there.  You can inspect `ds.attrs` in a new cell to see the full text of each, since it's truncated to fit on one line in this summary.  The `Conventions`, `standard_name_vocabulary`, and `keywords_vocabulary` indicate that this file uses standardised names (and which version of each standard), which enables much more useful automatic analysis in specialised programs.

Let's have a quick look at some of these data and metadata...

In [None]:
ds.attrs['summary']

In [None]:
# This is the coordinate array for time.
# Note that arrays have their own metadata, just like the full dataset
ds.time

In [None]:
# Let's check a more complicated data array
ds.nbar_0459_0479nm

TODO quick summary of chunk sizes

TODO set up that we'll be plotting things, and describe why we prefer high-level to low level tools.  
TODO first demonstrate an example with a mpl.pyplot function, with title etc, then show the plot method approach.  Also with and without robust=True; and then show multipanel plots.
(this ties into the idea of object orientation - plug *Think Python* at this point)

In [None]:
small = slice(None, None, 50)
ds['nbar_0620_0670nm'].isel(time=0, latitude=small, longitude=small).plot.imshow(robust=True)

In [None]:
# Sample URL - full data expected mid-2017
data_url = 'http://dapds00.nci.org.au/thredds/dodsC/uc0/rs0_dev/multiple_band_variables/LS7_ETM_NBAR_P54_GANBAR01-002_089_078_2015_153_-26.nc'
# Open the dataset, but only download contents as needed - at this stage, metadata
data = xr.open_dataset(data_url)
# Drop the coordinate reference system variable (it's not a band)
data = data.drop('crs')
# Translate band numbers to names
data.rename({
        'band1': 'blue',
        'band2': 'green',
        'band3': 'red', 
        'band4': 'nir',
        'band5': 'swir1',
        'band6': 'swir2'
    }, inplace=True)

# Display a quick summary
data

In [None]:
# TODO:  explain how this is stacking arrays along a new axis
arr = data.isel(time=0).to_array(dim='band')
arr

In [None]:
arr.plot.imshow(robust=True, col='band', col_wrap=3)
# TODO: illustrate and explain difference between plt.imshow(...) and (...).plot.imshow()