# Open Datacube
## Feature Summary Examples

The [Open Datacube](https://www.opendatacube.org/) provides an integrated gridded data analysis environment for decades of analysis-ready earth observation satellite and related data, acquired from multiple satellites and acquisition systems.

For instructions on using the Datacube on NCI, see: http://datacube-core.readthedocs.io/en/stable/user/nci_usage.html

For instructions on setting up your own instance, see: http://datacube-core.readthedocs.io/en/stable/ops/install.html

This notebook touches briefly on some the implemented features of the Datacube module, and is only intended to demonstrate functionality rather than be a tutorial.

In [None]:
%matplotlib inline
import datacube

If you have set up your config correctly, or are using the module on NCI, you should be able to make `Datacube` object that can connects to the configured datacube system.

In [None]:
dc = datacube.Datacube(app='dc-example')
dc

## Datacube products and measurements
The Datacube provides pandas.DataFrame representations of the available products and measurements:

In [None]:
dc.list_products()

## Datacube Measurements
The list of measurements stored in the datacube can also be listed.

Measurements are also known as _bands_ in the imagery domain, and _data variables_ when stored in NetCDF files or when working with `xarray.Dataset` objects.

In [None]:
dc.list_measurements()

## Retrieving data


In [None]:
nbar = dc.load(product='ls7_usgs_sr_albers', measurements=['green'], x=(149.25, 149.35), 
               y=(-35.25, -35.35), time=('2016-01', '2016-06'), use_threads=True)

The returned data is an `xarray.Dataset` object, which is a labelled n-dimensional array wrapping a `numpy` array.

We can investigate the data to see the variables (measurement bands) and dimensions that were returned:

In [None]:
nbar

We can look at the data by name directly, or through the `data_vars` dictionary:

In [None]:
import numpy as np
def mask_temporal(xd_in):
    
    keys = ['coastal_aerosol', 'blue', 'green', 'red', 'nir', 'swir1', 'swir2', 'pixel_qa']
    ds_key = list(xd_in.data_vars)[0]
    if ds_key not in keys:
        raise RuntimeError("Unknown data variable " + ds_key + " must be one of " + ','.join(keys))
    null = xd_in[ds_key].attrs['nodata']
    if (ds_key == 'pixel_qa'):
        null = 1
    time = [x.time.data for x in xd_in[ds_key] if np.count_nonzero(x.data != null) != 0 ]  
        
    return time, xd_in.sel(time=time)
    

In [None]:
time, nbar = mask_temporal(nbar)
print(nbar)
print(len(time))

In [None]:
nbar.data_vars

In [None]:
nbar.green

## Plotting data
We can select the data at a particular time and see what is there. We can use pandas-style labels to select a time period, inclusive of the end label:

In [None]:
autumn = nbar.green.loc['2016-3':'2016-5']
autumn.shape

In [None]:
autumn.plot(col='time', col_wrap=3, vmin=0, vmax=2500)

## Masking out NO_DATA values
When there is no data availaible, such as on the boundaries of a scene, it is filled in with a special value.
We can use filter it out, although xarray will convert the data from `int` to `float` so that it can use `NaN` to indicate no data.

Now that bad values are no longer represented as `-9999`, the data fits on a much better colour ramp:

In [None]:
autumn_valid = autumn.where(autumn != autumn.attrs['nodata'])
autumn_valid.plot(col='time', col_wrap=3, vmin=0, vmax=1000)

## Masking out cloud
Some of the images are clearly clouds, we should remove them.  There is a product with detected clouds called **PQ** (for Pixel Quality) we can use to mask out the clouds.

In [None]:
pq = dc.load(product='ls7_usgs_sr_albers', measurements=['pixel_qa'], x=(149.25, 149.35), y=(-35.25, -35.35), 
             time=('2016-01', '2016-06'), use_threads=True)
pq = pq.sel(time=time)
pq_autumn = pq.pixel_qa.loc['2016-3':'2016-5']
pq_autumn.plot(col='time', col_wrap=3, vmin=1)

The PQ layer stores a bitmask of several values. We can list the information available:

In [None]:
from datacube.storage import masking
import pandas
#pandas.DataFrame.from_dict(masking.get_flags_def(pq))#, orient='index')
import pprint

pprint.pprint(masking.get_flags_def(pq))

In [None]:
CLEAR_DATA = 2
good_data = pq & CLEAR_DATA
autumn_good_data = good_data.pixel_qa.loc['2016-3':'2016-5']
autumn_good_data.plot(col='time', col_wrap=3)

In [None]:
autumn_cloud_free = autumn_valid.where(autumn_good_data)
autumn_cloud_free.plot(col='time', col_wrap=3, vmin=0, vmax=1000)

## Group by time
You may have noticed that some of the days above are repeated, with times less than a minute apart.  this is because of the overlap in LANDSAT scenes.  If we group by solar day (a rough local time based on longitude), we can combine these slices:

In [None]:
nbar_by_solar_day = dc.load(product='ls7_usgs_sr_albers', measurements=['green'], x=(149.25, 149.35), 
               y=(-35.25, -35.35), time=('2016-01', '2016-06'), group_by='solar_day', use_threads=True)
time, nbar_by_solar_day = mask_temporal(nbar_by_solar_day)
len(nbar_by_solar_day.time)

We have fewer times than we did previously.

In [None]:
autumn2 = nbar_by_solar_day.green.loc['2016-3':'2016-5']
autumn2.shape

In [None]:
autumn2.plot(col='time', col_wrap=3, vmin=0, vmax=1000)

## Some basic band maths
We can combine the `red` and `nir` (_near-infrared_) bands to calculate NDVI (_normalised difference vegetation index_).

In [None]:
nbar = 0
two_bands = dc.load(product='ls7_usgs_sr_albers', measurements=['red', 'nir'], x=(149.07, 149.17), 
                    y=(-35.25, -35.35), time=('2016-01', '2016-06'), group_by='solar_day', use_threads=True)
time, two_bands = mask_temporal(two_bands)

In [None]:
pq = dc.load(product='ls7_usgs_sr_albers', measurements=['pixel_qa'], x=(149.07, 149.17), 
                    y=(-35.25, -35.35), time=('2016-01', '2016-06'), group_by='solar_day', use_threads=True)
pq = pq.sel(time=time)

In [None]:
CLEAR_DATA = 2

red = two_bands.red.where(two_bands.red != two_bands.red.attrs['nodata'])
nir = two_bands.nir.where(two_bands.nir != two_bands.nir.attrs['nodata'])

good_data = pq.pixel_qa & CLEAR_DATA
ndvi = ((nir - red) / (nir + red)).where(good_data)

In [None]:
print(ndvi)

In [None]:
ndvi.plot(col='time', col_wrap=5, vmin=-0.25, vmax=0.75)

In [None]:
mostly_cloud_free = good_data.sum(dim=('x','y')) > (0.75 * good_data.size / good_data.time.size)
mostly_good_ndvi = ndvi.where(mostly_cloud_free).dropna('time', how='all')
mostly_good_ndvi.plot(col='time', col_wrap=5, vmin=-0.25, vmax=0.75)

## Some stats

In [None]:
mostly_good_ndvi.median(dim='time').plot()

In [None]:
mostly_good_ndvi.std(dim='time').plot(vmin=0.0, vmax=0.2)

## Pixel drill

In [None]:
mostly_good_ndvi.sel(y=-3955361, x=1549813, method='nearest').plot()

In [None]:
mostly_good_ndvi.isel(x=[200], y=[200]).plot()

In [None]:
mostly_good_ndvi.isel(y=250).plot()

A line shapefile with pairs of coordinates (using `sel_points` instead of `isel_points`) would be able to be interpolated into something less blocky for the next plot.

In [None]:
mostly_good_ndvi.isel_points(x=[0, 100, 200, 300, 300, 400], 
                             y=[200, 200, 200, 250, 300, 400]).plot(x='points', y='time')

## Plotting a multi-band image

In [None]:
rgb = dc.load(product='ls7_usgs_sr_albers', 
              x=(149.07, 149.17), y=(-35.25, -35.35), 
              time=('2016-3-1', '2016-6-30'), 
              measurements=['red', 'green', 'blue'], 
              group_by='solar_day', use_threads=True).to_array(dim='color').transpose('time', 'y', 'x', 'color')

In [None]:
fake_saturation = 3000
clipped_visible = rgb.where(rgb<fake_saturation).fillna(fake_saturation)
max_val = clipped_visible.max(['y', 'x'])
scaled = (clipped_visible / max_val)

In [None]:
from matplotlib import pyplot as plt
plt.imshow(scaled.isel(time=0))