# Tabulation of Fractional Cover data within shapefile polygons

**What does this notebook do?**

This notebook is a pilot collaboration between Geoscience Australia and Australian Bureau of Statistics. The purpose of the notebook is to use a shapefile polygon boundaries to load fractional cover dataset, complete zonal statistics and tabulate the results.

**Requirements**

You need to run the following commands from the command line prior to launching jupyter notebooks from the same terminal so that the required libraries and paths are set:

`module use /g/data/v10/public/modules/modulefiles`

`module load dea/20181213`


**Background**

Data from the Landsat 5,7 and 8 satellite missions are accessible through Digital Earth Australia (DEA). The code snippets in this notebook will let you retrieve and plot the Fractional Cover (FC25) data stored in DEA.


**How to use this notebook**

A basic understanding of any programming language is desirable but one doesn't have to be an expert Python programmer to manipulate the code to get and display the data.This doc applies to the following Landsat satellites, Fractional Cover bands and the WOfS dataset:

- Landsat 5
- Landsat 7
- Landsat 8
- PV - Photosythetic vegetation
- NPV - Non-Photosythetic vegetation
- BS - Bare Soil
- UE - Unmixing Error
- Water Observations from Space (WOFs)
- WOfS Feature Layer (WOFL)

**Bugs still to fix**

- add water to percentage - AH
- Memory errors for extra large polygons - AH & ET


**Errors or bugs**

If you find an error or bug in this notebook, please contact erin.telfer@ga.gov.au.


## Import Libraries and define functions

In [1]:
%matplotlib inline

from datetime import time, datetime
import os.path

from matplotlib import pyplot as plt
import pandas as pd
import numpy
import xarray as xr
import rasterio
import rasterio.features
import fiona
import dask
from dask.delayed import delayed
from dask.distributed import LocalCluster, Client
import tempfile

import datacube
from datacube import Datacube
from datacube.virtual import construct, construct_from_yaml
from datacube.ui.task_app import year_splitter

### Set up a local dask cluster
This lets several processes work at the same time, and manage total memory usage

We also get a dashboard to see how the system is running

In [2]:
cluster = LocalCluster(local_dir=tempfile.gettempdir(), 
                       n_workers=3, 
                       threads_per_worker=1,
                       memory_limit=6e9)
client = Client(cluster)
dask.config.set(get=client.get)
client

0,1
Client  Scheduler: tcp://127.0.0.1:35134  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 3  Cores: 3  Memory: 18.00 GB


In [3]:
chunk_size = {'time': 1, 'x': 2000, 'y': 2000}

In [4]:
dc = Datacube()

### Construct virtual product

In [5]:
LS7_BROKEN_DATE = datetime(2003, 5, 31)
is_pre_slc_failure = lambda dataset: dataset.center_time < LS7_BROKEN_DATE

In [6]:
def wofls_fuser(dest, src):
    where_nodata = (src & 1) == 0
    numpy.copyto(dest, src, where=where_nodata)
    return dest

In [7]:
fc_land_only_yaml = """
    transform: apply_mask
    mask_measurement_name: water
    preserve_dtype: false
    input:
        juxtapose:
          - collate:
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls5_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                      - transform: make_mask
                        input:
                            product: ls5_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls7_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                        dataset_predicate: __main__.is_pre_slc_failure
                      - transform: make_mask
                        input:
                            product: ls7_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls8_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                      - transform: make_mask
                        input:
                            product: ls8_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
          - transform: make_mask
            input:
                product: wofs_albers
                group_by: solar_day
                fuse_func: __main__.wofls_fuser
            flags:
                water_observed: false
            mask_measurement_name: water
"""
fc_land_only = construct_from_yaml(fc_land_only_yaml)

### Set up functions

In [8]:
def geometry_mask(geoms, geobox, all_touched=False, invert=False, chunks=None):
    """
    Create a mask from shapes.

    By default, mask is intended for use as a
    numpy mask, where pixels that overlap shapes are False.
    :param list[Geometry] geoms: geometries to be rasterized
    :param datacube.utils.GeoBox geobox:
    :param bool all_touched: If True, all pixels touched by geometries will be burned in. If
                             false, only pixels whose center is within the polygon or that
                             are selected by Bresenham's line algorithm will be burned in.
    :param bool invert: If True, mask will be True for pixels that overlap shapes.
    """
    data = rasterio.features.geometry_mask([geom.to_crs(geobox.crs) for geom in geoms],
                                           out_shape=geobox.shape,
                                           transform=geobox.affine,
                                           all_touched=all_touched,
                                           invert=invert)
    if chunks is not None:
        data = dask.array.from_array(data, chunks=tuple(chunks[d] for d in geobox.dims))
        
    coords = [xr.DataArray(data=coord.values, name=dim, dims=[dim], attrs={'units': coord.units}) 
              for dim, coord in geobox.coords.items()]
    return xr.DataArray(data, coords=coords)

In [9]:
def get_shapes(shape_file):
    with fiona.open(shape_file) as shapes:
        crs = datacube.utils.geometry.CRS(shapes.crs_wkt)
        for shape in shapes:
            geom = datacube.utils.geometry.Geometry(shape['geometry'], crs=crs)
            yield geom, shape['properties']

In [10]:
def fc_summary(data,mask_int):
    fc = data[['BS', 'PV', 'NPV']].sum(dim=('x', 'y'))
    area = fc * (25 * 25 / 1_000_000)
    area = area.rename({'BS': 'BS_area', 'PV': 'PV_area', 'NPV': 'NPV_area'})
    for da in area.data_vars.values():
        da.attrs['units'] = 'km2'

    fc = fc / mask_int * 100
    for da in fc.data_vars.values():
        da.attrs['units'] = '%'
        
    fc = fc.merge(area)
    
    return fc

In [11]:
def keepna(a, dim=None, thresh=None):
    if type(a) is xr.Dataset:
        return a.apply(keepna, keep_attrs=True, dim=dim, thresh=thresh)
    
    keep_dim = [] if dim is None else [dim]
    dims = [d for d in a.dims if d not in keep_dim]
    if thresh is None:
        keep = numpy.isfinite(a).sum(dim=dims) > 0
    else:
        keep = numpy.isfinite(a).sum(dim=dims) >= thresh
    return a.where(keep, other=numpy.nan)

In [134]:
def plot_stacked(ds, catchment_id,plot_title='title', show=True):
    if not show:
        plt.ioff()
        
    fig,ax = plt.subplots(figsize=(10,5))
    ax.stackplot(ds.dropna(dim='time').time.data, 
                 ds.dropna(dim='time').BS, 
                 ds.dropna(dim='time').NPV, 
                 ds.dropna(dim='time').PV,
                 colors = ['tan','olive','darkolivegreen',], 
                 labels=['BS','NPV','PV',])
    plt.legend(loc='upper center', ncol = 3)
    plt.title(f'FC Components: Catchment ID {catchment_id}', size=12)
    plt.ylabel('Percentage (%)', size=12) #Set Y label
    plt.xlabel('Date', size=12) #Set X label
    
    plt.savefig(f'/g/data/r78/ext547/abs/output/{catchment_id}_{plot_title}.png');
        
#     plt.savefig(f'{catchment_id}_monthly_plot.png');
    plt.close(fig)
    
    # Turn interactive back on
    if not show:
        plt.show()

In [13]:
def month_splitter(start_year, end_year_inclusive):
    yield from (str(p) for p in pd.period_range(start=pd.Period(start_year, freq='A').start_time, 
                               end=pd.Period(end_year_inclusive, freq='A').end_time, 
                               freq='M'))

### Process the query
For each year and polygon query the product, apply the gemotry mask and compute the frational cover stats

Using `client.compute()` lets us use the monthly results in calculating the annual results at the same time.

In [14]:
shape_file = os.path.expanduser('../input/SA_2016_twosmallpolygons_3577.shp')

# shape_file = os.path.expanduser('../input/SA_2016_threepolygons_3577.shp')
# shape_file = os.path.expanduser('../input/SA_2016_twopolygons_3577.shp')
shapes = list(get_shapes(shape_file))

In [149]:
start_year, end_year = 2000, 2010
time_range = (str(start_year), str(end_year))
time_range

('2000', '2010')

In [23]:
# # Use this list instead of shapes to just the big outback South Australian area
# s2 = [(g,p) for g, p in shapes if str(p['SA2_MAIN16']) == '406021141']

In [24]:
# s2

If we have enough resources, we can start the query and calculation of the next year's data while the previous is still being calculated. `by_slice=False` will be faster, but use more memory.

For larger areas `by_slice` will need to be `True`, so that the compute cluster does not become overwhelmed.  

If you get the error:
> `distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting`

then you will need to set `by_slice=True`

In [150]:
by_slice=True

In [151]:
def process_area(geometry, catchment_id, time_range):
    monthly_values = []
    annual_values = []
    mask = None
          
    for sub_time_range in month_splitter(time_range[0], time_range[-1]):
        try:
            data = fc_land_only.load(dc, dask_chunks=chunk_size, 
                                     time=sub_time_range, 
                                     geopolygon=geometry)
        except ValueError:
            print(f'    Error, skipping...')
            continue

        if mask is None:
            mask = geometry_mask([geometry], data.geobox, invert=True, chunks=data.chunks)
            mask_int = mask * 1
            mask_int = mask_int.sum() * 100
        data = data.where(mask)
        data = data.resample(time='1MS').mean(dim='time', skipna=True)
        data = keepna(data, dim='time', thresh=0.9*int(mask.sum()))

        monthly_data = fc_summary(data,mask_int)
        monthly_data = monthly_data.where(monthly_data != 0)
        monthly_data = client.compute(monthly_data, sync=by_slice)
        monthly_values.append(monthly_data)
        
        print(f"    calculation complete for {str(monthly_data.time.values)}")

    if not by_slice:
        print("  all years queried, hard load data")
        monthly_values = client.gather(monthly_values)
    
    monthly_values = xr.concat(monthly_values, dim='time').dropna(dim='time')
    annual_values = monthly_values.resample(time='1YS').mean(dim='time', skipna=True)
    print(f"    calculation complete for annual values")

    
    plot_stacked(monthly_values, catchment_id, plot_title='monthly_plot',show=False)
    plot_stacked(annual_values, catchment_id, plot_title='annual_plot', show=False)
    
    print("  all data loaded, save to csv")
    monthly_values.to_dataframe().to_csv(f"../output/{catchment_id}_monthly.csv")
    annual_values.to_dataframe().to_csv(f"../output/{catchment_id}_annual.csv")
    return(monthly_values,annual_values02)
    print(f"  Catchment {catchment_id} done")

In [152]:
for geometry, properties in shapes:
    catchment_id = str(properties['SA2_MAIN16'])
    print(f"Catchment ID: {catchment_id}, size: {properties['AREASQKM16']}km^2, time: {time_range}")
    
    try:
          process_area(geometry, catchment_id, time_range)
    except Exception:
          print(f"Could not process {catchment_id}")
          client.restart()
          continue

Catchment ID: 201031017, size: 1627.5521km^2, time: ('2000', '2010')
    calculation complete for ['2000-01-01T00:00:00.000000000']
    calculation complete for ['2000-02-01T00:00:00.000000000']
    calculation complete for ['2000-03-01T00:00:00.000000000']
    calculation complete for ['2000-04-01T00:00:00.000000000']
    calculation complete for ['2000-05-01T00:00:00.000000000']
    calculation complete for ['2000-06-01T00:00:00.000000000']
    calculation complete for ['2000-07-01T00:00:00.000000000']
    calculation complete for ['2000-08-01T00:00:00.000000000']
    calculation complete for ['2000-09-01T00:00:00.000000000']
    calculation complete for ['2000-10-01T00:00:00.000000000']
    calculation complete for ['2000-11-01T00:00:00.000000000']
    calculation complete for ['2000-12-01T00:00:00.000000000']
    calculation complete for ['2001-01-01T00:00:00.000000000']
    calculation complete for ['2001-02-01T00:00:00.000000000']
    calculation complete for ['2001-03-01T00:00:0

    calculation complete for ['2000-08-01T00:00:00.000000000']
    calculation complete for ['2000-09-01T00:00:00.000000000']
    calculation complete for ['2000-10-01T00:00:00.000000000']
    calculation complete for ['2000-11-01T00:00:00.000000000']
    calculation complete for ['2000-12-01T00:00:00.000000000']
    calculation complete for ['2001-01-01T00:00:00.000000000']
    calculation complete for ['2001-02-01T00:00:00.000000000']
    calculation complete for ['2001-03-01T00:00:00.000000000']
    calculation complete for ['2001-04-01T00:00:00.000000000']
    calculation complete for ['2001-05-01T00:00:00.000000000']
    calculation complete for ['2001-06-01T00:00:00.000000000']
    calculation complete for ['2001-07-01T00:00:00.000000000']
    calculation complete for ['2001-08-01T00:00:00.000000000']
    calculation complete for ['2001-09-01T00:00:00.000000000']
    calculation complete for ['2001-10-01T00:00:00.000000000']
    calculation complete for ['2001-11-01T00:00:00.0000

In [92]:
def process_area(geometry, catchment_id, time_range):
    monthly_values = []
    annual_values = []
    mask = None
          
    for sub_time_range in month_splitter(time_range[0], time_range[-1]):
#         print(f'  lazy loading {sub_time_range}...')
        try:
            data = fc_land_only.load(dc, dask_chunks=chunk_size, 
                                     time=sub_time_range, 
                                     geopolygon=geometry)
#             print(f'    lazy loaded {sub_time_range}')
        except ValueError:
            print(f'    Error, skipping...')
            continue

        if mask is None:
            mask = geometry_mask([geometry], data.geobox, invert=True, chunks=data.chunks)
            mask_int = mask * 1
            mask_int = mask_int.sum() * 100
        data = data.where(mask)
#         print(client.compute([data], sync=by_slice))
        data = data.resample(time='1MS').mean(dim='time', skipna=True)
        data = keepna(data, dim='time', thresh=0.9*int(mask.sum()))

        monthly_data = fc_summary(data,mask_int)
        monthly_data = monthly_data.where(monthly_data != 0)
        annual_data = monthly_data.resample(time='1YS').mean(dim='time', skipna=True)
        print(f"    calculating for {dict(monthly_data.sizes)}")
        monthly_data, annual_data = client.compute([monthly_data, annual_data], sync=by_slice)
#         monthly_data, annual_data = client.compute([monthly_data, annual_data], sync=by_slice)

#         print('data')
#         print(data)
#         print('mv')
#         print(monthly_data)
#         print('av')
#         print(annual_values)   
        monthly_values.append(monthly_data)
        annual_values.append(annual_data)
        
    if not by_slice:
        print("  all years queried, hard load data")
        monthly_values = client.gather(monthly_values)
        annual_values = client.gather(annual_values)


    print('b')
    monthly_values = xr.concat(monthly_values, dim='time').dropna(dim='time')
    annual_values02 = monthly_values.resample(time='1YS').mean(dim='time', skipna=True)
    print('c')    
    plot_stacked(monthly_values, catchment_id, show=False)
    print('d')
    annual_values = xr.concat(annual_values, dim='time').dropna(dim='time')
    print('e')
    
    print("  all data loaded, save to csv")
    monthly_values.to_dataframe().to_csv(f"../output/{catchment_id}_monthly.csv")
    annual_values.to_dataframe().to_csv(f"../output/{catchment_id}_annual.csv")
    annual_values02.to_dataframe().to_csv(f"../output/{catchment_id}_annual02.csv")
    return(monthly_values,annual_values,annual_values02)
    print(f"  Catchment {catchment_id} done")