# Tabulation of Fractional Cover data within shapefile polygons

**What does this notebook do?**

This notebook is a pilot collaboration between Geoscience Australia and Australian Bureau of Statistics. The purpose of the notebook is to use a shapefile polygon boundaries to load fractional cover dataset, complete zonal statistics and tabulate the results.

**Requirements**

You need to run the following commands from the command line prior to launching jupyter notebooks from the same terminal so that the required libraries and paths are set:

`module use /g/data/v10/public/modules/modulefiles`

`module load dea/20181213`


**Background**

Data from the Landsat 5,7 and 8 satellite missions are accessible through Digital Earth Australia (DEA). The code snippets in this notebook will let you retrieve and plot the Fractional Cover (FC25) data stored in DEA.


**How to use this notebook**

A basic understanding of any programming language is desirable but one doesn't have to be an expert Python programmer to manipulate the code to get and display the data.This doc applies to the following Landsat satellites, Fractional Cover bands and the WOfS dataset:

- Landsat 5
- Landsat 7
- Landsat 8
- PV - Photosythetic vegetation
- NPV - Non-Photosythetic vegetation
- BS - Bare Soil
- UE - Unmixing Error
- Water Observations from Space (WOFs)
- WOfS Feature Layer (WOFL)

**Bugs still to fix**

- Memory errors for extra large polygons - AH & ET


**Errors or bugs**

If you find an error or bug in this notebook, please contact erin.telfer@ga.gov.au.


## 1. Import Libraries

In [None]:
%matplotlib inline

from datetime import time, datetime
import os.path

from matplotlib import pyplot as plt
import pandas as pd
import numpy
import csv
import xarray as xr
import rasterio
import rasterio.features
import fiona
from datetime import datetime
import dask
from dask.delayed import delayed
from dask.distributed import LocalCluster, Client
import tempfile

import datacube
from datacube import Datacube
from datacube.virtual import construct, construct_from_yaml
from datacube.ui.task_app import year_splitter
from datacube.utils.geometry import CRS

print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Define directories and years of interest

In [None]:
#Set folder  and SA2 shapefile locations
shapefile_path = '../input/SA2_2016_AUST.shp'
output_path = '/g/data/r78/ext547/abs/output/'

In [None]:
#Specify time years of interest
start_year, end_year = 2011, 2012
time_range = (str(start_year), str(end_year))
print(time_range)

## 3. Set up a local dask cluster
Some calculations take more memory than is available on a system.  By breaking the data up into chunks, we can chain a sequence of operations together, and work on the data a small piece at a time.

This lets several processes work at the same time, and manage total memory usage for the calculations.

In more advanced setups, we can distribute the work across multiple computers, using all of their memory and CPU power.

In [None]:
#Set up dask cluster
n_workers = 7
threads_per_worker=1
mem_per_worker = 8e9  # 8e9 is 8GB (8,000,000,000 bytes)

chunk_size = {'time': 1, 'x': 2000, 'y': 2000}

In [None]:
cluster = LocalCluster(local_dir=tempfile.gettempdir(), 
                       n_workers=n_workers, 
                       threads_per_worker=threads_per_worker,
                       memory_limit=mem_per_worker)
client = Client(cluster)
dask.config.set(get=client.get)
client

We also get a dashboard to see how the system is running, by clicking the link above after the cell has been run.

## 4. Connect to the Datacube 

In [6]:
dc = Datacube()
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


## 5. Construct the virtual product

In [7]:
#Remove Landsat 7 scenes with the Scan Line Correction (SLC) missing data
LS7_BROKEN_DATE = datetime(2003, 5, 31)
is_pre_slc_failure = lambda dataset: dataset.center_time < LS7_BROKEN_DATE
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


In [8]:
#Create function to ensure wofls in correct format
def wofls_fuser(dest, src):
    where_nodata = (src & 1) == 0
    numpy.copyto(dest, src, where=where_nodata)
    return dest
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


In [9]:
#Create virtual product so that datacube data can be loaded effectively within memory
fc_and_water_yaml = """
        juxtapose:
          - collate:
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls5_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                      - transform: make_mask
                        input:
                            product: ls5_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls7_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                        # dataset_predicate: __main__.is_pre_slc_failure
                      - transform: make_mask
                        input:
                            product: ls7_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
              - transform: apply_mask
                mask_measurement_name: pixelquality
                preserve_dtype: false
                input:
                    juxtapose:
                      - product: ls8_fc_albers
                        group_by: solar_day
                        measurements: [PV, NPV, BS]
                      - transform: make_mask
                        input:
                            product: ls8_pq_albers
                            group_by: solar_day
                            fuse_func: datacube.helpers.ga_pq_fuser
                        flags:
                            ga_good_pixel: true
                        mask_measurement_name: pixelquality
          - transform: make_mask
            input:
                product: wofs_albers
                group_by: solar_day
                fuse_func: __main__.wofls_fuser
            flags:
                wet: true
            mask_measurement_name: water
"""
fc_and_water = construct_from_yaml(fc_and_water_yaml)

## 6. Set up functions

In [10]:
def geometry_mask(geoms, geobox, all_touched=False, invert=False, chunks=None):
    """
    Create a mask from shapes.

    By default, mask is intended for use as a
    numpy mask, where pixels that overlap shapes are False.
    :param list[Geometry] geoms: geometries to be rasterized
    :param datacube.utils.GeoBox geobox:
    :param bool all_touched: If True, all pixels touched by geometries will be burned in. If
                             false, only pixels whose center is within the polygon or that
                             are selected by Bresenham's line algorithm will be burned in.
    :param bool invert: If True, mask will be True for pixels that overlap shapes.
    """
    data = rasterio.features.geometry_mask([geom.to_crs(geobox.crs) for geom in geoms],
                                           out_shape=geobox.shape,
                                           transform=geobox.affine,
                                           all_touched=all_touched,
                                           invert=invert)
    if chunks is not None:
        data = dask.array.from_array(data, chunks=tuple(chunks[d] for d in geobox.dims))
        
    coords = [xr.DataArray(data=coord.values, name=dim, dims=[dim], attrs={'units': coord.units}) 
              for dim, coord in geobox.coords.items()]
    return xr.DataArray(data, coords=coords)
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


In [11]:
def get_shapes(shape_file):
    """
    Extract spatial inforamtion from polygons within shapefile
    """
    with fiona.open(shape_file) as shapes:
        crs = datacube.utils.geometry.CRS(shapes.crs_wkt)
        for shape in shapes:
            if shape['geometry'] is None:
                continue
            geom = datacube.utils.geometry.Geometry(shape['geometry'], crs=crs)
            geom = geom.to_crs(CRS('EPSG:3577'))
            yield geom, shape['properties']
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


In [12]:
def fc_and_water_summary(data, mask_int):
    """
    Calculate the percentage and the area of each FC component, 
    water, and unclassified data within each SA2 boundary. 
    """    
    # Where there is water, set the FC bands to 0%
    fc = data[['PV', 'NPV', 'BS']].where(data.water==0, other=0)
    fc['water'] = data.water.where(numpy.isfinite(fc['BS'])) * numpy.float32(100)
    fc = fc.apply(lambda data_array: data_array.clip(0, 100).where(numpy.isfinite(data_array)))
    
    # Flatten to a monthly mean
    fc = fc.groupby(fc.time.astype('datetime64[M]'), squeeze=False).mean(dim='time', skipna=True)
    # Calculate percentage of cover based on area of mask
    percentage = fc.sum(dim=['x','y']) * (100 / mask_int)
    
    for da in percentage.data_vars.values():
        da.attrs['units'] = '%'

    fc_unclass = percentage.to_array('variable').sum(dim='variable')
    fc_unclass = 100 - float(fc_unclass.values)
    print(f"   PV = {int(percentage.PV.values)}%, NPV = {int(percentage.NPV.values)}%, BS = {int(percentage.BS.values)}%, water = {int(percentage.water.values)}%, unclassified = {int(fc_unclass)}%")
    
    fc['unclassified'] = ('time', numpy.repeat(fc_unclass,fc.time.size)) 
    percentage['unclassified'] = ('time', numpy.repeat(fc_unclass,fc.time.size)) 
        
    pixel_area_in_metres2 = 25 * 25
    m2_to_km2 = (1 / 1_000_000)
    percent_to_fraction = (1 / 100)
    
    area = (fc * (pixel_area_in_metres2 * m2_to_km2 * percent_to_fraction)).sum(dim=['x','y'])
    area = area.rename({'BS': 'BS_area', 
                        'PV': 'PV_area', 
                        'NPV': 'NPV_area', 
                        'water': 'water_area', 
                        'unclassified':'unclassified_area'})
    
    for da in area.data_vars.values():
        da.attrs['units'] = 'km^2'
    fc = percentage.merge(area)
    return fc

In [13]:
def plot_stacked(ds, sa2_id,plot_title='title', show=True):
    """
    Create and save a stacked plot to visualise FC components
    """
    if not show:
        plt.ioff()
        
    fig,ax = plt.subplots(figsize=(10,5))
    ax.stackplot(ds.dropna(dim='time').time.data, 
                 ds.dropna(dim='time').PV,
                 ds.dropna(dim='time').NPV, 
                 ds.dropna(dim='time').BS, 
                 ds.dropna(dim='time').water, 
                 ds.dropna(dim='time').unclassified, 
                 colors = ['darkolivegreen','olive','tan','darkblue','red'], 
                 labels=['PV','NPV','BS','Water','Unclassified',])
    plt.legend(loc='upper center', ncol = 5)
    plt.title(f'FC components: SA2 ID {sa2_id}', size=12)
    plt.ylabel('Percentage (%)', size=12) #Set Y label
    plt.xlabel('Date', size=12) #Set X label
    
    plt.savefig(f'{output_path}/{sa2_id}_{plot_title}.png');
    plt.close(fig)
    
    # Turn interactive back on
    if not show:
        plt.show()
print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Cell finished at 2019-03-25 13:19:19


In [14]:
def month_splitter(start_year, end_year_inclusive):
    """ 
    Split specified years into months 
    """
    yield from (str(p) for p in pd.period_range(start=pd.Period(start_year).start_time, 
                               end=pd.Period(end_year_inclusive).end_time, 
                               freq='M'))

In [15]:
def output_csv(input_ds, sa2_id, sa2_name, sa2_size, monthly_or_annual='frequency'):
    """
    Save tabulated data into a csv
    """
    input_ds = input_ds.to_dataframe()
    input_ds.insert(0,'SA2_id', sa2_id)
    input_ds.insert(1,'SA2_name', sa2_name)
    input_ds.insert(2,'SA2_size', sa2_size)
    input_ds.to_csv(f"{output_path}/tabulate_KO_{monthly_or_annual}_FC.csv",mode='a',header=False)  

## 7. Set up the query
For each year and polygon query the product, apply the geometry mask and compute the fractional cover stats

Using `client.compute()` lets us use the monthly results in calculating the annual results at the same time.

In [None]:
#Obtain spatial information from shapefile
shape_file = os.path.expanduser(f'{shapefile_path}')
shapes = list(get_shapes(shape_file))

# #Specify a particular SA2 boundary, if required
#shapes = [(g,p) for g, p in shapes if str(p['SA2_MAIN16']) == '312011340']
shapes = [(g,p) for g, p in shapes if str(p['AREASQKM16']) < '5_000']


print(f"Cell finished at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

If we have enough resources, we can start the query and calculation of the next year's data while the previous is still being calculated. `by_slice=False` will be faster, but use more memory.

For larger areas `by_slice` will need to be `True`, so that the compute cluster does not become overwhelmed.  

If you get the error:
> `distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting`

then you will need to set `by_slice=True`

In [17]:
by_slice=True

In [18]:
def process_area(geometry, sa2_id, sa2_name, sa2_size, time_range):
    monthly_values = []
    annual_values = []
    mask = None

    for sub_time_range in month_splitter(time_range[0], time_range[-1]):
        print(sub_time_range)
        try:
            data = fc_and_water.load(dc, dask_chunks=chunk_size, 
                                     time=sub_time_range, 
                                     geopolygon=geometry)
        except ValueError:
            print(f'    No data for {sub_time_range} , skipping...')
            continue

        if mask is None:
            mask = geometry_mask([geometry], data.geobox, invert=True, chunks=data.chunks)
            mask_int = mask * 1
            mask_int = mask_int.sum() * 100
        data = data.where(mask)
        monthly_data = fc_and_water_summary(data, mask_int)
        monthly_data = client.compute(monthly_data, sync=by_slice)
        monthly_values.append(monthly_data)
        
    if not by_slice:
        print("  all years queried, hard load data")
        monthly_values = client.gather(monthly_values)

    monthly_values = xr.concat(monthly_values, dim='time')
    monthly_values = monthly_values.where(monthly_values['unclassified'] < 10).dropna(dim='time')
    annual_values = monthly_values.resample(time='1YS').mean(dim='time', skipna=True)
    print(monthly_values)
    print(annual_values)
    print(f"Calculation complete for annual values")
    
    plot_stacked(monthly_values, sa2_id, plot_title='monthly_plot_wofs',show=False)
    plot_stacked(annual_values, sa2_id, plot_title='annual_plot_wofs', show=False)
    
    print("All data loaded, save to csv")
    output_csv(monthly_values, sa2_id, sa2_name, sa2_size, monthly_or_annual='monthly')
    output_csv(annual_values, sa2_id, sa2_name, sa2_size, monthly_or_annual='annual')
    
    print(f"SA2 {sa2_id} done")

## 8. Process query

In [19]:
#Create empty CSV with specified headings
header = ['DATE','SA2_ID', 'SA2_NAME', 'SA2_SIZE','PV_PERCENTAGE','NPV_PERCENTAGE','BS_PERCENTAGE','WOFL_PERCENTAGE',
          'UNCLASSIFIED_PERCENTAGE','PV_AREA_SQKM_ALBERS3577','NPV_AREA_SQKM_ALBERS3577',
          'BS_AREA_SQKM_ALBERS3577','WOFL_AREA_SQKM_ALBERS3577','UNCLASSIFIED_AREA_SQKM_ALBERS3577']

with open(f"{output_path}/tabulate_annual_FC.csv","w") as outcsv: #create csv to save output and add header text
    writer = csv.writer(outcsv)
    writer.writerow(header)
    
with open(f"{output_path}/tabulate_monthly_FC.csv","w") as outcsv: #create csv to save output and add header text
    writer = csv.writer(outcsv)
    writer.writerow(header)

In [21]:
#Run tabulation script
for geometry, properties in shapes:
    sa2_id = str(properties['SA2_MAIN16'])
    sa2_name = str(properties['SA2_NAME16'])
    sa2_size = str(properties['AREASQKM16'])
    print(f"SA2 ID: {sa2_id}, size: {sa2_size}km^2, time: {time_range}")
    
    try:
          process_area(geometry, sa2_id, sa2_name, sa2_size, time_range)
    except Exception as e:
          print(f"Could not process {sa2_id}: {e}")
          client.restart()

SA2 ID: 312011340, size: 21092.2365km^2, time: ('2011', '2012')
2011-01
   PV = 63%, NPV = 27%, BS = 3%, water = 1%, unclassified = 3%
2011-02
   PV = 36%, NPV = 13%, BS = 3%, water = 0%, unclassified = 46%




KeyboardInterrupt: 