# C3 FC percentile validation

* [**Sign up to the DEA Sandbox**](https://docs.dea.ga.gov.au/setup/sandbox.html) to run this notebook interactively from a browser
* **Compatibility:** Notebook currently compatible with the`DEA Sandbox` environments
* **Products used:** 
[fc_percentile_albers_annual](https://explorer.sandbox.dea.ga.gov.au/products/fc_percentile_albers_annual), 
C3 fc percentile test product

## Description
The notebook is to validate the new C3 fc percentile product against the C2 product `fc_percentile_albers_annual`. It produced the output for the validation report.

1. Generate distritubtions and plot PDFs as the validation results
2. Produce the summary of validation results
3. Plot examples of the findings

***

## Getting started

Install the package needed by

`!pip install awswrangler`

in the top cell or the terminal then restart notebook.

In [None]:
import datacube
import rasterio
import boto3
import xarray as xr
import numpy as np
import re
from datacube.utils.dask import start_local_dask
from datacube import Datacube
from osgeo import ogr, gdal, osr
from scipy.stats import norm
import pandas as pd
import matplotlib.pyplot as plt
import os
import scipy.stats as sps
import awswrangler as wr

In [None]:
# create a local cluster
client = start_local_dask(n_workers=1, threads_per_worker=7, memory_limit='56GB')
client

In [4]:
# `dev` is the credential profile name
# change it accordingly
session = boto3.Session(profile_name='dev')
fc_bucket = "s3://dea-public-data-dev/test/fc-percentile/"

In [5]:
# list all the available file paths/prefix
fc_x_dirs = wr.s3.list_directories(fc_bucket, boto3_session=session)
fc_file_dirs = []
for x_idx in fc_x_dirs:
    fc_file_dirs += wr.s3.list_directories(x_idx, boto3_session=session)

In [6]:
fc_file_dirs

['s3://dea-public-data-dev/test/fc-percentile/x12/y19/',
 's3://dea-public-data-dev/test/fc-percentile/x12/y20/',
 's3://dea-public-data-dev/test/fc-percentile/x14/y29/',
 's3://dea-public-data-dev/test/fc-percentile/x15/y29/',
 's3://dea-public-data-dev/test/fc-percentile/x17/y19/',
 's3://dea-public-data-dev/test/fc-percentile/x17/y20/',
 's3://dea-public-data-dev/test/fc-percentile/x17/y37/',
 's3://dea-public-data-dev/test/fc-percentile/x18/y19/',
 's3://dea-public-data-dev/test/fc-percentile/x18/y20/',
 's3://dea-public-data-dev/test/fc-percentile/x18/y37/',
 's3://dea-public-data-dev/test/fc-percentile/x19/y25/',
 's3://dea-public-data-dev/test/fc-percentile/x20/y25/',
 's3://dea-public-data-dev/test/fc-percentile/x24/y40/',
 's3://dea-public-data-dev/test/fc-percentile/x27/y42/',
 's3://dea-public-data-dev/test/fc-percentile/x27/y43/',
 's3://dea-public-data-dev/test/fc-percentile/x28/y31/',
 's3://dea-public-data-dev/test/fc-percentile/x28/y42/',
 's3://dea-public-data-dev/test

In [42]:
def generate_seamask(shape_file, data_shape, orig_coords, resolution):
    """
        creak mask without oceans
        input:
            shape_file: the shape file of Australia coastline
            data_shape: the shape of loaded data to be masked upon
            orig_coords: the origin of the image for gdal to decide the transform
            resolution: pixel size with signs, e.g., (30, -30) for C3 and (25, -25) for C2
        output:
            a numpy array of mask, where valid pixels = 1
    """
    source_ds = ogr.Open(shape_file)
    source_layer = source_ds.GetLayer()
    source_layer.SetAttributeFilter("FEAT_CODE!='sea'")

    yt, xt = data_shape
    xres = resolution[0]
    yres = resolution[1]
    no_data = 0

    xcoord, ycoord = orig_coords
    geotransform = (xcoord - (xres*0.5), xres, 0, ycoord - (yres*0.5), 0, yres)

    target_ds = gdal.GetDriverByName('MEM').Create('', xt, yt, gdal.GDT_Byte)
    target_ds.SetGeoTransform(geotransform)
    albers = osr.SpatialReference()
    albers.ImportFromEPSG(3577)
    target_ds.SetProjection(albers.ExportToWkt())
    band = target_ds.GetRasterBand(1)
    band.SetNoDataValue(no_data)

    gdal.RasterizeLayer(target_ds, [1], source_layer, burn_values=[1])
    return band.ReadAsArray()


In [41]:
def random_samples(input_array, pixel_size, size_div=150*150):
    """
        randomly sample the data with replacement
        input:
            input_array: the array of data to sample
            pixel_size: area of a pixel, e.g., 30^2 for C3 and 25^2 for C2
            size_div: area which includes integer numbers of pixels for both C2 and C3, default 150^2
        output:
            mean and variance of the random samples
    """
    sample_mean = 0
    sample_var = 0
    tmp_array = input_array.reshape(-1)
    tmp_array = tmp_array[~np.isnan(tmp_array)]
    size_d = size_div / pixel_size * 1e3
    batches = tmp_array.size // size_d
    for i in range(int(batches*2)):
        sample_array = tmp_array[np.random.randint(0, int(tmp_array.size), int(size_d))]
        sample_mean += sample_array.mean()
        sample_var += np.var(sample_array)
    return (sample_mean/batches/2, sample_var/batches/2)

1.1 Generate mean and variance for each grid in the list
-----

In [None]:
dc = Datacube()
var_list = ["pv_pc_", "npv_pc_", "bs_pc_"]
perc_list = ["10", "50", "90"]
i = 0
pd_columns = []
for v in var_list:
    for p in perc_list:
        pd_columns += [v+p+'_mean']
        pd_columns += [v+p+'_var']

In [None]:
# loop over the path/prefix of all the test grids
for f_dir in fc_file_dirs:
    dataset = None
    for i in range(1987, 2021):
        # get the path/prefix of every year for the grid
        non_empty_list = wr.s3.list_objects(f_dir + str(i), boto3_session=session, suffix=['tif'])
        if non_empty_list == []:
            continue
        tmp_set = []
        # load all the data into dask array and named by the year
        for o in non_empty_list:
            data = xr.open_rasterio(o, chunks={'x':3200, 'y':3200})
            data.name = re.findall(r'(?<=P1Y_)\w+', o)[0]
            tmp_set += [data]
        # make the xarray the similar format as C2
        tmp_set = xr.merge(tmp_set)
        tmp_set = tmp_set.rename_dims({'band': 'time'})
        tmp_set = tmp_set.rename_vars({'band': 'time'})
        tmp_set.time.data[0] = i
        if dataset is None:
            dataset = tmp_set
        else:
            dataset = xr.concat([dataset, tmp_set], dim='time')
    # mask nodata
    re_c3 = dataset.where(dataset.qa==2, 0)
    # query and load C2 by the geolocation of grid
    query = {'time':('1987-01-01', '2021-01-01'), 'x': (re_c3.x.data.min() - 15, re_c3.x.data.max() + 15), 'y': (re_c3.y.data.min() - 15, re_c3.y.data.max() + 15), 'crs': 'EPSG:3577'}
    c2_data = dc.load(product='fc_percentile_albers_annual', **query, dask_chunks={'time':1})
    # mask nodata
    re_c2 = c2_data.where(c2_data > -1, 0)
    
    # generate raster of oceans mask
    c2_land_raster = generate_seamask("aus_map/cstauscd_r_3577.shp", re_c2.PV_PC_10.shape[1:], (re_c2.x.data.min(), re_c2.y.data.max()), (25, -25))
    c3_land_raster = generate_seamask("aus_map/cstauscd_r_3577.shp", re_c3.pv_pc_10.shape[1:], (re_c3.x.data.min(), re_c3.y.data.max()), (30, -30))
    # init panda dataframe to save the results
    results_c2 = pd.DataFrame(columns=pd_columns, index=np.arange(1987, 2021))
    results_c3 = pd.DataFrame(columns=pd_columns, index=np.arange(1987, 2021))
    
    print("start load data")
    re_c3.load()
    re_c2.load()
    # compute mean and variance for each band and each year
    for y in range(1987, 2021):
        for v in var_list:
            for p in perc_list:
                results_c3.loc[y, v+p+'_mean'], results_c3.loc[y, v+p+'_var'] = random_samples(re_c3[v+p].loc[dict(time=y)].where(c3_land_raster > 0, 0).data, 30**2)
                results_c2.loc[y, v+p+'_mean'], results_c2.loc[y, v+p+'_var'] = random_samples(re_c2[v.upper()+p].loc[dict(time=str(y)+'-01-01')].where(c2_land_raster > 0, 0).data, 25**2)
    # save the results to csvs
    # named after grid index and product version, e.g., x14_y29_c2.csv
    results_c2.to_csv('_'.join(f_dir.split('/')[-3:-1])+'_c2.csv')
    results_c3.to_csv('_'.join(f_dir.split('/')[-3:-1])+'_c3.csv')

1.2. Plot PDFs of each grid with the mean and variance saved in csvs
--------------------

In [None]:
for f_dir in fc_file_dirs:
    tile_name = '_'.join(f_dir.split('/')[-3:-1])
    if os.path.exists(tile_name+'.png'):
        continue
    # read in results from csvs
    results_c2 = pd.read_csv(tile_name+'_c2.csv', index_col=0)
    results_c3 = pd.read_csv(tile_name+'_c3.csv', index_col=0)
    fig, axs = plt.subplots(34, 9,  sharey=True, sharex=True, figsize=(20, 60))
    i = 0
    j = 0
    # plot PDFs for each year and each band
    for y in range(1987, 2021):
        for v in var_list:
            for p in perc_list:
                x = np.arange(results_c3.loc[y, v+p+'_mean']-3*np.sqrt(results_c3.loc[y, v+p+'_var']), results_c3.loc[y, v+p+'_mean']+3*np.sqrt(results_c3.loc[y, v+p+'_var']), 1)
                axs[j, i].plot(norm.pdf(x, results_c3.loc[y, v+p+'_mean'], np.sqrt(results_c3.loc[y, v+p+'_var'])), label='C3 distribution', color='darkblue')
                x = np.arange(results_c2.loc[y, v+p+'_mean']-3*np.sqrt(results_c2.loc[y, v+p+'_var']), results_c2.loc[y, v+p+'_mean']+3*np.sqrt(results_c2.loc[y, v+p+'_var']), 1)
                axs[j, i].plot(norm.pdf(x, results_c2.loc[y, v+p+'_mean'], np.sqrt(results_c2.loc[y, v+p+'_var'])), label='C2 distribution', color='darkorange')
                # set title of columns of plotting grid
                if j == 0:
                    axs[j, i].set_title(v+p)
                # set title of rows of plotting grid
                if i == 0:
                    axs[j, i].set_ylabel(str(y), rotation=90, size='large')
                i += 1
        j += 1
        i = 0
    plt.tight_layout()
    # plot legends shared by all subplots
    handles, labels = axs[0, 0].get_legend_handles_labels()
    fig.legend(handles, labels, loc='upper left')
    fig.savefig(tile_name+'.png')
    print("plot", tile_name)
    plt.close()

2.. Plot the summary of the validation results
-------

In [9]:
# define the columns to summarize
mean_columns = []
for v in var_list:
    for p in perc_list:
        mean_columns += [v+p+'_mean']

In [None]:
# scatter plot the difference of mean for all testing grids in the list
fig, axs = plt.subplots(3, 3,  sharey=True, sharex=True, figsize=(6, 6))
for f_dir in fc_file_dirs:
    tile_name = '_'.join(f_dir.split('/')[-3:-1])
    if tile_name in ['x40_y13', 'x43_y15', 'x45_y17']:
        continue
    # read in the results saved in the csvs
    results_c2 = pd.read_csv(tile_name+'_c2.csv', index_col=0)
    results_c3 = pd.read_csv(tile_name+'_c3.csv', index_col=0)
    # compute the difference of mean for all the bands
    mean_diff = results_c3[mean_columns] - results_c2[mean_columns]
    i = 0
    j = 0
    # scatter plot the differences
    for v in var_list:
        for p in perc_list:
            axs[i, j].plot(mean_diff[v+p+'_mean'], 'o', color='SteelBlue',  mfc='none', markersize=3)
            # set title of columns of plotting grid
            if i == 0:
                axs[i, j].set_title('pc_'+str(p))
            # set title of rows of plotting grid
            if j == 0:
                axs[i, j].set_ylabel(v.split('_')[0], rotation=90, size='large')
            j += 1
        i += 1
        j = 0
# mark the crutial time points
for i in range(3):
    for j in range(3):
        axs[i, j].axvline(x=1987, linestyle='--', color='darkgreen', label='1987 LS5 start')
        axs[i, j].axvline(x=1999, linestyle='--', color='darkblue', label='1999 LS5/LS7 switch')
        axs[i, j].axvline(x=2003, linestyle='--', color='darkorange', label='2003 LS7 broken')
        axs[i, j].axvline(x=2013, linestyle='--', color='OliveDrab', label='2013 LS8 start')
# plot the legends shared by all subplots
handles, labels = axs[0, 0].get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=2)
fig.savefig("all_tiles_mean_diff.png", bbox_inches='tight')

In [None]:
# histgram plot for the difference of mean for all testing grids in the list
fig, axs = plt.subplots(3, 3,  sharey=True, sharex=True, figsize=(6, 6))
mean_diff = None
# compute the difference of mean for all grids
for f_dir in fc_file_dirs:
    tile_name = '_'.join(f_dir.split('/')[-3:-1])
    if tile_name in ['x40_y13', 'x43_y15', 'x45_y17']:
        continue
    results_c2 = pd.read_csv(tile_name+'_c2.csv', index_col=0)
    results_c3 = pd.read_csv(tile_name+'_c3.csv', index_col=0)
    if mean_diff is None:
        mean_diff = results_c3[mean_columns] - results_c2[mean_columns]
    else:
        mean_diff = mean_diff.append(results_c3[mean_columns] - results_c2[mean_columns])
i = 0
j = 0
for v in var_list:
    for p in perc_list:
        # histgram plot for each band
        axs[i, j].hist(mean_diff[v+p+'_mean'].to_numpy(), color='SteelBlue', bins=50, density=True)
        kde = sps.gaussian_kde(mean_diff[v+p+'_mean'].to_numpy())
        axs[i, j].plot(np.arange(-20, 20, 0.1), kde.pdf(np.arange(-20, 20, 0.1)), color='darkorange', linestyle='--', linewidth=1)
        if i == 0:
            axs[i, j].set_title('pc_'+str(p))
        if j == 0:
            axs[i, j].set_ylabel(v.split('_')[0], rotation=90, size='large')
        j += 1
    i += 1
    j = 0
fig.savefig("all_tiles_hist.png", bbox_inches='tight')

3.. Plot FC percentile band of any testing grid
---------------------

In [None]:
# plot fc percentile band as required
# reading data is the same as in the loop of computation above
f_dir = fc_bucket + 'x45/y17/'
dataset = None
for i in range(1987, 2021):
    non_empty_list = wr.s3.list_objects(f_dir + str(i), boto3_session=session, suffix=['tif'])
    if non_empty_list == []:
        continue
    tmp_set = []
    for o in non_empty_list:
        data = xr.open_rasterio(o, chunks={'x':3200, 'y':3200})
        data.name = re.findall(r'(?<=P1Y_)\w+', o)[0]
        tmp_set += [data]
    tmp_set = xr.merge(tmp_set)
    tmp_set = tmp_set.rename_dims({'band': 'time'})
    tmp_set = tmp_set.rename_vars({'band': 'time'})
    tmp_set.time.data[0] = i
    if dataset is None:
        dataset = tmp_set
    else:
        dataset = xr.concat([dataset, tmp_set], dim='time')
re_c3 = dataset.where(dataset.qa==2)
query = {'time':('1987-01-01', '2021-01-01'), 'x': (re_c3.x.data.min() - 15, re_c3.x.data.max() + 15), 'y': (re_c3.y.data.min() - 15, re_c3.y.data.max() + 15), 'crs': 'EPSG:3577'}
c2_data = dc.load(product='fc_percentile_albers_annual', **query, dask_chunks={'time':1})
re_c2 = c2_data.where(c2_data > -1)

c2_land_raster = generate_seamask("aus_map/cstauscd_r_3577.shp", re_c2.PV_PC_10.shape[1:], (re_c2.x.data.min(), re_c2.y.data.max()), (25, -25))
c3_land_raster = generate_seamask("aus_map/cstauscd_r_3577.shp", re_c3.pv_pc_10.shape[1:], (re_c3.x.data.min(), re_c3.y.data.max()), (30, -30))

In [None]:
# plot the valid data for a band
re_c3.pv_pc_10.loc[dict(time=2018)].where(c3_land_raster > 0).compute().plot(aspect=1.5, size=10)
plt.savefig('x45y17_2018_c3.png', bbox_inches='tight')

In [None]:
# title too long for C2, drop spatial_ref: 3577
re_c2 = re_c2.drop_vars('spatial_ref')

In [None]:
# plot the valid data for a band
re_c2.PV_PC_10.loc[dict(time='2018-01-01')].where(c2_land_raster > 0).compute().plot(aspect=1.5, size=10)
plt.savefig('x45y17_2018_c2.png', bbox_inches='tight')

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/GeoscienceAustralia/dea-notebooks).

**Last modified:** August 2021

**Compatible datacube version:** 

In [125]:
print(datacube.__version__)

1.8.4.dev81+g80d466a2


## Tags
Browse all available tags on the DEA User Guide's [Tags Index](https://docs.dea.ga.gov.au/genindex.html)

**Tags**: :index:`sandbox compatible`, :index:`landsat 8`, :index:`landsat 7`, :index: `landsat 5`, :index: `fc percentile`