# Extracting training data from the ODC <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[ga_ls8c_gm_2_annual](https://explorer.digitalearth.africa/ga_ls8c_gm_2_annual), [s2_l2a](https://explorer.digitalearth.africa/s2_l2a)


## Background

**Training data** is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy ([Maxell et al 2018](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343)).  A review of training data methods in the context of Earth Observation is available [here](https://www.mdpi.com/2072-4292/12/6/1034) 

When creating training data, be sure to capture the **spectral variability** of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). Another common problem with training data is **class imbalance**. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion.

There are many platforms to use for gathering training data, the best one to use depends on your application. GIS platforms are great for collection training data as they are highly flexible and mature platforms; [Geo-Wiki](https://www.geo-wiki.org/) and [Collect Earth Online](https://collect.earth/home) are two open-source websites that may also be useful depending on the reference data strategy employed.

This notebook assumes the user has already collected reference data polygons, and now wishes to extract feature layers from the `open-data-cube` to assist in classifying satellite images.


## Description
This notebook will extract training data (feature layers) from the `open-data-cube` using geometries within a shapefile (or geojson). To do this, we rely on a custom `deafrica-sandbox-notebooks` function called `collect_training_data`, contained within the [deafrica_classificationtools](../Scripts/deafrica_classificationtools.py) script.  The goal of this notebook is to familarise users with this function so they can extract the appropriate data for their use-case.

1. Preview the polygons in our training data by plotting them on a basemap
2. Extract training data from the datacube using `collect_training_data`'s inbuilt feature layer parameters
3. Extract training data from the datacube using  a custom defined feature layer function that we can pass to `collect_training_data`
4. Export the training data to disk for use in subsequent scripts

***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages


In [None]:
# !pip install richdem
# !pip install https://packages.dea.ga.gov.au/hdstats/hdstats-0.1.5.tar.gz
# !pip install dask-ml

In [1]:
%matplotlib inline

import sys
import os
import warnings
import datacube
import numpy as np
import xarray as xr
import subprocess as sp
import geopandas as gpd
from datacube.utils.geometry import assign_crs

sys.path.append('../Scripts')
from deafrica_plotting import map_shapefile
from xr_geomedian_tmad import xr_geomedian_tmad
from deafrica_bandindices import calculate_indices
from deafrica_classificationtools import collect_training_data 

warnings.filterwarnings("ignore")

# from deafrica_temporal_statistics import temporal_statistics
%load_ext autoreload
%autoreload 2



## Analysis parameters

* `path`: The path to the input shapefile from which we will extract training data. A default shapefile is provided.
* `field`: This is the name of column in your shapefile attribute table that contains the class labels. **The class labels must be integers**
* `ncpus`: Set this value to > 1 to parallize the collection of training data. eg. npus=8. 

> **Note**: With supervised classification, its common to have many, many labelled geometries in the training data. `collect_training_data` can parallelize across the geometries in order to speed up the extracting of training data. Setting `ncpus>1` will automatically trigger the parallelization, however, its best to set `ncpus=1` to begin with to assist with debugging before triggering the parallelization. 


In [2]:
path = 'data/training/test_training_dataset.shp' 
field = 'class'

### Find the number of cpus

In [3]:
try:
    ncpus = int(float(sp.getoutput('env | grep CPU')[-4:]))
except:
    ncpus = int(float(sp.getoutput('env | grep CPU')[-3:]))

print('ncpus = '+str(ncpus))

ncpus = 15


## Preview input data

We can load and preview our input data shapefile using `geopandas`. The shapefile should contain a column with class labels (e.g. 'class'). These labels will be used to train our model. 

> Remember, the class labels **must** be represented by `integers`.


In [4]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()

Unnamed: 0,class,geometry
0,0,"POLYGON ((38.52475 12.27793, 38.52475 12.27841..."
1,0,"POLYGON ((35.27410 5.60795, 35.27410 5.60842, ..."
2,0,"POLYGON ((38.97103 -1.84843, 38.97103 -1.84796..."
3,0,"POLYGON ((38.30430 0.32585, 38.30430 0.32632, ..."
4,0,"POLYGON ((41.33019 3.89073, 41.33019 3.89120, ..."


In [6]:
# Plot training data in an interactive map
# map_shapefile(input_data, attribute=field)

## Extracting training data

The function `collect_training_data` takes our shapefile containing class labels and extracts training data from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing an `NaN` (not-a-number) values. 

`Collect_training_data` has the ability to generate many different types of **feature layers**, relatively simple layers can be calculated using pre-defined parameters within the function; more complex layers can be computed by passing in a `custom_func`. To begin with, let's try generating feature layers using the pre-defined methods.

The in-built feature layer parameters are described below:
* `product`: The name of the product to extract from the datacube. In this example we use a geomedian composite from 2018, `'ga_ls8c_gm_2_annual'`
* `time`: The time range from which to extract data
* `calc_indices`: This parameter provides a method for calculating a number of remote sensing indices (e.g. `['NDWI', 'NDVI']`). Any of the indices found in the [deafrica_bandindices](../Scripts/deafrica_bandindices.py) script can be used here
* `drop`: If this variable is set to True, and 'calc_indices' are supplied, the spectral bands will be dropped from the dataset leaving only the band indices as data variables in the dataset.
* `reduce_func`: The classification models we're applying here require our training data to be in two dimensions (ie. `x` & `y`). If our data has a time-dimension (e.g. if we load in an annual time-series of satellite images) then we need to collapse the time dimension.  "reduce_func" is simply the summary statistic used to collapse the temporal dimension. Options are 'mean', 'median', 'std', 'max', 'min', 'geomedian'. In the default example we are loading a geomedian composite, so there is no time dimension to reduce.
* `zonal_stats`: An optional string giving the names of zonal statistics to calculate across each polygon. Default is None (all pixel values are returned). Supported values are 'mean', 'median', 'max', 'min', and 'std'.
* `return_coords` : If `True`, then the traiining data will contain two extra columns 'x_coord' and 'y_coord' corresponding to the x,y coordinate of each sample. This variable can be useful for handling spatial autocorrelation between samples later on in the ML workflow. 

In addition to the parameters required for `collect_training_data`, we also need to set up a few parameters for the open-data-cube query, such as `measurements` (the bands to load from the satellite), the `resolution` (the cell size), and the `output_crs` (the output projection). 

In [None]:
#set up our inputs to collect_training_data
products =  ['ga_ls8c_gm_2_annual']
time = ('2018')
reduce_func = None
calc_indices = ['NDVI','LAI', 'NDWI']
drop = False
zonal_stats = 'mean'
return_coords = True

# Set up the inputs for the ODC query
measurements =  ['blue','green','red','nir','swir_1','swir_2']
resolution = (-30,30)
output_crs='epsg:6933'

In [None]:
# generate a datacube query object
query = {
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs,
    'group_by' : 'solar_day',
}

Now let's run the `collect_training_data` function.

In [None]:
column_names, model_input = collect_training_data(
                                    gdf=input_data,
                                    products=products,
                                    dc_query=query,
                                    ncpus=ncpus,
                                    return_coords=return_coords,
                                    field=field,
                                    calc_indices=calc_indices,
                                    reduce_func=reduce_func,
                                    drop=drop,
                                    zonal_stats=zonal_stats)

The function returns two numpy arrays, the first (`column_names`) contains a list of the names of the feature layers we've computed:

In [None]:
print(column_names)

The second array (`model_input`) contains the data from our labelled geometries. The first item in the array is the class integer (e.g. in the default example 1. 'crop', or 0. 'noncrop'), the second set of items are the values for each feature layer we computed:

In [None]:
print(np.array_str(model_input, precision=2, suppress_small=True))

## Optional: Custom feature layers

The feature layers that are most relevant for discriminating the classes of your classification problem may be more complicated than those provided in the `collect_training_data` function.  In this case, we can pass a custom feature layer function through the `custom_func` parameter.   

* `custom_func`: A custom function for generating feature layers. If this parameter is set, all other options (excluding 'zonal_stats'), will be ignored. The result of the 'custom_func' must be a single xarray dataset containing 2D coordinates (i.e x, y; no time dimension). The custom function has access to the datacube dataset extracted using the `dc_query` params. To load other datasets, you can use the `like=ds.geobox` parameter in `dc.load`

First, lets define a custom feature layer function that will demonstrate how to extract more complicated temporal statistics.

In [7]:
from custom_feature_layers import xr_terrain

def two_seasons_gm_mads(ds):
    dc = datacube.Datacube(app='training')
    ds = ds / 10000
    ds1 = ds.sel(time=slice('2019-01', '2019-06'))
    ds2 = ds.sel(time=slice('2019-07', '2019-12')) 
    
    def fun(ds, era):
        
        #geomedian and tmads
        gm_mads = xr_geomedian_tmad(ds)
        gm_mads = calculate_indices(gm_mads,
                               index=['NDVI', 'LAI'],
                               drop=False,
                               normalise=False,
                               collection='s2')
        
        gm_mads['edev'] = -np.log(gm_mads['edev'])
        gm_mads['sdev'] = -np.log(gm_mads['sdev'])
        gm_mads['bcdev'] = -np.log(gm_mads['bcdev'])
        
        for band in gm_mads.data_vars:
            gm_mads = gm_mads.rename({band:band+era})
        
        return gm_mads
    
    epoch1 = fun(ds1, era='_S1')
    epoch2 = fun(ds2, era='_S2')
    
#     from datacube.testutils.io import rio_slurp_xarray
#     slope = rio_slurp_xarray("https://deafrica-data.s3.amazonaws.com/ancillary/dem-derivatives/cog_slope_africa.tif", gbox=ds.geobox)
    slope = dc.load(product='srtm', like=ds.geobox).squeeze()
    slope = slope.elevation
    slope = xr_terrain(slope, 'slope_riserun')
    slope = slope.to_dataset(name='slope')
    
    result = xr.merge([epoch1,
                       epoch2,
                       slope],compat='override')

    return result.squeeze()

Now, we can pass this function to `collect_training_data`.  For each of the geometries in our shapefile we will extract temporal statistics, and elevation data.

Because we are now interested in calculating a range of temporal statistics, we will redefine our intial parameters to include a time-series of Sentinel-2 data (whereas above we loaded data from a geomedian composite).  Remember, passing in a `custom_func` to `collect_training_data` means many of the other feature layer parameters are ignored.

In [8]:
#set up our inputs to collect_training_data
products =  ['s2_l2a']
time = ('2019-01','2019-12')
zonal_stats = 'median' 
return_coords=True

# Set up the inputs for the ODC query
measurements =  ['red','blue','green','nir','swir_1']
resolution = (-20,20)
output_crs='epsg:6933'

In [9]:
#generate a new datacube query object
query = {
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs,
    'group_by' : 'solar_day',
}

As we are now extracting 12-months of Sentinel-2 data from the datacube for every geometry, the load times will be much slower.

In [10]:
crops = input_data[input_data['class']==1].sample(n=100, replace=False).reset_index(drop=True)
noncrops = input_data[input_data['class']==0].sample(n=100, replace=False).reset_index(drop=True)

import pandas as pd
input_data = pd.concat([crops,noncrops]).reset_index(drop=True)

In [20]:
column_names, model_input = collect_training_data(
                                    gdf=input_data,
                                    products=products,
                                    dc_query=query,
                                    ncpus=ncpus,
                                    return_coords=return_coords,
                                    field=field,
                                    zonal_stats=zonal_stats,
                                    custom_func=two_seasons_gm_mads
                                    )

Reducing data using user supplied custom function
Taking zonal statistic: median
Collecting training data in parallel mode


 98%|█████████▊| 196/200 [10:38<00:13,  3.26s/it]


Output training data has shape (196, 24)
Removed NaNs & Infs, cleaned input shape:  (163, 24)





In [21]:
print(column_names)
print('')
print(np.array_str(model_input, precision=2, suppress_small=True))

['class', 'red_S1', 'blue_S1', 'green_S1', 'nir_S1', 'swir_1_S1', 'edev_S1', 'sdev_S1', 'bcdev_S1', 'NDVI_S1', 'LAI_S1', 'red_S2', 'blue_S2', 'green_S2', 'nir_S2', 'swir_1_S2', 'edev_S2', 'sdev_S2', 'bcdev_S2', 'NDVI_S2', 'LAI_S2', 'slope', 'x_coord', 'y_coord']

[[      1.         0.07       0.04 ...       0.29 3464870.   -406010.  ]
 [      1.         0.13       0.07 ...       0.07 3377000.    236650.  ]
 [      1.         0.08       0.04 ...       0.11 3432470.   -267270.  ]
 ...
 [      0.         0.08       0.05 ...       0.46 3695990.   -593590.  ]
 [      0.         0.16       0.08 ...       0.03 4083460.   1250110.  ]
 [      0.         0.2        0.08 ...       0.04 3648150.    110110.  ]]


### Seperate coordinates

By setting `return_coords=True` in the `collect_training_data` function, our training data now has two extra columns called `x_coord` and `y_coord`.  We need to seperate these from our training dataset as they will not be used to train the machine learning model. Instead, these variables will be used to help conduct Spatial K-fold Cross validation (SKVC) and spatially aware test-train-splits in the notebook `3_Train_fit_evaluate_classifier`.  For more information on why this is important, see this [article](https://www.tandfonline.com/doi/abs/10.1080/13658816.2017.1346255?journalCode=tgis20).

In [22]:
# Select the variables we want to use to train our model
coord_variables = ['x_coord', 'y_coord']

# Extract relevant indices from the processed shapefile
model_col_indices = [column_names.index(var_name) for var_name in coord_variables]


In [23]:
np.savetxt("results/training_data/training_data_coordinates.txt", model_input[:, model_col_indices])

## Export training data

Once we've collected all the training data we require, we can write the data to disk. This will allow us to import the data in the next step(s) of the workflow.


In [24]:
#set the name and location of the output file
output_file = "results/training_data/test_training_data.txt"

In [34]:
#grab all columns except the x-y coords
model_col_indices = [column_names.index(var_name) for var_name in column_names[0:-2]]
#Export files to disk
np.savetxt(output_file, model_input[:, model_col_indices], header=" ".join(column_names[0:-2]), fmt="%4f")

## Recommended next steps

To continue working through the notebooks in this `Scalable Machine Learning on the ODC` workflow, go to the next notebook `2_Inspect_training_data.ipynb`.

1. **Extracting_training_data (this notebook)**
2. [Inspect_training_data](2_Inspect_training_data.ipynb)
3. [Train_fit_evaluate_classifier](3_Train_fit_evaluate_classifier.ipynb)
4. [Predict](4_Predict.ipynb)
5. [Accuracy_assessment](5_Accuracy_assessment.ipynb)
6. [Object-based_filtering](6_Object-based_filtering_(optional).ipynb)


***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** August 2020

**Compatible datacube version:** 

In [None]:
print(datacube.__version__)

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)