GEODATA-HARVESTER NOTEBOOK: Temporal Processing Example II
----------------------------------------------------------

This notebook demonstrates the temporal processing capabilities of the Geodata-Harvester for extracting a time series (using four time intervals) of climate and satellite data with multiple bands.
The example is based on the settings as defined in the `settings/settings_temporal2.yaml` file. 
Only data sources with a time component are selected, i.e., SILO, DEA, and Google Earth Engine (GEE) satellite data.


The Geodata-Harvester enables researchers with reusable workflows for automatic data extraction from a range of data sources including spatial-temporal processing into useable formats. User provided data is auto-completed with a suitable set of spatial- and temporal-aligned covariates as a ready-made dataset for machine learning and agriculture models. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.

The main workflow of the Harvester is as follows:

Options and user settings (e.g., data layer selections, spatial coverage, temporal constraints, i/o directory names) are defined by the user in the notebook settings menu or can be loaded with a settings yaml file (e.g., settings/settings_test). All settings are also saved in a yaml file for reusability.

The notebook imports settings and all Python modules that include functionality to download and extract data for each data source. After settings are read in, checked, and processed into valid data retrieval (API) queries, all selected data layers are sequentially downloaded and then processed into a clean dataframe table and co-registered raster maps. The entire workflow can be run either completely automatically or individually by selecting only certain process parts in the Notebook.
Additional data sources can be best added by writing the API handlers and extraction functionalities as separate Python module, which are then imported by the Notebook. Currently the following data sources are supported by the following modules:

- 'getdata_slga.py': Soil Data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_landscape': Landscape data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_silo.py': Climate Data from SILO
- 'getdata_dem.py: 'National Digital Elevation Model (DEM) 1 Second plus Slope and Apect calculation
- 'getdata_dea_nci.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via NCI server
- 'getdata_dea.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via Open Web Service server provided by DEA
- 'getdata_radiometric.py': Geoscience Australia National Geophysical Compilation Sub-collection Radiometrics
'eeharvest': Google Earth Engine API integration handler
For more details. please see README and the Data Overview page.

This notebook is part of the Data Harvester project developed for the Agricultural Research Federation (AgReFed).

Copyright 2023 Sydney Informatics Hub (SIH), The University of Sydney

### Import libraries

In [1]:
import os

# Import harvest function from geodata_harvester
from geodata_harvester import harvest

### Specify settings file

Set settings in settings YAML file beforehand, such as data-layer names, region, and dates. 

In [2]:
# Path to file:
path_settings = 'settings'
# Filename
fname_settings = 'settings_temporal2.yaml'
infname = os.path.join(path_settings,fname_settings)

### Harvest

The harvest function executes automatically all download and processing steps for all data layers from the web as specified in settings file above.
Filenames of the processed files and some metadata is saved in a dataframe df.

In [3]:
df = harvest.run(infname, return_df = True)

[1m[35mStarting the data harvester -----[0m
[35mℹ Found the following 3 sources: ['DEA', 'SILO', 'GEE'][0m
[1m[35m
Downloading from API sources -----[0m
[35m⊙ Initialising Earth Engine...[0m 4.0s                                                                            
[32m✔ Done[0m
[1m
⌛ Downloading Google Earth Engine data...[0m
[35mℹ Multiple collections detected in Google Earth Engine config file[0m
[35mℹ Validating settings and generating 2 configuration profiles[0m
  Profile 1 will process 'LANDSAT/LC09/C02/T1_L2' and download bands ['SR_B2', 'SR_B3', 'SR_B4']
  Profile 2 will process 'COPERNICUS/S2_SR_HARMONIZED' and download bands ['NDVI']
[35mℹ -------------------- Downloading Profile 1 --------------------[0m
[1m[36mRunning preprocess() -----[0m
[35mℹ Number of image(s) found: 1[0m
[35m⊙ Applying scale, offset and cloud masks...[0m 1.6s                                                               
[35m⊙ Calculating spectral indices: NDVI...[0m 

### Inspect result dataframe

In [4]:
# Inspect either entire generated dataframe with 
# df
# or only the first rows with
df.head()

Unnamed: 0,Longitude,Latitude,ee_COPERNICUS_d1229d42_mean_2022-08-01-to-2022-08-31_NDVI_mean,ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B2_mean,ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B3_mean,ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B4_mean,ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B2_mean,ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B3_mean,ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B4_mean,ee_COPERNICUS_fcb50880_mean_2022-08-31-to-2022-09-30_NDVI_mean,...,ga_ls_landcover_level4,silo_max_temp_2022-08-01-2022-12-01_mean_2022-07-31_1,silo_max_temp_2022-08-01-2022-12-01_mean_2022-08-31_1,silo_max_temp_2022-08-01-2022-12-01_mean_2022-10-01_1,silo_max_temp_2022-08-01-2022-12-01_mean_2022-10-31_1,silo_min_temp_2022-08-01-2022-12-01_mean_2022-07-31_1,silo_min_temp_2022-08-01-2022-12-01_mean_2022-08-31_1,silo_min_temp_2022-08-01-2022-12-01_mean_2022-10-01_1,silo_min_temp_2022-08-01-2022-12-01_mean_2022-10-31_1,geometry
0,149.85268,-30.264663,0.456572,7964.0,9059.0,8309.0,8073.0,8772.0,8066.0,0.619215,...,97.0,19.025805,20.974195,24.726673,26.345163,5.735485,8.151613,11.923333,11.96129,POINT (149.85268 -30.26466)
1,149.884838,-30.265302,0.13241,8598.0,9324.0,10031.0,8748.0,9505.0,10193.0,0.183365,...,30.0,18.919352,20.887094,24.623329,26.245161,5.590323,8.032257,11.776666,11.82258,POINT (149.88484 -30.26530)
2,149.884838,-30.265302,0.13241,8598.0,9324.0,10031.0,8748.0,9505.0,10193.0,0.183365,...,30.0,18.919352,20.887094,24.623329,26.245161,5.590323,8.032257,11.776666,11.82258,POINT (149.88484 -30.26530)
3,149.838791,-30.278542,0.421019,8055.0,9010.0,8257.0,8049.0,8750.0,8083.0,0.683723,...,97.0,19.216133,21.154839,24.9,26.496777,5.819355,8.261291,12.003332,12.045162,POINT (149.83879 -30.27854)
4,149.830843,-30.275437,0.367244,8316.0,10031.0,9594.0,8101.0,9553.0,8658.0,0.611031,...,35.0,19.216133,21.154839,24.9,26.496777,5.819355,8.261291,12.003332,12.045162,POINT (149.83084 -30.27544)


In [6]:
print('Number of columns in dataframe: ', len(df.columns))
# Print column names of dataframe
print(df.columns)

Number of columns in dataframe:  65
Index(['Longitude', 'Latitude',
       'ee_COPERNICUS_d1229d42_mean_2022-08-01-to-2022-08-31_NDVI_mean',
       'ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B2_mean',
       'ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B3_mean',
       'ee_LANDSAT_2a2b9af2_mean_2022-08-01-to-2022-08-31_SR_B4_mean',
       'ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B2_mean',
       'ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B3_mean',
       'ee_LANDSAT_71cd5d63_mean_2022-08-31-to-2022-09-30_SR_B4_mean',
       'ee_COPERNICUS_fcb50880_mean_2022-08-31-to-2022-09-30_NDVI_mean',
       'ee_LANDSAT_6b57e1a5_mean_2022-09-30-to-2022-10-30_SR_B2_mean',
       'ee_LANDSAT_6b57e1a5_mean_2022-09-30-to-2022-10-30_SR_B3_mean',
       'ee_LANDSAT_6b57e1a5_mean_2022-09-30-to-2022-10-30_SR_B4_mean',
       'ee_COPERNICUS_03b3ab57_mean_2022-09-30-to-2022-10-30_NDVI_mean',
       'ee_COPERNICUS_dac18192_mean_2022-10-30-to-2022-11-29_NDVI_mean',
 