GEODATA-HARVESTER NOTEBOOK
--------------------------------

The Geodata-Harvester enables researchers with reusable workflows for automatic data extraction from a range of data sources including spatial-temporal processing into useable formats. User provided data is auto-completed with a suitable set of spatial- and temporal-aligned covariates as a ready-made dataset for machine learning and agriculture models. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.

The main workflow of the Harvester is as follows:

Options and user settings (e.g., data layer selections, spatial coverage, temporal constraints, i/o directory names) are defined by the user in the notebook settings menu or can be loaded with a settings yaml file (e.g., settings/settings_test). All settings are also saved in a yaml file for reusability.

The notebook imports settings and all Python modules that include functionality to download and extract data for each data source. After settings are read in, checked, and processed into valid data retrieval (API) queries, all selected data layers are sequentially downloaded and then processed into a clean dataframe table and co-registered raster maps. The entire workflow can be run either completely automatically or individually by selecting only certain process parts in the Notebook.
Additional data sources can be best added by writing the API handlers and extraction functionalities as separate Python module, which are then imported by the Notebook. Currently the following data sources are supported by the following modules:

- 'getdata_slga.py': Soil Data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_landscape': Landscape data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_silo.py': Climate Data from SILO
- 'getdata_dem.py: 'National Digital Elevation Model (DEM) 1 Second plus Slope and Apect calculation
- 'getdata_dea_nci.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via NCI server
- 'getdata_dea.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via Open Web Service server provided by DEA
- 'getdata_radiometric.py': Geoscience Australia National Geophysical Compilation Sub-collection Radiometrics
'eeharvest': Google Earth Engine API integration handler
For more details. please see README and the Data Overview page.

This notebook is part of the Data Harvester project developed for the Agricultural Research Federation (AgReFed).

Copyright 2023 Sydney Informatics Hub (SIH), The University of Sydney

### Import libraries

In [1]:
import os

# Import harvest function from geodata_harvester
from geodata_harvester import harvest

### Specify settings file

Set settings in settings YAML file beforehand, such as data-layer names, region, and dates. 

In [2]:
fname_settings = 'settings_harvest.yaml'
path_settings = 'settings'

### Harvest

The harvest function executes automatically all download and processing steps for all data layers from the web as specified in settings file above.
Filenames of the processed files and some metadata is saved in a dataframe df.

In [3]:
df = harvest.run(os.path.join(path_settings,fname_settings))

[1m[35mStarting the data harvester -----[0m
[35mℹ Found the following 6 sources: ['DEA', 'DEM', 'Landscape', 'Radiometric', 'SILO', 'SLGA'][0m
[1m[35m
Downloading from API sources -----[0m
[1m
⌛ Downloading DEA data...[0m
[35m⊙ Downloading landsat_barest_earth.tif for None[0m 1.2s                                                           
[35m⊙ Downloading ga_ls_ard_3.tif for 2022-10-01T00:00:00.000Z[0m 0.7s                                                
[35m⊙ Downloading ga_ls_ard_3.tif for 2022-10-02T00:00:00.000Z[0m 0.7s                                                
[35m⊙ Downloading ga_ls_ard_3.tif for 2022-10-03T00:00:00.000Z[0m 0.8s                                                
[35m⊙ Downloading ga_ls_ard_3.tif for 2022-10-04T00:00:00.000Z[0m 0.8s                                                
[35m⊙ Downloading ga_ls_ard_3.tif for 2022-10-05T00:00:00.000Z[0m 0.8s                                                
[35m⊙ Downloading ga_ls_ard_3.tif for 202