GEODATA-HARVESTER WITH SETTINGS WIDGET
--------------------------------------

The Geodata-Harvester enables researchers with reusable workflows for automatic data extraction from a range of data sources including spatial-temporal processing into useable formats. User provided data is auto-completed with a suitable set of spatial- and temporal-aligned covariates as a ready-made dataset for machine learning and agriculture models. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.

The main workflow of the Harvester is as follows:

Options and user settings (e.g., data layer selections, spatial coverage, temporal constraints, i/o directory names) are defined by the user in the notebook settings menu or can be loaded with a settings yaml file (e.g., settings/settings_test). All settings are also saved in a yaml file for reusability.

The notebook imports settings and all Python modules that include functionality to download and extract data for each data source. After settings are read in, checked, and processed into valid data retrieval (API) queries, all selected data layers are sequentially downloaded and then processed into a clean dataframe table and co-registered raster maps. The entire workflow can be run either completely automatically or individually by selecting only certain process parts in the Notebook.
Additional data sources can be best added by writing the API handlers and extraction functionalities as separate Python module, which are then imported by the Notebook. Currently the following data sources are supported by the following modules:

- 'getdata_slga.py': Soil Data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_landscape': Landscape data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_silo.py': Climate Data from SILO
- 'getdata_dem.py: 'National Digital Elevation Model (DEM) 1 Second plus Slope and Apect calculation
- 'getdata_dea_nci.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via NCI server
- 'getdata_dea.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via Open Web Service server provided by DEA
- 'getdata_radiometric.py': Geoscience Australia National Geophysical Compilation Sub-collection Radiometrics
'eeharvest': Google Earth Engine API integration handler
For more details. please see README and the Data Overview page.

This notebook is part of the Data Harvester project developed for the Agricultural Research Federation (AgReFed).

Copyright 2023 Sydney Informatics Hub (SIH), The University of Sydney

### Import libraries

In [1]:
import os
import time
from datetime import datetime
from os.path import exists
from pathlib import Path
from types import SimpleNamespace

# Import harvest function from geodata_harvester
from geodata_harvester import harvest
# Import widget library
from geodata_harvester.widgets import harvesterwidgets as hw

### Settings via interactive widget

Set settings such as data-layer names, region, and dates in widget window. The widget window might take a few seconds to open in the notebook.

In [2]:
tab_nest, w_settings, names_settings, w_load = hw.gen_maintab()
#Note: the display screen may take a couple of seconds more after loading
time.sleep(2)
display(tab_nest) 

Tab(children=(Accordion(children=(GridBox(children=(FileChooser(path='/Users/seb/CTDS/Projects/AgReFed/Harvest…

### Evaluate and save settings as YAML file

In [3]:
if w_load.value == None:
    dict_settings = hw.eval_widgets(w_settings, names_settings)
    # Convert settings from dictionary to SimpleNamespace (so all settings names available as settings.xxxname)
    settings = SimpleNamespace(**dict_settings)
    # Check if output path exists, if not create it:
    os.makedirs(settings.outpath, exist_ok=True) 
    # Save settings to yaml file:
    fname_settings = os.path.join(settings.outpath, 'settings_saved.yaml')
    hw.save_dict_settings(dict_settings, fname_settings)
else:
    print(f'Settings loaded from {w_load.value}')
    settings = hw.load_settings(w_load.value)
# Print settings
hw.print_settings(settings)

Settings saved to file dataresults_example/settings_saved.yaml
Settings loaded:
----------------
settings.infile : /Users/seb/CTDS/Projects/AgReFed/Harvester/geodata-harvester/notebooks/testdata/example-site_llara.csv
settings.outpath : dataresults_example/
settings.colname_lng : Long
settings.colname_lat : Lat
settings.target_bbox : [149.769345, -30.335861, 149.949173, -30.206271]
settings.target_res : 3.0
settings.date_min : 2022-01-01
settings.temp_intervals : 1
settings.date_max : 2022-01-31
settings.temp_buffer : 1
settings.target_sources:
   'SLGA': {'Bulk_Density': ['0-5cm'], 'Organic_Carbon': ['0-5cm']}
   'SILO': {'monthly_rain': ['median']}
   'DEA': ['ls8_barest_earth_mosaic']
   'DEM': ['DEM']
   'Radiometric': ['radmap2019_grid_dose_terr_awags_rad_2019']
   'Landscape': ['Slope', 'Aspect']
   'GEE': {'preprocess': {'collection': ['LANDSAT/LC08/C02/T1_L2'], 'spectral': 'NDVI', 'reduce': 'median', 'mask_clouds': True, 'mask_probability': None}, 'download': {'bands': 'NDVI'}}

### Harvest

The harvest function executes automatically all download and processing steps for all data layers from the web as specified in settings file above.
Filenames of the processed files and some metadata is saved in a dataframe df.

In [4]:
df = harvest.run(fname_settings, return_df = True)

[1m[35mStarting the data harvester -----[0m
[35mℹ Found the following 7 sources: ['DEA', 'DEM', 'GEE', 'Landscape', 'Radiometric', 'SILO', 'SLGA'][0m
[1m[35m
Downloading from API sources -----[0m
[35m⊙ Initialising Earth Engine...[0m 4.2s                                                                            
[32m✔ Done[0m
[1m
⌛ Downloading Google Earth Engine data...[0m
[1m[36mRunning preprocess() -----[0m
[35mℹ Number of image(s) found: 2[0m
[35m⊙ Applying scale, offset and cloud masks...[0m 1.3s                                                               
[35m⊙ Calculating spectral indices: NDVI...[0m 1.1s                                                                   
[32m✔ Preprocessing complete[0m
[1m[36mRunning download() -----[0m
[35mℹ Band(s) selected: ['NDVI_median'][0m
[35mℹ Setting scale to ~80.1m, converted from 3.0 arcsec at latitude -30.27[0m
[35mℹ Setting download dir to dataresults_example/[0m
[35m⊙ Downloading ee_LANDSAT_424

### Inspect result dataframe

In [5]:
# Inspect either entire generated dataframe with 
# df
# or only the first rows with
df.head()

Unnamed: 0,Longitude,Latitude,geometry,ee_LANDSAT_4248c388_median,ls8_barest_earth_mosaic,DEM,landscape_Slope,landscape_Aspect,radmap2019_grid_dose_terr_awags_rad_2019,monthly_rain_median
0,149.85268,-30.264663,POINT (149.85268 -30.26466),0.239018,948,245.135803,1.052722,189.684097,33.15168,80.599854
1,149.884838,-30.265302,POINT (149.88484 -30.26530),0.077492,1023,263.777679,1.554277,281.077698,35.969486,74.300049
2,149.884838,-30.265302,POINT (149.88484 -30.26530),0.077492,1023,263.777679,1.554277,281.077698,35.969486,74.300049
3,149.838791,-30.278542,POINT (149.83879 -30.27854),0.215485,1095,233.248428,0.920433,242.743683,29.618393,87.199951
4,149.830843,-30.275437,POINT (149.83084 -30.27544),0.110641,1153,230.523056,1.062537,267.301697,25.061012,87.199951
