GEODATA-HARVESTER WITH SETTINGS WIDGET
----------------------------------------------------------

The Geodata-Harvester enables researchers with reusable workflows for automatic data extraction from a range of data sources including spatial-temporal processing into useable formats. User provided data is auto-completed with a suitable set of spatial- and temporal-aligned covariates as a ready-made dataset for machine learning and agriculture models. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.

The main workflow of the Harvester is as follows:

Options and user settings (e.g., data layer selections, spatial coverage, temporal constraints, i/o directory names) are defined by the user in the notebook settings menu or can be loaded with a settings yaml file (e.g., settings/settings_test). All settings are also saved in a yaml file for reusability.

The notebook imports settings and all Python modules that include functionality to download and extract data for each data source. After settings are read in, checked, and processed into valid data retrieval (API) queries, all selected data layers are sequentially downloaded and then processed into a clean dataframe table and co-registered raster maps. The entire workflow can be run either completely automatically or individually by selecting only certain process parts in the Notebook.
Additional data sources can be best added by writing the API handlers and extraction functionalities as separate Python module, which are then imported by the Notebook. Currently the following data sources are supported by the following modules:

- 'getdata_slga.py': Soil Data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_landscape': Landscape data from Soil and Landscape Grid of Australia (SLGA)
- 'getdata_silo.py': Climate Data from SILO
- 'getdata_dem.py: 'National Digital Elevation Model (DEM) 1 Second plus Slope and Apect calculation
- 'getdata_dea_nci.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via NCI server
- 'getdata_dea.py: 'Digital Earth Australia's (DEA) Geoscience Earth Observations via Open Web Service server provided by DEA
- 'getdata_radiometric.py': Geoscience Australia National Geophysical Compilation Sub-collection Radiometrics
- 'eeharvest': Google Earth Engine (GEE) API integration handler (GEE account required)

NOTE THAT A GOOGLE EARTH ENGINE ACCOUNT AND AUTHENTICATION IS REQUIRED IF GEE LAYERS ARE SELECTED.
Please follow the instructions in [GEE Setup](https://sydney-informatics-hub.github.io/AgReFed-Workshop/pydocs/setup-gee) to set up your GEE account.

For more details. please see README and the Data Overview page.

This notebook is part of the Data Harvester project developed for the Agricultural Research Federation (AgReFed).

Copyright 2023 Sydney Informatics Hub (SIH), The University of Sydney

### Import libraries

In [1]:
import os
import time
from datetime import datetime
from os.path import exists
from pathlib import Path
from types import SimpleNamespace
import pandas as pd
import IPython

# Import harvest function from geodata_harvester
from geodata_harvester import harvest, settingshandler
# Import widget library
from geodata_harvester.widgets import harvesterwidgets as hw

### Settings via interactive widget

Set settings such as data-layer names, region, and dates in widget window. The widget window might take a few seconds to open in the notebook. Note that widgets are not showing up in some VScode extension, in this case please use the default jupyter notebook in browser or other Jupyter applications such as JupyterLab desktop.

If no bounding box is provided, the program will automatically generate a bounding box based on the maximum extent of the locations as given in the input file plus a padding of 0.05 deg.

In [2]:
tab_nest, w_settings, names_settings, w_load = hw.gen_maintab()
#Note: the display screen may take a couple of seconds more after loading
time.sleep(2)
display(tab_nest) 

Tab(children=(Accordion(children=(GridBox(children=(FileChooser(path='/Users/seb/CTDS/Projects/AgReFed/Harvest…

### Evaluate and save settings as YAML file

In [3]:
if w_load.value == None:
    dict_settings = hw.eval_widgets(w_settings, names_settings)
    # Convert settings from dictionary to SimpleNamespace (so all settings names available as settings.xxxname)
    settings = SimpleNamespace(**dict_settings)
    # remove GEE from settings if not used
    if settings.target_sources['GEE']['preprocess']['collection'] == None:
        del settings.target_sources['GEE']
    # Check if output path exists, if not create it:
    os.makedirs(settings.outpath, exist_ok=True)
    # Save settings to yaml file:
    fname_settings = os.path.join(settings.outpath, 'settings_saved.yaml')
    hw.save_dict_settings(dict_settings, fname_settings)
else:
    print(f'Settings loaded from {w_load.value}')
    settings = hw.load_settings(w_load.value)
# Print settings
hw.print_settings(settings)

Settings saved to file results_harvest_widget/settings_saved.yaml
Settings loaded:
----------------
settings.infile : /Users/seb/CTDS/Projects/AgReFed/Harvester/geodata-harvester/notebooks/data/example-site_llara.csv
settings.outpath : results_harvest_widget
settings.colname_lng : Long
settings.colname_lat : Lat
settings.target_bbox : [149.769345, -30.335861, 149.949173, -30.206271]
settings.target_res : 6.0
settings.date_min : 2022-10-01
settings.time_intervals : 2
settings.date_max : 2022-11-30
settings.temp_buffer : 1
settings.target_sources:
   'SLGA': {'Bulk_Density': ['0-5cm'], 'Clay': ['0-5cm']}
   'SILO': {'monthly_rain': 'sum'}
   'DEA': ['s2_barest_earth']
   'DEM': ['DEM']
   'Radiometric': ['radmap2019_grid_dose_terr_awags_rad_2019']
   'Landscape': ['Slope', 'Aspect']
   'GEE': {'preprocess': {'collection': ['LANDSAT/LC09/C02/T1_L2'], 'spectral': 'NDVI', 'reduce': 'median', 'mask_clouds': True, 'mask_probability': None}, 'download': {'bands': 'NDVI'}}


### Harvest

The harvest.run function automatically runs all download and processing steps for all requested data sources as specified in settings file above. While this provides a simple and fast way to download and process data, it is also possible to run the individual steps separately, which offers more options for processing. For more details on the individual steps please see [source code](https://github.com/Sydney-Informatics-Hub/geodata-harvester/blob/main/src/geodata_harvester/harvest.py) and documentation within the individual modules. Alternatively review the [workshop page](https://sydney-informatics-hub.github.io/AgReFed-Workshop/pydocs/py00-workshop.html) for an introduction on the individual steps.

The harvest.run functions returns a dataframe with filenames of all downloaded data layers. All results and images are saved to disk in the output directory as specified in settings file.

The following main steps are automatically executed within the harvest.run() function:

- loading settings from config yaml file
- if bounding box is not provided, create bounding box from input file points plus padding of 0.05 deg
- downloading data layers for each source as specified in settings file (this may take a while, depending on number of layers, size of region, and speed of internet connection)
- GEE authorization if GEE layers are selected
- processing data layers as specified in settings file (e.g., temporal binning)
- save downloaded image files to disk as GeoTiffs (.tif)
- save summary table of downloaded files as CSV ( see `download_summary.csv`)
- extract data for point locations provided in input file (Lat and Long columns)
- save extracted point result table to disk as CSV (`results.csv`) and as geopackage (`results.gpkg`)

In [5]:
df = harvest.run(fname_settings, return_df = True)

[1m[35mStarting the data harvester -----[0m
[35mℹ Found the following 7 sources: ['DEA', 'DEM', 'GEE', 'Landscape', 'Radiometric', 'SILO', 'SLGA'][0m
[1m[35m
Downloading from API sources -----[0m
[35m⊙ Initialising Earth Engine...[0m 5.0s                                    
[32m✔ Done[0m
[1m
⌛ Downloading Google Earth Engine data...[0m
[1m[36mRunning preprocess() -----[0m
[35mℹ Number of image(s) found: 1[0m
[35m⊙ Applying scale, offset and cloud masks...[0m 1.8s                       
[35m⊙ Calculating spectral indices: NDVI...[0m 0.9s                           
[32m✔ Preprocessing complete[0m
[1m[36mRunning download() -----[0m
[35mℹ Band(s) selected: ['NDVI_median'][0m
[35mℹ Setting scale to ~160.2m, converted from 6.0 arcsec at latitude -30.27[0m
[35mℹ Setting download dir to results_harvest_widget/ee[0m
[35m⊙ Downloading ee_LANDSAT_dbbc2809.tif[0m 3.6s                             
[32m✔ Google Earth Engine download(s) complete[0m
[1m[36mRunni

### Inspect result dataframe

The result dataframe contains extracted data from all data sources for the locations as specified in the input file. The dataframe can be used for further processing or analysis.
This dataframe table is also saved as csv file in the output directory.


In [7]:
# render pandas dataframe df as html table in jupyter notebook
IPython.display.HTML(df.to_html())
# Alternatively print head of table only:
# df.head()

Unnamed: 0,Longitude,Latitude,ee_LANDSAT_dbbc2809_median_2022-10-01-to-2022-10-31_NDVI_median,ee_LANDSAT_749ee22c_median_2022-10-31-to-2022-11-30_NDVI_median,s2_barest_earth_red,s2_barest_earth_green,s2_barest_earth_blue,s2_barest_earth_red_edge_1,s2_barest_earth_red_edge_2,s2_barest_earth_red_edge_3,s2_barest_earth_nir,s2_barest_earth_nir_2,s2_barest_earth_swir1,s2_barest_earth_swir2,DEM_SRTM_1_Second_Hydro_Enforced_2023_06_14_1,Landscape_Slope_1,Landscape_Aspect_1,radiometric_radmap2019_grid_dose_terr_awags_rad_2019_radmap2019_grid_dose_terr_awags_rad_2019,silo_monthly_rain_2022-10-01-2022-11-30_sum_2022-10-15_1,SLGA_Bulk_Density_0-5cm_1,SLGA_Clay_0-5cm_1,geometry
0,149.85268,-30.264663,0.522849,0.247036,1134.0,827.0,643.0,1293.0,1394.0,1479.0,1592.0,1658.0,2191.0,2050.0,244.658585,1.046624,209.138062,33.15168,269.699951,1.368779,27.214527,POINT (149.85268 -30.26466)
1,149.884838,-30.265302,0.123883,0.167186,1433.0,1005.0,765.0,1646.0,1796.0,1918.0,2079.0,2159.0,2837.0,2520.0,264.428772,1.001,279.542847,35.969486,263.199951,1.362662,31.956041,POINT (149.88484 -30.26530)
2,149.884838,-30.265302,0.123883,0.167186,1433.0,1005.0,765.0,1646.0,1796.0,1918.0,2079.0,2159.0,2837.0,2520.0,264.428772,1.001,279.542847,35.969486,263.199951,1.362662,31.956041,POINT (149.88484 -30.26530)
3,149.838791,-30.278542,0.488897,0.228174,1236.0,936.0,756.0,1400.0,1508.0,1602.0,1724.0,1797.0,2281.0,2149.0,233.005081,0.84143,242.743683,29.618393,258.599854,1.360451,32.675858,POINT (149.83879 -30.27854)
4,149.830843,-30.275437,0.464824,0.185975,1250.0,936.0,746.0,1421.0,1532.0,1636.0,1775.0,1854.0,2556.0,2369.0,230.575439,1.062537,242.921112,25.061012,258.599854,1.334362,35.097813,POINT (149.83084 -30.27544)
5,149.83739,-30.272546,0.461214,0.186517,1319.0,975.0,764.0,1515.0,1657.0,1786.0,1950.0,2042.0,2823.0,2574.0,234.390594,0.474602,218.443161,33.548824,269.699951,1.381845,27.269773,POINT (149.83739 -30.27255)
6,149.884349,-30.274718,0.423914,0.079271,1294.0,949.0,743.0,1465.0,1570.0,1654.0,1780.0,1837.0,2370.0,2141.0,263.179749,1.135136,189.133453,37.309677,263.199951,1.381452,23.636356,POINT (149.88435 -30.27472)
7,149.834467,-30.283844,0.436778,0.106983,1185.0,880.0,689.0,1341.0,1443.0,1537.0,1651.0,1715.0,2130.0,1991.0,231.333786,0.564514,274.254364,32.40744,258.599854,1.351733,41.639271,POINT (149.83447 -30.28384)
8,149.875227,-30.271458,0.124853,0.090514,1083.0,796.0,623.0,1232.0,1334.0,1427.0,1530.0,1585.0,1926.0,1747.0,256.943665,1.165965,239.259262,34.95285,263.199951,1.353937,31.951994,POINT (149.87523 -30.27146)
9,149.887489,-30.276671,0.419985,0.09773,1242.0,904.0,701.0,1415.0,1536.0,1629.0,1755.0,1817.0,2392.0,2143.0,261.955414,2.485159,226.067764,43.085884,237.699951,1.388412,22.133448,POINT (149.88749 -30.27667)


### Overview of downloaded files
Provides an overview of all downloaded files. The files are saved in the output directory as specified in settings file (see column `filename_out` below).

In [10]:
# list all files in the output directory
df_log = pd.read_csv(os.path.join(settings.outpath, 'download_summary.csv'))

# render pandas dataframe as html table
IPython.display.HTML(df_log.to_html())

Unnamed: 0,layername,agfunction,dataset,layertitle,filename_out,loginfo
0,ee_LANDSAT_dbbc2809,median,GEE,ee_LANDSAT_dbbc2809_median_2022-10-01-to-2022-10-31,results_harvest_widget/ee/ee_LANDSAT_dbbc2809_median_2022-10-01-to-2022-10-31.tif,downloaded
1,ee_LANDSAT_749ee22c,median,GEE,ee_LANDSAT_749ee22c_median_2022-10-31-to-2022-11-30,results_harvest_widget/ee/ee_LANDSAT_749ee22c_median_2022-10-31-to-2022-11-30.tif,downloaded
2,s2_barest_earth,median,DEA,s2_barest_earth,results_harvest_widget/dea/s2_barest_earth.tif,downloaded
3,DEM,,DEM,DEM,results_harvest_widget/DEM_SRTM_1_Second_Hydro_Enforced_2023_06_14.tif,downloaded
4,Slope,,Landscape,landscape_Slope,results_harvest_widget/Landscape_Slope.tif,downloaded
5,Aspect,,Landscape,landscape_Aspect,results_harvest_widget/Landscape_Aspect.tif,downloaded
6,radmap2019_grid_dose_terr_awags_rad_2019,,Radiometric,radmap2019_grid_dose_terr_awags_rad_2019,results_harvest_widget/radiometric_radmap2019_grid_dose_terr_awags_rad_2019.tif,downloaded
7,monthly_rain,sum,SILO,silo_monthly_rain_2022-10-01-2022-11-30_sum_2022-10-15,results_harvest_widget/silo/silo_monthly_rain_2022-10-01-2022-11-30_sum_2022-10-15.tif,downloaded
8,Bulk_Density,0-5cm,SLGA,Bulk_Density_0-5cm,results_harvest_widget/SLGA_Bulk_Density_0-5cm.tif,downloaded
9,Clay,0-5cm,SLGA,Clay_0-5cm,results_harvest_widget/SLGA_Clay_0-5cm.tif,downloaded
