# Acquiring neccessary geospatial domain data for CONFLUENCE

## Types of geospatial data
To build and develop our hydrological models we need information about the geospatial attributes of our domain. These data include:

1. Elevation data (Digital Elevation Model, DEM)
2. Land cover classifications
3. Soil type classifications

## Methods of acquiring geospatial data
There are several ways of acquiring geospatial data for our domain in CONFLUENCE, depending on the resources we have access to:

1. Subsetting from full domain datasets stored on HPC. If you have access to appropriate HPC infrastructure we can use the gistool (https://github.com/CH-Earth/gistool)
2. Download data directly from provider
3. User supplied data. If you want to use your own geospatial data, e.g. with datasets not currently integrated in CONFLUENCE these can be defined in the CONFLUENCE configuration file

In this notebook we will cover using methods 1 and 2 for aqcuiring the pertinent geospatial data for our models

# 1. Subsetting data from HPC storage
## Key Configuration Settings

Let's begin by reviewing the key parts of the `config_active.yaml` file that are essential for initializing a new project:

1. `CONFLUENCE_DATA_DIR`: The root directory where all CONFLUENCE data will be stored.
2. `CONFLUENCE_CODE_DIR`: The directory containing the CONFLUENCE code.
3. `DOMAIN_NAME`: The name of your study area or project domain.
4. `BOUNDING_BOX_COORDS`: Coordinates of the domain bounding box
5. `GISTOOL_DATASET_ROOT`: Path to gistool datasets root directory
6. `TOOL_ACCOUNT`: HPC account for running datatool

In [1]:
import sys
from pathlib import Path
import yaml # type: ignore
import logging
import rasterio
import numpy as np
from pathlib import Path
from scipy import stats

# Add the parent directory to sys.path
current_dir = Path.cwd()
parent_dir = current_dir.parent.parent
sys.path.append(str(parent_dir))

# Import required CONFLUENCE utility functions
from utils.dataHandling_utils.data_acquisition_utils import gistoolRunner, meritDownloader, soilgridsDownloader, modisDownloader # type: ignore

# Print if successfull
print("All modules imported successfully")

All modules imported successfully


## Check configurations

Now we should print our configuration settings and make sure that we have defined all the settings we need. 

In [2]:
config_path = Path('../../0_config_files/config_active.yaml')
with open(config_path, 'r') as config_file:
    config = yaml.safe_load(config_file)
    
    # Display key configuration settings\n",
    print(f"CONFLUENCE_DATA_DIR: {config['CONFLUENCE_DATA_DIR']}")
    print(f"CONFLUENCE_CODE_DIR: {config['CONFLUENCE_CODE_DIR']}")
    print(f"DOMAIN_NAME: {config['DOMAIN_NAME']}")
    print(f"BOUNDING_BOX_COORDS: {config['BOUNDING_BOX_COORDS']}")
    print(f"GISTOOL_DATASET_ROOT: {config['GISTOOL_DATASET_ROOT']}")
    print(f"TOOL_ACCOUNT: {config['TOOL_ACCOUNT']}")

CONFLUENCE_DATA_DIR: /home/darri/data/CONFLUENCE_data
CONFLUENCE_CODE_DIR: /home/darri/code/CONFLUENCE
DOMAIN_NAME: Bow_at_Banff
BOUNDING_BOX_COORDS: 51.76/-116.55/50.95/-115.5
GISTOOL_DATASET_ROOT: /project/6079554/data/geospatial-data/
TOOL_ACCOUNT: def-mclark-ab


## Define default paths

Now let's define the paths to the attribute data before we run the acquisition scripts and create the containing directories

In [3]:
# Main project directory
data_dir = config['CONFLUENCE_DATA_DIR']
project_dir = Path(data_dir) / f"domain_{config['DOMAIN_NAME']}"

# Attribute directories
dem_dir = project_dir / 'attributes' / 'elevation' / 'dem'
soilclass_dir = project_dir / 'attributes' / 'soilclass'
landclass_dir = project_dir / 'attributes' / 'landclass'

for dir in [dem_dir, soilclass_dir, landclass_dir]: dir.mkdir(parents = True, exist_ok = True)

# 1. Running gistool
Now that we have our configuration loaded, let's run the gistool to get data we need. This process involves initializing the gistoolRunner with the appropriate settings for each of the datasets we want to extract.

## A. Elevation data.

Currently gistool has support for the MERIT hydro digital elevation model

In [3]:
# Set up 
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize gistoolRunner class
gr = gistoolRunner(config, logger)

# Get lat and lon lims
bbox = config['BOUNDING_BOX_COORDS'].split('/')
latlims = f"{bbox[2]},{bbox[0]}"
lonlims = f"{bbox[1]},{bbox[3]}"

# Create the gistool command
gistool_command = gr.create_gistool_command(dataset = 'MERIT-Hydro', output_dir = dem_dir, lat_lims = latlims, lon_lims = lonlims, variables = 'elv')
gr.execute_gistool_command(gistool_command)


NameError: name 'config' is not defined

## B. Landcover Data

Currently the gistool supports the MODIS (MOD12Q1) and Landsat (NALCMS) land cover classification data

In [13]:
#First we define which years we should acquire the Landcover data for
start_year = 2001
end_year = 2020

#Select which MODIS dataset to use
modis_var = "MCD12Q1.061"

# Create the gistool command
gistool_command = gr.create_gistool_command(dataset = 'MODIS', output_dir = landclass_dir, lat_lims = latlims, lon_lims = lonlims, variables = modis_var, start_date=f"{start_year}-01-01", end_date=f"{end_year}-01-01")
gr.execute_gistool_command(gistool_command)




['/home/darri/data/CONFLUENCE_data/installs/gistool/extract-gis.sh', '--dataset=MODIS', '--dataset-dir=/project/6079554/data/geospatial-data/MODIS', '--output-dir=/home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/landclass', '--lat-lims=50.95,51.76', '--lon-lims=-116.55,-115.5', '--variable=MCD12Q1.061', '--prefix=domain_Bow_at_Banff', '--lib-path=/project/rrg-mclark/lib/--submit-job', '--print-geotiff=true', '--account=def-mclark-ab', '--start-date=2001-01-01', '--end-date=2020-01-01']
(2024-10-19 23:44:23) modis.sh: processing MODIS HDF(s)...
(2024-10-19 23:44:23) modis.sh: creating cache directory under /home/darri/scratch/.temp_data_358124064
(2024-10-19 23:44:23) modis.sh: creating output directory under /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/landclass
(2024-10-19 23:44:23) modis.sh: building virtual format (.vrt) of MODIS HDFs under /home/darri/scratch/.temp_data_358124064
(2024-10-20 00:14:35) modis.sh: subsetting HDFs in GeoTIFF format und

mkdir: cannot create directory ‘/home/darri/empty_dir’: File exists
2024-10-19 20:14:41,369 - INFO - gistool completed successfully.
2024-10-19 20:14:41,370 - INFO - Geospatial data acquisition process completed


(2024-10-20 00:14:41) modis.sh: deleting temporary files from /home/darri/scratch/.temp_data_358124064
(2024-10-20 00:14:41) modis.sh: temporary files from /home/darri/scratch/.temp_data_358124064 are removed
(2024-10-20 00:14:41) modis.sh: results are produced under /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/landclass


In [16]:
# If we selected a range of years we need to calculate the mode of the timeseries
def calculate_landcover_mode(input_dir, output_file, start_year, end_year):
    # List all the geotiff files for the years we're interested in
    geotiff_files = [input_dir / f"domain_{config['DOMAIN_NAME']}{year}.tif" for year in range(start_year, end_year + 1)]
    
    # Read the first file to get metadata
    with rasterio.open(geotiff_files[0]) as src:
        meta = src.meta
        shape = src.shape
    
    # Initialize an array to store all the data
    all_data = np.zeros((len(geotiff_files), *shape), dtype=np.int16)
    
    # Read all the geotiffs into the array
    for i, file in enumerate(geotiff_files):
        with rasterio.open(file) as src:
            all_data[i] = src.read(1)
    
    # Calculate the mode along the time axis
    mode_data, _ = stats.mode(all_data, axis=0)
    mode_data = mode_data.astype(np.int16).squeeze()
    
    # Update metadata for output
    meta.update(count=1, dtype='int16')
    
    # Write the result
    with rasterio.open(output_file, 'w', **meta) as dst:
        dst.write(mode_data, 1)
    
    print(f"Mode calculation complete. Result saved to {output_file}")

if start_year != end_year:
    input_dir = landclass_dir / modis_var
    output_file = landclass_dir / f"domain_Bow_at_Banff_landcover.tif"
    
    calculate_landcover_mode(input_dir, output_file, start_year, end_year)

Mode calculation complete. Result saved to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/landclass/domain_Bow_at_Banff_landcover.tif


## C. Soil classification data

Currently the gistool supports i. Soil Grids (v1), ii. USDA Soil Class and iii. Global Soil Dataset for Earth System Modelling (GSDE)

In [6]:
# Create the gistool command
gistool_command = gr.create_gistool_command(dataset = 'soil_class', output_dir = soilclass_dir, lat_lims = latlims, lon_lims = lonlims, variables = 'soil_classes')
gr.execute_gistool_command(gistool_command)


['/home/darri/data/CONFLUENCE_data/installs/gistool/extract-gis.sh', '--dataset=soil_class', '--dataset-dir=/project/6079554/data/geospatial-data/soil_classes', '--output-dir=/home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/soilclass', '--lat-lims=50.95,51.76', '--lon-lims=-116.55,-115.5', '--variable=soil_classes', '--prefix=domain_Bow_at_Banff', '--lib-path=/project/rrg-mclark/lib/--submit-job', '--print-geotiff=true', '--account=def-mclark-ab']
(2024-10-19 23:21:58) soil_class.sh: processing Wouter's wonderful soil_class GeoTIFF(s)...
(2024-10-19 23:21:58) soil_class.sh: creating output directory under /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/soilclass
(2024-10-19 23:21:58) soil_class.sh: subsetting GeoTIFFs under /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/soilclass


mkdir: cannot create directory ‘/home/darri/empty_dir’: File exists
2024-10-19 19:22:00,283 - INFO - gistool completed successfully.
2024-10-19 19:22:00,284 - INFO - Geospatial data acquisition process completed


(2024-10-19 23:22:00) soil_class.sh: deleting temporary files from /home/darri/scratch/.temp_data_856264249
(2024-10-19 23:22:00) soil_class.sh: temporary files from /home/darri/scratch/.temp_data_856264249 are removed
(2024-10-19 23:22:00) soil_class.sh: results are produced under /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/attributes/soilclass


# 2. Download data from provider

In case don't have access gistool supported HPC infrastructure data can be downloaded from the original data provider. CONFLUENCE currently supports direct downloads of the following datasets:

1. Elevation (MERIT hydro)
2. Soil classifications (SOILGRIDS)
3. Landcover classifications (MODIS12Q1)

These scripts are adapted from the CWARHM workflows by Knoben et al., 2021. The user can also develop their own download scripts here. If you do so, please consider contributing them to the CONFLUENCE repository.

## 1. Download elevation data from MERIT Hydro

In [3]:
# 1. Download MERIT HYDRO elevation data
# Set up 
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize meritDownloader class
md = meritDownloader(config, logger)

# Run MERIT downloads
md.run_download()

logger.info("MERIT data processing completed")

2024-10-19 21:11:57,301 - INFO - Attempting to download elv_n30w120.tar
2024-10-19 21:11:57,378 - ERROR - Error downloading elv_n30w120.tar on try 1: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1006)>


KeyboardInterrupt: 

## 2. Download soil classification data from SOILGRIDS

In [3]:
# Set up 
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize soilgridsDownloader class
sd = soilgridsDownloader(config, logger)

# Run SOILGRIDS downloads and processing
sd.process_soilgrids_data()

logger.info("SOILGRIDS data processing completed")

2024-10-19 21:19:16,796 - ERROR - Error downloading SOILGRIDS data: HTTPSConnectionPool(host='www.hydroshare.org', port=443): Max retries exceeded with url: /hsapi/resource/None/files/usda_mode_soilclass_250m_ll.tif (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x2b7724672d50>, 'Connection to www.hydroshare.org timed out. (connect timeout=None)'))
2024-10-19 21:19:16,805 - ERROR - No .tif files found in the raw soil data directory
2024-10-19 21:19:16,807 - INFO - Removed raw soil data directory: /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/parameters/soilclass/1_soil_classes_global
2024-10-19 21:19:16,808 - INFO - SOILGRIDS data processing completed
2024-10-19 21:19:16,808 - INFO - SOILGRIDS data processing completed


## 3. Download landcover classifications from MODIS (MOD12Q1)

In [4]:
# Set up 
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize modisDownloader class
md = modisDownloader(config, logger)

# Run MODIS downloads and processing
md.run_modis_workflow()

logger.info("MODIS data processing completed")

TypeError: expected str, bytes or os.PathLike object, not NoneType

## Conclusion
Congratulations! You have successfully acquired the geospatial data we need to define our modelling domain and to estimate our model attributes.