# Model-Agnostic Input Data Preprocessing in CONFLUENCE

## Introduction

This notebook focuses on the model-agnostic preprocessing steps for input data in CONFLUENCE. Model-agnostic preprocessing involves tasks that are common across different hydrological models, such as data acquisition, quality control, and initial formatting.

Key steps covered in this notebook include:

1. Spatial resampling of forcing data to match the model domain
2. Calculate zonal statistics for the domain geospatial attributes 

In this preprocessing stage we ensure that our input data is consistent, complete, and properly formatted before we move on to model-specific preprocessing steps. By the end of this notebook, you will have clean, standardized datasets ready for further model-specific processing.

## First we import the libraries and functions we need

In [1]:
import sys
from pathlib import Path
from typing import Dict, Any
import logging
import yaml # type: ignore

current_dir = Path.cwd()
parent_dir = current_dir.parent.parent
sys.path.append(str(parent_dir))

from utils.dataHandling_utils.agnosticPreProcessor_util import forcingResampler, geospatialStatistics # type: ignore

# Set up logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Check configurations

Now we should print our configuration settings and make sure that we have defined all the settings we need. 

In [2]:
config_path = Path('../../0_config_files/config_active.yaml')
with open(config_path, 'r') as config_file:
    config = yaml.safe_load(config_file)
    print(f"FORCING_DATASET: {config['FORCING_DATASET']}")
    print(f"EASYMORE_CLIENT: {config['EASYMORE_CLIENT']}")
    print(f"FORCING_VARIABLES: {config['FORCING_VARIABLES']}")
    print(f"EXPERIMENT_TIME_START: {config['EXPERIMENT_TIME_START']}")
    print(f"EXPERIMENT_TIME_START: {config['EXPERIMENT_TIME_START']}")

FORCING_DATASET: ERA5
EASYMORE_CLIENT: easymore cli
FORCING_VARIABLES: longitude,latitude,time,LWRadAtm,SWRadAtm,pptrate,airpres,airtemp,spechum,windspd
EXPERIMENT_TIME_START: 2010-01-01 01:00
EXPERIMENT_TIME_START: 2010-01-01 01:00


## Define default paths

Now let's define the paths to data directories before we run the pre processing scripts and create the containing directories

In [3]:
# Main project directory
data_dir = config['CONFLUENCE_DATA_DIR']
project_dir = Path(data_dir) / f"domain_{config['DOMAIN_NAME']}"

# Data directoris
raw_data_dir = project_dir / 'forcing' / 'raw_data'
basin_averaged_data = project_dir / 'forcing' / 'basin_averaged_data'
catchment_intersection_dir = project_dir / 'shapefiles' / 'catchment_intersection'

# Make sure the new directories exists
basin_averaged_data.mkdir(parents = True, exist_ok = True)
catchment_intersection_dir.mkdir(parents = True, exist_ok = True)

## 1. Pre process forcing data

Now let's resample the forcing data onto our model domain. We use the easymore resampling tool by Gharari et al., 2023

In [4]:
# Initialize forcingReampler class
fr = forcingResampler(config, logger)

# Run resampling
fr.run_resampling()

2024-10-20 20:57:06,330 - INFO - Starting forcing data resampling process
2024-10-20 20:57:06,331 - INFO - Creating ERA5 shapefile
2024-10-20 20:57:06,331 - INFO - Creating ERA5 shapefile
2024-10-20 20:57:07,394 - INFO - Created 20 records
2024-10-20 20:57:07,397 - INFO - ERA5 shapefile created and saved to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/shapefiles/forcing/forcing_ERA5.shp
2024-10-20 20:57:07,398 - INFO - Starting forcing remapping process
2024-10-20 20:57:07,399 - INFO - Creating one weighted forcing file
2024-10-20 20:57:07,461 - INFO - Created 136 records
2024-10-20 20:57:07,525 - INFO - Created 20 records


EASYMORE version 0.0.4 is initiated.
EASYMORE is given multiple varibales to be remapped but only on format and fill valueEASYMORE repeat the format and fill value for all the variables in output files
EASYMORE will remap variable  airpres  from source file to variable  airpres  in remapped NeCDF file
EASYMORE will remap variable  LWRadAtm  from source file to variable  LWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  SWRadAtm  from source file to variable  SWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  pptrate  from source file to variable  pptrate  in remapped NeCDF file
EASYMORE will remap variable  airtemp  from source file to variable  airtemp  in remapped NeCDF file
EASYMORE will remap variable  spechum  from source file to variable  spechum  in remapped NeCDF file
EASYMORE will remap variable  windspd  from source file to variable  windspd  in remapped NeCDF file
EASYMORE detects that target shapefile is in WGS84 (epsg:4326)
EASYMORE detects that th

2024-10-20 20:57:07,609 - INFO - Created 136 records
2024-10-20 20:57:07,621 - INFO - Created 20 records
2024-10-20 20:57:07,658 - INFO - Created 136 records
2024-10-20 20:57:07,709 - INFO - Created 20 records
  shp_int.to_file(self.temp_dir+self.case_name+'_intersected_shapefile.shp') # save the intersected files
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_wr

------REMAPPING------
netcdf output file will be compressed at level 4
Remapping /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/forcing/raw_data/domain_Bow_at_Banff_ERA5_merged_201001.nc to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/forcing/basin_averaged_data/Bow_at_Banff_ERA5_remapped_2010-01-01-00-00-00.nc
Started at date and time 2024-10-20 20:57:08.038089


2024-10-20 20:57:27,705 - INFO - Creating all weighted forcing files


Ended   at date and time 2024-10-20 20:57:27.654515
------
EASYMORE version 0.0.4 is initiated.
No temporary folder is provided for EASYMORE; this will result in EASYMORE saving the files in the same directory as python script
EASYMORE is given multiple varibales to be remapped but only on format and fill valueEASYMORE repeat the format and fill value for all the variables in output files
remap file is provided; EASYMORE will use this file and skip calculation of remapping
EASYMORE will remap variable  airpres  from source file to variable  airpres  in remapped NeCDF file
EASYMORE will remap variable  LWRadAtm  from source file to variable  LWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  SWRadAtm  from source file to variable  SWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  pptrate  from source file to variable  pptrate  in remapped NeCDF file
EASYMORE will remap variable  airtemp  from source file to variable  airtemp  in remapped NeCDF file
EASYMORE will

2024-10-20 21:32:13,963 - INFO - All weighted forcing files created
2024-10-20 21:32:13,964 - INFO - Forcing remapping process completed
2024-10-20 21:32:13,965 - INFO - Forcing data resampling process completed


Ended   at date and time 2024-10-20 21:32:13.918443
------


## 2. Pre process geospatial data

Now let's calculate the zonal statistics of the geospatial attributes we need for our model

In [5]:
# Set up
# Initialize geospatialStatistics class
gs = geospatialStatistics(config, logger)

# Run resampling
gs.run_statistics()

2024-10-20 21:32:13,971 - INFO - Calculating soil statistics
2024-10-20 21:32:14,693 - INFO - Created 136 records
2024-10-20 21:32:14,699 - INFO - Soil statistics saved to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/shapefiles/catchment_intersection/with_soilgrids/catchment_with_soilclass.shp
2024-10-20 21:32:14,700 - INFO - Calculating land statistics
2024-10-20 21:32:15,414 - INFO - Created 136 records
2024-10-20 21:32:15,422 - INFO - Land statistics saved to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/shapefiles/catchment_intersection/with_landclass/catchment_with_landclass.shp
2024-10-20 21:32:15,423 - INFO - Calculating elevation statistics
2024-10-20 21:32:16,135 - INFO - Updating existing 'elev_mean' column
2024-10-20 21:32:16,157 - INFO - Created 136 records
2024-10-20 21:32:16,163 - INFO - Elevation statistics saved to /home/darri/data/CONFLUENCE_data/domain_Bow_at_Banff/shapefiles/catchment_intersection/with_dem/catchment_with_dem.shp
2024-10-20 21:32:16,164