# Model-Agnostic Input Data Preprocessing in CONFLUENCE

## Introduction

This notebook focuses on the model-agnostic preprocessing steps for input data in CONFLUENCE. Model-agnostic preprocessing involves tasks that are common across different hydrological models, such as data acquisition, quality control, and initial formatting.

Key steps covered in this notebook include:

1. Spatial resampling of forcing data to match the model domain
2. Calculate zonal statistics for the domain geospatial attributes 

In this preprocessing stage we ensure that our input data is consistent, complete, and properly formatted before we move on to model-specific preprocessing steps. By the end of this notebook, you will have clean, standardized datasets ready for further model-specific processing.

## First we import the libraries and functions we need

In [1]:
import sys
from pathlib import Path
from typing import Dict, Any
import logging
import yaml # type: ignore

current_dir = Path.cwd()
parent_dir = current_dir.parent.parent
sys.path.append(str(parent_dir))

from utils.dataHandling_utils.agnosticPreProcessor_util import forcingResampler, geospatialStatistics # type: ignore

# Set up logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Check configurations

Now we should print our configuration settings and make sure that we have defined all the settings we need. 

In [2]:
config_path = Path('../../0_config_files/config_active.yaml')
with open(config_path, 'r') as config_file:
    config = yaml.safe_load(config_file)
    print(f"FORCING_DATASET: {config['FORCING_DATASET']}")
    print(f"EASYMORE_CLIENT: {config['EASYMORE_CLIENT']}")
    print(f"FORCING_VARIABLES: {config['FORCING_VARIABLES']}")
    print(f"EXPERIMENT_TIME_START: {config['EXPERIMENT_TIME_START']}")
    print(f"EXPERIMENT_TIME_START: {config['EXPERIMENT_TIME_START']}")

FORCING_DATASET: ERA5
EASYMORE_CLIENT: easymore cli
FORCING_VARIABLES: longitude,latitude,time,LWRadAtm,SWRadAtm,pptrate,airpres,airtemp,spechum,windspd
EXPERIMENT_TIME_START: 2009-01-01 01:00
EXPERIMENT_TIME_START: 2009-01-01 01:00


## Define default paths

Now let's define the paths to data directories before we run the pre processing scripts and create the containing directories

In [3]:
# Main project directory
data_dir = config['CONFLUENCE_DATA_DIR']
project_dir = Path(data_dir) / f"domain_{config['DOMAIN_NAME']}"

# Data directoris
raw_data_dir = project_dir / 'forcing' / 'raw_data'
basin_averaged_data = project_dir / 'forcing' / 'basin_averaged_data'
catchment_intersection_dir = project_dir / 'shapefiles' / 'catchment_intersection'

# Make sure the new directories exists
basin_averaged_data.mkdir(parents = True, exist_ok = True)
catchment_intersection_dir.mkdir(parents = True, exist_ok = True)

## 1. Pre process forcing data

Now let's resample the forcing data onto our model domain. We use the easymore resampling tool by Gharari et al., 2023

In [4]:
# Initialize forcingReampler class
fr = forcingResampler(config, logger)

# Run resampling
fr.run_resampling()

2024-10-25 23:23:43,976 - INFO - Starting forcing data resampling process
2024-10-25 23:23:43,976 - INFO - Creating ERA5 shapefile
2024-10-25 23:23:43,977 - INFO - Creating ERA5 shapefile


/home/darri/data/CONFLUENCE_data/domain_Chena/forcing/raw_data/domain_Chena_ERA5_merged_200907.nc


2024-10-25 23:23:45,690 - INFO - Created 30 records
2024-10-25 23:23:45,693 - INFO - ERA5 shapefile created and saved to /home/darri/data/CONFLUENCE_data/domain_Chena/shapefiles/forcing/forcing_ERA5.shp
2024-10-25 23:23:45,694 - INFO - Starting forcing remapping process
2024-10-25 23:23:45,695 - INFO - Creating one weighted forcing file
2024-10-25 23:23:45,779 - INFO - Created 125 records
2024-10-25 23:23:45,858 - INFO - Created 30 records


EASYMORE version 0.0.4 is initiated.
sourceSHP:     ID    lat     lon       elev_m  \
0    6  65.25 -146.75   614.841658   
1    7  65.00 -146.75   478.848153   
2    8  64.75 -146.75   405.491750   
3   11  65.25 -146.50   665.868057   
4   12  65.00 -146.50   579.080089   
5   13  64.75 -146.50   382.907236   
6   16  65.25 -146.25   664.235418   
7   17  65.00 -146.25   504.349369   
8   18  64.75 -146.25   547.247863   
9   21  65.25 -146.00   737.616422   
10  22  65.00 -146.00   545.425692   
11  23  64.75 -146.00   613.711567   
12  26  65.25 -145.75   715.852561   
13  27  65.00 -145.75   697.727087   
14  28  64.75 -145.75   669.787407   
15  31  65.25 -145.50   671.115521   
16  32  65.00 -145.50   719.796766   
17  33  64.75 -145.50   770.912365   
18  36  65.25 -145.25   702.725251   
19  37  65.00 -145.25   810.155836   
20  38  64.75 -145.25   705.485242   
21  41  65.25 -145.00   785.682317   
22  42  65.00 -145.00   979.410587   
23  43  64.75 -145.00   931.779886   
24

2024-10-25 23:23:45,946 - INFO - Created 125 records
2024-10-25 23:23:45,958 - INFO - Created 30 records
2024-10-25 23:23:45,998 - INFO - Created 125 records
2024-10-25 23:23:46,048 - INFO - Created 30 records


EASYMORE detects that shapefile longitude is between -180 and 180, no correction is performed
                                             geometry  ID
0   POLYGON ((-146.875 65.125, -146.875 65.375, -1...   1
1   POLYGON ((-146.875 64.875, -146.875 65.125, -1...   2
2   POLYGON ((-146.875 64.625, -146.875 64.875, -1...   3
3   POLYGON ((-146.625 65.125, -146.625 65.375, -1...   4
4   POLYGON ((-146.625 64.875, -146.625 65.125, -1...   5
5   POLYGON ((-146.625 64.625, -146.625 64.875, -1...   6
6   POLYGON ((-146.375 65.125, -146.375 65.375, -1...   7
7   POLYGON ((-146.375 64.875, -146.375 65.125, -1...   8
8   POLYGON ((-146.375 64.625, -146.375 64.875, -1...   9
9   POLYGON ((-146.125 65.125, -146.125 65.375, -1...  10
10  POLYGON ((-146.125 64.875, -146.125 65.125, -1...  11
11  POLYGON ((-146.125 64.625, -146.125 64.875, -1...  12
12  POLYGON ((-145.875 65.125, -145.875 65.375, -1...  13
13  POLYGON ((-145.875 64.875, -145.875 65.125, -1...  14
14  POLYGON ((-145.875 64.625, -145.

  shp_int.to_file(self.temp_dir+self.case_name+'_intersected_shapefile.shp') # save the intersected files
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_writ

------REMAPPING------
netcdf output file will be compressed at level 4
Remapping /home/darri/data/CONFLUENCE_data/domain_Chena/forcing/raw_data/domain_Chena_ERA5_merged_200901.nc to /home/darri/data/CONFLUENCE_data/domain_Chena/forcing/basin_averaged_data/Chena_ERA5_remapped_2009-01-01-00-00-00.nc
Started at date and time 2024-10-25 23:23:46.360541


2024-10-25 23:24:05,913 - INFO - Creating all weighted forcing files


Ended   at date and time 2024-10-25 23:24:05.861269
------
EASYMORE version 0.0.4 is initiated.
No temporary folder is provided for EASYMORE; this will result in EASYMORE saving the files in the same directory as python script
EASYMORE is given multiple varibales to be remapped but only on format and fill valueEASYMORE repeat the format and fill value for all the variables in output files
remap file is provided; EASYMORE will use this file and skip calculation of remapping
EASYMORE will remap variable  airpres  from source file to variable  airpres  in remapped NeCDF file
EASYMORE will remap variable  LWRadAtm  from source file to variable  LWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  SWRadAtm  from source file to variable  SWRadAtm  in remapped NeCDF file
EASYMORE will remap variable  pptrate  from source file to variable  pptrate  in remapped NeCDF file
EASYMORE will remap variable  airtemp  from source file to variable  airtemp  in remapped NeCDF file
EASYMORE will

KeyboardInterrupt: 

## 2. Pre process geospatial data

Now let's calculate the zonal statistics of the geospatial attributes we need for our model

In [None]:
# Set up
# Initialize geospatialStatistics class
gs = geospatialStatistics(config, logger)

# Run resampling
gs.run_statistics()