# Model-Agnostic Input Data Preprocessing in CONFLUENCE

## Introduction

This notebook focuses on the model-agnostic preprocessing steps for input data in CONFLUENCE. Model-agnostic preprocessing involves tasks that are common across different hydrological models, such as data acquisition, quality control, and initial formatting.

Key steps covered in this notebook include:

1. Acquiring meteorological and geospatial data for the study area
2. Standardizing data formats and units
3. Spatial resampling of data to match the model domain
4. Calculate zonal statistics for the model attributes 

In this preprocessing stage we ensure that our input data is consistent, complete, and properly formatted before we move on to model-specific preprocessing steps. By the end of this notebook, you will have clean, standardized datasets ready for further model-specific processing.

## First we import the libraries and functions we need

In [1]:
import sys
from pathlib import Path
from typing import Dict, Any
import logging

# Add the parent directory to sys.path
current_dir = Path.cwd()
parent_dir = current_dir.parent
sys.path.append(str(parent_dir))

from utils.data_utils import DataAcquisitionProcessor, DataPreProcessor # type: ignore
from utils.logging_utils import setup_logger # type: ignore

# Load configuration
config_path = parent_dir / '0_config_files' / 'config_active.yaml'
with open(config_path, 'r') as config_file:
    config = yaml.safe_load(config_file)

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize DomainDiscretizer
discretizer = DomainDiscretizer(config, logger)

## Secondly let's define the functions we need

In [2]:
def setup_logging(log_file: str = 'input_data_processing.log') -> logging.Logger:
    """Set up and return a logger."""
    return setup_logger('input_data_processor', log_file)

def get_project_dir(config: Dict[str, Any]) -> Path:
    """Determine the project directory based on configuration."""
    data_dir = Path(config.get('CONFLUENCE_DATA_DIR'))
    domain_name = config.get('DOMAIN_NAME')
    return data_dir / f"domain_{domain_name}"

def process_hpc_data(config: Dict[str, Any], logger: logging.Logger):
    """Process data using HPC resources."""
    logger.info('Data acquisition set to HPC')
    data_acquisition = DataAcquisitionProcessor(config, logger)
    try:
        data_acquisition.run_data_acquisition()
    except Exception as e:
        logger.error(f"Error during data acquisition: {str(e)}")
        raise

def process_supplied_data(config: Dict[str, Any], logger: logging.Logger):
    """Process user-supplied data."""
    logger.info('Model input data set to supplied by user')
    data_preprocessor = DataPreProcessor(config, logger)
    data_preprocessor.process_zonal_statistics()

def process_input_data(config: Dict[str, Any]):
    """Main function to process input data based on configuration."""
    logger = setup_logging()
    logger.info("Starting input data processing")
    
    project_dir = get_project_dir(config)
    logger.info(f"Project directory: {project_dir}")
    
    if config.get('DATA_ACQUIRE') == 'HPC':
        process_hpc_data(config, logger)
    elif config.get('DATA_ACQUIRE') == 'supplied':
        process_supplied_data(config, logger)
    else:
        logger.error(f"Invalid DATA_ACQUIRE option: {config.get('DATA_ACQUIRE')}")
        raise ValueError("DATA_ACQUIRE must be either 'HPC' or 'supplied'")
    
    logger.info("Input data processing completed")

## Lastly run the process input data function 

In [None]:
if __name__ == "__main__":
    config = 
    process_input_data(config)