### Introduction & Vision
Welcome to this **prototype notebook**, which illustrates a **foundational workflow** for **Physics-AI** modeling, using **Darcy Flow** (and an **FNO** approach) as a **concrete example**. The goals of this notebook are **multi-faceted**:

1. **Showcase a workflow** for **Physics-AI training** with **Darcy Flow** data, emphasizing how easily we can integrate **FNO** models in **NVIDIA Modulus** or similar frameworks.  
2. **Demonstrate a general-purpose AutoML approach**—one that systematically searches for optimal **hyperparameters** (learning rate, channels, modes, etc.) across *any* PDE or neural operator.  
3. **Preview Active Learning** as a complementary strategy to guide data acquisition based on model **uncertainty**. (In this notebook, I’ll outline how to do it; the *actual active learning code* is placed in a second notebook, [`darcy_active_learning.ipynb`](darcy_active_learning.ipynb).)

Beyond these **current technical** accomplishments, this notebook also hints at **inspirational** next steps—extending from a **single PDE** toward a broad **Physics-AI solution** that includes:

#### I. Technical (Software Engineering)
- **Ontology-based data transformations**: A structured, automated pipeline for bridging different PDE data shapes (mesh ↔ grid ↔ point cloud). It reduces manual conversions and helps unify HPC outputs (e.g., AMReX) with ML-ready arrays.  
- **Ontology engine**: A framework that **detects** dataset geometry (e.g., uniform grid vs. unstructured mesh) and picks the right operator or transformation step. This paves the way for “one-click” PDE model building, especially when integrated with AutoML.  
- **AutoML for candidate selection**: Not just hyperparameters, but also *which* neural operator (FNO, AFNO, WNO, PINNs, etc.) is best for a given domain geometry or PDE. This accelerates experimentation by automatically ranking model architectures.  
- **Advanced workflow, pipelines, and training**: Encompasses **HPC synergy** (ingesting large or partially refined HPC data), orchestrated pipelines (e.g., Kedro, Airflow), and **accelerated PDE surrogate training** (multi-GPU, distributed). Together, these let us efficiently handle time-evolving PDE snapshots and large-scale parameter sweeps in real engineering environments.

#### II. Model & Physics Content
- **Models: WNO, NUFNO, etc.**:
  Beyond standard FNO, options like **Wavelet Neural Operator (WNO)** capture sharper local features via wavelet transforms, while **Non-Uniform FNO (NUFNO)** accommodates partial refinements or semi-structured domains. These advanced architectures can improve accuracy without a complete shift to graph-based methods.

- **Customized operator designs**:
  Domain-specific enhancements—e.g., a plasma-tailored FNO or specialized boundary treatments—boost performance on PDEs with unique constraints (sharp separatrices, anisotropy). This ensures surrogates match real-world physics more precisely than generic operators.

- **Ensembles / partial refinement & local closures**:
  In HPC settings with variable mesh refinement or subgrid phenomena, a hybrid approach (e.g., FNO + DiffusionNet) can handle global PDE patterns while focusing local operators on high-resolution patches. This preserves large-scale coverage and detail where it matters most.

- **Multi-scale patterns**  
  Many PDEs combine broad wave modes with fine-edged phenomena (e.g., subgrid turbulence). Leveraging wavelet-based or ensemble architectures means each scale can be tackled effectively—ensuring no critical features get lost.

- **Multi-physics regimes**:
  Real engineering tasks often blend multiple physics (e.g., fluid–structure interaction, electromagnetic–thermal coupling). By composing or extending neural operators for each sub-physics domain, we can solve coupled PDE sets under one pipeline.

- **Physics-Informed Loss (added to current physics-informed architecture)**:
  Incorporating PDE constraints directly into the training objective ensures surrogates adhere to known physics. This is invaluable for **inverse problem solving** (where data can be sparse) and for overall stability/robustness when extrapolating to new parameter regimes.

#### III. Downstream Applications
- **Inverse Problem Solving**:
  Quickly invert PDE relationships to find **which input conditions** yield a desired output (e.g., *some configuration for a target outcome value*). This drastically reduces design-cycle times compared to iterative HPC solves.

- **Optimization**:
  Plug surrogates into parametric optimization loops (shape optimization, operational parameter tuning). The surrogate’s fast inference replaces expensive HPC calls at each iteration, speeding up design exploration.

- **Deployment, HPC workflows, and MLOps**:
  Once the model is trained, seamlessly **deploy** it alongside HPC codes for real-time PDE updates, controlling or monitoring processes. MLOps features (monitoring, versioning) ensure reliability, traceability, and easy model updates in production or research HPC clusters.

I’m calling it a **“kick-off”** project because, even though it’s built around Darcy Flow and FNO, the underlying design can **readily scale**—both in terms of PDE complexity (multi-scale turbulence, advanced HPC data) and in terms of **workflow** (AutoML, HPC integration, interactive active learning, etc.). By adopting these modular components, we set the stage for a future in which **Physics-AI** model development becomes more automated, adaptable, and robust—serving a wide range of scientific and engineering challenges.

## Table of Contents
1. [00_Generate_Data](#00_generate_data)  
   - [00_01 Darcy Flow Simulation + Data Descriptor Creation](#00_01-darcy-flow-simulation--data-descriptor-creation)  
     - Demonstrates **generating synthetic Darcy flow data** (via `Darcy2D`) and creating a **data descriptor**. This lays the groundwork for PDE data ingestion, transformations, and future AutoML usage.

2. [01_Build_Surrogate_Model](#01_build_surrogate_model)  
   - [01_00 AutoMLCandidateModelSelection](#01_00-automl-candidate-model-selection)
     - **Motivation**: Showcases how we determine which PDE surrogate model(s) might work best given a dataset descriptor.  
     - **01_00_01 Data Descriptor & Model Registry Initialization**  
       - Loads the generated data descriptor, creates or loads a `ModelRegistry` with metadata for multiple models (FNO, AFNO, DiffusionNet, etc.).  
     - **01_00_02 Candidate Model Selection & Validation**  
       - Validates descriptor, applies selection logic (e.g., simple rules or advanced ranking), and saves the selected model(s) for downstream pipeline steps.

   - [01_01 Data Loading and (Optional) Data Transformation](#01_01-data-loading-and-optional-data-transformation)  
     - Covers loading raw `.pt` files or synthetic data, plus transformations like normalization or boundary labeling.  
       - [01_01_01 LoadRawData](#01_01_01-loadrawdata)  
         - Shows how `.pt` data is read, with minimal Exploratory Data Analysis (EDA).  
       - [01_01_03 TransformRawData](#01_01_03-transformrawdata)  
         - Applies any coordinate expansions, normalization, or shape fixes.  
       - [01_01_04 Preprocessing](#01_01_04-preprocessing)  
         - Optional steps for data quality checks or outlier removal.  
       - [01_01_05 FeaturePreparation](#01_01_05-featurepreparation)  
         - Final feature engineering, e.g., boundary-channel additions.

   - [01_02 Model Definition](#01_02-model-definition)  
     - **Implements** the PDE surrogate networks (e.g., `FNOWithDropout`, AFNO). Explains class architecture and relevant config fields.

   - [01_03 Model Factory](#01_03-model-factory)  
     - Demonstrates a single function `get_model(cfg)` that returns a chosen operator based on `model_name` in the config.

   - [01_04 Configuring Hyperparameters](#01_04-configuring-hyperparameters)  
     - Discusses reading or overriding hyperparams (Fourier modes, widths, learning rate, etc.) from `config.yaml`. Also references HPC or local usage.

   - [01_05 Model Training Loop](#01_05-model-training-loop)  
     - Outlines the core training logic: optimizer, loss, epoch iteration, logging (potentially with MLFlow).

   - [01_06 Model Training Execution](#01_06-model-training-execution)  
     - **Brings it together**: builds a `model`, obtains a `dataloader`, and runs the main training loop. For example:
       ```python
       model = get_model(cfg)
       train_loader = get_darcy_data_loader(cfg)
       final_val_loss = run_modulus_training_loop(cfg, model, train_loader)
       ```
     - Presents how we might do single-run or multi-model iteration.

   - [01_07 AutoML and Hyperparameter Tuning](#01_07-automl-and-hyperparameter-tuning)  
     - Demonstrates **Optuna** or similar libraries for PDE hyperparameter search (e.g., `modes`, `width`, `depth`, `lr`). Also covers multi-model tuning (FNO vs. AFNO).

   - [01_08 Visualizing Performance and Results](#01_08-visualizing-performance-and-results)  
     - Shows how to **plot** training/validation curves or produce PDE field comparisons (predicted vs. ground truth). Possibly lists best trials from AutoML.

3. [Offline Active Learning (Short Overview)](#offline-active-learning-short-overview)
   - **Note**: Active Learning steps (MC-Dropout for uncertain PDE samples) are covered in a separate notebook. We only summarize here.  
   - [Active Learning Notebook](darcy_active_learning.ipynb) — The second file demonstrates:
     1. Loading a **dropout-enabled** operator,
     2. Running multiple forward passes for uncertainty,
     3. Selecting top-K uncertain PDE inputs,
     4. (Optionally) saving them for partial retraining or HPC PDE solves.

> *If you only need to see the AL approach, jump directly to [Active Learning Notebook](darcy_active_learning.ipynb).* This first notebook focuses on data generation, model building, and AutoML. 

In [54]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [55]:
import logging
logging.basicConfig(level=logging.WARNING) # logging.basicConfig(level=logging.DEBUG)

In [56]:
import os
from omegaconf import OmegaConf
from darcy_automl_active_learning.path_utils import get_paths

repo_root_path, darcy_project_root_path, config_file_path, data_dir_path, results_dir_path = get_paths()

### 00_01 Darcy Flow Simulation & Data Descriptor Creation

In this section, we will **generate synthetic Darcy flow data** using `Darcy2D` (part of the NVIDIA Modulus datapipe utilities) and **document** the resulting dataset via a standardized **data descriptor**. The data descriptor follows our **physics-AI data taxonomy**, ensuring we capture crucial fields like dimensionality, geometry type, uniformity, and so forth.

1. **Generate Data**  
   - We will create multiple `.pt` files containing the Darcy field samples (`permeability` and `darcy`), using the `Darcy2D` datapipe for 2D PDE generation.  
   - These files are placed in the `data/00_Generate_Data/` folder.

2. **Load Configuration**  
   - We leverage a `config.yaml` that provides hyperparameters for data generation: resolution, batch size, normalizer values, etc.

3. **Create a Data Descriptor**  
   - After data generation, we write out a JSON file (e.g., `data_desc.json`) describing the dataset’s structure. This descriptor conforms to our **Core Taxonomy & Ontology** for PDE data. For a 2D uniform grid, we specify fields like `"dimension": 2`, `"geometry_type": "grid"`, `"uniform": true`, and the channel layout.

In [57]:
import os
import json
import torch
from omegaconf import OmegaConf

# If you have Modulus installed:
try:
    from modulus.datapipes.benchmarks.darcy import Darcy2D
except ImportError:
    Darcy2D = None
    print("[Warning] 'modulus.datapipes.benchmarks.darcy' not found. Please ensure NVIDIA Modulus is installed.")

SKIP_EXISTING = True  # True -> won't overwrite existing .pt files

# 1) Load config from 'config_file' (path to your config.yaml)
if not os.path.exists(config_file_path):
    print(f"[GenerateData] config.yaml not found at {config_file_path}; using fallback defaults.")
    cfg = OmegaConf.create({
        "normaliser": {
            "permeability": {"mean": 1.0, "std_dev": 0.5},
            "darcy":       {"mean": 0.1, "std_dev": 0.05}
        },
        "training": {
            "resolution": 64,
            "batch_size": 4,
            "max_pseudo_epochs": 10,
            "pseudo_epoch_sample_size": 512
        }
    })
else:
    cfg = OmegaConf.load(config_file_path)

# 2) Construct normaliser info if config.yaml has mean/std for 'permeability' / 'darcy'
norm_cfg = cfg.normaliser
normaliser = {
    "permeability": (norm_cfg.permeability.mean, norm_cfg.permeability.std_dev),
    "darcy":       (norm_cfg.darcy.mean,       norm_cfg.darcy.std_dev),
}

# 3) Prepare an output directory under data_dir
data_path = os.path.join(data_dir_path, "00_Generate_Data")
os.makedirs(data_path, exist_ok=True)

use_cuda_if_avail = True
device = "cuda" if (use_cuda_if_avail and torch.cuda.is_available()) else "cpu"

# 4) Try to instantiate Darcy2D (if Modulus is installed)
if Darcy2D is None:
    print("[GenerateData] Darcy2D not available. Skipping PDE generation or using fallback.")
else:
    print("[GenerateData] Initializing Darcy2D data loader...")
    dataloader = Darcy2D(
        resolution=cfg.training.resolution,
        batch_size=cfg.training.batch_size,
        normaliser=normaliser,
        device=device
    )

    # 5) Determine how many batches to save
    num_batches_to_save = cfg.training.batch_size * 5
    print(f"[GenerateData] Generating and saving {num_batches_to_save} batch files to '{data_path}' ...")

    for i in range(num_batches_to_save):
        save_path = os.path.join(data_path, f"darcy_batch_{i}.pt")
        if SKIP_EXISTING and os.path.exists(save_path):
            print(f"Skipping batch {i} -> file {save_path} already exists (SKIP_EXISTING=True).")
            continue

        batch = next(iter(dataloader))
        torch.save(batch, save_path)
        perm_shape  = batch["permeability"].shape
        darcy_shape = batch["darcy"].shape
        print(f"Saved batch {i} -> permeability {perm_shape}, darcy {darcy_shape}")

    print("[GenerateData] Data generation complete!")

# 6) Create a comprehensive data descriptor (2D uniform grid, 2 channels, etc.)
data_descriptor = {
    "descriptor_name": "Darcy2D_Uniform_2Ch",
    "data_structure": {
        "dimension": 2,
        "geometry_type": "grid",
        "uniform": True,
        "representation": {
            "array_layout": "[N, H, W, C]",
            "coordinate_mapping": "implicit uniform",
            "coordinate_bounds": {
                "x_min": 0.0,
                "x_max": 1.0,
                "y_min": 0.0,
                "y_max": 1.0
            }
        },
        "is_transient": False,
        "boundary": False,
        "boundary_info": None,
        "cell_type": None,
        "decimation": False,
        "decimation_level": None,
        "channels": 2,
        "time_steps": None,
        "adjacency": False
    },
    "metadata": {
        "pde_type": "Darcy",
        "num_samples": cfg.training.batch_size * 5,
        "description": (
            "2D Darcy dataset with 2 channels (e.g. 'permeability' and 'darcy'), "
            "generated via Modulus Darcy2D with uniform grid."
        ),
        "source_script": "00_Generate_Data.ipynb",
        "file_pattern": "darcy_batch_*.pt",
        "notes": "Each .pt file stores a dict with keys 'permeability' & 'darcy'."
    }
}

# 7) Save the descriptor to JSON in the same directory
data_desc_path = os.path.join(data_path, "data_desc.json")
with open(data_desc_path, "w") as f:
    json.dump(data_descriptor, f, indent=2)

print(f"[GenerateData] Wrote data descriptor to: {data_desc_path}")

[GenerateData] Initializing Darcy2D data loader...
[GenerateData] Generating and saving 20 batch files to 'examples/cfd/darcy_autoML_active_learning/data/00_Generate_Data' ...
Module modulus.datapipes.benchmarks.kernels.initialization 0f5b36a load on device 'cuda:0' took 48.97 ms  (cached)
Module modulus.datapipes.benchmarks.kernels.utils 27df179 load on device 'cuda:0' took 301.55 ms  (cached)
Module modulus.datapipes.benchmarks.kernels.finite_difference d7632e6 load on device 'cuda:0' took 154.55 ms  (cached)


### 01_00 AutoML Candidate Model Selection

Machine learning pipelines often involve a **model selection** phase, where we decide which model or models best fit a given dataset based on **data compatibility**, **user constraints**, and **desired outcomes** (e.g., accuracy vs. speed). In PDE-based workflows, this can become **challenging** because each **neural operator** or **graph-based surrogate** may expect different **geometry types** (mesh vs. grid), different channels or dimensionality, and so on.  

In this section, we demonstrate a **simple** version of an **AutoML** or **candidate model selection** pipeline. We show how to:

1. **Load** the data descriptor generated from the previous step (`00_Generate_Data`).  
2. **Initialize** a `ModelRegistry` containing metadata for our six candidate models (e.g., FNO, AFNO, DiffusionNet).  
3. **Create** a `CandidateModelSelector` or logic component that can pick which model(s) to recommend.  
4. **Validate** that the dataset’s descriptor is coherent and meets minimal requirements.  
5. **Select** candidate models (for demonstration, we pick FNO or whichever is compatible).  
6. **Retrieve** the target data structure each candidate model expects.  
7. **Save** the results—so future steps can build on them for data transformations, training pipelines, or hyperparameter tuning.

### 01_00_01 Data Descriptor & Model Registry Initialization

In this **first subsection**, we perform the **setup** needed to run the model selection routine:

- **Load** the data descriptor from `data/00_Generate_Data/data_desc.json`, confirming it has the required **fields** (e.g., `dimension`, `geometry_type`, etc.).  
- **Instantiate** our `ModelRegistry`, which knows about each candidate model’s **“accepted formats”**—like “2D uniform grid” for FNO or “3D unstructured mesh” for DiffusionNet. This registry can be extended later to incorporate **hyperparameters**, **training code references**, or **performance** metadata.  
- (Optionally) **Preview** or **print** the loaded descriptor to ensure correctness.


In [51]:
import os
import json

from darcy_automl_active_learning.model_registry.model_registry import ModelRegistry
from darcy_automl_active_learning.model_selection.candidate_selector import CandidateModelSelector
from darcy_automl_active_learning.model_selection.selection_strategies import SimpleSelectionStrategy
# If you have a data descriptor utility:
# from darcy_automl_active_learning.data_descriptor.data_descriptor_utils import load_data_descriptor

# 1) Identify the path to the data descriptor
data_desc_path = f"{data_dir_path}/00_Generate_Data/data_desc.json"
print(f"Loading data descriptor from '{data_desc_path}'...")

# Optionally, load the descriptor directly if you want to do some pre-check:
with open(data_desc_path, "r") as f:
    data_desc = json.load(f)
print("[AutoML] Data descriptor loaded successfully. Domain Data Structures:")
print(f"  - 'descriptor_name': {data_desc.get('descriptor_name')}")

# 2) Initialize the ModelRegistry (defaults to the 6-7 candidate models)
registry = ModelRegistry()
print("[AutoML] ModelRegistry initialized. Available models:")
for m in registry.list_models():
    print(f"  - {m}")

# 3) Create our CandidateModelSelector using a strategy
#    Here, we use a "SimpleSelectionStrategy" that always picks FNO, just for demonstration.
strategy = SimpleSelectionStrategy()
candidate_selector = CandidateModelSelector(
    model_registry=registry,
    selection_strategy=strategy
)

# 4) Validate the data descriptor
#    (If you have a custom load/validate function, call it here.
#     Or you can rely on candidate_selector internally.)
is_valid = candidate_selector.validate_data_descriptor(data_desc_path)
if not is_valid:
    raise ValueError("[AutoML] Data descriptor validation failed. Please fix the descriptor.")

print("[AutoML] Data descriptor is valid.")

# 5) Perform candidate model selection
selected_candidates = candidate_selector.automl_candidate_model_selection(data_desc_path)
print("[AutoML] Selected candidates:", selected_candidates)

# 6) For each selected model, retrieve the required data structure
for (model_name, candidate_key) in selected_candidates:
    data_struct = candidate_selector.get_required_data_structure(model_name)
    print(f"[AutoML] For model '{model_name}' (key='{candidate_key}'), "
          f"the required data structure is:\n{data_struct}")

# 7) Save the chosen candidates to JSON so that later steps know which model(s) we plan to train
output_folder = f"{data_dir_path}/01_00_AutoMLCandidateModelSelection"
os.makedirs(output_folder, exist_ok=True)
save_path = candidate_selector.save_candidate_models(selected_candidates, output_folder)
print(f"[AutoMLCandidateModelSelection] Results saved to: {save_path}")

Loading data descriptor from 'examples/cfd/darcy_autoML_active_learning/data/00_Generate_Data/data_desc.json'...
[AutoML] Data descriptor loaded successfully. Domain Data Structures:
  - 'descriptor_name': Darcy2D_Uniform_2Ch
[AutoML] ModelRegistry initialized. Available models:
  - AFNO
  - DiffusionNet
  - FNOWithDropout
  - FNO
  - GraphCast
  - NuFNO
  - WNO
[AutoML] Data descriptor is valid.
[AutoML] Selected candidates: [('FNOWithDropout', 'candidate0'), ('AFNO', 'candidate1')]
[AutoML] For model 'FNOWithDropout' (key='candidate0'), the required data structure is:
[{'dimension': [2, 3], 'geometry_type': ['grid'], 'representations': [{'representation_name': 'uniform_grid', 'uniform': True, 'is_voxel_grid': False, 'is_transient_supported': False, 'channels_min': 1, 'channels_max': None, 'boundary_required': False, 'mesh_type': None, 'notes': 'Same shape requirements as vanilla FNO (e.g. [N, C, H, W] for 2D).'}]}]
[AutoML] For model 'AFNO' (key='candidate1'), the required data struc

### 01_00_02 Candidate Model Selection & Validation

In this **second subsection**, we apply the actual **selection** routine:

1. **Validate** the data descriptor, ensuring it meets minimal fields (e.g. `dimension`, `geometry_type`) and that it’s consistent with the PDE problem.
2. **Apply** the selection logic, which might be as simple as “Pick FNO if the data is 2D–3D uniform,” or as complex as an **algorithm** that ranks models by performance, speed, or hyperparameter constraints.
3. For each **recommended** model, we fetch the **target data structure** (e.g. `{"dimension": [2,3], "uniform": true, ...}`) so we know what transformations might be needed.
4. **Store** or **save** the chosen candidates (and any relevant metadata) so we can retrieve them later when performing data transformations, training, or hyperparameter tuning.

In [52]:
import os
import json

from darcy_automl_active_learning.model_selection.candidate_selector import CandidateModelSelector
from darcy_automl_active_learning.ontology.ontology_engine import OntologyEngine

########################
# 0) READ INPUTS
########################

# Load the chosen candidates from the previous step (01_00 AutoMLCandidateModelSelection).
candidates_path = os.path.join(data_dir_path, "01_00_AutoMLCandidateModelSelection", "chosen_candidates.json")
with open(candidates_path, "r") as f:
    selected_candidates = json.load(f)
# Example: selected_candidates = [["FNO", "candidate0"], ["DiffusionNet", "candidate1"], ...]

# Load the data descriptor (the "source" data structure).
data_desc_path = os.path.join(data_dir_path, "00_Generate_Data", "data_desc.json")
with open(data_desc_path, "r") as f:
    data_desc = json.load(f)

########################
# 1) INITIALIZE CLASSES
########################
ontology_engine = OntologyEngine()
candidate_selector = CandidateModelSelector(
    model_registry=registry,
    selection_strategy=strategy
)

########################
# 2) GENERATE TRANSFORMATION PLANS
########################

all_candidates_plans = {}

for (model_name, candidate_key) in selected_candidates:
    # Query the model registry/selector to get the "desired/required data structure"
    target_data_struct = candidate_selector.get_required_data_structure(model_name)
    
    # Ask the ontology engine what transformations are needed
    transformation_plan = ontology_engine.suggest_transformations(
        source_data_desc=data_desc["data_structure"],
        target_data_requirements=target_data_struct,
        model_name=model_name,
        candidate_key=candidate_key,
        data_dir_path=data_dir_path
    )

    print(f"[Candidate: {candidate_key}] Model: {model_name}")
    print(" - Required data structure:", target_data_struct)
    print(" - Proposed transformation plan:", transformation_plan, "\n")
    
    # Store the plan in a dictionary for future steps
    all_candidates_plans[candidate_key] = {
        "model_name": model_name,
        "plan": transformation_plan
    }

# Optionally, save these plans to JSON so other notebook sections can load them
plans_output_path = os.path.join(data_dir_path, "01_01_DataTransformationPlan", "transformation_plans.json")
os.makedirs(os.path.dirname(plans_output_path), exist_ok=True)

with open(plans_output_path, "w") as f:
    json.dump(all_candidates_plans, f, indent=2)

print(f"[Info] Transformation plans saved to {plans_output_path}.")

[Candidate: candidate0] Model: FNOWithDropout
 - Required data structure: [{'dimension': [2, 3], 'geometry_type': ['grid'], 'representations': [{'representation_name': 'uniform_grid', 'uniform': True, 'is_voxel_grid': False, 'is_transient_supported': False, 'channels_min': 1, 'channels_max': None, 'boundary_required': False, 'mesh_type': None, 'notes': 'Same shape requirements as vanilla FNO (e.g. [N, C, H, W] for 2D).'}]}]
 - Proposed transformation plan: {'model_name': 'FNOWithDropout', 'stages': [{'stage_name': '01_01_01_LoadRawData', 'transform_ops': [{'method': 'copy_only', 'params': {'source_folder': '00_Generate_Data', 'dest_folder': '01_01_LoadRawData', 'subfolder_source': 'candidate0', 'subfolder_dest': 'candidate0', 'data_dir_path': 'examples/cfd/darcy_autoML_active_learning/data'}}]}, {'stage_name': '01_01_03_TransformRawData', 'transform_ops': [{'method': 'copy_only', 'params': {'source_folder': '01_01_LoadRawData', 'dest_folder': '01_01_03_TransformRawData', 'subfolder_sou

### 01_01 Data Loading and (Optional) Data Transformation


#### 01_01_01 LoadRawData

In this subsection, we handle the **initial import** of raw `.pt` files generated during our **“00_Generate_Data”** step. These files typically contain PDE fields (e.g., **`permeability`**, **`darcy`**) that we’ll eventually feed into one or more candidate models. Specifically, we aim to:

1. **Copy** the `.pt` files from `data/00_Generate_Data/` into a new folder, `data/01_01_LoadRawData/`.  
2. **Preserve** the existing data descriptor (`data_desc.json`), ensuring we maintain consistent metadata on dimensions, geometry types, channels, etc.  
3. Optionally perform minimal **exploratory data analysis (EDA)**—for instance, loading a sample `.pt` file and checking array shapes or key names.

Why a separate **LoadRawData** step? By isolating this phase, we keep our pipeline **modular**: each section (loading, transforming, preprocessing, feature engineering) has its own folder and minimal concerns. This structure scales well to more complex PDE workflows or HPC environments, where multiple transformations or domain-specific checks might be added later.

We’ll also demonstrate how we can integrate **transformation plans**—particularly the “01_01_01_LoadRawData” stage from our `transformation_plans.json`—to orchestrate these copy/EDA operations consistently. In a real production setting, you might expand this step to include **additional** data integrity checks (e.g., verifying file counts, ensuring no missing `.pt` or descriptor file), or HPC scheduling logic. For now, our **copy** operation and light EDA illustrate how to set a clear foundation for downstream tasks.

In [53]:
import os
import json
from darcy_automl_active_learning.ontology.ontology_transformation_engine import OntologyTransformationEngine

# 1) Path to the transformation plans (produced by OntologyEngine in an earlier step).
plans_json_path = os.path.join(data_dir_path, "01_01_DataTransformationPlan", "transformation_plans.json")

if not os.path.exists(plans_json_path):
    raise FileNotFoundError(f"[LoadRawData] Cannot find transformation plans at {plans_json_path}")

# 2) Load the transformation plans
with open(plans_json_path, "r") as f:
    all_candidates_plans = json.load(f)

print(f"[LoadRawData] Loaded transformation plans from {plans_json_path}.")
print("Candidate keys found:", list(all_candidates_plans.keys()))

# Example structure of all_candidates_plans (dictionary):
# {
#   "candidate0": { "model_name": "FNOWithDropout", "plan": {...} },
#   "candidate1": { "model_name": "AFNO",          "plan": {...} },
#    ...
# }

# 3) Instantiate our transformation engine
trans_engine = OntologyTransformationEngine()

# 4) We'll iterate through each candidate, find the stage "01_01_01_LoadRawData",
#    and execute its transform_ops in order.
for candidate_key, plan_info in all_candidates_plans.items():
    # plan_info might look like:
    # {
    #   "model_name": "FNOWithDropout",
    #   "plan": {
    #       "model_name": "FNOWithDropout",
    #       "stages": [
    #         { 
    #           "stage_name": "01_01_01_LoadRawData",
    #           "transform_ops": [ { "method": "copy_only", "params": {...} }, ... ]
    #         },
    #         ...
    #       ]
    #   }
    # }
    model_name = plan_info["model_name"]
    plan_dict  = plan_info["plan"]

    print(f"\n[LoadRawData] Processing candidate '{candidate_key}' for model '{model_name}'")

    # 4a) Retrieve the "stages" from plan_dict
    stages = plan_dict.get("stages", [])
    # 4b) Filter for the stage we want -> "01_01_01_LoadRawData"
    loadraw_stage = next(
        (st for st in stages if st.get("stage_name") == "01_01_01_LoadRawData"),
        None
    )

    if loadraw_stage is None:
        print(f"    -> No '01_01_01_LoadRawData' stage found for candidate '{candidate_key}'. Skipping.")
        continue

    # 4c) Execute each transform_op
    transform_ops = loadraw_stage.get("transform_ops", [])
    for op in transform_ops:
        method_name = op["method"]
        params      = op["params"]

        print(f"    -> Invoking '{method_name}' with params: {params}")
        # We'll dispatch to the transformation engine methods
        if hasattr(trans_engine, method_name):
            method = getattr(trans_engine, method_name)
            method(**params)  # e.g. copy_only(source_folder, dest_folder)
        else:
            print(f"    -> [Warning] Transformation method '{method_name}' not found. Skipped.")

print("\n[LoadRawData] All candidates processed for stage '01_01_01_LoadRawData'.")

[LoadRawData] Loaded transformation plans from examples/cfd/darcy_autoML_active_learning/data/01_01_DataTransformationPlan/transformation_plans.json.
Candidate keys found: ['candidate0', 'candidate1']

[LoadRawData] Processing candidate 'candidate0' for model 'FNOWithDropout'
    -> Invoking 'copy_only' with params: {'source_folder': '00_Generate_Data', 'dest_folder': '01_01_LoadRawData', 'subfolder_source': 'candidate0', 'subfolder_dest': 'candidate0', 'data_dir_path': 'examples/cfd/darcy_autoML_active_learning/data'}
[OntologyTransformationEngine] COPY_ONLY done: examples/cfd/darcy_autoML_active_learning/data/00_Generate_Data/candidate0 -> examples/cfd/darcy_autoML_active_learning/data/01_01_LoadRawData/candidate0

[LoadRawData] Processing candidate 'candidate1' for model 'AFNO'
    -> Invoking 'copy_only' with params: {'source_folder': '00_Generate_Data', 'dest_folder': '01_01_LoadRawData', 'subfolder_source': 'candidate1', 'subfolder_dest': 'candidate1', 'data_dir_path': 'examples/

#### 01_01_03 TransformRawData

In this section, we apply (or simulate) **data transformations** needed by each **candidate model**. Recall that the previous step selected one or more target architectures (e.g., `"FNO"`, `"AFNO"`, `"DiffusionNet"`) and assigned them labels (`"candidate0"`, `"candidate1"`, etc.). Here, each candidate’s data flows from **`01_01_LoadRawData`** into a **new** subfolder—for instance, `data/01_01_03_TransformRawData/candidate0/`.

1. **Copy** the `.pt` files from the “LoadRawData” folder.  
2. **Transform** them per model requirements (if necessary).

> **Why Transform?**  
> Different models impose distinct constraints on the dataset’s **geometry**, **resolution**, or **channels**. From an HPC or PDE perspective, transformations ensure the raw data aligns with each model’s assumptions—such as **spectral operators** needing uniform spacing or **mesh-based operators** expecting unstructured vertex/face data.

> **Possible Transformations** might include:
> - **TRANSFORM_MESH_TO_GRID**:  
>   Convert unstructured mesh data to a uniform grid—necessary if you want to feed a domain with irregular elements into a **Fourier** or **wavelet** operator that performs global transforms along regular axes. This can involve interpolation or resampling of original nodal values.  
>
> - **TRANSFORM_DECIMATE_MESH**:  
>   Downsample or reduce vertex count for large HPC-generated meshes. This is often needed if memory constraints or real-time performance requires smaller data sets. Decimation should preserve key PDE features or boundaries without losing critical geometry detail.  
>
> - **TRANSFORM_REGRID_DATA**:  
>   Change resolution from, say, 128×128 to 64×64, matching a model’s input dimension or training memory budget. This is especially relevant for **FNO/AFNO** if your PDE solver originally output very high resolution.  
>
> - **TRANSFORM_ADD_BOUNDARY_CHANNEL**:  
>   Insert an extra channel labeling boundary indices, inlet/outlet regions, or domain interfaces. Many PDE surrogates benefit from explicitly differentiating boundary conditions.  
>
> - **TRANSFORM_COORDINATE_MAPPING**:  
>   Adjust coordinate references (e.g., non-uniform → uniform) or embed extra coordinate fields (e.g., adding `(x, y)` grids as input channels). Useful for **PINNs** or operator-learning methods that rely on positional encodings.  
>
> - **TRANSFORM_NORMALIZE_TENSORS**:  
>   Scale PDE fields to a standard range or distribution (e.g., zero mean, unit variance). This can stabilize training by preventing large differences in scales across multiple PDE variables (e.g., velocity vs. pressure).  
>
> - **TRANSFORM_TIME_SUBSAMPLING** (if transient data):  
>   Select or downsample time steps from a high-frequency simulation if your surrogate only needs coarse temporal resolution.

> **Minimal or Custom**  
> Sometimes, no transformation is needed if the dataset already matches the model’s expected shape (e.g., a 2D uniform grid for FNO). In other scenarios—especially bridging drastically different data formats—transforms can be **extensive** (e.g., partial **voxelization** or complex manifold parameterization for unstructured surfaces).

**Implementation Outline**  
1. **Identify** the selected models and their subfolders (e.g., `"candidate0"` for FNO).  
2. **Gather** relevant transformations from a “transformation plan” (possibly stored in JSON or a Python object).  
3. **Apply** the transformations in sequence to each `.pt` file (or geometry file) from the previous step:
   - Each step modifies shapes, channels, geometry format, or resolution.  
   - If no transform is required, the script simply copies the data.  
4. **Save** the result in `01_01_03_TransformRawData/candidateX/` with an **updated `data_desc.json`** if the geometry or channels changed. That descriptor now reflects the new data layout (e.g., from `dimension=3, geometry_type="mesh"` to `dimension=2, geometry_type="grid"`).

By isolating each candidate’s transformed data, we keep the pipeline modular, ensuring that subsequent **Preprocessing** or **FeaturePreparation** steps can be tailored per model. For demonstration, we’ll parse a JSON file listing our chosen candidates (e.g., `[["FNO", "candidate0"], ...]`) and apply a minimal transform (or copying) to confirm the pipeline structure. In a production scenario, you might incorporate advanced geometry libraries (e.g., PyVista, VTK) or PDE-aware boundary labeling at this stage, especially in HPC contexts where domain complexity is high.

In [49]:
import os
import json
from darcy_automl_active_learning.ontology.ontology_transformation_engine import OntologyTransformationEngine

# 1) Path to the transformation plans (produced by OntologyEngine in an earlier step).
plans_json_path = os.path.join(data_dir_path, "01_01_DataTransformationPlan", "transformation_plans.json")

if not os.path.exists(plans_json_path):
    raise FileNotFoundError(f"[TransformRawData] Cannot find transformation plans at {plans_json_path}")

# 2) Load the transformation plans
with open(plans_json_path, "r") as f:
    all_candidates_plans = json.load(f)

print(f"[TransformRawData] Loaded transformation plans from {plans_json_path}.")
print("Candidate keys found:", list(all_candidates_plans.keys()))

# Example structure of all_candidates_plans (dictionary):
# {
#   "candidate0": { "model_name": "FNOWithDropout", "plan": {...} },
#   "candidate1": { "model_name": "AFNO",          "plan": {...} },
#   ...
# }

# 3) Instantiate our transformation engine
trans_engine = OntologyTransformationEngine()

# 4) We'll iterate through each candidate, find the stage "01_01_03_TransformRawData",
#    and execute its transform_ops in order.
for candidate_key, plan_info in all_candidates_plans.items():
    # plan_info might look like:
    # {
    #   "model_name": "FNOWithDropout",
    #   "plan": {
    #       "model_name": "FNOWithDropout",
    #       "stages": [
    #         {
    #           "stage_name": "01_01_01_LoadRawData",
    #           "transform_ops": [...]
    #         },
    #         {
    #           "stage_name": "01_01_03_TransformRawData",
    #           "transform_ops": [...]
    #         },
    #         ...
    #       ]
    #   }
    # }
    model_name = plan_info["model_name"]
    plan_dict  = plan_info["plan"]

    print(f"\n[TransformRawData] Processing candidate '{candidate_key}' for model '{model_name}'")

    # 4a) Retrieve the list of stages from plan_dict
    stages = plan_dict.get("stages", [])

    # 4b) Look for the stage named "01_01_03_TransformRawData"
    transformraw_stage = next(
        (st for st in stages if st.get("stage_name") == "01_01_03_TransformRawData"),
        None
    )

    if transformraw_stage is None:
        print(f"    -> No '01_01_03_TransformRawData' stage found for candidate '{candidate_key}'. Skipping.")
        continue

    # 4c) Execute each transform_op in that stage
    transform_ops = transformraw_stage.get("transform_ops", [])
    for op in transform_ops:
        method_name = op["method"]
        params      = op["params"]

        # Log the operation
        print(f"    -> Invoking '{method_name}' with params: {params}")

        # Dispatch to the transformation engine methods
        if hasattr(trans_engine, method_name):
            method = getattr(trans_engine, method_name)
            method(**params)  # e.g., copy_only(source_folder, dest_folder)
        else:
            print(f"    -> [Warning] Transformation method '{method_name}' not found. Skipped.")

print("\n[TransformRawData] All candidates processed for stage '01_01_03_TransformRawData'.")

[TransformRawData] Loaded transformation plans from examples/cfd/darcy_autoML_active_learning/data/01_01_DataTransformationPlan/transformation_plans.json.
Candidate keys found: ['candidate0', 'candidate1']

[TransformRawData] Processing candidate 'candidate0' for model 'FNOWithDropout'
    -> Invoking 'copy_only' with params: {'source_folder': '01_01_LoadRawData', 'dest_folder': '01_01_03_TransformRawData', 'subfolder_source': 'candidate0', 'subfolder_dest': 'candidate0', 'data_dir_path': 'examples/cfd/darcy_autoML_active_learning/data'}
[OntologyTransformationEngine] COPY_ONLY done: examples/cfd/darcy_autoML_active_learning/data/01_01_LoadRawData/candidate0 -> examples/cfd/darcy_autoML_active_learning/data/01_01_03_TransformRawData/candidate0

[TransformRawData] Processing candidate 'candidate1' for model 'AFNO'
    -> Invoking 'copy_only' with params: {'source_folder': '01_01_LoadRawData', 'dest_folder': '01_01_03_TransformRawData', 'subfolder_source': 'candidate1', 'subfolder_dest':

#### 01_01_04 Preprocessing

Even after **data transformations** (e.g., re-gridding or mesh decimation), **real-world PDE workflows** frequently require **additional** refinement before final training. This **preprocessing** ensures the dataset is consistently formatted, free of corruption, and enriched with any domain-specific metadata. Common operations might include:

- **Geometry Augmentation**  
  Performing random translations, rotations, or domain cropping to enhance model robustness and generalization.  

- **Cleaning & Filtering**  
  - **`PREPROC_REMOVE_OUTLIERS`**: Identifying and removing aberrant data points (e.g., extremely large velocities or pressures that arise from solver instabilities).  
  - **`PREPROC_DETECT_REPLACE_NANS`**: Automatically detecting `NaN` or infinite values and replacing them with defaults (e.g., zeros) or discarding those samples.  
  - **`PREPROC_FILTER_INCOMPLETE_SAMPLES`**: Skipping data entries where certain channels or geometry components are missing (e.g., partial PDE fields).  

- **Domain-Specific Preprocessing**  
  - **`PREPROC_LOG_STATS`**: Logging basic statistics (mean, std, min/max) per channel or boundary region for QA/QC.  
  - **`PREPROC_ADD_BOUNDARY_LABELS`**: Adding specialized boundary or interface labels, if not handled in the transform phase.  
  - **`PREPROC_ADD_CUSTOM_COORDS`**: Incorporating advanced parameterizations (e.g., polar/spherical coordinates for circular or spherical domains).  
  - **`PREPROC_MULTI_PHYSICS_COMBINE`**: Merging multiple PDE fields (e.g., fluid + thermal data) into a unified feature map.

---

In our **prototype** pipeline, we **simplify** preprocessing to minimal or **no** extra modifications. We essentially:

1. **Copy** each candidate’s transformed data files into a **`01_01_04_Preprocessing`** folder.  
2. **Optionally** perform consistency checks (ensuring each `.pt` file has the expected dimensions, verifying boundary channels exist if required, etc.).

> **Why keep this step separate?**  
> Preprocessing is distinct from core transformations because it can be **highly domain-specific** and may evolve over time. For instance, advanced HPC or industrial PDE pipelines might integrate strict validation rules (e.g., confirming mesh connectivity, verifying boundary compliance).

> **Extending Preprocessing**  
> In a **production** setup, you might expand this step to automate the following:
> - **`PREPROC_REMOVE_OUTLIERS`** and **`PREPROC_DETECT_REPLACE_NANS`** to ensure data integrity.  
> - **`PREPROC_LOG_STATS`** to capture summarizing metrics in a QA/QC log.  
> - **`PREPROC_ADD_BOUNDARY_LABELS`** to incorporate more sophisticated geometry masks.  
> - **Integration** with anomaly detection networks to flag suspicious samples or **retain** high-value domain extremes.

For now, our function (`do_preprocessing_for_candidates`) remains a **placeholder** indicating where these operations would occur. Future versions can expand domain-specific logic as needed, ensuring each candidate’s data is **validated**, **cleaned**, and **augmented** prior to **FeaturePreparation** or final model training.

{'candidate0': {'model_name': 'FNOWithDropout',
  'plan': {'candidate0': {'model_name': 'FNOWithDropout',
    'plan': {'model_name': 'FNOWithDropout',
     'stages': [{'stage_name': '01_01_01_LoadRawData',
       'transform_ops': [{'method': 'copy_only',
         'params': {'source_folder': '00_Generate_Data',
          'dest_folder': '01_01_LoadRawData',
          'data_dir_path': 'examples/cfd/darcy_autoML_active_learning/data'}}]},
      {'stage_name': '01_01_03_TransformRawData',
       'transform_ops': [{'method': 'copy_only',
         'params': {'source_folder': '01_01_LoadRawData',
          'dest_folder': '01_01_03_TransformRawData',
          'subfolder_source': 'candidate0',
          'subfolder_dest': 'candidate0',
          'data_dir_path': 'examples/cfd/darcy_autoML_active_learning/data'}}]},
      {'stage_name': '01_01_04_Preprocessing',
       'transform_ops': [{'method': 'copy_only',
         'params': {'source_folder': '01_01_03_TransformRawData',
          'dest_folde

In [14]:
import os
import json
from darcy_automl_active_learning.ontology.ontology_transformation_engine import OntologyTransformationEngine

# 1) Path to the transformation plans
plans_json_path = os.path.join(data_dir_path, "01_01_DataTransformationPlan", "transformation_plans.json")
if not os.path.exists(plans_json_path):
    raise FileNotFoundError(f"[Preprocessing] Cannot find transformation plans at {plans_json_path}")

# 2) Load the transformation plans
with open(plans_json_path, "r") as f:
    all_candidates_plans = json.load(f)

print(f"[Preprocessing] Loaded transformation plans from {plans_json_path}.")
print("Candidate keys found:", list(all_candidates_plans.keys()))

# 3) Instantiate or reuse the transformation engine
trans_engine = OntologyTransformationEngine()

# We'll assume the source folder is "01_01_03_TransformRawData"
source_root = os.path.join(data_dir_path, "01_01_03_TransformRawData")

# 4) For each candidate, find the stage "01_01_04_Preprocessing" and execute
for candidate_key, plan_info in all_candidates_plans.items():

    model_name = plan_info["model_name"]
    plan_dict  = plan_info["plan"]

    print(f"\n[Preprocessing] Processing candidate '{candidate_key}' for model '{model_name}'")

    # 4a) Retrieve the list of stages
    stages = plan_dict.get("stages", [])

    # 4b) Look for "01_01_04_Preprocessing"
    preprocessing_stage = next(
        (st for st in stages if st.get("stage_name") == "01_01_04_Preprocessing"),
        None
    )

    if preprocessing_stage is None:
        print(f"    -> No '01_01_04_Preprocessing' stage found for candidate '{candidate_key}'. Skipping.")
        continue

    # 4c) Retrieve transform_ops
    transform_ops = preprocessing_stage.get("transform_ops", [])
    if not transform_ops:
        print(f"    -> '01_01_04_Preprocessing' has no transform_ops for '{candidate_key}'. Skipping.")
        continue

    # 4d) Create the destination folder
    dest_folder = os.path.join(data_dir_path, "01_01_04_Preprocessing", candidate_key)
    os.makedirs(dest_folder, exist_ok=True)

    # 4e) Execute each transform method
    for op in transform_ops:
        method_name = op["method"]
        params      = op["params"]

        # Override the source/dest for clarity (common pattern in your pipeline)
        params["source_folder"] = os.path.join(source_root, candidate_key)
        params["dest_folder"]   = dest_folder

        print(f"    -> Invoking '{method_name}' with params: {params}")

        if hasattr(trans_engine, method_name):
            method = getattr(trans_engine, method_name)
            method(**params)
        else:
            print(f"    -> [Warning] Method '{method_name}' not found in transformation engine. Skipped.")

    print(f"    -> Finished preprocessing for candidate '{candidate_key}'.")

print("\n[Preprocessing] All candidates processed for stage '01_01_04_Preprocessing'.")


[Preprocessing] Loaded transformation plans from examples/cfd/darcy_autoML_active_learning/data/01_01_DataTransformationPlan/transformation_plans.json.
Candidate keys found: ['candidate0']

[Preprocessing] Processing candidate 'candidate0' for model 'FNOWithDropout'
    -> No '01_01_04_Preprocessing' stage found for candidate 'candidate0'. Skipping.

[Preprocessing] All candidates processed for stage '01_01_04_Preprocessing'.


#### 01_01_05 FeaturePreparation

Even with **preprocessing** in place, many **PDE workflows** can benefit from **feature engineering** to give models the best possible representation of the domain. Such feature engineering often targets **input** channels or **auxiliary** data that helps the model learn PDE patterns more effectively. Typical operations may include:

- **Boundary Channel Additions**  
  - **`FEATURE_ADD_BOUNDARY_MASK`**: Creating a channel that flags boundary nodes or cells (e.g., 1 at boundary, 0 in the interior). This clarifies region distinctions for the model.  
  - **`FEATURE_MESH_ADJ_INFO`**: For mesh-based PDEs, encoding adjacency or connectivity in a way that the model can leverage more directly.

- **Coordinate Expansions**  
  - **`FEATURE_ADD_COORDS`**: Injecting explicit \((x, y)\) or \((x, y, z)\) coordinates into each data sample if they’re not already included.  
  - **`FEATURE_TRANSFORM_COORDS`**: Converting from Cartesian to polar/spherical coordinates for certain domains or PDE problems.

- **Channel Rearrangements & Combinations**  
  - **`FEATURE_STACK_INPUTS`**: Stacking multiple PDE fields (e.g., temperature + velocity) into a single input tensor.  
  - **`FEATURE_SPLIT_FIELDS`**: Splitting one multi-channel input into separate sub-tensors for specialized architectures.

- **Scaling or Normalizing Fields**  
  - **`FEATURE_SCALE_CHANNELS`**: Applying scaling or normalization to each channel (e.g., min–max scaling or standard deviation normalization) after domain-specific preprocessing.  
  - **`FEATURE_LOG_TRANSFORM`**: Sometimes used for PDE variables that span multiple magnitudes (e.g., exponential growth in wave amplitude or flow velocity).

- **Noise Injection & Data Augmentation**  
  - **`FEATURE_ADD_NOISE`**: Introducing mild noise for regularization or simulating measurement uncertainty in sensor-based PDE data.  
  - **`FEATURE_AUGMENT_GEOMETRY`**: Additional geometric transformations (e.g., flips, slight domain perturbations) that specifically enhance feature diversity.

---

**Prototype Implementation**  
In our current pipeline, we keep **FeaturePreparation** **minimal**—doing little more than **copying** the data to a new folder. However, this step represents a natural **extension point** for domain-specific feature engineering. We envision a function `prepare_features_for_candidates(...)` in **`src/feature_engineering.py`** that could eventually:

1. **Verify** the presence of core PDE channels (e.g., `permeability`, `pressure`, `velocity`).  
2. **Combine** or **split** channels as needed for a given architecture (e.g., wavelet vs. graph-based).  
3. **Inject** boundary masks or coordinate arrays if a model demands explicit domain context.

By **decoupling** this from the earlier **Preprocessing** (which focuses on data cleaning and consistency), we ensure that model-specific or domain-specific feature engineering can **evolve** independently. Over time, additional transformations (like **`FEATURE_ADD_COORDS`**, **`FEATURE_SPLIT_FIELDS`**, or **`FEATURE_LOG_TRANSFORM`**) can be integrated without disrupting the rest of the pipeline. As a result, each candidate model can have precisely the **feature representation** it needs to learn effectively from the PDE data.

In [11]:
import os
import json
from darcy_automl_active_learning.ontology.ontology_transformation_engine import OntologyTransformationEngine

# 1) Path to the transformation plans
plans_json_path = os.path.join(data_dir_path, "01_01_DataTransformationPlan", "transformation_plans.json")
if not os.path.exists(plans_json_path):
    raise FileNotFoundError(f"[FeaturePreparation] Cannot find transformation plans at {plans_json_path}")

# 2) Load the transformation plans
with open(plans_json_path, "r") as f:
    all_candidates_plans = json.load(f)

print(f"[FeaturePreparation] Loaded transformation plans from {plans_json_path}.")
print("Candidate keys found:", list(all_candidates_plans.keys()))

# 3) Instantiate (or reuse) the transformation engine
trans_engine = OntologyTransformationEngine()

# We'll assume the source folder is "01_01_04_Preprocessing"
source_root = os.path.join(data_dir_path, "01_01_04_Preprocessing")

# 4) For each candidate, look for the stage "01_01_05_FeaturePreparation" and execute
for candidate_key, plan_info in all_candidates_plans.items():
    # plan_info typically looks like:
    # {
    #   "model_name": "FNOWithDropout",
    #   "plan": {
    #       "model_name": "FNOWithDropout",
    #       "stages": [
    #         {
    #           "stage_name": "01_01_01_LoadRawData",
    #           "transform_ops": [...]
    #         },
    #         {
    #           "stage_name": "01_01_05_FeaturePreparation",
    #           "transform_ops": [...]
    #         },
    #         ...
    #       ]
    #   }
    # }

    model_name = plan_info["model_name"]
    plan_dict = plan_info["plan"]

    print(f"\n[FeaturePreparation] Processing candidate '{candidate_key}' for model '{model_name}'")

    # 4a) Retrieve the list of stages
    stages = plan_dict.get("stages", [])

    # 4b) Find the "01_01_05_FeaturePreparation" stage
    featureprep_stage = next(
        (st for st in stages if st.get("stage_name") == "01_01_05_FeaturePreparation"),
        None
    )

    if featureprep_stage is None:
        print(f"    -> No '01_01_05_FeaturePreparation' stage found for candidate '{candidate_key}'. Skipping.")
        continue

    # 4c) Retrieve transform_ops
    transform_ops = featureprep_stage.get("transform_ops", [])
    if not transform_ops:
        print(f"    -> '01_01_05_FeaturePreparation' has no transform_ops for '{candidate_key}'. Skipping.")
        continue

    # 4d) Create the destination folder
    dest_folder = os.path.join(data_dir_path, "01_01_05_FeaturePreparation", candidate_key)
    os.makedirs(dest_folder, exist_ok=True)

    # 4e) Execute each transformation operation
    for op in transform_ops:
        method_name = op["method"]
        params = op["params"]

        # Override the source/dest for clarity
        params["source_folder"] = os.path.join(source_root, candidate_key)
        params["dest_folder"]   = dest_folder

        print(f"    -> Invoking '{method_name}' with params: {params}")

        if hasattr(trans_engine, method_name):
            method = getattr(trans_engine, method_name)
            method(**params)
        else:
            print(f"    -> [Warning] Method '{method_name}' not found in transform engine. Skipped.")

    print(f"    -> Finished feature preparation for candidate '{candidate_key}'.")

print("\n[FeaturePreparation] All candidates processed for stage '01_01_05_FeaturePreparation'.")


[FeaturePreparation] Loaded transformation plans from examples/cfd/darcy_autoML_active_learning/data/01_01_DataTransformationPlan/transformation_plans.json.
Candidate keys found: ['candidate0']

[FeaturePreparation] Processing candidate 'candidate0' for model 'FNOWithDropout'
    -> No '01_01_05_FeaturePreparation' stage found for candidate 'candidate0'. Skipping.

[FeaturePreparation] All candidates processed for stage '01_01_05_FeaturePreparation'.


### Conclusion: Data Pipeline Ready

We have successfully completed an **end-to-end data pipeline** for our Darcy Flow (or more generally PDE-based) project, incorporating:

1. **Data Descriptor & Model Compatibility**
   - Defined a **comprehensive data descriptor** (`data_desc.json`) capturing dimension, geometry type, uniformity, and more.
   - Used a `ModelRegistry` and a simple `AutoMLCandidateModelSelection` to verify which candidate models (e.g., FNO, AFNO) can directly consume our dataset, or if transformations are required.

2. **Raw Data Loading (`01_01_LoadRawData`)**
   - Copied raw `.pt` files from `data/00_Generate_Data` to `data/01_01_LoadRawData`.
   - Preserved the data descriptor for consistency.
   - Performed minimal Exploratory Data Analysis (EDA) to confirm file integrity and shapes (e.g., checking `"permeability"`, `"darcy"`).

3. **Transforming Raw Data (`01_01_03_TransformRawData`)**
   - For each **candidate model** (e.g., `candidate0`, `candidate1`), created a dedicated subfolder in `data/01_03_TransformRawData`.
   - Demonstrated how to handle **typical PDE transformations**:
     - **`TRANSFORM_MESH_TO_GRID`**: Converting an unstructured mesh to a uniform grid if required by a spectral-based operator.
     - **`TRANSFORM_DECIMATE_MESH`**: Reducing mesh complexity for memory or performance constraints.
     - **`TRANSFORM_REGRID_DATA`**: Adjusting resolution or coordinate spacing to match model expectations.
     - **`TRANSFORM_APPLY_BC_AUGMENTATION`**: Incorporating boundary-condition channels (if not added earlier).
     - **`TRANSFORM_NORMALIZE`**: Standardizing or normalizing PDE fields (e.g., substract mean, divide by std).
   - Kept transformations minimal in this prototype, but laid out the structure for more sophisticated re-gridding or domain modifications if needed.

4. **Preprocessing (`01_01_04_Preprocessing`)**
   - Introduced a **placeholder** for additional PDE data modifications, including:
     - **`PREPROC_GEOMETRY_AUGMENT`**: Random rotations, domain cropping, or flips.
     - **`PREPROC_REMOVE_OUTLIERS`**: Filtering extreme or invalid values.
     - **`PREPROC_DETECT_REPLACE_NANS`**: Handling missing or corrupted data points.
     - **`PREPROC_FILTER_INCOMPLETE_SAMPLES`**: Removing partial or malformed data entries.
   - Copied the transformed data for each candidate into `data/01_04_Preprocessing`, ensuring any domain- or application-specific cleaning can be done here.

5. **Feature Preparation (`01_01_05_FeaturePreparation`)**
   - Final stage of data engineering before **model training**, covering potential:
     - **Boundary Mask Channels** (e.g., `FEATURE_ADD_BOUNDARY_MASK`).
     - **Coordinate Expansions** (`FEATURE_ADD_COORDS`), if needed for operator-based PDE solvers.
     - **Channel Stacking** (`FEATURE_STACK_INPUTS`) or **Splitting** (`FEATURE_SPLIT_FIELDS`) to reorganize PDE fields.
     - **Scaling** or **Augmentation** for the final inputs (e.g., `FEATURE_SCALE_CHANNELS` or `FEATURE_ADD_NOISE`).
   - Copied or updated files under `data/01_05_FeaturePreparation`, providing a flexible hook for advanced PDE-specific feature engineering.

**Outcome & Next Steps**  
All data are now **cleaned**, **transformed**, and **feature-engineered** in a structured manner, ready for **surrogate model training** or **AutoML** hyperparameter tuning. Our project’s data folders now look like:

```
data/
 ├─ 00_Generate_Data/
 │    └─ data_desc.json
 ├─ 01_00_AutoMLCandidateModelSelection/
 │    └─ chosen_candidates.json
 ├─ 01_01_LoadRawData/
 ├─ 01_03_TransformRawData/
 │    ├─ candidate0/
 │    └─ candidate1/
 ├─ 01_04_Preprocessing/
 │    ├─ candidate0/
 │    └─ candidate1/
 └─ 01_05_FeaturePreparation/
      ├─ candidate0/
      └─ candidate1/
```

With the **data pipeline** complete, we can move on to **model definition**, **training**, and (optionally) **AutoML** tasks such as hyperparameter optimization or multi-model experimentation.

### 01_02 Model Definition

In this section, we introduce our primary PDE surrogate model definitions. We focus on two main variants:

1. **FNOWithDropout** – A custom subclass of Modulus’s Fourier Neural Operator (FNO) that injects dropout. This allows us to do Monte Carlo Dropout–based uncertainty estimation or simply add a regularization mechanism.
2. **AFNO** – NVIDIA Modulus’s Adaptive Fourier Neural Operator, which uses an adaptive frequency gating approach for improved spectral flexibility.

Both surrogates rely on hyperparameter definitions stored in our `config.yaml` under `cfg.arch.fno.*` or `cfg.arch.afno.*`. By default, we’ll pull settings like `in_channels`, `out_channels`, `latent_channels`, `drop` (dropout rate), and so on directly from `config.yaml`. You can override these values in the notebook if needed—just edit the `cfg` object before creating the models.

We’ll keep the actual model classes (and any helper functions) in `src/models.py` (or sub-files like `fno_dropout.py`, `afno.py`), each thoroughly documented with docstrings. Then, in the next cells, we’ll show how to use these classes in conjunction with the config fields.

### 01_03 Model Factory

This section focuses on **merging our user configuration** (especially the field `cfg.model_name`) with the model definitions created in “01_02 Model Definition.” By doing so, we can **automate** which PDE surrogate to build—be it an FNO-based model, AFNO, or a future extension (like a PINN or DiffusionNet). 

**Why a Factory?** It lets us keep a **single** entry point (`get_model(cfg)`), which reads the relevant parameters (`cfg.arch.fno.*`, `cfg.arch.afno.*`, etc.) and returns the correct PyTorch module. This modular approach also makes it straightforward to **add** new model variants (e.g., a different neural operator) without changing the notebook workflow. 

In the following steps, we’ll:
1. Create a new file, `model_factory.py`, that defines `get_model(cfg)` (with docstrings).
2. Demonstrate how we **import** and **use** this factory function in the notebook.
3. Confirm it works by instantiating a model and optionally running a quick shape check.

This pattern helps maintain a **clean separation** between model definitions and the logic that decides **which** model to instantiate—making the pipeline easier to scale and adapt for new PDE surrogates.

### 01_04 Configuring Hyperparameters

In this section, we outline how to configure the hyperparameters for our PDE surrogate models. 
Recall that we store default values (like `epochs`, `learning_rate`, `batch_size`, etc.) in our
[`config.yaml`](./config.yaml). 

For instance, here are a few default hyperparameters you might see in that file:

| Hyperparameter      | Default Value | Description / Notes                                |
|---------------------|--------------|-----------------------------------------------------|
| `training.epochs`   | 10           | Number of training epochs                           |
| `training.lr`       | 1e-3         | Initial learning rate for the optimizer            |
| `training.batch_size` | 16        | Mini-batch size for training loops                 |
| `arch.fno.num_fno_modes` | 12      | Number of Fourier modes (FNO-specific)             |
| `arch.afno.drop`    | 0.1          | Dropout rate for AFNO gating (AFNO-specific)       |

**Overriding Hyperparams Locally**  
You can update these hyperparameters within the notebook before training or tuning. For example:
```python
cfg.training.lr = 5e-4
cfg.training.epochs = 30
print("Updated training config:", cfg.training)
```

**Using MLFlow**  
We also demonstrate how to log hyperparameters to MLFlow, so each run’s configuration is 
stored alongside its metrics and artifacts. In a typical flow, you might do:

```python
import mlflow

mlflow.start_run(run_name="Experiment_FNO")
# log hyperparams
log_hyperparams_mlflow(cfg)

# proceed with training...
mlflow.end_run()
```

In subsequent cells, we’ll show how to integrate these hyperparameters into the training loop, 
as well as how to override them for AutoML or HPC use cases if you wish. 
This approach ensures a **reproducible** pipeline—where each run can be traced back 
to its exact configuration and settings.

### 01_05 Model Training Loop

In this section, we implement a **generic PDE training loop** that references our **configuration parameters** (like epochs, learning rate, batch size, etc.) from `config.yaml`. This training loop can be used for:

- **Single-Run Training**: Train a single model with a chosen set of hyperparameters (e.g., an FNO or AFNO).
- **Multi-Run/AutoML** scenarios: Called multiple times with different hyperparameter overrides for hyperparameter tuning (we’ll see this usage in a later section).

We incorporate:
- **Progress Bars** with `tqdm`, to get live feedback on training progress (especially helpful in notebooks).
- **MLFlow Logging** (optional), so each epoch’s train and validation loss is recorded for future analysis.
- **Device Handling** (CPU vs. GPU via a `device` parameter).

If you’re running on **HPC or distributed** environments, you may want to disable the tqdm progress bars (for performance/logging reasons) and/or integrate distributed managers from Modulus or PyTorch. We’ll point out where those hooks go, but keep them minimal for this prototype.

Below, we’ll demonstrate how to use our training loop, pass in a config object, and see the relevant progress bar and MLFlow logs.

### 01_06 Model Training Execution

In this section, we bring together all of the moving parts from our pipeline:
- **Data pipeline**: The raw data has been generated, transformed, and preprocessed in the earlier steps.
- **Model factory**: We can instantiate our chosen model (e.g., FNO or AFNO) using the config-based logic from “01_03 Model Factory.”
- **Hyperparameter settings**: From “01_04 Configuring Hyperparameters,” we have default (or overridden) values for epochs, learning rate, batch size, and so on.
- **Training loop**: As defined in “01_05 Model Training Loop,” which handles epochs, mini-batches, loss calculation, optional validation, and more.

By **combining** these steps, we now present a **user-facing script or function** (`execute_training` or similar) that performs the **end-to-end** training process:
1. **Pull** the final data loader(s),  
2. **Create** or load the model,  
3. **Train** using our training loop,  
4. **Track** progress in a notebook progress bar (using `tqdm` by default),  
5. **Log** metrics to MLFlow (if desired),  
6. **Save** checkpoints according to the user’s preference (final, best, or every epoch).

We’ll also briefly show how to adjust or disable certain features for HPC usage—such as turning off the progress bar or hooking in distributed training if needed. The remainder of this section walks through a Python function and example usage in the notebook to carry out this consolidated training flow.

### 01_07 AutoML and Hyperparameter Tuning
In the previous section (“01_06 Model Training Execution”), we demonstrated how to train our PDE surrogate (FNO or AFNO) with a chosen set of hyperparameters—either from our default `config.yaml` or via simple overrides. Now, we turn to a more **systematic** approach: **hyperparameter tuning** or **AutoML**.

Here, we’ll leverage a search method (grid, random, or Bayesian—commonly **Optuna** in Python) to explore the hyperparameter space. Our `config.yaml` already contains default parameter values and additional fields (under `cfg.automl`) specifying **ranges** (e.g., Fourier modes from 8 to 20, learning rate from 1e-4 to 5e-3, etc.). 

**MLFlow Logging**  
Just as in our normal training, we’ll integrate MLFlow to log each hyperparam trial’s configuration and final metrics. By doing so, we can easily compare many trials in a single, consolidated UI. 

**Progress Bars**  
For each trial, we can still rely on our PDE training loop’s `tqdm` progress bar—although for a large number of trials, it might be practical to reduce the training epochs or batch sizes to speed up each run.

---

**Key Points in This Section**
1. **Hyperparameter Range Setup**  
   We confirm or update the `config.yaml` sub-tree (`cfg.automl`) that defines the search space for FNO (e.g. `modes`, `width`, `depth`, etc.) and, if relevant, for AFNO (`drop`, `gating_strength`, etc.).

2. **AutoML Logic**  
   We’ll create or review a new file, `src/automl.py`, which contains code to parse those search ranges and define an **Optuna objective** function.

3. **Partial vs. Full Training**  
   In each trial, we might do a reduced set of epochs or data to expedite the search. Once the best params are found, we’ll do a **full** retraining using the discovered configuration.

4. **MLFlow**  
   We’ll log each trial’s hyperparams and final validation metrics under separate nested runs, so you can open MLFlow and compare them.

By the end of this section, you’ll have seen how to run multiple hyperparam search trials—**automatically** adjusting FNO or AFNO parameters—before picking the best discovered setup for a final training pass.

### 01_08 Visualizing Performance and Results

After training our PDE surrogate models (and possibly using AutoML to tune hyperparameters),
we now want to **examine** how well they perform. In this section, we will:

1. **Load** the training/validation metrics from our logs (or MLFlow, if enabled).
2. **Plot** these metrics (e.g., loss curves over epochs).
3. **Compare** model predictions to ground-truth solutions for a few test samples—especially
   valuable in Darcy flow, where we can visualize the predicted pressure fields vs. the true
   solution.
4. **Summarize** errors (e.g., MSE, absolute difference) across a set of test samples to
   get a sense of overall accuracy, variance, and potential failure cases.

We rely on the utility functions we placed in **`src/visualization.py`**:

- `plot_train_val_loss(...)`: For plotting training/validation loss curves.
- `plot_prediction_comparison(...)`: Side-by-side visualization of **input** (permeability),
  **predicted** (pressure), **ground truth** (pressure), and a simple **error map**.
- `plot_error_distribution(...)`: Quick histogram or boxplot of errors across many samples.
- `summarize_metrics_table(...)`: A small table summarizing results from multiple runs.

Finally, we’ll also **load** a saved model checkpoint (if we have one) or pick a final/best-epoch
checkpoint to run inference on sample PDE inputs. By the end, we should have a clear picture
of how our model is performing and any areas for improvement.

#### Concluding the Visualization & Pipeline
We’ve now completed a full pass through our PDE surrogate pipeline—from data preparation, 
model definition, and hyperparameter tuning, to final training and results visualization.

- **Final Observations**:  
  - For instance, using FNO with `modes=12` and `width=64` yielded approximately **X%** relative error on the test set.  
  - The predicted Darcy fields show close alignment with the ground truth solutions, as seen in our 2D plots.

- **HPC Readiness**:  
  - If you plan to run larger resolutions or more epochs, the same notebook logic can scale to HPC environments. 
  - You may disable the progress bar or use a distributed manager (e.g., `DistributedManager` in Modulus) to parallelize training.

- **Advanced Features**:  
  - In real-world scenarios, consider adding PDE constraints, subgrid modeling, or multi-objective optimization if the use-case demands more advanced physics fidelity.
  - **Active Learning** can be integrated to select new PDE samples, especially if generating or simulating data is expensive.

- **MLFlow or Other Logs**:  
  - If you recorded metrics in MLFlow, open the MLFlow UI (or your logging interface) to view interactive charts, parameter comparisons, and artifacts (e.g., model checkpoints, images).

**Next Steps**:
1. **Refine the Model**: Increase epochs, tweak hyperparameters further, or incorporate additional PDE constraints.
2. **Deploy or Save** the pipeline: Convert your final model to an inference engine or HPC environment.
3. **Explore** expansions like deeper AFNO gating, multi-physics PDE coupling, or more advanced domain transformations.

With these steps, you have a **functioning** pipeline that can be adapted for **larger HPC** usage, 
more sophisticated PDE tasks, or integrated with **AutoML** strategies to systematically refine hyperparameters.