# Pipeline Examples

## PCA-GP on Field Data

This notebook demonstrates the end-to-end process of building a machine learning pipeline for predicting high-dimensional flow fields from scalar input parameters using the PLAID dataset. The pipeline leverages PLAID’s scikit-learn-compatible blocks and standard dataset structures to enable modular, reusable workflows.

Key steps covered in this notebook:

- **Loading and preparing the PLAID dataset** using Hugging Face integration and PLAID’s dataset classes  
- **Standardizing features** with PLAID-wrapped scikit-learn transformers for scalars and fields  
- **Dimensionality reduction** of flow fields via Principal Component Analysis (PCA) to reduce output complexity  
- **Regression modeling** of PCA coefficients from scalar inputs using Gaussian Process regression  
- **Pipeline assembly** combining transformations and regressors into a single scikit-learn-compatible workflow  
- **Hyperparameter tuning** using Optuna and scikit-learn’s `GridSearchCV`
- **Model evaluation** using cross-validation and appropriate metrics  
- **Best practices** for working with PLAID datasets and pipelines in a reproducible and modular manner

### 📦 Imports

In [1]:
import warnings
warnings.filterwarnings('ignore', module='sklearn')
warnings.filterwarnings("ignore", message=".*IProgress not found.*")

import os
from pathlib import Path

import yaml
import numpy as np
import optuna

from datasets.utils.logging import disable_progress_bar
from datasets import load_dataset

from sklearn.base import clone
from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.multioutput import MultiOutputRegressor

from sklearn.model_selection import KFold, GridSearchCV

from plaid.bridges.huggingface_bridge import huggingface_dataset_to_plaid, huggingface_description_to_problem_definition
from plaid.pipelines.sklearn_block_wrappers import WrappedPlaidSklearnTransformer, WrappedPlaidSklearnRegressor
from plaid.pipelines.plaid_blocks import PlaidTransformedTargetRegressor, PlaidColumnTransformer

disable_progress_bar()
n_processes = min(max(1, os.cpu_count()), 6)

## 📥 Load Dataset

We load the `VKI-LS59` dataset from Hugging Face and restrict ourselves to the first 24 samples of the training set.

In [2]:
hf_dataset = load_dataset("PLAID-datasets/VKI-LS59", split="all_samples[:24]")
dataset_train, _ = huggingface_dataset_to_plaid(hf_dataset, processes_number = n_processes, verbose = False)

We print the summary of dataset_train, which contains 24 samples, with 8 scalars and 8 fields, which is consistent with the VKI-LS59` dataset:

In [3]:
print(dataset_train)

Dataset(24 samples, 8 scalars, 0 time_seriess, 8 fields)


## ⚙️ Pipeline Configuration

For convenience, the `in_features_identifiers` and `out_features_identifiers` for each pipeline block are defined in a `.yml` file. Here's an example of how the configuration might look:

```yaml
pca_nodes:
  in_features_identifiers:
    - type: nodes
      base_name: Base_2_2
  out_features_identifiers:
    - type: scalar
      name: reduced_nodes_*
```

In [4]:
try:
    filename = Path(__file__).parent.parent.parent / "docs" / "source" / "notebooks" / "config_pipeline.yml"
except NameError:
    filename = "config_pipeline.yml"

with open(filename, 'r') as f:
    config = yaml.safe_load(f)

all_feature_id = config['input_scalar_scaler']['in_features_identifiers'] +\
    config['pca_nodes']['in_features_identifiers'] + config['pca_mach']['in_features_identifiers']

In this example, we aim to predict the ``mach`` based on two input scalars ``angle_in`` and ``mach_out``, and the mesh node coordinates. To contain memory consumption, we restrict the dataset to the features required for this example:

In [5]:
dataset_train = dataset_train.from_features_identifier(all_feature_id)
print(dataset_train)

Dataset(24 samples, 2 scalars, 0 time_seriess, 1 field)


We notive that only the 2 scalars and the field of interest are kept after restriction.

1. Preprocessor

We now define a preprocessor: a `MinMaxScaler` of the 2 input scalars and a `PCA` on the nodes coordinates of the meshes:

In [6]:
preprocessor = PlaidColumnTransformer(
    [
        ('input_scalar_scaler', WrappedPlaidSklearnTransformer(MinMaxScaler(), **config['input_scalar_scaler'])),
        ('pca_nodes', WrappedPlaidSklearnTransformer(PCA(), **config['pca_nodes'])),
    ]
)
preprocessor

0,1,2
,plaid_transformers,"[('input_scalar_scaler', ...), ('pca_nodes', ...)]"

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,n_components,
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


We use a `PlaidColumnTransformer` to apply independent transformations to different feature groups.

To verify this behavior, we apply the `preprocessor` to `dataset_train`:

In [7]:
preprocessed_dataset = preprocessor.fit_transform(dataset_train)
print(preprocessed_dataset)
print("scalar names =", preprocessed_dataset.get_scalar_names())
print("field names =", preprocessed_dataset.get_field_names())

Dataset(24 samples, 3 scalars, 0 time_seriess, 1 field)
scalar names = ['angle_in', 'mach_out', 'reduced_nodes_*']
field names = ['mach']


Using `MinMaxScaler`, we scaled the `angle_in` and `mach_out` features, replacing their original values. In contrast, `PCA` compressed the node coordinates and produced new scalar features named `reduced_nodes_*`, representing the PCA components. Alternatively, we could have specified `out_features_identifiers` in the `.yml` file configuring the `MinMaxScaler` block to generate new scalars without overwriting the original inputs.

2. Postprocessor

Next, we define the postprocessor, which applies PCA to the `mach` field:

In [8]:
postprocessor = WrappedPlaidSklearnTransformer(PCA(), **config['pca_mach'])
postprocessor

0,1,2
,sklearn_block,PCA()
,in_features_identifiers,"[{'base_name': 'Base_2_2', 'name': 'mach', 'type': 'field'}]"
,out_features_identifiers,"[{'name': 'reduced_mach_*', 'type': 'scalar'}]"

0,1,2
,n_components,
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


3. TransformedTargetRegressor

The Gaussian Process regressor takes the transformed `angle_in` and `mach_out` scalars, along with the PCA coefficients of the mesh node coordinates as inputs, and predicts the PCA coefficients of the `mach` field as outputs. This is facilitated by using a `PlaidTransformedTargetRegressor`.

In [None]:
kernel = Matern(length_scale_bounds=(1e-8, 1e8), nu = 2.5)

gpr = GaussianProcessRegressor(
    kernel=kernel,
    optimizer='fmin_l_bfgs_b',
    n_restarts_optimizer=1,
    random_state=42)

reg = MultiOutputRegressor(gpr)

def length_scale_init(X):
    return np.ones(X.shape[1])

dynamics_params_factory = {'estimator__kernel__length_scale':length_scale_init}

regressor = WrappedPlaidSklearnRegressor(reg, **config['regressor_mach'], dynamics_params_factory = dynamics_params_factory)

target_regressor = PlaidTransformedTargetRegressor(
    regressor=regressor,
    transformer=postprocessor
)
target_regressor

0,1,2
,regressor,WrappedPlaidS...om_state=42)))
,transformer,WrappedPlaidS...n_block=PCA())

0,1,2
,kernel,"Matern(length_scale=1, nu=2.5)"
,alpha,1e-10
,optimizer,'fmin_l_bfgs_b'
,n_restarts_optimizer,1
,normalize_y,False
,copy_X_train,True
,n_targets,
,random_state,42
,kernel__length_scale,1.0
,kernel__length_scale_bounds,"(1e-08, ...)"

0,1,2
,n_components,
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


`PlaidTransformedTargetRegressor` functions like scikit-learn’s `TransformedTargetRegressor` but operates directly on PLAID datasets.

4. Pipeline assembling

We then define the complete pipeline as follows:

In [10]:
pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("regressor", target_regressor),
    ]
)
pipeline

0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,plaid_transformers,"[('input_scalar_scaler', ...), ('pca_nodes', ...)]"

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,n_components,
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,

0,1,2
,regressor,WrappedPlaidS...om_state=42)))
,transformer,WrappedPlaidS...n_block=PCA())

0,1,2
,kernel,"Matern(length_scale=1, nu=2.5)"
,alpha,1e-10
,optimizer,'fmin_l_bfgs_b'
,n_restarts_optimizer,1
,normalize_y,False
,copy_X_train,True
,n_targets,
,random_state,42
,kernel__length_scale,1.0
,kernel__length_scale_bounds,"(1e-08, ...)"

0,1,2
,n_components,
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


## 🎯 Optuna hyperparameter tuning

We now use Optuna to optimize hyperparameters, specifically tuning the number of components for the two `PCA` blocks using three-fold cross-validation.

In [11]:
def objective(trial):
    # Suggest hyperparameters
    nodes_n_components = trial.suggest_int("preprocessor__pca_nodes__sklearn_block__n_components", 3, 4)
    mach_n_components = trial.suggest_int("regressor__transformer__sklearn_block__n_components", 4, 5)

    # Clone and configure pipeline
    pipeline_run = clone(pipeline)
    pipeline_run.set_params(
        preprocessor__pca_nodes__sklearn_block__n_components=nodes_n_components,
        regressor__transformer__sklearn_block__n_components=mach_n_components
    )

    cv = KFold(n_splits=3, shuffle=True, random_state=42)

    scores = []

    indices = np.arange(len(dataset_train))

    for train_idx, val_idx in cv.split(indices):

        dataset_cv_train_ = dataset_train[train_idx]
        dataset_cv_val_   = dataset_train[val_idx]

        pipeline_run.fit(dataset_cv_train_)

        score = pipeline_run.score(dataset_cv_val_)

        scores.append(score)

    return np.mean(scores)

We maximize the defined objective function over 4 trials selected by Optuna.

In [12]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=4)
print("best_params =", study.best_params)

[I 2025-07-26 11:01:20,280] A new study created in memory with name: no-name-ee8584cc-79b2-4930-832e-f917dc67ff82
[I 2025-07-26 11:01:22,425] Trial 0 finished with value: 0.9238537953117891 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 5}. Best is trial 0 with value: 0.9238537953117891.
[I 2025-07-26 11:01:24,693] Trial 1 finished with value: 0.9231200462807997 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 5}. Best is trial 0 with value: 0.9238537953117891.
[I 2025-07-26 11:01:26,571] Trial 2 finished with value: 0.9231200122557025 and parameters: {'preprocessor__pca_nodes__sklearn_block__n_components': 3, 'regressor__transformer__sklearn_block__n_components': 5}. Best is trial 0 with value: 0.9238537953117891.
[I 2025-07-26 11:01:28,618] Trial 3 finished with value: 0.9231199879688123 and parameters: {'preprocessor__p

best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 5}


We retrieve the best hyperparameters found by Optuna and use them to define the `optimized_pipeline`.

In [13]:
optimized_pipeline = clone(pipeline).set_params(**study.best_params)
optimized_pipeline.fit(dataset_train)
optimized_pipeline

0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,plaid_transformers,"[('input_scalar_scaler', ...), ('pca_nodes', ...)]"

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,n_components,4
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,

0,1,2
,regressor,WrappedPlaidS...om_state=42)))
,transformer,WrappedPlaidS...components=5))

0,1,2
,kernel,"Matern(length_scale=1, nu=2.5)"
,alpha,1e-10
,optimizer,'fmin_l_bfgs_b'
,n_restarts_optimizer,1
,normalize_y,False
,copy_X_train,True
,n_targets,
,random_state,42
,kernel__length_scale,1.0
,kernel__length_scale_bounds,"(1e-08, ...)"

0,1,2
,n_components,5
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


Next, we fit the `optimized_pipeline` to the `dataset_train` dataset and evaluate its performance on the same data.

In [14]:
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)

score = 0.9692695269755178 , error = 0.030730473024482174


We use an anisotropic kernel in the Gaussian Process. Its optimized `length_scale` is a vector with dimensions equal to 2 plus the number of PCA components from `preprocessor__pca_nodes__sklearn_block__n_components`, accounting for the two input scalars.

In [15]:
print("Dimension GP kernel length_scale =", len(optimized_pipeline.named_steps["regressor"].regressor_.sklearn_block_.estimators_[0].kernel_.get_params()['length_scale']))
print("Expected dimension =", 2 + study.best_params['preprocessor__pca_nodes__sklearn_block__n_components'])

Dimension GP kernel length_scale = 6
Expected dimension = 6


The error remains non-zero due to the approximation introduced by PCA. Since the Gaussian Process regressor interpolates, the error is expected to vanish on the training set if all PCA modes are retained.

In [16]:
exact_pipeline = clone(pipeline).set_params(
    preprocessor__pca_nodes__sklearn_block__n_components = 24,
    regressor__transformer__sklearn_block__n_components = 24
)
exact_pipeline.fit(dataset_train)
dataset_pred = exact_pipeline.predict(dataset_train)
score = exact_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)

score = 0.9999999999923527 , error = 7.64732721592054e-12


## 🔍 GridSearchCV hyperparameter tuning

Since our pipeline nodes conform to the scikit-learn API, the constructed pipeline can be used directly with `GridSearchCV`.

In [17]:
param_grid = {
    'preprocessor__pca_nodes__sklearn_block__n_components': [3, 4],
    'regressor__transformer__sklearn_block__n_components': [4, 5],
}

cv = KFold(n_splits=3, shuffle=True, random_state=42)
search = GridSearchCV(pipeline, param_grid=param_grid, cv=cv, verbose=3, error_score='raise')
search.fit(dataset_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=4;, score=0.936 total time=   0.7s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=4;, score=0.913 total time=   0.7s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=4;, score=0.921 total time=   0.6s
[CV 1/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=5;, score=0.936 total time=   0.6s
[CV 2/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=5;, score=0.913 total time=   0.9s
[CV 3/3] END preprocessor__pca_nodes__sklearn_block__n_components=3, regressor__transformer__sklearn_block__n_components=5;, score=0.921 total time=   0.6s
[CV 

0,1,2
,estimator,Pipeline(step...ock=PCA())))])
,param_grid,"{'preprocessor__pca_node...arn_block__n_components': [3, 4], 'regressor__transformer...arn_block__n_components': [4, 5]}"
,scoring,
,n_jobs,
,refit,True
,cv,KFold(n_split... shuffle=True)
,verbose,3
,pre_dispatch,'2*n_jobs'
,error_score,'raise'
,return_train_score,False

0,1,2
,plaid_transformers,"[('input_scalar_scaler', ...), ('pca_nodes', ...)]"

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,n_components,4
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,

0,1,2
,regressor,WrappedPlaidS...om_state=42)))
,transformer,WrappedPlaidS...components=5))

0,1,2
,kernel,"Matern(length_scale=1, nu=2.5)"
,alpha,1e-10
,optimizer,'fmin_l_bfgs_b'
,n_restarts_optimizer,1
,normalize_y,False
,copy_X_train,True
,n_targets,
,random_state,42
,kernel__length_scale,1.0
,kernel__length_scale_bounds,"(1e-08, ...)"

0,1,2
,n_components,5
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


We evaluate the performance of the optimized pipeline by computing its score on the training set.

In [18]:
print("best_params =", search.best_params_)
optimized_pipeline = clone(pipeline).set_params(**search.best_params_)
optimized_pipeline.fit(dataset_train)
dataset_pred = optimized_pipeline.predict(dataset_train)
score = optimized_pipeline.score(dataset_train)
print("score =", score, ", error =", 1. - score)

best_params = {'preprocessor__pca_nodes__sklearn_block__n_components': 4, 'regressor__transformer__sklearn_block__n_components': 5}
score = 0.9692695269794028 , error = 0.03073047302059717
