diff --git a/examples/no-priors-characterization/README.md b/examples/no-priors-characterization/README.md deleted file mode 100644 index 79fc313be..000000000 --- a/examples/no-priors-characterization/README.md +++ /dev/null @@ -1,281 +0,0 @@ -# Exploring Parameter Spaces with No-Priors Characterization - - - -> [!NOTE] The scenario -> -> You have an experiment with multiple parameters, -> and you want to understand how these parameters influence the outcome. -> **In this example, `ado`'s no-priors characterization operator is used to -> systematically sample and measure the target property across the parameter -> space using various sampling strategies aimed at covering uniformly the -> parameter space.** Using the no-priors characterization -> operator involves: -> -> 1. Defining the parameter space to explore. -> 2. Creating an `operation` that uses no-priors characterization to sample -> points using a chosen strategy. -> 3. Observing the sampling process as it measures the target output property with -> the selected strategy. - -> [!IMPORTANT] Prerequisites -> -> Get the example files and install dependencies: -> -> ```commandline -> git clone https://github.com/IBM/ado.git -> cd ado -> pip install plugins/operators/no-priors-characterization/ -> pip install examples/no-priors-characterization/custom_experiments/ -> ``` - -> [!CAUTION] -> -> All commands below assume you are running them from the -> **top-level of the `ado` repository**. - -> [!TIP] TL;DR -> -> To create a `discoveryspace` and explore it with the no-priors -> characterization operator, execute the following from the root of the `ado` -> repository: -> -> ```bash -> : # Create the space to explore based on a custom experiment -> ado create space -f \ -> examples/no-priors-characterization/example_yamls/space_reaction.yaml \ -> --new-sample-store -> : # Explore it with no-priors characterization! -> ado create operation -f \ -> examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml \ -> --use-latest space -> ``` - - - -## What is No-Priors Characterization? - -**No-Priors Characterization** is a sampling operator designed to explore a -parameter space systematically without requiring any prior knowledge or -existing data. It's perfect for initial exploration of a system where you want -to gather representative samples across the entire parameter space. - -**Handling Existing Measurements**: If the discovery space already contains -measured entities for the target property, the operator automatically: - -- Identifies which entities have already been measured -- Excludes them from sampling, so that the operator will measure the - desired amount of entities - -The operator supports three sampling strategies: - -1. **Random Sampling (`random`)**: Uniformly random sampling across the - parameter space. Fast and simple, but may not provide optimal coverage. - -2. **Concatenated Latin Hypercube Sampling (`clhs`)**: An adaptation of Latin - Hypercube Sampling for discrete spaces. Good coverage in each dimension is - obtained by avoiding measuring parameters combinations with many common - values. Particularly effective for high-dimensional spaces. - -3. **Sobol Sampling (`sobol`)**: A quasi-random low-discrepancy sampling - method that provides better space-filling properties than pure random - sampling. It has been adapted for discrete parameter spaces. It falls back - to Concatenated Latin Hypercube Sampling when collisions are detected - during the discretization process. - - -> [!CAUTION] -> -> In the current version of no-priors characterization, if not all -> measurements produce the observed target output property specified in the -> `operation.parameters.targetOutput` field, the operation may fail or produce -> incomplete results. Ensure all experiments return the expected target property. - - - -The operator samples a specified number of points in batches, measures them -using the configured experiment, and stores the results in the sample store. - -## Creating a `discoveryspace` - -A `discoveryspace` describes the parameters you want to explore (`entitySpace`) -and how to measure them (`measurementSpace`). In this example, we'll use two -custom Python functions as experiments and take inspiration from the Chemistry domain: - -1. **`calculate_reaction_yield`**: Calculates chemical reaction yield based on - temperature (K), concentration (mol/L), and catalyst amount (g) using an - Arrhenius-like equation. - -2. **`calculate_material_strength`**: Calculates material tensile strength (MPa) - based on composition percentages, temperature (°C), and grain size (μm) using - a Hall-Petch relationship. - -First, create the `discoveryspace` by executing this command from the repository -root: - -```commandline -ado create space -f \ - examples/no-priors-characterization/example_yamls/space_reaction.yaml \ - --new-sample-store -``` - -This will create a new space and a sample store to hold the measurement results. -The output will be similar to: - -```terminaloutput -Success! Created space with identifier: space-bfed2d-19b49a -``` - -## Exploring with a No-Priors Characterization Operation - -Next, we will run an `operation` that uses no-priors characterization to -explore the `discoveryspace`. We provide three example configurations with -different sampling strategies: - -### Basic Sampling with CLHS - -The configuration for a basic sampling operation using CLHS is in -`op_basic_sampling.yaml`: - - - -```yaml -{% - include-markdown "./example_yamls/op_basic_sampling.yaml" -%} -``` - - -To run the operation, execute: - - - -```commandline -ado create operation -f \ - examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml \ - --use-latest space -``` - - - -### Exploration with Random Sampling - -For an exploration with random sampling (uses random sampling with 20 samples -and batch size of 5 for quick initial exploration): - -```commandline -ado create operation -f \ - examples/no-priors-characterization/example_yamls/op_quick_exploration.yaml \ - --use-latest space -``` - -**Note**: Each operation samples different points from the space based on its -strategy and parameters, even when using the same discovery space. - -### Thorough Coverage with Sobol Sequence - -For comprehensive coverage using Sobol sequences (uses Sobol sampling with 100 -samples and batch size of 1 for detailed parameter space coverage): - -```commandline -ado create operation -f \ - examples/no-priors-characterization/example_yamls/op_thorough_coverage.yaml \ - --use-latest space -``` - -### What to Expect in the Terminal - -You will see output as the no-priors characterization operator samples and -measures points. The key stages are: - -#### Initialization - -The operator will log the start of the sampling process: - - - -```commandline -2026-03-09 16:30:00,000 INFO MainThread no_priors_characterization.operator: Starting no-priors characterization with 30 samples using clhs strategy -``` - - - -#### Sampling and Measurement - -For each batch of points, you will see output indicating the experiments being -submitted and completed: - - - -```commandline -(RandomWalk pid=82843) Continuous batching: SUBMIT EXPERIMENT. Submitted experiment custom_experiments.calculate_reaction_yield for temperature.353-concentration.4.1-catalyst_amount.4.5. Request identifier: c72090 -(RandomWalk pid=82843) -(RandomWalk pid=82843) Continuous batching: SUMMARY. Entities sampled and submitted: 2. Experiments completed: 1 Waiting on 1 active requests. There are 0 dependent experiments -(RandomWalk pid=82843) Continuous Batching: EXPERIMENT COMPLETION. Received finished notification for experiment in measurement request in group 1: request-c72090-experiment-calculate_reaction_yield-entities-temperature.353-concentration.4.1-catalyst_amount.4.5 (no_priors_characterization)-requester-randomwalk-1.6.1.dev9+03a65e7b.dirty-9a277d-time-2026-03-10 11:43:11.066810+00:00 -``` - - - -#### Completion - -The operation will end with a success message: - - - -```commandline -Success! Created operation with identifier operation-no-priors-characterization-v0.1-8b23a245 and it finished successfully. -``` - - - -## Looking at the `operation` output - -After the operation completes, you can view the sampled entities and their -measured values. - -You can see the relationship between the space and operations with: - -```commandline -ado show related space --use-latest -``` - -This will show the `discoveryspace` and the operations that were run. -To see the entities of the space that have been measured, you can run: - - - -```commandline -ado show entities space --use-latest -``` - - - -This will display a table of the entities sampled and their measured reaction -yield values. - - - -```text -┌───────┬──────────────────────────────────────────────────────────┬────────────────────────────┬─────────────────────────────────────────────┬─────────────┬───────────────┬─────────────────┬──────────┐ -│ INDEX │ identifier │ generatorid │ experiment_id │ temperature │ concentration │ catalyst_amount │ yield │ -├───────┼──────────────────────────────────────────────────────────┼────────────────────────────┼─────────────────────────────────────────────┼─────────────┼───────────────┼─────────────────┼──────────┤ -│ 0 │ temperature.300-concentration.1.0-catalyst_amount.2.0 │ no_priors_characterization │ custom_experiments.calculate_reaction_yield │ 300 │ 1.0 │ 2.0 │ 45.23 │ -│ 1 │ temperature.350-concentration.2.5-catalyst_amount.5.0 │ no_priors_characterization │ custom_experiments.calculate_reaction_yield │ 350 │ 2.5 │ 5.0 │ 78.91 │ -│ 2 │ temperature.400-concentration.0.5-catalyst_amount.1.0 │ no_priors_characterization │ custom_experiments.calculate_reaction_yield │ 400 │ 0.5 │ 1.0 │ 92.15 │ -│ ... │ ... │ ... │ ... │ ... │ ... │ ... │ ... │ -└───────┴──────────────────────────────────────────────────────────┴────────────────────────────┴─────────────────────────────────────────────┴─────────────┴───────────────┴─────────────────┴──────────┘ -``` - - - -## Takeaways - -- **Systematic Exploration**: The no-priors characterization operator provides - systematic sampling of parameter spaces without requiring prior knowledge. -- **Multiple Strategies**: Choose from random, Sobol, or CLHS sampling based on - your needs for speed vs. coverage quality. -- **Flexible Configuration**: Adjust the number of samples and batch size to - balance thoroughness with computational resources. -- **Foundation for Further Analysis**: The sampled data can serve as a - foundation for building surrogate models or for use with other operators like - TRIM. diff --git a/examples/no-priors-characterization/custom_experiments/no_priors_custom_experiments/experiments.py b/examples/no-priors-characterization/custom_experiments/no_priors_custom_experiments/experiments.py deleted file mode 100644 index 8d426ef38..000000000 --- a/examples/no-priors-characterization/custom_experiments/no_priors_custom_experiments/experiments.py +++ /dev/null @@ -1,176 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -from typing import Literal - -import numpy as np - -from orchestrator.modules.actuators.custom_experiments import custom_experiment -from orchestrator.schema.domain import PropertyDomain, VariableTypeEnum -from orchestrator.schema.property import ConstitutiveProperty - -# --------------------------- -# Properties for Reaction Yield -# --------------------------- - -temperature = ConstitutiveProperty( - identifier="temperature", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[273, 473], # 0-200°C in Kelvin - ), -) - -concentration = ConstitutiveProperty( - identifier="concentration", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[0.1, 5.0], # mol/L - ), -) - -catalyst_amount = ConstitutiveProperty( - identifier="catalyst_amount", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[0.0, 10.0], # grams - ), -) - -# --------------------------- -# Properties for Material Strength -# --------------------------- - -composition_a = ConstitutiveProperty( - identifier="composition_a", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[0, 100], # percentage - ), -) - -composition_b = ConstitutiveProperty( - identifier="composition_b", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[0, 100], # percentage - ), -) - -temperature_celsius = ConstitutiveProperty( - identifier="temperature_celsius", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[-50, 200], # Celsius - ), -) - -grain_size = ConstitutiveProperty( - identifier="grain_size", - propertyDomain=PropertyDomain( - variableType=VariableTypeEnum.CONTINUOUS_VARIABLE_TYPE, - domainRange=[1, 100], # micrometers - ), -) - -# --------------------------- -# Reaction Yield Experiment -# --------------------------- - - -@custom_experiment( - required_properties=[temperature, concentration, catalyst_amount], - output_property_identifiers=["yield"], -) -def calculate_reaction_yield( - temperature: float, concentration: float, catalyst_amount: float -) -> dict[Literal["yield"], float]: - """ - Calculate chemical reaction yield using Arrhenius-like equation with catalyst effect. - - The yield is calculated using: - k = A * exp(-Ea / (R * T)) * (1 + 0.1 * catalyst_amount) - yield = 100 * (1 - exp(-k * concentration * time)) - - where: - A = 1e10 (pre-exponential factor) - Ea = 50000 J/mol (activation energy) - R = 8.314 J/(mol·K) (gas constant) - time = 3600 s (reaction time) - - Args: - temperature: Reaction temperature in Kelvin - concentration: Reactant concentration in mol/L - catalyst_amount: Catalyst amount in grams - - Returns: - dict: Dictionary containing the calculated yield as a percentage (0-100) - """ - A = 1e10 # pre-exponential factor - Ea = 50000 # J/mol, activation energy - R = 8.314 # J/(mol·K), gas constant - time = 3600 # seconds, reaction time - - # Calculate rate constant with catalyst effect - k = A * np.exp(-Ea / (R * temperature)) * (1 + 0.1 * catalyst_amount) - - # Calculate yield - reaction_yield = 100 * (1 - np.exp(-k * concentration * time)) - - # Ensure yield is between 0 and 100 - reaction_yield = np.clip(reaction_yield, 0, 100) - - return {"yield": float(reaction_yield)} - - -# --------------------------- -# Material Strength Experiment -# --------------------------- - - -@custom_experiment( - required_properties=[composition_a, composition_b, temperature_celsius, grain_size], - output_property_identifiers=["tensile_strength"], -) -def calculate_material_strength( - composition_a: float, - composition_b: float, - temperature_celsius: float, - grain_size: float, -) -> dict[Literal["tensile_strength"], float]: - """ - Calculate material tensile strength using Hall-Petch relationship with composition effects. - - The strength is calculated using: - base_strength = composition_a * 500 + composition_b * 300 + (100 - composition_a - composition_b) * 200 - temp_factor = 1 - 0.002 * (temperature_celsius - 20) - grain_factor = 1 + 100 / sqrt(grain_size) - tensile_strength = base_strength * temp_factor * grain_factor / 1000 - - Args: - composition_a: Percentage of component A (0-100) - composition_b: Percentage of component B (0-100) - temperature_celsius: Testing temperature in Celsius - grain_size: Grain size in micrometers - - Returns: - dict: Dictionary containing the calculated tensile strength in MPa - """ - # Calculate base strength from composition - composition_c = 100 - composition_a - composition_b - base_strength = composition_a * 500 + composition_b * 300 + composition_c * 200 - - # Temperature effect (strength decreases with temperature) - temp_factor = 1 - 0.002 * (temperature_celsius - 20) - temp_factor = np.clip(temp_factor, 0.1, 2.0) # Prevent unrealistic values - - # Hall-Petch relationship (strength increases with smaller grain size) - grain_factor = 1 + 100 / np.sqrt(grain_size) - - # Calculate final tensile strength in MPa - tensile_strength = base_strength * temp_factor * grain_factor / 1000 - - # Ensure positive strength - tensile_strength = np.maximum(tensile_strength, 0) - - return {"tensile_strength": float(tensile_strength)} diff --git a/examples/no-priors-characterization/custom_experiments/pyproject.toml b/examples/no-priors-characterization/custom_experiments/pyproject.toml deleted file mode 100644 index 2c21edf05..000000000 --- a/examples/no-priors-characterization/custom_experiments/pyproject.toml +++ /dev/null @@ -1,19 +0,0 @@ -[project] -name = "no_priors_custom_experiments" -description = "A set of custom experiments used to test No-Priors Characterization Operation" -dependencies = [ - "ado-core", - "numpy", -] -dynamic = ["version"] - -[project.entry-points."ado.custom_experiments"] -# This should be python file with your decorated function(s). -no_priors_experiments = "no_priors_custom_experiments.experiments" - -[build-system] -requires = ["setuptools", "setuptools_scm"] -build-backend = "setuptools.build_meta" - -[tool.setuptools_scm] -root = "../../../" diff --git a/examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml b/examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml deleted file mode 100644 index 566ecf4e1..000000000 --- a/examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT -spaces: - - space-c8717f-3a68bf -operation: - module: - operationType: characterize - operatorName: no_priors_characterization - parameters: - targetOutput: yield - samples: 30 - batchSize: 1 - sampling_strategy: clhs diff --git a/examples/no-priors-characterization/example_yamls/op_quick_exploration.yaml b/examples/no-priors-characterization/example_yamls/op_quick_exploration.yaml deleted file mode 100644 index 1d5bea309..000000000 --- a/examples/no-priors-characterization/example_yamls/op_quick_exploration.yaml +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT -spaces: - - space-c8717f-3a68bf -operation: - module: - operationType: characterize - operatorName: no_priors_characterization - parameters: - targetOutput: yield - samples: 20 - batchSize: 5 - sampling_strategy: random diff --git a/examples/no-priors-characterization/example_yamls/op_thorough_coverage.yaml b/examples/no-priors-characterization/example_yamls/op_thorough_coverage.yaml deleted file mode 100644 index a2026f891..000000000 --- a/examples/no-priors-characterization/example_yamls/op_thorough_coverage.yaml +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT -spaces: - - space-c8717f-3a68bf -operation: - module: - operationType: characterize - operatorName: no_priors_characterization - parameters: - targetOutput: yield - samples: 100 - batchSize: 1 - sampling_strategy: sobol diff --git a/examples/no-priors-characterization/example_yamls/space_reaction.yaml b/examples/no-priors-characterization/example_yamls/space_reaction.yaml deleted file mode 100644 index eea63704b..000000000 --- a/examples/no-priors-characterization/example_yamls/space_reaction.yaml +++ /dev/null @@ -1,21 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT -sampleStoreIdentifier: 3a68bf -metadata: - name: reaction_yield_space -entitySpace: - - identifier: temperature - propertyDomain: - domainRange: [273, 473] - interval: 10 - - identifier: concentration - propertyDomain: - domainRange: [0.1, 5.0] - interval: 0.2 - - identifier: catalyst_amount - propertyDomain: - domainRange: [0.0, 10.0] - interval: 0.5 -experiments: - - actuatorIdentifier: custom_experiments - experimentIdentifier: calculate_reaction_yield diff --git a/examples/trim/example_yamls/randomwalk_clhs_operation.yaml b/examples/trim/example_yamls/randomwalk_clhs_operation.yaml new file mode 100644 index 000000000..35efa1b2e --- /dev/null +++ b/examples/trim/example_yamls/randomwalk_clhs_operation.yaml @@ -0,0 +1,26 @@ +# Copyright IBM Corporation 2025, 2026 +# SPDX-License-Identifier: MIT +metadata: + name: 'randomwalk-sobol' + description: 'Perform a random walk using Sobol quasi-random sampling for better space coverage' +spaces: + - space-2fa5d0-2905f9 +operation: + module: + operatorName: "random_walk" + operationType: "search" + parameters: + numberEntities: 20 + batchSize: 5 + singleMeasurement: true + samplerConfig: + module: + moduleName: trim.samplers.no_priors_sampler + moduleClass: NoPriorsSampleSelector + parameters: + targetOutput: pressure + samples: 20 + batchSize: 1 + sampling_strategy: clhs + +# Made with Bob diff --git a/examples/trim/example_yamls/randomwalk_sobol_operation.yaml b/examples/trim/example_yamls/randomwalk_sobol_operation.yaml new file mode 100644 index 000000000..19ba89791 --- /dev/null +++ b/examples/trim/example_yamls/randomwalk_sobol_operation.yaml @@ -0,0 +1,26 @@ +# Copyright IBM Corporation 2025, 2026 +# SPDX-License-Identifier: MIT +metadata: + name: 'randomwalk-sobol' + description: 'Perform a random walk using Sobol quasi-random sampling for better space coverage' +spaces: + - space-2fa5d0-2905f9 +operation: + module: + operatorName: "random_walk" + operationType: "search" + parameters: + numberEntities: 20 + batchSize: 5 + singleMeasurement: true + samplerConfig: + module: + moduleName: trim.samplers.no_priors_sampler + moduleClass: NoPriorsSampleSelector + parameters: + targetOutput: pressure + samples: 20 + batchSize: 1 + sampling_strategy: sobol + +# Made with Bob diff --git a/plugins/operators/no-priors-characterization/README.md b/plugins/operators/no-priors-characterization/README.md deleted file mode 100644 index f7bdf3c03..000000000 --- a/plugins/operators/no-priors-characterization/README.md +++ /dev/null @@ -1,46 +0,0 @@ -# ADO No-Priors Characterization Operator - -`ado-no-priors-characterization` is an operator plugin for the -[Accelerated Discovery Orchestrator (ADO)](https://github.com/IBM/ado), -providing initial exploration of discovery spaces using high-dimensional -sampling strategies. - -**No-Priors Characterization** is designed for unbiased exploration when no -measured data exists yet, establishing an initial dataset for subsequent -model-based exploration. - -## How it Works - -The `No-Priors Characterization` operator uses different sampling strategies -to ensure good coverage of the discovery space: - -- **`random`**: Random sampling across the space for unbiased exploration. - This provides the baseline sampling approach. -- **`clhs`** (Concatenated Latin Hypercube Sampling): Ensures uniform coverage - by enforcing stratification in each dimension independently. Each dimension - cycles through all possible values before repeating. -- **`sobol`**: Sobol sequence sampling for quasi-random low-discrepancy coverage - -The operator retrieves already-measured entities from the discovery space, -orders the unmeasured entities using the specified sampling strategy, -and yields entities in batches -for measurement. - -## Installation - -You can install the `No-Priors Characterization` operator and its dependencies -(including `ado-core`) directly from PyPI: - -```bash -pip install ado-no-priors-characterization -``` - -## More Information - -To learn more about No-Priors Characterization and explore the full -capabilities of ADO, including detailed documentation, configuration guides, and -additional examples, visit the official ADO website: - -- **No-Priors Quickstart**: -- **Configuring No-Priors**: -- **ADO Documentation**: diff --git a/plugins/operators/no-priors-characterization/pyproject.toml b/plugins/operators/no-priors-characterization/pyproject.toml deleted file mode 100644 index d9df0afcb..000000000 --- a/plugins/operators/no-priors-characterization/pyproject.toml +++ /dev/null @@ -1,30 +0,0 @@ -[project] -name = "ado-no-priors-characterization" -description = "No-priors characterization operator for sampling discovery spaces using high-dimensional sampling strategies" -readme = "README.md" -requires-python = ">=3.10,<3.14" -dependencies = [ - "ado-core", - "numpy", - "pandas>=2.2.0", - "scipy", -] -dynamic = ["version"] - -[project.entry-points."ado.operators"] -no-priors-characterization = "no_priors_characterization.operator" - -[build-system] -requires = ["setuptools", "setuptools_scm"] -build-backend = "setuptools.build_meta" - -[tool.setuptools.packages.find] -where = ["src"] -include = ["no_priors_characterization*"] - -[tool.setuptools_scm] -root = "../../../" -local_scheme = "node-and-timestamp" - -[tool.uv.sources] -ado-core = { workspace = true } diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/__init__.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/__init__.py deleted file mode 100644 index ff303fb0e..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/__init__.py +++ /dev/null @@ -1,6 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -from no_priors_characterization.operator import no_priors_characterization - -__all__ = ["no_priors_characterization"] diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/operator.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/operator.py deleted file mode 100644 index b6c26b592..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/operator.py +++ /dev/null @@ -1,109 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -import logging -from importlib.metadata import version - -from no_priors_characterization.no_priors_pydantic import NoPriorsParameters -from orchestrator.core.discoveryspace.space import DiscoverySpace -from orchestrator.core.operation.config import FunctionOperationInfo -from orchestrator.core.operation.operation import OperationOutput -from orchestrator.modules.operators.collections import characterize_operation - -logger = logging.getLogger(__name__) - - -@characterize_operation( - name="no_priors_characterization", - configuration_model=NoPriorsParameters, - configuration_model_default=NoPriorsParameters(targetOutput="default_target"), - description=""" - No-priors characterization samples a discovery space using high-dimensional - sampling strategies (random, CLHS, Sobol, etc.) without relying on prior - model knowledge or feature importance. This operator is useful for initial - exploration of discovery spaces when no training data exists yet. - """, - version=version("ado-no-priors-characterization"), -) -def no_priors_characterization( - discoverySpace: DiscoverySpace = None, # type: ignore[name-defined] - operationInfo: FunctionOperationInfo | None = None, - **kwargs: object, -) -> OperationOutput: - """ - Execute no-priors characterization on a discovery space. - - Samples entities using high-dimensional sampling strategies without requiring - prior model training or feature importance information. Useful for initial - characterization when no measured data exists. - - Args: - discoverySpace: The discovery space to characterize - operationInfo: Optional operation metadata - **kwargs: Additional parameters validated against NoPriorsParameters model - - Returns: - OperationOutput containing the operation resources and metadata - """ - # Lazy import to avoid circular import issues during plugin loading - import orchestrator.modules.operators.randomwalk # noqa: F401 — registers explore.random_walk - from orchestrator.modules.operators.collections import explore - from orchestrator.modules.operators.randomwalk import ( - CustomSamplerConfiguration, - RandomWalkParameters, - SamplerModuleConf, - ) - - random_walk = explore.operators["random_walk"].function - - params = NoPriorsParameters.model_validate(kwargs) - logger.info( - f"No-priors characterization starts. Target variable = {params.targetOutput}" - ) - logger.info(f"Parameters: {params}") - - # Configure the no-priors sampler - no_priors_module = SamplerModuleConf( - moduleClass="NoPriorsSampleSelector", - moduleName="no_priors_characterization.no_priors_sampler", - ) - - no_priors_sampler_config = CustomSamplerConfiguration( - module=no_priors_module, parameters=params - ) - - no_priors_random_walk_params = RandomWalkParameters( - samplerConfig=no_priors_sampler_config, - batchSize=params.batchSize, - numberEntities=params.samples, - singleMeasurement=True, - ) - - # Execute the random walk with the no-priors sampler - from orchestrator.core.metadata import ConfigurationMetadata - - # Create metadata with custom fields for tracking no-priors parameters - metadata = ConfigurationMetadata( - name="No-priors characterization", - description=f"No-priors characterization using {params.sampling_strategy} strategy with {params.samples} samples", - ) - # Add custom fields using extra="allow" in ConfigurationMetadata - metadata.sampling_strategy = params.sampling_strategy # type: ignore[attr-defined] - metadata.samples = params.samples # type: ignore[attr-defined] - - updated_operation_info = FunctionOperationInfo( - metadata=metadata, - actuatorConfigurationIdentifiers=( - operationInfo.actuatorConfigurationIdentifiers if operationInfo else [] - ), - ) - - op_output = random_walk( - discoverySpace=discoverySpace, - operationInfo=updated_operation_info, - **no_priors_random_walk_params.model_dump(), - ) - - logger.info("No-priors characterization completed") - - return op_output diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/__init__.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/__init__.py deleted file mode 100644 index deb2683da..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/__init__.py +++ /dev/null @@ -1,33 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -# Export commonly used utilities for easier imports -from no_priors_characterization.utils.high_dimensional_sampling import ( - concatenated_latin_hypercube_sampling, - get_sampling_indices_multi_dimensional, -) -from no_priors_characterization.utils.one_dimensional_sampling import ( - get_index_list_ordered_partitions, - get_index_list_van_der_corput, -) -from no_priors_characterization.utils.space_df_connector import ( - get_df_all_entities_no_measurements, - get_list_of_entities_from_df_and_space, - get_project_context, - get_source_and_target, - get_space, -) -from orchestrator.utilities.pandas import sort_rows_by_column_names - -__all__ = [ - "concatenated_latin_hypercube_sampling", - "get_df_all_entities_no_measurements", - "get_index_list_ordered_partitions", - "get_index_list_van_der_corput", - "get_list_of_entities_from_df_and_space", - "get_project_context", - "get_sampling_indices_multi_dimensional", - "get_source_and_target", - "get_space", - "sort_rows_by_column_names", -] diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/high_dimensional_sampling.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/high_dimensional_sampling.py deleted file mode 100644 index a77bbabb4..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/high_dimensional_sampling.py +++ /dev/null @@ -1,338 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -import logging -import math -import random -from typing import Literal - -import numpy as np -from scipy.stats.qmc import Sobol - -from no_priors_characterization.utils.one_dimensional_sampling import ( - get_index_list_van_der_corput, -) - -logger_high_dimensional = logging.getLogger(__name__) - - -def concatenated_latin_hypercube_sampling( - dimensions: list[int], - final_sample_size: int, - seed: int | None = None, -) -> list[list[int]]: - """ - Generates samples using a Concatenated Latin Hypercube Sampling strategy. - - For each dimension independently, this method enforces a 1D stratification - (Latin Hypercube property) by generating random permutations of the - possible values. If the number of requested samples 'final_sample_size' exceeds the cardinality - of a dimension, new random permutations are concatenated to the sequence. - - This guarantees that for any dimension j with size d_j, every sequence - of d_j samples contains exactly one instance of every value in range(d_j). - - Args: - dimensions (List[int]): Cardinality (size) of each dimension. Must be positive. - final_sample_size (int): Total number of points to sample. - seed (Optional[int]): Optional PRNG seed for reproducibility. - - Returns: - List[List[int]]: A list of final_sample_size sampled points, where each point is a - list of indices corresponding to the dimensions. - - Raises: - ValueError: If any dimension size is less than 1. - """ - if any(d <= 0 for d in dimensions): - raise ValueError( - f"All dimensions must be >= 1, received dimensions={dimensions}" - ) - - if final_sample_size <= 0: - return [] - - # Use default RNG when seed is not provided, otherwise create seeded instance - rng = random.Random() if seed is None else random.Random(seed) # noqa: S311 - - # Per-dimension pools: active permutation for the current block. - # We maintain the Latin Hypercube property by sampling without replacement. - pools: list[list[int]] = [list(range(d)) for d in dimensions] - samples: list[list[int]] = [] - - for _ in range(final_sample_size): - point: list[int] = [] - for j, d in enumerate(dimensions): - # If the current permutation block is exhausted, start a new one (new cycle). - if not pools[j]: - pools[j] = list(range(d)) - - # Select a random element from the remaining pool for this block. - k = rng.randrange(len(pools[j])) - value = pools[j].pop(k) - point.append(value) - - samples.append(point) - - return samples - - -# NOTE: preliminary tests on collision reveal that if final_sample_size is half of the product of dimensions collisions are rare -def sobol_sampling( - dimensions: list[int], final_sample_size: int, seed: int | None = None -) -> list[list[int]]: - """ - Generates Sobol sampled points scaled to integer dimensions. - - This function uses a Sobol sequence to generate points in the unit hypercube [0, 1)^d, - scales them to the specified integer dimensions, and checks for collisions. If collisions - occur (duplicate points), it falls back to Concatenated Latin Hypercube Sampling. - - Args: - dimensions (list[int]): A list of integers representing the size (cardinality) of each dimension. - final_sample_size (int): The number of points to sample. - seed (int | None, optional): Random seed for the Sobol scrambler. Defaults to None. - - Returns: - list[list[int]]: A list of final_sample_size points, where each point is a list of integer coordinates. - """ - # Sobol generates points in [0, 1). We scale them to the integer dimensions. - - sampler = Sobol(d=len(dimensions), scramble=True, rng=seed) - points = sampler.random(final_sample_size) - - # Scale and floor to get integer indices - discrete_points = [ - [int(val * d) for val, d in zip(p, dimensions, strict=True)] for p in points - ] - - # Check for collisions - # Convert inner lists to tuples because lists are unhashable and cannot be used in a set - unique_points = {tuple(p) for p in discrete_points} - n_collisions = final_sample_size - len(unique_points) - - if n_collisions > 0: - logger_high_dimensional.error( - f"Sobol sampling failed, {n_collisions} collisions detected, defaulting to clhs sampling" - ) - return concatenated_latin_hypercube_sampling( - dimensions=dimensions, final_sample_size=final_sample_size, seed=seed - ) - - return discrete_points - - -# TODO: test this function -def distinct_sobol_sampling( - dimensions: list[int], final_sample_size: int, seed: int | None = None -) -> list[list[int]]: - """ - Generates 'n' distinct points on a grid of size 'dimensions' using a Sobol sequence. - Guarantees no collisions by skipping duplicates in the sequence. - """ - # 1. Safety Check: Is the grid big enough? - total_capacity = np.prod(dimensions) - if final_sample_size > total_capacity: - raise ValueError( - f"Cannot generate {final_sample_size} distinct points: Grid only has {total_capacity} cells." - ) - - # 2. Setup Sobol - # We scramble to get better coverage. - sampler = Sobol(d=len(dimensions), scramble=True, rng=seed) - - unique_points = set() - results = [] - - # 3. Iterative Generation - # We generate in batches to be efficient. - # Start with a batch larger than N to account for potential rejections. - batch_size = max(final_sample_size * 2, 64) - - while len(results) < final_sample_size: - # Draw a batch of float points [0, 1) - raw_points = sampler.random(batch_size) - - for p in raw_points: - # Discretize: Map [0, 1) -> Integer coordinates - # Using int(x * dim) scales it to the grid index [0, dim-1] - coord = tuple([int(p[i] * dimensions[i]) for i in range(len(dimensions))]) - - # Check Uniqueness - if coord in unique_points: - continue - - unique_points.add(coord) - results.append(list(coord)) - - # Stop immediately if we have enough - if len(results) == final_sample_size: - return results - - # If we need more points, increase batch size for next iteration - # (helpful if the grid is nearly full and collisions are frequent) - batch_size *= 2 - - return results - - -def random_high_dimensional_sampling( - dimensions: list[int], final_sample_size: int, seed: int | None = None -) -> list[list[int]]: - """ - Generate n unique random samples from a high-dimensional space. - - Args: - dimensions: Cardinality (size) of each dimension. Must be positive. - final_sample_size: Total number of points to sample. - seed: Optional PRNG seed for reproducibility. - - Returns: - List of final_sample_size sampled points, each point is a list of indices - - Raises: - ValueError: If final_sample_size exceeds the total number of possible configurations - """ - import itertools - import random - from math import prod - - # Set the seed for the random number generator - if seed is not None: - random.seed(seed) - - # Check if the number of requested samples is valid - num_configs = prod(dimensions) - if final_sample_size > num_configs: - raise ValueError( - f"Cannot generate {final_sample_size} unique samples. " - f"The sample space only contains {num_configs} possibilities." - ) - - # This still creates all combinations in memory, which is a limitation - # for extremely large dimensional spaces. - configs = list(itertools.product(*[range(d) for d in dimensions])) - - # Ensure we don't try to sample more than available - actual_sample_size = min(final_sample_size, len(configs)) - if actual_sample_size < final_sample_size: - import logging - - logger = logging.getLogger(__name__) - logger.warning( - f"Requested {final_sample_size} samples but only {len(configs)} unique " - f"configurations available. Sampling {actual_sample_size} instead." - ) - - # random.sample is highly optimized for this task. - # It's much faster than manually choosing and removing. - samples = random.sample(configs, actual_sample_size) - - return [list(s) for s in samples] - - -def get_sampling_indices_multi_dimensional( - dimensions: list[int], - n: int | Literal["all", "max"], - space: dict[str, int] | None = None, - strategy: Literal["random", "clhs", "sobol"] = "clhs", - seed: int | None = None, -) -> list[list[int]]: - """ - Generate sampling indices for a high-dimensional space using `get_index_list_van_der_corput` for each dimension. - - Args: - dimensions (List[int]): Sizes of each dimension (e.g., [8, 5]). - n (int | str): Number of points to sample: - - 'all': sample all possible combinations (product of dimensions) - - 'max': sample up to max(dimensions) - strategy (str): sampling subroutine: - - 'random': selects random points from the beginning - - 'clhs': refer to concatenated_latin_hypercube_sampling - - 'sobol': sobol sampling - - space (Optional[Dict[str, int]]): Optional mapping of dimension names to sizes (used only for logging/debug purposes). - Example: - space = {'batch_size': 8, 'model_name': 5} - seed (Optional[int]): controls the randomness - - note: strategies may have an upper bound on the number of elements that respect the strategy that they can return - if this number is exceeded, they resort to random sampling. - - Returns: - List[List[int]]: Outer list length = n (or product of dimensions if n='all'). - Each inner list contains one sampled combination across dimensions. - """ - - # Set the seed for the random number generator - if seed is not None: - random.seed(seed) - - # Log space details if provided - if space: - indices_dict = { - k: get_index_list_van_der_corput(v, v) for k, v in space.items() - } - if [len(indices) for indices in list(indices_dict.values())] != dimensions: - logger_high_dimensional.error( - f"A space dict has been provided ->{space}. It is inconsistent with dimensions={dimensions}" - ) - logger_high_dimensional.warning( - f"list(indices_dict.values()) = {list(indices_dict.values())}" - ) - raise ValueError("Space has inconsistent dimensions!") - logger_high_dimensional.info( - "Sampling indices for each named dimension (ordered low to high): %s", - indices_dict, - ) - - # Compute sampling orders for each dimension - orders = [get_index_list_van_der_corput(v, v) for v in dimensions] - - if logger_high_dimensional.isEnabledFor(logging.DEBUG): - logger_high_dimensional.debug("Dimensions: %s", dimensions) - logger_high_dimensional.debug("Sampling orders for each dimension:") - for i, o in enumerate(orders): - logger_high_dimensional.debug("Dimension %d order: %s", i, o) - - # Calculate maximum possible samples - maximum_n = 1 - for d in dimensions: - maximum_n *= d - lcm = math.lcm(*dimensions) - - if lcm != maximum_n: - logger_high_dimensional.debug( - "Periodicity detected, the sampling subroutine will ensure that you will not sampple" - "the same configuration more than once." - ) - - if isinstance(n, str): - if n == "all": - n = maximum_n - elif n == "max": - n = max(dimensions) - else: - raise ValueError(f"Unrecognized string for n: {n}") - - if n > maximum_n: - logger_high_dimensional.warning( - f"Maximal sample size is {maximum_n}, you requested {n} sampling presciptions." - f"Elaborating prescription for n_samples = {maximum_n}" - ) - - logger_high_dimensional.debug( - "Preparing to sample %d out of %d possible points.", n, maximum_n - ) - - match strategy: - case "random": - return random_high_dimensional_sampling(dimensions, n, seed=seed) - case "clhs": - return concatenated_latin_hypercube_sampling( - dimensions=dimensions, final_sample_size=n, seed=seed - ) - case "sobol": - return sobol_sampling(dimensions=dimensions, final_sample_size=n, seed=seed) - case _: - raise NotImplementedError(f"Strategy {strategy} is unknown") diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/one_dimensional_sampling.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/one_dimensional_sampling.py deleted file mode 100644 index e5b28e625..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/one_dimensional_sampling.py +++ /dev/null @@ -1,293 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT -# -import logging - -logger = logging.getLogger(__name__) - - -def get_index_list_van_der_corput( - length_segment: int, - tot_points_to_sample: int, - sampled_indices: list[int] | None = None, - sort: bool = False, - verbose: bool = False, -) -> list[int]: - """ - Selects a set of indices from a 1D segment using a deterministic sampling strategy. - It is a modified Van der Corput Sequence - - :param length_segment: Total number of units in the 1D segment. - :type length_segment: int - :param tot_points_to_sample: Total number of indices to sample. - :type tot_points_to_sample: int - :param sampled_indices: List of indices already sampled. Defaults to an empty list. - :type sampled_indices: list[int], optional - :param sort: If True, returns the final list sorted in ascending order. Defaults to False. - :type sort: bool, optional - :param verbose: If True, prints debug information during sampling. Defaults to False. - :type verbose: bool, optional - - :raises ValueError: If `tot_points_to_sample` exceeds `length_segment`. - - :return: A list of sampled indices satisfying the distribution strategy. - :rtype: list[int] - - ## Additional Observations and examples - This function assumes that the data has been projected into a 1D segment based on feature importance, - making it isomorphic to a 1d segment. The goal is to sample `tot_points_to_sample` indices from this segment, - optionally considering a set of already sampled indices (`sampled_indices`). The strategy ensures that the - selected points are well-distributed and structurally balanced, akin to placing support ropes on a beam to - prevent collapse. - - The metaphor used is that of a beam suspended by ropes. Initially, ropes are placed at the extremities (indices 0 and `length_segment - 1`) - to ensure boundary support. Additional ropes (sampled points) are added iteratively at the midpoint of the longest unsampled intervals. - In cases of symmetry or multiple equally sparse regions, the algorithm evaluates local neighborhood density to prioritize selection. - - - For example, consider a segment of 14 elements (get_index_list_van_der_corput(14,8)): - - :: - - Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 - Sample: 1 - 8 5 - 7 3 - - 4 - 6 - 2 - - Here, numbers in the bottom row represent the order in which each point is added, and `-` indicates unsampled positions. - The algorithm ensures that each new point is placed where it maximally improves the balance of the structure, - often targeting the midpoint of the largest gaps. - - :examples: - - >>> get_index_list_van_der_corput(5, 3, sampled_indices=[0, 4]) - [0, 2, 4] - - >>> get_index_list_van_der_corput(10, 4, sampled_indices=[0, 4, 9]) - [0, 4, 6, 9] - - This strategy is particularly useful in optimization settings where boundary coverage and balanced sampling are important. - """ - - if tot_points_to_sample == 0: - return [] - - if tot_points_to_sample > length_segment: - raise ValueError( - "ValueError: You are trying to sample more points than those that are available" - ) - - if sampled_indices is None: - sampled_indices = [] - - if len(sampled_indices) == length_segment: - maximal_indices_list = list(range(length_segment)) - if sampled_indices.sort() != maximal_indices_list: - logging.error( - "Sampled indices do not correspond to [0,..., max_n_indices -1]" - "Returning list(range(max_n_indices)" - ) - return maximal_indices_list - - if len(sampled_indices) > tot_points_to_sample: - logging.warning( - "Number of sampled indices is greater than the number of indices you want to sample" - "Returning sampled indices" - ) - return sampled_indices - - index_list = list(sampled_indices) - sampled_set = set(index_list) - - for point in [0, length_segment - 1]: - if point not in sampled_set: - index_list.append(point) - sampled_set.add(point) - if len(index_list) == tot_points_to_sample: - return sorted(index_list) - - def build_prefix_and_len(index_list: list[int]) -> tuple[list[int], int]: - """ - Builds prefix sums over a truncated mask: M = max(index_list)+1. - prefix[j] = sum(mask[0:j]) with prefix length M+1. - """ - if not index_list: - return [0], 0 - - M = max(index_list) + 1 - - # You must define sampled_set based on the input list - sampled_set = set(index_list) - - prefix = [0] * (M + 1) - s = 0 - - for i in range(M): - # i represents the current index in the imaginary mask array - s += 1 if i in sampled_set else 0 - prefix[i + 1] = s - - return prefix, M - - def get_list_min_weight( - prefix: list[int], M: int, d: int, selectable_indices: list[int] - ) -> list[int]: - """ - uses prefix sums instead of numpy.mean. - Only considers indices i in selectable_indices intersected with [0, M-1], - and preserves ascending order for ties exactly like the OG. - """ - # cmpute mean densities and track min - # We must preserve order: OG loops i = 0..M-1 and filters by membership. - # Achieve the same by iterating selectable_indices (which we build in ascending order) - # but breaking when i >= M. - vals = {} - for i in selectable_indices: - if i >= M: - break - left = i - d - right = i + d - if left < 0: - left = 0 - if right >= M: - right = M - 1 - total = prefix[right + 1] - prefix[left] - denom = right - left + 1 - mean = total / denom # float64-equivalent - matches numpy.mean on booleans - vals[i] = mean - - if not vals: - return [] - - min_val = min(vals.values()) - # preserving order of candidates as OG: ascending index order - out = [] - for i in selectable_indices: - if i >= M: - break - if vals.get(i) == min_val: - out.append(i) - return out - - def get_selectable_indices() -> list[int]: - # OG did O(N*m) with "i not in list", but we do O(N) with a set, but order identical. - return [i for i in range(length_segment) if i not in sampled_set] - - max_d = length_segment - - # main loop - while len(index_list) < tot_points_to_sample: - selection = 0 - selectable_indices = get_selectable_indices() - - # prefix sums for the current (truncated) mask once per outer iteration - prefix, M = build_prefix_and_len(index_list=index_list) - - d = 1 - # keeping "previous set" semantics exactly (used when l becomes empty) - previous_set = selectable_indices - - while selection == 0: - indices = get_list_min_weight(prefix, M, d, selectable_indices) - - if not indices: - # Exact OG behavior: pick first element of the previous set - # when the intersection is empty at this d. - if not previous_set: - raise ValueError( - "Previous candidate set should not be empty or None" - ) - if verbose: - logger.info( - f"No intersection found with d={d}. Using the previous set " - f"Appending to {index_list} the first element of {previous_set}" - ) - chosen = previous_set[0] - index_list.append(chosen) - sampled_set.add(chosen) - selection = 1 - - else: - # narrowing minimal-density set - previous_set = selectable_indices - selectable_indices = indices - - if len(selectable_indices) == 1 or d == max_d: - # pick the first element (ascending order preserved) - if verbose: - logger.info( - f"Appending to {index_list} the first element of {selectable_indices}" - ) - chosen = selectable_indices[0] - index_list.append(chosen) - sampled_set.add(chosen) - selection = 1 - - # OG increments d regardless it's immaterial after selection, but we mirror it - d += 1 - - if sort: - return sorted(index_list) - return index_list - - -def get_index_list_ordered_partitions(n: int, tot_points: int) -> list[int]: - """ - Select indices from a 1D segment using a partition-based sampling strategy. - - The data is treated as isomorphic to a 1D segment ordered by feature importance. - Points are selected by iteratively finding midpoints of the largest gaps. - - Args: - n: Total length of the segment (len(df)), valid indices are 0 to n-1 - tot_points: Number of points to sample - - Returns: - Sorted list of sampled indices - - Raises: - ValueError: If tot_points exceeds n - """ - if tot_points == 0: - logger.debug("No points selected from the list, return empty list") - return [] - if tot_points > n: - raise ValueError - if tot_points == 1: - return [0] - index_list = [n - 1, 0] - number_of_inner_points_sampled = 0 - while number_of_inner_points_sampled + 2 < tot_points: - l_copy_sorted = index_list.copy() - l_copy_sorted.sort() - l_copy = index_list.copy() - for _i, el in enumerate(l_copy[1:]): - start = el - index_seen = l_copy_sorted.index(el) - end = l_copy_sorted[index_seen + 1] - mid = midpoint(start=start, end=end) - if mid in index_list: - continue - number_of_inner_points_sampled += 1 - index_list.append(mid) - if number_of_inner_points_sampled + 2 == tot_points: - break - index_list.sort() - return index_list - - -def midpoint(start: int, end: int) -> int: - """ - Calculate the midpoint between two indices. - - Args: - start: Starting index - end: Ending index - - Returns: - Integer midpoint index - - Raises: - ValueError: If start is greater than end - """ - if end - start < 0: - raise ValueError("Start is greater than end!") - return start + ((end - start) // 2) diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/order.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/order.py deleted file mode 100644 index 5ff4e320e..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/order.py +++ /dev/null @@ -1,247 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -import itertools -import logging -import math -from typing import Literal - -import numpy as np -import pandas as pd - -from no_priors_characterization.utils.high_dimensional_sampling import ( - get_sampling_indices_multi_dimensional, -) - -logger = logging.getLogger(__name__) - - -def order_df_for_sampling_with_no_priors( - df: pd.DataFrame, - constitutive_properties: list[str], - n: int, - strategy: Literal["random", "clhs", "sobol"], -) -> pd.DataFrame: - """ - Orders a DataFrame for high-dimensional sampling without prior knowledge. - - Deduplicates rows based on constitutive properties, orders them for sampling, - and returns a subset of n samples using the specified strategy. - - Args: - df: Input dataset containing at least the columns specified in - constitutive_properties. May contain duplicate configurations. - constitutive_properties: Column names defining the configuration space. - Uniqueness is enforced over the Cartesian product of these properties. - n: Number of samples to generate. Adjusted if larger than available - unique configurations. - strategy: Sampling strategy - "random", "clhs", or "sobol". - - Returns: - DataFrame with n sampled rows, preserving the original column schema. - Index is positional (0..n-1). - - Raises: - ValueError: If n <= 0 after adjustment or no samples are available. - """ - - # Filtering - len_original = len(df) - df_unique = df.drop_duplicates(subset=constitutive_properties).reset_index( - drop=True - ) - delta_len = len_original - len(df_unique) - if delta_len > 0: - logging.warning( - f"Removing {delta_len} duplicate configurations." - f"They are characterized by the same combination of constitutive properties = {constitutive_properties}" - ) - - if n > len(df_unique): - logging.warning( - f"Requested {n} samples, but DataFrame has only {len(df_unique)} rows. Adjusting n to {len(df_unique)}." - ) - n = len(df_unique) - - if n <= 0: - logging.error( - f"No samples available to select. DataFrame has {len(df_unique)} rows and {n} samples were requested." - ) - # Return empty DataFrame with same columns as input - return pd.DataFrame(columns=df_unique.columns) - - # Build dictionaries - def _get_sorted_uniques(prop: str) -> list: - """Helper to safely sort unique values for a property.""" - vals = df_unique[prop].unique() - try: - return sorted(vals) - except TypeError: - logging.warning( - f"Cannot sort mixed types for property '{prop}'. " - "Keeping original order." - ) - return list(vals) - - value_dict = {prop: _get_sorted_uniques(prop) for prop in constitutive_properties} - - space_dict = {prop: len(vals) for prop, vals in value_dict.items()} - - dimensions = list(space_dict.values()) - - # Order DataFrame for index mapping - df_unique = order_df_for_get_index_list_nn_high_dimensional( - df_unique, constitutive_properties, dimensions=dimensions - ).reset_index(drop=True) - - # Generate sampling orders - orders_to_sample = get_sampling_indices_multi_dimensional( - dimensions=dimensions, space=space_dict, n=n, strategy=strategy - ) - - # Map orders to DataFrame indices - indices_to_sample = get_index_list_nn_high_dimensional(orders_to_sample, dimensions) - - logger.info(f"Indexes are:\n {indices_to_sample}") - try: - return df_unique.iloc[indices_to_sample] - except IndexError: - logging.error( - f"Index Error detected. Length of the dataframe is {len(df_unique)}." - "The indices that cause the error are:" - ) - max_len = len(df_unique) - out_of_bounds_list = [i for i in indices_to_sample if i < 0 or i >= max_len] - - logging.error(out_of_bounds_list) - logging.error("Returning empty dataset") - return pd.DataFrame({}) - - -def order_df_for_get_index_list_nn_high_dimensional( - df: pd.DataFrame, constitutive_properties: list[str], dimensions: list[int] -) -> pd.DataFrame: - """ - Ensure DataFrame is ordered and complete for high-dimensional index generation. - - Prepares the DataFrame so rows align with the Cartesian product implied by - constitutive_properties and dimensions. Sorts rows, validates completeness, - and injects missing combinations if needed. - - Args: - df: Input DataFrame containing at least the columns in constitutive_properties. - constitutive_properties: Column names defining the high-dimensional space. - Order determines sort priority. - dimensions: Expected cardinality for each constitutive property. - Used to compute expected_len = product(dimensions). - - Returns: - DataFrame sorted by constitutive_properties and augmented with any missing - combinations. Injected rows have NaN for non-constitutive columns. - - Notes: - If dimensions and actual unique values disagree, uses observed unique - values to generate combinations. - """ - # Sort by constitutive properties - df = df.sort_values(by=constitutive_properties).reset_index(drop=True) - - expected_len = math.prod(dimensions) - - # Return early if already complete - if len(df) == expected_len: - return df - - # Generate all possible combinations based on actual unique values - unique_values = [ - sorted(df[prop].dropna().unique()) for prop in constitutive_properties - ] - all_combinations = list(itertools.product(*unique_values)) - actual_expected_len = len(all_combinations) - - logger.warning( - f"DataFrame length mismatch: expected {expected_len} (product of {dimensions}), " - f"but got {len(df)}. Actual unique combinations: {actual_expected_len}." - ) - - # Identify existing combinations - existing_combinations = { - tuple(row[prop] for prop in constitutive_properties) for _, row in df.iterrows() - } - - # Find missing combinations - missing_combinations = [ - comb for comb in all_combinations if comb not in existing_combinations - ] - - if missing_combinations: - logger.info( - f"Injecting {len(missing_combinations)} missing rows to satisfy the property." - ) - injected_rows = [] - for comb in missing_combinations: - row_data = dict(zip(constitutive_properties, comb, strict=False)) - # Fill other columns with NaN - for col in df.columns: - if col not in constitutive_properties: - row_data[col] = pd.NA - injected_rows.append(row_data) - - # Append missing rows - df = pd.concat([df, pd.DataFrame(injected_rows)], ignore_index=True) - - # Sort again after injection - df = df.sort_values(by=constitutive_properties).reset_index(drop=True) - - logger.info(f"Injected rows: {injected_rows}") - - return df - - -def get_index_list_nn_high_dimensional( - orders_to_sample: list[list[int]], dimensions: list[int] -) -> list[int]: - """ - Map high-dimensional sampling orders to linear (flattened) indices. - - Converts multi-dimensional coordinates to linear indices using row-major ordering, - where the last dimension varies fastest. - - Args: - orders_to_sample: List of multi-dimensional coordinates [i0, i1, ..., ik] - dimensions: Size of each dimension [d0, d1, ..., dk] - - Returns: - List of linear indices corresponding to the input coordinates - - Warns: - If duplicate or out-of-bounds indices are detected - """ - indices = [] - cprod = np.cumprod(np.array(dimensions), dtype=int).tolist() - maximum_n = cprod[-1] - - for order in orders_to_sample: - index = 0 - multiplier = 1 - # Iterate reversed so last dimension varies fastest - for i in reversed(range(len(dimensions))): - index += order[i] * multiplier - multiplier *= dimensions[i] - - if index > maximum_n: - logging.warning( - f"Out of bound index {index} computed from order {order}, dimensions are {dimensions}" - ) - indices.append(index) - - if len(set(indices)) != len(indices): - logger.error(f"{len(indices) - len(set(indices))} Duplicated indices!") - - out_of_bounds_list = [i for i in indices if i > maximum_n] - if out_of_bounds_list: - logger.error( - f"The following indices are out of bound: {out_of_bounds_list}, maximum admissible value is {maximum_n-1}" - ) - - return indices diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/space_df_connector.py b/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/space_df_connector.py deleted file mode 100644 index 9c29c2fa3..000000000 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/utils/space_df_connector.py +++ /dev/null @@ -1,524 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -from __future__ import annotations - -import logging -from typing import TYPE_CHECKING, Any - -import pandas as pd - -from orchestrator.core.discoveryspace.space import DiscoverySpace -from orchestrator.schema.virtual_property import PropertyAggregationMethodEnum - -if TYPE_CHECKING: - from collections.abc import Hashable - - from orchestrator.metastore.project import ProjectContext - from orchestrator.schema.entity import Entity - -logger = logging.getLogger(__name__) - - -def get_project_context() -> ProjectContext: - """ - Retrieve the current ADO project context from configuration. - - Returns: - ProjectContext object for the active project - """ - import orchestrator.cli.core.config - - ado_configuration = orchestrator.cli.core.config.AdoConfiguration.load() - return ado_configuration.project_context # type: ignore[name-defined] - - -def get_space( - space_or_space_id: DiscoverySpace | str, -) -> DiscoverySpace: - """ - Get a DiscoverySpace object from either a space object or identifier string. - - Args: - space_or_space_id: Either a DiscoverySpace object or its string identifier - - Returns: - DiscoverySpace object - """ - - if isinstance(space_or_space_id, DiscoverySpace): - return space_or_space_id - - return DiscoverySpace.from_stored_configuration( - project_context=get_project_context(), - space_identifier=space_or_space_id, - ) - - -# %% - - -def get_df_all_entities_no_measurements( - discoverySpace: DiscoverySpace | str, -) -> pd.DataFrame: - """ - Return a DataFrame of all entities in the given Discovery Space, regardless of whether - they have any mea sured target outputs. - - - Each row represents an entity from the entity space. - - Includes the entity identifier and all constitutive property values. - - Does NOT include any measured target outputs (only features). - - Useful for generating the full feature set for prediction or backfilling missing measurements. - - Parameters - ---------- - discoverySpace : DiscoverySpace | str - The Discovery Space object or its identifier. - targetOutput_list : list, optional - List of target output names (ignored in this function, included for API consistency). - - Returns - ------- - pd.DataFrame - DataFrame with columns: ['identifier', ]. - """ - - space = get_space(space_or_space_id=discoverySpace) - - entity_space = space.entitySpace - cp_ids = [cp.identifier for cp in entity_space.constitutiveProperties] - - list_of_dicts_to_convert = [] - for point_values in entity_space.sequential_point_iterator(): - point_dict = dict(zip(cp_ids, point_values, strict=True)) - entity = entity_space.entity_for_point(point_dict) - ed = {"identifier": entity.identifier} - ed.update(point_dict) - list_of_dicts_to_convert.append(ed) - - return pd.DataFrame(list_of_dicts_to_convert) - - -def get_df_at_least_one_measured_value( - discoverySpace: DiscoverySpace | str, - targetOutput_list: list[str] | None = None, - add_measurement_id: bool = False, -) -> pd.DataFrame: - """ - Return a DataFrame of entities that have at least one measured target output from the - provided list, aggregated across all experiments in the Discovery Space. - - - Each row represents an entity with measurements. - - Includes identifier (optional), constitutive properties, and the requested target outputs. - - Drops rows with missing values for the selected targets. - - May Return an empty DataFrame - - Parameters - ---------- - discoverySpace : DiscoverySpace | str - The Discovery Space object or its identifier. - targetOutput_list : list - List of target output names to include in the DataFrame. - add_measurement_id : bool - If True, include the entity identifier column in the output. - - Returns - ------- - pd.DataFrame - DataFrame with columns: ['identifier' (optional), , ]. - """ - - if not targetOutput_list: - targetOutput_list = [] - space = get_space(space_or_space_id=discoverySpace) - col_list = [cp.identifier for cp in space.entitySpace.constitutiveProperties] - if add_measurement_id: - col_list = ["identifier", *col_list] - - discoverySpace.sample_store.refresh() - - df = pd.DataFrame( - space.matchingEntitiesTable( - property_type="target", - aggregationMethod=PropertyAggregationMethodEnum.mean, - ) - ) - - if df.empty: - # NOTE: this condition is hit when there are no measurements at all existing in the space - logger.warning( - "No measured properties found in the discovery space\nReturning empty DataFrame\n " - ) - return df - - all_df_cols = list(df.columns) - valid_targetOutput_list = [] - for el in targetOutput_list: - if el in all_df_cols: - valid_targetOutput_list.append(el) - elif f"{el}-mean" in all_df_cols and el not in all_df_cols: - logger.warning( - f"Column named '{el}-mean' (instead of '{el}', which is not present)" - "found in the DataFrame obtained through matchingEntitiesTable. " - f"Renaming it to '{el}'." - ) - # Rename the column in the DataFrame - df.rename(columns={f"{el}-mean": el}, inplace=True) - valid_targetOutput_list += [el] - elif f"{el}-mean" in all_df_cols and el in all_df_cols: - logger.warning( - f"Columns named '{el}-mean' and '{el}'" - "found in the DataFrame obtained through matchingEntitiesTable. " - f"Renaming it to '{el}'." - ) - logger.error("Unexpected behavior can happen!") - # Rename the column in the DataFrame - df.rename(columns={f"{el}-mean": el}, inplace=True) - valid_targetOutput_list += [el] - col_list += valid_targetOutput_list - - # Something unexpected happened: log here about it - if valid_targetOutput_list != targetOutput_list: - if len(valid_targetOutput_list) == 0: - logger.error( - "No valid target in the columns of the DataFrame." - f"columns are:\t{list(df.columns)}." - f"First rows are:\n{df.head(5)}" - ) - else: - not_found = [ - t for t in targetOutput_list if t not in valid_targetOutput_list - ] - logger.error( - f"Found measurements for the following valid targets:\t{valid_targetOutput_list}" - ) - logger.error( - f"No measurement found for the following valid targets:\t{not_found}" - ) - - removed_cols = [c for c in list(df.columns) if c not in col_list] - logger.debug( - "Obtaining df with at least one measured target." - f"Removed columns: {removed_cols}" - ) - - df = df[col_list] - - # I can still have Nans here for cols in targetOutput_list, - # because I am taking points for which I have at least one of the measured properties of the experiment - df.dropna(inplace=True) - - # The resulting DataFrame can be empty - if df.empty: - logger.warning( - "Although there were some measured properties in the discovery space." - ) - logger.warning( - "All measured properties in the discovery space" - f"are different from the desired outputs {targetOutput_list}.Returning empty DataFrame\n " - ) - - return df - - -def get_source_and_target( - discoverySpace: DiscoverySpace | str, - targetOutput: str, - log_string: str = "", -) -> tuple[pd.DataFrame, pd.DataFrame]: - """ - Build source (labeled) and target (unlabeled) DataFrames for a given target output `t`. - Note, source can be empty - - - Retrieves measured entities for `t` and all entities without measurements. - - Merges on common feature columns (excluding 'identifier'). - - Splits into: - source_df: rows with non-null `t` (features + target). - target_df: rows with null `t` (features only). - - Parameters - ---------- - discoverySpace : str - Discovery Space identifier (e.g., 'space-1a2469-6a3ed5'). - t : str - Target output column name. - - Returns - ------- - tuple - (source_df, target_df) - """ - - dfm = get_df_at_least_one_measured_value(discoverySpace, [targetOutput]) - dfu = get_df_all_entities_no_measurements(discoverySpace) - keys = [c for c in dfu.columns if c in dfm.columns and c != "identifier"] - - if dfm.empty: - logger.warning("The source space is empty") - return dfm, dfu - - df = dfu.merge(dfm, on=keys, how="left") - - # If nothing is measured you do not have the columns, so I add the column as empty to run the - # following logic safely - if targetOutput not in list(df.columns): - logger.info( - f"""The target output was not present in the columns of the measured+unmeasured DataFrame,' \ - meaning that '{targetOutput}' has never been measured in this space. - dfm.empty = {df.empty}. Adding an empty column to the DataFrame. - """ - ) - logger.debug("Adding an empty column to the DataFrame.") - df[targetOutput] = pd.NA - - if targetOutput in list(df.columns): - df_measured_drop_na = df.dropna(subset=[targetOutput]) - df_unmeasured_drop_na = df[df[targetOutput].isna()].drop(columns=[targetOutput]) - n_rows_dropped = len(df) - len(df_measured_drop_na) - logger.debug( - f"Dropped {n_rows_dropped} rows. Function called with log_string={log_string}" - ) - if df_measured_drop_na.empty: - logger.warning( - f"Empty source after dropping rows that contain Nan in {targetOutput} column" - ) - if df_unmeasured_drop_na.empty: - logger.warning( - f"Empty target after filtering rows that contain Nan in {targetOutput} column" - ) - return df_measured_drop_na, df_unmeasured_drop_na - save_path = "df_with_no_targetOutput_columns.csv" - logger.error( - f"'{targetOutput}' column is missing, saving df in {save_path}, returning unmerged DataFrames" - ) - df.to_csv(save_path) - return dfm, dfu - - -def validate_points_in_space( - points: list[dict], - space: DiscoverySpace, -) -> tuple[list[dict], list[int]]: - """ - Validate a list of point dictionaries against a Discovery Space entity space. - - A point is considered valid if `space.entitySpace.isPointInSpace(point)` returns True. - This function returns both the subset of valid points (in original order) and - the indices of invalid points for diagnostics. - - Parameters - ---------- - points : list[dict] - List of point dicts `{constitutive_property_id: value}` to validate. - space : DiscoverySpace - The Discovery Space whose entity space defines the validity constraints. - - Returns - ------- - (valid_points, invalid_indices) : tuple[list[dict], list[int]] - valid_points : - The points that are valid under `space.entitySpace.isPointInSpace`. - invalid_indices : - The zero-based indices (relative to the input `points`) that were invalid. - - Examples - -------- - >>> points = make_points_from_df(df, space) - >>> valid_points, invalid_idx = validate_points_in_space(points, space) - >>> if invalid_idx: - ... print(f"Warning: {len(invalid_idx)} invalid rows at indices {invalid_idx}") - """ - valid_points: list[dict] = [] - invalid_indices: list[int] = [] - - for i, p in enumerate(points): - if space.entitySpace.isPointInSpace(p): - valid_points.append(p) - else: - invalid_indices.append(i) - return valid_points, invalid_indices - - -def df_to_points( - df: pd.DataFrame, - cols: list[str] | None = None, - dropna: bool = True, - drop_duplicates: bool = False, -) -> list[dict[Hashable, Any]]: - """ - Convert DataFrame rows to list of point dictionaries. - - Args: - df: Input DataFrame - cols: Columns to include. If None, uses all columns - dropna: If True, drop rows containing any NaN values - drop_duplicates: If True, drop duplicate rows - - Returns: - List of dictionaries, each representing a point {property_id: value} - - Raises: - KeyError: If requested columns are not present in DataFrame - """ - - if cols is None: - cols = list(df.columns) - missing = set(cols) - set(df.columns) - if missing: - raise KeyError(f"Requested columns not present in DataFrame: {missing}") - - sub = df[cols].copy() - if dropna: - sub = sub.dropna(how="any") - if drop_duplicates: - sub = sub.drop_duplicates() - - # Convert numpy scalars to python builtins for safety - def to_py(x: object) -> object: - import numpy as np - - if isinstance(x, (np.generic)): - return x.item() - return x - - # apply conversion (only if needed) - for c in sub.columns: - sub[c] = sub[c].map(to_py) - - return sub.to_dict(orient="records") - - -# TODO: check if these are actually needed -def df_to_points_parsing( - df: pd.DataFrame, - cols: list[str] | None = None, - dropna: bool = True, - parse_values: bool = False, -) -> list[dict]: - """ - Convert DataFrame to points with optional string value parsing. - - Args: - df: Input DataFrame - cols: Columns to include - dropna: If True, drop rows with NaN values - parse_values: If True, parse string values using ast.literal_eval - - Returns: - List of point dictionaries with parsed values - """ - import ast - - points = df_to_points(df, cols=cols, dropna=dropna) - if not parse_values: - return points - - parsed = [] - for p in points: - newp = {} - for k, v in p.items(): - if isinstance(v, str): - try: - newp[k] = ast.literal_eval(v) - except Exception: - newp[k] = v - else: - newp[k] = v - parsed.append(newp) - return parsed - - -def make_points_from_df( - df: pd.DataFrame, - space: DiscoverySpace, - cols: list[str] | None = None, - dropna: bool = True, - parse_values: bool = True, -) -> list[dict]: - """ - Convert a DataFrame of constitutive properties into a list of point dictionaries, - using the entity-space canonical column order by default. - - Each point is a mapping {constitutive_property_id: value}. By default, rows with - any NaN across the selected columns are dropped, and string values are parsed - into Python literals where possible (e.g., "[1, 2]" -> [1, 2]) via `ast.literal_eval`. - - Parameters - ---------- - df : pd.DataFrame - Input DataFrame whose columns correspond to constitutive property identifiers. - space : DiscoverySpace - The Discovery Space providing the canonical order of constitutive properties. - cols : list[str], optional - Explicit list of columns to use. If None, uses the canonical order: - `[cp.identifier for cp in space.entitySpace.constitutiveProperties]`. - dropna : bool, default True - If True, drop rows containing any NaN in the selected columns. - parse_values : bool, default True - If True, attempt to parse string values into Python objects using `ast.literal_eval`. - - Returns - ------- - list[dict] - A list of point dicts, one per retained row: `[{prop_id: value, ...}, ...]`. - - Raises - ------ - KeyError - If any of the requested `cols` are not present in `df`. - - Examples - -------- - >>> space_cols = [cp.identifier for cp in space.entitySpace.constitutiveProperties] - >>> points = make_points_from_df(df, space, cols=space_cols, dropna=True, parse_values=True) - """ - # Determine canonical order if cols not provided - if cols is None: - cols = [cp.identifier for cp in space.entitySpace.constitutiveProperties] - - # Validate requested columns exist - missing = set(cols) - set(df.columns) - if missing: - raise KeyError(f"Requested columns not present in DataFrame: {missing}") - - # Convert rows -> point dicts, with optional parsing - return df_to_points_parsing(df, cols=cols, dropna=dropna, parse_values=parse_values) - - -def get_list_of_entities_from_df_and_space( - df: pd.DataFrame, space: DiscoverySpace -) -> list[Entity]: - """ - Convert DataFrame rows to Entity objects validated against a discovery space. - - Args: - df: DataFrame containing constitutive property values - space: DiscoverySpace defining the entity space constraints - - Returns: - List of valid Entity objects - - Warns: - If number of valid entities differs from DataFrame row count - """ - points = make_points_from_df(df=df, space=space) - valid_points, __ = validate_points_in_space(points, space) - - list_of_entities = [] - from orchestrator.schema.point import SpacePoint - - for p in valid_points: - # p is a dict mapping constitutive property id -> value - sp = SpacePoint(entity=p) - entity = sp.to_entity( - generatorid="no_priors_characterization" - ) # builds an Entity from the dict without touching the sample store - list_of_entities.append(entity) - - numberEntities = len(list_of_entities) - if numberEntities != len(df): - numberEntities_log = f"""Warning: number of valid entities {numberEntities} is different from the number of rows in the ordered df {len(df)}. - This means that some rows in the ordered df did not correspond to valid entities in the discovery space. - """ - logging.warning(numberEntities_log) - return list_of_entities diff --git a/plugins/operators/no-priors-characterization/visualize_sampling.py b/plugins/operators/no-priors-characterization/visualize_sampling.py deleted file mode 100644 index 275f7e7a0..000000000 --- a/plugins/operators/no-priors-characterization/visualize_sampling.py +++ /dev/null @@ -1,135 +0,0 @@ -# Copyright IBM Corporation 2025, 2026 -# SPDX-License-Identifier: MIT - -""" -Visualization script for comparing sampling strategies. - -This script demonstrates the distribution patterns of different sampling -strategies (random, CLHS, Sobol) in a 2D grid space. -""" - -import sys - -try: - import matplotlib.pyplot as plt - import numpy as np - from matplotlib.axes import Axes -except ModuleNotFoundError: - print("matplotlib not found. Please install it to run the visualization.") - print("pip install matplotlib") - sys.exit(1) - -from no_priors_characterization.utils.high_dimensional_sampling import ( - concatenated_latin_hypercube_sampling, - random_high_dimensional_sampling, - sobol_sampling, -) - - -def plot_grid( - ax: Axes, - dimensions: list[int] | tuple[int, int], - points: np.ndarray | list[list[int]], - title: str, -) -> None: - """ - Plot a 2D grid visualization of sampled points with overlap detection. - - Args: - ax: Matplotlib axes object to draw on. - dimensions: Dimensions of the grid [width, height]. - points: List of sampled points as [x, y] coordinates. - title: Title for the plot. - """ - from collections import defaultdict - - import matplotlib.patches as patches - - nx, ny = dimensions[0], dimensions[1] - - # Setup grid - ax.set_xlim(0, nx) - ax.set_ylim(0, ny) - ax.set_xticks(range(nx + 1)) - ax.set_yticks(range(ny + 1)) - ax.grid(True, color="black", linewidth=1) - ax.set_aspect("equal") - ax.set_title(title, fontsize=12, pad=10) - - # Track points in each cell to handle overlaps - # Maps (x, y) -> list of time indices (1-based) - grid_content = defaultdict(list) - - # points is a list of [x, y], enumerate gives us the time index (0-based) - for time, point in enumerate(points): - x, y = int(point[0]), int(point[1]) # Ensure integers - if 0 <= x < nx and 0 <= y < ny: - # Store t + 1 so the first sample is '1' - grid_content[(x, y)].append(time + 1) - - # Draw squares and text - for (x, y), indices in grid_content.items(): - count = len(indices) - # Darker alpha if multiple points hit the same square - alpha = min(0.4 + 0.2 * count, 1.0) - rect = patches.Rectangle( - (x, y), 1, 1, linewidth=0, facecolor="#ff0000", alpha=alpha - ) - ax.add_patch(rect) - - # Label is the comma-separated list of indices - label = ",".join(map(str, indices)) - - # Add text with shadow effect - ax.text( - x + 0.52, - y + 0.52, - label, - ha="center", - va="center", - color="#D4FF00", - fontweight="bold", - ) - ax.text( - x + 0.5, - y + 0.5, - label, - ha="center", - va="center", - color="#000000", - fontweight="bold", - ) - - -def main() -> None: - """Run the sampling visualization comparison.""" - # Configuration - dimensions = [20, 6] # 20 columns, 6 rows (Total 120 cells) - N = 30 # Number of samples to draw - SEED = 42 - - # Plotting - _fig, axes = plt.subplots(1, 3, figsize=(15, 5)) - - # 1. Random Sampling - pts_rnd = random_high_dimensional_sampling(dimensions, N, seed=SEED) - plot_grid(axes[0], dimensions, pts_rnd, f"Random Sampling (N={N})\n(Clumps & Gaps)") - - # 2. Concatenated LHS - pts_lhs = concatenated_latin_hypercube_sampling(dimensions, N, seed=SEED) - plot_grid( - axes[1], dimensions, pts_lhs, f"Concatenated LHS (N={N})\n(Uniform Rows/Cols)" - ) - - # 3. Sobol Sequence - pts_sobol = sobol_sampling(dimensions, N, seed=SEED) - plot_grid( - axes[2], dimensions, pts_sobol, f"Sobol Sequence (N={N})\n(Maximal Spreading)" - ) - - plt.tight_layout() - plt.show() - - -if __name__ == "__main__": - main() diff --git a/plugins/operators/trim/pyproject.toml b/plugins/operators/trim/pyproject.toml index 233b36c60..9aa2419b4 100644 --- a/plugins/operators/trim/pyproject.toml +++ b/plugins/operators/trim/pyproject.toml @@ -5,7 +5,6 @@ readme = "README.md" requires-python = ">=3.10,<3.14" dependencies = [ "ado-core", - "ado-no-priors-characterization", "autogluon-tabular[catboost,xgboost]==1.5", "numpy", "pandas>=2.2.0", @@ -30,4 +29,3 @@ local_scheme = "node-and-timestamp" [tool.uv.sources] ado-core = { workspace = true } -ado-no-priors-characterization = { workspace = true } diff --git a/plugins/operators/trim/src/trim/operator.py b/plugins/operators/trim/src/trim/operator.py index f4ad7a2ae..5873e8616 100644 --- a/plugins/operators/trim/src/trim/operator.py +++ b/plugins/operators/trim/src/trim/operator.py @@ -5,12 +5,11 @@ import logging from importlib.metadata import version -from no_priors_characterization.utils import get_source_and_target - from orchestrator.core.discoveryspace.space import DiscoverySpace from orchestrator.core.operation.config import FunctionOperationInfo from orchestrator.core.operation.operation import OperationOutput from orchestrator.modules.operators.collections import characterize_operation +from trim.samplers.no_priors_utils import get_source_and_target from trim.trim_pydantic import ( TrimParameters, ) # Importing this way works when the package is installed @@ -54,7 +53,8 @@ def trim( Returns: OperationOutput containing the operation resources and metadata """ - from orchestrator.modules.operators.collections import characterize, explore + # Lazy import to avoid circular import issues during plugin loading + from orchestrator.modules.operators.collections import explore from orchestrator.modules.operators.randomwalk import ( CustomSamplerConfiguration, RandomWalkParameters, @@ -95,9 +95,23 @@ def trim( f"Note: Trim sampler has been called with a minimum budget of {params.samplingBudget.minPoints} points." ) - # Call the no-priors-characterization operator directly - no_priors_operator = characterize.no_priors_characterization - op_output_characterization_no_prior = no_priors_operator( + # Use random-walk with no-priors sampler instead of direct operator call + no_priors_module = SamplerModuleConf( + moduleClass="NoPriorsSampleSelector", + moduleName="trim.samplers.no_priors_sampler", + ) + no_priors_sampler_config = CustomSamplerConfiguration( + module=no_priors_module, + parameters=params.noPriorParameters, + ) + no_priors_rwparams = RandomWalkParameters( + samplerConfig=no_priors_sampler_config, + batchSize=params.noPriorParameters.batchSize, + numberEntities=params.samplingBudget.minPoints - len(source_df), + singleMeasurement=True, + ) + + op_output_characterization_no_prior = random_walk( discoverySpace=discoverySpace, operationInfo=FunctionOperationInfo.model_validate( { @@ -112,7 +126,7 @@ def trim( ), } ), - **params.noPriorParameters.model_dump(), + **no_priors_rwparams.model_dump(), ) source_df, target_df = get_source_and_target( @@ -157,7 +171,11 @@ def trim( operationInfo=FunctionOperationInfo.model_validate( { "metadata": {"completed operation": "Iterative Modeling Operation"}, - "actuatorConfigurationIdentifiers": operationInfo.actuatorConfigurationIdentifiers, + "actuatorConfigurationIdentifiers": ( + operationInfo.actuatorConfigurationIdentifiers + if operationInfo + else [] + ), } ), **trim_rwparams.model_dump(), diff --git a/examples/no-priors-characterization/custom_experiments/no_priors_custom_experiments/__init__.py b/plugins/operators/trim/src/trim/samplers/__init__.py similarity index 100% rename from examples/no-priors-characterization/custom_experiments/no_priors_custom_experiments/__init__.py rename to plugins/operators/trim/src/trim/samplers/__init__.py diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_pydantic.py b/plugins/operators/trim/src/trim/samplers/no_priors_parameters.py similarity index 69% rename from plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_pydantic.py rename to plugins/operators/trim/src/trim/samplers/no_priors_parameters.py index 3608470df..c1240c4b1 100644 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_pydantic.py +++ b/plugins/operators/trim/src/trim/samplers/no_priors_parameters.py @@ -15,8 +15,6 @@ class NoPriorsParameters(BaseModel): strategy (str): sampling subroutine: - 'random': selects random points from the beginning - - 'one_shift': refer to one_shift_then_random_points_high_dimensional_sampling - - 'recursive_aggregation': refer to recursive_aggregation_high_dimensional_sampling - 'clhs': refer to concatenated_latin_hypercube_sampling - 'sobol': sobol sampling """ @@ -48,25 +46,15 @@ class NoPriorsParameters(BaseModel): ] = 1 sampling_strategy: Annotated[ - Literal["random", "one_shift", "recursive_aggregation", "clhs", "sobol"], + Literal["random", "clhs", "sobol"], BeforeValidator(lambda s: s.lower()), Field( description=( "Sampling subroutine. Supported values:\n" " - 'random': selects random points from the beginning\n" - " - 'one_shift': see one_shift_then_random_points_high_dimensional_sampling\n" - " - 'recursive_aggregation': see recursive_aggregation_high_dimensional_sampling\n" " - 'clhs': dimension-wise random without replacement until each dim cycles\n" " - 'sobol': sobol sampling via scipy\n" - "Aliases: 'random_shifts' → 'recursive_aggregation'.\n" "Validation is case-insensitive; value is normalized to lowercase." ), ), ] = "clhs" - - -if __name__ == "__main__": - params = NoPriorsParameters.model_validate(NoPriorsParameters(targetOutput="test")) - print( - f"type of model_validate output on no-priors-characterization default is {type(params)}, printing the full object gives {params}" - ) diff --git a/plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_sampler.py b/plugins/operators/trim/src/trim/samplers/no_priors_sampler.py similarity index 64% rename from plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_sampler.py rename to plugins/operators/trim/src/trim/samplers/no_priors_sampler.py index 2d7c220d1..3030d4965 100644 --- a/plugins/operators/no-priors-characterization/src/no_priors_characterization/no_priors_sampler.py +++ b/plugins/operators/trim/src/trim/samplers/no_priors_sampler.py @@ -7,15 +7,15 @@ from pydantic import BaseModel -from no_priors_characterization.no_priors_pydantic import NoPriorsParameters -from no_priors_characterization.utils.order import order_df_for_sampling_with_no_priors -from no_priors_characterization.utils.space_df_connector import ( - get_list_of_entities_from_df_and_space, - get_source_and_target, -) from orchestrator.core.discoveryspace.samplers import BaseSampler from orchestrator.core.discoveryspace.space import DiscoverySpace, Entity from orchestrator.modules.operators.discovery_space_manager import DiscoverySpaceManager +from trim.samplers.no_priors_parameters import NoPriorsParameters +from trim.samplers.no_priors_utils import ( + get_list_of_entities_from_df_and_space, + get_source_and_target, + order_df_for_sampling_with_no_priors, +) logger_no_priors = logging.getLogger(__name__) @@ -113,18 +113,70 @@ async def iterator() -> typing.AsyncGenerator[list[Entity], None]: # type: igno def entityIterator( self, discoverySpace: DiscoverySpace, batchsize: int = 1 ) -> typing.Generator[list[Entity], None, None]: - """Returns an remoteEntityIterator that returns entities in order""" + """ + Generate entities for no-priors characterization sampling (synchronous version). + + Orders the target space using a high-dimensional sampling strategy (e.g., CLHS, Sobol) + without relying on prior model knowledge or feature importance. + + Args: + discoverySpace: The discovery space to sample from + batchsize: Number of entities to yield per iteration + + Yields: + List of Entity objects to be measured, in the determined order + """ def iterator_closure( space: DiscoverySpace, ) -> typing.Callable[[], typing.Generator[list[Entity], None, None]]: - # list_of_entities = list(...) # type: ignore[name-defined] - # numberEntities = len(list_of_entities) + logger_no_priors.info("Characterization with no-priors starts.\n") + logger_no_priors.info(f"Parameters are:\n{self.params}\n\n") + + source_df, target_df = get_source_and_target( + space, self.params.targetOutput + ) + logger_no_priors.info(f"Target dataframe has length {len(target_df)}") + + # The 'samples' parameter specifies the number of NEW entities to sample, + # regardless of how many entities have already been measured in the space + logger_no_priors.info( + f"Space has {len(source_df)} measured entities. " + f"Sampling {self.params.samples} new entities as requested." + ) + target_df = order_df_for_sampling_with_no_priors( + target_df, + [cp.identifier for cp in space.entitySpace.constitutiveProperties], + self.params.samples, + strategy=self.params.sampling_strategy, + ) + list_of_entities_for_no_prior_characterization = ( + get_list_of_entities_from_df_and_space(df=target_df, space=space) + ) + + logger_no_priors.info( + "\n\nCharacterization with no-priors finished. Starting Iterative Modeling.\n" + ) - def iterator() -> typing.Generator[list[Entity], None, None]: # type: ignore[name-defined] - raise NotImplementedError - # ...for i in range(0, numberEntities, batchsize): + def iterator() -> typing.Generator[list[Entity], None, None]: + logger_no_priors.info( + "\n\nIteration over sorted entities for no priors characterization starts.\n" + ) + for i in range( + 0, len(list_of_entities_for_no_prior_characterization), batchsize + ): + entities = list_of_entities_for_no_prior_characterization[ + i : i + batchsize + ] + if len(entities) == 0: + logger_no_priors.info( + "\n\nCharacterization with no-priors finished.\n" + ) + break + else: + yield entities + logger_no_priors.info("\n\nCharacterization with no-priors finished.\n") return iterator diff --git a/plugins/operators/trim/src/trim/samplers/no_priors_utils.py b/plugins/operators/trim/src/trim/samplers/no_priors_utils.py new file mode 100644 index 000000000..ccf6a2544 --- /dev/null +++ b/plugins/operators/trim/src/trim/samplers/no_priors_utils.py @@ -0,0 +1,953 @@ +# Copyright IBM Corporation 2025, 2026 +# SPDX-License-Identifier: MIT + +""" +Utility functions for no-priors sampling, including: +- High-dimensional sampling strategies (CLHS, Sobol, random) +- DataFrame ordering and index mapping +- Entity/point conversion and validation +- Discovery space data extraction +""" + +from __future__ import annotations + +import itertools +import logging +import math +import random +from typing import TYPE_CHECKING, Any, Literal + +import numpy as np +import pandas as pd +from scipy.stats.qmc import Sobol + +from orchestrator.core.discoveryspace.space import DiscoverySpace +from orchestrator.schema.virtual_property import PropertyAggregationMethodEnum + +if TYPE_CHECKING: + from collections.abc import Hashable + + from orchestrator.metastore.project import ProjectContext + from orchestrator.schema.entity import Entity + +logger = logging.getLogger(__name__) + + +# ============================================================================ +# 1D Sampling Functions +# ============================================================================ + + +def get_index_list_van_der_corput( + length_segment: int, + tot_points_to_sample: int, + sampled_indices: list[int] | None = None, + sort: bool = False, + verbose: bool = False, +) -> list[int]: + """ + Selects indices from a 1D segment using a modified Van der Corput sequence. + + Args: + length_segment: Total number of units in the 1D segment + tot_points_to_sample: Total number of indices to sample + sampled_indices: List of indices already sampled + sort: If True, returns the final list sorted + verbose: If True, prints debug information + + Returns: + List of sampled indices + + Raises: + ValueError: If tot_points_to_sample exceeds length_segment + """ + if tot_points_to_sample == 0: + return [] + + if tot_points_to_sample > length_segment: + raise ValueError( + "ValueError: You are trying to sample more points than those that are available" + ) + + if sampled_indices is None: + sampled_indices = [] + + if len(sampled_indices) == length_segment: + maximal_indices_list = list(range(length_segment)) + if sorted(sampled_indices) != maximal_indices_list: + logging.error( + "Sampled indices do not correspond to [0,..., max_n_indices -1]. " + "Returning list(range(max_n_indices))" + ) + return maximal_indices_list + + if len(sampled_indices) > tot_points_to_sample: + logging.warning( + "Number of sampled indices is greater than the number of indices you want to sample" + "Returning sampled indices" + ) + return sampled_indices + + index_list = list(sampled_indices) + sampled_set = set(index_list) + + for point in [0, length_segment - 1]: + if point not in sampled_set: + index_list.append(point) + sampled_set.add(point) + if len(index_list) == tot_points_to_sample: + return sorted(index_list) + + def build_prefix_and_len(index_list: list[int]) -> tuple[list[int], int]: + if not index_list: + return [0], 0 + + M = max(index_list) + 1 + sampled_set = set(index_list) + prefix = [0] * (M + 1) + s = 0 + + for i in range(M): + s += 1 if i in sampled_set else 0 + prefix[i + 1] = s + + return prefix, M + + def get_list_min_weight( + prefix: list[int], M: int, d: int, selectable_indices: list[int] + ) -> list[int]: + vals = {} + for i in selectable_indices: + if i >= M: + break + left = max(0, i - d) + right = min(M - 1, i + d) + total = prefix[right + 1] - prefix[left] + denom = right - left + 1 + mean = total / denom + vals[i] = mean + + if not vals: + return [] + + min_val = min(vals.values()) + out = [] + for i in selectable_indices: + if i >= M: + break + if vals.get(i) == min_val: + out.append(i) + return out + + def get_selectable_indices() -> list[int]: + return [i for i in range(length_segment) if i not in sampled_set] + + max_d = length_segment + + while len(index_list) < tot_points_to_sample: + selection = 0 + selectable_indices = get_selectable_indices() + prefix, M = build_prefix_and_len(index_list=index_list) + d = 1 + previous_set = selectable_indices + + while selection == 0: + indices = get_list_min_weight(prefix, M, d, selectable_indices) + + if not indices: + if not previous_set: + raise ValueError( + "Previous candidate set should not be empty or None" + ) + if verbose: + logger.info( + f"No intersection found with d={d}. Using the previous set " + f"Appending to {index_list} the first element of {previous_set}" + ) + chosen = previous_set[0] + index_list.append(chosen) + sampled_set.add(chosen) + selection = 1 + else: + previous_set = selectable_indices + selectable_indices = indices + + if len(selectable_indices) == 1 or d == max_d: + if verbose: + logger.info( + f"Appending to {index_list} the first element of {selectable_indices}" + ) + chosen = selectable_indices[0] + index_list.append(chosen) + sampled_set.add(chosen) + selection = 1 + + d += 1 + + if sort: + return sorted(index_list) + return index_list + + +# ============================================================================ +# High-Dimensional Sampling Functions +# ============================================================================ + + +def concatenated_latin_hypercube_sampling( + dimensions: list[int], + final_sample_size: int, + seed: int | None = None, +) -> list[list[int]]: + """ + Generates samples using Concatenated Latin Hypercube Sampling. + + Args: + dimensions: Cardinality (size) of each dimension + final_sample_size: Total number of points to sample + seed: Optional PRNG seed for reproducibility + + Returns: + List of sampled points + + Raises: + ValueError: If any dimension size is less than 1 + """ + if any(d <= 0 for d in dimensions): + raise ValueError( + f"All dimensions must be >= 1, received dimensions={dimensions}" + ) + + if final_sample_size <= 0: + return [] + + rng = random.Random() if seed is None else random.Random(seed) # noqa: S311 + pools: list[list[int]] = [list(range(d)) for d in dimensions] + samples: list[list[int]] = [] + + for _ in range(final_sample_size): + point: list[int] = [] + for j, d in enumerate(dimensions): + if not pools[j]: + pools[j] = list(range(d)) + k = rng.randrange(len(pools[j])) + value = pools[j].pop(k) + point.append(value) + samples.append(point) + + return samples + + +def sobol_sampling( + dimensions: list[int], final_sample_size: int, seed: int | None = None +) -> list[list[int]]: + """ + Generates Sobol sampled points scaled to integer dimensions. + + Falls back to CLHS if collisions are detected. + + Args: + dimensions: Size of each dimension + final_sample_size: Number of points to sample + seed: Random seed for the Sobol scrambler + + Returns: + List of sampled points + """ + sampler = Sobol(d=len(dimensions), scramble=True, rng=seed) + points = sampler.random(final_sample_size) + + discrete_points = [ + [int(val * d) for val, d in zip(p, dimensions, strict=True)] for p in points + ] + + unique_points = {tuple(p) for p in discrete_points} + n_collisions = final_sample_size - len(unique_points) + + if n_collisions > 0: + logger.error( + f"Sobol sampling failed, {n_collisions} collisions detected, defaulting to clhs sampling" + ) + return concatenated_latin_hypercube_sampling( + dimensions=dimensions, final_sample_size=final_sample_size, seed=seed + ) + + return discrete_points + + +def random_high_dimensional_sampling( + dimensions: list[int], final_sample_size: int, seed: int | None = None +) -> list[list[int]]: + """ + Generate unique random samples from a high-dimensional space. + + Args: + dimensions: Cardinality of each dimension + final_sample_size: Total number of points to sample + seed: Optional PRNG seed + + Returns: + List of sampled points + + Raises: + ValueError: If final_sample_size exceeds total configurations + """ + if seed is not None: + random.seed(seed) + + num_configs = math.prod(dimensions) + if final_sample_size > num_configs: + raise ValueError( + f"Cannot generate {final_sample_size} unique samples. " + f"The sample space only contains {num_configs} possibilities." + ) + + configs = list(itertools.product(*[range(d) for d in dimensions])) + actual_sample_size = min(final_sample_size, len(configs)) + + if actual_sample_size < final_sample_size: + logger.warning( + f"Requested {final_sample_size} samples but only {len(configs)} unique " + f"configurations available. Sampling {actual_sample_size} instead." + ) + + samples = random.sample(configs, actual_sample_size) + return [list(s) for s in samples] + + +def get_sampling_indices_multi_dimensional( + dimensions: list[int], + n: int | Literal["all", "max"], + space: dict[str, int] | None = None, + strategy: Literal["random", "clhs", "sobol"] = "clhs", + seed: int | None = None, +) -> list[list[int]]: + """ + Generate sampling indices for a high-dimensional space. + + Args: + dimensions: Sizes of each dimension + n: Number of points to sample ('all', 'max', or integer) + space: Optional mapping of dimension names to sizes + strategy: Sampling strategy ('random', 'clhs', or 'sobol') + seed: Controls randomness + + Returns: + List of sampled multi-dimensional coordinates + """ + if seed is not None: + random.seed(seed) + + if space: + indices_dict = { + k: get_index_list_van_der_corput(v, v) for k, v in space.items() + } + if [len(indices) for indices in list(indices_dict.values())] != dimensions: + logger.error( + f"A space dict has been provided ->{space}. It is inconsistent with dimensions={dimensions}" + ) + raise ValueError("Space has inconsistent dimensions!") + logger.info( + "Sampling indices for each named dimension (ordered low to high): %s", + indices_dict, + ) + + orders = [get_index_list_van_der_corput(v, v) for v in dimensions] + + if logger.isEnabledFor(logging.DEBUG): + logger.debug("Dimensions: %s", dimensions) + logger.debug("Sampling orders for each dimension:") + for i, o in enumerate(orders): + logger.debug("Dimension %d order: %s", i, o) + + maximum_n = math.prod(dimensions) + lcm = math.lcm(*dimensions) + + if lcm != maximum_n: + logger.debug( + "Periodicity detected, the sampling subroutine will ensure that you will not sample" + "the same configuration more than once." + ) + + if isinstance(n, str): + if n == "all": + n = maximum_n + elif n == "max": + n = max(dimensions) + else: + raise ValueError(f"Unrecognized string for n: {n}") + + if n > maximum_n: + logger.warning( + f"Maximal sample size is {maximum_n}, you requested {n} sampling prescriptions." + f"Elaborating prescription for n_samples = {maximum_n}" + ) + + logger.debug("Preparing to sample %d out of %d possible points.", n, maximum_n) + + match strategy: + case "random": + return random_high_dimensional_sampling(dimensions, n, seed=seed) + case "clhs": + return concatenated_latin_hypercube_sampling( + dimensions=dimensions, final_sample_size=n, seed=seed + ) + case "sobol": + return sobol_sampling(dimensions=dimensions, final_sample_size=n, seed=seed) + case _: + raise NotImplementedError(f"Strategy {strategy} is unknown") + + +# ============================================================================ +# DataFrame Ordering and Index Mapping +# ============================================================================ + + +def get_index_list_nn_high_dimensional( + orders_to_sample: list[list[int]], dimensions: list[int] +) -> list[int]: + """ + Map high-dimensional sampling orders to linear (flattened) indices. + + Args: + orders_to_sample: List of multi-dimensional coordinates + dimensions: Size of each dimension + + Returns: + List of linear indices + + Warns: + If duplicate or out-of-bounds indices are detected + """ + indices = [] + cprod = np.cumprod(np.array(dimensions), dtype=int).tolist() + maximum_n = cprod[-1] + + for order in orders_to_sample: + index = 0 + multiplier = 1 + for i in reversed(range(len(dimensions))): + index += order[i] * multiplier + multiplier *= dimensions[i] + + if index > maximum_n: + logging.warning( + f"Out of bound index {index} computed from order {order}, dimensions are {dimensions}" + ) + indices.append(index) + + if len(set(indices)) != len(indices): + logger.error(f"{len(indices) - len(set(indices))} Duplicated indices!") + + out_of_bounds_list = [i for i in indices if i > maximum_n] + if out_of_bounds_list: + logger.error( + f"The following indices are out of bound: {out_of_bounds_list}, maximum admissible value is {maximum_n-1}" + ) + + return indices + + +def order_df_for_get_index_list_nn_high_dimensional( + df: pd.DataFrame, constitutive_properties: list[str], dimensions: list[int] +) -> pd.DataFrame: + """ + Ensure DataFrame is ordered and complete for high-dimensional index generation. + + Args: + df: Input DataFrame + constitutive_properties: Column names defining the space + dimensions: Expected cardinality for each property + + Returns: + DataFrame sorted and augmented with missing combinations + """ + df = df.sort_values(by=constitutive_properties).reset_index(drop=True) + expected_len = math.prod(dimensions) + + if len(df) == expected_len: + return df + + unique_values = [ + sorted(df[prop].dropna().unique()) for prop in constitutive_properties + ] + all_combinations = list(itertools.product(*unique_values)) + actual_expected_len = len(all_combinations) + + logger.warning( + f"DataFrame length mismatch: expected {expected_len} (product of {dimensions}), " + f"but got {len(df)}. Actual unique combinations: {actual_expected_len}." + ) + + existing_combinations = { + tuple(row[prop] for prop in constitutive_properties) for _, row in df.iterrows() + } + + missing_combinations = [ + comb for comb in all_combinations if comb not in existing_combinations + ] + + if missing_combinations: + logger.info( + f"Injecting {len(missing_combinations)} missing rows to satisfy the property." + ) + injected_rows = [] + for comb in missing_combinations: + row_data = dict(zip(constitutive_properties, comb, strict=False)) + for col in df.columns: + if col not in constitutive_properties: + row_data[col] = pd.NA + injected_rows.append(row_data) + + df = pd.concat([df, pd.DataFrame(injected_rows)], ignore_index=True) + df = df.sort_values(by=constitutive_properties).reset_index(drop=True) + logger.info(f"Injected rows: {injected_rows}") + + return df + + +def order_df_for_sampling_with_no_priors( + df: pd.DataFrame, + constitutive_properties: list[str], + n: int, + strategy: Literal["random", "clhs", "sobol"], +) -> pd.DataFrame: + """ + Orders a DataFrame for high-dimensional sampling without prior knowledge. + + Args: + df: Input dataset + constitutive_properties: Column names defining the configuration space + n: Number of samples to generate + strategy: Sampling strategy + + Returns: + DataFrame with n sampled rows + + Raises: + ValueError: If n <= 0 after adjustment or no samples available + """ + len_original = len(df) + df_unique = df.drop_duplicates(subset=constitutive_properties).reset_index( + drop=True + ) + delta_len = len_original - len(df_unique) + if delta_len > 0: + logging.warning( + f"Removing {delta_len} duplicate configurations." + f"They are characterized by the same combination of constitutive properties = {constitutive_properties}" + ) + + if n > len(df_unique): + logging.warning( + f"Requested {n} samples, but DataFrame has only {len(df_unique)} rows. Adjusting n to {len(df_unique)}." + ) + n = len(df_unique) + + if n <= 0: + logging.error( + f"No samples available to select. DataFrame has {len(df_unique)} rows and {n} samples were requested." + ) + return pd.DataFrame(columns=df_unique.columns) + + def _get_sorted_uniques(prop: str) -> list: + vals = df_unique[prop].unique() + try: + return sorted(vals) + except TypeError: + logging.warning( + f"Cannot sort mixed types for property '{prop}'. " + "Keeping original order." + ) + return list(vals) + + value_dict = {prop: _get_sorted_uniques(prop) for prop in constitutive_properties} + space_dict = {prop: len(vals) for prop, vals in value_dict.items()} + dimensions = list(space_dict.values()) + + df_unique = order_df_for_get_index_list_nn_high_dimensional( + df_unique, constitutive_properties, dimensions=dimensions + ).reset_index(drop=True) + + orders_to_sample = get_sampling_indices_multi_dimensional( + dimensions=dimensions, space=space_dict, n=n, strategy=strategy + ) + + indices_to_sample = get_index_list_nn_high_dimensional(orders_to_sample, dimensions) + + logger.info(f"Indexes are:\n {indices_to_sample}") + try: + return df_unique.iloc[indices_to_sample] + except IndexError: + logging.error( + f"Index Error detected. Length of the dataframe is {len(df_unique)}." + "The indices that cause the error are:" + ) + max_len = len(df_unique) + out_of_bounds_list = [i for i in indices_to_sample if i < 0 or i >= max_len] + logging.error(out_of_bounds_list) + logging.error("Returning empty dataset") + return pd.DataFrame({}) + + +# ============================================================================ +# Discovery Space Data Extraction +# ============================================================================ + + +def get_project_context() -> ProjectContext: + """Retrieve the current ADO project context from configuration.""" + import orchestrator.cli.core.config + + ado_configuration = orchestrator.cli.core.config.AdoConfiguration.load() + return ado_configuration.project_context # type: ignore[name-defined] + + +def get_space( + space_or_space_id: DiscoverySpace | str, +) -> DiscoverySpace: + """Get a DiscoverySpace object from either a space object or identifier string.""" + if isinstance(space_or_space_id, DiscoverySpace): + return space_or_space_id + + return DiscoverySpace.from_stored_configuration( + project_context=get_project_context(), + space_identifier=space_or_space_id, + ) + + +def get_df_all_entities_no_measurements( + discoverySpace: DiscoverySpace | str, +) -> pd.DataFrame: + """ + Return a DataFrame of all entities in the Discovery Space. + + Returns: + DataFrame with columns: ['identifier', ] + """ + space = get_space(space_or_space_id=discoverySpace) + entity_space = space.entitySpace + cp_ids = [cp.identifier for cp in entity_space.constitutiveProperties] + + list_of_dicts_to_convert = [] + for point_values in entity_space.sequential_point_iterator(): + point_dict = dict(zip(cp_ids, point_values, strict=True)) + entity = entity_space.entity_for_point(point_dict) + ed = {"identifier": entity.identifier} + ed.update(point_dict) + list_of_dicts_to_convert.append(ed) + + return pd.DataFrame(list_of_dicts_to_convert) + + +def get_df_at_least_one_measured_value( + discoverySpace: DiscoverySpace | str, + targetOutput_list: list[str] | None = None, + add_measurement_id: bool = False, +) -> pd.DataFrame: + """ + Return a DataFrame of entities with at least one measured target output. + + Returns: + DataFrame with columns: ['identifier' (optional), , ] + """ + if not targetOutput_list: + targetOutput_list = [] + space = get_space(space_or_space_id=discoverySpace) + col_list = [cp.identifier for cp in space.entitySpace.constitutiveProperties] + if add_measurement_id: + col_list = ["identifier", *col_list] + + discoverySpace.sample_store.refresh() + + df = pd.DataFrame( + space.matchingEntitiesTable( + property_type="target", + aggregationMethod=PropertyAggregationMethodEnum.mean, + ) + ) + + if df.empty: + logger.warning( + "No measured properties found in the discovery space\nReturning empty DataFrame\n " + ) + return df + + all_df_cols = list(df.columns) + valid_targetOutput_list = [] + for el in targetOutput_list: + if el in all_df_cols: + valid_targetOutput_list.append(el) + elif f"{el}-mean" in all_df_cols and el not in all_df_cols: + logger.warning( + f"Column named '{el}-mean' (instead of '{el}', which is not present)" + "found in the DataFrame obtained through matchingEntitiesTable. " + f"Renaming it to '{el}'." + ) + df.rename(columns={f"{el}-mean": el}, inplace=True) + valid_targetOutput_list += [el] + elif f"{el}-mean" in all_df_cols and el in all_df_cols: + logger.warning( + f"Columns named '{el}-mean' and '{el}'" + "found in the DataFrame obtained through matchingEntitiesTable. " + f"Renaming it to '{el}'." + ) + logger.error("Unexpected behavior can happen!") + df.rename(columns={f"{el}-mean": el}, inplace=True) + valid_targetOutput_list += [el] + col_list += valid_targetOutput_list + + if valid_targetOutput_list != targetOutput_list: + if len(valid_targetOutput_list) == 0: + logger.error( + "No valid target in the columns of the DataFrame." + f"columns are:\t{list(df.columns)}." + f"First rows are:\n{df.head(5)}" + ) + else: + not_found = [ + t for t in targetOutput_list if t not in valid_targetOutput_list + ] + logger.error( + f"Found measurements for the following valid targets:\t{valid_targetOutput_list}" + ) + logger.error( + f"No measurement found for the following valid targets:\t{not_found}" + ) + + removed_cols = [c for c in list(df.columns) if c not in col_list] + logger.debug( + "Obtaining df with at least one measured target." + f"Removed columns: {removed_cols}" + ) + + df = df[col_list] + df.dropna(inplace=True) + + if df.empty: + logger.warning( + "Although there were some measured properties in the discovery space." + ) + logger.warning( + "All measured properties in the discovery space" + f"are different from the desired outputs {targetOutput_list}.Returning empty DataFrame\n " + ) + + return df + + +def get_source_and_target( + discoverySpace: DiscoverySpace | str, + targetOutput: str, + log_string: str = "", +) -> tuple[pd.DataFrame, pd.DataFrame]: + """ + Build source (labeled) and target (unlabeled) DataFrames for a target output. + + Returns: + Tuple of (source_df, target_df) + """ + dfm = get_df_at_least_one_measured_value(discoverySpace, [targetOutput]) + dfu = get_df_all_entities_no_measurements(discoverySpace) + keys = [c for c in dfu.columns if c in dfm.columns and c != "identifier"] + + if dfm.empty: + logger.warning("The source space is empty") + return dfm, dfu + + df = dfu.merge(dfm, on=keys, how="left") + + if targetOutput not in list(df.columns): + logger.info( + f"""The target output was not present in the columns of the measured+unmeasured DataFrame,' \ + meaning that '{targetOutput}' has never been measured in this space. + dfm.empty = {df.empty}. Adding an empty column to the DataFrame. + """ + ) + logger.debug("Adding an empty column to the DataFrame.") + df[targetOutput] = pd.NA + + if targetOutput in list(df.columns): + df_measured_drop_na = df.dropna(subset=[targetOutput]) + df_unmeasured_drop_na = df[df[targetOutput].isna()].drop(columns=[targetOutput]) + n_rows_dropped = len(df) - len(df_measured_drop_na) + logger.debug( + f"Dropped {n_rows_dropped} rows. Function called with log_string={log_string}" + ) + if df_measured_drop_na.empty: + logger.warning( + f"Empty source after dropping rows that contain Nan in {targetOutput} column" + ) + if df_unmeasured_drop_na.empty: + logger.warning( + f"Empty target after filtering rows that contain Nan in {targetOutput} column" + ) + return df_measured_drop_na, df_unmeasured_drop_na + + save_path = "df_with_no_targetOutput_columns.csv" + logger.error( + f"'{targetOutput}' column is missing, saving df in {save_path}, returning unmerged DataFrames" + ) + df.to_csv(save_path) + return dfm, dfu + + +# ============================================================================ +# Entity/Point Conversion +# ============================================================================ + + +def validate_points_in_space( + points: list[dict], + space: DiscoverySpace, +) -> tuple[list[dict], list[int]]: + """ + Validate point dictionaries against a Discovery Space. + + Returns: + Tuple of (valid_points, invalid_indices) + """ + valid_points: list[dict] = [] + invalid_indices: list[int] = [] + + for i, p in enumerate(points): + if space.entitySpace.isPointInSpace(p): + valid_points.append(p) + else: + invalid_indices.append(i) + return valid_points, invalid_indices + + +def df_to_points( + df: pd.DataFrame, + cols: list[str] | None = None, + dropna: bool = True, + drop_duplicates: bool = False, +) -> list[dict[Hashable, Any]]: + """ + Convert DataFrame rows to list of point dictionaries. + + Args: + df: Input DataFrame + cols: Columns to include + dropna: If True, drop rows containing NaN + drop_duplicates: If True, drop duplicate rows + + Returns: + List of point dictionaries + """ + if cols is None: + cols = list(df.columns) + missing = set(cols) - set(df.columns) + if missing: + raise KeyError(f"Requested columns not present in DataFrame: {missing}") + + sub = df[cols].copy() + if dropna: + sub = sub.dropna(how="any") + if drop_duplicates: + sub = sub.drop_duplicates() + + def to_py(x: object) -> object: + if isinstance(x, (np.generic)): + return x.item() + return x + + for c in sub.columns: + sub[c] = sub[c].map(to_py) + + return sub.to_dict(orient="records") + + +def df_to_points_parsing( + df: pd.DataFrame, + cols: list[str] | None = None, + dropna: bool = True, + parse_values: bool = False, +) -> list[dict]: + """Convert DataFrame to points with optional string value parsing.""" + import ast + + points = df_to_points(df, cols=cols, dropna=dropna) + if not parse_values: + return points + + parsed = [] + for p in points: + newp = {} + for k, v in p.items(): + if isinstance(v, str): + try: + newp[k] = ast.literal_eval(v) + except Exception: + newp[k] = v + else: + newp[k] = v + parsed.append(newp) + return parsed + + +def make_points_from_df( + df: pd.DataFrame, + space: DiscoverySpace, + cols: list[str] | None = None, + dropna: bool = True, + parse_values: bool = True, +) -> list[dict]: + """ + Convert DataFrame of constitutive properties into point dictionaries. + + Args: + df: Input DataFrame + space: Discovery Space providing canonical order + cols: Explicit list of columns to use + dropna: If True, drop rows with NaN + parse_values: If True, parse string values + + Returns: + List of point dictionaries + """ + if cols is None: + cols = [cp.identifier for cp in space.entitySpace.constitutiveProperties] + + missing = set(cols) - set(df.columns) + if missing: + raise KeyError(f"Requested columns not present in DataFrame: {missing}") + + return df_to_points_parsing(df, cols=cols, dropna=dropna, parse_values=parse_values) + + +def get_list_of_entities_from_df_and_space( + df: pd.DataFrame, space: DiscoverySpace +) -> list[Entity]: + """ + Convert DataFrame rows to Entity objects validated against a discovery space. + + Args: + df: DataFrame containing constitutive property values + space: DiscoverySpace defining the entity space constraints + + Returns: + List of valid Entity objects + """ + points = make_points_from_df(df=df, space=space) + valid_points, __ = validate_points_in_space(points, space) + + list_of_entities = [] + from orchestrator.schema.point import SpacePoint + + for p in valid_points: + sp = SpacePoint(entity=p) + entity = sp.to_entity(generatorid="no_priors_characterization") + list_of_entities.append(entity) + + numberEntities = len(list_of_entities) + if numberEntities != len(df): + numberEntities_log = f"""Warning: number of valid entities {numberEntities} is different from the number of rows in the ordered df {len(df)}. + This means that some rows in the ordered df did not correspond to valid entities in the discovery space. + """ + logging.warning(numberEntities_log) + return list_of_entities + + +# Made with Bob diff --git a/plugins/operators/trim/src/trim/trim_pydantic.py b/plugins/operators/trim/src/trim/trim_pydantic.py index 0010d297b..05362a7ab 100644 --- a/plugins/operators/trim/src/trim/trim_pydantic.py +++ b/plugins/operators/trim/src/trim/trim_pydantic.py @@ -5,9 +5,10 @@ from typing import Annotated import pydantic -from no_priors_characterization.no_priors_pydantic import NoPriorsParameters from pydantic import BaseModel, ConfigDict, Field, model_validator +from trim.samplers.no_priors_parameters import NoPriorsParameters + class SamplingBudget(pydantic.BaseModel): minPoints: Annotated[ diff --git a/plugins/operators/trim/src/trim/trim_sampler.py b/plugins/operators/trim/src/trim/trim_sampler.py index c22b83bea..9ca0f7833 100644 --- a/plugins/operators/trim/src/trim/trim_sampler.py +++ b/plugins/operators/trim/src/trim/trim_sampler.py @@ -20,6 +20,11 @@ from autogluon.tabular import TabularDataset, TabularPredictor from orchestrator.core.discoveryspace.samplers import BaseSampler +from trim.samplers.no_priors_utils import ( + get_index_list_van_der_corput, + get_list_of_entities_from_df_and_space, + get_source_and_target, +) from trim.trim_pydantic import TrimParameters if TYPE_CHECKING: @@ -29,11 +34,6 @@ from orchestrator.modules.operators.discovery_space_manager import ( DiscoverySpaceManager, ) -from no_priors_characterization.utils import ( - get_index_list_van_der_corput, - get_list_of_entities_from_df_and_space, - get_source_and_target, -) from orchestrator.utilities.pandas import sort_rows_by_column_names from trim.utils.exceptions import InsufficientDataError diff --git a/plugins/operators/trim/src/trim/utils/order.py b/plugins/operators/trim/src/trim/utils/order.py index 459657ade..eb7c2a8b8 100644 --- a/plugins/operators/trim/src/trim/utils/order.py +++ b/plugins/operators/trim/src/trim/utils/order.py @@ -9,8 +9,10 @@ import numpy as np import pandas as pd from autogluon.tabular import TabularPredictor -from no_priors_characterization.utils import get_sampling_indices_multi_dimensional +from trim.samplers.no_priors_utils import ( + get_sampling_indices_multi_dimensional, +) from trim.trim_pydantic import AutoGluonArgs from trim.utils.miscellaneous import delete_dir diff --git a/plugins/operators/trim/tests/test_high_dimensional_sampling.py b/plugins/operators/trim/tests/test_high_dimensional_sampling.py index 0b2c6457c..c8971692f 100644 --- a/plugins/operators/trim/tests/test_high_dimensional_sampling.py +++ b/plugins/operators/trim/tests/test_high_dimensional_sampling.py @@ -13,10 +13,8 @@ from typing import Any import pytest -from no_priors_characterization.utils.high_dimensional_sampling import ( - concatenated_latin_hypercube_sampling, -) from test_data_documentation import TEST_DATAFRAMES +from trim.samplers.no_priors_utils import concatenated_latin_hypercube_sampling class TestConcatenatedLatinHypercubeSampling: diff --git a/plugins/operators/trim/tests/test_sampling.py b/plugins/operators/trim/tests/test_sampling.py index a0113b1ae..4fcc79486 100644 --- a/plugins/operators/trim/tests/test_sampling.py +++ b/plugins/operators/trim/tests/test_sampling.py @@ -2,10 +2,7 @@ # SPDX-License-Identifier: MIT import pytest -from no_priors_characterization.utils.one_dimensional_sampling import ( - get_index_list_ordered_partitions, - get_index_list_van_der_corput, -) # Replace with actual module name +from trim.samplers.no_priors_utils import get_index_list_van_der_corput # --- Error Handling Tests --- @@ -36,21 +33,3 @@ def test_get_index_list_nn_full_sampling() -> None: def test_get_index_list_nn_sorted_sampling(points: int, expected: list[int]) -> None: """Should return sorted sampling for segment of length 17.""" assert get_index_list_van_der_corput(17, points, sort=True) == expected - - -# --- Functional Tests for get_index_list_ordered_partitions --- - - -@pytest.mark.parametrize( - ("points", "expected"), - [ - (7, [0, 2, 4, 8, 10, 12, 16]), - (8, [0, 2, 4, 6, 8, 10, 12, 16]), - (9, [0, 2, 4, 6, 8, 10, 12, 14, 16]), - ], -) -def test_get_index_list_ordered_partitions_sampling( - points: int, expected: list[int] -) -> None: - """Should return correct partition-based sampling for segment of length 17.""" - assert get_index_list_ordered_partitions(17, points) == expected diff --git a/pyproject.toml b/pyproject.toml index f1ac80459..2300b3854 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -92,7 +92,6 @@ resolution-helpers = [ # cuda dependencies. test = [ "ado-autoconf", - "ado-no-priors-characterization", "ado-ray-tune", "ado-sfttrainer; python_version < '3.13'", "ado-trim", @@ -152,7 +151,6 @@ members = [ [tool.uv.sources] ado-autoconf = { workspace = true, editable = true } -ado-no-priors-characterization = { workspace = true, editable = true } ado-ray-tune = { workspace = true, editable = true } ado-sfttrainer = { path = "plugins/actuators/sfttrainer", editable = true } ado-trim = { workspace = true, editable = true } diff --git a/requirements.txt b/requirements.txt index 7614c34cf..199703253 100644 --- a/requirements.txt +++ b/requirements.txt @@ -422,7 +422,6 @@ googleapis-common-protos==1.74.0 \ # via google-api-core greenlet==3.4.0 ; platform_machine == 'AMD64' or platform_machine == 'WIN32' or platform_machine == 'aarch64' or platform_machine == 'amd64' or platform_machine == 'ppc64le' or platform_machine == 'win32' or platform_machine == 'x86_64' \ --hash=sha256:04403ac74fe295a361f650818de93be11b5038a78f49ccfb64d3b1be8fbf1267 \ - --hash=sha256:0e1254cf0cbaa17b04320c3a78575f29f3c161ef38f59c977108f19ffddaf077 \ --hash=sha256:1054c5a3c78e2ab599d452f23f7adafef55062a783a8e241d24f3b633ba6ff82 \ --hash=sha256:16dec271460a9a2b154e3b1c2fa1050ce6280878430320e85e08c166772e3f97 \ --hash=sha256:1a54a921561dd9518d31d2d3db4d7f80e589083063ab4d3e2e950756ef809e1a \ @@ -436,27 +435,20 @@ greenlet==3.4.0 ; platform_machine == 'AMD64' or platform_machine == 'WIN32' or --hash=sha256:5b99e87be7eba788dd5b75ba1cde5639edffdec5f91fe0d734a249535ec3408c \ --hash=sha256:5cb614ace7c27571270354e9c9f696554d073f8aa9319079dcba466bbdead711 \ --hash=sha256:636d2f95c309e35f650e421c23297d5011716be15d966e6328b367c9fc513a82 \ - --hash=sha256:6f0def07ec9a71d72315cf26c061aceee53b306c36ed38c35caba952ea1b319d \ --hash=sha256:805bebb4945094acbab757d34d6e1098be6de8966009ab9ca54f06ff492def58 \ --hash=sha256:8424683caf46eb0eb6f626cb95e008e8cc30d0cb675bdfa48200925c79b38a08 \ --hash=sha256:849f8bc17acd6295fcb5de8e46d55cc0e52381c56eaf50a2afd258e97bc65940 \ - --hash=sha256:89995ce5ddcd2896d89615116dd39b9703bfa0c07b583b85b89bf1b5d6eddf81 \ - --hash=sha256:8c5696c42e6bb5cfb7c6ff4453789081c66b9b91f061e5e9367fa15792644e76 \ --hash=sha256:90036ce224ed6fe75508c1907a77e4540176dcf0744473627785dd519c6f9996 \ --hash=sha256:9390ad88b652b1903814eaabd629ca184db15e0eeb6fe8a390bbf8b9106ae15a \ --hash=sha256:956215d5e355fffa7c021d168728321fd4d31fd730ac609b1653b450f6a4bc71 \ - --hash=sha256:98eedd1803353daf1cd9ef23eef23eda5a4d22f99b1f998d273a8b78b70dd47f \ --hash=sha256:9b2d9a138ffa0e306d0e2b72976d2fb10b97e690d40ab36a472acaab0838e2de \ --hash=sha256:a0a53fb071531d003b075c444014ff8f8b1a9898d36bb88abd9ac7b3524648a2 \ --hash=sha256:a19093fbad824ed7c0f355b5ff4214bffda5f1a7f35f29b31fcaa240cc0135ab \ --hash=sha256:a1c4f6b453006efb8310affb2d132832e9bbb4fc01ce6df6b70d810d38f1f6dc \ --hash=sha256:a70ed1cb0295bee1df57b63bf7f46b4e56a5c93709eea769c1fec1bb23a95875 \ - --hash=sha256:ac6a5f618be581e1e0713aecec8e54093c235e5fa17d6d8eb7ffc487e2300508 \ --hash=sha256:b45e45fe47a19051a396abb22e19e7836a59ee6c5a90f3be427343c37908d65b \ - --hash=sha256:b7857e2202aae67bc5725e0c1f6403c20a8ff46094ece015e7d474f5f7020b55 \ --hash=sha256:c660bce1940a1acae5f51f0a064f1bc785d07ea16efcb4bc708090afc4d69e83 \ --hash=sha256:d18eae9a7fb0f499efcd146b8c9750a2e1f6e0e93b5a382b3481875354a430e6 \ - --hash=sha256:d336d46878e486de7d9458653c722875547ac8d36a1cff9ffaf4a74a3c1f62eb \ --hash=sha256:ee407d4d1ca9dc632265aee1c8732c4a2d60adff848057cdebfe5fe94eb2c8a2 \ --hash=sha256:f38b81880ba28f232f1f675893a39cf7b6db25b31cc0a09bb50787ecf957e85e \ --hash=sha256:f50a96b64dafd6169e595a5c56c9146ef80333e67d4476a65a9c55f400fc22ff \ diff --git a/tests/fixtures/modules/operators.py b/tests/fixtures/modules/operators.py index 7e3320bc9..557613eba 100644 --- a/tests/fixtures/modules/operators.py +++ b/tests/fixtures/modules/operators.py @@ -17,7 +17,7 @@ @pytest.fixture def expected_characterize_operators() -> list[str]: - return ["profile", "detect_anomalous_series", "trim", "no_priors_characterization"] + return ["profile", "detect_anomalous_series", "trim"] @pytest.fixture diff --git a/tests/operators/test_general_orchestration.py b/tests/operators/test_general_orchestration.py index 4db4c904a..86c250800 100644 --- a/tests/operators/test_general_orchestration.py +++ b/tests/operators/test_general_orchestration.py @@ -14,7 +14,7 @@ @pytest.mark.parametrize( "operator_name", - ["profile", "no_priors_characterization"], + ["profile"], ) def test_operator_callable_for_harness_unwraps_decorated_operator( operator_name: str, diff --git a/tests/operators/test_trim_example_integration.py b/tests/operators/test_trim_example_integration.py index 4e05f9eb5..260e58074 100644 --- a/tests/operators/test_trim_example_integration.py +++ b/tests/operators/test_trim_example_integration.py @@ -9,7 +9,6 @@ import pytest import trim_custom_experiments.experiments # noqa: F401 — registers ideal-gas experiment import yaml -from no_priors_characterization.no_priors_pydantic import NoPriorsParameters from testcontainers.mysql import MySqlContainer import orchestrator.modules.operators.randomwalk # noqa: F401 @@ -31,6 +30,7 @@ pytest.importorskip("autogluon") +from trim.samplers.no_priors_parameters import NoPriorsParameters from trim.trim_pydantic import ( AutoGluonArgs, SamplingBudget, diff --git a/uv.lock b/uv.lock index 158ae9425..853607571 100644 --- a/uv.lock +++ b/uv.lock @@ -16,7 +16,6 @@ required-markers = [ members = [ "ado-autoconf", "ado-core", - "ado-no-priors-characterization", "ado-ray-tune", "ado-trim", "ado-vllm-performance", @@ -122,7 +121,6 @@ resolution-helpers = [ ] test = [ { name = "ado-autoconf" }, - { name = "ado-no-priors-characterization" }, { name = "ado-ray-tune" }, { name = "ado-sfttrainer", marker = "python_full_version < '3.13'" }, { name = "ado-trim" }, @@ -186,7 +184,6 @@ docs = [ resolution-helpers = [{ name = "urllib3", specifier = ">=2.5.0" }] test = [ { name = "ado-autoconf", editable = "plugins/custom_experiments/autoconf" }, - { name = "ado-no-priors-characterization", editable = "plugins/operators/no-priors-characterization" }, { name = "ado-ray-tune", editable = "plugins/operators/ray_tune" }, { name = "ado-sfttrainer", marker = "python_full_version < '3.13'", editable = "plugins/actuators/sfttrainer" }, { name = "ado-trim", editable = "plugins/operators/trim" }, @@ -200,25 +197,6 @@ test = [ { name = "trim-custom-experiments", editable = "examples/trim/custom_experiments" }, ] -[[package]] -name = "ado-no-priors-characterization" -source = { editable = "plugins/operators/no-priors-characterization" } -dependencies = [ - { name = "ado-core" }, - { name = "numpy" }, - { name = "pandas" }, - { name = "scipy", version = "1.15.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, - { name = "scipy", version = "1.16.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, -] - -[package.metadata] -requires-dist = [ - { name = "ado-core", editable = "." }, - { name = "numpy" }, - { name = "pandas", specifier = ">=2.2.0" }, - { name = "scipy" }, -] - [[package]] name = "ado-ray-tune" source = { editable = "plugins/operators/ray_tune" } @@ -268,7 +246,6 @@ name = "ado-trim" source = { editable = "plugins/operators/trim" } dependencies = [ { name = "ado-core" }, - { name = "ado-no-priors-characterization" }, { name = "autogluon-tabular", extra = ["catboost", "xgboost"] }, { name = "numpy" }, { name = "pandas" }, @@ -278,7 +255,6 @@ dependencies = [ [package.metadata] requires-dist = [ { name = "ado-core", editable = "." }, - { name = "ado-no-priors-characterization", editable = "plugins/operators/no-priors-characterization" }, { name = "autogluon-tabular", extras = ["catboost", "xgboost"], specifier = "==1.5" }, { name = "numpy" }, { name = "pandas", specifier = ">=2.2.0" }, @@ -2618,18 +2594,14 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/0c/bc/e30e1e3d5e8860b0e0ce4d2b16b2681b77fd13542fc0d72f7e3c22d16eff/greenlet-3.4.0-cp310-cp310-macosx_11_0_universal2.whl", hash = "sha256:d18eae9a7fb0f499efcd146b8c9750a2e1f6e0e93b5a382b3481875354a430e6", size = 284315, upload-time = "2026-04-08T17:02:52.322Z" }, { url = "https://files.pythonhosted.org/packages/5b/cc/e023ae1967d2a26737387cac083e99e47f65f58868bd155c4c80c01ec4e0/greenlet-3.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:636d2f95c309e35f650e421c23297d5011716be15d966e6328b367c9fc513a82", size = 601916, upload-time = "2026-04-08T16:24:35.533Z" }, { url = "https://files.pythonhosted.org/packages/67/32/5be1677954b6d8810b33abe94e3eb88726311c58fa777dc97e390f7caf5a/greenlet-3.4.0-cp310-cp310-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:234582c20af9742583c3b2ddfbdbb58a756cfff803763ffaae1ac7990a9fac31", size = 616399, upload-time = "2026-04-08T16:30:54.536Z" }, - { url = "https://files.pythonhosted.org/packages/82/0a/3a4af092b09ea02bcda30f33fd7db397619132fe52c6ece24b9363130d34/greenlet-3.4.0-cp310-cp310-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:ac6a5f618be581e1e0713aecec8e54093c235e5fa17d6d8eb7ffc487e2300508", size = 621077, upload-time = "2026-04-08T16:40:34.946Z" }, { url = "https://files.pythonhosted.org/packages/74/bf/2d58d5ea515704f83e34699128c9072a34bea27d2b6a556e102105fe62a5/greenlet-3.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:523677e69cd4711b5a014e37bc1fb3a29947c3e3a5bb6a527e1cc50312e5a398", size = 611978, upload-time = "2026-04-08T15:56:31.335Z" }, - { url = "https://files.pythonhosted.org/packages/8c/39/3786520a7d5e33ee87b3da2531f589a3882abf686a42a3773183a41ef010/greenlet-3.4.0-cp310-cp310-manylinux_2_39_riscv64.whl", hash = "sha256:d336d46878e486de7d9458653c722875547ac8d36a1cff9ffaf4a74a3c1f62eb", size = 416893, upload-time = "2026-04-08T16:43:02.392Z" }, { url = "https://files.pythonhosted.org/packages/bd/69/6525049b6c179d8a923256304d8387b8bdd4acab1acf0407852463c6d514/greenlet-3.4.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:b45e45fe47a19051a396abb22e19e7836a59ee6c5a90f3be427343c37908d65b", size = 1571957, upload-time = "2026-04-08T16:26:17.041Z" }, { url = "https://files.pythonhosted.org/packages/4e/6c/bbfb798b05fec736a0d24dc23e81b45bcee87f45a83cfb39db031853bddc/greenlet-3.4.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:5434271357be07f3ad0936c312645853b7e689e679e29310e2de09a9ea6c3adf", size = 1637223, upload-time = "2026-04-08T15:57:27.556Z" }, { url = "https://files.pythonhosted.org/packages/b7/7d/981fe0e7c07bd9d5e7eb18decb8590a11e3955878291f7a7de2e9c668eb7/greenlet-3.4.0-cp310-cp310-win_amd64.whl", hash = "sha256:a19093fbad824ed7c0f355b5ff4214bffda5f1a7f35f29b31fcaa240cc0135ab", size = 237902, upload-time = "2026-04-08T17:03:14.16Z" }, { url = "https://files.pythonhosted.org/packages/fb/c6/dba32cab7e3a625b011aa5647486e2d28423a48845a2998c126dd69c85e1/greenlet-3.4.0-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:805bebb4945094acbab757d34d6e1098be6de8966009ab9ca54f06ff492def58", size = 285504, upload-time = "2026-04-08T15:52:14.071Z" }, { url = "https://files.pythonhosted.org/packages/54/f4/7cb5c2b1feb9a1f50e038be79980dfa969aa91979e5e3a18fdbcfad2c517/greenlet-3.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:439fc2f12b9b512d9dfa681c5afe5f6b3232c708d13e6f02c845e0d9f4c2d8c6", size = 605476, upload-time = "2026-04-08T16:24:37.064Z" }, { url = "https://files.pythonhosted.org/packages/d6/af/b66ab0b2f9a4c5a867c136bf66d9599f34f21a1bcca26a2884a29c450bd9/greenlet-3.4.0-cp311-cp311-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a70ed1cb0295bee1df57b63bf7f46b4e56a5c93709eea769c1fec1bb23a95875", size = 618336, upload-time = "2026-04-08T16:30:56.59Z" }, - { url = "https://files.pythonhosted.org/packages/6d/31/56c43d2b5de476f77d36ceeec436328533bff960a4cba9a07616e93063ab/greenlet-3.4.0-cp311-cp311-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:8c5696c42e6bb5cfb7c6ff4453789081c66b9b91f061e5e9367fa15792644e76", size = 625045, upload-time = "2026-04-08T16:40:37.111Z" }, { url = "https://files.pythonhosted.org/packages/e5/5c/8c5633ece6ba611d64bf2770219a98dd439921d6424e4e8cf16b0ac74ea5/greenlet-3.4.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c660bce1940a1acae5f51f0a064f1bc785d07ea16efcb4bc708090afc4d69e83", size = 613515, upload-time = "2026-04-08T15:56:32.478Z" }, - { url = "https://files.pythonhosted.org/packages/80/ca/704d4e2c90acb8bdf7ae593f5cbc95f58e82de95cc540fb75631c1054533/greenlet-3.4.0-cp311-cp311-manylinux_2_39_riscv64.whl", hash = "sha256:89995ce5ddcd2896d89615116dd39b9703bfa0c07b583b85b89bf1b5d6eddf81", size = 419745, upload-time = "2026-04-08T16:43:04.022Z" }, { url = "https://files.pythonhosted.org/packages/a9/df/950d15bca0d90a0e7395eb777903060504cdb509b7b705631e8fb69ff415/greenlet-3.4.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:ee407d4d1ca9dc632265aee1c8732c4a2d60adff848057cdebfe5fe94eb2c8a2", size = 1574623, upload-time = "2026-04-08T16:26:18.596Z" }, { url = "https://files.pythonhosted.org/packages/1a/e7/0839afab829fcb7333c9ff6d80c040949510055d2d4d63251f0d1c7c804e/greenlet-3.4.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:956215d5e355fffa7c021d168728321fd4d31fd730ac609b1653b450f6a4bc71", size = 1639579, upload-time = "2026-04-08T15:57:29.231Z" }, { url = "https://files.pythonhosted.org/packages/d9/2b/b4482401e9bcaf9f5c97f67ead38db89c19520ff6d0d6699979c6efcc200/greenlet-3.4.0-cp311-cp311-win_amd64.whl", hash = "sha256:5cb614ace7c27571270354e9c9f696554d073f8aa9319079dcba466bbdead711", size = 238233, upload-time = "2026-04-08T17:02:54.286Z" }, @@ -2637,9 +2609,7 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/65/8b/3669ad3b3f247a791b2b4aceb3aa5a31f5f6817bf547e4e1ff712338145a/greenlet-3.4.0-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:1a54a921561dd9518d31d2d3db4d7f80e589083063ab4d3e2e950756ef809e1a", size = 286902, upload-time = "2026-04-08T15:52:12.138Z" }, { url = "https://files.pythonhosted.org/packages/38/3e/3c0e19b82900873e2d8469b590a6c4b3dfd2b316d0591f1c26b38a4879a5/greenlet-3.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:16dec271460a9a2b154e3b1c2fa1050ce6280878430320e85e08c166772e3f97", size = 606099, upload-time = "2026-04-08T16:24:38.408Z" }, { url = "https://files.pythonhosted.org/packages/b5/33/99fef65e7754fc76a4ed14794074c38c9ed3394a5bd129d7f61b705f3168/greenlet-3.4.0-cp312-cp312-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:90036ce224ed6fe75508c1907a77e4540176dcf0744473627785dd519c6f9996", size = 618837, upload-time = "2026-04-08T16:30:58.298Z" }, - { url = "https://files.pythonhosted.org/packages/44/57/eae2cac10421feae6c0987e3dc106c6d86262b1cb379e171b017aba893a6/greenlet-3.4.0-cp312-cp312-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:6f0def07ec9a71d72315cf26c061aceee53b306c36ed38c35caba952ea1b319d", size = 624901, upload-time = "2026-04-08T16:40:38.981Z" }, { url = "https://files.pythonhosted.org/packages/36/f7/229f3aed6948faa20e0616a0b8568da22e365ede6a54d7d369058b128afd/greenlet-3.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a1c4f6b453006efb8310affb2d132832e9bbb4fc01ce6df6b70d810d38f1f6dc", size = 615062, upload-time = "2026-04-08T15:56:33.766Z" }, - { url = "https://files.pythonhosted.org/packages/6a/8a/0e73c9b94f31d1cc257fe79a0eff621674141cdae7d6d00f40de378a1e42/greenlet-3.4.0-cp312-cp312-manylinux_2_39_riscv64.whl", hash = "sha256:0e1254cf0cbaa17b04320c3a78575f29f3c161ef38f59c977108f19ffddaf077", size = 423927, upload-time = "2026-04-08T16:43:05.293Z" }, { url = "https://files.pythonhosted.org/packages/08/97/d988180011aa40135c46cd0d0cf01dd97f7162bae14139b4a3ef54889ba5/greenlet-3.4.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:9b2d9a138ffa0e306d0e2b72976d2fb10b97e690d40ab36a472acaab0838e2de", size = 1573511, upload-time = "2026-04-08T16:26:20.058Z" }, { url = "https://files.pythonhosted.org/packages/d4/0f/a5a26fe152fb3d12e6a474181f6e9848283504d0afd095f353d85726374b/greenlet-3.4.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:8424683caf46eb0eb6f626cb95e008e8cc30d0cb675bdfa48200925c79b38a08", size = 1640396, upload-time = "2026-04-08T15:57:30.88Z" }, { url = "https://files.pythonhosted.org/packages/42/cf/bb2c32d9a100e36ee9f6e38fad6b1e082b8184010cb06259b49e1266ca01/greenlet-3.4.0-cp312-cp312-win_amd64.whl", hash = "sha256:a0a53fb071531d003b075c444014ff8f8b1a9898d36bb88abd9ac7b3524648a2", size = 238892, upload-time = "2026-04-08T17:03:10.094Z" }, @@ -2647,9 +2617,7 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/7a/75/7e9cd1126a1e1f0cd67b0eda02e5221b28488d352684704a78ed505bd719/greenlet-3.4.0-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:43748988b097f9c6f09364f260741aa73c80747f63389824435c7a50bfdfd5c1", size = 285856, upload-time = "2026-04-08T15:52:45.82Z" }, { url = "https://files.pythonhosted.org/packages/9d/c4/3e2df392e5cb199527c4d9dbcaa75c14edcc394b45040f0189f649631e3c/greenlet-3.4.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5566e4e2cd7a880e8c27618e3eab20f3494452d12fd5129edef7b2f7aa9a36d1", size = 610208, upload-time = "2026-04-08T16:24:39.674Z" }, { url = "https://files.pythonhosted.org/packages/da/af/750cdfda1d1bd30a6c28080245be8d0346e669a98fdbae7f4102aa95fff3/greenlet-3.4.0-cp313-cp313-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:1054c5a3c78e2ab599d452f23f7adafef55062a783a8e241d24f3b633ba6ff82", size = 621269, upload-time = "2026-04-08T16:30:59.767Z" }, - { url = "https://files.pythonhosted.org/packages/e0/93/c8c508d68ba93232784bbc1b5474d92371f2897dfc6bc281b419f2e0d492/greenlet-3.4.0-cp313-cp313-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:98eedd1803353daf1cd9ef23eef23eda5a4d22f99b1f998d273a8b78b70dd47f", size = 628455, upload-time = "2026-04-08T16:40:40.698Z" }, { url = "https://files.pythonhosted.org/packages/54/78/0cbc693622cd54ebe25207efbb3a0eb07c2639cb8594f6e3aaaa0bb077a8/greenlet-3.4.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f82cb6cddc27dd81c96b1506f4aa7def15070c3b2a67d4e46fd19016aacce6cf", size = 617549, upload-time = "2026-04-08T15:56:34.893Z" }, - { url = "https://files.pythonhosted.org/packages/7f/46/cfaaa0ade435a60550fd83d07dfd5c41f873a01da17ede5c4cade0b9bab8/greenlet-3.4.0-cp313-cp313-manylinux_2_39_riscv64.whl", hash = "sha256:b7857e2202aae67bc5725e0c1f6403c20a8ff46094ece015e7d474f5f7020b55", size = 426238, upload-time = "2026-04-08T16:43:06.865Z" }, { url = "https://files.pythonhosted.org/packages/ba/c0/8966767de01343c1ff47e8b855dc78e7d1a8ed2b7b9c83576a57e289f81d/greenlet-3.4.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:227a46251ecba4ff46ae742bc5ce95c91d5aceb4b02f885487aff269c127a729", size = 1575310, upload-time = "2026-04-08T16:26:21.671Z" }, { url = "https://files.pythonhosted.org/packages/b8/38/bcdc71ba05e9a5fda87f63ffc2abcd1f15693b659346df994a48c968003d/greenlet-3.4.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:5b99e87be7eba788dd5b75ba1cde5639edffdec5f91fe0d734a249535ec3408c", size = 1640435, upload-time = "2026-04-08T15:57:32.572Z" }, { url = "https://files.pythonhosted.org/packages/a1/c2/19b664b7173b9e4ef5f77e8cef9f14c20ec7fce7920dc1ccd7afd955d093/greenlet-3.4.0-cp313-cp313-win_amd64.whl", hash = "sha256:849f8bc17acd6295fcb5de8e46d55cc0e52381c56eaf50a2afd258e97bc65940", size = 238760, upload-time = "2026-04-08T17:04:03.878Z" }, diff --git a/website/docs/examples/example_yamls/op_basic_sampling.yaml b/website/docs/examples/example_yamls/op_basic_sampling.yaml deleted file mode 120000 index e01111145..000000000 --- a/website/docs/examples/example_yamls/op_basic_sampling.yaml +++ /dev/null @@ -1 +0,0 @@ -../../../../examples/no-priors-characterization/example_yamls/op_basic_sampling.yaml \ No newline at end of file diff --git a/website/docs/examples/example_yamls/op_quick_exploration.yaml b/website/docs/examples/example_yamls/op_quick_exploration.yaml deleted file mode 120000 index ee9e2d0c6..000000000 --- a/website/docs/examples/example_yamls/op_quick_exploration.yaml +++ /dev/null @@ -1 +0,0 @@ -../../../../examples/no-priors-characterization/example_yamls/op_quick_exploration.yaml \ No newline at end of file diff --git a/website/docs/examples/example_yamls/op_thorough_coverage.yaml b/website/docs/examples/example_yamls/op_thorough_coverage.yaml deleted file mode 120000 index c38ecaf28..000000000 --- a/website/docs/examples/example_yamls/op_thorough_coverage.yaml +++ /dev/null @@ -1 +0,0 @@ -../../../../examples/no-priors-characterization/example_yamls/op_thorough_coverage.yaml \ No newline at end of file diff --git a/website/docs/examples/example_yamls/space_reaction.yaml b/website/docs/examples/example_yamls/space_reaction.yaml deleted file mode 120000 index 48a189ac5..000000000 --- a/website/docs/examples/example_yamls/space_reaction.yaml +++ /dev/null @@ -1 +0,0 @@ -../../../../examples/no-priors-characterization/example_yamls/space_reaction.yaml \ No newline at end of file diff --git a/website/docs/examples/no-priors-characterization.md b/website/docs/examples/no-priors-characterization.md deleted file mode 120000 index 7daf43406..000000000 --- a/website/docs/examples/no-priors-characterization.md +++ /dev/null @@ -1 +0,0 @@ -../../../examples/no-priors-characterization/README.md \ No newline at end of file diff --git a/website/docs/operators/no-priors-characterization.md b/website/docs/operators/no-priors-characterization.md deleted file mode 120000 index dee9ca30f..000000000 --- a/website/docs/operators/no-priors-characterization.md +++ /dev/null @@ -1 +0,0 @@ -../../../plugins/operators/no-priors-characterization/README.md \ No newline at end of file diff --git a/website/docs/operators/random-walk.md b/website/docs/operators/random-walk.md index 507bd16c7..e9327a250 100644 --- a/website/docs/operators/random-walk.md +++ b/website/docs/operators/random-walk.md @@ -59,25 +59,6 @@ After the second operation: replayed (as they were already measured during the first operation) - The timeseries of this second operation is stored. It has 200 entities in it. -## Controlling sampling and measurements: Continuous batching - -When a `random_walk` operation encounters an unmeasured entity in the -`discoveryspace`, it applies the experiments defined by its `measurementspace`. -Depending on the experiments, you may want to control how many concurrent -experiments are being executed. - -`random_walk` uses continuous batching to set the number of concurrent -**requested** experiments and ensure that, as far as possible, there is always -this number of experiments in flight. - -This approach maximizes throughput compared to standard batch-wise submission. -In the normal case the time to finish measuring batch of N entities is, at a -minimum, the time taken for the longest experiment to complete. This means if -one experiment is very long and the others short, there can be capacity in the -system for (N-1) additional entities to be measured but it will not be used. - -The next section explains more about configuring continuous batching - ## Configuring a `random_walk` operation The parameters for a `random_walk` operation are (default values shown): @@ -123,6 +104,8 @@ spaces: - your-spaces ``` +The following sections explain the different options + !!! info end You can get a default `random_walk` operation template and the schema of its @@ -131,10 +114,27 @@ spaces: The information output by this command should always be preferred over the information presented here if there is an inconsistency. +## Continuous batching + +When a `random_walk` operation encounters an unmeasured entity in the +`discoveryspace`, it applies the experiments defined by its `measurementspace`. +Depending on the experiments, you may want to control how many concurrent +experiments are being executed. + +`random_walk` uses continuous batching to set the number of concurrent +**requested** experiments and ensure that, as far as possible, there is always +this number of experiments in flight. + +This approach maximizes throughput compared to standard batch-wise submission. +In the normal case the time to finish measuring batch of N entities is, at a +minimum, the time taken for the longest experiment to complete. This means if +one experiment is very long and the others short, there can be capacity in the +system for (N-1) additional entities to be measured but it will not be used. + ### Batch Size and Concurrent Experiments -When it comes to managing resources during an exploration, the key variable one -wants to control is the number of concurrent experiments. +When it comes to managing resources during an exploration, the key variable +to control is the number of concurrent experiments. For the `random_walk` operator, this number is its `batchSize` parameter (the number of initial entities submitted) multiplied by the number of experiments in @@ -151,7 +151,37 @@ this many concurrent experiment requests during the operation. Hence, continuous batching can only maintain that there are N experiments requested at any time. -### Base Sampling Types and Modes +### Sampling all Entities + +If either of the following conditions are true you can specify a value of "all" +for the `numberOfEntities` field in the random walk configuration: + +- All dimensions in the `entityspace`s are discrete and bounded or categorical +- The sampling type is `selector` i.e. you are iterating over an existing set + number of entities in a `samplestore` + +In the first case `all` will be converted to the size of the space. In the +second case `all` will be converted to the number of matching entities in the +`samplestore`. + +If both of these conditions is False the `random_walk` operator will raise a +ValueError when the execution starts. + +!!! info end + + Depending on the Filter settings a randomwalk operation may not sample "all" + entities even if "all" is specified. This is because the filter may filter out + some entities. + +!!! warning end + + For `discoveryspaces` where one/both of the above conditions are True setting + `numberOfEntities` greater than the corresponding size (size of space, or number + of matching entities in `samplestore`) will raise a ValueError. This means you + cannot set `numberOfEntities` to an arbitrarily large number to ensure sampling + all of them - use `all` instead. + +## Basic Sampling The `samplerConfig` field controls how Entities are sampled during the operation. The base `samplerConfig` is shown in the examples above and has the @@ -163,7 +193,7 @@ samplerType: selector grouping: [] ``` -#### Sampling Types +### Sampling Types There are two sampling types: `generator` and `selector`. @@ -175,7 +205,7 @@ are bounded. The `selector` sampling type draws _existing matching entities_ from the `samplestore` of the `discoveryspace` i.e. it doesn't use the entity space. -#### Sampler Modes +### Sampler Modes Both sampling types support four modes, which can be categorised as flat or grouped: @@ -230,7 +260,7 @@ for x in propertyN.values: entity({'propertyN':x, 'propertyN_1':y, ..., 'property1':z}) ``` -#### Why Grouped Modes? +### Why Grouped Modes? The advantage of the group modes is that they can allow [actuators](../actuators/working-with-actuators.md) to reuse their test @@ -248,7 +278,7 @@ allows. See the docs of the specific actuator you are using to see if and how it can benefit from grouping. -#### Enabling Grouping +### Enabling Grouping To use the grouped modes (`randomgrouped`, `sequentialgrouped`) you need to supply a list of constitutive properties to group by using the `grouping` @@ -279,14 +309,10 @@ spaces: - your-spaces ``` -### Custom Samplers +## Custom Samplers -It is also possible to specify that `random_walk` uses a custom sampler. This is -a class that inherits from -`orchestrator.core.discoveryspace.samplers.BaseSampler`. This is useful for -implementing more complex sampling schemes. For example, for developers who want -to use random_walk to drive an exploration but have custom logic to execute -before choosing each sample/entity. +`random_walk` can also use custom samplers for +more complex sampling schemes. For custom samplers the `samplerConfig` field has the following structure: @@ -302,7 +328,94 @@ parameters: # A dictionary of key value pairs with the values for the custom sam -#### Implementing a Custom Sampler +### Available Custom Samplers + +#### No Priors Sample Selector + +To install `NoPriorsSampleSelector` execute + +```bash +pip install plugins/operators/trim/ +``` + +The `NoPriorsSampleSelector` provides quasi-random sampling strategies designed +for high-dimensional discrete spaces. These strategies produce sequences where +consecutive elements are maximally dispersed, favoring uniform coverage of the +space: + +- **`sobol`**: Sobol sequences are low-discrepancy quasi-random sequences widely + used for space-filling designs. They provide better coverage than pure random + sampling by ensuring points are well-distributed across all dimensions. +- **`clhs`**: Concatenated Latin Hypercube Sampling (CLHS) samples each dimension + independently without replacement, cycling through all values before repeating. + This ensures each dimension is uniformly covered. + +**Collision Handling**: Sobol sampling may produce collisions (duplicate points), +when this happens the sampler automatically falls back to CLHS to ensure +the requested number of unique samples. + +##### Example: Sobol Sampling + +Here we write an example using Sobol ordering for quasi-random +low-discrepancy coverage. Make sure to install the TRIM package first. +Then install TRIM custom experiments with + +```bash +pip install examples/trim/custom_experiments/ +``` + +To create a discoveryspace and explore it with the TRIM operator, execute the +following from the root of the ado repository: + +```bash +ado create space -f examples/trim/example_yamls/space_pressure.yaml --new-sample-store + +ado create operation -f \ + examples/trim/example_yamls/randomwalk_sobol_operation.yaml \ + --use-latest space +``` + +The configuration file `randomwalk_sobol_operation.yaml` contains the following +to specify which points to sample + +```yaml +samplerConfig: + module: + moduleName: trim.samplers.no_priors_sampler + moduleClass: NoPriorsSampleSelector + parameters: + targetOutput: pressure + samples: 20 + batchSize: 1 + sampling_strategy: sobol +``` + +Since `batchSize: 1` the operation will sample one point at a time, this +ensures that the sequence of measurements has the desired uniform coverage + +```bash +ado show entities operation --use-latest -o csv --output-file your_file.csv +``` + +The file `your_file.csv` will contain the sequence of sampled points, you +will see something like this: + + + +```csv +request_index,result_index,identifier,experiment_id,generatorid,mol,temperature,volume,pressure,request_id,entity_index,valid +0,0,mol.0.2-temperature.274-volume.8,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.2,274,8,56.9540689333,c8f814,0,True +1,0,mol.0.7-temperature.284-volume.1,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.7,284,1,1652.9151684584,232c8e,0,True +2,0,mol.0.4-temperature.294-volume.7,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.4,294,7,139.6829719824,9c6ae3,0,True +3,0,mol.0.9-temperature.284-volume.5,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.9,284,5,425.03532903216,83a93d,0,True +4,0,mol.0.5-temperature.280-volume.6,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.5,280,6,194.00412775333334,9e8ecd,0,True +5,0,mol.0.1-temperature.298-volume.4,custom_experiments.calculate_pressure_ideal_gas,no_priors_characterization,0.1,298,4,61.9427465041,db9284,0,True +... +``` + + + +### Implementing a Custom Sampler To implement a custom sampler create a sub-class of `orchestrator.core.discovery.samplers.BaseSampler` and implement all required @@ -337,37 +450,7 @@ class MySampler(BaseSampler): ... ``` -### Sampling all Entities - -If either of the following conditions are true you can specify a value of "all" -for the `numberOfEntities` field in the random walk configuration: - -- All dimensions in the `entityspace`s are discrete and bounded or categorical -- The sampling type is `selector` i.e. you are iterating over an existing set - number of entities in a `samplestore` - -In the first case `all` will be converted to the size of the space. In the -second case `all` will be converted to the number of matching entities in the -`samplestore`. - -If both of these conditions is False the `random_walk` operator will raise a -ValueError when the execution starts. - -!!! info end - - Depending on the Filter settings a randomwalk operation may not sample "all" - entities even if "all" is specified. This is because the filter may filter out - some entities. - -!!! warning end - - For `discoveryspaces` where one/both of the above conditions are True setting - `numberOfEntities` greater than the corresponding size (size of space, or number - of matching entities in `samplestore`) will raise a ValueError. This means you - cannot set `numberOfEntities` to an arbitrarily large number to ensure sampling - all of them - use `all` instead. - -### Filtering Entities +## Filtering Entities In some circumstance you may want to only sample a subset of Entities. Some examples include @@ -391,26 +474,29 @@ which can take the following values: - `measured`: Only Entities fully measured by the experiments in the `measurementspace` will be sampled -### Multiple Measurement +## Memoization: Reusing existing measurements -By setting `singleMeasurement:` to False the random walk operation will measure -ALL entities it samples, even if they already have measurements. +If `singleMeasurement:` is False, all experiments are applied to +ALL entities sampled, even if they already have the results for that +experiment. -If entities have multiple measurements e.g. you turned this off and then turned -it on again, then if an entity has multiple measurements each one will be -replayed. +By setting `singleMeasurement:` to True (the default) a random walk operation +will check if an experiment has already been applied to an entity and, +if it has, reuse a.k.a. replay, the result. -Check [replayed measurements](explore_operators.md#memoization-replaying-measurements) +If the entity has multiple results for the same experiment, each one will be +replayed. +See [replayed measurements](explore_operators.md#memoization-replaying-measurements) for more details. -### Retrying Failed Measurements +## Retrying Failed Measurements If the measurement of an entity by an experiment fails `random_walk` can retry it. The parameter controlling this is `maxRetries` which by default is 0 - no retries. If `maxRetries` is N then failing measurements will be retried up to `N` times. -#### Experiment request index v number of experiments requested +### Experiment request index v number of experiments requested To understand a `random_walk` operations logs when maxRetries is greater than 0 it's necessary to understand how it tracks the entity+experiment combinations it diff --git a/website/mkdocs.yml b/website/mkdocs.yml index 924dd9542..3e0adc9d9 100644 --- a/website/mkdocs.yml +++ b/website/mkdocs.yml @@ -158,7 +158,6 @@ nav: - Space Characterization: - Identify the important dimensions of a space: examples/lhu.md - Quickly building a predictive model for a configuration space: examples/trim.md - - Characterizing Spaces Without Prior Knowledge: examples/no-priors-characterization.md - Fine-Tuning Throughput: - Measure throughput of fine-tuning locally: examples/finetune-locally.md - Measure throughput of fine-tuning on a remote RayCluster: examples/finetune-remotely.md @@ -199,4 +198,3 @@ nav: - The Random Walk Operator: operators/random-walk.md - The Ray Tune Operator: operators/optimisation-with-ray-tune.md - The TRIM Operator: operators/trim.md - - The No-Priors Characterization Operator: operators/no-priors-characterization.md