# Step Examples

This notebook demonstrates the `Step` API with decorator-based configuration and pure run methods. Steps separate business logic from state management for better testability and composability.

## Features Demonstrated

- **Basic Steps**: Decorator-based input/output declarations with `@Step.requires` and `@Step.provides`
- **Validation**: Automatic validation of required outputs and type safety
- **Conditional Steps**: Steps that execute only when specific conditions are met
- **Data-Driven Logic**: Decision making based on actual data characteristics
- **Fit-Aware Steps**: Two-phase ML workflow with separate fitting and execution phases
- **Error Handling**: Proper exception handling for unfitted steps and missing outputs
- **Scoped Step**: Allows you to initialize a step with a restricted view of the state

## Setup

In [14]:
import sys
import os

# Add the project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [15]:
from src.idspy.core.state import State
from src.idspy.core.step import Step, ConditionalStep, FitAwareStep

## Basic Step with Decorators

Decorator-based input/output declarations. Pure `run` methods with automatic validation.

In [16]:
@Step.requires(data=list)
@Step.provides(sum=int)
class MakeSum(Step):
    def run(self, data, **kwargs):
        return {"sum": sum(data)}


s = State({"data": [1, 2, 3]})
MakeSum()(s)
s

State(size=2, data={'data': [1, 2, 3], 'sum': 6})

### Validation: Missing Required Outputs

Automatic validation ensures steps produce all declared outputs.

In [17]:
@Step.provides(x=int)
class NoopProvides(Step):
    def run(self, **kwargs):
        # forgets to return {"x": some_value}
        return {}


s = State()
try:
    NoopProvides()(s)
except KeyError as e:
    print(e)  # -> NoopProvides: missing required output 'x'

## Conditional Steps

Steps that execute only when conditions are met via `should_run()` method.

In [18]:
@Step.requires(data=list)
@Step.provides(data=list)
class MaybeNormalize(ConditionalStep):
    def should_run(self, state: State) -> bool:
        return bool(state.get("normalize", False))

    def run(self, data, **kwargs):
        m = sum(data) / len(data)
        normalized = [x - m for x in data]
        return {"data": normalized}

    def on_skip(self, state: State) -> None:
        print(f"[skip] {self.name} because normalize flag is False")


s = State({"data": [1, 2, 3], "normalize": False})
MaybeNormalize()(s)  # skipped → prints message
s.to_dict()
# {'data': [1, 2, 3], 'normalize': False}

[skip] MaybeNormalize because normalize flag is False


{'data': [1, 2, 3], 'normalize': False}

In [19]:
s = State({"data": [1, 2, 3], "normalize": True})
MaybeNormalize()(s)  # runs
s.to_dict()
# {'data': [-1.0, 0.0, 1.0], 'normalize': True}

{'data': [-1.0, 0.0, 1.0], 'normalize': True}

### Data-Driven Conditional Steps

Decision making based on actual data characteristics for dynamic pipeline behavior.

In [20]:
@Step.requires(data=list)
@Step.provides(trained=bool)
class TrainIfEnoughData(ConditionalStep):
    def __init__(self, min_len: int = 3):
        super().__init__()
        self.min_len = min_len

    def should_run(self, state: State) -> bool:
        return len(state.get("data", [])) >= self.min_len

    def run(self, data, **kwargs):
        # pretend training...
        return {"trained": True}


s = State({"data": [1, 2]})
TrainIfEnoughData(min_len=3)(s)  # skipped
s.get("trained", None)
# None

s["data"] = [1, 2, 3, 4]
TrainIfEnoughData(min_len=3)(s)  # runs
s["trained"]
# True

True

## Fit-Aware Steps

Two-phase ML workflow: `fit_impl()` for learning, `run()` for applying learned parameters.

In [21]:
@Step.requires(data=list)
@Step.provides(data=list)
class MeanCenter(FitAwareStep):
    def __init__(self):
        super().__init__()
        self.mean = None

    def fit_impl(self, data, **kwargs):
        self.mean = sum(data) / len(data)

    def run(self, data, **kwargs):
        centered = [x - self.mean for x in data]
        return {"data": centered}

### Error Handling: Unfitted Steps

Framework prevents running unfitted steps with clear error messages.

In [22]:
s = State({"data": [1.0, 2.0, 3.0]})
step = MeanCenter()

try:
    step(s)
except RuntimeError as e:
    print(e)  # 'MeanCenter' is not fitted.

'MeanCenter' is not fitted.


### Proper Usage: Fit Then Run

Fitted steps store learned parameters internally, keeping pipeline state clean.

In [23]:
step = MeanCenter()
s = State({"data": [1.0, 2.0, 3.0]})

step.fit(s)  # computes and stores mean internally
print(f"Mean computed: {step.mean}")  # Mean computed: 2.0

step(s)  # apply centering
print(s.to_dict())  # {'data': [-1.0, 0.0, 1.0]}

Mean computed: 2.0
{'data': [-1.0, 0.0, 1.0]}


## Scoped Step

By scoping a step, you ensure that it only reads from and writes to the portion of the state relevant to its context, avoiding conflicts between similarly named data in different namespaces.

In [24]:
@Step.requires(data=list)
@Step.provides(data=list)
class Sort(Step):
    def __init__(self):
        super().__init__(scope_prefix="user")

    def run(self, data: list, **kwargs):
        return {"data": sorted(data)}

s = State({
    "user.data": [2.0, 1.0, 3.0],
    "company.data": [3.0, 1.0, 2.0]
})

Sort()(s)
print(s.to_dict())

{'user.data': [1.0, 2.0, 3.0], 'company.data': [3.0, 1.0, 2.0]}
