# Step Classes - Building Blocks of Pipelines

This notebook demonstrates the core Step classes that form the building blocks of pipelines. Steps are reusable components that encapsulate data logic with clear contracts for inputs and outputs.

## What you'll learn

In this tutorial, you'll discover how to:

1. **Create Basic Steps** - Define atomic units of work with input/output contracts
2. **Handle Step Composition** - Combine steps into complex processing workflows
3. **Implement Contextual Steps** - Steps that run within resource management contexts
4. **Use Conditional Steps** - Execute steps only when specific conditions are met
5. **Build Fittable Steps** - Two-phase ML workflows with separate fit/transform operations

## Key Benefits

- **Modularity**: Encapsulate processing logic in reusable components
- **Composability**: Mix and match steps to build complex workflows
- **ML-Ready**: Built-in support for fit/transform patterns
- **Resource Management**: Automatic handling of contexts and cleanup

---

Let's start by setting up our environment and explore the different types of steps available.

In [27]:
import sys
import os

# Add the project root to Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

In [28]:
import pandas as pd
import numpy as np
from typing import Dict, Any, Optional

from src.idspy.core.step.base import Step
from src.idspy.core.step.conditional import ConditionalStep
from src.idspy.core.step.fittable import FittableStep
from src.idspy.core.step.contextual import ContextualStep
from src.idspy.core.pipeline.base import Pipeline
from src.idspy.core.storage.dict import DictStorage

# Create sample data for demonstrations
data = pd.DataFrame({
    "feature_1": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
    "feature_2": [0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5],
    "category": ["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"],
    "target": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
})

print("Sample dataset created:")
print(f"\tShape: {data.shape}")
print(f"\tColumns: {list(data.columns)}")
print("\nFirst few rows:")
display(data.head())

Sample dataset created:
	Shape: (10, 4)
	Columns: ['feature_1', 'feature_2', 'category', 'target']

First few rows:


Unnamed: 0,feature_1,feature_2,category,target
0,1.0,0.5,A,0
1,2.0,1.5,B,1
2,3.0,2.5,A,0
3,4.0,3.5,B,1
4,5.0,4.5,A,0


## Base Step Implementation

The **Step** class is the foundation of all processing components. Every step must implement:

- **`@Step.needs()` Decorator**: Formally declare required inputs for step execution
- **`bindings` Property**: Map logical parameter names to physical storage keys for flexible data routing
- **`compute()`**: Contains the actual processing logic

### Design Patterns

- Steps declare their data dependencies explicitly through requirements
- Bindings enable decoupling of step logic from storage implementation  
- Required inputs are automatically injected as `compute()` method parameters
- `Pipeline` and `Storage` are used to orchestrate steps and manage the flow of data between steps

Let's create some example steps:

In [29]:
# Example 1: Data Loading Step
class DataLoader(Step):
    """Loads data and makes it available to other steps."""

    def bindings(self):
        return {"dataset": "raw_data"}  # Output key -> storage key

    def compute(self):
        # In practice, this might load from files, databases, APIs, etc.
        print("Loading dataset...")
        return {"dataset": data.copy()}


# Example 2: Feature Selection Step
@Step.needs("data") # Indicates this step needs some data provided by another step
class FeatureSelector(Step):
    """Selects specific columns from the dataset."""

    def __init__(self, columns):
        super().__init__()
        self.columns = columns

    def bindings(self):
        return {"data": "raw_data", "selected": "selected_features"}

    def compute(self, data):
        print(f"Selecting columns: {self.columns}")
        selected = data[self.columns].copy()
        print(f"\tOriginal shape: {data.shape} -> Selected shape: {selected.shape}")
        return {"selected": selected}


# Example 3: Simple transformation step
@Step.needs("data")
class ConstantAdder(Step):
    """Adds a constant value to numerical columns."""

    def __init__(self, value):
        super().__init__()
        self.value = value

    def bindings(self):
        return {"data": "selected_features", "result": "result"}

    def compute(self, data):
        print(f"Adding {self.value} to all numerical columns...")
        result = data.copy()
        numeric_cols = result.select_dtypes(include=[np.number]).columns
        result[numeric_cols] = result[numeric_cols] + self.value
        return {"result": result}

In [30]:
# Test the basic steps in a simple pipeline
storage = DictStorage()

loader = DataLoader()
selector = FeatureSelector(["feature_1", "feature_2"])
adder = ConstantAdder(100)

print("Testing Basic Steps")
print("=" * 40)

# Create and run pipeline
pipeline = Pipeline([loader, selector, adder], storage=storage)
result = pipeline.run()

print(f"\nPipeline completed!")
final_data = storage.get(["result"])
print(f"\nFinal transformed data:")
display(final_data)

Testing Basic Steps
Loading dataset...
Selecting columns: ['feature_1', 'feature_2']
	Original shape: (10, 4) -> Selected shape: (10, 2)
Adding 100 to all numerical columns...

Pipeline completed!

Final transformed data:


{'result':    feature_1  feature_2
 0      101.0      100.5
 1      102.0      101.5
 2      103.0      102.5
 3      104.0      103.5
 4      105.0      104.5
 5      106.0      105.5
 6      107.0      106.5
 7      108.0      107.5
 8      109.0      108.5
 9      110.0      109.5}

## Contextual Step Examples

Steps that run within a context manager for automatic resource management and cleanup.

In [31]:
from contextlib import contextmanager
import tempfile
from pathlib import Path

class FileWriterStep(ContextualStep):
    """Contextual step that provides a temporary directory and writes data to a file."""

    def __init__(self, filename: str, content: str, name: Optional[str] = None):
        super().__init__(name=name or "FileWriterStep")
        self.filename = filename
        self.content = content

    @contextmanager
    def context(self, **kwargs):
        """Create and cleanup a temporary directory."""
        with tempfile.TemporaryDirectory() as temp_dir:
            print(f"[{self.name}] Created temporary directory: {temp_dir}")

            # Provide context object with temp directory
            ctx = type('Context', (), {'temp_dir': temp_dir})()

            try:
                yield ctx
            finally:
                print(f"[{self.name}] Cleaned up temporary directory: {temp_dir}")

    def compute(self, context: Any = None, **kwargs: Any) -> Dict[str, Any]:
        if context and hasattr(context, 'temp_dir'):
            filepath = Path(context.temp_dir) / self.filename
        else:
            raise ValueError("Context with 'temp_dir' is required")

        print(f"[{self.name}] Writing to: {filepath}")

        with open(filepath, 'w') as f:
            f.write(self.content)

        return {
            "filepath": str(filepath),
            "bytes_written": len(self.content.encode())
        }

    def bindings(self) -> Dict[str, str]:
        return {}

In [32]:
# Demonstrate contextual steps
print("=== Contextual Steps Demo ===")

# Create the wrapped step
writer_step = FileWriterStep(name="FileWriter", filename="example.txt", content="Hello from contextual step!\nThis file will be automatically cleaned up.")

# Run with automatic temp directory management
result = writer_step.run()

print(f"File written to: {result['filepath']}")
print(f"Bytes written: {result['bytes_written']}")

=== Contextual Steps Demo ===
[FileWriter] Created temporary directory: /var/folders/9q/5fdsccl51c34cl9rkc28dfx40000gn/T/tmphamx4k5y
[FileWriter] Writing to: /var/folders/9q/5fdsccl51c34cl9rkc28dfx40000gn/T/tmphamx4k5y/example.txt
[FileWriter] Cleaned up temporary directory: /var/folders/9q/5fdsccl51c34cl9rkc28dfx40000gn/T/tmphamx4k5y
File written to: /var/folders/9q/5fdsccl51c34cl9rkc28dfx40000gn/T/tmphamx4k5y/example.txt
Bytes written: 71


## Conditional Step Examples

Steps that execute only when specific conditions are met, with proper handling of skip scenarios.

In [33]:
class DataQualityChecker(ConditionalStep):
    """Check if data meets quality requirements before processing."""

    def __init__(self, min_rows: int = 5, required_columns: list = None, name: Optional[str] = None):
        super().__init__(name=name)
        self.min_rows = min_rows
        self.required_columns = required_columns or []

    def bindings(self) -> Dict[str, str]:
        return {}

    def should_run(self, data: pd.DataFrame, **kwargs) -> bool:
        has_enough_rows = len(data) >= self.min_rows
        has_required_columns = all(col in data.columns for col in self.required_columns)

        print(f"[{self.name}] Quality check - Rows: {len(data)}>={self.min_rows}: {has_enough_rows}")
        print(f"[{self.name}] Quality check - Required columns {self.required_columns}: {has_required_columns}")

        return has_enough_rows and has_required_columns

    def on_skip(self, data: pd.DataFrame, **kwargs) -> None:
        print(f"[{self.name}] SKIPPED - Data quality check failed")
        print(f"  - Rows: {len(data)} (need >= {self.min_rows})")
        print(f"  - Missing columns: {set(self.required_columns) - set(data.columns)}")

    def compute(self, **kwargs: Any) -> Dict[str, Any]:
        print(f"[{self.name}] Data quality check PASSED")

In [34]:
# Test with good data
good_checker = DataQualityChecker(min_rows=5, required_columns=["feature_1", "target"])
good_checker.run(data=data)

# Test with insufficient data
small_data = data.head(3)  # Only 3 rows
bad_checker = DataQualityChecker(min_rows=5, required_columns=["feature_1", "target"])
bad_checker.run(data=small_data)

[DataQualityChecker] Quality check - Rows: 10>=5: True
[DataQualityChecker] Quality check - Required columns ['feature_1', 'target']: True
[DataQualityChecker] Data quality check PASSED
[DataQualityChecker] Quality check - Rows: 3>=5: False
[DataQualityChecker] Quality check - Required columns ['feature_1', 'target']: True
[DataQualityChecker] SKIPPED - Data quality check failed
  - Rows: 3 (need >= 5)
  - Missing columns: set()


{}

### Fittable Step Implementation

Two-phase ML workflow with separate fitting and execution phases, demonstrating proper state management.

In [35]:
class StandardScaler(FittableStep):
    """Standardize features by removing mean and scaling to unit variance."""

    def __init__(self, features: list = None, name: Optional[str] = None):
        super().__init__(name=name)
        self.features = features
        self.mean_ = None
        self.std_ = None

    @property
    def bindings(self) -> Dict[str, str]:
        return {}

    def fit_impl(self, data: pd.DataFrame, **kwargs) -> None:
        """Learn the mean and standard deviation from training data."""
        features = self.features or data.select_dtypes(include=[np.number]).columns.tolist()

        print(f"[{self.name}] Fitting on features: {features}")
        self.mean_ = data[features].mean()
        self.std_ = data[features].std()

        print(f"[{self.name}] Learned statistics:")
        print(f"  Mean: {dict(self.mean_)}")
        print(f"  Std: {dict(self.std_)}")

    def compute(self, data: pd.DataFrame, **kwargs: Any) -> Dict[str, Any]:
        """Apply learned scaling to new data."""
        features = self.features or data.select_dtypes(include=[np.number]).columns.tolist()

        print(f"[{self.name}] Applying scaling to {len(features)} features")
        scaled_data = data.copy()
        scaled_data[features] = (data[features] - self.mean_[features]) / self.std_[features]

        return {"scaled_data": scaled_data}

In [36]:
# Demonstrate fittable steps
print("=== Fittable Steps Demo ===")

# Create train/test split
train_data = data.iloc[:7]  # First 7 rows for training
test_data = data.iloc[7:]   # Last 3 rows for testing

print("Train data shape:", train_data.shape)
print("Test data shape:", test_data.shape)

# Create and fit scaler
scaler = StandardScaler(features=["feature_1", "feature_2"])
print(f"\nScaler fitted: {scaler.is_fitted}")

# Try to run without fitting (should raise error)
try:
    scaler.run(data=test_data)
except RuntimeError as e:
    print(f"Expected error: {e}\n")

# Fit the scaler
scaler.fit(data=train_data)

# Now run successfully
scaled_result = scaler.run(data=test_data)
print(f"\nOriginal test data:")
display(test_data)
print(f"\nScaled test data:")
display(scaled_result['scaled_data'][['feature_1', 'feature_2']])

=== Fittable Steps Demo ===
Train data shape: (7, 4)
Test data shape: (3, 4)

Scaler fitted: False
Expected error: Step StandardScaler(, fitted=False) is not fitted; cannot run.

[StandardScaler] Fitting on features: ['feature_1', 'feature_2']
[StandardScaler] Learned statistics:
  Mean: {'feature_1': np.float64(4.0), 'feature_2': np.float64(3.5)}
  Std: {'feature_1': np.float64(2.160246899469287), 'feature_2': np.float64(2.160246899469287)}
[StandardScaler] Applying scaling to 2 features

Original test data:


Unnamed: 0,feature_1,feature_2,category,target
7,8.0,7.5,B,1
8,9.0,8.5,A,0
9,10.0,9.5,B,1



Scaled test data:


Unnamed: 0,feature_1,feature_2
7,1.85164,1.85164
8,2.31455,2.31455
9,2.77746,2.77746


## Key Takeaways

1. **Step Architecture**: Steps are reusable components with clear input/output contracts
2. **Base Step Pattern**: Implement `bindings()`, `compute()`, and `@needs()`
3. **Composability**: Steps can be combined into complex processing pipelines
4. **Resource Management**: ContextualStep handles setup/teardown automatically
5. **Conditional Execution**: ConditionalStep allows steps to run only when conditions are met
6. **ML Workflows**: FittableStep provides fit/transform patterns for machine learning