# Interactive Pipeline Configuration with DAGConfigFactory

This notebook demonstrates the new interactive approach to pipeline configuration using the DAGConfigFactory.
Instead of manually creating 500+ lines of static configuration, we use a guided step-by-step process.

## Workflow Overview

1. **Define Pipeline DAG** - Create the pipeline structure
2. **Initialize DAGConfigFactory** - Set up the interactive factory
3. **Configure Base Settings** - Set shared pipeline configuration
4. **Configure Processing Settings** - Set shared processing configuration
5. **Configure Individual Steps** - Set step-specific configurations
6. **Generate Final Configurations** - Create config instances
7. **Save to JSON** - Export unified configuration file



## Environment Setup

In [1]:
import os
import json
import sys
from pathlib import Path
from datetime import datetime, date
import logging
from typing import List, Optional, Dict, Any


# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Get parent directory of current notebook
project_root = str(Path().absolute().parent)
print(f"project root {project_root}")
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"add project root {project_root} into system")

project root /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines
add project root /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines into system


In [2]:
# SageMaker and SAIS imports
from sagemaker import Session
from sagemaker.workflow.pipeline_context import PipelineSession

2025-11-28 03:11:18,137 - INFO - Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-11-28 03:11:18,138 - INFO - NumExpr defaulting to 16 threads.
2025-11-28 03:11:19,118 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
print(f"Role: {PipelineSession().get_caller_identity_arn()}")

2025-11-28 03:11:19,516 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Role: arn:aws:iam::178936618742:role/AmazonSageMaker-ExecutionRole-Default


In [4]:
bucket = "buyer-seller-messaging-reversal"
print(f"Bucket: {bucket}")

Bucket: buyer-seller-messaging-reversal


## Step 1: Define Pipeline DAG

First, we define the pipeline structure using a DAG (Directed Acyclic Graph).
This replaces the hardcoded pipeline structure from the legacy approach.

In [5]:
from cursus.api.dag.base_dag import PipelineDAG


def create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag() -> PipelineDAG:
    """
    Create a DAG for Bedrock Batch-enhanced PyTorch E2E pipeline with Label Ruleset steps.

    This DAG represents a complete end-to-end workflow that uses:
    1. Bedrock prompt template generation and batch processing for LLM-enhanced data
    2. Label ruleset generation and execution for transparent label transformation
    3. PyTorch training, followed by calibration, packaging, and registration

    The label ruleset steps sit between Bedrock processing and training/evaluation,
    providing transparent, rule-based label transformation that's easy to modify.

    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()

    # Add all nodes - incorporating Bedrock batch processing and label ruleset steps
    dag.add_node("DummyTraining")  # Dummy data load for training
    dag.add_node(
        "ModelCalibration_calibration"
    )  # Model calibration step with calibration variant
    dag.add_node("Package")  # Package step
    # dag.add_node("Registration")  # MIMS registration step
    # dag.add_node("Payload")  # Payload step
    dag.add_node("DummyDataLoading_calibration")  # Dummy data load for calibration
    dag.add_node(
        "TabularPreprocessing_calibration"
    )  # Tabular preprocessing for calibration
    dag.add_node("PyTorchModelEval_calibration")  # Model evaluation step

    # Calibration flow with Bedrock batch processing and label ruleset integration
    dag.add_edge("DummyDataLoading_calibration", "TabularPreprocessing_calibration")

    # Evaluation flow
    dag.add_edge("DummyTraining", "PyTorchModelEval_calibration")
    dag.add_edge(
        "TabularPreprocessing_calibration", "PyTorchModelEval_calibration"
    )  # Use labeled calibration data

    # Model calibration flow - depends on model evaluation
    dag.add_edge("PyTorchModelEval_calibration", "ModelCalibration_calibration")

    # Output flow
    dag.add_edge("ModelCalibration_calibration", "Package")
    dag.add_edge("DummyTraining", "Package")  # Raw model is also input to packaging
    # dag.add_edge("PyTorchTraining", "Payload")  # Payload test uses the raw model
    # dag.add_edge("Package", "Registration")
    # dag.add_edge("Payload", "Registration")

    logger.info(
        f"Created Bedrock Batch-PyTorch with Label Ruleset E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges"
    )
    return dag


# Create the pipeline DAG
dag = create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag()

print(f"Pipeline DAG created with {len(dag.nodes)} steps:")
for node in dag.nodes:
    print(f"  - {node}")

2025-11-28 03:11:20,386 - INFO - Added node: DummyTraining
2025-11-28 03:11:20,387 - INFO - Added node: ModelCalibration_calibration
2025-11-28 03:11:20,387 - INFO - Added node: Package
2025-11-28 03:11:20,387 - INFO - Added node: DummyDataLoading_calibration
2025-11-28 03:11:20,388 - INFO - Added node: TabularPreprocessing_calibration
2025-11-28 03:11:20,388 - INFO - Added node: PyTorchModelEval_calibration
2025-11-28 03:11:20,389 - INFO - Added edge: DummyDataLoading_calibration -> TabularPreprocessing_calibration
2025-11-28 03:11:20,389 - INFO - Added edge: DummyTraining -> PyTorchModelEval_calibration
2025-11-28 03:11:20,389 - INFO - Added edge: TabularPreprocessing_calibration -> PyTorchModelEval_calibration
2025-11-28 03:11:20,390 - INFO - Added edge: PyTorchModelEval_calibration -> ModelCalibration_calibration
2025-11-28 03:11:20,390 - INFO - Added edge: ModelCalibration_calibration -> Package
2025-11-28 03:11:20,390 - INFO - Added node: PyTorchTraining
2025-11-28 03:11:20,391 -

Pipeline DAG created with 7 steps:
  - DummyTraining
  - ModelCalibration_calibration
  - Package
  - DummyDataLoading_calibration
  - TabularPreprocessing_calibration
  - PyTorchModelEval_calibration
  - PyTorchTraining


## Step 2: Initialize DAGConfigFactory

Now we initialize the DAGConfigFactory with our DAG. This will automatically:
- Map DAG nodes to configuration classes
- Set up the interactive workflow
- Prepare for step-by-step configuration

In [6]:
from cursus.api.factory.dag_config_factory import DAGConfigFactory

# Initialize the factory with our DAG
factory = DAGConfigFactory(dag)

# Get the config class mapping
config_map = factory.get_config_class_map()

print("DAG Node to Config Class Mapping:")
print("=" * 50)
for node_name, config_class in config_map.items():
    print(f"  {node_name:<35} -> {config_class.__name__}")

print(f"\nSuccessfully mapped {len(config_map)} steps to configuration classes.")

2025-11-28 03:11:20,397 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-28 03:11:20,397 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-11-28 03:11:20,398 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2025-11-28 03:11:20,398 - INFO - ‚úÖ Registry info loaded: 41 steps
2025-11-28 03:11:20,399 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2025-11-28 03:11:20,399 - INFO - üîç ScriptAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-28 03:11:20,400 - INFO - üîç ScriptAutoDiscovery.__init__ - workspace_dirs: []
2025-11-28 03:11:20,400 - INFO - üîç ScriptAutoDiscovery.__init__ - priority_workspace_dir: None
2025-11-28 03:11:20,400 - INFO - ‚úÖ Registry info loaded: 41 steps
2025-11-28 03:11:20,401 - INFO - üéâ ScriptAutoD

DAG Node to Config Class Mapping:
  DummyTraining                       -> DummyTrainingConfig
  ModelCalibration_calibration        -> ModelCalibrationConfig
  Package                             -> PackageConfig
  DummyDataLoading_calibration        -> DummyDataLoadingConfig
  TabularPreprocessing_calibration    -> TabularPreprocessingConfig
  PyTorchModelEval_calibration        -> PyTorchModelEvalConfig
  PyTorchTraining                     -> PyTorchTrainingConfig

Successfully mapped 7 steps to configuration classes.


## Step 3: Configure Base Pipeline Settings

These settings are shared across ALL pipeline steps. Instead of repeating them
in every step configuration, we set them once here.

In [7]:
# Get base configuration requirements
base_requirements = factory.get_base_config_requirements()

print("Base Pipeline Configuration Requirements:")
print("=" * 50)
for req in base_requirements:
    marker = "*" if req["required"] else " "
    default_info = (
        f" (default: {req.get('default')})"
        if not req["required"] and "default" in req
        else ""
    )
    print(f"{marker} {req['name']:<25} ({req['type']}){default_info}")
    print(f"    {req['description']}")
    print()

Base Pipeline Configuration Requirements:
* author                    (str)
    Author or owner of the pipeline.

* bucket                    (str)
    S3 bucket name for pipeline artifacts and data.

* role                      (str)
    IAM role for pipeline execution.

* region                    (str)
    Custom region code (NA, EU, FE) for internal logic.

* service_name              (str)
    Service name for the pipeline.

* pipeline_version          (str)
    Version string for the SageMaker Pipeline.

  model_class               (str) (default: xgboost)
    Model class (e.g., XGBoost, PyTorch).

  current_date              (str) (default: PydanticUndefined)
    Current date, typically used for versioning or pathing.

  framework_version         (str) (default: 2.1.0)
    Default framework version (e.g., PyTorch).

  py_version                (str) (default: py310)
    Default Python version.

  source_dir                (Optional) (default: None)
    Common source directory fo

In [8]:
# Set up basic configuration values
region_list = ["NA", "EU", "FE"]
region_selection = 0
region = region_list[region_selection]

# Map region to AWS region
region_mapping = {"NA": "us-east-1", "EU": "eu-west-1", "FE": "us-west-2"}
aws_region = region_mapping[region]

service_name = "BuyerAbuseRnR"
pipeline_version = "0.0.2"
author = "lukexie"
model_class = "pytorch"

# Get current directory and set up paths
current_dir = Path.cwd()
package_root = Path(current_dir).resolve()
source_dir = Path("docker")
project_root_folder = "rnr_pytorch_bedrock"

# Set base configuration
factory.set_base_config(
    # Infrastructure settings
    bucket=bucket,
    role=PipelineSession().get_caller_identity_arn(),
    region=region,
    aws_region=aws_region,
    # Project identification
    author=author,
    service_name=service_name,
    pipeline_version=pipeline_version,
    model_class=model_class,
    # Framework settings
    framework_version="2.1.0",
    py_version="py310",
    source_dir=str(source_dir),
    project_root_folder=project_root_folder,
    # Date settings
    current_date=date.today().strftime("%Y-%m-%d"),
    # Enable Cache
    enable_caching=False,
    # Use Secure PyPI
    use_secure_pypi=False,
)

print("‚úÖ Base pipeline configuration set successfully!")
print(f"   Region: {region} ({aws_region})")
print(f"   Service: {service_name}")
print(f"   Author: {author}")
print(f"   Pipeline Version: {pipeline_version}")

2025-11-28 03:11:20,773 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-11-28 03:11:20,946 - INFO - Base configuration set successfully


‚úÖ Base pipeline configuration set successfully!
   Region: NA (us-east-1)
   Service: BuyerAbuseRnR
   Author: lukexie
   Pipeline Version: 0.0.2


## Step 4: Configure Base Processing Settings

These settings are shared across all PROCESSING steps (data loading, preprocessing, etc.)
but not training steps.

In [9]:
# Get base processing configuration requirements
processing_requirements = factory.get_base_processing_config_requirements()

if processing_requirements:
    print("Base Processing Configuration Requirements:")
    print("=" * 50)
    for req in processing_requirements:
        marker = "*" if req["required"] else " "
        default_info = (
            f" (default: {req.get('default')})"
            if not req["required"] and "default" in req
            else ""
        )
        print(f"{marker} {req['name']:<30} ({req['type']}){default_info}")
        print(f"    {req['description']}")
        print()
else:
    print("No base processing configuration required for this pipeline.")

Base Processing Configuration Requirements:
  processing_instance_count      (int) (default: 1)
    Instance count for processing jobs

  processing_volume_size         (int) (default: 500)
    Volume size for processing jobs in GB

  processing_instance_type_large (str) (default: ml.m5.4xlarge)
    Large instance type for processing step.

  processing_instance_type_small (str) (default: ml.m5.2xlarge)
    Small instance type for processing step.

  use_large_processing_instance  (bool) (default: False)
    Set to True to use large instance type, False for small instance type.

  processing_source_dir          (Optional) (default: None)
    Source directory for processing scripts. Falls back to base source_dir if not provided.

  processing_entry_point         (Optional) (default: None)
    Entry point script for processing, must be relative to source directory. Can be overridden by derived classes.

  processing_script_arguments    (Optional) (default: None)
    Optional arguments fo

In [10]:
# Set base processing configuration if needed
if processing_requirements:
    processing_source_dir = source_dir / "scripts"

    factory.set_base_processing_config(
        # Processing infrastructure
        processing_source_dir=str(processing_source_dir),
        processing_instance_type_large="ml.m5.12xlarge",
        processing_instance_type_small="ml.m5.4xlarge",
    )

    print("‚úÖ Base processing configuration set successfully!")
    print(f"   Processing source: {processing_source_dir}")
else:
    print("‚úÖ No base processing configuration needed.")

2025-11-28 03:11:20,959 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:20,960 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:20,960 - INFO - Base processing configuration set successfully


‚úÖ Base processing configuration set successfully!
   Processing source: docker/scripts


## Step 5: Check Configuration Status

Let's see which steps still need configuration.

In [11]:
# Check current status
status = factory.get_configuration_status()
pending_steps = factory.get_pending_steps()

print("Configuration Status:")
print("=" * 30)
print(f"Base config set: {'‚úÖ' if status['base_config'] else '‚ùå'}")
print(f"Processing config set: {'‚úÖ' if status['base_processing_config'] else '‚ùå'}")
print(f"Total steps: {len(config_map)}")
print(f"Pending steps: {len(pending_steps)}")
print()

if pending_steps:
    print("Steps needing configuration:")
    for step in pending_steps:
        print(f"  - {step}")
else:
    print("‚úÖ All steps configured!")

Configuration Status:
Base config set: ‚úÖ
Processing config set: ‚úÖ
Total steps: 7
Pending steps: 6

Steps needing configuration:
  - DummyTraining
  - ModelCalibration_calibration
  - DummyDataLoading_calibration
  - TabularPreprocessing_calibration
  - PyTorchModelEval_calibration
  - PyTorchTraining


## Step 6: Configure Individual Steps

Now we configure each step with its specific requirements. The factory will show us
only the fields that are unique to each step (not inherited from base configs).

### Step 6.2: Configure Dummy Data Loading Steps

In [12]:
# Configure dummy data loading
if "DummyDataLoading_calibration" in pending_steps:
    step_name = "DummyDataLoading_calibration"

    data_source = "s3://buyer-seller-messaging-reversal/pipeline/lukexie-BuyerAbuseRnR-pytorch-NA/20251111035543/labelrulesetexecution/processed_data/test"

    factory.set_step_config(
        step_name,
        job_type="calibration",
        data_source=data_source,
        processing_entry_point="dummy_data_loading.py",
        use_large_processing_instance=True,
        write_data_shards=True,
        output_format="PARQUET",
    )
    print(f"‚úÖ {step_name} configured")

2025-11-28 03:11:20,972 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:20,972 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:20,973 - INFO - ‚úÖ DummyDataLoading_calibration configured successfully using DummyDataLoadingConfig


‚úÖ DummyDataLoading_calibration configured


### Step 6.3: Configure Training Step

This config is for **TrainingStep**. 
* It ask user to provide all necessary information to construct a **Container** and start a **Training Job**
* Ths most important information has provided in the **HyperParameter** section.


In [13]:
tab_field_list = [
    "net_conc_amt",
    "ttm_conc_amt",
    "ttm_conc_count",
    "concsi",
    "deliverable_flag",
    "undeliverable_flag",
    "unique_message_count",
    "total_ship_track_events_by_order",
    "total_unique_ship_track_events_by_order",
]

In [14]:
cat_field_list = [
    "dialogue",
    "shiptrack_event_history_by_order",
]

In [15]:
label_name = "llm_reversal_flag"
id_name = "order_id"

in_house_field_list = [
    "THREAD_ID",
    "region",
    "sender_customer_ids",
    "receipient_customer_ids",
    "entry_point_list",
    "topic_id_list",
    "subject_list",
    "shipment_ids",
    "shiptrack_tracking_id_lists_by_order",
    "shiptrack_sender_id_lists_by_order",
    "shiptrack_event_code_lists_by_order",
    "shiptrack_supplement_code_lists_by_order",
    "shiptrack_event_date_lists_by_order",
    "shiptrack_max_estimated_arrival_date_by_order",
    "shiptrack_max_estimated_delivery_date_by_order",
    "customer_id",
    "marketplace_id",
    "org",
    "concession_type",
    "abuse_type",
    "mfn_type",
    "week",
    "month",
    "order_day",
    "queued_date",
    "action_date",
    "queueid",
    "fraud_action_type_id",
    "concession_status",
    "abuse_flag",
    "concession_closed_date",
    "rr_tag",
    "reversal_reason",
    "reversal_flag",
    "reclassification_flag",
    "reclassified_segment",
    "row_num",
    "llm_status",
    "llm_error",
    "llm_category",
    "llm_confidence_score",
    "llm_key_evidence",
    "llm_reasoning",
    "llm_parse_status",
    "llm_validation_passed",
    "llm_raw_response",
    "llm_validation_error",
    id_name,
    label_name,
]

In [16]:
full_field_list = tab_field_list + cat_field_list + in_house_field_list

In [17]:
batch_size = 2
lr = 3e-05
max_epochs = 3  # 3, #15,
metric_choices = ["f1_score", "auroc"]
optimizer = "SGD"

In [18]:
# First, let's create the hyperparameters
from cursus.core.base.hyperparameters_base import ModelHyperparameters
from cursus.steps.hyperparams.hyperparameters_trimodal import TriModalHyperparameters


# Create base hyperparameters
base_hyperparameter = ModelHyperparameters(
    full_field_list=full_field_list,
    cat_field_list=cat_field_list,
    tab_field_list=tab_field_list,
    label_name=label_name,
    id_name=id_name,
    multiclass_categories=[0, 1],
    batch_size=batch_size,
    lr=lr,
    max_epochs=max_epochs,
    metric_choices=metric_choices,
    optimizer=optimizer,
)

In [19]:
model_class = "trimodal_bert"
tokenizer = "bert-base-multilingual-uncased"
primary_text_name = "dialogue"
secondary_text_name = "shiptrack_event_history_by_order"
lr_decay = 0.05
max_sen_len = 512
chunk_trancate = True
max_total_chunks = 3
momentum = 0.9
pretrained_embedding = True
reinit_layers = 2
reinit_pooler = True
run_scheduler = True
load_ckpt = False
hidden_common_dim = 100
val_check_interval = 0.25
warmup_steps = 300
weight_decay = 0
early_stop_metric = "val_loss"
early_stop_patience = 3
primary_tokenizer = secondary_tokenizer = tokenizer
text_input_ids_key = "input_ids"
text_attention_mask_key = "attention_mask"
fp16 = False
use_gradient_checkpointing = False  # False

In [20]:
# Create XGBoost hyperparameters
trimodal_hyperparams = TriModalHyperparameters.from_base_hyperparam(
    base_hyperparameter,
    model_class=model_class,
    tokenizer=tokenizer,
    primary_text_name=primary_text_name,
    secondary_text_name=secondary_text_name,
    text_input_ids_key=text_input_ids_key,
    text_attention_mask_key=text_attention_mask_key,
    momentum=momentum,
    lr_decay=lr_decay,
    weight_decay=weight_decay,
    max_sen_len=max_sen_len,
    chunk_trancate=chunk_trancate,
    max_total_chunks=max_total_chunks,
    reinit_layers=reinit_layers,
    reinit_pooler=reinit_pooler,
    run_scheduler=run_scheduler,
    load_ckpt=load_ckpt,
    hidden_common_dim=hidden_common_dim,
    warmup_steps=warmup_steps,
    val_check_interval=val_check_interval,
    early_stop_metric=early_stop_metric,
    early_stop_patience=early_stop_patience,
    fp16=fp16,
    use_gradient_checkpointing=use_gradient_checkpointing,
)

print("‚úÖ Hyperparameters created")
print(
    f"   Features: {len(full_field_list)} total, {len(tab_field_list)} numerical, {len(cat_field_list)} categorical"
)
print(f"   XGBoost rounds: {trimodal_hyperparams.max_epochs}")

‚úÖ Hyperparameters created
   Features: 60 total, 9 numerical, 2 categorical
   XGBoost rounds: 3


In [21]:
instance_type_list = [
    "ml.m5.4xlarge",
    "ml.m5.12xlarge",
    "ml.p3.16xlarge",
    "ml.g4dn.16xlarge",
    "ml.g5.12xlarge",
    "ml.g5.16xlarge",
]

In [22]:
instance_select = -4  # -2 #-1
instance_type_list[instance_select]

'ml.p3.16xlarge'

In [23]:
# Configure XGBoost training
if "PyTorchTraining" in pending_steps:
    step_name = "PyTorchTraining"
    training_volume_size = 800
    factory.set_step_config(
        step_name,
        training_instance_type=instance_type_list[instance_select],
        training_entry_point="pytorch_training.py",
        training_volume_size=training_volume_size,
    )
    print(f"‚úÖ {step_name} configured")
    print(f"   Instance type: {instance_type_list[instance_select]}")
    print(f"   Volume size: {training_volume_size} GB")

2025-11-28 03:11:21,038 - INFO - ‚úÖ PyTorchTraining configured successfully using PyTorchTrainingConfig


‚úÖ PyTorchTraining configured
   Instance type: ml.p3.16xlarge
   Volume size: 800 GB


### Step 6.4: Configure Preprocessing Steps

In [24]:
# Configure training preprocessing
if "TabularPreprocessing_training" in pending_steps:
    step_name = "TabularPreprocessing_training"

    factory.set_step_config(
        step_name,
        job_type="training",
        label_name=None,
        processing_entry_point="tabular_preprocessing.py",
        use_large_processing_instance=True,
        output_format="Parquet",
    )
    print(f"‚úÖ {step_name} configured")

In [25]:
# Configure training preprocessing
if "TabularPreprocessing_calibration" in pending_steps:
    step_name = "TabularPreprocessing_calibration"

    factory.set_step_config(
        step_name,
        job_type="calibration",
        label_name=None,
        processing_entry_point="tabular_preprocessing.py",
        use_large_processing_instance=True,
        output_format="Parquet",
    )
    print(f"‚úÖ {step_name} configured")

2025-11-28 03:11:21,050 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,050 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,051 - INFO - ‚úÖ TabularPreprocessing_calibration configured successfully using TabularPreprocessingConfig


‚úÖ TabularPreprocessing_calibration configured


### Step 6.3-2 Dummy Training Step

In [26]:
# Configure training preprocessing
if "DummyTraining" in pending_steps:
    step_name = "DummyTraining"
    pretrained_model_path = "s3://buyer-seller-messaging-reversal/pipeline/lukexie-BuyerAbuseRnR-pytorch-NA/20251127060644/pytorch_training/pipelines-2x77ph4km9d6-PyTorchTraining-eLa7mE7jJT/output/model.tar.gz"

    factory.set_step_config(
        step_name,
        pretrained_model_path=pretrained_model_path,
        processing_entry_point="dummy_training.py",
        use_large_processing_instance=True,
    )
    print(f"‚úÖ {step_name} configured")

2025-11-28 03:11:21,056 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,056 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,057 - INFO - ‚úÖ DummyTraining configured successfully using DummyTrainingConfig


‚úÖ DummyTraining configured


### Get Remaining Steps

**USER INPUT BLOCK**: Fill in the essential fields for each remaining step.
The factory has identified the required fields for each step.

In [27]:
# Get current pending steps
current_pending = factory.get_pending_steps()

print("Remaining steps to configure:")
print("=" * 40)

for step_name in current_pending:
    requirements = factory.get_step_requirements(step_name)
    essential_reqs = [req for req in requirements if req["required"]]

    print(f"\n{step_name}:")
    print(f"  Essential fields ({len(essential_reqs)}):")
    for req in essential_reqs:
        print(f"    * {req['name']} ({req['type']}) - {req['description']}")

    if len(requirements) > len(essential_reqs):
        optional_count = len(requirements) - len(essential_reqs)
        print(f"  Optional fields: {optional_count}")

Remaining steps to configure:

ModelCalibration_calibration:
  Essential fields (1):
    * label_field (str) - Name of the label column
  Optional fields: 12

PyTorchModelEval_calibration:
  Essential fields (2):
    * id_name (str) - Name of the ID field in the dataset (required for evaluation).
    * label_name (str) - Name of the label field in the dataset (required for evaluation).
  Optional fields: 8


### Step 6.6: Configure Label Ruleset Generation Steps

In [28]:
from cursus.steps.configs.config_label_ruleset_generation_step import (
    LabelRulesetGenerationConfig,
    LabelConfig,
    RuleDefinition,
    RulesetDefinitionList,
    RuleCondition,
    ComparisonOperator,
)

In [29]:
label_config = LabelConfig(
    output_label_name="llm_reversal_flag",
    output_label_type="binary",
    label_values=[0, 1],
    label_mapping={"0": "No_Reversal_Required", "1": "Reversal_Required"},
    default_label=1,  # Conservative: flag for review if no rules match
    evaluation_mode="priority",
)

In [30]:
rule_no_reversal = RuleDefinition(
    name="No_Reversal_Categories",
    priority=1,
    enabled=True,
    conditions=RuleCondition(
        field="llm_category",
        operator=ComparisonOperator.IN,
        value=["TrueDNR", "PDA_Undeliverable", "PDA_Early_Refund", "Returnless_Refund"],
    ),
    output_label=0,
    description="Categories indicating NO reversal: delivered/fraud patterns, logistics delays, goodwill refunds",
)

# Rule 2: Categories indicating reversal required (output_label = 1)
rule_reversal_required = RuleDefinition(
    name="Reversal_Required_Categories",
    priority=2,
    enabled=True,
    conditions=RuleCondition(
        field="llm_category",
        operator=ComparisonOperator.IN,
        value=[
            "Confirmed_Delay",
            "Delivery_Attempt_Failed",
            "Seller_Unable_To_Ship",
            "Buyer_Received_WrongORDefective_Item",
            "BuyerCancellation",
            "Return_NoLongerNeeded",
            "Product_Information_Support",
            "Insufficient_Information",
        ],
    ),
    output_label=1,
    description="Categories indicating reversal required: legitimate refunds, quality issues, cancellations, manual review cases",
)

# Collect all rules (just 2 rules now!)
# Wrap rules in RulesetDefinitionList (required by config)
ruleset_definitions = RulesetDefinitionList(
    rules=[rule_no_reversal, rule_reversal_required]
)

In [31]:
if "LabelRulesetGeneration" in pending_steps:
    step_name = "LabelRulesetGeneration"

    factory.set_step_config(
        step_name,
        # Label configuration (Pydantic model)
        label_config=label_config,
        # Rule definitions (RulesetDefinitionList with Pydantic models)
        rule_definitions=ruleset_definitions,
        # ===== Tier 2: Optional Configuration (with defaults) =====
        # Validation settings (all default to True)
        enable_field_validation=True,
        enable_label_validation=True,
        enable_logic_validation=True,
        # Optimization settings (defaults to True)
        enable_rule_optimization=True,
        # Configuration path (defaults to 'ruleset_configs')
        ruleset_configs_path="ruleset_configs",
        # Processing configuration (defaults to 'label_ruleset_generation.py')
        processing_entry_point="label_ruleset_generation.py",
    )

    print(f"‚úÖ {step_name} configured")

### Step 6.7: Configure Label Ruleset Execution Steps

In [32]:
from cursus.steps.configs.config_label_ruleset_execution_step import (
    LabelRulesetExecutionConfig,
)

In [33]:
if "LabelRulesetExecution_training" in pending_steps:
    step_name = "LabelRulesetExecution_training"

    factory.set_step_config(
        step_name,
        # ===== Tier 1: Required Configuration =====
        # Job type determines which splits to process
        job_type="training",  # One of: 'training', 'validation', 'testing', 'calibration'
        # ===== Tier 2: Optional Configuration (with defaults) =====
        # Execution configuration (all default to True)
        fail_on_missing_fields=True,  # Fail if llm_category field missing
        enable_rule_match_tracking=True,  # Track which rules match
        enable_progress_logging=True,  # Log progress during execution
        # Data format configuration (defaults to empty string for auto-detection)
        preferred_input_format="Parquet",  # Options: 'CSV', 'TSV', 'Parquet', or '' for auto
        # Processing configuration (defaults to 'label_ruleset_execution.py')
        processing_entry_point="label_ruleset_execution.py",
        # ===== Processing Step Base Configuration =====
        # SageMaker instance configuration
        use_large_processing_instance=True,
    )

    print(f"‚úÖ {step_name} configured for job_type='training'")

In [34]:
if "LabelRulesetExecution_calibration" in pending_steps:
    step_name = "LabelRulesetExecution_calibration"

    factory.set_step_config(
        step_name,
        # ===== Tier 1: Required Configuration =====
        # Job type determines which splits to process
        job_type="calibration",  # One of: 'training', 'validation', 'testing', 'calibration'
        # ===== Tier 2: Optional Configuration (with defaults) =====
        # Execution configuration (all default to True)
        fail_on_missing_fields=True,  # Fail if llm_category field missing
        enable_rule_match_tracking=True,  # Track which rules match
        enable_progress_logging=True,  # Log progress during execution
        # Data format configuration (defaults to empty string for auto-detection)
        preferred_input_format="Parquet",  # Options: 'CSV', 'TSV', 'Parquet', or '' for auto
        # Processing configuration (defaults to 'label_ruleset_execution.py')
        processing_entry_point="label_ruleset_execution.py",
        # ===== Processing Step Base Configuration =====
        # SageMaker instance configuration
        use_large_processing_instance=True,
    )

    print(f"‚úÖ {step_name} configured for job_type='training'")

### Step 6.8: Pytorch Model Eval Step

In [35]:
id_name

'order_id'

In [36]:
label_name

'llm_reversal_flag'

In [37]:
# Configure Model Evaluation
if "PyTorchModelEval_calibration" in current_pending:
    step_name = "PyTorchModelEval_calibration"
    factory.set_step_config(
        step_name,
        job_type="calibration",
        processing_entry_point="pytorch_model_eval.py",
        id_name=id_name,
        label_name=label_name,
        processing_source_dir=str(source_dir),
        processing_instance_type_large="ml.g5.12xlarge",
        use_large_processing_instance=True,
    )
    print(f"‚úÖ {step_name} configured")

2025-11-28 03:11:21,123 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:21,123 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:21,124 - INFO - ‚úÖ PyTorchModelEval_calibration configured successfully using PyTorchModelEvalConfig


‚úÖ PyTorchModelEval_calibration configured


### Step 6.9: Model Calibration Step

In [38]:
# Configure Model Calibration
if "ModelCalibration_calibration" in current_pending:
    factory.set_step_config(
        "ModelCalibration_calibration",
        label_field=label_name,
        processing_entry_point="model_calibration.py",
        score_field="prob_class_1",
        is_binary=True,
        num_classes=2,
        score_field_prefix="prob_class_",
        multiclass_categories=[0, 1],
    )
    print(f"‚úÖ ModelCalibration_calibration configured")

2025-11-28 03:11:21,130 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,130 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,131 - INFO - ‚úÖ ModelCalibration_calibration configured successfully using ModelCalibrationConfig


‚úÖ ModelCalibration_calibration configured


### Step 6.10: Pre-Registration Step

## Step 7: Generate Final Configurations

Now that all steps are configured, we can generate the final configuration instances.
The factory will validate that all essential fields are provided and create the config objects.

In [39]:
# Check final status
final_status = factory.get_configuration_status()
final_pending = factory.get_pending_steps()

print("Final Configuration Status:")
print("=" * 40)
print(f"Base config: {'‚úÖ' if final_status['base_config'] else '‚ùå'}")
print(f"Processing config: {'‚úÖ' if final_status['base_processing_config'] else '‚ùå'}")
print(f"Pending steps: {len(final_pending)}")

if final_pending:
    print("\nStill pending:")
    for step in final_pending:
        print(f"  - {step}")
    print("\n‚ö†Ô∏è  Please configure remaining steps before generating configs.")
else:
    print("\n‚úÖ All steps configured! Ready to generate configurations.")

Final Configuration Status:
Base config: ‚úÖ
Processing config: ‚úÖ
Pending steps: 0

‚úÖ All steps configured! Ready to generate configurations.


In [40]:
# Generate final configurations
if not final_pending:
    try:
        print("Generating final configurations...")
        configs = factory.generate_all_configs()

        print(f"\n‚úÖ Successfully generated {len(configs)} configuration instances:")
        for i, config in enumerate(configs, 1):
            print(f"  {i:2d}. {config.__class__.__name__}")

        print("\nüéâ Configuration generation complete!")

    except Exception as e:
        print(f"\n‚ùå Configuration generation failed: {e}")
        print("\nPlease check that all required fields are provided.")
        configs = None
else:
    print("\n‚ö†Ô∏è  Cannot generate configs - some steps are still pending configuration.")
    configs = None

2025-11-28 03:11:21,143 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,143 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts
2025-11-28 03:11:21,144 - INFO - ‚úÖ Package auto-configured successfully (only tier 2+ fields)
2025-11-28 03:11:21,144 - INFO - ‚úÖ Auto-configured 1 steps with only tier 2+ fields
2025-11-28 03:11:21,145 - INFO - ‚úÖ Returning 7 pre-validated configuration instances


Generating final configurations...

‚úÖ Successfully generated 7 configuration instances:
   1. DummyDataLoadingConfig
   2. PyTorchTrainingConfig
   3. TabularPreprocessingConfig
   4. DummyTrainingConfig
   5. PyTorchModelEvalConfig
   6. ModelCalibrationConfig
   7. PackageConfig

üéâ Configuration generation complete!


In [41]:
len(configs)

7

## Step 8: Save to JSON

Finally, we save the generated configurations to a unified JSON file using the existing
`merge_and_save_configs` utility. This creates the same format as the legacy approach
but with much less effort!

In [42]:
if configs:
    # Set up output directory and filename
    MODEL_CLASS = "pytorch"
    service_name = "BuyerAbuseRnR"

    config_dir = Path(current_dir) / "pipeline_config"
    config_dir.mkdir(parents=True, exist_ok=True)

    config_file_name = f"config.json"
    config_path = config_dir / config_file_name

    print(f"Saving configurations to: {config_path}")

    # Use the existing merge_and_save_configs utility
    from cursus.steps.configs.utils import merge_and_save_configs

    try:
        merged_config = merge_and_save_configs(configs, str(config_path))

        print(f"\n‚úÖ Configuration saved successfully!")
        print(f"   File: {config_path}")
        print(f"   Size: {config_path.stat().st_size / 1024:.1f} KB")

        # Also save hyperparameters separately (for compatibility)
        hyperparam_path = source_dir / "hyperparams" / f"hyperparameters.json"
        with open(hyperparam_path, "w") as f:
            json.dump(trimodal_hyperparams.model_dump(), f, indent=2, sort_keys=True)

        print(f"   Hyperparameters: {hyperparam_path}")

        # print(f"\nüéâ Interactive configuration complete!")
        # print(f"\nüìä Comparison with legacy approach:")
        # print(f"   Legacy: 500+ lines of manual configuration")
        # print(f"   Interactive: Guided step-by-step process")
        # print(f"   Time saved: ~20-25 minutes")
        # print(f"   Error reduction: Validation at each step")

    except Exception as e:
        print(f"\n‚ùå Failed to save configurations: {e}")

else:
    print("\n‚ö†Ô∏è  No configurations to save. Please generate configs first.")

2025-11-28 03:11:21,156 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:21,157 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker
2025-11-28 03:11:21,160 - INFO - Merging and saving 7 configs to /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,160 - INFO - Collecting field information for 7 configs (6 processing configs)


Saving configurations to: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json


2025-11-28 03:11:21,164 - INFO - Collected information for 67 unique fields
2025-11-28 03:11:21,166 - INFO - Efficient algorithm identified 15 shared fields
2025-11-28 03:11:21,168 - INFO - Populated specific fields for 7 configs
2025-11-28 03:11:21,169 - INFO - Shared fields: 15
2025-11-28 03:11:21,169 - INFO - Specific steps: 7
2025-11-28 03:11:21,170 - INFO - Saving merged configuration to /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,171 - INFO - Successfully saved merged configuration to /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,171 - INFO - Successfully saved merged configs to /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json



‚úÖ Configuration saved successfully!
   File: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
   Size: 20.9 KB
   Hyperparameters: docker/hyperparams/hyperparameters.json


### Test if we can load it

In [43]:
from cursus.steps.configs.config_dummy_data_loading_step import DummyDataLoadingConfig
from cursus.steps.configs.config_tabular_preprocessing_step import (
    TabularPreprocessingConfig,
)
from cursus.steps.configs.config_bedrock_prompt_template_generation_step import (
    BedrockPromptTemplateGenerationConfig,
)
from cursus.steps.configs.config_bedrock_batch_processing_step import (
    BedrockBatchProcessingConfig,
)
from cursus.steps.configs.config_label_ruleset_generation_step import (
    LabelRulesetGenerationConfig,
)
from cursus.steps.configs.config_label_ruleset_execution_step import (
    LabelRulesetExecutionConfig,
)
from cursus.steps.configs.config_pytorch_training_step import PyTorchTrainingConfig
from cursus.steps.configs.config_pytorch_model_eval_step import PyTorchModelEvalConfig
from cursus.steps.configs.config_dummy_training_step import DummyTrainingConfig
from cursus.steps.configs.config_model_calibration_step import ModelCalibrationConfig
from cursus.steps.configs.config_package_step import PackageConfig

In [44]:
from cursus.steps.configs.utils import load_configs

In [45]:
CONFIG_CLASSES = {
    "DummyDataLoadingConfig": DummyDataLoadingConfig,
    "BedrockPromptTemplateGenerationConfig": BedrockPromptTemplateGenerationConfig,
    "BedrockBatchProcessingConfig": BedrockBatchProcessingConfig,
    "TabularPreprocessingConfig": TabularPreprocessingConfig,
    "LabelRulesetGenerationConfig": LabelRulesetGenerationConfig,
    "LabelRulesetExecutionConfig": LabelRulesetExecutionConfig,
    "PyTorchTrainingConfig": PyTorchTrainingConfig,
    "PyTorchModelEvalConfig": PyTorchModelEvalConfig,
    "DummyTrainingConfig": DummyTrainingConfig,
    "ModelCalibrationConfig": ModelCalibrationConfig,
    "PackageConfig": PackageConfig,
}

In [46]:
loaded_configs = load_configs(str(config_path), CONFIG_CLASSES)

2025-11-28 03:11:21,195 - INFO - Loading configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,195 - INFO - Loading configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,196 - INFO - Successfully loaded configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-28 03:11:21,196 - INFO - Successfully loaded configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json with 7 specific configs
2025-11-28 03:11:21,197 - INFO - Creating additional config instance for DummyDataLoading_calibration (DummyDataLoadingConfig)
2025-11-28 03:11:21,198 - INFO - Package location discovery succeeded (bundled): /home/ec2-us

In [47]:
len(loaded_configs)

7

## Summary

This notebook demonstrates the **DAGConfigFactory** approach to pipeline configuration:

### ‚úÖ **Benefits Achieved**

1. **Reduced Complexity**: From 500+ lines of manual config to guided workflow
2. **Base Config Inheritance**: Set common fields once, inherit everywhere
3. **Step-by-Step Guidance**: Clear requirements for each configuration step
4. **Validation**: Comprehensive validation prevents configuration errors
5. **Reusable DAG**: Pipeline structure defined once, reused across environments

### üîÑ **Workflow Comparison**

| Aspect | Legacy Approach | Interactive Approach |
|--------|----------------|---------------------|
| **Lines of Code** | 500+ manual lines | Guided step-by-step |
| **Time Required** | 30+ minutes | 10-15 minutes |
| **Error Rate** | High (manual entry) | Low (validation) |
| **Reusability** | Copy-paste heavy | DAG-driven |
| **Maintenance** | Manual updates | Automatic inheritance |

### üöÄ **Next Steps**

The generated configuration file can now be used with the existing pipeline compiler:

```python
# Use with pipeline compiler (from demo_pipeline.ipynb)
from cursus.core.compiler.dag_compiler import PipelineDAGCompiler

dag_compiler = PipelineDAGCompiler(
    config_path=config_path,
    sagemaker_session=pipeline_session,
    role=role
)

# Compile DAG to pipeline
template_pipeline, report = dag_compiler.compile_with_report(dag=dag)
```

The interactive configuration approach transforms the user experience from complex manual setup to an intuitive, guided workflow while maintaining full compatibility with the existing cursus infrastructure.