# Interactive Pipeline Configuration with DAGConfigFactory

This notebook demonstrates the new interactive approach to pipeline configuration using the DAGConfigFactory.
Instead of manually creating 500+ lines of static configuration, we use a guided step-by-step process.

## Workflow Overview

1. **Define Pipeline DAG** - Create the pipeline structure
2. **Initialize DAGConfigFactory** - Set up the interactive factory
3. **Configure Base Settings** - Set shared pipeline configuration
4. **Configure Processing Settings** - Set shared processing configuration
5. **Configure Individual Steps** - Set step-specific configurations
6. **Generate Final Configurations** - Create config instances
7. **Save to JSON** - Export unified configuration file



## Environment Setup

In [1]:
import os
import json
import sys
from pathlib import Path
from datetime import datetime, date
import logging
from typing import List, Optional, Dict, Any


# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Get parent directory of current notebook
project_root = str(Path().absolute().parent)
print(f"project root {project_root}")
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"add project root {project_root} into system")

project root /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines
add project root /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines into system


In [2]:
# SageMaker and SAIS imports
from sagemaker import Session
from sagemaker.workflow.pipeline_context import PipelineSession

2025-11-05 22:49:00,623 - INFO - Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2025-11-05 22:49:00,624 - INFO - NumExpr defaulting to 8 threads.
  from pandas.core.computation.check import NUMEXPR_INSTALLED
2025-11-05 22:49:01,759 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
print(f"Role: {PipelineSession().get_caller_identity_arn()}")

2025-11-05 22:49:02,105 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Role: arn:aws:iam::178936618742:role/AmazonSageMaker-ExecutionRole-Default


In [4]:
bucket = "buyer-seller-messaging-reversal"
print(f"Bucket: {bucket}")

Bucket: buyer-seller-messaging-reversal


## Step 1: Define Pipeline DAG

First, we define the pipeline structure using a DAG (Directed Acyclic Graph).
This replaces the hardcoded pipeline structure from the legacy approach.

In [5]:
from cursus.api.dag.base_dag import PipelineDAG


def create_bedrock_batch_data_processing_dag() -> PipelineDAG:
    """
    Create a DAG for Bedrock Batch data processing pipeline.

    This DAG represents the simplest possible workflow that includes
    cost-efficient Bedrock batch LLM enhancement for pure data processing
    without any training, calibration, packaging, registration, or evaluation steps.
    Perfect for data enhancement and annotation workflows.

    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()

    # Add minimal data processing nodes with Bedrock batch enhancement
    # dag.add_node("DummyDataLoading_training")  # Dummy data load
    # dag.add_node("TabularPreprocessing_training")  # Tabular preprocessing
    dag.add_node(
        "BedrockPromptTemplateGeneration"
    )  # Bedrock prompt template generation
    # dag.add_node("BedrockBatchProcessing_training")  # Bedrock batch processing step

    # Simple data processing flow with Bedrock batch enhancement
    # dag.add_edge("DummyDataLoading_training", "TabularPreprocessing_training")

    # Bedrock batch processing flow - two inputs to BedrockBatchProcessing
    # dag.add_edge("TabularPreprocessing_training", "BedrockBatchProcessing_training")  # Data input
    # dag.add_edge("BedrockPromptTemplateGeneration", "BedrockBatchProcessing_training")  # Template input

    logger.info(
        f"Created Bedrock Batch data processing DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges"
    )
    return dag


# Create the pipeline DAG
dag = create_bedrock_batch_data_processing_dag()

print(f"Pipeline DAG created with {len(dag.nodes)} steps:")
for node in dag.nodes:
    print(f"  - {node}")

2025-11-05 22:49:02,993 - INFO - Added node: BedrockPromptTemplateGeneration
2025-11-05 22:49:02,994 - INFO - Created Bedrock Batch data processing DAG with 1 nodes and 0 edges


Pipeline DAG created with 1 steps:
  - BedrockPromptTemplateGeneration


## Step 2: Initialize DAGConfigFactory

Now we initialize the DAGConfigFactory with our DAG. This will automatically:
- Map DAG nodes to configuration classes
- Set up the interactive workflow
- Prepare for step-by-step configuration

In [6]:
from cursus.api.factory.dag_config_factory import DAGConfigFactory

# Initialize the factory with our DAG
factory = DAGConfigFactory(dag)

# Get the config class mapping
config_map = factory.get_config_class_map()

print("DAG Node to Config Class Mapping:")
print("=" * 50)
for node_name, config_class in config_map.items():
    print(f"  {node_name:<35} -> {config_class.__name__}")

print(f"\nSuccessfully mapped {len(config_map)} steps to configuration classes.")

2025-11-05 22:49:03,000 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-05 22:49:03,001 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-11-05 22:49:03,002 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2025-11-05 22:49:03,002 - INFO - ‚úÖ Registry info loaded: 34 steps
2025-11-05 22:49:03,002 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2025-11-05 22:49:03,003 - INFO - üîç ScriptAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-05 22:49:03,003 - INFO - üîç ScriptAutoDiscovery.__init__ - workspace_dirs: []
2025-11-05 22:49:03,003 - INFO - üîç ScriptAutoDiscovery.__init__ - priority_workspace_dir: None
2025-11-05 22:49:03,004 - INFO - ‚úÖ Registry info loaded: 34 steps
2025-11-05 22:49:03,004 - INFO - üéâ ScriptAutoD

DAG Node to Config Class Mapping:
  BedrockPromptTemplateGeneration     -> BedrockPromptTemplateGenerationConfig

Successfully mapped 1 steps to configuration classes.


## Step 3: Configure Base Pipeline Settings

These settings are shared across ALL pipeline steps. Instead of repeating them
in every step configuration, we set them once here.

In [7]:
# Get base configuration requirements
base_requirements = factory.get_base_config_requirements()

print("Base Pipeline Configuration Requirements:")
print("=" * 50)
for req in base_requirements:
    marker = "*" if req["required"] else " "
    default_info = (
        f" (default: {req.get('default')})"
        if not req["required"] and "default" in req
        else ""
    )
    print(f"{marker} {req['name']:<25} ({req['type']}){default_info}")
    print(f"    {req['description']}")
    print()

Base Pipeline Configuration Requirements:
* author                    (str)
    Author or owner of the pipeline.

* bucket                    (str)
    S3 bucket name for pipeline artifacts and data.

* role                      (str)
    IAM role for pipeline execution.

* region                    (str)
    Custom region code (NA, EU, FE) for internal logic.

* service_name              (str)
    Service name for the pipeline.

* pipeline_version          (str)
    Version string for the SageMaker Pipeline.

  model_class               (str) (default: xgboost)
    Model class (e.g., XGBoost, PyTorch).

  current_date              (str) (default: PydanticUndefined)
    Current date, typically used for versioning or pathing.

  framework_version         (str) (default: 2.1.0)
    Default framework version (e.g., PyTorch).

  py_version                (str) (default: py310)
    Default Python version.

  source_dir                (Optional) (default: None)
    Common source directory fo

In [None]:
# Set up basic configuration values
region_list = ["NA", "EU", "FE"]
region_selection = 0
region = region_list[region_selection]

# Map region to AWS region
region_mapping = {"NA": "us-east-1", "EU": "eu-west-1", "FE": "us-west-2"}
aws_region = region_mapping[region]

service_name = "BuyerAbuseRnR"
pipeline_version = "0.0.1"
author = "lukexie"
model_class = "pytorch"

# Get current directory and set up paths
current_dir = Path.cwd()
package_root = Path(current_dir).resolve()
source_dir = Path("docker")
project_root_folder = "rnr_pytorch_bedrock"

# Set base configuration
factory.set_base_config(
    # Infrastructure settings
    bucket=bucket,
    role=PipelineSession().get_caller_identity_arn(),
    region=region,
    aws_region=aws_region,
    # Project identification
    author=author,
    service_name=service_name,
    pipeline_version=pipeline_version,
    model_class=model_class,
    # Framework settings
    framework_version="2.1.0",
    py_version="py310",
    source_dir=str(source_dir),
    project_root_folder=project_root_folder,
    # Date settings
    current_date=date.today().strftime("%Y-%m-%d"),
    # Enable Cache
    enable_caching=False,
)

print("‚úÖ Base pipeline configuration set successfully!")
print(f"   Region: {region} ({aws_region})")
print(f"   Service: {service_name}")
print(f"   Author: {author}")
print(f"   Pipeline Version: {pipeline_version}")

2025-11-05 22:49:03,194 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-11-05 22:49:03,545 - INFO - Base configuration set successfully


‚úÖ Base pipeline configuration set successfully!
   Region: NA (us-east-1)
   Service: BuyerAbuseRnR
   Author: lukexie
   Pipeline Version: 0.0.1


## Step 4: Configure Base Processing Settings

These settings are shared across all PROCESSING steps (data loading, preprocessing, etc.)
but not training steps.

In [9]:
# Get base processing configuration requirements
processing_requirements = factory.get_base_processing_config_requirements()

if processing_requirements:
    print("Base Processing Configuration Requirements:")
    print("=" * 50)
    for req in processing_requirements:
        marker = "*" if req["required"] else " "
        default_info = (
            f" (default: {req.get('default')})"
            if not req["required"] and "default" in req
            else ""
        )
        print(f"{marker} {req['name']:<30} ({req['type']}){default_info}")
        print(f"    {req['description']}")
        print()
else:
    print("No base processing configuration required for this pipeline.")

Base Processing Configuration Requirements:
  processing_instance_count      (int) (default: 1)
    Instance count for processing jobs

  processing_volume_size         (int) (default: 500)
    Volume size for processing jobs in GB

  processing_instance_type_large (str) (default: ml.m5.4xlarge)
    Large instance type for processing step.

  processing_instance_type_small (str) (default: ml.m5.2xlarge)
    Small instance type for processing step.

  use_large_processing_instance  (bool) (default: False)
    Set to True to use large instance type, False for small instance type.

  processing_source_dir          (Optional) (default: None)
    Source directory for processing scripts. Falls back to base source_dir if not provided.

  processing_entry_point         (Optional) (default: None)
    Entry point script for processing, must be relative to source directory. Can be overridden by derived classes.

  processing_script_arguments    (Optional) (default: None)
    Optional arguments fo

In [10]:
# Set base processing configuration if needed
if processing_requirements:
    processing_source_dir = source_dir / "scripts"

    factory.set_base_processing_config(
        # Processing infrastructure
        processing_source_dir=str(processing_source_dir),
        processing_instance_type_large="ml.m5.12xlarge",
        processing_instance_type_small="ml.m5.4xlarge",
    )

    print("‚úÖ Base processing configuration set successfully!")
    print(f"   Processing source: {processing_source_dir}")
else:
    print("‚úÖ No base processing configuration needed.")

2025-11-05 22:49:03,559 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers
2025-11-05 22:49:03,559 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers
2025-11-05 22:49:03,560 - INFO - Base processing configuration set successfully


‚úÖ Base processing configuration set successfully!
   Processing source: dockers/scripts


## Step 5: Check Configuration Status

Let's see which steps still need configuration.

In [11]:
# Check current status
status = factory.get_configuration_status()
pending_steps = factory.get_pending_steps()

print("Configuration Status:")
print("=" * 30)
print(f"Base config set: {'‚úÖ' if status['base_config'] else '‚ùå'}")
print(f"Processing config set: {'‚úÖ' if status['base_processing_config'] else '‚ùå'}")
print(f"Total steps: {len(config_map)}")
print(f"Pending steps: {len(pending_steps)}")
print()

if pending_steps:
    print("Steps needing configuration:")
    for step in pending_steps:
        print(f"  - {step}")
else:
    print("‚úÖ All steps configured!")

Configuration Status:
Base config set: ‚úÖ
Processing config set: ‚úÖ
Total steps: 1
Pending steps: 1

Steps needing configuration:
  - BedrockPromptTemplateGeneration


## Step 6: Configure Individual Steps

Now we configure each step with its specific requirements. The factory will show us
only the fields that are unique to each step (not inherited from base configs).

### Step 6.1 Prompt Generation

In [12]:
config_requirements = factory.get_step_requirements("BedrockPromptTemplateGeneration")

if config_requirements:
    print("BedrockPromptTemplateGeneration Config Requirements:")
    print("=" * 50)
    for req in config_requirements:
        marker = "*" if req["required"] else " "
        default_info = (
            f" (default: {req.get('default')})"
            if not req["required"] and "default" in req
            else ""
        )
        print(f"{marker} {req['name']:<30} ({req['type']}){default_info}")
        print(f"    {req['description']}")
        print()
else:
    print("No base processing configuration required for this pipeline.")

BedrockPromptTemplateGeneration Config Requirements:
* input_placeholders             (List)
    List of input field names to include in the template (e.g., ['input_data', 'context', 'metadata'])

* prompt_configs_path            (str)
    Path to prompt configuration directory containing system_prompt.json, output_format.json, instruction.json, and category_definitions.json files, relative to processing source directory

  template_task_type             (str) (default: classification)
    Type of task for template generation (classification, sentiment_analysis, content_moderation)

  template_style                 (str) (default: structured)
    Style of template generation (structured, conversational, technical)

  validation_level               (str) (default: standard)
    Level of template validation (basic, standard, comprehensive)

  output_format_type             (str) (default: structured_json)
    Type of output format (structured_json, formatted_text, hybrid)

  required_out

In [13]:
from cursus.steps.configs.config_bedrock_prompt_template_generation_step import (
    BedrockPromptTemplateGenerationConfig,
    SystemPromptConfig,
    OutputFormatConfig,
    InstructionConfig,
)

In [14]:
category_definitions_path = package_root / "dockers" / "prompt_configs"

In [15]:
category_definitions_path

PosixPath('/home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs')

In [16]:
system_prompt_settings = SystemPromptConfig(
    role_definition="expert in analyzing buyer-seller messaging conversations and shipping logistics",
    expertise_areas=[
        "buyer-seller messaging analysis",
        "shipping logistics",
        "delivery timing analysis",
        "e-commerce dispute resolution",
        "classification and categorization",
    ],
    responsibilities=[
        "classify interactions based on message content",
        "analyze shipping events and delivery timing",
        "categorize into predefined dispute categories",
        "provide evidence-based reasoning for classifications",
    ],
    behavioral_guidelines=[
        "be precise in classification decisions",
        "be objective in evidence evaluation",
        "be thorough in timeline analysis",
        "follow exact formatting requirements",
        "consider all available evidence sources",
    ],
    tone="professional",  # Options: "professional", "casual", "technical", "formal"
)

In [17]:
output_format_settings = OutputFormatConfig(
    format_type="structured_text",
    header_text="**CRITICAL: Follow this exact format for automated parsing**",
    structured_text_sections=[
        {
            "number": 1,
            "header": "Category",
            "format": "single_value",
            "placeholder": "${category_enum}",  # Auto-resolved from category definitions
            "placeholder_source": "schema_enum",
        },
        {
            "number": 2,
            "header": "Confidence Score",
            "format": "single_value",
            "placeholder": "${numeric_range}",  # Auto-resolved from schema constraints
            "placeholder_source": "schema_range",
        },
        {
            "number": 3,
            "header": "Key Evidence",
            "format": "subsections",
            "item_prefix": "- ",
            "indent": "   ",
            "subsections": [
                {
                    "name": "Message Evidence",
                    "example_items": [
                        "- [BUYER]: Example message",
                        "- [SELLER]: Example response",
                    ],
                },
                {
                    "name": "Shipping Evidence",
                    "example_items": ["- [Event]: Delivered"],
                },
                {
                    "name": "Timeline Evidence",
                    "example_items": ["- Delivery on 2025-02-21"],
                },
            ],
        },
        {
            "number": 4,
            "header": "Reasoning",
            "format": "subsections",
            "item_prefix": "- ",
            "indent": "   ",
            "subsections": [
                {"name": "Primary Factors", "example_items": ["- Main reason"]},
                {
                    "name": "Supporting Evidence",
                    "example_items": ["- Supporting detail"],
                },
                {"name": "Contradicting Evidence", "example_items": ["- None"]},
            ],
        },
    ],
    field_descriptions={
        "Category": "Exactly one category from the predefined list (case-sensitive match required)",
        "Confidence Score": "Decimal number between 0.00 and 1.00 indicating classification certainty",
        "Key Evidence": "Three subsections: Message Evidence, Shipping Evidence, Timeline Evidence - each with [sep] token separators",
        "Reasoning": "Three subsections: Primary Factors, Supporting Evidence, Contradicting Evidence - each with [sep] token separators",
    },
    formatting_rules=[
        "Use exact section headers with numbers and colons",
        "No semicolons (;) anywhere in response",
    ],
    validation_requirements=[
        "Category must match exactly from predefined list",
        "Confidence score must be decimal format (e.g., 0.85, not 85%)",
        "Each evidence item must start with '[sep] ' token",
    ],
    evidence_validation_rules=[
        "Message Evidence must include direct quotes with speaker identification",
        "Shipping Evidence must include tracking events with timestamps",
        "Timeline Evidence must show chronological sequence of events",
        "All evidence must reference specific content from input data",
    ],
    example_output=(
        "1. Category: TrueDNR\n\n"
        "2. Confidence Score: 0.92\n\n"
        "3. Key Evidence:\n"
        "   * Message Evidence:\n"
        "     - [BUYER]: Hello, I have not received my package\n"
        "     - [BUYER]: But I did not find any package, please refund me\n"
        "   * Shipping Evidence:\n"
        "     - [Event Time]: 2025-02-21T17:40:49.323Z [Event]: Delivered\n"
        "     - No further shipping events after delivery confirmation\n"
        "   * Timeline Evidence:\n"
        "     - Delivery confirmation on 2025-02-21 17:40\n"
        "     - Buyer reports non-receipt starting 2025-02-25 07:14\n\n"
        "4. Reasoning:\n"
        "   * Primary Factors:\n"
        "     - Tracking shows package was delivered successfully\n"
        "     - Buyer explicitly states they did not receive the package\n"
        "   * Supporting Evidence:\n"
        "     - Buyer requests refund due to missing package\n"
        "     - No evidence of buyer receiving wrong/defective item\n"
        "   * Contradicting Evidence:\n"
        "     - None"
    ),
)

In [18]:
instruction_settings = InstructionConfig(
    include_analysis_steps=True,
    include_decision_criteria=True,
    include_reasoning_requirements=True,
    step_by_step_format=True,
    include_evidence_validation=True,
    classification_guidelines={
        "sections": [
            {
                "title": "## Classification Guidelines",
                "subsections": [
                    {
                        "title": "### 1. Output Format Requirements",
                        "content": [
                            "**Category Selection:**",
                            "- Choose exactly ONE category from the provided list",
                            "- Category name must match exactly (case-sensitive)",
                            "",
                            "**Confidence Score:**",
                            "- Provide as decimal number between 0.00 and 1.00 (e.g., 0.95)",
                            "- Base confidence for complete data: 0.7-1.0",
                            "- Missing one field: reduce by 0.1-0.2",
                            "- Missing two fields: reduce by 0.2-0.3",
                            "- Minimum confidence threshold: 0.5",
                        ],
                    },
                    {
                        "title": "### 2. Shiptrack Parsing Rules",
                        "content": [
                            "**Multiple Shipment Structure:**",
                            "- Multiple shipment sequences separated by shipment IDs",
                            '- Each sequence starts with "[bom] [Shipment ID]:* [eom]"',
                            "",
                            "**Analysis Approach:**",
                            "- Process each shipment sequence separately",
                            "- Compare delivery events (EVENT_301) across all sequences",
                            "",
                            "**Key Event Codes:**",
                            "- EVENT_301: Delivery confirmation",
                            "- EVENT_302: Out for delivery",
                            "- EVENT_201: Arrival at facility",
                        ],
                    },
                    {
                        "title": "### 3. Missing Data Handling",
                        "content": [
                            "**When Dialogue is Empty but Shiptrack Exists:**",
                            "- Focus on shipping events and timeline",
                            "- Reduce confidence score by 0.1-0.2",
                            "",
                            "**When Shiptrack is Empty but Dialogue Exists:**",
                            "- Focus on message content and reported issues",
                            "- Reduce confidence score by 0.1-0.2",
                        ],
                    },
                    {
                        "title": "### 4. Category Priority Hierarchy",
                        "content": [
                            "**Tier 1: Abuse Pattern Categories (Highest Priority)**",
                            "- PDA_Undeliverable: Verify no delivery + refund given",
                            "- PDA_Early_Refund: Verify refund before delivery",
                            "",
                            "**Tier 2: Delivery Status Categories**",
                            "- TrueDNR: Delivered but disputed",
                            "- Confirmed_Delay: External factors confirmed",
                        ],
                    },
                    {
                        "title": "### 5. Evidence Requirements",
                        "content": [
                            "**Message Evidence Must Include:**",
                            "- Direct quotes from dialogue with speaker identification",
                            "",
                            "**Shipping Evidence Must Include:**",
                            "- All tracking events listed chronologically",
                            "",
                            "**Timeline Evidence Must Show:**",
                            "- Clear chronological sequence of events",
                        ],
                    },
                ],
            }
        ]
    },
)

In [19]:
category_definitions_path

PosixPath('/home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs')

In [20]:
step_name = "BedrockPromptTemplateGeneration"
factory.set_step_config(
    step_name,
    # input
    input_placeholders=[
        "dialogue",
        "shiptrack_event_history_by_order",
        "shiptrack_max_estimated_arrival_date_by_order",
    ],
    prompt_configs_path=str(category_definitions_path),
    # basic setting
    template_task_type="buyer_seller_classification",
    template_style="structured",
    validation_level="comprehensive",
    template_version="2.0",
    # Output configuration
    output_format_type="structured_text",
    required_output_fields=[
        "Category",
        "Confidence Score",
        "Key Evidence",
        "Reasoning",
    ],
    # Template features
    include_examples=True,
    generate_validation_schema=True,
    # Sub-configurations (Pydantic models)
    system_prompt_settings=system_prompt_settings,
    output_format_settings=output_format_settings,
    instruction_settings=instruction_settings,
    processing_entry_point="bedrock_prompt_template_generation.py",
)
print(f"‚úÖ {step_name} configured")

2025-11-05 22:49:03,624 - INFO - Generated system prompt config: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs/system_prompt.json
2025-11-05 22:49:03,625 - INFO - Generated output format config: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs/output_format.json
2025-11-05 22:49:03,625 - INFO - Generated instruction config: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs/instruction.json
2025-11-05 22:49:03,626 - INFO - Skipping category_definitions.json generation (no category definitions available)
2025-11-05 22:49:03,626 - INFO - Generated prompt configuration bundle in: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/dockers/prompt_configs
2025-11-05 22:49:03,626 - INFO - Bundle contains 3 JSON configuration files: s

‚úÖ BedrockPromptTemplateGeneration configured


### Step 6.2: Configure Dummy Data Loading Steps

In [21]:
# Configure dummy data loading
if "DummyDataLoading_training" in pending_steps:
    step_name = "DummyDataLoading_training"

    data_source = (
        "s3://buyer-seller-messaging-reversal/production-pipeline/raw-input/2025-11-03"
    )

    factory.set_step_config(
        step_name,
        job_type="training",
        data_source=data_source,
        processing_entry_point="dummy_data_loading.py",
        use_large_processing_instance=True,
        output_format="PARQUET",
    )
    print(f"‚úÖ {step_name} configured")

### Step 6.3: Configure Preprocessing Steps

In [22]:
pending_steps

['BedrockPromptTemplateGeneration']

In [23]:
# Configure training preprocessing
if "TabularPreprocessing_training" in pending_steps:
    step_name = "TabularPreprocessing_training"

    factory.set_step_config(
        step_name,
        job_type="training",
        label_name="reversal_flag",
        processing_entry_point="tabular_preprocessing.py",
        use_large_processing_instance=True,
    )
    print(f"‚úÖ {step_name} configured")

### Step 6.4: Configure Remaining Steps

**USER INPUT BLOCK**: Fill in the essential fields for each remaining step.
The factory has identified the required fields for each step.

In [24]:
# Get current pending steps
current_pending = factory.get_pending_steps()

print("Remaining steps to configure:")
print("=" * 40)

for step_name in current_pending:
    requirements = factory.get_step_requirements(step_name)
    essential_reqs = [req for req in requirements if req["required"]]

    print(f"\n{step_name}:")
    print(f"  Essential fields ({len(essential_reqs)}):")
    for req in essential_reqs:
        print(f"    * {req['name']} ({req['type']}) - {req['description']}")

    if len(requirements) > len(essential_reqs):
        optional_count = len(requirements) - len(essential_reqs)
        print(f"  Optional fields: {optional_count}")

Remaining steps to configure:


In [25]:
bedrock_batch_role_arn = PipelineSession().get_caller_identity_arn()

2025-11-05 22:49:03,673 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [26]:
bedrock_primary_model_id = "anthropic.claude-sonnet-4-5-20250929-v1:0"

In [27]:
bedrock_inference_profile_arn = "arn:aws:bedrock:us-east-1:178936618742:inference-profile/us.anthropic.claude-sonnet-4-20250514-v1:0"

In [28]:
# Configure Model Evaluation
if "BedrockBatchProcessing_training" in current_pending:
    factory.set_step_config(
        "BedrockBatchProcessing_training",
        job_type="training",
        processing_entry_point="bedrock_batch_processing.py",
        bedrock_batch_role_arn=bedrock_batch_role_arn,
        bedrock_primary_model_id=bedrock_primary_model_id,
        bedrock_inference_profile_arn=bedrock_inference_profile_arn,
    )
    print(f"‚úÖ BedrockBatchProcessing_training configured")

## Step 7: Generate Final Configurations

Now that all steps are configured, we can generate the final configuration instances.
The factory will validate that all essential fields are provided and create the config objects.

In [29]:
# Check final status
final_status = factory.get_configuration_status()
final_pending = factory.get_pending_steps()

print("Final Configuration Status:")
print("=" * 40)
print(f"Base config: {'‚úÖ' if final_status['base_config'] else '‚ùå'}")
print(f"Processing config: {'‚úÖ' if final_status['base_processing_config'] else '‚ùå'}")
print(f"Pending steps: {len(final_pending)}")

if final_pending:
    print("\nStill pending:")
    for step in final_pending:
        print(f"  - {step}")
    print("\n‚ö†Ô∏è  Please configure remaining steps before generating configs.")
else:
    print("\n‚úÖ All steps configured! Ready to generate configurations.")

Final Configuration Status:
Base config: ‚úÖ
Processing config: ‚úÖ
Pending steps: 0

‚úÖ All steps configured! Ready to generate configurations.


In [30]:
# Generate final configurations
if not final_pending:
    try:
        print("Generating final configurations...")
        configs = factory.generate_all_configs()

        print(f"\n‚úÖ Successfully generated {len(configs)} configuration instances:")
        for i, config in enumerate(configs, 1):
            print(f"  {i:2d}. {config.__class__.__name__}")

        print("\nüéâ Configuration generation complete!")

    except Exception as e:
        print(f"\n‚ùå Configuration generation failed: {e}")
        print("\nPlease check that all required fields are provided.")
        configs = None
else:
    print("\n‚ö†Ô∏è  Cannot generate configs - some steps are still pending configuration.")
    configs = None

2025-11-05 22:49:03,901 - INFO - ‚úÖ Returning 1 pre-validated configuration instances


Generating final configurations...

‚úÖ Successfully generated 1 configuration instances:
   1. BedrockPromptTemplateGenerationConfig

üéâ Configuration generation complete!


In [31]:
len(configs)

1

## Step 8: Save to JSON

Finally, we save the generated configurations to a unified JSON file using the existing
`merge_and_save_configs` utility. This creates the same format as the legacy approach
but with much less effort!

In [32]:
if configs:
    # Set up output directory and filename
    MODEL_CLASS = "pytorch"
    service_name = "BuyerAbuseRnR"

    config_dir = Path(current_dir) / "pipeline_config"
    config_dir.mkdir(parents=True, exist_ok=True)

    config_file_name = f"config.json"
    config_path = config_dir / config_file_name

    print(f"Saving configurations to: {config_path}")

    # Use the existing merge_and_save_configs utility
    from cursus.steps.configs.utils import merge_and_save_configs

    try:
        merged_config = merge_and_save_configs(configs, str(config_path))

        print(f"\n‚úÖ Configuration saved successfully!")
        print(f"   File: {config_path}")
        print(f"   Size: {config_path.stat().st_size / 1024:.1f} KB")

        # Also save hyperparameters separately (for compatibility)
        # hyperparam_path = source_dir / 'hyperparams' / f'hyperparameters.json'
        # with open(hyperparam_path, 'w') as f:
        #    json.dump(xgb_hyperparams.model_dump(), f, indent=2, sort_keys=True)

        # print(f"   Hyperparameters: {hyperparam_path}")

        # print(f"\nüéâ Interactive configuration complete!")
        # print(f"\nüìä Comparison with legacy approach:")
        # print(f"   Legacy: 500+ lines of manual configuration")
        # print(f"   Interactive: Guided step-by-step process")
        # print(f"   Time saved: ~20-25 minutes")
        # print(f"   Error reduction: Validation at each step")

    except Exception as e:
        print(f"\n‚ùå Failed to save configurations: {e}")

else:
    print("\n‚ö†Ô∏è  No configurations to save. Please generate configs first.")

2025-11-05 22:49:03,987 - INFO - Discovered 47 core config classes
2025-11-05 22:49:03,994 - INFO - Discovered 4 core hyperparameter classes
2025-11-05 22:49:04,021 - INFO - Discovered 7 base hyperparameter classes from core/base
2025-11-05 22:49:04,022 - INFO - Built complete config classes: 58 total (47 config + 11 hyperparameter auto-discovered)
2025-11-05 22:49:04,022 - INFO - Discovered 58 config classes via step catalog
2025-11-05 22:49:04,025 - INFO - Merging and saving 1 configs to /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-05 22:49:04,026 - INFO - Collecting field information for 1 configs (1 processing configs)


Saving configurations to: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json


2025-11-05 22:49:04,098 - INFO - Discovered 47 core config classes
2025-11-05 22:49:04,105 - INFO - Discovered 4 core hyperparameter classes
2025-11-05 22:49:04,327 - INFO - Discovered 7 base hyperparameter classes from core/base
2025-11-05 22:49:04,327 - INFO - Built complete config classes: 58 total (47 config + 11 hyperparameter auto-discovered)
2025-11-05 22:49:04,328 - INFO - Discovered 58 config classes via step catalog
2025-11-05 22:49:04,331 - INFO - Collected information for 49 unique fields
2025-11-05 22:49:04,403 - INFO - Discovered 47 core config classes
2025-11-05 22:49:04,410 - INFO - Discovered 4 core hyperparameter classes
2025-11-05 22:49:04,438 - INFO - Discovered 7 base hyperparameter classes from core/base
2025-11-05 22:49:04,439 - INFO - Built complete config classes: 58 total (47 config + 11 hyperparameter auto-discovered)
2025-11-05 22:49:04,439 - INFO - Discovered 58 config classes via step catalog
2025-11-05 22:49:04,441 - INFO - Populated specific fields for 1


‚úÖ Configuration saved successfully!
   File: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
   Size: 26.8 KB


### Test if we can load it

In [33]:
from cursus.steps.configs.config_dummy_data_loading_step import DummyDataLoadingConfig
from cursus.steps.configs.config_tabular_preprocessing_step import (
    TabularPreprocessingConfig,
)
from cursus.steps.configs.config_bedrock_prompt_template_generation_step import (
    BedrockPromptTemplateGenerationConfig,
)
from cursus.steps.configs.config_bedrock_batch_processing_step import (
    BedrockBatchProcessingConfig,
)

In [34]:
from cursus.steps.configs.utils import load_configs

In [35]:
CONFIG_CLASSES = {
    "DummyDataLoadingConfig": DummyDataLoadingConfig,
    "BedrockPromptTemplateGenerationConfig": BedrockPromptTemplateGenerationConfig,
    "BedrockBatchProcessingConfig": BedrockBatchProcessingConfig,
    "TabularPreprocessingConfig": TabularPreprocessingConfig,
}

In [36]:
loaded_configs = load_configs(str(config_path), CONFIG_CLASSES)

2025-11-05 22:49:04,576 - INFO - Loading configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-05 22:49:04,577 - INFO - Loading configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-05 22:49:04,579 - INFO - Successfully loaded configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-05 22:49:04,579 - INFO - Successfully loaded configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json with 1 specific configs
2025-11-05 22:49:04,580 - INFO - Creating additional config instance for BedrockPromptTemplateGeneration (BedrockPromptTemplateGenerationConfig)
2025-11-05 22:49:04,653 - INFO - Discovered 47 core config classes
2025-11-

In [37]:
len(loaded_configs)

1

## Summary

This notebook demonstrates the **DAGConfigFactory** approach to pipeline configuration:

### ‚úÖ **Benefits Achieved**

1. **Reduced Complexity**: From 500+ lines of manual config to guided workflow
2. **Base Config Inheritance**: Set common fields once, inherit everywhere
3. **Step-by-Step Guidance**: Clear requirements for each configuration step
4. **Validation**: Comprehensive validation prevents configuration errors
5. **Reusable DAG**: Pipeline structure defined once, reused across environments

### üîÑ **Workflow Comparison**

| Aspect | Legacy Approach | Interactive Approach |
|--------|----------------|---------------------|
| **Lines of Code** | 500+ manual lines | Guided step-by-step |
| **Time Required** | 30+ minutes | 10-15 minutes |
| **Error Rate** | High (manual entry) | Low (validation) |
| **Reusability** | Copy-paste heavy | DAG-driven |
| **Maintenance** | Manual updates | Automatic inheritance |

### üöÄ **Next Steps**

The generated configuration file can now be used with the existing pipeline compiler:

```python
# Use with pipeline compiler (from demo_pipeline.ipynb)
from cursus.core.compiler.dag_compiler import PipelineDAGCompiler

dag_compiler = PipelineDAGCompiler(
    config_path=config_path,
    sagemaker_session=pipeline_session,
    role=role
)

# Compile DAG to pipeline
template_pipeline, report = dag_compiler.compile_with_report(dag=dag)
```

The interactive configuration approach transforms the user experience from complex manual setup to an intuitive, guided workflow while maintaining full compatibility with the existing cursus infrastructure.