# Cursus: Automatic SageMaker (MODS) Pipeline Compiler

The main contribution of this work is **Cursus**, a **compiler** that automatically generate **[MODS (Model Training Workflow Operation and Development System) Pipeline](https://w.amazon.com/bin/view/CMLS/Overview/MODS/)** base on two set of user inputs
* The **Pipeline DAG (Directed Acylic Graph)**, which describe pipeline as a graph
* The **Unified Config JSON**, which provides a central hub to extract all user inputs and their associated step information
    * Run [demo_config](./demo_config.ipynb) first to generate the Unified Config JSON
    * The config json will be saved in `./pipeling_config/xxx/` folder

![mods_pipeline_train_eval_calib](./demo/mods_pipeline_train_eval_calib.png)


In [1]:
import os
import json
import pandas as pd
import pickle
import sys
import subprocess
from datetime import datetime

from pathlib import Path

In [2]:
from pydantic import BaseModel, Field, model_validator, field_validator
from typing import List, Optional, Dict, Any, Type, Union, Tuple


In [3]:
from collections import (
    defaultdict,
    deque
)

In [4]:
import logging

In [5]:
logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
logger = logging.getLogger(__name__)


## Environment Setup

In [6]:
from sagemaker import Session

2025-11-30 06:22:53,846 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [7]:
from sagemaker.workflow.pipeline_context import PipelineSession

In [8]:
bucket_name='buyer-seller-messaging-reversal'

In [9]:
pipeline_session = PipelineSession(default_bucket=bucket_name) # IMPORTANT now the session uses the generated sagemaker_config

2025-11-30 06:22:54,265 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [10]:
role=PipelineSession().get_caller_identity_arn()
role

2025-11-30 06:22:54,569 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


'arn:aws:iam::178936618742:role/AmazonSageMaker-ExecutionRole-Default'

In [11]:
from pathlib import Path
import sys

# Get parent directory of current notebook
project_root = str(Path().absolute().parent)
if project_root not in sys.path:
    sys.path.insert(0, project_root)  
    print(f"add project root {project_root} into system")

add project root /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines into system


## Basic Information

In [12]:
region_list = [
    'NA',
    'EU',
    'FE'
]

In [13]:
region_selection = 0

In [14]:
region = region_list[region_selection]
region

'NA'

In [15]:
MODEL_CLASS='pytorch'

In [16]:
service_name="BuyerAbuseRnR"

#### Config and Hyperparameter Information

In [17]:
current_dir = Path.cwd()
config_dir = Path(current_dir) / 'pipeline_config'
print(config_dir)

/home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config


In [18]:
#hyparam_filename = f'hyperparameters_{region}_{MODEL_CLASS}.json' #'hyperparameters.json'

In [19]:
pipeline_config_name = f'config.json'  #f'config_{region}.json'
pipeline_config_name

'config.json'

In [20]:
config_path = config_dir / pipeline_config_name

In [21]:
config_path

PosixPath('/home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json')

## Pipeline Imports

In [22]:
from enum import Enum
from pydantic import BaseModel

## [Optional]: Test Config Load Functionality

Please skip this section if you are not concern about the config information loaded

### Hyperparameters

In [23]:
#from cursus.steps.hyperparams.hyperparameters_xgboost import XGBoostModelHyperparameters

In [24]:
#hyparam_path = config_dir / hyparam_filename
#with open(hyparam_path, 'r') as file:
#    hyperparam_dict = json.load(file)

In [25]:
#hyperparams = XGBoostModelHyperparameters(**hyperparam_dict)

In [26]:
#hyperparams.num_classes

In [27]:
#hyperparams.is_binary

### Import Configs

In [28]:
from cursus.core.base.config_base import BasePipelineConfig



In [29]:
#from cursus.steps.configs.config_cradle_data_loading_step import (CradleDataLoadingConfig,
#                                                    MdsDataSourceConfig,
#                                                    EdxDataSourceConfig,
#                                                    DataSourceConfig,
#                                                    DataSourcesSpecificationConfig,
#                                                    JobSplitOptionsConfig,
#                                                    TransformSpecificationConfig,
#                                                    OutputSpecificationConfig,
#                                                    CradleJobSpecificationConfig
#                                                   )

In [30]:
from cursus.steps.configs.config_dummy_data_loading_step import DummyDataLoadingConfig
from cursus.steps.configs.config_tabular_preprocessing_step import TabularPreprocessingConfig
from cursus.steps.configs.config_bedrock_prompt_template_generation_step import BedrockPromptTemplateGenerationConfig
from cursus.steps.configs.config_bedrock_batch_processing_step import BedrockBatchProcessingConfig
from cursus.steps.configs.config_label_ruleset_generation_step import LabelRulesetGenerationConfig
from cursus.steps.configs.config_label_ruleset_execution_step import LabelRulesetExecutionConfig
from cursus.steps.configs.config_pytorch_training_step import PyTorchTrainingConfig
from cursus.steps.configs.config_pytorch_model_eval_step import PyTorchModelEvalConfig
from cursus.steps.configs.config_dummy_training_step import DummyTrainingConfig
from cursus.steps.configs.config_model_calibration_step import ModelCalibrationConfig
from cursus.steps.configs.config_package_step import PackageConfig
from cursus.steps.configs.config_payload_step import PayloadConfig

### Load Config

In [31]:
from cursus.steps.configs.utils import serialize_config, merge_and_save_configs, load_configs, verify_configs

In [32]:
CONFIG_CLASSES = {
        'DummyDataLoadingConfig':                     DummyDataLoadingConfig,
        'BedrockPromptTemplateGenerationConfig':      BedrockPromptTemplateGenerationConfig,
        'BedrockBatchProcessingConfig':               BedrockBatchProcessingConfig,
        'TabularPreprocessingConfig':                 TabularPreprocessingConfig,
        'LabelRulesetGenerationConfig':               LabelRulesetGenerationConfig,
        'LabelRulesetExecutionConfig':                LabelRulesetExecutionConfig,
        'PyTorchTrainingConfig':                      PyTorchTrainingConfig,
        'PyTorchModelEvalConfig':                     PyTorchModelEvalConfig,
        'DummyTrainingConfig':                        DummyTrainingConfig,
        'ModelCalibrationConfig':                     ModelCalibrationConfig,
        'PackageConfig':                              PackageConfig,
        'PayloadConfig':                              PayloadConfig,
    }

In [33]:
config_path

PosixPath('/home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json')

In [34]:
# Load configs
loaded_configs = load_configs(config_path, CONFIG_CLASSES)

2025-11-30 06:22:55,602 - INFO - Loading configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-30 06:22:55,602 - INFO - Loading configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-30 06:22:55,603 - INFO - Successfully loaded configuration from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-30 06:22:55,604 - INFO - Successfully loaded configs from /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json with 7 specific configs
2025-11-30 06:22:55,604 - INFO - Creating additional config instance for DummyDataLoading_calibration (DummyDataLoadingConfig)
2025-11-30 06:22:55,605 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: 

In [35]:
loaded_configs

{'DummyDataLoading_calibration': DummyDataLoadingConfig(author='lukexie', bucket='buyer-seller-messaging-reversal', role='arn:aws:iam::178936618742:role/AmazonSageMaker-ExecutionRole-Default', region='NA', service_name='BuyerAbuseRnR', pipeline_version='0.0.2', model_class='pytorch', current_date='2025-11-30', framework_version='2.1.0', py_version='py310', source_dir='docker', enable_caching=False, use_secure_pypi=False, max_runtime_seconds=172800, project_root_folder='rnr_pytorch_bedrock', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=True, processing_source_dir='docker/scripts', processing_entry_point='dummy_data_loading.py', processing_script_arguments=None, processing_framework_version='1.2-1', data_source='s3://buyer-seller-messaging-reversal/pipeline/lukexie-BuyerAbuseRnR-pytorch-NA/20251111035543/labelrulesetexecution/processed_data/test', job

In [36]:
len(loaded_configs)

7

In [37]:
[str(k) for k in loaded_configs.keys()]

['DummyDataLoading_calibration',
 'DummyTraining',
 'ModelCalibration_calibration',
 'Package',
 'Payload',
 'PyTorchModelEval_calibration',
 'TabularPreprocessing_calibration']

In [38]:
first_config = next(iter(loaded_configs.values()))

In [39]:
PIPELINE_VERSION = first_config.pipeline_version

In [40]:
PIPELINE_DESCRIPTION = first_config.pipeline_description

In [41]:
PIPELINE_NAME = first_config.pipeline_name

## Parameter Setup

In [42]:
import boto3
from sagemaker.workflow.pipeline_context import PipelineSession

# Initialize boto3 clients
ec2_client = boto3.client('ec2')
kms_client = boto3.client('kms')
sts_client = boto3.client('sts')

# Get account and region info
account_id = sts_client.get_caller_identity()['Account']
region = boto3.Session().region_name

2025-11-30 06:22:55,836 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


### Find VPC Subnet - Get default VPC subnets or list all

In [43]:
response = ec2_client.describe_subnets(
    Filters=[{'Name': 'default-for-az', 'Values': ['true']}]
)
vpc_subnet_id = response['Subnets'][0]['SubnetId'] if response['Subnets'] else None

# OR list all subnets and choose one
#all_subnets = ec2_client.describe_subnets()
#for subnet in all_subnets['Subnets']:
#    print(f"Subnet ID: {subnet['SubnetId']}, VPC: {subnet['VpcId']}, AZ: {subnet['AvailabilityZone']}")


### Find Security Group - Get default or list all

In [44]:
response = ec2_client.describe_security_groups(
    Filters=[{'Name': 'group-name', 'Values': ['default']}]
)
security_group_id = response['SecurityGroups'][0]['GroupId'] if response['SecurityGroups'] else None

# OR list all security groups
#all_sgs = ec2_client.describe_security_groups()
#for sg in all_sgs['SecurityGroups']:
#    print(f"SG ID: {sg['GroupId']}, Name: {sg['GroupName']}, VPC: {sg.get('VpcId')}")

### Find KMS Key - List KMS keys for SageMaker

In [45]:
response = kms_client.list_aliases()
for alias in response['Aliases']:
    if 'sagemaker' in alias['AliasName'].lower():
        print(f"KMS Alias: {alias['AliasName']}, Key ID: {alias.get('TargetKeyId')}")

# OR get account's default KMS key ARN
kms_key_id = f"arn:aws:kms:{region}:{account_id}:alias/aws/sagemaker"


In [46]:
print(f"\nFound values:")
print(f"VPC Subnet: {vpc_subnet_id}")
print(f"Security Group: {security_group_id}")
print(f"KMS Key: {kms_key_id}")


Found values:
VPC Subnet: subnet-45db3e4b
Security Group: sg-e116c4be
KMS Key: arn:aws:kms:us-east-1:178936618742:alias/aws/sagemaker


### Execution Id

In [47]:
execution_id = datetime.now().strftime("%Y%m%d%H%M%S")

### Define Parameter String

In [48]:
from sagemaker.network import NetworkConfig
from sagemaker.processing import ProcessingInput
from sagemaker.workflow.functions import Join
from sagemaker.workflow.parameters import ParameterString


In [49]:
# Predefined Pipeline Parameters
PIPELINE_EXECUTION_TEMP_DIR = ParameterString(name="EXECUTION_S3_PREFIX", default_value=f"s3://{bucket_name}/pipeline/{PIPELINE_NAME}/{execution_id}")
KMS_ENCRYPTION_KEY_PARAM = ParameterString(name="KMS_ENCRYPTION_KEY_PARAM", default_value=kms_key_id)
VPC_SUBNET = ParameterString(
    name="VPC_SUBNET",
    default_value=vpc_subnet_id
)  # TODO: test if we can replace it with multiple subnets
SECURITY_GROUP_ID = ParameterString(name="SECURITY_GROUP_ID", default_value=security_group_id)
PROCESSING_JOB_SHARED_NETWORK_CONFIG = NetworkConfig(
    enable_network_isolation=False,
    security_group_ids=[SECURITY_GROUP_ID],
    subnets=[VPC_SUBNET],
    encrypt_inter_container_traffic=True,
)

## Import Packages

In [50]:
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Optional, Type
from pathlib import Path
import logging
import os
import importlib

In [51]:
import sagemaker
from sagemaker import Session
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.parameters import Parameter
from sagemaker.workflow.properties import Properties
from sagemaker.workflow.pipeline_context import PipelineSession # Crucial import

## Demo: An End-to-End Pipeline based on PipelineDAG Compiler
Let us use the following simpler DAG (without registration as example)


In this demo there are several user input
* the **Unified JSON file** in `config_path`
* the **Registry Manager**: an object that handles the map between step logical name to `step.properties`
* the **Dependency Resolver**: an object than handles the *automatic dependency resolution* between steps
* the other fields
    * `sagemaker_session`: pipelne session
    * `role`: IAM Role
    * `notebook_root`: track the root path 


In this pipeline template, we inherit from base class `PipelineTemplateBase`. 

The **major tasks** are
* *`Config` Classes Import*
* *Configuration Validation*
* *Step Builder Retrieval and Step Builder Map Creation*
* *Configuration Map Creation*
* **Pipeline DAG Generation**: ideally, user should create this DAG and use it as input
* **Automatic Pipeline Assemble**: Call `pipeline_assembler`


### DAG to Template Compiler

In [52]:
from cursus.api.dag.base_dag import PipelineDAG
from cursus.core.compiler.dag_compiler import compile_dag_to_pipeline, PipelineDAGCompiler
from cursus.core.compiler.validation import ConversionReport
from cursus.steps.configs.utils import load_configs

In [53]:
def create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag() -> PipelineDAG:
    """
    Create a DAG for Bedrock Batch-enhanced PyTorch E2E pipeline with Label Ruleset steps.

    This DAG represents a complete end-to-end workflow that uses:
    1. Bedrock prompt template generation and batch processing for LLM-enhanced data
    2. Label ruleset generation and execution for transparent label transformation
    3. PyTorch training, followed by calibration, packaging, and registration

    The label ruleset steps sit between Bedrock processing and training/evaluation,
    providing transparent, rule-based label transformation that's easy to modify.

    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()

    # Add all nodes - incorporating Bedrock batch processing and label ruleset steps
    dag.add_node("DummyDataLoading_training")  # Dummy data load for training
    dag.add_node("TabularPreprocessing_training")  # Tabular preprocessing for training
    #dag.add_node(
    #    "BedrockPromptTemplateGeneration"
    #)  # Bedrock prompt template generation (shared)
    #dag.add_node(
    #    "BedrockBatchProcessing_training"
    #)  # Bedrock batch processing step for training
    #dag.add_node(
    #    "LabelRulesetGeneration"
    #)  # Label ruleset generation (shared for training and calibration)
    #dag.add_node(
    #    "LabelRulesetExecution_training"
    #)  # Label ruleset execution for training data
    dag.add_node("PyTorchTraining")  # PyTorch training step
    #dag.add_node(
    #    "ModelCalibration_calibration"
    #)  # Model calibration step with calibration variant
    #dag.add_node("Package")  # Package step
    #dag.add_node("Registration")  # MIMS registration step
    #dag.add_node("Payload")  # Payload step
    #dag.add_node("DummyDataLoading_calibration")  # Dummy data load for calibration
    #dag.add_node(
    #    "TabularPreprocessing_calibration"
    #)  # Tabular preprocessing for calibration
    #dag.add_node(
    #    "BedrockBatchProcessing_calibration"
    #)  # Bedrock batch processing step for calibration
    #dag.add_node(
    #    "LabelRulesetExecution_calibration"
    #)  # Label ruleset execution for calibration data
    #dag.add_node("PyTorchModelEval_calibration")  # Model evaluation step

    # Training flow with Bedrock batch processing and label ruleset integration
    dag.add_edge("DummyDataLoading_training", "TabularPreprocessing_training")
    dag.add_edge(
        "TabularPreprocessing_training", "PyTorchTraining"
    )  # Data input
    
    # Bedrock batch processing flow for training - two inputs to BedrockBatchProcessing_training
    dag.add_edge(
        "TabularPreprocessing_training", "BedrockBatchProcessing_training"
    )  # Data input
    dag.add_edge(
        "BedrockPromptTemplateGeneration", "BedrockBatchProcessing_training"
    )  # Template input

    # Label ruleset execution for training - two inputs to LabelRulesetExecution_training
    dag.add_edge(
        "BedrockBatchProcessing_training", "LabelRulesetExecution_training"
    )  # Data input
    dag.add_edge(
        "LabelRulesetGeneration", "LabelRulesetExecution_training"
    )  # Ruleset input

    # Labeled data flows to PyTorch training
    dag.add_edge("LabelRulesetExecution_training", "PyTorchTraining")

    # Calibration flow with Bedrock batch processing and label ruleset integration
    #dag.add_edge("DummyDataLoading_calibration", "TabularPreprocessing_calibration")

    # Bedrock batch processing flow for calibration - two inputs to BedrockBatchProcessing_calibration
    #dag.add_edge(
    #    "TabularPreprocessing_calibration", "BedrockBatchProcessing_calibration"
    #)  # Data input
    #dag.add_edge(
    #    "BedrockPromptTemplateGeneration", "BedrockBatchProcessing_calibration"
    #)  # Template input

    # Label ruleset execution for calibration - two inputs to LabelRulesetExecution_calibration
    #dag.add_edge(
    #    "BedrockBatchProcessing_calibration", "LabelRulesetExecution_calibration"
    #)  # Data input
    #dag.add_edge(
    #    "LabelRulesetGeneration", "LabelRulesetExecution_calibration"
    #)  # Ruleset input

    # Evaluation flow
    #dag.add_edge("PyTorchTraining", "PyTorchModelEval_calibration")
    #dag.add_edge(
    #    "LabelRulesetExecution_calibration", "PyTorchModelEval_calibration"
    #)  # Use labeled calibration data

    # Model calibration flow - depends on model evaluation
    #dag.add_edge("PyTorchModelEval_calibration", "ModelCalibration_calibration")

    # Output flow
    #dag.add_edge("ModelCalibration_calibration", "Package")
    #dag.add_edge("PyTorchTraining", "Package")  # Raw model is also input to packaging
    #dag.add_edge("PyTorchTraining", "Payload")  # Payload test uses the raw model
    #dag.add_edge("Package", "Registration")
    #dag.add_edge("Payload", "Registration")

    logger.info(
        f"Created Bedrock Batch-PyTorch with Label Ruleset E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges"
    )
    return dag

In [54]:
def create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag() -> PipelineDAG:
    """
    Create a complete end-to-end XGBoost pipeline DAG.
    
    This DAG represents the same workflow as the legacy demo_config.ipynb
    but in a structured, reusable format.
    
    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()
    
    # Add all nodes - matching the structure from demo_config.ipynb
    dag.add_node("DummyDataLoading_training")      # Training data loading
    dag.add_node("TabularPreprocessing_training")   # Training data preprocessing
    dag.add_node("PyTorchTraining")                 # XGBoost model training
    
    # Define dependencies - training flow
    dag.add_edge("DummyDataLoading_training", "TabularPreprocessing_training")
    dag.add_edge("TabularPreprocessing_training", "PyTorchTraining")
    
    logger.info(f"Created XGBoost E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges")
    return dag

In [55]:
def create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag() -> PipelineDAG:
    """
    Create a complete end-to-end XGBoost pipeline DAG.
    
    This DAG represents the same workflow as the legacy demo_config.ipynb
    but in a structured, reusable format.
    
    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()
    
    # Add all nodes - incorporating Bedrock batch processing and label ruleset steps
    dag.add_node("DummyTraining")  # Dummy data load for training
    dag.add_node("DummyDataLoading_calibration")  # Dummy data load for calibration
    dag.add_node(
        "TabularPreprocessing_calibration"
    )  # Tabular preprocessing for calibration
    dag.add_node("PyTorchModelEval_calibration")  # Model evaluation step
    dag.add_node(
        "ModelCalibration_calibration"
    )  # Model calibration step with calibration variant
    dag.add_node("Package")  # Package step
    dag.add_node("Payload")  # Payload step


    # Calibration flow with Bedrock batch processing and label ruleset integration
    dag.add_edge("DummyDataLoading_calibration", "TabularPreprocessing_calibration")

    # Evaluation flow
    dag.add_edge("DummyTraining", "PyTorchModelEval_calibration")
    dag.add_edge(
        "TabularPreprocessing_calibration", "PyTorchModelEval_calibration"
    )  # Use labeled calibration data

    # Model calibration flow - depends on model evaluation
    dag.add_edge("PyTorchModelEval_calibration", "ModelCalibration_calibration")

    # Output flow
    dag.add_edge("ModelCalibration_calibration", "Package")
    dag.add_edge("DummyTraining", "Package")  # Raw model is also input to packaging
    dag.add_edge("DummyTraining", "Payload")  # Payload test uses the raw model
    
    logger.info(f"Created XGBoost E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges")
    return dag

In [56]:
dag = create_bedrock_batch_pytorch_with_label_ruleset_e2e_dag()

2025-11-30 06:22:56,532 - INFO - Added node: DummyTraining
2025-11-30 06:22:56,532 - INFO - Added node: DummyDataLoading_calibration
2025-11-30 06:22:56,533 - INFO - Added node: TabularPreprocessing_calibration
2025-11-30 06:22:56,533 - INFO - Added node: PyTorchModelEval_calibration
2025-11-30 06:22:56,533 - INFO - Added node: ModelCalibration_calibration
2025-11-30 06:22:56,534 - INFO - Added node: Package
2025-11-30 06:22:56,534 - INFO - Added node: Payload
2025-11-30 06:22:56,534 - INFO - Added edge: DummyDataLoading_calibration -> TabularPreprocessing_calibration
2025-11-30 06:22:56,535 - INFO - Added edge: DummyTraining -> PyTorchModelEval_calibration
2025-11-30 06:22:56,535 - INFO - Added edge: TabularPreprocessing_calibration -> PyTorchModelEval_calibration
2025-11-30 06:22:56,535 - INFO - Added edge: PyTorchModelEval_calibration -> ModelCalibration_calibration
2025-11-30 06:22:56,536 - INFO - Added edge: ModelCalibration_calibration -> Package
2025-11-30 06:22:56,536 - INFO - 

In [57]:
pipeline_parameters = [
    PIPELINE_EXECUTION_TEMP_DIR,
    KMS_ENCRYPTION_KEY_PARAM,
    SECURITY_GROUP_ID,
    VPC_SUBNET,
]

In [58]:
dag_compiler = PipelineDAGCompiler(
    config_path=config_path,
    sagemaker_session=pipeline_session,
    role=role,
    pipeline_parameters=pipeline_parameters
)

2025-11-30 06:22:56,546 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-30 06:22:56,547 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-11-30 06:22:56,547 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2025-11-30 06:22:56,548 - INFO - ‚úÖ Registry info loaded: 43 steps
2025-11-30 06:22:56,548 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2025-11-30 06:22:56,548 - INFO - üîç ScriptAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-30 06:22:56,549 - INFO - üîç ScriptAutoDiscovery.__init__ - workspace_dirs: []
2025-11-30 06:22:56,549 - INFO - üîç ScriptAutoDiscovery.__init__ - priority_workspace_dir: None
2025-11-30 06:22:56,550 - INFO - ‚úÖ Registry info loaded: 43 steps
2025-11-30 06:22:56,550 - INFO - üéâ ScriptAutoD

### Create a Pipeline

#### DAG Validation and Preview of Config Resolution

In [59]:
preview_only = True

In [60]:
if preview_only:
    preview = dag_compiler.preview_resolution(dag)
    logger.info("DAG node resolution preview:")
    for node, config_type in preview.node_config_map.items():
        confidence = preview.resolution_confidence.get(node, 0.0)
        logger.info(f"  {node} ‚Üí {config_type} (confidence: {confidence:.2f})")
        
    if preview.recommendations:
        logger.info("Recommendations:")
        for recommendation in preview.recommendations:
            logger.info(f"  - {recommendation}")
        
    validation = dag_compiler.validate_dag_compatibility(dag)
    logger.info(f"DAG validation: {'VALID' if validation.is_valid else 'INVALID'}")
    if not validation.is_valid:
        if validation.missing_configs:
            logger.warning(f"Missing configs: {validation.missing_configs}")
        if validation.unresolvable_builders:
            logger.warning(f"Unresolvable builders: {validation.unresolvable_builders}")
        if validation.config_errors:
            logger.warning(f"Config errors: {validation.config_errors}")

2025-11-30 06:22:56,565 - INFO - Previewing resolution for 7 DAG nodes
2025-11-30 06:22:56,565 - INFO - Creating template for DAG with 7 nodes
2025-11-30 06:22:56,566 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-30 06:22:56,567 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-11-30 06:22:56,567 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2025-11-30 06:22:56,568 - INFO - ‚úÖ Registry info loaded: 43 steps
2025-11-30 06:22:56,568 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2025-11-30 06:22:56,568 - INFO - üîç ScriptAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-30 06:22:56,569 - INFO - üîç ScriptAutoDiscovery.__init__ - workspace_dirs: []
2025-11-30 06:22:56,569 - INFO - üîç ScriptAutoDiscovery.__init__ - pri

### Put it Together: Pipeline Generation from DAG

In [61]:
# Convert DAG to pipeline and get report
try:
    logger.info(f"Converting DAG to pipeline")
    template_pipeline, report = dag_compiler.compile_with_report(
        dag=dag
    )
        
    # Log report summary
    logger.info(f"Conversion complete: {report.summary()}")
    for node, details in report.resolution_details.items():
        logger.info(f"  {node} ‚Üí {details['config_type']} ({details['builder_type']})")
        
    # Log pipeline creation details
    logger.info(f"Pipeline '{template_pipeline.name}' created successfully")
    logger.info(f"Pipeline ARN: {template_pipeline.arn if hasattr(template_pipeline, 'arn') else 'Not available until upserted'}")
    logger.info("To upsert the pipeline, call pipeline.upsert()")       
except Exception as e:
    logger.error(f"Failed to convert DAG to pipeline: {e}")
    raise

2025-11-30 06:22:56,719 - INFO - Converting DAG to pipeline
2025-11-30 06:22:56,720 - INFO - Compiling DAG with detailed reporting
2025-11-30 06:22:56,720 - INFO - Compiling DAG with 7 nodes to pipeline
2025-11-30 06:22:56,720 - INFO - Creating template for DAG with 7 nodes
2025-11-30 06:22:56,721 - INFO - Loading configs from: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/pipeline_config/config.json
2025-11-30 06:22:56,721 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/cursus
2025-11-30 06:22:56,722 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-11-30 06:22:56,723 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2025-11-30 06:22:56,723 - INFO - ‚úÖ Registry info loaded: 43 steps
2025-11-30 06:22:56,723 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2025-11-30 06:22:56,724 

### Pipeline Template

After the pipeline is generated, we can retrieve the pipeline template

In [62]:
pipeline_template_builder = dag_compiler.get_last_template()

## Start Execution

In [63]:
role_arn = pipeline_session.get_caller_identity_arn()
role_arn

'arn:aws:iam::178936618742:role/AmazonSageMaker-ExecutionRole-Default'

In [64]:
pipeline_description=PIPELINE_DESCRIPTION

In [65]:
PIPELINE_DESCRIPTION

'BuyerAbuseRnR pytorch Model NA'

### Upsert

In [66]:
template_pipeline.upsert(
                role_arn=role_arn, description=pipeline_description
            )

2025-11-30 06:22:57,401 - INFO - SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
2025-11-30 06:22:57,814 - INFO - Uploaded /home/ec2-user/SageMaker/AmazonSageMaker-lukexie-sagemaker-bsm-repo/pipelines/rnr_pytorch_bedrock/docker/scripts to s3://buyer-seller-messaging-reversal/lukexie-BuyerAbuseRnR-pytorch-NA-0-0-2-pipeline/code/1ab2be6de97597c80d219d20eb33001aa781ae46fd39e85d51a7d69fd249e405/sourcedir.tar.gz
2025-11-30 06:22:57,874 - INFO - runproc.sh uploaded to s3://buyer-seller-messaging-reversal/lukexie-BuyerAbuseRnR-pytorch-NA-0-0-2-pipeline/code/98f6fa33c1a36beaee98b312e63ec04cef74b2ac27e39d60b70e8a5cbfb87f71/runproc.sh
2025-11-30 06:22:58,429 - INFO

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:178936618742:pipeline/lukexie-BuyerAbuseRnR-pytorch-NA-0-0-2-pipeline',
 'PipelineVersionId': 10,
 'ResponseMetadata': {'RequestId': '5dfca02b-149d-4b3d-a112-c9bb55a13f79',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5dfca02b-149d-4b3d-a112-c9bb55a13f79',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'none'",
   'cache-control': 'no-cache, no-store, must-revalidate',
   'x-content-type-options': 'nosniff',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '138',
   'date': 'Sun, 30 Nov 2025 06:23:06 GMT'},
  'RetryAttempts': 0}}

### Start

In [67]:
pipeline_execution_parameters={
    "EXECUTION_S3_PREFIX": f"s3://{bucket_name}/pipeline/{PIPELINE_NAME}/{execution_id}",
    "KMS_ENCRYPTION_KEY_PARAM": kms_key_id,
    "VPC_SUBNET": vpc_subnet_id,
    "SECURITY_GROUP_ID": security_group_id,
}

In [68]:
pipeline_execution = template_pipeline.start(
                parameters=pipeline_execution_parameters
            )

2025-11-30 06:23:06,585 - INFO - SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
