# Interactive Pipeline Configuration with DAGConfigFactory

This notebook demonstrates the new interactive approach to pipeline configuration using the DAGConfigFactory.
Instead of manually creating 500+ lines of static configuration, we use a guided step-by-step process.

## Workflow Overview

1. **Define Pipeline DAG** - Create the pipeline structure
2. **Initialize DAGConfigFactory** - Set up the interactive factory
3. **Configure Base Settings** - Set shared pipeline configuration
4. **Configure Processing Settings** - Set shared processing configuration
5. **Configure Individual Steps** - Set step-specific configurations
6. **Generate Final Configurations** - Create config instances
7. **Save to JSON** - Export unified configuration file

![mods_pipeline_train_eval_calib](./tutorials/mods_end_to_end_xgboost.png)

## Environment Setup

In [1]:
import os
import json
import sys
from pathlib import Path
from datetime import datetime, date
import logging
from typing import List, Optional, Dict, Any


# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Get parent directory of current notebook
project_root = str(Path().absolute().parent.parent )
print(f"project root {project_root}")
if project_root not in sys.path:
    sys.path.insert(0, project_root)  
    print(f"add project root {project_root} into system")

project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src
add project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src into system


In [2]:
# SageMaker and SAIS imports
from sagemaker import Session
from sagemaker.workflow.pipeline_context import PipelineSession
from secure_ai_sandbox_python_lib.session import Session as SaisSession
from mods_workflow_helper.utils.secure_session import create_secure_session_config
from mods_workflow_helper.sagemaker_pipeline_helper import SecurityConfig

# Initialize SAIS session
sais_session = SaisSession(".")

# Create security config
security_config = SecurityConfig(
    kms_key=sais_session.get_team_owned_bucket_kms_key(),
    security_group=sais_session.sandbox_vpc_security_group(),
    vpc_subnets=sais_session.sandbox_vpc_subnets()
)

# Create SageMaker config
sagemaker_config = create_secure_session_config(
    role_arn=PipelineSession().get_caller_identity_arn(),
    bucket_name=sais_session.team_owned_s3_bucket_name(),
    kms_key=sais_session.get_team_owned_bucket_kms_key(),
    vpc_subnet_ids=sais_session.sandbox_vpc_subnets(),
    vpc_security_groups=[sais_session.sandbox_vpc_security_group()]
)

# Create pipeline session
pipeline_session = PipelineSession(
    default_bucket=sais_session.team_owned_s3_bucket_name(), 
    sagemaker_config=sagemaker_config
)
pipeline_session.config = sagemaker_config

print(f"Bucket: {sais_session.team_owned_s3_bucket_name()}")
print(f"Role: {PipelineSession().get_caller_identity_arn()}")

2025-10-16 03:49:34,880 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


2025-10-16 03:49:35,173 - INFO - CA certs are provided via the AmazonCACerts installation at /home/ec2-user/.local/lib/python3.10/site-packages/amazoncerts
2025-10-16 03:49:35,613 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-10-16 03:49:36,446 - INFO - successfully patched module botocore
2025-10-16 03:49:36,463 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-10-16 03:49:36,645 - INFO - There is no MODS workflow execution id provided, this is probably because you are running your pipeline outside of MODS.
2025-10-16 03:49:36,661 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-10-16 03:49:37,174 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um
Role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1


## Step 1: Define Pipeline DAG

First, we define the pipeline structure using a DAG (Directed Acyclic Graph).
This replaces the hardcoded pipeline structure from the legacy approach.

In [3]:
from buyer_abuse_mods_template.cursus.api.dag.base_dag import PipelineDAG

def create_xgboost_complete_e2e_dag() -> PipelineDAG:
    """
    Create a complete end-to-end XGBoost pipeline DAG.
    
    This DAG represents the same workflow as the legacy demo_config.ipynb
    but in a structured, reusable format.
    
    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()
    
    # Add all nodes - matching the structure from demo_config.ipynb
    dag.add_node("CradleDataLoading_training")      # Training data loading
    dag.add_node("CradleDataLoading_calibration")   # Calibration data loading
    dag.add_node("TabularPreprocessing_training")   # Training data preprocessing
    dag.add_node("TabularPreprocessing_calibration") # Calibration data preprocessing
    dag.add_node("XGBoostTraining")                 # XGBoost model training
    dag.add_node("XGBoostModelEval_calibration")    # Model evaluation
    dag.add_node("ModelCalibration_calibration")    # Model calibration
    dag.add_node("Package")                         # Model packaging
    dag.add_node("Registration")                    # MIMS model registration
    dag.add_node("Payload")                         # Payload generation
    
    # Define dependencies - training flow
    dag.add_edge("CradleDataLoading_training", "TabularPreprocessing_training")
    dag.add_edge("TabularPreprocessing_training", "XGBoostTraining")
    
    # Calibration flow
    dag.add_edge("CradleDataLoading_calibration", "TabularPreprocessing_calibration")
    
    # Evaluation flow
    dag.add_edge("XGBoostTraining", "XGBoostModelEval_calibration")
    dag.add_edge("TabularPreprocessing_calibration", "XGBoostModelEval_calibration")
    
    # Model calibration flow
    dag.add_edge("XGBoostModelEval_calibration", "ModelCalibration_calibration")
    
    # Output flow
    dag.add_edge("ModelCalibration_calibration", "Package")
    dag.add_edge("XGBoostTraining", "Package")
    dag.add_edge("XGBoostTraining", "Payload")
    dag.add_edge("Package", "Registration")
    dag.add_edge("Payload", "Registration")
    
    logger.info(f"Created XGBoost E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges")
    return dag

# Create the pipeline DAG
dag = create_xgboost_complete_e2e_dag()

print(f"Pipeline DAG created with {len(dag.nodes)} steps:")
for node in dag.nodes:
    print(f"  - {node}")

2025-10-16 03:49:37,571 - INFO - Added node: CradleDataLoading_training
2025-10-16 03:49:37,571 - INFO - Added node: CradleDataLoading_calibration
2025-10-16 03:49:37,572 - INFO - Added node: TabularPreprocessing_training
2025-10-16 03:49:37,572 - INFO - Added node: TabularPreprocessing_calibration
2025-10-16 03:49:37,573 - INFO - Added node: XGBoostTraining
2025-10-16 03:49:37,573 - INFO - Added node: XGBoostModelEval_calibration
2025-10-16 03:49:37,574 - INFO - Added node: ModelCalibration_calibration
2025-10-16 03:49:37,574 - INFO - Added node: Package
2025-10-16 03:49:37,574 - INFO - Added node: Registration
2025-10-16 03:49:37,575 - INFO - Added node: Payload
2025-10-16 03:49:37,575 - INFO - Added edge: CradleDataLoading_training -> TabularPreprocessing_training
2025-10-16 03:49:37,575 - INFO - Added edge: TabularPreprocessing_training -> XGBoostTraining
2025-10-16 03:49:37,576 - INFO - Added edge: CradleDataLoading_calibration -> TabularPreprocessing_calibration
2025-10-16 03:49:

Pipeline DAG created with 10 steps:
  - CradleDataLoading_training
  - CradleDataLoading_calibration
  - TabularPreprocessing_training
  - TabularPreprocessing_calibration
  - XGBoostTraining
  - XGBoostModelEval_calibration
  - ModelCalibration_calibration
  - Package
  - Registration
  - Payload


## Step 2: Initialize DAGConfigFactory

Now we initialize the DAGConfigFactory with our DAG. This will automatically:
- Map DAG nodes to configuration classes
- Set up the interactive workflow
- Prepare for step-by-step configuration

In [4]:
from buyer_abuse_mods_template.cursus.api.factory.dag_config_factory import DAGConfigFactory

# Initialize the factory with our DAG
factory = DAGConfigFactory(dag)

# Get the config class mapping
config_map = factory.get_config_class_map()

print("DAG Node to Config Class Mapping:")
print("=" * 50)
for node_name, config_class in config_map.items():
    print(f"  {node_name:<35} -> {config_class.__name__}")

print(f"\nSuccessfully mapped {len(config_map)} steps to configuration classes.")

2025-10-16 03:49:37,586 - INFO - 🔧 BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/cursus
2025-10-16 03:49:37,586 - INFO - 🔧 BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-10-16 03:49:37,587 - INFO - ✅ BuilderAutoDiscovery basic initialization complete
2025-10-16 03:49:37,587 - INFO - ✅ Registry info loaded: 25 steps
2025-10-16 03:49:37,588 - INFO - 🎉 BuilderAutoDiscovery initialization completed successfully
2025-10-16 03:49:37,781 - INFO - Discovered 33 core config classes
2025-10-16 03:49:37,791 - INFO - Discovered 3 core hyperparameter classes
2025-10-16 03:49:37,817 - INFO - Discovered 7 base hyperparameter classes from core/base
2025-10-16 03:49:37,818 - INFO - Built complete config classes: 43 total (33 config + 10 hyperparameter auto-discovered)
2025-10-16 03:49:37,818 - INFO - Discovered 43 config classes via step catalog
2025-10-16 03:49:37,818 - INFO - Registry system initialized su

DAG Node to Config Class Mapping:
  CradleDataLoading_training          -> CradleDataLoadingConfig
  CradleDataLoading_calibration       -> CradleDataLoadingConfig
  TabularPreprocessing_training       -> TabularPreprocessingConfig
  TabularPreprocessing_calibration    -> TabularPreprocessingConfig
  XGBoostTraining                     -> XGBoostTrainingConfig
  XGBoostModelEval_calibration        -> XGBoostModelEvalConfig
  ModelCalibration_calibration        -> ModelCalibrationConfig
  Package                             -> PackageConfig
  Registration                        -> RegistrationConfig
  Payload                             -> PayloadConfig

Successfully mapped 10 steps to configuration classes.


## Step 3: Configure Base Pipeline Settings

These settings are shared across ALL pipeline steps. Instead of repeating them
in every step configuration, we set them once here.

In [5]:
# Get base configuration requirements
base_requirements = factory.get_base_config_requirements()

print("Base Pipeline Configuration Requirements:")
print("=" * 50)
for req in base_requirements:
    marker = "*" if req['required'] else " "
    default_info = f" (default: {req.get('default')})" if not req['required'] and 'default' in req else ""
    print(f"{marker} {req['name']:<25} ({req['type']}){default_info}")
    print(f"    {req['description']}")
    print()

Base Pipeline Configuration Requirements:
* author                    (str)
    Author or owner of the pipeline.

* bucket                    (str)
    S3 bucket name for pipeline artifacts and data.

* role                      (str)
    IAM role for pipeline execution.

* region                    (str)
    Custom region code (NA, EU, FE) for internal logic.

* service_name              (str)
    Service name for the pipeline.

* pipeline_version          (str)
    Version string for the SageMaker Pipeline.

  model_class               (str) (default: xgboost)
    Model class (e.g., XGBoost, PyTorch).

  current_date              (str) (default: PydanticUndefined)
    Current date, typically used for versioning or pathing.

  framework_version         (str) (default: 2.1.0)
    Default framework version (e.g., PyTorch).

  py_version                (str) (default: py310)
    Default Python version.

  source_dir                (Optional) (default: None)
    Common source directory fo

In [6]:
# Set up basic configuration values
region_list = ['NA', 'EU', 'FE']
region_selection = 0
region = region_list[region_selection]

# Map region to AWS region
region_mapping = {
    'NA': 'us-east-1',
    'EU': 'eu-west-1', 
    'FE': 'us-west-2'
}
aws_region = region_mapping[region]

# Get current directory and set up paths
current_dir = Path.cwd()
package_root = Path(current_dir).resolve()
source_dir = Path('dockers')
project_root_folder = "bap_example_pipeline"

# Set base configuration
factory.set_base_config(
    # Infrastructure settings
    bucket=sais_session.team_owned_s3_bucket_name(),
    role=PipelineSession().get_caller_identity_arn(),
    region=region,
    aws_region=aws_region,
    
    # Project identification
    author=sais_session.owner_alias(),
    service_name='AtoZ',
    pipeline_version='1.3.1',
    
    # Framework settings
    framework_version='1.7-1',
    py_version='py3',
    source_dir=str(source_dir),
    project_root_folder=project_root_folder,
    
    # Date settings
    current_date=date.today().strftime("%Y-%m-%d")
)

print("✅ Base pipeline configuration set successfully!")
print(f"   Region: {region} ({aws_region})")
print(f"   Service: AtoZ")
print(f"   Author: {sais_session.owner_alias()}")
print(f"   Pipeline Version: 1.3.1")

2025-10-16 03:49:37,847 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-10-16 03:49:38,014 - INFO - Base configuration set successfully


✅ Base pipeline configuration set successfully!
   Region: NA (us-east-1)
   Service: AtoZ
   Author: lukexie
   Pipeline Version: 1.3.1


## Step 4: Configure Base Processing Settings

These settings are shared across all PROCESSING steps (data loading, preprocessing, etc.)
but not training steps.

In [7]:
# Get base processing configuration requirements
processing_requirements = factory.get_base_processing_config_requirements()

if processing_requirements:
    print("Base Processing Configuration Requirements:")
    print("=" * 50)
    for req in processing_requirements:
        marker = "*" if req['required'] else " "
        default_info = f" (default: {req.get('default')})" if not req['required'] and 'default' in req else ""
        print(f"{marker} {req['name']:<30} ({req['type']}){default_info}")
        print(f"    {req['description']}")
        print()
else:
    print("No base processing configuration required for this pipeline.")

Base Processing Configuration Requirements:
  processing_instance_count      (int) (default: 1)
    Instance count for processing jobs

  processing_volume_size         (int) (default: 500)
    Volume size for processing jobs in GB

  processing_instance_type_large (str) (default: ml.m5.4xlarge)
    Large instance type for processing step.

  processing_instance_type_small (str) (default: ml.m5.2xlarge)
    Small instance type for processing step.

  use_large_processing_instance  (bool) (default: False)
    Set to True to use large instance type, False for small instance type.

  processing_source_dir          (Optional) (default: None)
    Source directory for processing scripts. Falls back to base source_dir if not provided.

  processing_entry_point         (Optional) (default: None)
    Entry point script for processing, must be relative to source directory. Can be overridden by derived classes.

  processing_script_arguments    (Optional) (default: None)
    Optional arguments fo

In [8]:
# Set base processing configuration if needed
if processing_requirements:
    processing_source_dir = source_dir / 'scripts'
    
    factory.set_base_processing_config(
        # Processing infrastructure
        processing_source_dir=str(processing_source_dir),
        processing_instance_type_large='ml.m5.12xlarge',
        processing_instance_type_small='ml.m5.4xlarge',
    )
    
    print("✅ Base processing configuration set successfully!")
    print(f"   Processing source: {processing_source_dir}")
else:
    print("✅ No base processing configuration needed.")

2025-10-16 03:49:38,025 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers
2025-10-16 03:49:38,026 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers
2025-10-16 03:49:38,026 - INFO - Base processing configuration set successfully


✅ Base processing configuration set successfully!
   Processing source: dockers/scripts


## Step 5: Check Configuration Status

Let's see which steps still need configuration.

In [9]:
# Check current status
status = factory.get_configuration_status()
pending_steps = factory.get_pending_steps()

print("Configuration Status:")
print("=" * 30)
print(f"Base config set: {'✅' if status['base_config'] else '❌'}")
print(f"Processing config set: {'✅' if status['base_processing_config'] else '❌'}")
print(f"Total steps: {len(config_map)}")
print(f"Pending steps: {len(pending_steps)}")
print()

if pending_steps:
    print("Steps needing configuration:")
    for step in pending_steps:
        print(f"  - {step}")
else:
    print("✅ All steps configured!")

Configuration Status:
Base config set: ✅
Processing config set: ✅
Total steps: 10
Pending steps: 8

Steps needing configuration:
  - CradleDataLoading_training
  - CradleDataLoading_calibration
  - TabularPreprocessing_training
  - TabularPreprocessing_calibration
  - XGBoostTraining
  - XGBoostModelEval_calibration
  - ModelCalibration_calibration
  - Registration


## Step 6: Configure Individual Steps

Now we configure each step with its specific requirements. The factory will show us
only the fields that are unique to each step (not inherited from base configs).

### Step 6.1: Configure Training Step

This config is for **TrainingStep**. 
* It ask user to provide all necessary information to construct a **Container** and start a **Training Job**
* Ths most important information has provided in the **HyperParameter** section.


In [10]:
tab_field_list = [
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_warn_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_warn_count_last_365_days',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_diff_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_notr_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_return_keywords_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_topic_count',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_order_count_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_amount_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_order_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_amount_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_amount_si_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_order_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_amount_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_diff_refunds_si_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_notr_refunds_si_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_order_count_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_amount_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_count_last_365_days',
'Abuse.mfn_refunds_si_by_customer_marketplace.n_mfn_refund_amount_si_last_365_days',
'Abuse.order_to_execution_time_from_eventvariables.n_order_to_execution',
'Abuse.shiptrack_flag_by_order.n_any_delivered',
'Abuse.shiptrack_flag_by_order.n_any_available_for_pickup',
'Abuse.shiptrack_flag_by_order.n_any_partial_delivered',
'Abuse.shiptrack_flag_by_order.n_any_undeliverable',
'Abuse.shiptrack_flag_by_order.n_any_returning',
'Abuse.shiptrack_flag_by_order.n_any_returned',
'COMP_DAYOB',
'claimAmount_value',
'claimantInfo_allClaimCount365day',
'claimantInfo_lifetimeClaimCount',
'claimantInfo_pendingClaimCount',
]

In [11]:
cat_field_list = [
    'PAYMETH',
    'claim_reason',
    'claimantInfo_status',
    'shipments_status'
]

In [12]:
label_name = 'is_abuse'         
id_name = 'order_id'

in_house_field_list = [
    id_name,
    'marketplace_id',
    label_name
]

In [13]:
full_field_list = tab_field_list + cat_field_list + in_house_field_list

In [14]:
# First, let's create the hyperparameters
from buyer_abuse_mods_template.cursus.steps.hyperparams.hyperparameters_xgboost import XGBoostModelHyperparameters
from buyer_abuse_mods_template.cursus.core.base.hyperparameters_base import ModelHyperparameters

# Create base hyperparameters
base_hyperparameter = ModelHyperparameters(
    full_field_list=full_field_list,
    cat_field_list=cat_field_list,
    tab_field_list=tab_field_list,
    label_name=label_name,
    id_name=id_name,
    multiclass_categories=[0, 1]
)

# Create XGBoost hyperparameters
xgb_hyperparams = XGBoostModelHyperparameters.from_base_hyperparam(
    base_hyperparameter,
    num_round=300,
    max_depth=6,
    min_child_weight=1
)

print("✅ Hyperparameters created")
print(f"   Features: {len(full_field_list)} total, {len(tab_field_list)} numerical, {len(cat_field_list)} categorical")
print(f"   XGBoost rounds: {xgb_hyperparams.num_round}")

✅ Hyperparameters created
   Features: 67 total, 60 numerical, 4 categorical
   XGBoost rounds: 300


In [15]:
instance_type_list = [
    "ml.m5.4xlarge",
    "ml.g4dn.16xlarge", 
    "ml.g5.12xlarge", 
    "ml.g5.16xlarge",
    "ml.p3.8xlarge", 
    "ml.m5.12xlarge",
    "ml.p3.16xlarge"
]

In [16]:
instance_select = -2

In [17]:
# Configure XGBoost training
if "XGBoostTraining" in pending_steps:
    step_name = "XGBoostTraining"
    
    factory.set_step_config(
        step_name,
        training_instance_type=instance_type_list[instance_select],
        training_entry_point='xgboost_training.py',
        training_volume_size=800
    )
    print(f"✅ {step_name} configured")
    print(f"   Instance type: ml.m5.4xlarge")
    print(f"   Volume size: 800 GB")

2025-10-16 03:49:38,080 - INFO - ✅ XGBoostTraining configured successfully using XGBoostTrainingConfig


✅ XGBoostTraining configured
   Instance type: ml.m5.4xlarge
   Volume size: 800 GB


### Step 6.2: Configure Data Loading Steps


In this section, user provide the input to construct a **cradle profile**. In Cradle Profle, there are **four** sections
1. **Data Source Specification**: specify
    1. *data source* (MDS, EDX, ANDES)
    2. *input schema*
2. **Transform Specification**: specifiy 
    1. *transform SQL*
    2. *job split*
3. **Output Specification**: specify
    1. *output path*,
    2. *ouptut format* (CSV, UNESCAPED_TSV, JSON, ION, PARQUET)
    3. *output schema*
    4. *save mode*
4. **Cradle Job Specification** specify
    1. *cradle account*
    2. *cluster_type*

This config is for **CradleDataLoadingStep**, which is a customized step provided under [SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
* This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**. Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
* This step would need to load **Execution Document** to take action.
* This step itself does not have many options

In [18]:
from buyer_abuse_mods_template.cursus.steps.configs.config_cradle_data_loading_step import (CradleDataLoadingConfig,
                                                    MdsDataSourceConfig,
                                                    EdxDataSourceConfig,
                                                    DataSourceConfig,
                                                    DataSourcesSpecificationConfig,
                                                    JobSplitOptionsConfig,
                                                    TransformSpecificationConfig,
                                                    OutputSpecificationConfig,
                                                    CradleJobSpecificationConfig
                                                   )

#### Cradle Data Loading (Training) Step

In [19]:
training_start_datetime='2025-01-01T00:00:00'  #'2024-12-01T00:00:00'  #'2024-03-01T00:00:00'
training_end_datetime='2025-04-17T00:00:00' 

In [20]:
mds_service_name='AtoZ'
org_id=0

In [21]:
mds_field_list=['objectId', 
                 'transactionDate'] + tab_field_list + cat_field_list
mds_output_schema = [{"field_name": field, "field_type": "STRING"} for field in mds_field_list]

In [22]:
train_edx_arn = {
    'NA': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]',
    'EU': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292941",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"EU"]',
    'FE': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["25782074",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"FE"]',
}

In [23]:
tag_schema = [
    'order_id',
    'marketplace_id',
    'tag_date',
    'is_abuse',
    'abuse_type',
    'concession_type',
]
edx_schema_overrides = [{"field_name": field, "field_type": "STRING"} for field in tag_schema]

In [24]:
def get_all_fields(mds_fields: List[str], tag_fields: List[str]) -> List[str]:
    """
    Get a combined list of all fields from MDS and EDX sources.

    This function handles case-insensitivity to avoid duplicate columns in SQL SELECT
    statements where the only difference is case (e.g., "OrderId" and "orderid").
    When duplicates with different cases are found, the first occurrence is kept.

    Args:
        mds_fields (List[str]): List of MDS fields
        tag_fields (List[str]): List of tag fields

    Returns:
        List[str]: Combined and deduplicated list of fields
    """
    # Track lowercase field names to detect duplicates
    seen_lowercase = {}
    deduplicated_fields = []

    # Process all fields, keeping only the first occurrence when case-insensitive duplicates exist
    for field in mds_fields + tag_fields:
        field_lower = field.lower()
        if field_lower not in seen_lowercase:
            seen_lowercase[field_lower] = True
            deduplicated_fields.append(field)

    return sorted(deduplicated_fields)

In [25]:
output_schema = get_all_fields(mds_field_list, tag_schema)

In [26]:
output_format = "PARQUET"

Change the following transform sql

In [27]:
transform_sql = """
SELECT
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_warn_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_warn_count_last_365_days,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_diff_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_notr_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_return_keywords_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_buyer_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_seller_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_buyer_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_seller_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_topic_count,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_order_count_last_365_days,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_amount_last_365_days,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_count_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_order_count_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_amount_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_count_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_amount_si_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_order_count_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_amount_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_claims_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_diff_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_diff_claims_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_notr_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_notr_claims_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_order_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_amount_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_order_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_amount_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_diff_refunds_si_365_days,
    Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_notr_refunds_si_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_order_count_last_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_amount_last_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_count_last_365_days,
    Abuse__DOT__mfn_refunds_si_by_customer_marketplace__DOT__n_mfn_refund_amount_si_last_365_days,
    Abuse__DOT__order_to_execution_time_from_eventvariables__DOT__n_order_to_execution,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_available_for_pickup,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_delivered,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_partial_delivered,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returned,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returning,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_undeliverable,
    COMP_DAYOB,
    PAYMETH,
    abuse_type,
    claimAmount_value,
    claim_reason,
    claimantInfo_allClaimCount365day,
    claimantInfo_lifetimeClaimCount,
    claimantInfo_pendingClaimCount,
    claimantInfo_status,
    concession_type,
    is_abuse,
    marketplace_id,
    objectId,
    order_id,
    shipments_status,
    tag_date,
    transactionDate
FROM (
    SELECT
        RAW_MDS_NA.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_solicit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_warn_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_solicit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_warn_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_buyer_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_seller_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_diff_topic_si,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_notr_topic_si,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_return_keywords_si,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_buyer_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_buyer_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_seller_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_min_seller_order_message_time_gap,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_buyer_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_seller_message_count,
        RAW_MDS_NA.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_total_topic_count,
        RAW_MDS_NA.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_amount_si_last_365_days,
        RAW_MDS_NA.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_claims_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_claims_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_diff_claims_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_diff_claims_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_notr_claims_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_a2z_claims_by_customer_na__DOT__n_mfn_notr_claims_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_diff_refunds_si_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_notr_refunds_si_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_order_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_amount_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_count_last_365_days,
        RAW_MDS_NA.Abuse__DOT__mfn_refunds_si_by_customer_marketplace__DOT__n_mfn_refund_amount_si_last_365_days,
        RAW_MDS_NA.Abuse__DOT__order_to_execution_time_from_eventvariables__DOT__n_order_to_execution,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_available_for_pickup,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_delivered,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_partial_delivered,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returned,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returning,
        RAW_MDS_NA.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_undeliverable,
        RAW_MDS_NA.COMP_DAYOB,
        RAW_MDS_NA.PAYMETH,
        RAW_MDS_NA.claimAmount_value,
        RAW_MDS_NA.claim_reason,
        RAW_MDS_NA.claimantInfo_allClaimCount365day,
        RAW_MDS_NA.claimantInfo_lifetimeClaimCount,
        RAW_MDS_NA.claimantInfo_pendingClaimCount,
        RAW_MDS_NA.claimantInfo_status,
        RAW_MDS_NA.objectId,
        RAW_MDS_NA.shipments_status,
        RAW_MDS_NA.transactionDate,
        TAGS.abuse_type,
        TAGS.concession_type,
        TAGS.is_abuse,
        TAGS.marketplace_id,
        TAGS.order_id,
        TAGS.tag_date,
        ROW_NUMBER() OVER (PARTITION BY RAW_MDS_NA.objectId, TAGS.order_id ORDER BY RAW_MDS_NA.transactionDate DESC) as row_num
    FROM RAW_MDS_NA
    JOIN TAGS ON RAW_MDS_NA.objectId = TAGS.order_id
)
WHERE row_num = 1
"""

In [28]:
merge_sql = """
select * from INPUT
"""

In [29]:
available_cluster_types = ['STANDARD',
                           'SMALL',
                           'MEDIUM',
                           'LARGE'
                          ]
cluster_choice=-2
cluster_type=available_cluster_types[cluster_choice]
cluster_type

'MEDIUM'

In [30]:
cradle_account = 'Buyer-Abuse-RnD-Dev'

In [31]:
training_cradle_data_load_dict = {
    "job_type": "training",
    "data_sources_spec": DataSourcesSpecificationConfig(
        start_date=training_start_datetime, 
        end_date=training_end_datetime,
        data_sources=[
            DataSourceConfig(
                data_source_name='RAW_MDS_NA',
                data_source_type='MDS',
                mds_data_source_properties=MdsDataSourceConfig(
                    service_name=mds_service_name,
                    region=region,
                    output_schema=mds_output_schema,
                    org_id=org_id,
                )
            ), 
            DataSourceConfig(
                data_source_name='TAGS',
                data_source_type='EDX',
                edx_data_source_properties=EdxDataSourceConfig(
                    schema_overrides=edx_schema_overrides,
                    edx_arn=train_edx_arn[region],
                )
            )
        ]
    ),
    "output_spec": OutputSpecificationConfig(
        output_schema=output_schema,
        output_format=output_format
    ),
    "transform_spec":  TransformSpecificationConfig(
        transform_sql=transform_sql,
        job_split_options=JobSplitOptionsConfig(
            merge_sql=merge_sql
        )
    ),
    "cradle_job_spec": CradleJobSpecificationConfig(
        cradle_account=cradle_account,
        cluster_type=cluster_type,
    )
}

In [32]:
# Configure training data loading
if "CradleDataLoading_training" in pending_steps:
    step_name = "CradleDataLoading_training"
    requirements = factory.get_step_requirements(step_name)
    
    print(f"Configuring {step_name}:")
    print("-" * 40)
    for req in requirements[:5]:  # Show first 5 requirements
        marker = "*" if req['required'] else " "
        print(f"{marker} {req['name']:<25} ({req['type']})")
        print(f"    {req['description']}")
    
    if len(requirements) > 5:
        print(f"    ... and {len(requirements) - 5} more fields")
    
    # Set configuration for training data loading
    factory.set_step_config(
        step_name,
        **training_cradle_data_load_dict
    )
    print(f"✅ {step_name} configured")
    print()

2025-10-16 03:49:38,173 - INFO - ✅ CradleDataLoading_training configured successfully using CradleDataLoadingConfig


Configuring CradleDataLoading_training:
----------------------------------------
* job_type                  (str)
    One of ['training','validation','testing','calibration'] to indicate which dataset this job is pulling
* data_sources_spec         (DataSourcesSpecificationConfig)
    Full data‐sources specification (start/end dates plus list of sources).
* transform_spec            (TransformSpecificationConfig)
    Transform specification: SQL + job‐split options.
* output_spec               (OutputSpecificationConfig)
    Output specification: schema, output format, save mode, etc.
* cradle_job_spec           (CradleJobSpecificationConfig)
    Cradle job specification: cluster type, account, retry count, etc.
    ... and 1 more fields
✅ CradleDataLoading_training configured



#### Cradle Data Loading (Calibration) Step

In [33]:
#=======================================================
calibration_start_datetime ='2025-04-17T00:00:00'  #'2024-05-26T00:00:00'
calibration_end_datetime = '2025-04-28T00:00:00' #'2024-06-29T23:00:00'
#=======================================================

In [34]:
#=======================================================
calibration_edx_arn = {
    'NA': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-04-17T00:00:00Z,2025-04-28T00:00:00Z,"NA"]',
    'EU': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292941",2025-04-17T00:00:00Z,2025-04-27T00:00:00Z,"EU"]',
    'FE': 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["25782074",2025-04-17T00:00:00Z,2025-04-27T00:00:00Z,"FE"]',
}
#=======================================================

In [35]:
calibration_cradle_data_load_dict = {
    "job_type": "calibration",
    "data_sources_spec": DataSourcesSpecificationConfig(
        start_date=calibration_start_datetime, 
        end_date=calibration_end_datetime,
        data_sources=[
            DataSourceConfig(
                data_source_name='RAW_MDS_NA',
                data_source_type='MDS',
                mds_data_source_properties=MdsDataSourceConfig(
                    service_name=mds_service_name,
                    region=region,
                    output_schema=mds_output_schema,
                    org_id=org_id,
                )
            ), 
            DataSourceConfig(
                data_source_name='TAGS',
                data_source_type='EDX',
                edx_data_source_properties=EdxDataSourceConfig(
                    schema_overrides=edx_schema_overrides,
                    edx_arn=calibration_edx_arn[region],
                )
            )
        ]
    ),
    "output_spec": OutputSpecificationConfig(
        output_schema=output_schema,
        output_format=output_format
    ),
    "transform_spec":  TransformSpecificationConfig(
        transform_sql=transform_sql,
        job_split_options=JobSplitOptionsConfig(
            merge_sql=merge_sql
        )
    ),
    "cradle_job_spec": CradleJobSpecificationConfig(
        cradle_account=cradle_account,
        cluster_type=cluster_type,
    )
}

In [36]:
# Configure calibration data loading
if "CradleDataLoading_calibration" in pending_steps:
    step_name = "CradleDataLoading_calibration"
    
    factory.set_step_config(
        step_name,
        **calibration_cradle_data_load_dict
    )
    print(f"✅ {step_name} configured")

2025-10-16 03:49:38,197 - INFO - ✅ CradleDataLoading_calibration configured successfully using CradleDataLoadingConfig


✅ CradleDataLoading_calibration configured


### Step 6.3: Configure Registration Step


* [MRAS (Model Resource Allocation System)](https://w.amazon.com/bin/view/CMLS/ME/MIMS/) is a system that manages your **model endpoints**. 
    * It takes your model artifact and its metadata and deploys an endpoint to an AWS account you have onboarded to MRAS. You can access this endpoint through the AMES system, which URES uses.
* **MIMS (Model Inference Management System)** is a system that handles the model creation
* **MMS (Model Management Service)** would manage the model card
> 
> Note that we used to call **MRAS MIMS** (**Model Inference Management System**). 
> - **MIMS** is the component of MRAS that handles endpoint creation. 
> - To reduce customer confusion, we have started to use *MRAS* to also refer to *MIMS*. 
> - Some of our wikis may still use *MIMS* instead of *MRAS*.
> 
> If your team has not already, please [onboard an AWS account to MRAS](https://w.amazon.com/bin/view/CMLS/ME/MIMS/UserGuide/Onboarding/).

* **MIMSModelRegistrationStep** is a SageMaker Workflow Step that wrap around the service call to **MIMS**.
    * It is also a customized step provided by SAIS Python SDK
        * See Source code[SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
    * This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**.
        * Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
    * This step would need to load **Execution Document** to take action.

In **MIMSModelRegistrationStep**, we need to specify the fields to fill in the **Execution Document**
* *model_owner*
* *model_registration_domain*
* *model_registration_objective*
* *source_model_inference_input_variable_list*
* *source_model_inference_output_variable_list*
* *source_model_inference_content_types*
* *source_model_inference_response_types*


In [38]:
# Configure Registration step
if "Registration" in pending_steps:
    #=================== Update This =======================
    model_domain = 'AtoZ'
    model_objective = f'AtoZ_Claims_SM_Model_{region}'
    #=======================================================
    
    source_model_inference_input_variable_list = {
        field: 'NUMERIC' if field in tab_field_list else 'TEXT' 
        for field in tab_field_list + cat_field_list
    }
    
    source_model_inference_output_variable_list = {
        'legacy-score': 'NUMERIC',
        'calibrated-score': 'NUMERIC',
        'custom-output-label': 'TEXT'
    }

    #=================== Update This =======================
    framework='xgboost'
    inference_entry_point='xgboost_inference.py'
    #=======================================================
    
    factory.set_step_config(
        "Registration",
        framework=framework,
        inference_entry_point=inference_entry_point,
        model_owner='amzn1.abacus.team.djmdvixm5abr3p75c5ca',  # abuse-analytics team
        model_domain=model_domain,
        model_objective=model_objective,
        source_model_inference_output_variable_list=source_model_inference_output_variable_list,
        source_model_inference_input_variable_list=source_model_inference_input_variable_list
    )
    print(f"✅ Registration configured")

2025-10-16 03:49:58,097 - INFO - ✅ Registration configured successfully using RegistrationConfig


✅ Registration configured


### Step 6.4: Configure Preprocessing Steps

In [39]:
# Configure training preprocessing
if "TabularPreprocessing_training" in pending_steps:
    step_name = "TabularPreprocessing_training"
    
    factory.set_step_config(
        step_name,
        job_type='training',
        label_name='is_abuse',
        processing_entry_point='tabular_preprocessing.py',
        use_large_processing_instance=True
    )
    print(f"✅ {step_name} configured")

# Configure calibration preprocessing
if "TabularPreprocessing_calibration" in pending_steps:
    step_name = "TabularPreprocessing_calibration"
    
    factory.set_step_config(
        step_name,
        job_type='calibration',
        label_name='is_abuse',
        processing_entry_point='tabular_preprocessing.py',
        use_large_processing_instance=False
    )
    print(f"✅ {step_name} configured")

2025-10-16 03:50:00,243 - INFO - ✅ TabularPreprocessing_training configured successfully using TabularPreprocessingConfig
2025-10-16 03:50:00,245 - INFO - ✅ TabularPreprocessing_calibration configured successfully using TabularPreprocessingConfig


✅ TabularPreprocessing_training configured
✅ TabularPreprocessing_calibration configured


### Step 6.4: Configure Remaining Steps

**USER INPUT BLOCK**: Fill in the essential fields for each remaining step.
The factory has identified the required fields for each step.

In [40]:
# Get current pending steps
current_pending = factory.get_pending_steps()

print("Remaining steps to configure:")
print("=" * 40)

for step_name in current_pending:
    requirements = factory.get_step_requirements(step_name)
    essential_reqs = [req for req in requirements if req['required']]
    
    print(f"\n{step_name}:")
    print(f"  Essential fields ({len(essential_reqs)}):")
    for req in essential_reqs:
        print(f"    * {req['name']} ({req['type']}) - {req['description']}")
    
    if len(requirements) > len(essential_reqs):
        optional_count = len(requirements) - len(essential_reqs)
        print(f"  Optional fields: {optional_count}")

Remaining steps to configure:

XGBoostModelEval_calibration:
  Essential fields (2):
    * id_name (str) - Name of the ID field in the dataset (required for evaluation).
    * label_name (str) - Name of the label field in the dataset (required for evaluation).
  Optional fields: 3

ModelCalibration_calibration:
  Essential fields (1):
    * label_field (str) - Name of the label column
  Optional fields: 10


In [47]:
id_name

'order_id'

In [48]:
label_name

'is_abuse'

In [50]:
# Configure Model Evaluation
if "XGBoostModelEval_calibration" in current_pending:
    factory.set_step_config(
        "XGBoostModelEval_calibration",
        job_type='calibration',
        processing_entry_point='xgboost_model_eval.py',
        id_name=id_name,
        label_name=label_name,
    )
    print(f"✅ XGBoostModelEval_calibration configured")

2025-10-16 03:59:10,546 - INFO - ✅ XGBoostModelEval_calibration configured successfully using XGBoostModelEvalConfig


✅ XGBoostModelEval_calibration configured


In [51]:
# Configure Model Calibration
if "ModelCalibration_calibration" in current_pending:
    factory.set_step_config(
        "ModelCalibration_calibration",
        label_field='is_abuse',
        processing_entry_point='model_calibration.py',
        score_field='prob_class_1',
        is_binary=True,
        num_classes=2,
        score_field_prefix='prob_class_',
        multiclass_categories=[0, 1]
    )
    print(f"✅ ModelCalibration_calibration configured")

2025-10-16 03:59:13,052 - INFO - ✅ ModelCalibration_calibration configured successfully using ModelCalibrationConfig


✅ ModelCalibration_calibration configured


## Step 7: Generate Final Configurations

Now that all steps are configured, we can generate the final configuration instances.
The factory will validate that all essential fields are provided and create the config objects.

In [52]:
# Check final status
final_status = factory.get_configuration_status()
final_pending = factory.get_pending_steps()

print("Final Configuration Status:")
print("=" * 40)
print(f"Base config: {'✅' if final_status['base_config'] else '❌'}")
print(f"Processing config: {'✅' if final_status['base_processing_config'] else '❌'}")
print(f"Pending steps: {len(final_pending)}")

if final_pending:
    print("\nStill pending:")
    for step in final_pending:
        print(f"  - {step}")
    print("\n⚠️  Please configure remaining steps before generating configs.")
else:
    print("\n✅ All steps configured! Ready to generate configurations.")

Final Configuration Status:
Base config: ✅
Processing config: ✅
Pending steps: 0

✅ All steps configured! Ready to generate configurations.


In [53]:
# Generate final configurations
if not final_pending:
    try:
        print("Generating final configurations...")
        configs = factory.generate_all_configs()
        
        print(f"\n✅ Successfully generated {len(configs)} configuration instances:")
        for i, config in enumerate(configs, 1):
            print(f"  {i:2d}. {config.__class__.__name__}")
        
        print("\n🎉 Configuration generation complete!")
        
    except Exception as e:
        print(f"\n❌ Configuration generation failed: {e}")
        print("\nPlease check that all required fields are provided.")
        configs = None
else:
    print("\n⚠️  Cannot generate configs - some steps are still pending configuration.")
    configs = None

2025-10-16 03:59:28,689 - INFO - ✅ Package auto-configured successfully (only tier 2+ fields)
2025-10-16 03:59:28,691 - INFO - ✅ Payload auto-configured successfully (only tier 2+ fields)
2025-10-16 03:59:28,691 - INFO - ✅ Auto-configured 2 steps with only tier 2+ fields
2025-10-16 03:59:28,691 - INFO - ✅ Returning 10 pre-validated configuration instances


Generating final configurations...

✅ Successfully generated 10 configuration instances:
   1. XGBoostTrainingConfig
   2. CradleDataLoadingConfig
   3. CradleDataLoadingConfig
   4. RegistrationConfig
   5. TabularPreprocessingConfig
   6. TabularPreprocessingConfig
   7. ModelCalibrationConfig
   8. XGBoostModelEvalConfig
   9. PackageConfig
  10. PayloadConfig

🎉 Configuration generation complete!


In [55]:
len(configs)

10

## Step 8: Save to JSON

Finally, we save the generated configurations to a unified JSON file using the existing
`merge_and_save_configs` utility. This creates the same format as the legacy approach
but with much less effort!

In [56]:
if configs:
    # Set up output directory and filename
    MODEL_CLASS = 'xgboost'
    service_name = 'AtoZ'
    
    config_dir = Path(current_dir) / 'pipeline_config'
    config_dir.mkdir(parents=True, exist_ok=True)
    
    config_file_name = f'config.json'
    config_path = config_dir / config_file_name
    
    print(f"Saving configurations to: {config_path}")
    
    # Use the existing merge_and_save_configs utility
    from buyer_abuse_mods_template.cursus.steps.configs.utils import merge_and_save_configs
    
    try:
        merged_config = merge_and_save_configs(configs, str(config_path))
        
        print(f"\n✅ Configuration saved successfully!")
        print(f"   File: {config_path}")
        print(f"   Size: {config_path.stat().st_size / 1024:.1f} KB")
        
        # Also save hyperparameters separately (for compatibility)
        hyperparam_path = source_dir / 'hyperparams' / f'hyperparameters.json'
        with open(hyperparam_path, 'w') as f:
            json.dump(xgb_hyperparams.model_dump(), f, indent=2, sort_keys=True)
        
        print(f"   Hyperparameters: {hyperparam_path}")
        
        print(f"\n🎉 Interactive configuration complete!")
        print(f"\n📊 Comparison with legacy approach:")
        print(f"   Legacy: 500+ lines of manual configuration")
        print(f"   Interactive: Guided step-by-step process")
        print(f"   Time saved: ~20-25 minutes")
        print(f"   Error reduction: Validation at each step")
        
    except Exception as e:
        print(f"\n❌ Failed to save configurations: {e}")
        
else:
    print("\n⚠️  No configurations to save. Please generate configs first.")

2025-10-16 04:00:32,927 - INFO - 🔧 BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/cursus
2025-10-16 04:00:32,927 - INFO - 🔧 BuilderAutoDiscovery.__init__ - workspace_dirs: []
2025-10-16 04:00:32,928 - INFO - ✅ BuilderAutoDiscovery basic initialization complete
2025-10-16 04:00:32,928 - INFO - ✅ Registry info loaded: 25 steps
2025-10-16 04:00:32,929 - INFO - 🎉 BuilderAutoDiscovery initialization completed successfully


Saving configurations to: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json


2025-10-16 04:00:33,000 - INFO - Discovered 33 core config classes
2025-10-16 04:00:33,009 - INFO - Discovered 3 core hyperparameter classes
2025-10-16 04:00:33,035 - INFO - Discovered 7 base hyperparameter classes from core/base
2025-10-16 04:00:33,036 - INFO - Built complete config classes: 43 total (33 config + 10 hyperparameter auto-discovered)
2025-10-16 04:00:33,036 - INFO - Discovered 43 config classes via step catalog
2025-10-16 04:00:33,037 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers
2025-10-16 04:00:33,037 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers
2025-10-16 04:00:33,080 - INFO - Discovered 33 core config classes
2025-10-16 04:00:33,085 - INFO - Discovered 3 core hyperparameter classes
2025-10-16 04:00:33,110 - IN


✅ Configuration saved successfully!
   File: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json
   Size: 114.9 KB
   Hyperparameters: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/hyperparameters_NA_xgboost.json

🎉 Interactive configuration complete!

📊 Comparison with legacy approach:
   Legacy: 500+ lines of manual configuration
   Interactive: Guided step-by-step process
   Time saved: ~20-25 minutes
   Error reduction: Validation at each step


### Test if we can load it

In [59]:
from buyer_abuse_mods_template.cursus.steps.configs.config_cradle_data_loading_step import CradleDataLoadingConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_tabular_preprocessing_step import TabularPreprocessingConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_xgboost_training_step import XGBoostTrainingConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_xgboost_model_eval_step import XGBoostModelEvalConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_model_calibration_step import ModelCalibrationConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_package_step import PackageConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_payload_step import PayloadConfig
from buyer_abuse_mods_template.cursus.steps.configs.config_registration_step import RegistrationConfig

In [60]:
from buyer_abuse_mods_template.cursus.steps.configs.utils import load_configs

In [61]:
CONFIG_CLASSES = {
        'XGBoostTrainingConfig':       XGBoostTrainingConfig,
        'ModelCalibrationConfig':      ModelCalibrationConfig,
        'PackageConfig':               PackageConfig,
        'RegistrationConfig':          RegistrationConfig,
        'PayloadConfig':               PayloadConfig,
        'CradleDataLoadingConfig':     CradleDataLoadingConfig,
        'TabularPreprocessingConfig':  TabularPreprocessingConfig,
        'XGBoostModelEvalConfig':      XGBoostModelEvalConfig,
    }

In [62]:
loaded_configs = load_configs(str(config_path), CONFIG_CLASSES)

2025-10-16 04:08:41,715 - INFO - Loading configs from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json
2025-10-16 04:08:41,716 - INFO - Loading configuration from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json
2025-10-16 04:08:41,726 - INFO - Successfully loaded configuration from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json
2025-10-16 04:08:41,726 - INFO - Successfully loaded configs from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/pipeline_config/config.json with 10 specific configs
2025-10-16 04:08:41,729 - INFO - Creating additional config instance for CradleDataLoading_calibration (CradleDataLoadingConfig)
2025-10-16 04:08:41,774 - INFO - Discovered 33 core config classes
2025-10-16 04:08:41

In [63]:
len(loaded_configs)

10

## Summary

This notebook demonstrates the **DAGConfigFactory** approach to pipeline configuration:

### ✅ **Benefits Achieved**

1. **Reduced Complexity**: From 500+ lines of manual config to guided workflow
2. **Base Config Inheritance**: Set common fields once, inherit everywhere
3. **Step-by-Step Guidance**: Clear requirements for each configuration step
4. **Validation**: Comprehensive validation prevents configuration errors
5. **Reusable DAG**: Pipeline structure defined once, reused across environments

### 🔄 **Workflow Comparison**

| Aspect | Legacy Approach | Interactive Approach |
|--------|----------------|---------------------|
| **Lines of Code** | 500+ manual lines | Guided step-by-step |
| **Time Required** | 30+ minutes | 10-15 minutes |
| **Error Rate** | High (manual entry) | Low (validation) |
| **Reusability** | Copy-paste heavy | DAG-driven |
| **Maintenance** | Manual updates | Automatic inheritance |

### 🚀 **Next Steps**

The generated configuration file can now be used with the existing pipeline compiler:

```python
# Use with pipeline compiler (from demo_pipeline.ipynb)
from cursus.core.compiler.dag_compiler import PipelineDAGCompiler

dag_compiler = PipelineDAGCompiler(
    config_path=config_path,
    sagemaker_session=pipeline_session,
    role=role
)

# Compile DAG to pipeline
template_pipeline, report = dag_compiler.compile_with_report(dag=dag)
```

The interactive configuration approach transforms the user experience from complex manual setup to an intuitive, guided workflow while maintaining full compatibility with the existing cursus infrastructure.