# Interactive Pipeline Configuration with DAGConfigFactory

This notebook demonstrates the new interactive approach to pipeline configuration using the DAGConfigFactory.
Instead of manually creating 500+ lines of static configuration, we use a guided step-by-step process.

## Workflow Overview

1. **Define Pipeline DAG** - Create the pipeline structure
2. **Initialize DAGConfigFactory** - Set up the interactive factory
3. **Configure Base Settings** - Set shared pipeline configuration
4. **Configure Processing Settings** - Set shared processing configuration
5. **Configure Individual Steps** - Set step-specific configurations
6. **Generate Final Configurations** - Create config instances
7. **Save to JSON** - Export unified configuration file

![mods_pipeline_train_eval_calib](./tutorials/mods_end_to_end_xgboost.png)

## Environment Setup

In [1]:
import os
import json
import sys
from pathlib import Path
from datetime import datetime, date
import logging
from typing import List, Optional, Dict, Any


# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Get parent directory of current notebook
project_root = str(Path().absolute().parent)
print(f"project root {project_root}")
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"add project root {project_root} into system")

project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template
add project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template into system


In [2]:
# SageMaker and SAIS imports
from sagemaker import Session
from sagemaker.workflow.pipeline_context import PipelineSession

2026-01-08 00:00:49,795 - INFO - Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-01-08 00:00:49,796 - INFO - NumExpr defaulting to 8 threads.
2026-01-08 00:00:50,995 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
print(f"Role: {PipelineSession().get_caller_identity_arn()}")

2026-01-08 00:00:52,343 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1


In [4]:
bucket = "sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um"  # "buyer-seller-messaging-reversal"
print(f"Bucket: {bucket}")

Bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um


## Step 1: Define Pipeline DAG

First, we define the pipeline structure using a DAG (Directed Acyclic Graph).
This replaces the hardcoded pipeline structure from the legacy approach.

In [5]:
from cursus.api.dag.base_dag import PipelineDAG


def create_pytorch_e2e_training_dag() -> PipelineDAG:
    """
    Create a complete end-to-end XGBoost pipeline DAG.

    This DAG represents the same workflow as the legacy demo_config.ipynb
    but in a structured, reusable format.

    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()

    # Add all nodes - matching the structure from demo_config.ipynb
    dag.add_node("CradleDataLoading_training")  # Training data loading
    dag.add_node("TabularPreprocessing_training")  # Training data preprocessing
    dag.add_node("TokenizerTraining_training")  # Customized Tokenizer training
    dag.add_node("PyTorchTraining")  # XGBoost model training

    dag.add_node("CradleDataLoading_calibration")  # Dummy data load for calibration
    dag.add_node(
        "TabularPreprocessing_calibration"
    )  # Tabular preprocessing for calibration
    dag.add_node("PyTorchModelEval_calibration")  # Model evaluation step
    dag.add_node(
        "PercentileModelCalibration_calibration"
    )  # Model calibration step with calibration variant
    dag.add_node("Package")  # Package step
    dag.add_node("Registration")  # MIMS registration step
    dag.add_node("Payload")  # Payload step

    # Define dependencies - training flow
    dag.add_edge("CradleDataLoading_training", "TabularPreprocessing_training")
    dag.add_edge("TabularPreprocessing_training", "TokenizerTraining_training")
    dag.add_edge("TokenizerTraining_training", "PyTorchTraining")
    dag.add_edge("TabularPreprocessing_training", "PyTorchTraining")

    # Calibration flow with Bedrock batch processing and label ruleset integration
    dag.add_edge("CradleDataLoading_calibration", "TabularPreprocessing_calibration")

    # Evaluation flow
    dag.add_edge("PyTorchTraining", "PyTorchModelEval_calibration")
    dag.add_edge(
        "TabularPreprocessing_calibration", "PyTorchModelEval_calibration"
    )  # Use labeled calibration data

    # Model calibration flow - depends on model evaluation
    dag.add_edge(
        "PyTorchModelEval_calibration", "PercentileModelCalibration_calibration"
    )

    # Output flow
    dag.add_edge("PercentileModelCalibration_calibration", "Package")
    dag.add_edge("PyTorchTraining", "Package")  # Raw model is also input to packaging
    dag.add_edge("PyTorchTraining", "Payload")  # Raw model is also input to packaging
    dag.add_edge("Package", "Registration")
    dag.add_edge("Payload", "Registration")

    logger.info(
        f"Created XGBoost E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges"
    )
    return dag


# Create the pipeline DAG
dag = create_pytorch_e2e_training_dag()

print(f"Pipeline DAG created with {len(dag.nodes)} steps:")
for node in dag.nodes:
    print(f"  - {node}")

2026-01-08 00:00:56,391 - INFO - Added node: CradleDataLoading_training
2026-01-08 00:00:56,392 - INFO - Added node: TabularPreprocessing_training
2026-01-08 00:00:56,392 - INFO - Added node: TokenizerTraining_training
2026-01-08 00:00:56,392 - INFO - Added node: PyTorchTraining
2026-01-08 00:00:56,393 - INFO - Added node: CradleDataLoading_calibration
2026-01-08 00:00:56,393 - INFO - Added node: TabularPreprocessing_calibration
2026-01-08 00:00:56,393 - INFO - Added node: PyTorchModelEval_calibration
2026-01-08 00:00:56,393 - INFO - Added node: PercentileModelCalibration_calibration
2026-01-08 00:00:56,393 - INFO - Added node: Package
2026-01-08 00:00:56,394 - INFO - Added node: Registration
2026-01-08 00:00:56,394 - INFO - Added node: Payload
2026-01-08 00:00:56,395 - INFO - Added edge: CradleDataLoading_training -> TabularPreprocessing_training
2026-01-08 00:00:56,395 - INFO - Added edge: TabularPreprocessing_training -> TokenizerTraining_training
2026-01-08 00:00:56,395 - INFO - Ad

Pipeline DAG created with 11 steps:
  - CradleDataLoading_training
  - TabularPreprocessing_training
  - TokenizerTraining_training
  - PyTorchTraining
  - CradleDataLoading_calibration
  - TabularPreprocessing_calibration
  - PyTorchModelEval_calibration
  - PercentileModelCalibration_calibration
  - Package
  - Registration
  - Payload


## Step 2: Initialize DAGConfigFactory

Now we initialize the DAGConfigFactory with our DAG. This will automatically:
- Map DAG nodes to configuration classes
- Set up the interactive workflow
- Prepare for step-by-step configuration

In [6]:
from cursus.api.factory.dag_config_factory import DAGConfigFactory

# Initialize the factory with our DAG
factory = DAGConfigFactory(dag)

# Get the config class mapping
config_map = factory.get_config_class_map()

print("DAG Node to Config Class Mapping:")
print("=" * 50)
for node_name, config_class in config_map.items():
    print(f"  {node_name:<35} -> {config_class.__name__}")

print(f"\nSuccessfully mapped {len(config_map)} steps to configuration classes.")

2026-01-08 00:01:02,846 - INFO - üîß BuilderAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/cursus
2026-01-08 00:01:02,847 - INFO - üîß BuilderAutoDiscovery.__init__ - workspace_dirs: []
2026-01-08 00:01:02,847 - INFO - ‚úÖ BuilderAutoDiscovery basic initialization complete
2026-01-08 00:01:02,847 - INFO - ‚úÖ Registry info loaded: 49 steps
2026-01-08 00:01:02,848 - INFO - üéâ BuilderAutoDiscovery initialization completed successfully
2026-01-08 00:01:02,848 - INFO - üîç ScriptAutoDiscovery.__init__ starting - package_root: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/cursus
2026-01-08 00:01:02,848 - INFO - üîç ScriptAutoDiscovery.__init__ - workspace_dirs: []
2026-01-08 00:01:02,848 - INFO - üîç ScriptAutoDiscovery.__init__ - priority_workspace_dir: None
2026-01-08 00:01:02,849 - INFO - ‚úÖ Registry info loaded: 49 steps
2026-01-08 00:01:02,849 - INFO - üéâ ScriptAutoD

DAG Node to Config Class Mapping:
  CradleDataLoading_training          -> CradleDataLoadingConfig
  TabularPreprocessing_training       -> TabularPreprocessingConfig
  TokenizerTraining_training          -> TokenizerTrainingConfig
  PyTorchTraining                     -> PyTorchTrainingConfig
  CradleDataLoading_calibration       -> CradleDataLoadingConfig
  TabularPreprocessing_calibration    -> TabularPreprocessingConfig
  PyTorchModelEval_calibration        -> PyTorchModelEvalConfig
  PercentileModelCalibration_calibration -> PercentileModelCalibrationConfig
  Package                             -> PackageConfig
  Registration                        -> RegistrationConfig
  Payload                             -> PayloadConfig

Successfully mapped 11 steps to configuration classes.


## Step 3: Configure Base Pipeline Settings

These settings are shared across ALL pipeline steps. Instead of repeating them
in every step configuration, we set them once here.

In [7]:
# Get base configuration requirements
base_requirements = factory.get_base_config_requirements()

print("Base Pipeline Configuration Requirements:")
print("=" * 50)
for req in base_requirements:
    marker = "*" if req["required"] else " "
    default_info = (
        f" (default: {req.get('default')})"
        if not req["required"] and "default" in req
        else ""
    )
    print(f"{marker} {req['name']:<25} ({req['type']}){default_info}")
    print(f"    {req['description']}")
    print()

Base Pipeline Configuration Requirements:
* author                    (str)
    Author or owner of the pipeline.

* bucket                    (str)
    S3 bucket name for pipeline artifacts and data.

* role                      (str)
    IAM role for pipeline execution.

* region                    (str)
    Custom region code (NA, EU, FE) for internal logic.

* service_name              (str)
    Service name for the pipeline.

* pipeline_version          (str)
    Version string for the SageMaker Pipeline.

  model_class               (str) (default: xgboost)
    Model class (e.g., XGBoost, PyTorch).

  current_date              (str) (default: PydanticUndefined)
    Current date, typically used for versioning or pathing.

  framework_version         (str) (default: 2.1.0)
    Default framework version (e.g., PyTorch).

  py_version                (str) (default: py310)
    Default Python version.

  source_dir                (Optional) (default: None)
    Common source directory fo

In [8]:
# Set up basic configuration values
region_list = ["NA", "EU", "FE"]
region_selection = 0
region = region_list[region_selection]

# Map region to AWS region
region_mapping = {"NA": "us-east-1", "EU": "eu-west-1", "FE": "us-west-2"}
aws_region = region_mapping[region]

service_name = "Names3Risk"
pipeline_version = "1.0.0"
author = "lukexie"
model_class = "pytorch"

# Get current directory and set up paths
current_dir = Path.cwd()
package_root = Path(current_dir).resolve()
source_dir = Path("dockers")
project_root_folder = "names3risk_pytorch"

# Set base configuration
factory.set_base_config(
    # Infrastructure settings
    bucket=bucket,
    role=PipelineSession().get_caller_identity_arn(),
    region=region,
    aws_region=aws_region,
    # Project identification
    author=author,
    service_name=service_name,
    pipeline_version=pipeline_version,
    model_class=model_class,
    # Framework settings
    framework_version="2.1.0",
    py_version="py310",
    source_dir=str(source_dir),
    project_root_folder=project_root_folder,
    # Date settings
    current_date=date.today().strftime("%Y-%m-%d"),
    # Enable Cache
    enable_caching=False,
    # Use Secure PyPI
    use_secure_pypi=True,
)

print("‚úÖ Base pipeline configuration set successfully!")
print(f"   Region: {region} ({aws_region})")
print(f"   Service: {service_name}")
print(f"   Author: {author}")
print(f"   Pipeline Version: {pipeline_version}")

2026-01-08 00:01:28,166 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2026-01-08 00:01:28,319 - INFO - Base configuration set successfully


‚úÖ Base pipeline configuration set successfully!
   Region: NA (us-east-1)
   Service: Names3Risk
   Author: lukexie
   Pipeline Version: 1.0.0


## Step 4: Configure Base Processing Settings

These settings are shared across all PROCESSING steps (data loading, preprocessing, etc.)
but not training steps.

In [9]:
# Get base processing configuration requirements
processing_requirements = factory.get_base_processing_config_requirements()

if processing_requirements:
    print("Base Processing Configuration Requirements:")
    print("=" * 50)
    for req in processing_requirements:
        marker = "*" if req["required"] else " "
        default_info = (
            f" (default: {req.get('default')})"
            if not req["required"] and "default" in req
            else ""
        )
        print(f"{marker} {req['name']:<30} ({req['type']}){default_info}")
        print(f"    {req['description']}")
        print()
else:
    print("No base processing configuration required for this pipeline.")

Base Processing Configuration Requirements:
  processing_instance_count      (int) (default: 1)
    Instance count for processing jobs

  processing_volume_size         (int) (default: 500)
    Volume size for processing jobs in GB

  processing_instance_type_large (str) (default: ml.m5.4xlarge)
    Large instance type for processing step.

  processing_instance_type_small (str) (default: ml.m5.2xlarge)
    Small instance type for processing step.

  use_large_processing_instance  (bool) (default: False)
    Set to True to use large instance type, False for small instance type.

  processing_source_dir          (Optional) (default: None)
    Source directory for processing scripts. Falls back to base source_dir if not provided.

  processing_entry_point         (Optional) (default: None)
    Entry point script for processing, must be relative to source directory. Can be overridden by derived classes.

  processing_script_arguments    (Optional) (default: None)
    Optional arguments fo

In [10]:
# Set base processing configuration if needed
if processing_requirements:
    processing_source_dir = source_dir / "scripts"

    factory.set_base_processing_config(
        # Processing infrastructure
        processing_source_dir=str(processing_source_dir),
        processing_instance_type_large="ml.m5.12xlarge",
        processing_instance_type_small="ml.m5.4xlarge",
    )

    print("‚úÖ Base processing configuration set successfully!")
    print(f"   Processing source: {processing_source_dir}")
else:
    print("‚úÖ No base processing configuration needed.")

2026-01-08 00:01:36,527 - INFO - Working directory discovery succeeded (direct): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/names3risk_pytorch/dockers
2026-01-08 00:01:36,527 - INFO - Hybrid resolution completed successfully via Working Directory Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/names3risk_pytorch/dockers
2026-01-08 00:01:36,527 - INFO - Base processing configuration set successfully


‚úÖ Base processing configuration set successfully!
   Processing source: dockers/scripts


## Step 5: Check Configuration Status

Let's see which steps still need configuration.

In [11]:
# Check current status
status = factory.get_configuration_status()
pending_steps = factory.get_pending_steps()

print("Configuration Status:")
print("=" * 30)
print(f"Base config set: {'‚úÖ' if status['base_config'] else '‚ùå'}")
print(f"Processing config set: {'‚úÖ' if status['base_processing_config'] else '‚ùå'}")
print(f"Total steps: {len(config_map)}")
print(f"Pending steps: {len(pending_steps)}")
print()

if pending_steps:
    print("Steps needing configuration:")
    for step in pending_steps:
        print(f"  - {step}")
else:
    print("‚úÖ All steps configured!")

Configuration Status:
Base config set: ‚úÖ
Processing config set: ‚úÖ
Total steps: 11
Pending steps: 10

Steps needing configuration:
  - CradleDataLoading_training
  - TabularPreprocessing_training
  - TokenizerTraining_training
  - PyTorchTraining
  - CradleDataLoading_calibration
  - TabularPreprocessing_calibration
  - PyTorchModelEval_calibration
  - PercentileModelCalibration_calibration
  - Registration
  - Payload


## Step 6: Configure Individual Steps

Now we configure each step with its specific requirements. The factory will show us
only the fields that are unique to each step (not inherited from base configs).

### Step 6.1: Configure Training Step

This config is for **TrainingStep**. 
* It ask user to provide all necessary information to construct a **Container** and start a **Training Job**
* Ths most important information has provided in the **HyperParameter** section.


In [12]:
tab_field_list = [
    f"Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_solicit_count_last_365_days",
    f"Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_warn_count_last_365_days",
    f"Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_solicit_count_last_365_days",
    f"Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_warn_count_last_365_days",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_buyer_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_seller_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_diff_topic_si",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_notr_topic_si",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_return_keywords_si",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_order_message_time_gap",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_buyer_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_seller_message_count",
    f"Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_topic_count",
    "Abuse.completed_afn_orders_by_customer_marketplace.n_afn_order_count_last_365_days",
    "Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_amount_last_365_days",
    "Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days",
    "Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_order_count_last_365_days",
    "Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_amount_last_365_days",
    "Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days",
    "Abuse.dnr_by_customer_marketplace.n_dnr_amount_si_last_365_days",
    "Abuse.dnr_by_customer_marketplace.n_dnr_order_count_last_365_days",
    "Abuse.dnr_by_customer_marketplace.n_dnr_unit_amount_last_365_days",
    "Abuse.dnr_by_customer_marketplace.n_dnr_unit_count_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_claims_amount_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_claims_count_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_diff_claims_amount_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_diff_claims_count_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_notr_claims_amount_last_365_days",
    f"Abuse.mfn_a2z_claims_by_customer_{region.lower()}.n_mfn_notr_claims_count_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_order_count_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_amount_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_count_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_order_count_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_amount_last_365_days",
    "Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_count_last_365_days",
    "Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_diff_refunds_si_365_days",
    "Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_notr_refunds_si_365_days",
    "Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_order_count_last_365_days",
    "Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_amount_last_365_days",
    "Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_count_last_365_days",
    "Abuse.mfn_refunds_si_by_customer_marketplace.n_mfn_refund_amount_si_last_365_days",
    "Abuse.order_to_execution_time_from_eventvariables.n_order_to_execution",
    "Abuse.shiptrack_flag_by_order.n_any_delivered",
    "Abuse.shiptrack_flag_by_order.n_any_available_for_pickup",
    "Abuse.shiptrack_flag_by_order.n_any_partial_delivered",
    "Abuse.shiptrack_flag_by_order.n_any_undeliverable",
    "Abuse.shiptrack_flag_by_order.n_any_returning",
    "Abuse.shiptrack_flag_by_order.n_any_returned",
    "COMP_DAYOB",
    "claimAmount_value",
    "claimantInfo_allClaimCount365day",
    "claimantInfo_lifetimeClaimCount",
    "claimantInfo_pendingClaimCount",
]

In [13]:
cat_field_list = [
    "PAYMETH",
    "claim_reason",
    "claimantInfo_status",
    "shipments_status",
    "Abuse.buyer_abuse_bsm_message_body_concat_by_order_marketplaceid.c_message_body_concat_by_order",
]

In [14]:
label_name = "is_abuse"  # "llm_reversal_flag"
id_name = "order_id"

in_house_field_list = [
    "marketplace_id",
    id_name,
    label_name,
]

In [15]:
full_field_list = tab_field_list + cat_field_list + in_house_field_list

In [16]:
batch_size = 2
lr = 2e-5  # 3e-05
max_epochs = 5  # 3, #15,
metric_choices = ["f1_score", "auroc"]
optimizer = "SGD"

In [17]:
# First, let's create the hyperparameters
from cursus.core.base.hyperparameters_base import ModelHyperparameters
from cursus.steps.hyperparams.hyperparameters_bimodal import BimodalModelHyperparameters


# Create base hyperparameters
base_hyperparameter = ModelHyperparameters(
    full_field_list=full_field_list,
    cat_field_list=cat_field_list,
    tab_field_list=tab_field_list,
    label_name=label_name,
    id_name=id_name,
    multiclass_categories=[0, 1],
    batch_size=batch_size,
    lr=lr,
    max_epochs=max_epochs,
    metric_choices=metric_choices,
    optimizer=optimizer,
)

In [18]:
model_class = "bimodal_gate_fusion"  # "bimodal_bert"
tokenizer = "bert-base-multilingual-uncased"
text_name = "Abuse.buyer_abuse_bsm_message_body_concat_by_order_marketplaceid.c_message_body_concat_by_order"
lr_decay = 0.05
max_sen_len = 512
chunk_trancate = True
max_total_chunks = 3
momentum = 0.9
pretrained_embedding = True
reinit_layers = 2
reinit_pooler = True
run_scheduler = True
load_ckpt = False
hidden_common_dim = 100
val_check_interval = 0.25
warmup_steps = 300
weight_decay = 0
early_stop_metric = "val_loss"
early_stop_patience = 3
text_input_ids_key = "input_ids"
text_attention_mask_key = "attention_mask"
fp16 = False
use_gradient_checkpointing = False  # False

In [19]:
# Create XGBoost hyperparameters
bimodal_hyperparams = BimodalModelHyperparameters.from_base_hyperparam(
    base_hyperparameter,
    model_class=model_class,
    text_name=text_name,
    tokenizer=tokenizer,
    text_input_ids_key=text_input_ids_key,
    text_attention_mask_key=text_attention_mask_key,
    momentum=momentum,
    lr_decay=lr_decay,
    weight_decay=weight_decay,
    max_sen_len=max_sen_len,
    chunk_trancate=chunk_trancate,
    max_total_chunks=max_total_chunks,
    reinit_layers=reinit_layers,
    reinit_pooler=reinit_pooler,
    run_scheduler=run_scheduler,
    load_ckpt=load_ckpt,
    hidden_common_dim=hidden_common_dim,
    warmup_steps=warmup_steps,
    val_check_interval=val_check_interval,
    early_stop_metric=early_stop_metric,
    early_stop_patience=early_stop_patience,
    fp16=fp16,
    use_gradient_checkpointing=use_gradient_checkpointing,
)

print("‚úÖ Hyperparameters created")
print(
    f"   Features: {len(full_field_list)} total, {len(tab_field_list)} numerical, {len(cat_field_list)} categorical"
)
print(f"   XGBoost rounds: {bimodal_hyperparams.max_epochs}")

‚úÖ Hyperparameters created
   Features: 68 total, 60 numerical, 5 categorical
   XGBoost rounds: 5


In [20]:
bimodal_hyperparams

BimodalModelHyperparameters(full_field_list=['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_

In [21]:
instance_type_list = [
    "ml.m5.4xlarge",
    "ml.m5.12xlarge",
    "ml.p3.16xlarge",
    "ml.g4dn.16xlarge",
    "ml.g5.12xlarge",
    "ml.g5.16xlarge",
]

In [22]:
instance_select = -4  # -2 #-1
instance_type_list[instance_select]

'ml.p3.16xlarge'

In [23]:
# Configure XGBoost training
if "PyTorchTraining" in pending_steps:
    step_name = "PyTorchTraining"
    training_volume_size = 800
    factory.set_step_config(
        step_name,
        training_instance_type=instance_type_list[instance_select],
        training_entry_point="pytorch_training.py",
        training_volume_size=training_volume_size,
    )
    print(f"‚úÖ {step_name} configured")
    print(f"   Instance type: {instance_type_list[instance_select]}")
    print(f"   Volume size: {training_volume_size} GB")

2025-12-20 22:37:19,371 - INFO - ‚úÖ PyTorchTraining configured successfully using PyTorchTrainingConfig


‚úÖ PyTorchTraining configured
   Instance type: ml.p3.16xlarge
   Volume size: 800 GB


### Step 6.2: Configure Data Loading Steps


In this section, user provide the input to construct a **cradle profile**. In Cradle Profle, there are **four** sections
1. **Data Source Specification**: specify
    1. *data source* (MDS, EDX, ANDES)
    2. *input schema*
2. **Transform Specification**: specifiy 
    1. *transform SQL*
    2. *job split*
3. **Output Specification**: specify
    1. *output path*,
    2. *ouptut format* (CSV, UNESCAPED_TSV, JSON, ION, PARQUET)
    3. *output schema*
    4. *save mode*
4. **Cradle Job Specification** specify
    1. *cradle account*
    2. *cluster_type*

This config is for **CradleDataLoadingStep**, which is a customized step provided under [SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
* This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**. Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
* This step would need to load **Execution Document** to take action.
* This step itself does not have many options

In [24]:
from cursus.steps.configs.config_cradle_data_loading_step import (
    CradleDataLoadingConfig,
    MdsDataSourceConfig,
    EdxDataSourceConfig,
    DataSourceConfig,
    DataSourcesSpecificationConfig,
    JobSplitOptionsConfig,
    TransformSpecificationConfig,
    OutputSpecificationConfig,
    CradleJobSpecificationConfig,
)

#### Cradle Data Loading (Training) Step

In [25]:
training_start_datetime = "2025-01-01T00:00:00"  # "2025-05-11T00:00:00"  #'2024-12-01T00:00:00'  #'2024-03-01T00:00:00'
training_end_datetime = "2025-10-31T00:00:00"

In [26]:
mds_service_name = "AtoZ"
org_id = 0

In [27]:
mds_field_list = ["objectId", "transactionDate"] + tab_field_list + cat_field_list
mds_output_schema = [
    {"field_name": field, "field_type": "STRING"} for field in mds_field_list
]

In [28]:
train_edx_arn = {
    "NA": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Train"]',
    "EU": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Train"]',
    "FE": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Train"]',
}

In [29]:
tag_schema = [
    "order_id",
    "customer_id",
    "marketplace_id",
    "llm_reversal_flag",
    "is_abuse",
]
edx_schema_overrides = [
    {"field_name": field, "field_type": "STRING"} for field in tag_schema
]

In [30]:
def get_all_fields(mds_fields: List[str], tag_fields: List[str]) -> List[str]:
    """
    Get a combined list of all fields from MDS and EDX sources.

    This function handles case-insensitivity to avoid duplicate columns in SQL SELECT
    statements where the only difference is case (e.g., "OrderId" and "orderid").
    When duplicates with different cases are found, the first occurrence is kept.

    Args:
        mds_fields (List[str]): List of MDS fields
        tag_fields (List[str]): List of tag fields

    Returns:
        List[str]: Combined and deduplicated list of fields
    """
    # Track lowercase field names to detect duplicates
    seen_lowercase = {}
    deduplicated_fields = []

    # Process all fields, keeping only the first occurrence when case-insensitive duplicates exist
    for field in mds_fields + tag_fields:
        field_lower = field.lower()
        if field_lower not in seen_lowercase:
            seen_lowercase[field_lower] = True
            deduplicated_fields.append(field)

    return sorted(deduplicated_fields)

In [31]:
output_schema = get_all_fields(mds_field_list, tag_schema)

In [32]:
output_schema

['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_return_keywords_si',
 'Abuse.bsm_st

In [33]:
output_format = "PARQUET"

Change the following transform sql

In [34]:
transform_sql = f"""
SELECT
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_claims_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_claims_warn_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_concession_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_concession_warn_count_last_365_days,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_diff_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_notr_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_return_keywords_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_buyer_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_seller_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_buyer_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_seller_message_count,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_topic_count,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_order_count_last_365_days,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_amount_last_365_days,
    Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_count_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_order_count_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_amount_last_365_days,
    Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_count_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_amount_si_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_order_count_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_amount_last_365_days,
    Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_claims_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_diff_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_diff_claims_count_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_notr_claims_amount_last_365_days,
    Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_notr_claims_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_order_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_amount_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_order_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_amount_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_count_last_365_days,
    Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_diff_refunds_si_365_days,
    Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_notr_refunds_si_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_order_count_last_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_amount_last_365_days,
    Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_count_last_365_days,
    Abuse__DOT__mfn_refunds_si_by_customer_marketplace__DOT__n_mfn_refund_amount_si_last_365_days,
    Abuse__DOT__order_to_execution_time_from_eventvariables__DOT__n_order_to_execution,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_available_for_pickup,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_delivered,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_partial_delivered,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returned,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returning,
    Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_undeliverable,
    COMP_DAYOB,
    PAYMETH,
    claimAmount_value,
    claim_reason,
    claimantInfo_allClaimCount365day,
    claimantInfo_lifetimeClaimCount,
    claimantInfo_pendingClaimCount,
    claimantInfo_status,
    marketplace_id,
    objectId,
    order_id,
    shipments_status,
    customer_id,
    llm_reversal_flag,
    is_abuse,
    transactionDate,
    Abuse__DOT__buyer_abuse_bsm_message_body_concat_by_order_marketplaceid__DOT__c_message_body_concat_by_order
FROM (
    SELECT
        RAW_MDS_{region}.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_claims_solicit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_claims_warn_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_concession_solicit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__abuse_fap_action_by_customer_inline_transform_{region.lower()}__DOT__n_concession_warn_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_buyer_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_max_seller_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_diff_topic_si,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_notr_topic_si,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_message_count_with_return_keywords_si,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_buyer_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_buyer_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_seller_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_min_seller_order_message_time_gap,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_buyer_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_seller_message_count,
        RAW_MDS_{region}.Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}__DOT__n_total_topic_count,
        RAW_MDS_{region}.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__completed_afn_orders_by_customer_marketplace__DOT__n_afn_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__completed_mfn_orders_by_customer_marketplace__DOT__n_mfn_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_amount_si_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__dnr_by_customer_marketplace__DOT__n_dnr_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_claims_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_claims_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_diff_claims_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_diff_claims_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_notr_claims_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_a2z_claims_by_customer_{region.lower()}__DOT__n_mfn_notr_claims_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_diff_refunds_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_by_customer_marketplace__DOT__n_mfn_notr_refunds_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_diff_refunds_si_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_categorized_refunds_si_by_customer_marketplace__DOT__n_mfn_notr_refunds_si_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_order_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_amount_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_refunds_by_customer_marketplace__DOT__n_mfn_refund_unit_count_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__mfn_refunds_si_by_customer_marketplace__DOT__n_mfn_refund_amount_si_last_365_days,
        RAW_MDS_{region}.Abuse__DOT__order_to_execution_time_from_eventvariables__DOT__n_order_to_execution,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_available_for_pickup,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_delivered,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_partial_delivered,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returned,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_returning,
        RAW_MDS_{region}.Abuse__DOT__shiptrack_flag_by_order__DOT__n_any_undeliverable,
        RAW_MDS_{region}.COMP_DAYOB,
        RAW_MDS_{region}.PAYMETH,
        RAW_MDS_{region}.claimAmount_value,
        RAW_MDS_{region}.claim_reason,
        RAW_MDS_{region}.claimantInfo_allClaimCount365day,
        RAW_MDS_{region}.claimantInfo_lifetimeClaimCount,
        RAW_MDS_{region}.claimantInfo_pendingClaimCount,
        RAW_MDS_{region}.claimantInfo_status,
        RAW_MDS_{region}.objectId,
        RAW_MDS_{region}.shipments_status,
        RAW_MDS_{region}.transactionDate,
        TAGS.llm_reversal_flag,
        TAGS.is_abuse,
        TAGS.marketplace_id,
        TAGS.order_id,
        TAGS.customer_id,
        ROW_NUMBER() OVER (PARTITION BY RAW_MDS_{region}.objectId, TAGS.order_id ORDER BY RAW_MDS_{region}.transactionDate DESC) as row_num,
        regexp_replace(regexp_replace(regexp_replace(RAW_MDS_{region}.Abuse__DOT__buyer_abuse_bsm_message_body_concat_by_order_marketplaceid__DOT__c_message_body_concat_by_order, '(")', ''), '\\n', ''), '\\t', '') as Abuse__DOT__buyer_abuse_bsm_message_body_concat_by_order_marketplaceid__DOT__c_message_body_concat_by_order
    FROM RAW_MDS_{region}
    JOIN TAGS ON RAW_MDS_{region}.objectId = TAGS.order_id
)
WHERE row_num = 1
"""

In [35]:
merge_sql = """
select * from INPUT
"""

In [36]:
available_cluster_types = ["STANDARD", "SMALL", "MEDIUM", "LARGE"]
cluster_choice = -2
cluster_type = available_cluster_types[cluster_choice]
cluster_type

'MEDIUM'

In [37]:
cradle_account = "BRP-ML-Abuse"  #'Buyer-Abuse-RnD-Dev'

In [38]:
training_cradle_data_load_dict = {
    "job_type": "training",
    "data_sources_spec": DataSourcesSpecificationConfig(
        start_date=training_start_datetime,
        end_date=training_end_datetime,
        data_sources=[
            DataSourceConfig(
                data_source_name=f"RAW_MDS_{region}",
                data_source_type="MDS",
                mds_data_source_properties=MdsDataSourceConfig(
                    service_name=mds_service_name,
                    region=region,
                    output_schema=mds_output_schema,
                    org_id=org_id,
                ),
            ),
            DataSourceConfig(
                data_source_name="TAGS",
                data_source_type="EDX",
                edx_data_source_properties=EdxDataSourceConfig(
                    schema_overrides=edx_schema_overrides,
                    edx_arn=train_edx_arn[region],
                ),
            ),
        ],
    ),
    "output_spec": OutputSpecificationConfig(
        output_schema=output_schema, output_format=output_format
    ),
    "transform_spec": TransformSpecificationConfig(
        transform_sql=transform_sql,
        job_split_options=JobSplitOptionsConfig(merge_sql=merge_sql),
    ),
    "cradle_job_spec": CradleJobSpecificationConfig(
        cradle_account=cradle_account,
        cluster_type=cluster_type,
    ),
}

In [39]:
# Configure training data loading
if "CradleDataLoading_training" in pending_steps:
    step_name = "CradleDataLoading_training"
    requirements = factory.get_step_requirements(step_name)

    print(f"Configuring {step_name}:")
    print("-" * 40)
    for req in requirements[:5]:  # Show first 5 requirements
        marker = "*" if req["required"] else " "
        print(f"{marker} {req['name']:<25} ({req['type']})")
        print(f"    {req['description']}")

    if len(requirements) > 5:
        print(f"    ... and {len(requirements) - 5} more fields")

    # Set configuration for training data loading
    factory.set_step_config(step_name, **training_cradle_data_load_dict)
    print(f"‚úÖ {step_name} configured")
    print()

2025-12-20 22:37:19,473 - INFO - ‚úÖ CradleDataLoading_training configured successfully using CradleDataLoadingConfig


Configuring CradleDataLoading_training:
----------------------------------------
* job_type                  (str)
    One of ['training','validation','testing','calibration'] to indicate which dataset this job is pulling
* data_sources_spec         (DataSourcesSpecificationConfig)
    Full data‚Äêsources specification (start/end dates plus list of sources).
* transform_spec            (TransformSpecificationConfig)
    Transform specification: SQL + job‚Äêsplit options.
* output_spec               (OutputSpecificationConfig)
    Output specification: schema, output format, save mode, etc.
* cradle_job_spec           (CradleJobSpecificationConfig)
    Cradle job specification: cluster type, account, retry count, etc.
    ... and 1 more fields
‚úÖ CradleDataLoading_training configured



#### Cradle Data Loading (Calibration) Step

In [40]:
# =======================================================
calibration_start_datetime = "2025-01-01T00:00:00"  #'2024-05-26T00:00:00'
calibration_end_datetime = "2025-10-31T00:00:00"  #'2024-06-29T23:00:00'
# =======================================================

In [41]:
# =======================================================
calibration_edx_arn = {
    "NA": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Test"]',
    "EU": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Test"]',
    "FE": 'arn:amazon:edx:iad::manifest/trms-abuse-analytics/buyer-seller-messaging/bsm-tag-atoz/["20251215",2025-01-01T00:00:00Z,2025-10-31T01:00:00Z,"LLM_TAG_WW_Test"]',
}
# =======================================================

In [42]:
calibration_cradle_data_load_dict = {
    "job_type": "calibration",
    "data_sources_spec": DataSourcesSpecificationConfig(
        start_date=calibration_start_datetime,
        end_date=calibration_end_datetime,
        data_sources=[
            DataSourceConfig(
                data_source_name=f"RAW_MDS_{region}",
                data_source_type="MDS",
                mds_data_source_properties=MdsDataSourceConfig(
                    service_name=mds_service_name,
                    region=region,
                    output_schema=mds_output_schema,
                    org_id=org_id,
                ),
            ),
            DataSourceConfig(
                data_source_name="TAGS",
                data_source_type="EDX",
                edx_data_source_properties=EdxDataSourceConfig(
                    schema_overrides=edx_schema_overrides,
                    edx_arn=calibration_edx_arn[region],
                ),
            ),
        ],
    ),
    "output_spec": OutputSpecificationConfig(
        output_schema=output_schema, output_format=output_format
    ),
    "transform_spec": TransformSpecificationConfig(
        transform_sql=transform_sql,
        job_split_options=JobSplitOptionsConfig(merge_sql=merge_sql),
    ),
    "cradle_job_spec": CradleJobSpecificationConfig(
        cradle_account=cradle_account,
        cluster_type=cluster_type,
    ),
}

In [43]:
# Configure calibration data loading
if "CradleDataLoading_calibration" in pending_steps:
    step_name = "CradleDataLoading_calibration"

    factory.set_step_config(step_name, **calibration_cradle_data_load_dict)
    print(f"‚úÖ {step_name} configured")

2025-12-20 22:37:19,497 - INFO - ‚úÖ CradleDataLoading_calibration configured successfully using CradleDataLoadingConfig


‚úÖ CradleDataLoading_calibration configured


### Step 6.3: Configure Registration Step


* [MRAS (Model Resource Allocation System)](https://w.amazon.com/bin/view/CMLS/ME/MIMS/)¬†is a system that manages your **model endpoints**. 
    * It takes your model artifact and its metadata and deploys an endpoint to an AWS account you have onboarded to MRAS. You can access this endpoint through the AMES system, which URES uses.
* **MIMS (Model Inference Management System)** is a system that handles the model creation
* **MMS (Model Management Service)** would manage the model card
> 
> Note that we used to call **MRAS MIMS** (**Model Inference Management System**). 
> - **MIMS** is the component of MRAS that handles endpoint creation. 
> - To reduce customer confusion, we have started to use *MRAS* to also refer to *MIMS*. 
> - Some of our wikis may still use *MIMS* instead of *MRAS*.
> 
> If your team has not already, please¬†[onboard an AWS account to MRAS](https://w.amazon.com/bin/view/CMLS/ME/MIMS/UserGuide/Onboarding/).

* **MIMSModelRegistrationStep** is a SageMaker Workflow Step that wrap around the service call to **MIMS**.
    * It is also a customized step provided by SAIS Python SDK
        * See Source code[SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
    * This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**.
        * Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
    * This step would need to load **Execution Document** to take action.

In **MIMSModelRegistrationStep**, we need to specify the fields to fill in the **Execution Document**
* *model_owner*
* *model_registration_domain*
* *model_registration_objective*
* *source_model_inference_input_variable_list*
* *source_model_inference_output_variable_list*
* *source_model_inference_content_types*
* *source_model_inference_response_types*


In [44]:
# Configure Registration step
if "Registration" in pending_steps:
    # =================== Update This =======================
    model_domain = "BuyerSellerMessaging"
    model_objective = (
        f"AtoZ_Claims_BSM_Model_{region}"
        if region in ["EU", "FE"]
        else "AtoZ_Claims_BSM_Model_US"
    )
    # =======================================================

    # source_model_inference_input_variable_list = {
    #    field: 'NUMERIC' if field in tab_field_list else 'TEXT'
    #    for field in tab_field_list + cat_field_list
    # }
    source_model_inference_input_variable_list = [
        [field, "NUMERIC"] if field in tab_field_list else [field, "TEXT"]
        for field in tab_field_list + cat_field_list
    ]

    source_model_inference_output_variable_list = {
        "legacy-score": "NUMERIC",
        "score-percentile": "NUMERIC",
        "calibrated-score": "NUMERIC",
        "custom-output-label": "TEXT",
    }

    # =================== Update This =======================
    framework = "pytorch"
    inference_entry_point = "pytorch_inference_handler.py"
    # =======================================================

    factory.set_step_config(
        "Registration",
        framework=framework,
        inference_entry_point=inference_entry_point,
        model_owner="amzn1.abacus.team.djmdvixm5abr3p75c5ca",  # abuse-analytics team
        model_domain=model_domain,
        model_objective=model_objective,
        source_model_inference_output_variable_list=source_model_inference_output_variable_list,
        source_model_inference_input_variable_list=source_model_inference_input_variable_list,
    )
    print(f"‚úÖ Registration configured")

2025-12-20 22:37:19,505 - INFO - ‚úÖ Registration configured successfully using RegistrationConfig


‚úÖ Registration configured


### Step 6.4: Configure Preprocessing Steps

In [45]:
# Configure training preprocessing
if "TabularPreprocessing_training" in pending_steps:
    step_name = "TabularPreprocessing_training"

    factory.set_step_config(
        step_name,
        job_type="training",
        label_name=label_name,
        processing_entry_point="tabular_preprocessing.py",
        use_large_processing_instance=True,
        output_format="Parquet",
    )
    print(f"‚úÖ {step_name} configured")

2025-12-20 22:37:19,510 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,511 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,511 - INFO - ‚úÖ TabularPreprocessing_training configured successfully using TabularPreprocessingConfig


‚úÖ TabularPreprocessing_training configured


In [46]:
# Configure training preprocessing
if "TabularPreprocessing_calibration" in pending_steps:
    step_name = "TabularPreprocessing_calibration"

    factory.set_step_config(
        step_name,
        job_type="calibration",
        label_name=None,
        processing_entry_point="tabular_preprocessing.py",
        use_large_processing_instance=True,
        output_format="Parquet",
    )
    print(f"‚úÖ {step_name} configured")

2025-12-20 22:37:19,517 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,517 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,518 - INFO - ‚úÖ TabularPreprocessing_calibration configured successfully using TabularPreprocessingConfig


‚úÖ TabularPreprocessing_calibration configured


### Step 6.5: Configure Remaining Steps

**USER INPUT BLOCK**: Fill in the essential fields for each remaining step.
The factory has identified the required fields for each step.

In [47]:
# Get current pending steps
current_pending = factory.get_pending_steps()

print("Remaining steps to configure:")
print("=" * 40)

for step_name in current_pending:
    requirements = factory.get_step_requirements(step_name)
    essential_reqs = [req for req in requirements if req["required"]]

    print(f"\n{step_name}:")
    print(f"  Essential fields ({len(essential_reqs)}):")
    for req in essential_reqs:
        print(f"    * {req['name']} ({req['type']}) - {req['description']}")

    if len(requirements) > len(essential_reqs):
        optional_count = len(requirements) - len(essential_reqs)
        print(f"  Optional fields: {optional_count}")

Remaining steps to configure:

PyTorchModelEval_calibration:
  Essential fields (2):
    * id_name (str) - Name of the ID field in the dataset (required for evaluation).
    * label_name (str) - Name of the label field in the dataset (required for evaluation).
  Optional fields: 7

PercentileModelCalibration_calibration:
  Essential fields (1):
    * job_type (str) - Which data split to use for calibration (e.g., 'training', 'calibration', 'validation', 'test').
  Optional fields: 4

Payload:
  Essential fields (2):
    * expected_tps (int) - Expected transactions per second
    * max_latency_in_millisecond (int) - Maximum acceptable latency in milliseconds
  Optional fields: 8


In [48]:
id_name

'order_id'

In [49]:
label_name

'is_abuse'

In [50]:
# Configure Model Evaluation
if "PyTorchModelEval_calibration" in current_pending:
    step_name = "PyTorchModelEval_calibration"
    factory.set_step_config(
        step_name,
        job_type="calibration",
        processing_entry_point="pytorch_model_eval.py",
        id_name=id_name,
        label_name=label_name,
        processing_source_dir=str(source_dir),
        processing_instance_type_large="ml.p3.8xlarge",
        use_large_processing_instance=True,
        processing_framework_version="2.1.2",
        py_version="py310",
    )
    print(f"‚úÖ {step_name} configured")

2025-12-20 22:37:19,542 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,542 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,543 - INFO - ‚úÖ PyTorchModelEval_calibration configured successfully using PyTorchModelEvalConfig


‚úÖ PyTorchModelEval_calibration configured


In [51]:
# Configure Model Calibration
if "ModelCalibration_calibration" in current_pending:
    factory.set_step_config(
        "ModelCalibration_calibration",
        label_field=label_name,
        processing_entry_point="model_calibration.py",
        score_field="prob_class_1",
        is_binary=True,
        num_classes=2,
        score_field_prefix="prob_class_",
        multiclass_categories=[0, 1],
    )
    print(f"‚úÖ ModelCalibration_calibration configured")

In [52]:
# Configure Model Calibration
if "PercentileModelCalibration_calibration" in current_pending:
    factory.set_step_config(
        "PercentileModelCalibration_calibration",
        job_type="calibration",
        processing_entry_point="percentile_model_calibration.py",
        score_field="prob_class_1",
        score_fields=None,
    )
    print(f"‚úÖ PercentileModelCalibration_calibration configured")

2025-12-20 22:37:19,553 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,554 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,554 - INFO - ‚úÖ PercentileModelCalibration_calibration configured successfully using PercentileModelCalibrationConfig


‚úÖ PercentileModelCalibration_calibration configured


In [53]:
# Configure Model Calibration
if "Payload" in current_pending:
    field_dict = {
        "Abuse.buyer_abuse_bsm_message_body_concat_by_order_marketplaceid.c_message_body_concat_by_order": "[bom] [Arrival Time]: 2025-06-11 [BUYER]: I need my refund. [eom]",
    }
    load_test_instance_type_list = [
        "ml.m5.12xlarge"  # "ml.m5.4xlarge"
    ]
    factory.set_step_config(
        "Payload",
        processing_entry_point="payload.py",
        # =================================
        expected_tps=10,
        max_latency_in_millisecond=1000,
        field_defaults=field_dict,
        load_test_instance_type_list=load_test_instance_type_list,
    )
    print(f"‚úÖ Payload configured")

2025-12-20 22:37:19,560 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,561 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,561 - INFO - ‚úÖ Payload configured successfully using PayloadConfig


‚úÖ Payload configured


## Step 7: Generate Final Configurations

Now that all steps are configured, we can generate the final configuration instances.
The factory will validate that all essential fields are provided and create the config objects.

In [54]:
# Check final status
final_status = factory.get_configuration_status()
final_pending = factory.get_pending_steps()

print("Final Configuration Status:")
print("=" * 40)
print(f"Base config: {'‚úÖ' if final_status['base_config'] else '‚ùå'}")
print(f"Processing config: {'‚úÖ' if final_status['base_processing_config'] else '‚ùå'}")
print(f"Pending steps: {len(final_pending)}")

if final_pending:
    print("\nStill pending:")
    for step in final_pending:
        print(f"  - {step}")
    print("\n‚ö†Ô∏è  Please configure remaining steps before generating configs.")
else:
    print("\n‚úÖ All steps configured! Ready to generate configurations.")

Final Configuration Status:
Base config: ‚úÖ
Processing config: ‚úÖ
Pending steps: 0

‚úÖ All steps configured! Ready to generate configurations.


In [55]:
# Generate final configurations
if not final_pending:
    try:
        print("Generating final configurations...")
        configs = factory.generate_all_configs()

        print(f"\n‚úÖ Successfully generated {len(configs)} configuration instances:")
        for i, config in enumerate(configs, 1):
            print(f"  {i:2d}. {config.__class__.__name__}")

        print("\nüéâ Configuration generation complete!")

    except Exception as e:
        print(f"\n‚ùå Configuration generation failed: {e}")
        print("\nPlease check that all required fields are provided.")
        configs = None
else:
    print("\n‚ö†Ô∏è  Cannot generate configs - some steps are still pending configuration.")
    configs = None

Generating final configurations...

2025-12-20 22:37:19,574 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,574 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers/scripts
2025-12-20 22:37:19,575 - INFO - ‚úÖ Package auto-configured successfully (only tier 2+ fields)
2025-12-20 22:37:19,575 - INFO - ‚úÖ Auto-configured 1 steps with only tier 2+ fields
2025-12-20 22:37:19,575 - INFO - ‚úÖ Returning 10 pre-validated configuration instances




‚úÖ Successfully generated 10 configuration instances:
   1. PyTorchTrainingConfig
   2. CradleDataLoadingConfig
   3. CradleDataLoadingConfig
   4. RegistrationConfig
   5. TabularPreprocessingConfig
   6. TabularPreprocessingConfig
   7. PyTorchModelEvalConfig
   8. PercentileModelCalibrationConfig
   9. PayloadConfig
  10. PackageConfig

üéâ Configuration generation complete!


In [56]:
len(configs)

10

## Step 8: Save to JSON

Finally, we save the generated configurations to a unified JSON file using the existing
`merge_and_save_configs` utility. This creates the same format as the legacy approach
but with much less effort!

In [57]:
if configs:
    # Set up output directory and filename
    MODEL_CLASS = "pytorch"
    service_name = "AtoZ"

    config_dir = Path(current_dir) / "pipeline_config"
    config_dir.mkdir(parents=True, exist_ok=True)

    config_file_name = f"config_{region}.json"
    config_path = config_dir / config_file_name

    print(f"Saving configurations to: {config_path}")

    # Use the existing merge_and_save_configs utility
    from cursus.steps.configs.utils import merge_and_save_configs

    try:
        merged_config = merge_and_save_configs(configs, str(config_path))

        print(f"\n‚úÖ Configuration saved successfully!")
        print(f"   File: {config_path}")
        print(f"   Size: {config_path.stat().st_size / 1024:.1f} KB")

        # Also save hyperparameters separately (for compatibility)
        hyperparam_path = source_dir / "hyperparams" / f"hyperparameters_{region}.json"
        with open(hyperparam_path, "w") as f:
            json.dump(bimodal_hyperparams.model_dump(), f, indent=2, sort_keys=True)

        print(f"   Hyperparameters: {hyperparam_path}")

        print(f"\nüéâ Interactive configuration complete!")
        print(f"\nüìä Comparison with legacy approach:")
        print(f"   Legacy: 500+ lines of manual configuration")
        print(f"   Interactive: Guided step-by-step process")
        print(f"   Time saved: ~20-25 minutes")
        print(f"   Error reduction: Validation at each step")

    except Exception as e:
        print(f"\n‚ùå Failed to save configurations: {e}")

else:
    print("\n‚ö†Ô∏è  No configurations to save. Please generate configs first.")

2025-12-20 22:37:19,587 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,588 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,590 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,590 - INFO - Hybrid resolution completed successfully via Package Location Discovery: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,592 - INFO - Package location discovery succeeded (bundled): /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/dockers
2025-12-20 22:37:19,592 - INFO

Saving configurations to: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json

‚úÖ Configuration saved successfully!
   File: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json
   Size: 116.7 KB
   Hyperparameters: dockers/hyperparams/hyperparameters_NA.json

üéâ Interactive configuration complete!

üìä Comparison with legacy approach:
   Legacy: 500+ lines of manual configuration
   Interactive: Guided step-by-step process
   Time saved: ~20-25 minutes
   Error reduction: Validation at each step


### Test if we can load it

In [58]:
from cursus.steps.configs.config_cradle_data_loading_step import CradleDataLoadingConfig
from cursus.steps.configs.config_tabular_preprocessing_step import (
    TabularPreprocessingConfig,
)
from cursus.steps.configs.config_pytorch_training_step import PyTorchTrainingConfig
from cursus.steps.configs.config_pytorch_model_eval_step import PyTorchModelEvalConfig
from cursus.steps.configs.config_model_calibration_step import ModelCalibrationConfig
from cursus.steps.configs.config_percentile_model_calibration_step import (
    PercentileModelCalibrationConfig,
)
from cursus.steps.configs.config_package_step import PackageConfig
from cursus.steps.configs.config_payload_step import PayloadConfig
from cursus.steps.configs.config_package_step import PackageConfig
from cursus.steps.configs.config_payload_step import PayloadConfig
from cursus.steps.configs.config_registration_step import RegistrationConfig

In [59]:
from cursus.steps.configs.utils import load_configs

In [60]:
CONFIG_CLASSES = {
    "CradleDataLoadingConfig": CradleDataLoadingConfig,
    "TabularPreprocessingConfig": TabularPreprocessingConfig,
    "PyTorchTrainingConfig": PyTorchTrainingConfig,
    "PyTorchModelEvalConfig": PyTorchModelEvalConfig,
    "ModelCalibrationConfig": ModelCalibrationConfig,
    "PercentileModelCalibrationConfig": PercentileModelCalibrationConfig,
    "PackageConfig": PackageConfig,
    "PayloadConfig": PayloadConfig,
    "RegistrationConfig": RegistrationConfig,
}

In [61]:
loaded_configs = load_configs(str(config_path), CONFIG_CLASSES)

2025-12-20 22:37:19,647 - INFO - Loading configs from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json
2025-12-20 22:37:19,648 - INFO - Loading configuration from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json
2025-12-20 22:37:19,656 - INFO - Successfully loaded configuration from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json
2025-12-20 22:37:19,656 - INFO - Successfully loaded configs from /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/atoz_bsm_pytorch/pipeline_config/config_NA.json with 10 specific configs
2025-12-20 22:37:19,659 - INFO - Creating additional config instance for CradleDataLoading_calibration (CradleDataLoadingConfig)
2025-12-20 22:37:19,660 - INFO - Creating additional config instance for CradleDataLoading

In [62]:
len(loaded_configs)

10

## Summary

This notebook demonstrates the **DAGConfigFactory** approach to pipeline configuration:

### ‚úÖ **Benefits Achieved**

1. **Reduced Complexity**: From 500+ lines of manual config to guided workflow
2. **Base Config Inheritance**: Set common fields once, inherit everywhere
3. **Step-by-Step Guidance**: Clear requirements for each configuration step
4. **Validation**: Comprehensive validation prevents configuration errors
5. **Reusable DAG**: Pipeline structure defined once, reused across environments

### üîÑ **Workflow Comparison**

| Aspect | Legacy Approach | Interactive Approach |
|--------|----------------|---------------------|
| **Lines of Code** | 500+ manual lines | Guided step-by-step |
| **Time Required** | 30+ minutes | 10-15 minutes |
| **Error Rate** | High (manual entry) | Low (validation) |
| **Reusability** | Copy-paste heavy | DAG-driven |
| **Maintenance** | Manual updates | Automatic inheritance |

### üöÄ **Next Steps**

The generated configuration file can now be used with the existing pipeline compiler:

```python
# Use with pipeline compiler (from demo_pipeline.ipynb)
from cursus.core.compiler.dag_compiler import PipelineDAGCompiler

dag_compiler = PipelineDAGCompiler(
    config_path=config_path,
    sagemaker_session=pipeline_session,
    role=role
)

# Compile DAG to pipeline
template_pipeline, report = dag_compiler.compile_with_report(dag=dag)
```

The interactive configuration approach transforms the user experience from complex manual setup to an intuitive, guided workflow while maintaining full compatibility with the existing cursus infrastructure.