# Part 1: Config Creation

This notebook contains all **user input** and we save them into a **unified config json file** to construct the pipeline. 

In this notebook, we assume a cpmmonly seen pipeline. The task is as follows
* Data Pulling
* XGBoost Model Training
* Model Evaluation
* Model Registration

The following is the **Pipeline DAG (Direct Acyclic Graph)** 
![mods_pipeline_train_eval_calib](./demo/mods_pipeline_train_eval_calib.png)


The *steps* involved are as follow
1. **CradleDataLoadingStep** with repicated steps for **training** and **calibration** data flow
2. **TabularPreprocessingStep** with two different type (**training** and **calibration**)
3. **XGBoostTrainingStep**
4. **XGBoostModelEvaluationStep** 
5. **PackagingStep**
6. **MIMSModelRegistrationStep**
7. **PayloadStep** (Optional)

This notebook would let user to specify input information for each of these steps


There are two more **base step**, which constrols the **information sharing** across all steps. 
1. **Base Config**: shared for all steps
2. **Base Processing Config**: shared for all *processing steps*


#### OPTIONAL Install SAIS Python SDK, MODS Workflow Helper and MODS Python SDK

import pkg_resources
import subprocess
import sys

def install_package(package_name):
    try:
        # Check if package is already installed
        pkg_resources.require(package_name)
        print(f"{package_name} is already installed")
    except pkg_resources.DistributionNotFound:
        # Package not found, install it
        print(f"Installing {package_name}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", 
                             package_name, "--ignore-installed"])
        print(f"Successfully installed {package_name}")
    except Exception as e:
        print(f"Error occurred: {str(e)}")

install_package('amzn-secure-ai-sandbox-workflow-python-sdk')

install_package('amzn-mods-workflow-helper')

install_package('amzn-mods-python-sdk')

In [1]:
#!pip list

In [2]:
#!pip install --upgrade sagemaker

In [3]:
#!pip uninstall -y rpds-py
#!pip install rpds-py --force-reinstall

In [4]:
#!pip install amzn-secure-ai-sandbox-workflow-python-sdk --ignore-installed

In [5]:
#!pip install amzn-mods-workflow-helper amzn-mods-python-sdk --upgrade

#### Start

In [6]:
import os
import json
import pandas as pd
import pickle
import sys
import boto3
import subprocess
from datetime import datetime
import logging
from pathlib import Path

In [7]:
logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )

<a id='1'></a>
## Envionment Setup

In [8]:
from sagemaker import Session

2025-09-21 19:08:47,202 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [9]:
from secure_ai_sandbox_python_lib.session import Session as SaisSession
# Initialize session with team bucket
sais_session = SaisSession(".")

2025-09-21 19:08:47,492 - INFO - CA certs are provided via the AmazonCACerts installation at /home/ec2-user/.local/lib/python3.10/site-packages/amazoncerts
2025-09-21 19:08:47,880 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-09-21 19:08:48,631 - INFO - successfully patched module botocore


In [10]:
from mods_workflow_helper.sagemaker_pipeline_helper import SecurityConfig
security_config = SecurityConfig(
    kms_key=sais_session.get_team_owned_bucket_kms_key(),
    security_group=sais_session.sandbox_vpc_security_group(),
    vpc_subnets=sais_session.sandbox_vpc_subnets()
)

In [11]:
from mods_workflow_helper.utils.secure_session import create_secure_session_config
from sagemaker.workflow.pipeline_context import PipelineSession

sagemaker_config = create_secure_session_config(
    role_arn=PipelineSession().get_caller_identity_arn(),
    # If you are uploading to andes, use cradle_read_s3_bucket_name() and get_cradle_read_bucket_kms_key() respecitely
    bucket_name=sais_session.team_owned_s3_bucket_name(),
    kms_key=sais_session.get_team_owned_bucket_kms_key(),
    vpc_subnet_ids=sais_session.sandbox_vpc_subnets(),
    vpc_security_groups=[sais_session.sandbox_vpc_security_group()]
)

pipeline_session = PipelineSession(default_bucket=sais_session.team_owned_s3_bucket_name(), sagemaker_config=sagemaker_config) # IMPORTANT now the session uses the generated sagemaker_config

2025-09-21 19:08:48,709 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
2025-09-21 19:08:48,892 - INFO - There is no MODS workflow execution id provided, this is probably because you are running your pipeline outside of MODS.
2025-09-21 19:08:48,906 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [12]:
pipeline_session.config = sagemaker_config

In [13]:
sagemaker_config

{'SchemaVersion': '1.0',
 'SageMaker': {'PythonSDK': {'Modules': {'Session': {'DefaultS3Bucket': 'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um'},
    'RemoteFunction': {'VpcConfig': {'SecurityGroupIds': ['sg-0f97baad9c44ab543'],
      'Subnets': ['subnet-0e484e4a41446ca78', 'subnet-0206d83e93e45844a']},
     'S3KmsKeyId': 'arn:aws:kms:us-east-1:601857636239:key/5e32d636-8848-43ba-bd1f-f1a1c9a06982',
     'VolumeKmsKeyId': 'arn:aws:kms:us-east-1:601857636239:key/5e32d636-8848-43ba-bd1f-f1a1c9a06982',
     'RoleArn': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1'}}},
  'FeatureGroup': {'OfflineStoreConfig': {'S3StorageConfig': {'KmsKeyId': 'arn:aws:kms:us-east-1:601857636239:key/5e32d636-8848-43ba-bd1f-f1a1c9a06982'}},
   'OnlineStoreConfig': {'SecurityConfig': {'KmsKeyId': 'arn:aws:kms:us-east-1:601857636239:key/5e32d636-8848-43ba-bd1f-f1a1c9a06982'}},
   'RoleArn': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1'},
  'MonitoringSchedule':

In [14]:
print(sais_session.available_clients())
print(sais_session.available_resources())
print(sais_session.my_owned_s3_bucket_name())
print(sais_session.team_owned_s3_bucket_name())

dict_keys(['TRMSModelManagementServiceClient', 'DataAnalyticsWorkflowServiceClient', 'ModelInferenceManagementServiceClient', 'MODSModelWorkflowManagementServiceClient', 'SandboxProxyServiceClient', 'CloudWatch'])
dict_keys(['MDSDataLoader', 'CradleDataLoader', 'MyOwnS3BucketDataLoader', 'SharedBucketS3DataLoader', 'CrossTeamS3BucketDataLoader', 'CrossTeamExternalS3BucketDataLoader', 'EdxDataLoader', 'FeatureHubDataLoader', 'Docker', 'OTFSimulationTool', 'DAWSWorkflowFilesDownloader', 'MIMSModelRegistrar', 'TagFileUploader', 'AlexandriaDataLoader', 'DataUploader', 'MODSWorkflowBuilder'])
sandboxuserdependency-lukexie-personals3bucket-7hkl6lkqmo82
sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um


<a id='1.1'></a>
### SageMaker Pipeline Steps

In [15]:
import sagemaker
from sagemaker import Session, TrainingInput
from sagemaker import image_uris, model_uris, script_uris
from sagemaker.processing import ProcessingOutput, ProcessingInput, FrameworkProcessor
from sagemaker.sklearn import SKLearnProcessor, SKLearn
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import (
    ProcessingStep,
    TrainingStep,
    TuningStep,
    TransformStep,
)

<a id='1.2'></a>
### MODS Import

In [16]:
from mods.mods_template import MODSTemplate

from secure_ai_sandbox_workflow_python_sdk.mims_model_registration.mims_model_registration_processing_step import (
    MimsModelRegistrationProcessingStep,
)
from secure_ai_sandbox_workflow_python_sdk.cradle_data_loading.cradle_data_loading_step import (
    CradleDataLoadingStep,
)
from mods_workflow_core.utils.constants import (
    PIPELINE_EXECUTION_TEMP_DIR,
    KMS_ENCRYPTION_KEY_PARAM,
    PROCESSING_JOB_SHARED_NETWORK_CONFIG,
    SECURITY_GROUP_ID,
    VPC_SUBNET,
)
from secure_ai_sandbox_workflow_python_sdk.model_performance_evaluation.model_performance_evaluation_step import (
    ModelPerformanceEvaluationStep,
)

from secure_ai_sandbox_workflow_python_sdk.model_performance_evaluation.model_performance_evaluation_processor import (
    ModelPerformanceEvaluationProcessor,
)
from secure_ai_sandbox_workflow_python_sdk.utils.constants import PROCESSOR_DIRECTORY_ROOT

In [17]:
PIPELINE_EXECUTION_TEMP_DIR

ParameterString(name='EXECUTION_S3_PREFIX', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value=None)

In [18]:
KMS_ENCRYPTION_KEY_PARAM

ParameterString(name='KMS_ENCRYPTION_KEY_PARAM', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value=None)

In [19]:
SECURITY_GROUP_ID

ParameterString(name='SECURITY_GROUP_ID', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value=None)

In [20]:
VPC_SUBNET

ParameterString(name='VPC_SUBNET', parameter_type=<ParameterTypeEnum.STRING: 'String'>, default_value=None)

In [21]:
from pydantic import BaseModel, Field, model_validator, field_validator
from typing import List, Optional, Dict, Any, Union, ClassVar

In [22]:
from pathlib import Path
import sys

# Get parent directory of current notebook
project_root = str(Path().absolute() / 'src')
if project_root not in sys.path:
    sys.path.insert(0, project_root)  
    print(f"add project root {project_root} into system")

add project root /home/ec2-user/SageMaker/Cursus/src into system


In [23]:
project_root

'/home/ec2-user/SageMaker/Cursus/src'

In [24]:
current_dir = Path.cwd()
current_dir

PosixPath('/home/ec2-user/SageMaker/Cursus')

## **REQUIRED**: Set Up Training Hyperparameters

This section is for user to provide necessary information describing their model development. The critical information necesssary for construction of 
* **data loading**,
* **training**,
* **payload testing** and
* **model registration**


User need to provide information for the following fields
* input variable list
    * **full variable list**
    * **tabular varaible list** (float type)
    * **categorical variable list** (string or object type)
* region
* **label field name**
* **unique id name**
* *binary* classificaiton or *multi-class* classification
    * if multi-class classification, *number of classes*
    * *class weight*
* *metric choice*

In [25]:
region_list = [
    'NA',
    'EU',
    'FE'
]

In [26]:
region_selection = 0

In [27]:
region = region_list[region_selection]
region

'NA'

### Base Hyperparameters

In [28]:
from cursus.core.base.hyperparameters_base import ModelHyperparameters

In [29]:
def remove_case_insensitive_duplicates(string_list: list) -> list:
    """
    Remove duplicates from a list of strings, case-insensitive.
    Keeps the first occurrence of each string.
    
    Args:
        string_list (list): List of strings with potential case-insensitive duplicates
        
    Returns:
        list: List with duplicates removed (case-insensitive)
        
    Example:
        >>> remove_case_insensitive_duplicates(['Hello', 'HELLO', 'world', 'World'])
        ['Hello', 'world']
    """
    seen = set()
    result = []
    
    for s in string_list:
        if s.lower() not in seen:
            seen.add(s.lower())
            result.append(s)
    
    return result

##### Choose Variable

In [30]:
full_field_list = [
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_warn_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_warn_count_last_365_days',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_diff_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_notr_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_return_keywords_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_topic_count',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_order_count_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_amount_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_order_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_amount_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_amount_si_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_order_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_amount_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_diff_refunds_si_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_notr_refunds_si_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_order_count_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_amount_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_count_last_365_days',
'Abuse.mfn_refunds_si_by_customer_marketplace.n_mfn_refund_amount_si_last_365_days',
'Abuse.order_to_execution_time_from_eventvariables.n_order_to_execution',
'Abuse.shiptrack_flag_by_order.n_any_delivered',
'Abuse.shiptrack_flag_by_order.n_any_available_for_pickup',
'Abuse.shiptrack_flag_by_order.n_any_partial_delivered',
'Abuse.shiptrack_flag_by_order.n_any_undeliverable',
'Abuse.shiptrack_flag_by_order.n_any_returning',
'Abuse.shiptrack_flag_by_order.n_any_returned',
'COMP_DAYOB',
'PAYMETH',
'claimAmount_value',
'claim_reason',
'claimantInfo_allClaimCount365day',
'claimantInfo_lifetimeClaimCount',
'claimantInfo_pendingClaimCount',
'claimantInfo_status',
'shipments_status',
'order_id',
'marketplace_id',
'is_abuse'
]

In [31]:
cat_field_list = [
    'PAYMETH',
    'claim_reason',
    'claimantInfo_status',
    'shipments_status'
]

In [32]:
tab_field_list = [
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_claims_warn_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_solicit_count_last_365_days',
f'Abuse.abuse_fap_action_by_customer_inline_transform_{region.lower()}.n_concession_warn_count_last_365_days',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_max_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_diff_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_notr_topic_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_message_count_with_return_keywords_si',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_buyer_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_min_seller_order_message_time_gap',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_buyer_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_seller_message_count',
f'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_{region.lower()}.n_total_topic_count',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_order_count_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_amount_last_365_days',
'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_order_count_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_amount_last_365_days',
'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_amount_si_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_order_count_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_amount_last_365_days',
'Abuse.dnr_by_customer_marketplace.n_dnr_unit_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_diff_claims_count_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_amount_last_365_days',
'Abuse.mfn_a2z_claims_by_customer_na.n_mfn_notr_claims_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_diff_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_order_count_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_amount_last_365_days',
'Abuse.mfn_categorized_refunds_by_customer_marketplace.n_mfn_notr_refunds_unit_count_last_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_diff_refunds_si_365_days',
'Abuse.mfn_categorized_refunds_si_by_customer_marketplace.n_mfn_notr_refunds_si_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_order_count_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_amount_last_365_days',
'Abuse.mfn_refunds_by_customer_marketplace.n_mfn_refund_unit_count_last_365_days',
'Abuse.mfn_refunds_si_by_customer_marketplace.n_mfn_refund_amount_si_last_365_days',
'Abuse.order_to_execution_time_from_eventvariables.n_order_to_execution',
'Abuse.shiptrack_flag_by_order.n_any_delivered',
'Abuse.shiptrack_flag_by_order.n_any_available_for_pickup',
'Abuse.shiptrack_flag_by_order.n_any_partial_delivered',
'Abuse.shiptrack_flag_by_order.n_any_undeliverable',
'Abuse.shiptrack_flag_by_order.n_any_returning',
'Abuse.shiptrack_flag_by_order.n_any_returned',
'COMP_DAYOB',
'claimAmount_value',
'claimantInfo_allClaimCount365day',
'claimantInfo_lifetimeClaimCount',
'claimantInfo_pendingClaimCount',
]

In [33]:
label_name = 'is_abuse'         
id_name = 'order_id'
#marketplace_id_col='marketplace_id'


In [34]:
multiclass_categories = [0, 1]  #[0, 1, 2]

#### Construct Base Class

In [35]:
base_hyperparameter = ModelHyperparameters(
    full_field_list=full_field_list,
    cat_field_list=cat_field_list,
    tab_field_list=tab_field_list,
    label_name=label_name,
    id_name=id_name,
    # multi-class or binary class
    multiclass_categories=multiclass_categories
)

In [36]:
print(str(base_hyperparameter))

=== ModelHyperparameters ===

- Essential User Inputs -
cat_field_list: ['PAYMETH', 'claim_reason', 'claimantInfo_status', 'shipments_status']
full_field_list: ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_me

### XGBoost Hyperparameters

This class is a **derived class** from **base hyperparameters**. It is for **XGBoost training only parameters**

In [37]:
from cursus.steps.hyperparams.hyperparameters_xgboost import XGBoostModelHyperparameters

In [38]:
# model_class on SageMaker
model_class = 'xgboost'

In [39]:
model_params = {
    "num_round": 300,
    "max_depth": 6, 
    "min_child_weight": 1
}

In [40]:
model_params

{'num_round': 300, 'max_depth': 6, 'min_child_weight': 1}

#### Construct Derived Class (Share Same Field Values from Base)

In [41]:
xgb_hyperparams = XGBoostModelHyperparameters.from_base_hyperparam(
    base_hyperparameter,
    **model_params
)

In [42]:
xgb_hyperparams

XGBoostModelHyperparameters(full_field_list=['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_

In [43]:
print(str(xgb_hyperparams))

=== XGBoostModelHyperparameters ===

- Essential User Inputs -
cat_field_list: ['PAYMETH', 'claim_reason', 'claimantInfo_status', 'shipments_status']
full_field_list: ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_

## Setup Configurations for Steps in Pipeline

### Input/Output Names Standardization Pattern

We follow a consistent pattern for input_names and output_names dictionaries:

### For input_names:
- KEYS are logical names used in pipeline connection code
- VALUES are script input names used in processing/training scripts

### For output_names:
- KEYS are logical names referenced in pipeline connection code
- VALUES are descriptive output names used as keys in outputs dictionaries

This standardization ensures consistent handling across all pipeline steps.

Example:
- Step A's output_names = {"data": "ProcessedData"}
- When connecting: outputs = {"ProcessedData": "s3://path/..."}
- Step B's input_names = {"input": "InputData"}
- When connecting: step_b_inputs = {"input": step_a.properties.ProcessingOutputConfig.Outputs["ProcessedData"].S3Output.S3Uri}



### Base Config

In [44]:
from pydantic import BaseModel, Field, model_validator, field_validator
from typing import List, Optional, Dict, Any
from pathlib import Path
import json
from datetime import datetime

In [45]:
from cursus.core.base.config_base import BasePipelineConfig

In [46]:
# set up the config list for saving to JSON at the end
config_list = []

### **REQUIRED: [BASE STEP 1.0] Setup Base Config**

In this base config, user provider all *necessary information* that **shared across** all of steps.

#### Service Name (for Depolyment)

In [47]:
# for MDS data downloading
service_name = 'AtoZ'
service_name

'AtoZ'

#### Default Bucket

In [48]:
bucket=sais_session.team_owned_s3_bucket_name()
bucket

'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um'

#### IAM Role

In [49]:
role=PipelineSession().get_caller_identity_arn()
role

2025-09-21 19:09:04,920 - INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1'

#### Author Signature, Pipeline Information, Pipeline Folder

In [50]:
author = sais_session.owner_alias()

In [51]:
pipeline_version = '1.3.1' #'0.0.7'

#### Training Container Set Up

In [52]:
# model_version on SageMaker
framework_version = '1.7-1'
py_version = "py3"

In [53]:
from datetime import date

current_date = date.today().strftime("%Y-%m-%d")
print(current_date)

2025-09-21


In [54]:
#current_date = '2025-07-25'

In [55]:
if region == 'NA':
    aws_region = "us-east-1"
elif region == 'EU':
    aws_region = "eu-west-1"
elif region == 'FE':
    aws_region = "us-west-2"

In [56]:
aws_region

'us-east-1'

#### Source Dir

In [57]:
current_dir = Path.cwd()

print(f"Current working directory: {current_dir}")

Current working directory: /home/ec2-user/SageMaker/Cursus


In [58]:
package_root = Path(current_dir).resolve()
source_dir = package_root / 'dockers' / 'xgboost_atoz' 
source_dir

PosixPath('/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz')

#### Construct Base Config

In [59]:
base_config = BasePipelineConfig(
        bucket=bucket,
        current_date=current_date,
        region=region,
        aws_region=aws_region,
        author=author,
        role=role,
        service_name=service_name,
    
        # Overall pipeline identification
        pipeline_version=pipeline_version,

        # Common framework/scripting info (if shared across steps like train/inference)
        framework_version=framework_version,
        py_version=py_version,
        source_dir=str(source_dir)
)

In [60]:
base_config

BasePipelineConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', aws_region='us-east-1')

In [61]:
base_config.print_config()




===== CONFIGURATION =====
Class: BasePipelineConfig

----- Essential User Inputs (Tier 1) -----
Author: lukexie
Bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um
Pipeline_Version: 1.3.1
Region: NA
Role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1
Service_Name: AtoZ

----- System Inputs with Defaults (Tier 2) -----
Current_Date: 2025-09-21
Framework_Version: 1.7-1
Model_Class: xgboost
Py_Version: py3
Source_Dir: /home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz

----- Derived Fields (Tier 3) -----
Aws_Region: us-east-1
Model_Extra: {'aws_region': 'us-east-1'}
Model_Fields_Set: {'region', 'role', 'pipeline_version', 'framework_version', 'bucket', 'current_date', 'py_version', 'aws_region', 'source_dir', 'author', 'service_name'}
Pipeline_Description: AtoZ xgboost Model NA
Pipeline_Name: lukexie-AtoZ-xgboost-NA
Pipeline_S3_Loc: s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/MODS/lukexie-AtoZ-xgboost-NA_1.3.1
Portable_Source_Di

In [62]:
print(str(base_config))

=== BasePipelineConfig ===

- Essential User Inputs -
author: lukexie
bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um
pipeline_version: 1.3.1
region: NA
role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1
service_name: AtoZ

- System Inputs -
current_date: 2025-09-21
framework_version: 1.7-1
model_class: xgboost
py_version: py3
source_dir: /home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz

- Derived Fields -
aws_region: us-east-1
model_extra: {'aws_region': 'us-east-1'}
model_fields_set: {'region', 'role', 'pipeline_version', 'framework_version', 'bucket', 'current_date', 'py_version', 'aws_region', 'source_dir', 'author', 'service_name'}
pipeline_description: AtoZ xgboost Model NA
pipeline_name: lukexie-AtoZ-xgboost-NA
pipeline_s3_loc: s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/MODS/lukexie-AtoZ-xgboost-NA_1.3.1
portable_source_dir: dockers/xgboost_atoz
script_contract: None
step_catalog: <cursus.step_catalog.step_catal

#### Save into Config List

At the end of each section, we would save the config into a list so that we can aggregate them into a unified JSON

In [63]:
config_list.append(base_config)

### [BASE STEP 2.0] Base Processing Config

The following fieilds are shared accross all **processing step** configuration

In [64]:
from cursus.steps.configs.config_processing_step_base import ProcessingStepConfigBase
from cursus.steps.configs.config_package_step import PackageConfig

In [65]:
processing_source_dir = source_dir / 'scripts'
processing_source_dir

PosixPath('/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts')

In [66]:
processing_dict = {
    'processing_source_dir': str(processing_source_dir),
    'processing_instance_type_large': 'ml.m5.12xlarge',
    'processing_instance_type_small': 'ml.m5.4xlarge',
}

In [67]:
processing_step_config = ProcessingStepConfigBase.from_base_config(
    base_config,
    **processing_dict
)

In [68]:
processing_step_config

ProcessingStepConfigBase(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=False, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts', processing_entry_point=None, processing_script_arguments=None, processing_framework_version='1.2-1')

In [69]:
print(str(processing_step_config))



=== ProcessingStepConfigBase ===

- Essential User Inputs -
author: lukexie
bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um
pipeline_version: 1.3.1
region: NA
role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1
service_name: AtoZ

- System Inputs -
current_date: 2025-09-21
framework_version: 1.7-1
model_class: xgboost
processing_framework_version: 1.2-1
processing_instance_count: 1
processing_instance_type_large: ml.m5.12xlarge
processing_instance_type_small: ml.m5.4xlarge
processing_source_dir: /home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts
processing_volume_size: 500
py_version: py3
source_dir: /home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz
use_large_processing_instance: False

- Derived Fields -
aws_region: us-east-1
effective_instance_type: ml.m5.4xlarge
effective_source_dir: dockers/xgboost_atoz/scripts
model_extra: {}
model_fields_set: {'processing_instance_type_small', 'region', 'role', 'processing_source_dir', 'model_cla

In [70]:
processing_step_config.source_dir

'/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz'

In [71]:
processing_step_config.portable_source_dir

'dockers/xgboost_atoz'

In [72]:
processing_step_config.processing_source_dir

'/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts'

In [73]:
processing_step_config.portable_processing_source_dir

'dockers/xgboost_atoz/scripts'

#### Save to Config List

In [74]:
# Save to config list
config_list.append(processing_step_config)

### **REQUIRED: [STEP 1.0] Cradle Data Loading Config**

In this section, user provide the input to construct a **cradle profile**. In Cradle Profle, there are **four** sections
1. **Data Source Specification**: specify
    1. *data source* (MDS, EDX, ANDES)
    2. *input schema*
2. **Transform Specification**: specifiy 
    1. *transform SQL*
    2. *job split*
3. **Output Specification**: specify
    1. *output path*,
    2. *ouptut format* (CSV, UNESCAPED_TSV, JSON, ION, PARQUET)
    3. *output schema*
    4. *save mode*
4. **Cradle Job Specification** specify
    1. *cradle account*
    2. *cluster_type*

This config is for **CradleDataLoadingStep**, which is a customized step provided under [SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
* This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**. Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
* This step would need to load **Execution Document** to take action.
* This step itself does not have many options

In [75]:
from cursus.steps.configs.config_cradle_data_loading_step import (CradleDataLoadConfig,
                                                    MdsDataSourceConfig,
                                                    EdxDataSourceConfig,
                                                    DataSourceConfig,
                                                    DataSourcesSpecificationConfig,
                                                    JobSplitOptionsConfig,
                                                    TransformSpecificationConfig,
                                                    OutputSpecificationConfig,
                                                    CradleJobSpecificationConfig
                                                   )

In [76]:
from secure_ai_sandbox_workflow_python_sdk.utils.constants import (
        OUTPUT_TYPE_DATA,
        OUTPUT_TYPE_METADATA,
        OUTPUT_TYPE_SIGNATURE,
    )

In [77]:
from cursus.core.config_fields.cradle_config_factory import create_cradle_data_load_config

In [78]:
role

'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1'

In [79]:
region

'NA'

#### [REQUIRED] 2.1.1 MDS Data Source 

In [80]:
mds_field_list=['objectId', 
                 'transactionDate'] + tab_field_list + cat_field_list

In [81]:
mds_field_list

['objectId',
 'transactionDate',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_re

In [82]:
training_start_datetime = '2025-01-01T00:00:00'  #'2024-12-01T00:00:00'  #'2024-03-01T00:00:00'
training_end_datetime = '2025-04-17T00:00:00' 

In [83]:
calibration_start_datetime ='2025-04-17T00:00:00'  #'2024-05-26T00:00:00'
calibration_end_datetime = '2025-04-28T00:00:00' #'2024-06-29T23:00:00'

In [84]:
service_name = 'AtoZ'

In [85]:
org_id=0

#### [REQUIRED] 2.1.2 Tag EDX Data Source

In [86]:
tag_edx_provider = "trms-abuse-analytics"

In [87]:
tag_edx_subject = "qingyuye-notr-exp"

In [88]:
tag_edx_dataset = "atoz-tag"

In [89]:
etl_job_id_dict = {
    'NA': '24292902',
    'EU': '24292941',
    'FE': '25782074',
}

In [90]:
etl_job_id = etl_job_id_dict[region]
etl_job_id

'24292902'

In [91]:
tag_schema = [
    'order_id',
    'marketplace_id',
    'tag_date',
    'is_abuse',
    'abuse_type',
    'concession_type',
]

tag_edx_provider = "trms-abuse-analytics"
tag_edx_subject = "pre-delivery"
tag_edx_dataset = "otf"

In [92]:
edx_manifest_comment = region
edx_manifest_comment

'NA'

etl_job_id = '28480724'

tag_schema = [
    'order_id',
    'customerid',
    'marketplace_id',
    'order_fulfillment_network',
    'transactiondate',
    'finaldecision',
    'order_total',
    'is_pda_conceded_order_afn',
    'is_abuse_pda_afn',
    'is_refund_denied_order',
    'is_cap_denied_order',
    'is_chargeback_order_afn',
    'is_conceded_order',
    'is_enforced_customer',
    'has_pda_concession',
    'has_pda_abuse',
    'random_number'
]

#### [REQUIRED] 2.2 Output Specification

`output_format_list=["CSV","UNESCAPED_TSV","JSON","ION","PARQUET"]`

In [93]:
output_format = "PARQUET"

#### [REQUIRED] 2.3 Transform Specification

#### [REQUIRED] 2.4 Cradle Job Specification

In [94]:
aws_region

'us-east-1'

In [95]:
cradle_account = 'Buyer-Abuse-RnD-Dev'

In [96]:
current_date

'2025-09-21'

In [97]:
available_cluster_types = ['STANDARD',
                           'SMALL',
                           'MEDIUM',
                           'LARGE'
                          ]

In [98]:
cluster_choice=-2

In [99]:
cluster_type=available_cluster_types[cluster_choice]
cluster_type

'MEDIUM'

In [100]:
base_config

BasePipelineConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', aws_region='us-east-1')

#### Assemble Cradle Data Load Config (for Training, Test) via Cradle Config Factory

In [101]:
training_cradle_data_load_config = create_cradle_data_load_config(
    # Base pipeline essentials
    base_config=base_config,
    # Job configuration
    job_type='training',
    
    # MDS field list (direct fields to include)
    mds_field_list=mds_field_list,
    
    # Data timeframe
    start_date=training_start_datetime,
    end_date=training_end_datetime,
    
    # MDS data source
    service_name=service_name,
    
    # EDX data source
    tag_edx_provider=tag_edx_provider,
    tag_edx_subject=tag_edx_subject,
    tag_edx_dataset=tag_edx_dataset,
    etl_job_id=etl_job_id,
    
    # Infrastructure configuration
    cradle_account=cradle_account,
    org_id=org_id,
    edx_manifest_comment=edx_manifest_comment,
    
    
    # Optional overrides with reasonable defaults
    cluster_type=cluster_type,
    output_format= output_format,
    output_save_mode="ERRORIFEXISTS",
    #transform_sql=training_transform_sql,
    use_dedup_sql=True,
    tag_schema=tag_schema,
    
    # Join configuration
    mds_join_key= 'objectId',
    edx_join_key= 'order_id',
    join_type= 'JOIN'
)

In [102]:
print(training_cradle_data_load_config.transform_spec.transform_sql)


SELECT
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_warn_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_warn_count_last_365_days,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_diff_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_notr_topi

In [103]:
print(training_cradle_data_load_config.data_sources_spec.data_sources[0])

data_source_name='RAW_MDS_NA' data_source_type='MDS' mds_data_source_properties=MdsDataSourceConfig(service_name='AtoZ', region='NA', output_schema=[{'field_name': 'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days', 'field_type': 'STRING'}, {'field_name': 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_min_buyer_message_count', 'field_type': 'STRING'}, {'field_name': 'Abuse.shiptrack_flag_by_order.n_any_returned', 'field_type': 'STRING'}, {'field_name': 'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days', 'field_type': 'STRING'}, {'field_name': 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'field_type': 'STRING'}, {'field_name': 'Abuse.shiptrack_flag_by_order.n_any_available_for_pickup', 'field_type': 'STRING'}, {'field_name': 'transactionDate', 'field_type': 'STRING'}, {'field_name': 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit

In [104]:
print(training_cradle_data_load_config.data_sources_spec.data_sources[1])

data_source_name='TAGS' data_source_type='EDX' mds_data_source_properties=None edx_data_source_properties=EdxDataSourceConfig(edx_provider='trms-abuse-analytics', edx_subject='qingyuye-notr-exp', edx_dataset='atoz-tag', edx_manifest_key='["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]', schema_overrides=[{'field_name': 'order_id', 'field_type': 'STRING'}, {'field_name': 'marketplace_id', 'field_type': 'STRING'}, {'field_name': 'tag_date', 'field_type': 'STRING'}, {'field_name': 'is_abuse', 'field_type': 'STRING'}, {'field_name': 'abuse_type', 'field_type': 'STRING'}, {'field_name': 'concession_type', 'field_type': 'STRING'}]) andes_data_source_properties=None


In [105]:
training_cradle_data_load_config.data_sources_spec.data_sources[1].edx_data_source_properties.edx_manifest

'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]'

In [106]:
'arn:amazon:edx:iad::manifest/trms-abuse-analytics/pre-delivery/otf/["28480724",2025-06-16T07:00:00Z,2025-06-17T07:00:00Z]'

'arn:amazon:edx:iad::manifest/trms-abuse-analytics/pre-delivery/otf/["28480724",2025-06-16T07:00:00Z,2025-06-17T07:00:00Z]'

In [107]:
calibration_cradle_data_load_config = create_cradle_data_load_config(
    # Base pipeline essentials
    base_config=base_config,
    # Job configuration
    job_type='calibration',
    
    # MDS field list (direct fields to include)
    mds_field_list=mds_field_list,
    
    # Data timeframe
    start_date=calibration_start_datetime,
    end_date=calibration_end_datetime,
    
    # MDS data source
    service_name=service_name,
    
    # EDX data source
    tag_edx_provider=tag_edx_provider,
    tag_edx_subject=tag_edx_subject,
    tag_edx_dataset=tag_edx_dataset,
    etl_job_id=etl_job_id,
    
    # Infrastructure configuration
    cradle_account=cradle_account,
    org_id=org_id,
    edx_manifest_comment=edx_manifest_comment,
    
    # Optional overrides with reasonable defaults
    cluster_type=cluster_type,
    output_format=output_format,
    output_save_mode="ERRORIFEXISTS",
    #transform_sql=training_transform_sql,
    use_dedup_sql=True,
    tag_schema=tag_schema,
    
    # Join configuration
    mds_join_key= 'objectId',
    edx_join_key= 'order_id',
    join_type= 'JOIN'
)

In [108]:
print(calibration_cradle_data_load_config.transform_spec.transform_sql)


SELECT
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_claims_warn_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_solicit_count_last_365_days,
    Abuse__DOT__abuse_fap_action_by_customer_inline_transform_na__DOT__n_concession_warn_count_last_365_days,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_buyer_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_max_seller_order_message_time_gap,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_diff_topic_si,
    Abuse__DOT__bsm_stats_for_evaluated_mfn_concessions_by_customer_na__DOT__n_message_count_with_notr_topi

In [109]:
training_cradle_data_load_config.data_sources_spec.data_sources[0]

DataSourceConfig(data_source_name='RAW_MDS_NA', data_source_type='MDS', mds_data_source_properties=MdsDataSourceConfig(service_name='AtoZ', region='NA', output_schema=[{'field_name': 'Abuse.completed_mfn_orders_by_customer_marketplace.n_mfn_unit_count_last_365_days', 'field_type': 'STRING'}, {'field_name': 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_min_buyer_message_count', 'field_type': 'STRING'}, {'field_name': 'Abuse.shiptrack_flag_by_order.n_any_returned', 'field_type': 'STRING'}, {'field_name': 'Abuse.completed_afn_orders_by_customer_marketplace.n_afn_unit_count_last_365_days', 'field_type': 'STRING'}, {'field_name': 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'field_type': 'STRING'}, {'field_name': 'Abuse.shiptrack_flag_by_order.n_any_available_for_pickup', 'field_type': 'STRING'}, {'field_name': 'transactionDate', 'field_type': 'STRING'}, {'field_name': 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n

In [110]:
training_cradle_data_load_config.data_sources_spec.data_sources[1]

DataSourceConfig(data_source_name='TAGS', data_source_type='EDX', mds_data_source_properties=None, edx_data_source_properties=EdxDataSourceConfig(edx_provider='trms-abuse-analytics', edx_subject='qingyuye-notr-exp', edx_dataset='atoz-tag', edx_manifest_key='["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]', schema_overrides=[{'field_name': 'order_id', 'field_type': 'STRING'}, {'field_name': 'marketplace_id', 'field_type': 'STRING'}, {'field_name': 'tag_date', 'field_type': 'STRING'}, {'field_name': 'is_abuse', 'field_type': 'STRING'}, {'field_name': 'abuse_type', 'field_type': 'STRING'}, {'field_name': 'concession_type', 'field_type': 'STRING'}]), andes_data_source_properties=None)

In [111]:
training_cradle_data_load_config.data_sources_spec.data_sources[1].edx_data_source_properties.edx_manifest

'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]'

In [112]:
'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]'

'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-01-01T00:00:00Z,2025-04-17T00:00:00Z,"NA"]'

In [113]:
calibration_cradle_data_load_config.data_sources_spec.data_sources[1].edx_data_source_properties.edx_manifest

'arn:amazon:edx:iad::manifest/trms-abuse-analytics/qingyuye-notr-exp/atoz-tag/["24292902",2025-04-17T00:00:00Z,2025-04-28T00:00:00Z,"NA"]'

In [114]:
training_cradle_data_load_config.output_spec.output_schema

['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si',
 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_return_keywords_si',
 'Abuse.bsm_st

#### Save to Config List

In [115]:
# Save to config list
config_list.append(training_cradle_data_load_config)

In [116]:
# Save to config list
config_list.append(calibration_cradle_data_load_config)

### [STEP 2.0] Tabular Preprocessing Config (Training)

This step can be seen as **preprocessing step** for **training step**, or **post-processsing step** for **cradle data loading step**.

This step has several simple task:
* **Concatenate** chunks of data (due to job split or data split)
* Transform label fields into integer type
* For **job_type = training**, it *split* the dataset into three (train, val, test)
* This step is a **ProcessingStep**

In [117]:
from cursus.steps.configs.config_tabular_preprocessing_step import TabularPreprocessingConfig

In [118]:
training_tabular_preprocessing_step_config = TabularPreprocessingConfig.from_base_config(
    processing_step_config,
    job_type="training",
    label_name=base_hyperparameter.label_name,
    processing_entry_point="tabular_preprocessing.py"
)

In [119]:
training_tabular_preprocessing_step_config

TabularPreprocessingConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=False, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts', processing_entry_point='tabular_preprocessing.py', processing_script_arguments=None, processing_framework_version='1.2-1', label_name='is_abuse', job_type='training', train_ratio=0.7, test_val_ratio=0.5)

In [120]:
training_tabular_preprocessing_step_config.use_large_processing_instance = True #use large instance type

#### Save to Config List

In [121]:
# Save to config list
config_list.append(training_tabular_preprocessing_step_config)

### [STEP 2.1] Tabular Preprocessing Config (Calibration)

This step can be seen as **preprocessing step** for **model evaluation step**, or **post-processsing step** for **cradle data loading step**.

This step has several simple task:
* **Concatenate** chunks of data (due to job split or data split)
* Transform label fields into integer type
* For **job_type = calibration**, it provide the entire processed dataset without splitting.
* This step is a **ProcessingStep**

In [122]:
calibration_tabular_preprocessing_step_config = TabularPreprocessingConfig.from_base_config(
    processing_step_config,
    job_type="calibration",
    label_name=base_hyperparameter.label_name,
    processing_entry_point="tabular_preprocessing.py"
)

#### Save to Config List

In [123]:
# Save to config list
config_list.append(calibration_tabular_preprocessing_step_config)

In [124]:
len(config_list)

6

### [REQUIRED] [STEP 3.0] Training Config

This config is for **TrainingStep**. 
* It ask user to provide all necessary information to construct a **Container** and start a **Training Job**
* Ths most important information has provided in the **HyperParameter** section.


In [125]:
from cursus.steps.configs.config_xgboost_training_step import XGBoostTrainingConfig

In [126]:
instance_type_list = [
    "ml.m5.4xlarge",
    "ml.g4dn.16xlarge", 
    "ml.g5.12xlarge", 
    "ml.g5.16xlarge",
    "ml.p3.8xlarge", 
    "ml.m5.12xlarge",
    "ml.p3.16xlarge"
]

In [127]:
instance_select = -2

In [128]:
training_instance_type = instance_type_list[instance_select]
training_instance_type

'ml.m5.12xlarge'

In [129]:
training_volume_size = 800

In [130]:
training_entry_point = "xgboost_training.py"


In [131]:
train_dict = {
    'training_instance_type': training_instance_type,
    'training_entry_point': training_entry_point,
    'training_volume_size': training_volume_size,
    'hyperparameters': xgb_hyperparams
}

In [132]:
xgb_hyperparams

XGBoostModelHyperparameters(full_field_list=['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_

In [133]:
xgboost_train_config = XGBoostTrainingConfig.from_base_config(
    base_config,
    **train_dict
)

In [134]:
xgboost_train_config

XGBoostTrainingConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', training_entry_point='xgboost_training.py', hyperparameters=XGBoostModelHyperparameters(full_field_list=['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'Abuse.bsm

#### Save to Config List

In [135]:
# Save to config list
config_list.append(xgboost_train_config)

### [STEP 4.0] Model Calibration Config
The task of **model calibration** is to transform *raw model scores* into **calibrated probabilities** that better reflect the _true probability_ of the event occurring.


In [136]:
from cursus.steps.configs.config_model_calibration_step import ModelCalibrationConfig

In [137]:
label_field = base_hyperparameter.label_name

In [138]:
score_field_prefix = 'prob_class_'

In [139]:
model_calibration_config = ModelCalibrationConfig.from_base_config(
    processing_step_config,
    label_field=base_hyperparameter.label_name,
    processing_entry_point='model_calibration.py',
    score_field='prob_class_1',
    is_binary=base_hyperparameter.is_binary,
    num_classes=base_hyperparameter.num_classes,
    score_field_prefix='prob_class_',
    multiclass_categories=[i for i in range(base_hyperparameter.num_classes)]
    )

In [140]:
model_calibration_config

ModelCalibrationConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=False, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts', processing_entry_point='model_calibration.py', processing_script_arguments=None, processing_framework_version='1.2-1', label_field='is_abuse', calibration_method='gam', monotonic_constraint=True, gam_splines=10, error_threshold=0.05, is_binary=True, num_classes=2, score_field='prob_class_1', score_field_prefix='prob_

#### Save to Config List

In [141]:
config_list.append(model_calibration_config)

### [STEP 5.0] Model Evaluation Config

The task of **model evaluation** is to make **model inference** on processed **calibration data set**. Thus it has two dependencies
* **Training Step**
* **Calibration Data Flow**: (data loading + preprocesssing)

This implementation is using a **ProcessingStep** but **BatchTransformStep** can be used.

In [142]:
from cursus.steps.configs.config_xgboost_model_eval_step import XGBoostModelEvalConfig

In [143]:
model_eval_processing_entry_point = 'xgboost_model_evaluation.py'

In [144]:
model_eval_source_dir = source_dir

In [145]:
model_eval_job_type = 'calibration'

In [146]:
previous_processing_config = processing_step_config.model_dump()
previous_processing_config

{'author': 'lukexie',
 'bucket': 'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um',
 'role': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1',
 'region': 'NA',
 'service_name': 'AtoZ',
 'pipeline_version': '1.3.1',
 'model_class': 'xgboost',
 'current_date': '2025-09-21',
 'framework_version': '1.7-1',
 'py_version': 'py3',
 'source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz',
 'processing_instance_count': 1,
 'processing_volume_size': 500,
 'processing_instance_type_large': 'ml.m5.12xlarge',
 'processing_instance_type_small': 'ml.m5.4xlarge',
 'use_large_processing_instance': False,
 'processing_source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts',
 'processing_entry_point': None,
 'processing_script_arguments': None,
 'processing_framework_version': '1.2-1',
 'aws_region': 'us-east-1',
 'pipeline_name': 'lukexie-AtoZ-xgboost-NA',
 'pipeline_description': 'AtoZ xgboost Model NA',
 'pipeline_s3_loc': 's3://sandboxdepe

In [147]:
previous_processing_config['processing_source_dir'] = str(model_eval_source_dir)

In [148]:
previous_processing_config['processing_entry_point'] = model_eval_processing_entry_point

In [149]:
previous_processing_config['use_large_processing_instance'] = True

In [150]:
previous_processing_config

{'author': 'lukexie',
 'bucket': 'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um',
 'role': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1',
 'region': 'NA',
 'service_name': 'AtoZ',
 'pipeline_version': '1.3.1',
 'model_class': 'xgboost',
 'current_date': '2025-09-21',
 'framework_version': '1.7-1',
 'py_version': 'py3',
 'source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz',
 'processing_instance_count': 1,
 'processing_volume_size': 500,
 'processing_instance_type_large': 'ml.m5.12xlarge',
 'processing_instance_type_small': 'ml.m5.4xlarge',
 'use_large_processing_instance': True,
 'processing_source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz',
 'processing_entry_point': 'xgboost_model_evaluation.py',
 'processing_script_arguments': None,
 'processing_framework_version': '1.2-1',
 'aws_region': 'us-east-1',
 'pipeline_name': 'lukexie-AtoZ-xgboost-NA',
 'pipeline_description': 'AtoZ xgboost Model NA',
 'pipeline_s3_loc': '

In [151]:
xgboost_framework_version=base_config.framework_version

In [152]:
xgboost_model_eval_config = XGBoostModelEvalConfig(
    **previous_processing_config,
    job_type=model_eval_job_type,
    hyperparameters=xgb_hyperparams,
    xgboost_framework_version=xgboost_framework_version
)

In [153]:
xgboost_model_eval_config

XGBoostModelEvalConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=True, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_entry_point='xgboost_model_evaluation.py', processing_script_arguments=None, processing_framework_version='1.2-1', hyperparameters=XGBoostModelHyperparameters(full_field_list=['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'Abuse.abuse_fap_action_by_customer_inline_tr

In [154]:
xgboost_model_eval_config.portable_processing_source_dir

'dockers/xgboost_atoz'

#### Save to Config List

In [155]:
# Save to config list
config_list.append(xgboost_model_eval_config)

### [STEP 6.0] Packaging Config

The **PackagingStep** is a *prerequisite* for **MIMSModelRegistrationStep**. 

It achieve simple tasks
* Take input of `model.tar.gz` from training step
* Unpack the model
* Take input of source inference code (including `input_fn`, `model_fn`, `predict_fn`, `output_fn`)
* Repack the model + inference code into one `model.tar.gz`

We can ask the **TrainingStep** to do this. But it is better to make this step explicit.

In [156]:
from cursus.steps.configs.config_package_step import PackageConfig

In [157]:
package_config = PackageConfig.from_base_config(
    processing_step_config
)

In [158]:
package_config

PackageConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=False, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts', processing_entry_point='package.py', processing_script_arguments=None, processing_framework_version='1.2-1')

In [159]:
package_config.portable_processing_source_dir

'dockers/xgboost_atoz/scripts'

#### Save to Config List

In [160]:
# Save to config list
config_list.append(package_config)

### **REQUIRED** [STEP 7.0] MIMS Model Registration

* [MRAS (Model Resource Allocation System)](https://w.amazon.com/bin/view/CMLS/ME/MIMS/) is a system that manages your **model endpoints**. 
    * It takes your model artifact and its metadata and deploys an endpoint to an AWS account you have onboarded to MRAS. You can access this endpoint through the AMES system, which URES uses.
* **MIMS (Model Inference Management System)** is a system that handles the model creation
* **MMS (Model Management Service)** would manage the model card
> 
> Note that we used to call **MRAS MIMS** (**Model Inference Management System**). 
> - **MIMS** is the component of MRAS that handles endpoint creation. 
> - To reduce customer confusion, we have started to use *MRAS* to also refer to *MIMS*. 
> - Some of our wikis may still use *MIMS* instead of *MRAS*.
> 
> If your team has not already, please [onboard an AWS account to MRAS](https://w.amazon.com/bin/view/CMLS/ME/MIMS/UserGuide/Onboarding/).

* **MIMSModelRegistrationStep** is a SageMaker Workflow Step that wrap around the service call to **MIMS**.
    * It is also a customized step provided by SAIS Python SDK
        * See Source code[SecureAISandboxWorkflowPythonSDK](https://code.amazon.com/packages/SecureAISandboxWorkflowPythonSDK/trees/mainline#)
    * This step inherit from **MODSPredefinedProcessingStep**, which is a customized base class that itself inherits from **ScriptProcessingStep**.
        * Source code in [MODSWorkflowCore](https://code.amazon.com/packages/MODSWorkflowCore/trees/mainline#)
    * This step would need to load **Execution Document** to take action.

In **MIMSModelRegistrationStep**, we need to specify the fields to fill in the **Execution Document**
* *model_owner*
* *model_registration_domain*
* *model_registration_objective*
* *source_model_inference_input_variable_list*
* *source_model_inference_output_variable_list*
* *source_model_inference_content_types*
* *source_model_inference_response_types*


In [161]:
from enum import Enum

In [162]:
from cursus.steps.configs.config_registration_step import RegistrationConfig, create_inference_variable_list, VariableType

In [163]:
model_owner = "amzn1.abacus.team.djmdvixm5abr3p75c5ca" # team_id of abuse-analytics

In [164]:
model_domain = 'AtoZ'
model_objective = f'AtoZ_Claims_SM_Model_{region}'
model_objective

'AtoZ_Claims_SM_Model_NA'

In [165]:
source_model_inference_output_variable_list = {
    'legacy-score': 'NUMERIC',
    'calibrated-score': 'NUMERIC',
    'custom-output-label': 'TEXT'
}

In [166]:
source_model_inference_content_types = ["text/csv"]
source_model_inference_response_types = ["application/json"]

In [167]:
framework='xgboost'

In [168]:
inference_entry_point='xgboost_inference.py'

In [169]:
inference_instance_type="ml.m5.4xlarge"

In [170]:
source_model_inference_input_variable_list = create_inference_variable_list(tab_field_list, cat_field_list, 'list')

In [171]:
source_model_inference_input_variable_list

[['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
  'NUMERIC'],
 ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
  'NUMERIC'],
 ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
  'NUMERIC'],
 ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
  'NUMERIC'],
 ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
  'NUMERIC'],
 ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
  'NUMERIC'],
 ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
  'NUMERIC'],
 ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
  'NUMERIC'],
 ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si'

In [172]:
model_registration_config = RegistrationConfig.from_base_config(
    base_config,
    framework=framework,
    inference_entry_point=inference_entry_point,
    model_owner=model_owner,
    model_domain=model_domain,
    model_objective=model_objective,
    source_model_inference_output_variable_list=source_model_inference_output_variable_list,
    source_model_inference_input_variable_list=source_model_inference_input_variable_list
)


In [173]:
model_registration_config

RegistrationConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', model_owner='amzn1.abacus.team.djmdvixm5abr3p75c5ca', model_domain='AtoZ', model_objective='AtoZ_Claims_SM_Model_NA', framework='xgboost', inference_entry_point='xgboost_inference.py', source_model_inference_input_variable_list=[['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_t

In [174]:
print(str(model_registration_config))



=== RegistrationConfig ===

- Essential User Inputs -
author: lukexie
bucket: sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um
framework: xgboost
inference_entry_point: xgboost_inference.py
model_domain: AtoZ
model_objective: AtoZ_Claims_SM_Model_NA
model_owner: amzn1.abacus.team.djmdvixm5abr3p75c5ca
pipeline_version: 1.3.1
region: NA
role: arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1
service_name: AtoZ

- System Inputs -
current_date: 2025-09-21
framework_version: 1.7-1
inference_instance_type: ml.m5.large
model_class: xgboost
py_version: py3
source_dir: /home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz
source_model_inference_content_types: ['text/csv']
source_model_inference_input_variable_list: [['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_tr

#### Save to Config List

In [175]:
# Save to config list
config_list.append(model_registration_config)

### [STEP 8.0] Payload Sample Generation and Payload Config

The task of **Payload Step** is to generate the payload sample for **MIMS Model Registration.**
* This step is completely **optional**.


In [176]:
from io import BytesIO
import boto3
import tarfile

In [177]:
from cursus.steps.configs.config_payload_step import PayloadConfig

In [178]:
processing_base_dict = processing_step_config.model_dump()
processing_base_dict

{'author': 'lukexie',
 'bucket': 'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um',
 'role': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1',
 'region': 'NA',
 'service_name': 'AtoZ',
 'pipeline_version': '1.3.1',
 'model_class': 'xgboost',
 'current_date': '2025-09-21',
 'framework_version': '1.7-1',
 'py_version': 'py3',
 'source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz',
 'processing_instance_count': 1,
 'processing_volume_size': 500,
 'processing_instance_type_large': 'ml.m5.12xlarge',
 'processing_instance_type_small': 'ml.m5.4xlarge',
 'use_large_processing_instance': False,
 'processing_source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts',
 'processing_entry_point': None,
 'processing_script_arguments': None,
 'processing_framework_version': '1.2-1',
 'aws_region': 'us-east-1',
 'pipeline_name': 'lukexie-AtoZ-xgboost-NA',
 'pipeline_description': 'AtoZ xgboost Model NA',
 'pipeline_s3_loc': 's3://sandboxdepe

In [179]:
expected_tps = 2
max_latency_in_millisecond = 800

In [180]:
payload_config = PayloadConfig.from_base_config(
    processing_step_config,
    model_owner=model_owner,
    model_domain=model_domain,
    model_objective=model_objective,
    source_model_inference_output_variable_list=source_model_inference_output_variable_list,
    source_model_inference_input_variable_list=source_model_inference_input_variable_list,
    expected_tps=expected_tps,
    max_latency_in_millisecond=max_latency_in_millisecond
)

In [181]:
payload_config.processing_entry_point

'payload.py'

In [182]:
# The variable types will be properly handled in both configs
print(payload_config.source_model_inference_input_variable_list)

[['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days', 'NUMERIC'], ['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days', 'NUMERIC'], ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap', 'NUMERIC'], ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap', 'NUMERIC'], ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap', 'NUMERIC'], ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si', 'NUMERIC'], ['Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si', 'NUMERIC'], ['Abuse.bs

In [183]:
payload_config

PayloadConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_processing_instance=False, processing_source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz/scripts', processing_entry_point='payload.py', processing_script_arguments=None, processing_framework_version='1.2-1', model_owner='amzn1.abacus.team.djmdvixm5abr3p75c5ca', model_domain='AtoZ', model_objective='AtoZ_Claims_SM_Model_NA', source_model_inference_output_variable_list={'legacy-score': 'NUMERIC', 'calibrated-score': 'NUMERIC

In [184]:
payload_config.portable_processing_source_dir

'dockers/xgboost_atoz/scripts'

#### Save to Config List

In [185]:
# Save to config list
payload_s3_key_out = config_list.append(payload_config)

## Merge and Save Config List to JSON

In order to better organize the user input data, and reduce the redundancy in storage (due to config class inheritance), we develop a systematic rule to save them into one JSON

The JSON has the hierarchical structure as ![config_NA_xgboost](config_NA_xgboost.png)
* We split the fields into **shared** vs. **specific**
    * **shared**: since it is common for all configs, we only keep one copy
    * **specific**: each fields is associated with a *step* 
* We also have **processing step** and they have **processing_shared** vs. **processing_specific**
* This JSON also provides **inverted list**: For given *field name query*, it *retrieve steps* that use it
* This JSON provides a **list of all steps** involved

### List of Configs

In [186]:
config_list 

[BasePipelineConfig(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', aws_region='us-east-1'),
 ProcessingStepConfigBase(author='lukexie', bucket='sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um', role='arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1', region='NA', service_name='AtoZ', pipeline_version='1.3.1', model_class='xgboost', current_date='2025-09-21', framework_version='1.7-1', py_version='py3', source_dir='/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz', processing_instance_count=1, processing_volume_size=500, processing_instance_type_large='ml.m5.12xlarge', processing_instance_type_small='ml.m5.4xlarge', use_large_p

In [187]:
len(config_list)

12

In [188]:
from cursus.steps.configs.utils import serialize_config, merge_and_save_configs, load_configs, verify_configs

2025-09-21 19:11:08,184 - pipeline_registry.builder_registry - INFO - Registered builder: BatchTransform -> BatchTransformStepBuilder
2025-09-21 19:11:08,184 - INFO - Registered builder: BatchTransform -> BatchTransformStepBuilder
2025-09-21 19:11:08,186 - pipeline_registry.builder_registry - INFO - Registered builder: CurrencyConversion -> CurrencyConversionStepBuilder
2025-09-21 19:11:08,186 - INFO - Registered builder: CurrencyConversion -> CurrencyConversionStepBuilder
2025-09-21 19:11:08,187 - pipeline_registry.builder_registry - INFO - Registered builder: DummyTraining -> DummyTrainingStepBuilder
2025-09-21 19:11:08,187 - INFO - Registered builder: DummyTraining -> DummyTrainingStepBuilder
2025-09-21 19:11:08,190 - pipeline_registry.builder_registry - INFO - Registered builder: ModelCalibration -> ModelCalibrationStepBuilder
2025-09-21 19:11:08,190 - INFO - Registered builder: ModelCalibration -> ModelCalibrationStepBuilder
2025-09-21 19:11:08,193 - pipeline_registry.builder_regi

### Merge and Save Config List into one JSON

In [189]:
MODEL_CLASS = 'xgboost'

In [190]:
config_dir = Path(current_dir) / 'pipeline_config' / f'config_{region}_{MODEL_CLASS}_{service_name}_v2'
Path(config_dir).mkdir(parents=True, exist_ok=True)

In [191]:
config_dir

PosixPath('/home/ec2-user/SageMaker/Cursus/pipeline_config/config_NA_xgboost_AtoZ_v2')

In [192]:
config_file_name = f'config_{region}_{MODEL_CLASS}_{service_name}.json'

In [193]:
merged_config = merge_and_save_configs(config_list, 
                                       str(config_dir / config_file_name)
                                      )

2025-09-21 19:11:17,022 - INFO - Categorizing fields for 12 configs
2025-09-21 19:11:17,023 - INFO - Collecting field information for 12 configs (7 processing configs)
2025-09-21 19:11:17,033 - INFO - Collected information for 72 unique fields
2025-09-21 19:11:17,358 - INFO - Shared fields: 16
2025-09-21 19:11:17,358 - INFO - Specific steps: 12
2025-09-21 19:11:17,358 - INFO - Field categorization complete
2025-09-21 19:11:17,359 - INFO - Merging and saving 12 configs to /home/ec2-user/SageMaker/Cursus/pipeline_config/config_NA_xgboost_AtoZ_v2/config_NA_xgboost_AtoZ.json
2025-09-21 19:11:17,359 - INFO - Merged result contains:
2025-09-21 19:11:17,360 - INFO -   - 16 shared fields
2025-09-21 19:11:17,360 - INFO -   - 12 specific steps with 159 total fields
2025-09-21 19:11:17,369 - INFO - Field 'processing_instance_type_small' appears in multiple configs but is in specific section(s): ['Processing', 'TabularPreprocessing_training', 'TabularPreprocessing_calibration', 'ModelCalibration_c

In [194]:
merged_config

{'shared': {'author': 'lukexie',
  'bucket': 'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um',
  'role': 'arn:aws:iam::601857636239:role/SandboxRole-lukexie-us-east-1',
  'region': 'NA',
  'service_name': 'AtoZ',
  'pipeline_version': '1.3.1',
  'model_class': 'xgboost',
  'current_date': '2025-09-21',
  'framework_version': '1.7-1',
  'py_version': 'py3',
  'source_dir': '/home/ec2-user/SageMaker/Cursus/dockers/xgboost_atoz',
  'aws_region': 'us-east-1',
  'pipeline_name': 'lukexie-AtoZ-xgboost-NA',
  'pipeline_description': 'AtoZ xgboost Model NA',
  'pipeline_s3_loc': 's3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/MODS/lukexie-AtoZ-xgboost-NA_1.3.1',
  'portable_source_dir': 'dockers/xgboost_atoz'},
 'specific': defaultdict(dict,
             {'Base': {'__model_type__': 'BasePipelineConfig'},
              'Processing': {'__model_type__': 'ProcessingStepConfigBase',
               'processing_instance_count': 1,
               'processing_volume

In [195]:
def save_config_to_json(
    config: BaseModel, 
    config_path: str = "config/config.json"
) -> Path:
    """
    Save ModelConfig and ModelHyperparameters to JSON file
    
    Args:
        model_config: ModelConfig instance
        hyperparams: ModelHyperparameters instance
        config_path: Path to save the config file
    
    Returns:
        Path object of the saved config file
    """
    try:
        # Convert both models to dictionaries
        config_dict = config.model_dump()
        
        # Create config directory if it doesn't exist
        path = Path(config_path)
        path.parent.mkdir(parents=True, exist_ok=True)
        
        # Save to JSON file
        with open(path, 'w') as f:
            json.dump(config_dict, f, indent=2, sort_keys=True)
            
        print(f"Configuration saved to: {path}")
        return path
        
    except Exception as e:
        raise ValueError(f"Failed to save config: {str(e)}")

In [196]:
save_config_to_json(
    xgb_hyperparams, 
    str(config_dir / f'hyperparameters_{region}_{MODEL_CLASS}.json')
)

Configuration saved to: /home/ec2-user/SageMaker/Cursus/pipeline_config/config_NA_xgboost_AtoZ_v2/hyperparameters_NA_xgboost.json


PosixPath('/home/ec2-user/SageMaker/Cursus/pipeline_config/config_NA_xgboost_AtoZ_v2/hyperparameters_NA_xgboost.json')

In [197]:
data_type = 'mds_parquet'
region_alias = region
mds_data_dir_suffix = f"{data_type}/{region_alias}"
mds_data_dir_suffix

'mds_parquet/NA'

In [198]:
bucket

'sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um'

### Optional: Clean Up S3 and Reupload Configs

In [None]:
def sync_with_s3(s3_loc: str, config_dir: Path):
    """
    Synchronize local configurations and data with S3.
    
    Args:
        s3_loc: S3 location base path
        config_dir: Local config directory path
        prefix: Base prefix for data directory
        data_type: Type of data (e.g., 'mds_parquet')
        region: Region identifier
    """
    # Clear existing S3 location
    subprocess.run([
        'aws', 's3', 'rm', '--recursive', s3_loc
    ], check=True)

    # Upload config files
    config_s3_path = os.path.join(s3_loc, 'input', 'config')
    subprocess.run([
        'aws', 's3', 'cp', '--recursive', str(config_dir), config_s3_path
    ], check=True)
    print(f"Config uploaded to s3://{config_s3_path}")

    # Upload data files
    #data_dir_suffix = f"{data_type}/{region}"
    #data_source = os.path.join(prefix, data_dir_suffix)
    #data_destination = os.path.join(s3_loc, data_dir_suffix)
    #subprocess.run([
    #    'aws', 's3', 'cp', '--recursive', data_source, data_destination
    #///////], check=True)

In [None]:
pipeline_s3_loc = base_config.pipeline_s3_loc

In [None]:
config_dir

In [None]:
sync_with_s3(pipeline_s3_loc, config_dir)


## Optional: Upload Local Model to S3

In [None]:
#model_s3_path = os.path.join(pipeline_s3_loc, 'prod', 'model.tar.gz')
#model_s3_path

In [None]:
#model_local_path = Path(current_dir) / 'model.tar.gz'
#model_local_path

def upload_model_to_s3(
    model_local_path: Path,
    model_s3_path: str,
    remove_local: bool = True
) -> bool:
    """
    Upload model artifact to S3.
    
    Args:
        model_local_path: Local path to model file
        model_s3_path: S3 destination path
        remove_local: Whether to remove local file after upload
    
    Returns:
        bool: True if successful, False otherwise
    """
    logger = logging.getLogger(__name__)
    
    try:
        # Method 1: Using boto3
        s3_client = boto3.client('s3')
        
        # Extract bucket and key from s3 path
        s3_path = model_s3_path.replace('s3://', '')
        bucket = s3_path.split('/')[0]
        key = '/'.join(s3_path.split('/')[1:])
        
        logger.info(f"Uploading model from {model_local_path} to s3://{bucket}/{key}")
        
        s3_client.upload_file(
            str(model_local_path),
            bucket,
            key
        )
        
        # Alternative Method 2: Using AWS CLI
        # subprocess.run([
        #     'aws', 's3', 'cp',
        #     str(model_local_path),
        #     model_s3_path
        # ], check=True)
        
        if remove_local and model_local_path.exists():
            model_local_path.unlink()
            logger.info(f"Removed local file: {model_local_path}")
            
        logger.info("Model upload completed successfully")
        return True
        
    except ClientError as e:
        logger.error(f"Failed to upload model to S3: {str(e)}")
        return False
    except Exception as e:
        logger.error(f"Unexpected error during model upload: {str(e)}")
        return False

def ensure_s3_path_exists(s3_path: str) -> bool:
    """
    Ensure the S3 path exists by creating necessary directories.
    
    Args:
        s3_path: S3 path to verify/create
    
    Returns:
        bool: True if path exists or was created successfully
    """
    try:
        s3_client = boto3.client('s3')
        
        # Extract bucket and prefix
        s3_path = s3_path.replace('s3://', '')
        bucket = s3_path.split('/')[0]
        prefix = '/'.join(s3_path.split('/')[1:])
        
        # Create empty object to ensure path exists
        if prefix:
            s3_client.put_object(Bucket=bucket, Key=f"{prefix}/")
        return True
    except Exception as e:
        logging.error(f"Failed to create S3 path: {str(e)}")
        return False

if not model_local_path.exists():
    raise FileNotFoundError(f"Model file not found at {model_local_path}")

########## Ensure S3 path exists
if not ensure_s3_path_exists(model_s3_path):
    raise RuntimeError(f"Failed to create S3 path: {model_s3_path}")

######## Upload model
if upload_model_to_s3(model_local_path, model_s3_path, remove_local=False):
    logging.info("Model upload process completed successfully")
else:
    raise RuntimeError("Failed to upload model to S3")