# Script Validation

This notebook provide a framework of systematic script testing. One thing that pipeline generation does not cover is the functionality and connectivity of scripts. This is important since even if the pipeline can be executed, it does not guarantee that ALL scripts can be completed successfully.


In this test, we assume the following **Pipeline DAG (Direct Acyclic Graph)** 

![mods_pipeline_train_eval_calib](./tutorials/mods_end_to_end_xgboost.png)


The *steps* involved are as follow
1. **CradleDataLoadingStep** with repicated steps for **training** and **calibration** data flow
2. **TabularPreprocessingStep** with two different type (**training** and **calibration**)
3. **XGBoostTrainingStep**
4. **XGBoostModelEvaluationStep**
5. **ModelCalibrationStep**
6. **PackagingStep**
7. **PayloadStep**
8. **MIMSModelRegistrationStep**


This notebook would let user to specify input information for each of these steps


There are two more **base step**, which constrols the **information sharing** across all steps. 
1. **Base Config**: shared for all steps
2. **Base Processing Config**: shared for all *processing steps*

In [1]:
import os
import json
import pandas as pd
import pickle
import sys
import subprocess
from datetime import datetime
import logging
import shutil

from pathlib import Path

In [2]:
from pathlib import Path
import sys

# Get parent directory of current notebook
project_root = str(Path().absolute().parent.parent )
print(f"project root {project_root}")
if project_root not in sys.path:
    sys.path.insert(0, project_root)  
    print(f"add project root {project_root} into system")

project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src
add project root /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src into system


## Pipeline DAG

In [3]:
from buyer_abuse_mods_template.cursus.api.dag.base_dag import PipelineDAG

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml






In [4]:
def create_xgboost_complete_e2e_dag() -> PipelineDAG:
    """
    Create a DAG matching the exact structure from demo/demo_pipeline.ipynb.

    This DAG represents a complete end-to-end workflow including training,
    calibration, packaging, registration, and evaluation of an XGBoost model.

    Returns:
        PipelineDAG: The directed acyclic graph for the pipeline
    """
    dag = PipelineDAG()

    # Add all nodes - exactly as in the demo notebook
    dag.add_node("CradleDataLoading_training")  # Data load for training
    dag.add_node("TabularPreprocessing_training")  # Tabular preprocessing for training
    dag.add_node("XGBoostTraining")  # XGBoost training step
    dag.add_node(
        "ModelCalibration_calibration"
    )  # Model calibration step with calibration variant
    dag.add_node("Package")  # Package step
    dag.add_node("Registration")  # MIMS registration step
    dag.add_node("Payload")  # Payload step
    dag.add_node("CradleDataLoading_calibration")  # Data load for calibration
    dag.add_node(
        "TabularPreprocessing_calibration"
    )  # Tabular preprocessing for calibration
    dag.add_node("XGBoostModelEval_calibration")  # Model evaluation step

    # Training flow
    dag.add_edge("CradleDataLoading_training", "TabularPreprocessing_training")
    dag.add_edge("TabularPreprocessing_training", "XGBoostTraining")

    # Calibration flow
    dag.add_edge("CradleDataLoading_calibration", "TabularPreprocessing_calibration")

    # Evaluation flow
    dag.add_edge("XGBoostTraining", "XGBoostModelEval_calibration")
    dag.add_edge("TabularPreprocessing_calibration", "XGBoostModelEval_calibration")

    # Model calibration flow - depends on model evaluation
    dag.add_edge("XGBoostModelEval_calibration", "ModelCalibration_calibration")

    # Output flow
    dag.add_edge("ModelCalibration_calibration", "Package")
    dag.add_edge("XGBoostTraining", "Package")  # Raw model is also input to packaging
    dag.add_edge("XGBoostTraining", "Payload")  # Payload test uses the raw model
    dag.add_edge("Package", "Registration")
    dag.add_edge("Payload", "Registration")

    logger.info(
        f"Created XGBoost complete E2E DAG with {len(dag.nodes)} nodes and {len(dag.edges)} edges"
    )
    return dag

## Load Data From S3 to Local

We can use Cradle Data Loading step in day 1 to create local data in SAIS.

**Instruction** to find the output location for `CradleDataLoadingStep` in SAIS
- Go to `cmls-byoa-abuse-prod-us-east-1` account; Login using user name `SageMakerStudioRole-abuse-Prod-us-east-1`
- **Request CAZ** by creating a 2-PR Review and got approval from any of BAP team member.
- After login, type "Amazon SageMaker AI" in search bar to find SageMaker
- Find "Studio" under "Applications and IDEs" in left panel
- Make sure you are in "us-east-1" region (Check "United States (N. Virginia)" in top right)
- See `SaisStudioDomain` with user profile "default"
- Open Studio
- Find "Pipeline" in Left Panel
- Find your pipeline name by searching your alias
- Double Click the executed pipeline in the main panel
- You will see the node with `CradleDataLoading_xxx` 
- Click the Node, find the 4 tabs on the right
- Find the "Information Tab" ![find_job](./tutorials/find_job_for_step.png)
- Find "Processing job" field in the table down
- There is a link which link to the real processing job for `CradleDataLoading_xxx` step
- In the processing job, find "Processing output config : DATA" which will give you the output path of the CradleDataLoading Job



e,g, 
- my `CradleDataLoading_training` output 
    - `s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Training/output/data`
- my `CradleDataLoading_calibration` output
    - `s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Calibration/output/data`

In [97]:
import boto3
import os
import shutil
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
from typing import List, Optional, Union
from tqdm import tqdm
import time
import logging
import glob

In [6]:
class S3LocalDownloader:
    def __init__(self):
        """Initialize S3 downloader"""
        self.s3_client = boto3.client('s3')
        
        # Setup logging
        logging.basicConfig(level=logging.INFO,
                          format='%(asctime)s - %(levelname)s - %(message)s')
        self.logger = logging.getLogger(__name__)

    def cleanup_local_directory(self, local_dir: str) -> None:
        """
        Clean up existing files in the local directory.
        
        Args:
            local_dir (str): Path to local directory to clean
        """
        try:
            if os.path.exists(local_dir):
                # Check if directory has any files
                files = glob.glob(os.path.join(local_dir, '*'))
                if files:
                    self.logger.info(f"Found {len(files)} existing files in {local_dir}")
                    self.logger.info("Cleaning up existing files...")
                    
                    # Remove each file
                    for file_path in files:
                        try:
                            if os.path.isfile(file_path):
                                os.remove(file_path)
                                self.logger.debug(f"Removed: {file_path}")
                            elif os.path.isdir(file_path):
                                shutil.rmtree(file_path)
                                self.logger.debug(f"Removed directory: {file_path}")
                        except Exception as e:
                            self.logger.error(f"Error removing {file_path}: {e}")
                    
                    self.logger.info("Cleanup completed")
                else:
                    self.logger.info(f"Directory {local_dir} is empty")
            else:
                self.logger.info(f"Creating new directory: {local_dir}")
                os.makedirs(local_dir)
                
        except Exception as e:
            self.logger.error(f"Error during cleanup: {e}")
            raise

    def parse_s3_uri(self, s3_uri: str) -> tuple:
        """Parse S3 URI into bucket and key components."""
        if not s3_uri.startswith('s3://'):
            raise ValueError("URI must start with 's3://'")
        
        path = s3_uri[5:]
        parts = path.split('/', 1)
        bucket = parts[0]
        key = parts[1] if len(parts) > 1 else ''
        
        return bucket, key

    def list_s3_files(self, s3_uri: str) -> List[tuple]:
        """
        List all files in an S3 path.
        
        Returns:
            List[tuple]: List of (bucket, key) pairs
        """
        bucket, prefix = self.parse_s3_uri(s3_uri)
        files = []
        
        try:
            paginator = self.s3_client.get_paginator('list_objects_v2')
            for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
                if 'Contents' in page:
                    files.extend([(bucket, obj['Key']) for obj in page['Contents']])
            return files
        except Exception as e:
            self.logger.error(f"Error listing files: {e}")
            return []

    def download_file(self, bucket: str, key: str, local_path: str) -> bool:
        """
        Download a single file from S3 to specified local path.
        
        Args:
            bucket (str): S3 bucket name
            key (str): S3 key (file path in bucket)
            local_path (str): Full local path where file should be saved
        """
        try:
            # Create directory if it doesn't exist
            os.makedirs(os.path.dirname(local_path), exist_ok=True)
            self.s3_client.download_file(bucket, key, local_path)
            return True
        except Exception as e:
            self.logger.error(f"Error downloading {key}: {e}")
            return False

    def download_all(self, s3_uri: str, local_dir: str, max_workers: int = 5) -> List[str]:
        """
        Download all files from S3 path using multiple threads.
        
        Args:
            s3_uri (str): S3 URI to download files from
            local_dir (str): Local directory to save files to
            max_workers (int): Number of concurrent downloads
            
        Returns:
            List[str]: List of successfully downloaded file paths
        """
        # Clean up existing files first
        self.cleanup_local_directory(local_dir)
        
        self.logger.info(f"Listing files in {s3_uri}...")
        files = self.list_s3_files(s3_uri)
        
        if not files:
            self.logger.warning("No files found to download.")
            return []

        self.logger.info(f"Found {len(files)} files. Starting download...")
        successful_downloads = []
        
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = []
            for bucket, key in files:
                # Preserve the directory structure from S3
                relative_path = key.split('/')[-1]  # Just take the filename
                local_path = os.path.join(local_dir, relative_path)
                
                futures.append(
                    executor.submit(self.download_file, bucket, key, local_path)
                )
            
            # Monitor downloads with progress bar
            with tqdm(total=len(futures), desc="Downloading") as pbar:
                for future in futures:
                    try:
                        if future.result():
                            successful_downloads.append(local_path)
                    except Exception as e:
                        self.logger.error(f"Download failed: {e}")
                    finally:
                        pbar.update(1)

        self.logger.info(f"Downloaded {len(successful_downloads)} files successfully to {local_dir}")
        return successful_downloads

### CradleDataLoading_training

In [7]:
cradle_data_loading_training_s3_uri = "s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Training/output/data"

In [8]:
cradle_data_loading_training_local_dir = "./data/cradle_data_loading_training_output/"

In [9]:
# Create downloader instance
downloader = S3LocalDownloader()

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [10]:
# Start download with progress tracking
start_time = time.time()
downloaded_files = downloader.download_all(cradle_data_loading_training_s3_uri, cradle_data_loading_training_local_dir, max_workers=5)
end_time = time.time()

INFO:__main__:Found 791 existing files in ./data/cradle_data_loading_training_output/
INFO:__main__:Cleaning up existing files...
INFO:__main__:Cleanup completed
INFO:__main__:Listing files in s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Training/output/data...
INFO:__main__:Found 791 files. Starting download...
Downloading: 100%|██████████| 791/791 [00:14<00:00, 55.68it/s]
INFO:__main__:Downloaded 791 files successfully to ./data/cradle_data_loading_training_output/


In [11]:
def get_directory_size(directory: str) -> float:
    """Calculate total size of directory in MB."""
    total_size = sum(
        os.path.getsize(os.path.join(dirpath, filename))
        for dirpath, _, filenames in os.walk(directory)
        for filename in filenames
    )
    return total_size / (1024 * 1024)  # Convert to MB

In [12]:
# Print summary
print("\nDownload Summary:")
print(f"Time taken: {end_time - start_time:.2f} seconds")
print(f"Files downloaded: {len(downloaded_files)}")
print(f"Total size: {get_directory_size(cradle_data_loading_training_local_dir):.2f} MB")
        
# Print local directory content
print("\nLocal directory content:")
for file in os.listdir(cradle_data_loading_training_local_dir):
    file_path = os.path.join(cradle_data_loading_training_local_dir, file)
    if os.path.isfile(file_path):
        file_size = os.path.getsize(file_path)
        print(f"- {file} ({file_size/1024/1024:.2f} MB)")


Download Summary:
Time taken: 14.51 seconds
Files downloaded: 791
Total size: 94.07 MB

Local directory content:
- part-00566-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00020-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00046-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00498-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.11 MB)
- part-00631-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.13 MB)
- part-00212-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.11 MB)
- part-00146-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00736-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00759-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.13 MB)
- part-00608-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.12 MB)
- part-00004-7056aba9-97a9-4cfa-80a7-e0e6e6967d51-c000.snappy.parquet (0.11 MB)
- part

### CradleDataLoading_calibration output

In [13]:
cradle_data_loading_calibration_s3_uri = "s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Calibration/output/data"

In [14]:
cradle_data_loading_calibration_local_dir = "./data/cradle_data_loading_calibration_output/"

In [15]:
downloader = S3LocalDownloader()

In [16]:
 # Start download with progress tracking
start_time = time.time()
calibration_downloaded_files = downloader.download_all(cradle_data_loading_calibration_s3_uri, cradle_data_loading_calibration_local_dir, max_workers=5)
end_time = time.time()

INFO:__main__:Found 3 existing files in ./data/cradle_data_loading_calibration_output/
INFO:__main__:Cleaning up existing files...
INFO:__main__:Cleanup completed
INFO:__main__:Listing files in s3://sandboxdependency-abuse-secureaisandboxteamshare-1l77v9am252um/lukexie-AtoZ-xgboost-NA-1-3-8-pipeline-2025-10-06-22-36-30/CradleDataLoading-Calibration/output/data...
INFO:__main__:Found 3 files. Starting download...
Downloading: 100%|██████████| 3/3 [00:00<00:00, 16.34it/s]
INFO:__main__:Downloaded 3 files successfully to ./data/cradle_data_loading_calibration_output/


## Test Script

Note that our standard requires each script to have a `main` function with the following signature
```python
def main(
    input_paths: Dict[str, str],
    output_paths: Dict[str, str],
    environ_vars: Dict[str, str],
    job_args: argparse.Namespace,
)
```
We will import these main functions for testing following the PipelineDAG

![mods_pipeline_train_eval_calib](./tutorials/mods_end_to_end_xgboost.png)


We test the scripts in topological ordering
- **TabularPreprocessing (Training and Calibration)**
- **XGBoostTraining**
- **XGBoostModelEval (Calibration)**
- **ModelCalibration (Calibration)**
- **Package**
- **Payload**
- **Inference**


In [17]:
import argparse

### Step 2 Tabular Preprocessing (Training) Step

In [27]:
#!pip install pyarrow fastparquet

In [28]:
from buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.tabular_preprocessing import main as tabular_main

#### Read the Contract to understand the input and output channel names

```python
TABULAR_PREPROCESSING_CONTRACT = ScriptContract(
    entry_point="tabular_preprocessing.py",
    expected_input_paths={"DATA": "/opt/ml/processing/input/data"},
    expected_output_paths={"processed_data": "/opt/ml/processing/output"},
    expected_arguments={
        # No expected arguments - job_type comes from config
    },
    required_env_vars=["LABEL_FIELD", "TRAIN_RATIO", "TEST_VAL_RATIO"],
    optional_env_vars={
        "CATEGORICAL_COLUMNS": "",
        "NUMERICAL_COLUMNS": "",
        "TEXT_COLUMNS": "",
        "DATE_COLUMNS": "",
    },
    framework_requirements={
        "pandas": ">=1.3.0",
        "numpy": ">=1.21.0",
        "scikit-learn": ">=1.0.0",
    },
    description="""
    Tabular preprocessing script that:
    1. Combines data shards from input directory
    2. Cleans and processes label field
    3. Splits data into train/test/val for training jobs
    4. Outputs processed CSV files by split
    
    Contract aligned with actual script implementation:
    - Inputs: DATA (required) - reads from /opt/ml/processing/input/data
    - Outputs: processed_data (primary) - writes to /opt/ml/processing/output
    - Arguments: job_type (required) - defines processing mode (training/validation/testing)
    
    Script Implementation Details:
    - Reads data shards (CSV, JSON, Parquet) from input/data directory
    - Supports gzipped files and various formats
    - Processes labels (converts categorical to numeric if needed)
    - Splits data based on job_type (training creates train/test/val splits)
    - Outputs processed files to split subdirectories under /opt/ml/processing/output
    """,
)
```

We use the `cradle_data_loading_training_local_dir` as the input for Tabular processing step

In [29]:
tabular_preprocessing_training_local_dir = './data/tabular_preprocessing_training_output/'

In [30]:
# Define standard SageMaker paths based on contract
input_paths = {
    "data_input": cradle_data_loading_training_local_dir
}

output_paths = {
    "data_output": tabular_preprocessing_training_local_dir
}
environ_vars = {
    "LABEL_FIELD": 'is_abuse',
    "TRAIN_RATIO": '0.8',
    "TEST_VAL_RATIO": '0.5'
}
job_args = argparse.Namespace(**{'job_type': 'training'})


In [31]:
result_training = tabular_main(input_paths, output_paths, environ_vars, job_args)

[INFO] Combining data shards from ./data/cradle_data_loading_training_output/…
[INFO] Combined data shape: (895594, 72)
[INFO] Data shape after cleaning labels: (895594, 72)
[INFO] Saved data/tabular_preprocessing_training_output/train/train_processed_data.csv (shape=(716475, 72))
[INFO] Saved data/tabular_preprocessing_training_output/test/test_processed_data.csv (shape=(89559, 72))
[INFO] Saved data/tabular_preprocessing_training_output/val/val_processed_data.csv (shape=(89560, 72))
[INFO] Preprocessing complete.


In [32]:
result_training['train'].columns

Index(['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer

### Step 2 Tabular Preprocessing (Calibration) Step

In [33]:
tabular_preprocessing_calibration_local_dir = './data/tabular_preprocessing_calibration_output/'

In [34]:
# Define standard SageMaker paths based on contract
input_paths = {
    "data_input": cradle_data_loading_calibration_local_dir
}

output_paths = {
    "data_output": tabular_preprocessing_calibration_local_dir
}
environ_vars = {
    "LABEL_FIELD": 'is_abuse',
    "TRAIN_RATIO": '0.8',
    "TEST_VAL_RATIO": '0.5'
}
job_args = argparse.Namespace(**{'job_type': 'calibration'})


In [35]:
result_calibration = tabular_main(input_paths, output_paths, environ_vars, job_args)

[INFO] Combining data shards from ./data/cradle_data_loading_calibration_output/…
[INFO] Combined data shape: (45738, 72)
[INFO] Data shape after cleaning labels: (45738, 72)
[INFO] Saved data/tabular_preprocessing_calibration_output/calibration/calibration_processed_data.csv (shape=(45738, 72))
[INFO] Preprocessing complete.


In [36]:
result_calibration['calibration'].columns

Index(['Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days',
       'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_count_last_365_days',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_buyer_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_max_seller_order_message_time_gap',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_diff_topic_si',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer_na.n_message_count_with_notr_topic_si',
       'Abuse.bsm_stats_for_evaluated_mfn_concessions_by_customer

### Step 3 XGBoost Training Step

#### Read Contract to understand the input and output requirement

```python
XGBOOST_TRAIN_CONTRACT = TrainingScriptContract(
    entry_point="xgboost_training.py",
    expected_input_paths={
        "input_path": "/opt/ml/input/data",
        "hyperparameters_s3_uri": "/opt/ml/code/hyperparams/hyperparameters.json",
    },
    expected_output_paths={
        "model_output": "/opt/ml/model",
        "evaluation_output": "/opt/ml/output/data",
    },
    expected_arguments={
        # No expected arguments - using standard paths from contract
    },
    required_env_vars=[
        # No strictly required environment variables - script uses hyperparameters.json
    ],
    optional_env_vars={},
    framework_requirements={
        "boto3": ">=1.26.0",
        "xgboost": "==1.7.6",
        "scikit-learn": ">=0.23.2,<1.0.0",
        "pandas": ">=1.2.0,<2.0.0",
        "pyarrow": ">=4.0.0,<6.0.0",
        "beautifulsoup4": ">=4.9.3",
        "flask": ">=2.0.0,<3.0.0",
        "pydantic": ">=2.0.0,<3.0.0",
        "typing-extensions": ">=4.2.0",
        "matplotlib": ">=3.0.0",
        "numpy": ">=1.19.0",
    },
    description="""
    XGBoost training script for tabular data classification that:
    1. Loads training, validation, and test datasets from split directories
    2. Applies numerical imputation using mean strategy for missing values
    3. Fits risk tables on categorical features using training data
    4. Transforms all datasets using fitted preprocessing artifacts
    5. Trains XGBoost model with configurable hyperparameters
    6. Supports both binary and multiclass classification
    7. Handles class weights for imbalanced datasets
    8. Evaluates model performance with comprehensive metrics
    9. Saves model artifacts and preprocessing components
    10. Generates prediction files and performance visualizations
    
    Input Structure:
    - /opt/ml/input/data: Root directory containing train/val/test subdirectories
      - /opt/ml/input/data/train: Training data files (.csv, .parquet, .json)
      - /opt/ml/input/data/val: Validation data files
      - /opt/ml/input/data/test: Test data files
    - /opt/ml/input/data/config/hyperparameters.json: Model configuration (optional)
    
    Output Structure:
    - /opt/ml/model: Model artifacts directory
      - /opt/ml/model/xgboost_model.bst: Trained XGBoost model
      - /opt/ml/model/risk_table_map.pkl: Risk table mappings for categorical features
      - /opt/ml/model/impute_dict.pkl: Imputation values for numerical features
      - /opt/ml/model/feature_importance.json: Feature importance scores
      - /opt/ml/model/feature_columns.txt: Ordered feature column names
      - /opt/ml/model/hyperparameters.json: Model hyperparameters
    - /opt/ml/output/data: Evaluation results directory
      - /opt/ml/output/data/val.tar.gz: Validation predictions and metrics
      - /opt/ml/output/data/test.tar.gz: Test predictions and metrics
    
    Contract aligned with step specification:
    - Inputs: input_path (required), hyperparameters_s3_uri (optional)
    - Outputs: model_output (primary), evaluation_output (secondary)
    
    Hyperparameters (via JSON config):
    - Data fields: tab_field_list, cat_field_list, label_name, id_name
    - Model: is_binary, num_classes, class_weights
    - XGBoost: eta, gamma, max_depth, subsample, colsample_bytree, lambda_xgb, alpha_xgb
    - Training: num_round, early_stopping_rounds
    - Risk tables: smooth_factor, count_threshold
    
    Binary Classification:
    - Uses binary:logistic objective
    - Supports scale_pos_weight for class imbalance
    - Generates ROC and PR curves
    - Computes AUC-ROC, Average Precision, F1-Score
    
    Multiclass Classification:
    - Uses multi:softprob objective
    - Supports sample weights for class imbalance
    - Generates per-class and aggregate metrics
    - Computes micro/macro averaged metrics
    
    Risk Table Processing:
    - Fits risk tables on categorical features using target correlation
    - Applies smoothing and count thresholds for robust estimation
    - Transforms categorical values to risk scores
    
    Numerical Imputation:
    - Uses mean imputation strategy for missing numerical values
    - Fits imputation on training data only
    - Applies same imputation to validation and test sets
    """,
)
```

In [39]:
#!pip install xgboost

In [40]:
from buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training import main as training_main 

Note: You have installed the 'manylinux2014' variant of XGBoost. Certain features such as GPU algorithms or federated learning are not available. To use these features, please upgrade to a recent Linux distro with glibc 2.28+, and install the 'manylinux_2_28' variant.
INFO:matplotlib.font_manager:generated new fontManager


In [41]:
hyperparameter_local_dir = './dockers/hyperparams/hyperparameters.json'

In [42]:
xgboost_training_model_output_local_dir = './data/xgboost_training_model_output_raw'

In [43]:
xgboost_training_evaluation_output_local_dir = './data/xgboost_training_evaluation_output'

In [44]:
# Define standard SageMaker paths based on contract
input_paths = {
    "input_path": tabular_preprocessing_training_local_dir,
    "hyperparameters_s3_uri": hyperparameter_local_dir,
}

output_paths = {
    "model_output": xgboost_training_model_output_local_dir,
    "evaluation_output": xgboost_training_evaluation_output_local_dir,
}
environ_vars = {}
job_args = argparse.Namespace()

In [45]:
training_main(input_paths, output_paths, environ_vars, job_args)

2025-10-14 23:30:32 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Starting XGBoost training process...
2025-10-14 23:30:32 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Loading configuration from ./dockers/hyperparams/hyperparameters.json
2025-10-14 23:30:32 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Configuration loaded successfully
2025-10-14 23:30:32 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Loading datasets...
2025-10-14 23:30:38 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Loaded data -> train: (716475, 72), val: (89560, 72), test: (89559, 72)
2025-10-14 23:30:38 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Datasets loaded successfully
2025-10-14 23:30:38 - buyer_abuse_mods_template.cursus.steps.scripts.xgboost_training - INFO - Starting numerical imputation...
2025-10-14 23:30:39 - buyer_abuse_mods_t

#### Pack model 
The output of Training Step will automatically pack model_output into model.tar.gz. We have to do it manually

In [46]:
import tarfile

In [47]:
def pack_model_to_tar_gz(
    source_dir: str,
    target_dir: str,
    model_name: str = "model.tar.gz",
    cleanup: bool = True
) -> str:
    """
    Pack XGBoost model output into model.tar.gz file.
    
    Args:
        source_dir (str): Directory containing model files
        target_dir (str): Directory where tar.gz will be saved
        model_name (str): Name of the tar.gz file (default: 'model.tar.gz')
        cleanup (bool): Whether to remove existing tar file if it exists
        
    Returns:
        str: Full path to the created tar.gz file
    """
    logger = logging.getLogger(__name__)
    
    try:
        # Convert to Path objects
        source_path = Path(source_dir)
        target_path = Path(target_dir)
        
        # Validate source directory
        if not source_path.exists():
            raise FileNotFoundError(f"Source directory does not exist: {source_dir}")
        
        if not source_path.is_dir():
            raise NotADirectoryError(f"Source path is not a directory: {source_dir}")
        
        # Create target directory if it doesn't exist
        target_path.mkdir(parents=True, exist_ok=True)
        
        # Full path for the tar.gz file
        tar_path = target_path / model_name
        
        # Remove existing tar file if cleanup is True
        if cleanup and tar_path.exists():
            logger.info(f"Removing existing tar file: {tar_path}")
            tar_path.unlink()
            
        # Create tar.gz file
        logger.info(f"Creating tar.gz file at: {tar_path}")
        with tarfile.open(tar_path, "w:gz") as tar:
            # Add each file from source directory
            for file_path in source_path.glob("*"):
                if file_path.is_file():  # Only pack files, not directories
                    logger.debug(f"Adding file to tar: {file_path.name}")
                    tar.add(
                        file_path,
                        arcname=file_path.name  # Store only filename without path
                    )
        
        # Verify the tar file was created
        if not tar_path.exists():
            raise RuntimeError(f"Failed to create tar file at: {tar_path}")
            
        logger.info(f"Successfully created model tar file: {tar_path}")
        logger.info(f"Tar file size: {tar_path.stat().st_size / (1024*1024):.2f} MB")
        
        return str(tar_path)
        
    except Exception as e:
        logger.error(f"Error packing model: {str(e)}")
        raise

In [48]:
xgboost_training_model_tar_gz_local_dir = './data/xgboost_training_model_output_compressed'

In [49]:
try:
    # Pack model
    tar_path = pack_model_to_tar_gz(
        source_dir=xgboost_training_model_output_local_dir,
        target_dir=xgboost_training_model_tar_gz_local_dir,
        model_name='model.tar.gz',
        cleanup=True
    )
        
    # Print results
    print(f"\nModel packed successfully!")
    print(f"Tar file location: {tar_path}")
    
    # List contents of tar file
    print("\nTar file contents:")
    with tarfile.open(tar_path, 'r:gz') as tar:
        for member in tar.getmembers():
            print(f"- {member.name} ({member.size / 1024:.2f} KB)")
                
except Exception as e:
    print(f"Error: {str(e)}")

2025-10-14 23:30:52 - __main__ - INFO - Removing existing tar file: data/xgboost_training_model_output_compressed/model.tar.gz
2025-10-14 23:30:52 - __main__ - INFO - Creating tar.gz file at: data/xgboost_training_model_output_compressed/model.tar.gz
2025-10-14 23:30:52 - __main__ - INFO - Successfully created model tar file: data/xgboost_training_model_output_compressed/model.tar.gz
2025-10-14 23:30:52 - __main__ - INFO - Tar file size: 0.38 MB

Model packed successfully!
Tar file location: data/xgboost_training_model_output_compressed/model.tar.gz

Tar file contents:
- feature_columns.txt (4.89 KB)
- xgboost_model.bst (1145.58 KB)
- risk_table_map.pkl (17.70 KB)
- hyperparameters.json (10.98 KB)
- feature_importance.json (5.21 KB)
- impute_dict.pkl (5.16 KB)


### Step 4 XGBoostModelEval (Calibration) Step

```python
XGBOOST_MODEL_EVAL_CONTRACT = ScriptContract(
    entry_point="xgboost_model_eval.py",
    expected_input_paths={
        "model_input": "/opt/ml/processing/input/model",
        "processed_data": "/opt/ml/processing/input/eval_data",
    },
    expected_output_paths={
        "eval_output": "/opt/ml/processing/output/eval",
        "metrics_output": "/opt/ml/processing/output/metrics",
    },
    expected_arguments={
        # No expected arguments - job_type comes from config
    },
    required_env_vars=["ID_FIELD", "LABEL_FIELD"],
    optional_env_vars={},
    framework_requirements={
        "pandas": ">=1.3.0",
        "numpy": ">=1.21.0",
        "scikit-learn": ">=1.0.0",
        "xgboost": ">=1.6.0",
        "matplotlib": ">=3.5.0",
    },
    description="""
    XGBoost model evaluation script that:
    1. Loads trained XGBoost model and preprocessing artifacts
    2. Loads and preprocesses evaluation data using risk tables and imputation
    3. Generates predictions and computes performance metrics
    4. Creates ROC and Precision-Recall curve visualizations
    5. Saves predictions, metrics, and plots
    
    Input Structure:
    - /opt/ml/processing/input/model: Model artifacts directory containing:
      - xgboost_model.bst: Trained XGBoost model
      - risk_table_map.pkl: Risk table mappings for categorical features
      - impute_dict.pkl: Imputation dictionary for numerical features
      - feature_columns.txt: Feature column names and order
      - hyperparameters.json: Model hyperparameters and metadata
    - /opt/ml/processing/input/eval_data: Evaluation data (CSV or Parquet files)
    
    Output Structure:
    - /opt/ml/processing/output/eval/eval_predictions.csv: Model predictions with probabilities
    - /opt/ml/processing/output/metrics/metrics.json: Performance metrics
    - /opt/ml/processing/output/metrics/roc_curve.jpg: ROC curve visualization
    - /opt/ml/processing/output/metrics/pr_curve.jpg: Precision-Recall curve visualization
    
    Environment Variables:
    - ID_FIELD: Name of the ID column in evaluation data
    - LABEL_FIELD: Name of the label column in evaluation data
    
    Arguments:
    - job_type: Type of evaluation job to perform (e.g., "evaluation", "validation")
    
    Supports both binary and multiclass classification with appropriate metrics for each.
    """,
)
```

In [50]:
from buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval import main as model_eval_main

Looking in indexes: https://aws:****@amazon-149122183214.d.codeartifact.us-west-2.amazonaws.com/pypi/secure-pypi/simple/
***********************Package Installed*********************
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Added /opt/ml/processing/input/code to Python path
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Successfully embedded processing modules


In [51]:
xgboost_model_eval_eval_output = './data/xgboost_model_eval_eval_output'
xgboost_model_eval_metrics_output =  './data/xgboost_model_eval_metrics_output'

In [52]:
# Define standard SageMaker paths based on contract
input_paths = {
    "model_input": xgboost_training_model_tar_gz_local_dir,
    "processed_data": tabular_preprocessing_calibration_local_dir,
}

output_paths = {
    "eval_output": xgboost_model_eval_eval_output,
    "metrics_output": xgboost_model_eval_metrics_output,
}
environ_vars = {
    "ID_FIELD": 'order_id',
    "LABEL_FIELD": "is_abuse"
}
job_args = argparse.Namespace(**{'job_type': 'calibration'})

In [53]:
model_eval_main(input_paths, output_paths, environ_vars, job_args)

2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Running model evaluation with job_type: calibration
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Starting model evaluation script
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Loading model artifacts from ./data/xgboost_training_model_output_compressed
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Decompress the model tarball if it exists
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Found model.tar.gz at data/xgboost_training_model_output_compressed/model.tar.gz. Extracting...
2025-10-14 23:30:54 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.xgboost_model_eval - INFO - Extraction complete.
2

### Step 5 Model Calibration (Calibration) Step

```python
MODEL_CALIBRATION_CONTRACT = ScriptContract(
    entry_point="model_calibration.py",
    expected_input_paths={"evaluation_data": "/opt/ml/processing/input/eval_data"},
    expected_output_paths={
        "calibration_output": "/opt/ml/processing/output/calibration",
        "metrics_output": "/opt/ml/processing/output/metrics",
        "calibrated_data": "/opt/ml/processing/output/calibrated_data",
    },
    required_env_vars=["CALIBRATION_METHOD", "LABEL_FIELD", "SCORE_FIELD", "IS_BINARY"],
    optional_env_vars={
        "MONOTONIC_CONSTRAINT": "True",
        "GAM_SPLINES": "10",
        "ERROR_THRESHOLD": "0.05",
        "NUM_CLASSES": "2",
        "SCORE_FIELD_PREFIX": "prob_class_",
        "MULTICLASS_CATEGORIES": "[0, 1]",
    },
    framework_requirements={
        "scikit-learn": ">=0.23.2,<1.0.0",
        "pandas": ">=1.2.0,<2.0.0",
        "numpy": ">=1.20.0",
        "pygam": ">=0.8.0",
        "matplotlib": ">=3.3.0",
    },
    description="""Contract for model calibration processing step.
    
    The model calibration step takes a trained model's raw prediction scores and
    calibrates them to better reflect true probabilities, which is essential for
    risk-based decision-making, threshold setting, and confidence in model outputs.
    Supports both binary and multi-class classification scenarios.
    
    Input Structure:
    - /opt/ml/processing/input/eval_data: Evaluation dataset with ground truth labels and model predictions
    
    Output Structure:
    - /opt/ml/processing/output/calibration: Calibration mapping and artifacts
    - /opt/ml/processing/output/metrics: Calibration quality metrics
    - /opt/ml/processing/output/calibrated_data: Dataset with calibrated probabilities
    
    Environment Variables:
    - CALIBRATION_METHOD: Method to use for calibration (gam, isotonic, platt)
    - LABEL_FIELD: Name of the label column
    - SCORE_FIELD: Name of the prediction score column (for binary classification)
    - IS_BINARY: Whether this is a binary classification task (true/false)
    - MONOTONIC_CONSTRAINT: Whether to enforce monotonicity in GAM (optional)
    - GAM_SPLINES: Number of splines for GAM (optional)
    - ERROR_THRESHOLD: Acceptable calibration error threshold (optional)
    - NUM_CLASSES: Number of classes for multi-class classification (optional, default=2)
    - SCORE_FIELD_PREFIX: Prefix for probability columns in multi-class scenario (optional)
    - MULTICLASS_CATEGORIES: JSON string of class names/values for multi-class (optional)
    """,
)

```

In [55]:
#!pip install pygam

In [56]:
from buyer_abuse_mods_template.cursus.steps.scripts.model_calibration import main as model_calibration_main

Note
there is a compatibility issue between **pygam 0.8.1** and newer versions of **scipy**. The error 'csr_matrix' object has no attribute 'A' 

In [57]:
#!pip install pygam>=0.9.1 --ignore-installed

In [58]:
model_calibration_calibration_output = './data/model_calibration_calibration_output'
model_calibration_metrics_output =  './data/model_calibration_metrics_output'
model_calibration_calibrated_data = './data/model_calibration_calibrated_data'

In [59]:
# Define standard SageMaker paths based on contract
input_paths = {
    "evaluation_data": xgboost_model_eval_eval_output,
}

output_paths = {
    "calibration_output": model_calibration_calibration_output,
    "metrics_output": model_calibration_metrics_output,
    "calibrated_data": model_calibration_calibrated_data,
}
environ_vars = {
    "CALIBRATION_METHOD": "gam",
    "LABEL_FIELD": "is_abuse",
    "SCORE_FIELD":  "prob_class_1",
    "IS_BINARY":   "True",
    "MONOTONIC_CONSTRAINT": "True",
    "GAM_SPLINES": "10",
    "ERROR_THRESHOLD": "0.05",
    "NUM_CLASSES": "2",
    "SCORE_FIELD_PREFIX": "prob_class_",
    "MULTICLASS_CATEGORIES": "[0, 1]",
}
job_args = argparse.Namespace(**{'job_type': 'calibration'})

In [60]:
model_calibration_main(input_paths, output_paths, environ_vars, job_args)

2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Starting model calibration
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Running in binary mode
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Using job_type from command line: calibration
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO -   DATA PREPARATION
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Loading data for job_type=calibration using standard loading
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Loading data from ./data/xgboost_model_eval_eval_output/eval_predictions.csv
2025-10-14 23:31:13 - buyer_abuse_mods_template.cursus.steps.scripts.model_calibration - INFO - Loaded data with shape (45738, 4)
2025-10-14 23:31:13 - buyer_abuse_mods_te

{'status': 'success',
 'mode': 'binary',
 'calibration_method': 'gam',
 'metrics_report': {'mode': 'binary',
  'calibration_method': 'gam',
  'uncalibrated': {'expected_calibration_error': 0.007620695944738972,
   'maximum_calibration_error': 0.11980885920000009,
   'brier_score': 0.01867939043959358,
   'auc_roc': 0.9865442633026731,
   'reliability_diagram': {'true_probs': [0.006402959590210586,
     0.2066420664206642,
     0.368,
     0.43902439024390244,
     0.5,
     0.5714285714285714,
     0.68,
     0.7083333333333334,
     0.8282442748091603,
     0.8990825688073395],
    'pred_probs': [0.002698654235238307,
     0.139401767099631,
     0.2481911407999999,
     0.3509751668292684,
     0.4483121692391303,
     0.5487611722448977,
     0.6521399908000005,
     0.7542549802777774,
     0.8577333140076336,
     0.9492656996880736]},
   'bin_statistics': {'bin_counts': [42168.0,
     542.0,
     250.0,
     246.0,
     184.0,
     196.0,
     250.0,
     288.0,
     524.0,
     

### Step 6 Package Step

```python
PACKAGE_CONTRACT = ScriptContract(
    entry_point="package.py",
    expected_input_paths={
        "model_input": "/opt/ml/processing/input/model",
        "inference_scripts_input": "/opt/ml/processing/input/script",
        "calibration_model": "/opt/ml/processing/input/calibration",
    },
    expected_output_paths={"packaged_model": "/opt/ml/processing/output"},
    expected_arguments={
        # No expected arguments - using standard paths from contract
    },
    required_env_vars=[
        # No required environment variables for this script
    ],
    optional_env_vars={},
    framework_requirements={
        "python": ">=3.7"
        # Uses only standard library modules: shutil, tarfile, pathlib, logging, os
    },
    description="""
    MIMS packaging script that:
    1. Extracts model artifacts from input model directory or model.tar.gz
    2. Includes calibration model if available
    3. Copies inference scripts to code directory
    4. Creates a packaged model.tar.gz file for deployment
    4. Provides detailed logging of the packaging process
    
    Input Structure:
    - /opt/ml/processing/input/model: Model artifacts (files or model.tar.gz)
    - /opt/ml/processing/input/script: Inference scripts to include
    - /opt/ml/processing/input/calibration: Optional calibration model artifacts
    
    Output Structure:
    - /opt/ml/processing/output/model.tar.gz: Packaged model ready for deployment
    """,
)
```

In [61]:
from buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package import main as package_main

In [65]:
inference_script_dir = './dockers'

In [69]:
package_packaged_model_output = './data/package_packaged_model_output'
temp_working_dir = './data/mims_packaging_directory'

In [70]:
# Define standard SageMaker paths based on contract
input_paths = {
    "model_input": xgboost_training_model_tar_gz_local_dir,
    "inference_scripts_input": inference_script_dir,
    "calibration_model": model_calibration_calibration_output
}

output_paths = {
    "packaged_model": package_packaged_model_output
}
environ_vars = {
    "WORKING_DIRECTORY": temp_working_dir
}
job_args = argparse.Namespace()

In [71]:
result_package = package_main(input_paths, output_paths, environ_vars, job_args)

2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO - 
=== Starting MIMS packaging process ===
2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO - Python version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0]
2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO - Working directory: /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline
2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO - Available disk space: 58.05GB
2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO - 
Using paths:
2025-10-14 23:38:36 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.package - INFO -   Model path: data/xgboost_training_model_output_compressed
2025-10-14 23:38:36 - buyer_ab

### Step 7 Payload Step

```python
PAYLOAD_CONTRACT = ScriptContract(
    entry_point="payload.py",
    expected_input_paths={"model_input": "/opt/ml/processing/input/model"},
    expected_output_paths={"payload_sample": "/opt/ml/processing/output"},
    expected_arguments={
        # No expected arguments - using standard paths from contract
    },
    required_env_vars=[
        # No strictly required environment variables - script has defaults
    ],
    optional_env_vars={
        # Only these environment variables are actually used by the script:
        "CONTENT_TYPES": "application/json",
        "DEFAULT_NUMERIC_VALUE": "0.0",
        "DEFAULT_TEXT_VALUE": "DEFAULT_TEXT",
        # Special field environment variables follow pattern SPECIAL_FIELD_<fieldname>
    },
    framework_requirements={
        "python": ">=3.7"
        # Uses only standard library modules: json, logging, os, tarfile, tempfile, pathlib, enum, typing, datetime
    },
    description="""
    MIMS payload generation script that:
    1. Extracts hyperparameters from model artifacts (model.tar.gz or directory)
    2. Creates model variable list from field information
    3. Generates sample payloads in multiple formats (JSON, CSV)
    4. Archives payload files for deployment
    
    Note: This script extracts pipeline name, version, and model objective from hyperparameters,
    not from environment variables. It does not use PIPELINE_NAME, REGION, PAYLOAD_S3_KEY, or 
    BUCKET_NAME environment variables.
    
    Input Structure:
    - /opt/ml/processing/input/model: Model artifacts containing hyperparameters.json
    
    Output Structure:
    - /tmp/mims_payload_work/payload_sample/: Sample payload files (temporary)
    - /opt/ml/processing/output/: Output directory containing payload.tar.gz file
    
    Environment Variables:
    - CONTENT_TYPES: Comma-separated list of content types (default: "application/json")
    - DEFAULT_NUMERIC_VALUE: Default value for numeric fields (default: "0.0")
    - DEFAULT_TEXT_VALUE: Default value for text fields (default: "DEFAULT_TEXT")
    - SPECIAL_FIELD_<fieldname>: Custom values for specific fields
    
    Arguments:
    - mode: Operating mode for the script (default: "standard")
    """,
)
```


In [72]:
from buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload import main as payload_main

In [78]:
payload_payload_sample_output = './data/payload_payload_sample_output'
temp_working_dir = './data/mims_payload_work'

In [79]:
# Define standard SageMaker paths based on contract
input_paths = {
    "model_input": xgboost_training_model_tar_gz_local_dir,
}

output_paths = {
    "output_dir": payload_payload_sample_output
}
environ_vars = {
    "WORKING_DIRECTORY": temp_working_dir, 
    "CONTENT_TYPES": "application/json",
    "DEFAULT_NUMERIC_VALUE": "0.0",
    "DEFAULT_TEXT_VALUE": "DEFAULT_TEXT",
}
job_args = argparse.Namespace()

In [80]:
result_payload = payload_main(input_paths, output_paths, environ_vars, job_args)

2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO - 
Using paths:
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO -   Model input directory: data/xgboost_training_model_output_compressed
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO -   Output directory: data/payload_payload_sample_output
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO -   Working directory: data/mims_payload_work
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO -   Payload sample directory: data/mims_payload_work/payload_sample
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payload - INFO - Looking for hyperparameters in model artifacts
2025-10-14 23:46:13 - buyer_abuse_mods_template.bap_example_pipeline.dockers.scripts.payloa

## Inference Test

SageMaker inference using pre-build containrs requires a inference handler. In MIMS Registration, the inference script `xgboost_inference.py` is the inference handler.

It contains 4 functions
- `model_fn(model_dir: str)`, which takes model dir as input, and output the unpackaged model artifacts. During the inference, the `model.tar.gz` file will be unpackaged first internally. For our test, we manually do the unpacking
- `input_fn(request_body: Union[str, bytes], request_content_type: str, context: Optional[Any] = None)`, which takes in the streaming data input(ByteIO) and parse it for prediction task
- `predict_fn(input_data: pd.DataFrame, model_artifacts: Dict[str, Any])`, takes input from output of `input_fn` and load model artifacts (output from `model_fn`), then make inference and generate output
- `output_fn(prediction_output: Union[np.ndarray, List, Dict[str, np.ndarray]], accept: str = CONTENT_TYPE_JSON)`, takes input from output of `predict_fn`, send the output to the service call


In [86]:
# Get parent directory of current notebook
processing_root = str(Path('./dockers/processing').absolute().parent)
print(f"processing module {processing_root}")
if processing_root not in sys.path:
    sys.path.insert(0, processing_root)  
    print(f"add processing module {processing_root} into system")

processing module /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers
add processing module /home/ec2-user/SageMaker/BuyerAbuseModsTemplate/src/buyer_abuse_mods_template/bap_example_pipeline/dockers into system


In [87]:
from buyer_abuse_mods_template.bap_example_pipeline.dockers.xgboost_inference import model_fn, input_fn, predict_fn, output_fn

### Unpackage model artifacts locally

In [88]:
def unpack_model_artifacts(
    source_tar_gz: str,
    target_dir: str,
    cleanup: bool = True
) -> str:
    """
    Unpack model.tar.gz artifacts to target directory.
    
    Args:
        source_tar_gz (str): Path to model.tar.gz file
        target_dir (str): Directory to unpack artifacts to
        cleanup (bool): Whether to clean target directory before unpacking
        
    Returns:
        str: Path to unpacked artifacts directory
    """
    logger = logging.getLogger(__name__)
    
    try:
        # Convert to Path objects
        source_path = Path(source_tar_gz)
        target_path = Path(target_dir)
        
        # Validate source file
        if not source_path.exists():
            raise FileNotFoundError(f"Source file does not exist: {source_tar_gz}")
        
        if not source_path.is_file():
            raise ValueError(f"Source path is not a file: {source_tar_gz}")
        
        if not tarfile.is_tarfile(source_path):
            raise ValueError(f"Source file is not a valid tar file: {source_tar_gz}")
        
        # Clean up target directory if it exists and cleanup is True
        if cleanup and target_path.exists():
            logger.info(f"Cleaning up target directory: {target_dir}")
            shutil.rmtree(target_path)
        
        # Create target directory
        target_path.mkdir(parents=True, exist_ok=True)
        
        # Unpack tar.gz file
        logger.info(f"Unpacking {source_tar_gz} to {target_dir}")
        with tarfile.open(source_path, 'r:gz') as tar:
            # List contents before extraction
            logger.info("Tar contents:")
            for member in tar.getmembers():
                logger.info(f"- {member.name} ({member.size / 1024:.2f} KB)")
            
            # Extract all files
            tar.extractall(path=target_path)
        
        # Verify extraction
        extracted_files = list(target_path.glob('*'))
        if not extracted_files:
            raise RuntimeError(f"No files were extracted to {target_dir}")
        
        logger.info(f"Successfully unpacked {len(extracted_files)} files to {target_dir}")
        
        # Log extracted files and sizes
        logger.info("Extracted files:")
        for file_path in extracted_files:
            if file_path.is_file():
                size_kb = file_path.stat().st_size / 1024
                logger.info(f"- {file_path.name} ({size_kb:.2f} KB)")
        
        return str(target_path)
        
    except Exception as e:
        logger.error(f"Error unpacking model artifacts: {str(e)}")
        raise


In [89]:
try:
    # Define paths
    source_tar_gz = './data/package_packaged_model_output/model.tar.gz'
    target_dir = './data/inference_model_artifacts'
        
    # Unpack artifacts
    output_dir = unpack_model_artifacts(
        source_tar_gz=source_tar_gz,
        target_dir=target_dir,
        cleanup=True  # Clean target directory before unpacking
    )
        
    print(f"\nModel artifacts unpacked successfully!")
    print(f"Location: {output_dir}")
        
    # Print contents of target directory
    print("\nExtracted files:")
    for root, dirs, files in os.walk(output_dir):
        level = root.replace(output_dir, '').count(os.sep)
        indent = ' ' * 4 * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 4 * (level + 1)
        for file in files:
            file_path = os.path.join(root, file)
            size = os.path.getsize(file_path) / 1024  # KB
            print(f"{subindent}{file} ({size:.2f} KB)")
                
except Exception as e:
    print(f"Error: {str(e)}")

2025-10-15 04:03:24 - __main__ - INFO - Unpacking ./data/package_packaged_model_output/model.tar.gz to ./data/inference_model_artifacts
2025-10-15 04:03:24 - __main__ - INFO - Tar contents:
2025-10-15 04:03:24 - __main__ - INFO - - feature_columns.txt (4.89 KB)
2025-10-15 04:03:24 - __main__ - INFO - - xgboost_model.bst (1145.58 KB)
2025-10-15 04:03:24 - __main__ - INFO - - risk_table_map.pkl (17.70 KB)
2025-10-15 04:03:24 - __main__ - INFO - - hyperparameters.json (10.98 KB)
2025-10-15 04:03:24 - __main__ - INFO - - feature_importance.json (5.21 KB)
2025-10-15 04:03:24 - __main__ - INFO - - impute_dict.pkl (5.16 KB)
2025-10-15 04:03:24 - __main__ - INFO - - code/__init__.py (0.00 KB)
2025-10-15 04:03:24 - __main__ - INFO - - code/xgboost_training.py (28.44 KB)
2025-10-15 04:03:24 - __main__ - INFO - - code/xgboost_inference.py (35.83 KB)
2025-10-15 04:03:24 - __main__ - INFO - - code/__pycache__/__init__.cpython-310.pyc (0.21 KB)
2025-10-15 04:03:24 - __main__ - INFO - - code/__pycach

### Load Model

In [90]:
model_artifact_dict = model_fn(model_dir=target_dir)

2025-10-15 04:04:23,097 - INFO - Loading model from ./data/inference_model_artifacts
2025-10-15 04:04:23,098 - INFO - Found required file: xgboost_model.bst
2025-10-15 04:04:23,098 - INFO - Found required file: risk_table_map.pkl
2025-10-15 04:04:23,098 - INFO - Found required file: impute_dict.pkl
2025-10-15 04:04:23,099 - INFO - Found required file: feature_columns.txt
2025-10-15 04:04:23,109 - INFO - Loaded 64 ordered feature columns
2025-10-15 04:04:23,112 - INFO - Loading binary calibration model from ./data/inference_model_artifacts/calibration/calibration_model.pkl
2025-10-15 04:04:23,113 - INFO - Calibration model loaded successfully


In [91]:
model_artifact_dict

{'model': <xgboost.core.Booster at 0x7fc52a090cd0>,
 'risk_processors': {'PAYMETH': <processing.risk_table_processor.RiskTableMappingProcessor at 0x7fc525ab0d30>,
  'claim_reason': <processing.risk_table_processor.RiskTableMappingProcessor at 0x7fc525ab0e80>,
  'claimantInfo_status': <processing.risk_table_processor.RiskTableMappingProcessor at 0x7fc525ab0790>,
  'shipments_status': <processing.risk_table_processor.RiskTableMappingProcessor at 0x7fc525ab0d00>},
 'numerical_processor': <processing.numerical_imputation_processor.NumericalVariableImputationProcessor at 0x7fc525ab0910>,
 'feature_importance': {'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_solicit_count_last_365_days': 8.0,
  'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_claims_warn_count_last_365_days': 1.0,
  'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_solicit_count_last_365_days': 30.0,
  'Abuse.abuse_fap_action_by_customer_inline_transform_na.n_concession_warn_coun

In [114]:
def load_feature_columns(feature_columns_file: Union[str, Path]) -> List[str]:
    """
    Load feature columns from file, ignoring comments and empty lines.
    
    Args:
        feature_columns_file (Union[str, Path]): Path to feature columns file
        
    Returns:
        List[str]: List of column names in order
    """
    feature_columns = []
    with open(feature_columns_file, 'r') as f:
        for line in f:
            # Skip comments and empty lines
            if line.strip() and not line.startswith('#'):
                # Split by comma and take the column name (index 1)
                _, column_name = line.strip().split(',', 1)
                feature_columns.append(column_name)
    return feature_columns

In [115]:
feature_columns = load_feature_columns(Path('./data/xgboost_training_model_output_raw') / 'feature_columns.txt' )

In [118]:
from io import BytesIO
def transform_row_to_bytes(
    csv_path: Union[str, Path],
    feature_columns_file: Union[str, Path],
    row_index: int = 0
) -> bytes:
    """
    Transform one row of CSV into bytes using BytesIO.
    
    Args:
        csv_path (Union[str, Path]): Path to the CSV file
        row_index (int): Index of the row to transform (default: 0)
        
    Returns:
        bytes: CSV row as bytes
    """
    # Load feature columns
    feature_columns = load_feature_columns(feature_columns_file)
    
    # Read CSV
    df = pd.read_csv(csv_path, index_col=None)
    
    # Validate all required columns exist
    missing_columns = set(feature_columns) - set(df.columns)
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    # Select only the required columns in the specified order
    row_data = df.iloc[row_index:row_index+1][feature_columns]
    
    # Use BytesIO to convert to bytes
    buffer = BytesIO()
    row_data.to_csv(buffer, header=False, index=False)
    value = buffer.getvalue().rstrip(b'\n')
    buffer.close()
    
    # Get bytes from buffer
    return value

### Load Input

In [119]:
request_body = transform_row_to_bytes((Path(tabular_preprocessing_calibration_local_dir) / 'calibration' / 'calibration_processed_data.csv'),
                        feature_columns_file=(Path('./data/xgboost_training_model_output_raw') / 'feature_columns.txt'), 
                        row_index=1)

In [120]:
request_body

b'0,0,0,0,591950,618857,618857,0.0,0.0,0.5,1,591950,2,591950,1,618857,1,2,1,1,26,2702.39,85,10,430.87,12,0.022569,1,60.99,1,0.0,0,0.0,0,0.0,0,0,0.0,0,1,31.29,1,0.0,0.072621,1,31.29,1,0.072621,10,1,0,0,0,0,0,3108.784357,152.4,1,1,1,CC,NOTR,Normal,DELIVERED'

In [121]:
input_data = input_fn(request_body, request_content_type="text/csv")

2025-10-15 04:25:54,418 - INFO - Received request with Content-Type: text/csv
2025-10-15 04:25:54,419 - INFO - Processing content type: text/csv
2025-10-15 04:25:54,424 - INFO - Successfully parsed CSV into DataFrame. Shape: (1, 64)


In [122]:
input_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0,0,0,0,591950,618857,618857,0.0,0.0,0.5,...,0,3108.784357,152.4,1,1,1,CC,NOTR,Normal,DELIVERED


### Predict

In [124]:
prediction_output = predict_fn(input_data, model_artifact_dict)

2025-10-15 04:26:50,092 - INFO - Applied calibration to predictions


In [127]:
prediction_output

{'raw_predictions': array([[9.9957627e-01, 4.2373402e-04]], dtype=float32),
 'calibrated_predictions': array([[0.9959391 , 0.00406094]], dtype=float32)}

### Output

In [125]:
response = output_fn(prediction_output, accept="application/json")

2025-10-15 04:27:25,382 - INFO - Received prediction output of type: <class 'dict'> for accept type: application/json


In [126]:
response

('{"predictions": [{"legacy-score": "0.0004237340181134641", "calibrated-score": "0.004060941748321056", "custom-output-label": "class-0"}]}',
 'application/json')

In [129]:
print(response)

('{"predictions": [{"legacy-score": "0.0004237340181134641", "calibrated-score": "0.004060941748321056", "custom-output-label": "class-0"}]}', 'application/json')
