# XGBoost Pipeline Test - Pipeline Configuration

This notebook handles the pipeline definition and configuration setup for the XGBoost 3-step pipeline test.

**Pipeline Steps:**
1. XGBoost Training → 2. XGBoost Model Evaluation → 3. Model Calibration

**This notebook covers:**
- Pipeline definition (steps and edges)
- Step configuration generation
- Workspace setup validation
- Configuration file creation and validation

## 1. Setup and Imports

In [1]:
import os
import sys
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add cursus to path
sys.path.append(str(Path.cwd().parent.parent.parent / 'src'))

# Import Cursus components
try:
    from cursus.steps.registry.step_names import STEP_NAMES
    print("✓ Successfully imported Cursus step registry")
    cursus_available = True
except ImportError as e:
    print(f"⚠ Import error: {e}")
    print("Using mock step names for testing...")
    cursus_available = False

print(f"Configuration setup started at {datetime.now()}")

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/tianpeixie/Library/Application Support/sagemaker/config.yaml


2025-08-26 08:23:32,983 - pipeline_registry.builder_registry - INFO - Registered builder: BatchTransform -> BatchTransformStepBuilder
INFO:pipeline_registry.builder_registry:Registered builder: BatchTransform -> BatchTransformStepBuilder
2025-08-26 08:23:32,984 - pipeline_registry.builder_registry - INFO - Registered builder: CurrencyConversion -> CurrencyConversionStepBuilder
INFO:pipeline_registry.builder_registry:Registered builder: CurrencyConversion -> CurrencyConversionStepBuilder
2025-08-26 08:23:32,985 - pipeline_registry.builder_registry - INFO - Registered builder: DummyTraining -> DummyTrainingStepBuilder
INFO:pipeline_registry.builder_registry:Registered builder: DummyTraining -> DummyTrainingStepBuilder
2025-08-26 08:23:32,986 - pipeline_registry.builder_registry - INFO - Registered builder: ModelCalibration -> ModelCalibrationStepBuilder
INFO:pipeline_registry.builder_registry:Registered builder: ModelCalibration -> ModelCalibrationStepBuilder
2025-08-26 08:23:32,988 - pi

✓ Successfully imported Cursus step registry
Configuration setup started at 2025-08-26 08:23:33.015468


## 2. Directory Structure Validation

In [2]:
# Define directory structure (should match 01_setup_and_data_preparation.ipynb)
BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / 'data'
CONFIG_DIR = BASE_DIR / 'configs'
OUTPUTS_DIR = BASE_DIR / 'outputs'
WORKSPACE_DIR = OUTPUTS_DIR / 'workspace'
LOGS_DIR = OUTPUTS_DIR / 'logs'
RESULTS_DIR = OUTPUTS_DIR / 'results'

# Validate directories exist
directories = [DATA_DIR, CONFIG_DIR, OUTPUTS_DIR, WORKSPACE_DIR, LOGS_DIR, RESULTS_DIR]
missing_dirs = []

for directory in directories:
    if directory.exists():
        print(f"✓ Directory exists: {directory}")
    else:
        print(f"⚠ Directory missing: {directory}")
        directory.mkdir(parents=True, exist_ok=True)
        print(f"✓ Created directory: {directory}")

# Check for data files
train_data_path = DATA_DIR / 'train_data.csv'
eval_data_path = DATA_DIR / 'eval_data.csv'
metadata_path = DATA_DIR / 'dataset_metadata.json'

data_files = [train_data_path, eval_data_path, metadata_path]
for data_file in data_files:
    if data_file.exists():
        print(f"✓ Data file exists: {data_file}")
    else:
        print(f"⚠ Data file missing: {data_file}")
        print("Please run 01_setup_and_data_preparation.ipynb first!")

print("\nDirectory validation completed!")

✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/data
✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs
✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/outputs
✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/outputs/workspace
✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/outputs/logs
✓ Directory exists: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/outputs/results
⚠ Data file missing: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/data/train_data.csv
Please run 01_setup_and_data_preparation.ipynb first!
⚠ Data file missing: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/data/eval_data.csv
Please run 01_setup_and_data_preparation.ipynb first!
⚠ Data file missing: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime

## 3. Pipeline Definition

In [3]:
# Define the 3-step XGBoost pipeline
PIPELINE_STEPS = [
    'XGBoostTraining',
    'XGBoostModelEval', 
    'ModelCalibration'
]

# Define pipeline edges (dependencies)
PIPELINE_EDGES = [
    ('XGBoostTraining', 'XGBoostModelEval'),
    ('XGBoostModelEval', 'ModelCalibration')
]

# Pipeline metadata
PIPELINE_METADATA = {
    'name': 'XGBoost_3_Step_Pipeline',
    'description': 'End-to-end XGBoost pipeline with training, evaluation, and calibration',
    'version': '1.0.0',
    'steps': PIPELINE_STEPS,
    'edges': PIPELINE_EDGES,
    'created_at': datetime.now().isoformat()
}

print("PIPELINE DEFINITION")
print("=" * 40)
print(f"Pipeline Name: {PIPELINE_METADATA['name']}")
print(f"Steps: {len(PIPELINE_STEPS)}")
print(f"Edges: {len(PIPELINE_EDGES)}")
print("\nPipeline Flow:")
for i, step in enumerate(PIPELINE_STEPS):
    if i == 0:
        print(f"  {step}")
    else:
        print(f"    ↓")
        print(f"  {step}")

print("\nPipeline edges:")
for source, target in PIPELINE_EDGES:
    print(f"  {source} → {target}")

# Save pipeline definition
pipeline_def_path = CONFIG_DIR / 'pipeline_definition.json'
with open(pipeline_def_path, 'w') as f:
    json.dump(PIPELINE_METADATA, f, indent=2)

print(f"\n✓ Pipeline definition saved: {pipeline_def_path}")

PIPELINE DEFINITION
Pipeline Name: XGBoost_3_Step_Pipeline
Steps: 3
Edges: 2

Pipeline Flow:
  XGBoostTraining
    ↓
  XGBoostModelEval
    ↓
  ModelCalibration

Pipeline edges:
  XGBoostTraining → XGBoostModelEval
  XGBoostModelEval → ModelCalibration

✓ Pipeline definition saved: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs/pipeline_definition.json


## 4. Step Configuration Generation

In [4]:
def create_step_configurations():
    """
    Create configuration files for each pipeline step.
    
    Returns:
        dict: Dictionary of step configurations
    """
    
    print("CREATING STEP CONFIGURATIONS")
    print("=" * 40)
    
    # XGBoost Training Configuration
    xgboost_training_config = {
        "step_name": "XGBoostTraining",
        "step_type": "training",
        "description": "Train XGBoost model on synthetic dataset",
        "input_data_path": str(DATA_DIR / 'train_data.csv'),
        "output_model_path": str(WORKSPACE_DIR / 'xgboost_model.pkl'),
        "output_model_metadata_path": str(WORKSPACE_DIR / 'xgboost_model_metadata.json'),
        "hyperparameters": {
            "n_estimators": 100,
            "max_depth": 6,
            "learning_rate": 0.1,
            "subsample": 0.8,
            "colsample_bytree": 0.8,
            "random_state": 42,
            "objective": "binary:logistic",
            "eval_metric": "logloss"
        },
        "target_column": "target",
        "feature_columns": [f"feature_{i}" for i in range(10)],
        "validation_split": 0.2,
        "early_stopping_rounds": 10
    }
    
    # XGBoost Model Evaluation Configuration
    xgboost_eval_config = {
        "step_name": "XGBoostModelEval",
        "step_type": "evaluation",
        "description": "Evaluate trained XGBoost model on evaluation dataset",
        "model_path": str(WORKSPACE_DIR / 'xgboost_model.pkl'),
        "model_metadata_path": str(WORKSPACE_DIR / 'xgboost_model_metadata.json'),
        "eval_data_path": str(DATA_DIR / 'eval_data.csv'),
        "output_predictions_path": str(WORKSPACE_DIR / 'predictions.csv'),
        "output_metrics_path": str(WORKSPACE_DIR / 'eval_metrics.json'),
        "output_plots_dir": str(WORKSPACE_DIR / 'evaluation_plots'),
        "target_column": "target",
        "feature_columns": [f"feature_{i}" for i in range(10)],
        "metrics_to_compute": [
            "accuracy", "precision", "recall", "f1_score", 
            "auc_roc", "auc_pr", "log_loss"
        ],
        "probability_threshold": 0.5,
        "generate_plots": True
    }
    
    # Model Calibration Configuration
    calibration_config = {
        "step_name": "ModelCalibration",
        "step_type": "calibration",
        "description": "Calibrate model predictions using isotonic regression",
        "predictions_path": str(WORKSPACE_DIR / 'predictions.csv'),
        "eval_data_path": str(DATA_DIR / 'eval_data.csv'),
        "output_calibrated_model_path": str(WORKSPACE_DIR / 'calibrated_model.pkl'),
        "output_calibrated_predictions_path": str(WORKSPACE_DIR / 'calibrated_predictions.csv'),
        "output_calibration_metrics_path": str(WORKSPACE_DIR / 'calibration_metrics.json'),
        "output_calibration_plots_dir": str(WORKSPACE_DIR / 'calibration_plots'),
        "calibration_method": "isotonic",
        "target_column": "target",
        "cv_folds": 3,
        "generate_plots": True,
        "metrics_to_compute": [
            "brier_score", "calibration_error", "reliability_diagram"
        ]
    }
    
    # Collect all configurations
    configs = {
        'XGBoostTraining': xgboost_training_config,
        'XGBoostModelEval': xgboost_eval_config,
        'ModelCalibration': calibration_config
    }
    
    # Save individual configuration files
    for step_name, config in configs.items():
        config_path = CONFIG_DIR / f'{step_name.lower()}_config.json'
        with open(config_path, 'w') as f:
            json.dump(config, f, indent=2)
        print(f"✓ Created config: {config_path}")
    
    # Save combined configuration file
    combined_config = {
        'pipeline_metadata': PIPELINE_METADATA,
        'step_configurations': configs,
        'created_at': datetime.now().isoformat()
    }
    
    combined_config_path = CONFIG_DIR / 'pipeline_config.json'
    with open(combined_config_path, 'w') as f:
        json.dump(combined_config, f, indent=2)
    
    print(f"✓ Created combined config: {combined_config_path}")
    
    return configs

# Create step configurations
step_configs = create_step_configurations()
print("\nStep configurations created successfully!")

CREATING STEP CONFIGURATIONS
✓ Created config: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs/xgboosttraining_config.json
✓ Created config: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs/xgboostmodeleval_config.json
✓ Created config: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs/modelcalibration_config.json
✓ Created combined config: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/configs/pipeline_config.json

Step configurations created successfully!


## 5. Configuration Validation

In [5]:
def validate_step_configuration(step_name, config):
    """
    Validate a step configuration for completeness and correctness.
    
    Args:
        step_name: Name of the step
        config: Configuration dictionary
    
    Returns:
        tuple: (is_valid, validation_messages)
    """
    messages = []
    is_valid = True
    
    # Check required fields
    required_fields = ['step_name', 'step_type', 'description']
    for field in required_fields:
        if field not in config:
            messages.append(f"Missing required field: {field}")
            is_valid = False
        elif not config[field]:
            messages.append(f"Empty required field: {field}")
            is_valid = False
    
    # Check step-specific requirements
    if step_name == 'XGBoostTraining':
        training_required = ['input_data_path', 'output_model_path', 'hyperparameters', 'target_column']
        for field in training_required:
            if field not in config:
                messages.append(f"Training step missing: {field}")
                is_valid = False
        
        # Check if input data exists
        if 'input_data_path' in config:
            input_path = Path(config['input_data_path'])
            if not input_path.exists():
                messages.append(f"Input data file not found: {input_path}")
                is_valid = False
            else:
                messages.append(f"✓ Input data file exists: {input_path}")
    
    elif step_name == 'XGBoostModelEval':
        eval_required = ['model_path', 'eval_data_path', 'output_predictions_path', 'output_metrics_path']
        for field in eval_required:
            if field not in config:
                messages.append(f"Evaluation step missing: {field}")
                is_valid = False
    
    elif step_name == 'ModelCalibration':
        calib_required = ['predictions_path', 'eval_data_path', 'output_calibrated_model_path', 'calibration_method']
        for field in calib_required:
            if field not in config:
                messages.append(f"Calibration step missing: {field}")
                is_valid = False
    
    return is_valid, messages

# Validate all step configurations
print("CONFIGURATION VALIDATION")
print("=" * 40)

all_valid = True
for step_name, config in step_configs.items():
    print(f"\nValidating {step_name}:")
    is_valid, messages = validate_step_configuration(step_name, config)
    
    if is_valid:
        print(f"  ✓ {step_name} configuration is valid")
    else:
        print(f"  ✗ {step_name} configuration has issues:")
        all_valid = False
    
    for message in messages:
        if message.startswith('✓'):
            print(f"    {message}")
        else:
            print(f"    ⚠ {message}")

print(f"\n{'='*40}")
if all_valid:
    print("✓ All step configurations are valid!")
else:
    print("⚠ Some step configurations have issues. Please review and fix.")

print("\nConfiguration validation completed!")

CONFIGURATION VALIDATION

Validating XGBoostTraining:
  ✗ XGBoostTraining configuration has issues:
    ⚠ Input data file not found: /Users/tianpeixie/github_workspace/cursus/test/integration/runtime/data/train_data.csv

Validating XGBoostModelEval:
  ✓ XGBoostModelEval configuration is valid

Validating ModelCalibration:
  ✓ ModelCalibration configuration is valid

⚠ Some step configurations have issues. Please review and fix.

Configuration validation completed!


## 6. Configuration Summary

In [6]:
# Display configuration summary
print("CONFIGURATION SUMMARY")
print("=" * 50)

print(f"Pipeline: {PIPELINE_METADATA['name']}")
print(f"Version: {PIPELINE_METADATA['version']}")
print(f"Description: {PIPELINE_METADATA['description']}")
print(f"Created: {PIPELINE_METADATA['created_at']}")

print("\nStep Configurations:")
for step_name, config in step_configs.items():
    print(f"\n  {step_name}:")
    print(f"    Type: {config['step_type']}")
    print(f"    Description: {config['description']}")
    
    # Show key paths
    if 'input_data_path' in config:
        print(f"    Input: {Path(config['input_data_path']).name}")
    if 'output_model_path' in config:
        print(f"    Output: {Path(config['output_model_path']).name}")
    if 'output_predictions_path' in config:
        print(f"    Output: {Path(config['output_predictions_path']).name}")
    if 'output_calibrated_model_path' in config:
        print(f"    Output: {Path(config['output_calibrated_model_path']).name}")

print("\nConfiguration files created:")
config_files = list(CONFIG_DIR.glob('*.json'))
for config_file in sorted(config_files):
    print(f"  ✓ {config_file.name}")

print("\n" + "=" * 50)
print("PIPELINE CONFIGURATION COMPLETED")
print("=" * 50)
print("Ready for individual step testing!")
print("Next: Run 03_individual_step_testing.ipynb")

CONFIGURATION SUMMARY
Pipeline: XGBoost_3_Step_Pipeline
Version: 1.0.0
Description: End-to-end XGBoost pipeline with training, evaluation, and calibration
Created: 2025-08-26T08:23:40.868390

Step Configurations:

  XGBoostTraining:
    Type: training
    Description: Train XGBoost model on synthetic dataset
    Input: train_data.csv
    Output: xgboost_model.pkl

  XGBoostModelEval:
    Type: evaluation
    Description: Evaluate trained XGBoost model on evaluation dataset
    Output: predictions.csv

  ModelCalibration:
    Type: calibration
    Description: Calibrate model predictions using isotonic regression
    Output: calibrated_model.pkl

Configuration files created:
  ✓ modelcalibration_config.json
  ✓ pipeline_config.json
  ✓ pipeline_definition.json
  ✓ xgboostmodeleval_config.json
  ✓ xgboosttraining_config.json

PIPELINE CONFIGURATION COMPLETED
Ready for individual step testing!
Next: Run 03_individual_step_testing.ipynb
