# MLOps Pipeline Workflow & Team Notes

## Project Overview
This notebook implements an end-to-end MLOps data pipeline using the **Olist Brazilian E-Commerce dataset**. The goal was to demonstrate a production-style workflow that covers data ingestion, cataloging, exploratory analysis, feature engineering, feature storage, and dataset splitting — all using AWS services in a cost-efficient way.

The pipeline was intentionally built step-by-step to mirror MLOps practices rather than a one-off modeling notebook.

---

## High-Level Workflow

### 1. Raw Data Ingestion (S3 Data Lake)
- Created an S3 bucket to act as a data lake.
- Uploaded all **9 raw CSV files** from the Kaggle Olist dataset.
- Organized raw data under: s3:///raw/olist/ingest_date=YYYY-MM-DD/
- Each dataset was placed into its own subfolder to support Athena’s directory-based table requirements.

---

### 2. Data Cataloging & Querying (Athena)
- Created an Athena database (`olist_datalake`).
- Defined **external tables** for each dataset directly from JupyterLab (no Glue crawler required).
- Verified schemas and row counts using Athena queries.
- This step enabled SQL-based access to the data and served as the cataloging layer for downstream analysis.

---

### 3. Exploratory Data Analysis (SageMaker + Pandas)
- Loaded Athena tables into Pandas using `awswrangler`.
- Performed sanity checks on row counts and joins.
- Built an **order-level analytical view** by aggregating:
- order items
- payments
- customer attributes
- Engineered a target variable (`is_late`) based on delivery vs. estimated delivery dates.
- Observed class imbalance (~8% late deliveries), motivating careful splitting and evaluation later.

---

### 4. Feature Engineering
- Created leakage-safe, order-level features using only information available at purchase time:
- pricing, freight, number of items/sellers
- payment information
- time-based features (day of week, hour of day)
- customer state
- Maintained a **canonical feature dataset** for analysis and splitting.
- Created a **Feature Store–compatible version** with strict data types.

---

### 5. SageMaker Feature Store (Offline Store)
- Created a **SageMaker Feature Group** (offline store only to control cost).
- Used `order_id` as the record identifier.
- Used a strictly formatted ISO-8601 `event_time` with UTC (`Z`) as the event time feature.
- Successfully ingested ~99k feature records into Feature Store.
- Offline store data is persisted in S3 for training and future reuse.

---

### 6. Dataset Splitting (Time-Based)
- Performed a **time-based split** using `event_time` to avoid temporal leakage:
- Train: ~40%
- Validation: ~10%
- Test: ~10%
- Production reserve: ~40%
- Persisted each split as Parquet files to: s3:///splits/olist/features/version=v1/
- This mirrors a real production setup where recent data is reserved for inference.

---

## Key Engineering Decisions & Lessons Learned

- **Athena LOCATION must point to directories**, not individual files.
- **Feature Store requires strict ISO-8601 timestamps with timezone** — missing the `Z` suffix causes ingestion failures.
- Maintaining separate:
- canonical feature data (analysis-friendly)
- Feature Store–safe data (schema-restricted)
is a best practice in real MLOps systems.
- Time-based splitting is critical to avoid data leakage in temporal datasets.
- Offline Feature Store provides the required functionality while minimizing cost.

---

## Cost Management Notes
- SageMaker compute was stopped immediately after completion.
- Feature Store **online store was intentionally disabled** to avoid ongoing charges.
- S3 storage costs are minimal and safe to keep until final submission.
- Cleanup (Feature Group deletion, S3 cleanup) should only be done **after submission**.

---

## For Teammates
If you need to re-run or extend this work:
1. Start at the Athena read step (no need to re-upload raw data).
2. Do **not** re-run ingestion unless changing the feature schema.
3. Always stop SageMaker compute when finished.

This notebook represents a complete, MLOps data pipeline.

## Model Benchmark and Evaluation (Module 4)

### Baseline Model
As a benchmark, we implemented a simple heuristic model that always predicts an order will be delivered on time. This reflects the majority class in the dataset and establishes a lower bound for model performance.

Due to class imbalance (~86% of orders are not late), the baseline achieves high accuracy but fails to identify late deliveries, resulting in zero precision, recall, and F1-score.

### First Iteration Model (XGBoost v1)
We trained a first-pass XGBoost binary classifier in Amazon SageMaker using a limited set of engineered features related to order size, payment behavior, and purchase timing.

The model was evaluated using SageMaker Batch Transform on the held-out test dataset. Batch Transform was selected over a real-time endpoint to minimize cost and ensure automatic resource cleanup.

### Results Summary
- The XGBoost model achieved an AUC of approximately **0.56**, indicating it learned some discriminative signal beyond random chance.
- Overall accuracy matched the baseline model due to class imbalance and use of a default classification threshold.
- Precision, recall, and F1-score remained low, highlighting the need for future improvements such as class weighting, threshold tuning, and feature expansion.

### Key Takeaways
This iteration establishes a complete, cost-aware MLOps workflow including data ingestion, feature engineering, model training, evaluation, and deployment via batch inference. While performance improvements are needed, this version serves as a strong baseline for future model iterations and CI/CD integration in later modules.



In [None]:
import boto3

import sagemaker
bucket = sagemaker.Session().default_bucket()
prefix = "raw/olist/ingest_date=2026-01-25/"

s3 = boto3.client("s3")
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)

keys = [obj["Key"] for obj in resp.get("Contents", [])]
print("Found files:", len(keys))
for k in keys:
    print(k)

In [None]:
import boto3

results_prefix = "athena-results/"

s3 = boto3.client("s3")
resp = s3.list_objects_v2(Bucket=bucket, Prefix=results_prefix, MaxKeys=1)

if "Contents" in resp:
    print("✅ Athena results prefix exists:", f"s3://{bucket}/{results_prefix}")
else:
    # create a zero-byte object so the prefix exists
    s3.put_object(Bucket=bucket, Key=results_prefix)
    print("✅ Created Athena results prefix:", f"s3://{bucket}/{results_prefix}")

In [None]:
import time
import boto3

REGION = boto3.session.Session().region_name
athena = boto3.client("athena", region_name=REGION)

ATHENA_OUTPUT = f"s3://{bucket}/athena-results/"
DB = "olist_datalake"

def run_athena(sql: str, database: str = "default"):
    res = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={"Database": database},
        ResultConfiguration={"OutputLocation": ATHENA_OUTPUT},
    )
    qid = res["QueryExecutionId"]
    while True:
        q = athena.get_query_execution(QueryExecutionId=qid)
        state = q["QueryExecution"]["Status"]["State"]
        if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
            break
        time.sleep(1)
    if state != "SUCCEEDED":
        reason = q["QueryExecution"]["Status"].get("StateChangeReason", "Unknown")
        raise RuntimeError(f"Athena query failed: {state} - {reason}\nSQL:\n{sql}")
    return qid

run_athena(f"CREATE DATABASE IF NOT EXISTS {DB};", database="default")
print("✅ Database ready:", DB)

In [None]:
import boto3

base_prefix = "raw/olist/ingest_date=2026-01-25/"

files = [
    "olist_customers_dataset.csv",
    "olist_geolocation_dataset.csv",
    "olist_order_items_dataset.csv",
    "olist_order_payments_dataset.csv",
    "olist_order_reviews_dataset.csv",
    "olist_orders_dataset.csv",
    "olist_products_dataset.csv",
    "olist_sellers_dataset.csv",
    "product_category_name_translation.csv",
]

s3 = boto3.client("s3")

for f in files:
    src_key = base_prefix + f
    folder = f.replace(".csv", "")  # folder name = file name without .csv
    dst_key = f"{base_prefix}{folder}/{f}"
    
    # copy
    s3.copy_object(
        Bucket=bucket,
        CopySource={"Bucket": bucket, "Key": src_key},
        Key=dst_key
    )
    print("✅ Copied to:", dst_key)

print("\nDone. Next we’ll point Athena tables at these folders.")

In [None]:
resp = s3.list_objects_v2(Bucket=bucket, Prefix=base_prefix, MaxKeys=50)
for obj in resp.get("Contents", []):
    print(obj["Key"])

In [None]:
RAW_BASE = f"s3://{bucket}/{base_prefix}"

def create_csv_table(table_name: str, columns_ddl: str, folder_name: str):
    sql = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {DB}.{table_name} (
      {columns_ddl}
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
      'separatorChar' = ',',
      'quoteChar'     = '\"',
      'escapeChar'    = '\\\\'
    )
    STORED AS TEXTFILE
    LOCATION '{RAW_BASE}{folder_name}/'
    TBLPROPERTIES ('skip.header.line.count'='1');
    """
    run_athena(sql, database=DB)
    print(f"✅ Created: {DB}.{table_name}")

create_csv_table(
    "olist_customers_dataset",
    """
    customer_id string,
    customer_unique_id string,
    customer_zip_code_prefix int,
    customer_city string,
    customer_state string
    """,
    "olist_customers_dataset"
)

create_csv_table(
    "olist_geolocation_dataset",
    """
    geolocation_zip_code_prefix int,
    geolocation_lat double,
    geolocation_lng double,
    geolocation_city string,
    geolocation_state string
    """,
    "olist_geolocation_dataset"
)

create_csv_table(
    "olist_order_items_dataset",
    """
    order_id string,
    order_item_id int,
    product_id string,
    seller_id string,
    shipping_limit_date string,
    price double,
    freight_value double
    """,
    "olist_order_items_dataset"
)

create_csv_table(
    "olist_order_payments_dataset",
    """
    order_id string,
    payment_sequential int,
    payment_type string,
    payment_installments int,
    payment_value double
    """,
    "olist_order_payments_dataset"
)

create_csv_table(
    "olist_order_reviews_dataset",
    """
    review_id string,
    order_id string,
    review_score int,
    review_comment_title string,
    review_comment_message string,
    review_creation_date string,
    review_answer_timestamp string
    """,
    "olist_order_reviews_dataset"
)

create_csv_table(
    "olist_orders_dataset",
    """
    order_id string,
    customer_id string,
    order_status string,
    order_purchase_timestamp string,
    order_approved_at string,
    order_delivered_carrier_date string,
    order_delivered_customer_date string,
    order_estimated_delivery_date string
    """,
    "olist_orders_dataset"
)

create_csv_table(
    "olist_products_dataset",
    """
    product_id string,
    product_category_name string,
    product_name_lenght int,
    product_description_lenght int,
    product_photos_qty int,
    product_weight_g int,
    product_length_cm int,
    product_height_cm int,
    product_width_cm int
    """,
    "olist_products_dataset"
)

create_csv_table(
    "olist_sellers_dataset",
    """
    seller_id string,
    seller_zip_code_prefix int,
    seller_city string,
    seller_state string
    """,
    "olist_sellers_dataset"
)

create_csv_table(
    "product_category_name_translation",
    """
    product_category_name string,
    product_category_name_english string
    """,
    "product_category_name_translation"
)

In [None]:
run_athena(f"SHOW TABLES IN {DB};", database=DB)
print("✅ SHOW TABLES succeeded")

run_athena(f"SELECT COUNT(*) FROM {DB}.olist_orders_dataset;", database=DB)
print("✅ COUNT orders succeeded")

run_athena(f"SELECT order_status, COUNT(*) c FROM {DB}.olist_orders_dataset GROUP BY 1 ORDER BY c DESC;", database=DB)
print("✅ GROUP BY order_status succeeded")

In [None]:
#6.1
import awswrangler as wr
import pandas as pd

DB = "olist_datalake"

orders = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_orders_dataset",
    database=DB,
    ctas_approach=False
)

order_items = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_order_items_dataset",
    database=DB,
    ctas_approach=False
)

payments = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_order_payments_dataset",
    database=DB,
    ctas_approach=False
)

customers = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_customers_dataset",
    database=DB,
    ctas_approach=False
)

print("orders:", orders.shape)
print("order_items:", order_items.shape)
print("payments:", payments.shape)
print("customers:", customers.shape)

In [None]:
#6.2
# Parse timestamps
timestamp_cols = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]
for col in timestamp_cols:
    orders[col] = pd.to_datetime(orders[col], errors="coerce")

# Aggregations
items_agg = (
    order_items.groupby("order_id")
    .agg(
        num_items=("order_item_id", "count"),
        total_price=("price", "sum"),
        total_freight_value=("freight_value", "sum"),
        num_sellers=("seller_id", "nunique"),
    )
    .reset_index()
)

payments_agg = (
    payments.groupby("order_id")
    .agg(
        payment_value=("payment_value", "sum"),
        payment_installments=("payment_installments", "max"),
        payment_type=("payment_type", lambda x: x.value_counts().index[0]),
    )
    .reset_index()
)

eda_df = (
    orders
    .merge(items_agg, on="order_id", how="left")
    .merge(payments_agg, on="order_id", how="left")
    .merge(customers[["customer_id", "customer_state"]], on="customer_id", how="left")
)

# Time features
eda_df["purchase_dow"] = eda_df["order_purchase_timestamp"].dt.dayofweek
eda_df["purchase_hour"] = eda_df["order_purchase_timestamp"].dt.hour

# Label: late delivery
eda_df["is_late"] = (
    (eda_df["order_delivered_customer_date"].notna()) &
    (eda_df["order_estimated_delivery_date"].notna()) &
    (eda_df["order_delivered_customer_date"] > eda_df["order_estimated_delivery_date"])
).astype(int)

print("Orders rows:", len(orders))
print("EDA rows:", len(eda_df))
print("Row loss:", len(orders) - len(eda_df))
print("Late rate:\n", eda_df["is_late"].value_counts(normalize=True))

In [None]:
#7.0 + 7B
# Canonical features (with purchase timestamp)
feat = eda_df[[
    "order_id",
    "order_purchase_timestamp",
    "customer_state",
    "num_items",
    "total_price",
    "total_freight_value",
    "num_sellers",
    "payment_value",
    "payment_installments",
    "payment_type",
    "purchase_dow",
    "purchase_hour",
    "is_late"
]].copy()

feat["customer_state"] = feat["customer_state"].fillna("unknown").astype(str)
feat["payment_type"] = feat["payment_type"].fillna("unknown").astype(str)

for c in ["num_items", "num_sellers", "payment_installments", "purchase_dow", "purchase_hour", "is_late"]:
    feat[c] = feat[c].fillna(0).astype(int)

for c in ["total_price", "total_freight_value", "payment_value"]:
    feat[c] = feat[c].fillna(0.0).astype(float)

# Create strict ISO-8601 event time WITH timezone "Z"
feat["event_time"] = (
    pd.to_datetime(feat["order_purchase_timestamp"], errors="coerce")
    .dt.strftime("%Y-%m-%dT%H:%M:%SZ")
)

feat = feat.dropna(subset=["order_id", "event_time"]).reset_index(drop=True)

# FeatureStore-safe version (remove datetime64 column)
feat_fs = feat.drop(columns=["order_purchase_timestamp"]).copy()
feat_fs["event_time"] = feat_fs["event_time"].astype(str)

print("✅ feat shape:", feat.shape)
print("✅ feat_fs shape:", feat_fs.shape)
print("event_time sample:", feat_fs["event_time"].head().tolist())

In [None]:
#7.1
import boto3
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup

bucket = "aai540-olist-mlops-chris-7f3k2p"
region = boto3.session.Session().region_name
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

feature_group_name = "olist-order-features-v1"
offline_store_s3_uri = f"s3://{bucket}/feature-store/{feature_group_name}/"

fg = FeatureGroup(name=feature_group_name, sagemaker_session=sess)

print("Region:", region)
print("Role:", role)
print("Offline store URI:", offline_store_s3_uri)

In [None]:
#7.2
import time
from botocore.exceptions import ClientError

sm = boto3.client("sagemaker", region_name=region)

def feature_group_exists(name: str) -> bool:
    try:
        sm.describe_feature_group(FeatureGroupName=name)
        return True
    except ClientError as e:
        if "ResourceNotFound" in str(e):
            return False
        raise

def wait_for_fg_created(name: str, timeout_sec: int = 600, poll_sec: int = 10):
    start = time.time()
    while True:
        desc = sm.describe_feature_group(FeatureGroupName=name)
        status = desc.get("FeatureGroupStatus")
        offline_status = desc.get("OfflineStoreStatus", {}).get("Status", "UNKNOWN")
        print(f"Status={status}, OfflineStoreStatus={offline_status}")
        if status == "Created" and offline_status in ("Active", "UNKNOWN"):
            return desc
        if status in ("CreateFailed", "DeleteFailed"):
            raise RuntimeError(f"Feature Group failed with status={status}. Details: {desc}")
        if time.time() - start > timeout_sec:
            raise TimeoutError(f"Timed out waiting for Feature Group to be Created: {name}")
        time.sleep(poll_sec)

if feature_group_exists(feature_group_name):
    print(f"✅ Feature Group already exists: {feature_group_name}")
else:
    fg.load_feature_definitions(data_frame=feat_fs)
    fg.create(
        s3_uri=offline_store_s3_uri,
        record_identifier_name="order_id",
        event_time_feature_name="event_time",
        role_arn=role,
        enable_online_store=False,
    )
    print("⏳ Create request submitted")

wait_for_fg_created(feature_group_name)
print("✅ Feature Group ready")

In [None]:
#7.3
ingest_response = fg.ingest(data_frame=feat_fs, max_workers=2, wait=True)
print("✅ Ingest complete")

In [None]:
#7.4
import boto3

sm = boto3.client("sagemaker", region_name=region)
desc = sm.describe_feature_group(FeatureGroupName=feature_group_name)

print("FeatureGroupStatus:", desc["FeatureGroupStatus"])
print("OfflineStoreStatus:", desc["OfflineStoreStatus"]["Status"])
print("S3 Offline Store Uri:", desc["OfflineStoreConfig"]["S3StorageConfig"]["S3Uri"])

In [None]:
#8.0
import awswrangler as wr

bucket = "aai540-olist-mlops-chris-7f3k2p"

# Use feat_fs for splits (Feature Store compatible)
feat_sorted = feat_fs.sort_values("event_time").reset_index(drop=True)
n = len(feat_sorted)

train_end = int(n * 0.40)
val_end   = int(n * 0.50)
test_end  = int(n * 0.60)

train_df = feat_sorted.iloc[:train_end]
val_df   = feat_sorted.iloc[train_end:val_end]
test_df  = feat_sorted.iloc[val_end:test_end]
prod_df  = feat_sorted.iloc[test_end:]

print("✅ Split sizes")
print("train:", len(train_df))
print("val:  ", len(val_df))
print("test: ", len(test_df))
print("prod: ", len(prod_df))

split_base = f"s3://{bucket}/splits/olist/features/version=v1/"
wr.s3.to_parquet(train_df, f"{split_base}train/", dataset=True, mode="overwrite")
wr.s3.to_parquet(val_df,   f"{split_base}val/",   dataset=True, mode="overwrite")
wr.s3.to_parquet(test_df,  f"{split_base}test/",  dataset=True, mode="overwrite")
wr.s3.to_parquet(prod_df,  f"{split_base}prod/",  dataset=True, mode="overwrite")

print("✅ Wrote splits to:", split_base)

In [None]:
#9.0

In [None]:
import boto3

s3 = boto3.client("s3")
prefix = "splits/olist/features/version=v1/"

resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=50)
print("Objects found:", resp.get("KeyCount", 0))
for obj in resp.get("Contents", [])[:20]:
    print(obj["Key"])

In [None]:
#M4 Start
try:
    print(len(train))
except NameError:
    print("❌ Kernel is fresh — variables not loaded")

In [None]:
import awswrangler as wr
import pandas as pd
import numpy as np

bucket = "aai540-olist-mlops-chris-7f3k2p"
split_base = f"s3://{bucket}/splits/olist/features/version=v1/"

In [None]:
train = wr.s3.read_parquet(f"{split_base}train/")
val   = wr.s3.read_parquet(f"{split_base}val/")
test  = wr.s3.read_parquet(f"{split_base}test/")

print(train.shape, val.shape, test.shape)

In [None]:
#M4-1 Benchmark Model Baseline

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Ground truth
y_test = test["is_late"].astype(int)

# Baseline prediction: always predict NOT late
y_pred_baseline = pd.Series(0, index=test.index)

baseline_metrics = {
    "accuracy": accuracy_score(y_test, y_pred_baseline),
    "precision": precision_score(y_test, y_pred_baseline, zero_division=0),
    "recall": recall_score(y_test, y_pred_baseline, zero_division=0),
    "f1": f1_score(y_test, y_pred_baseline, zero_division=0),
}

baseline_metrics

In [None]:
#M4-2 1st Model Sage Maker

FEATURES = [
    "num_items",
    "total_price",
    "total_freight_value",
    "num_sellers",
    "payment_value",
    "payment_installments",
    "purchase_dow",
    "purchase_hour",
]

def to_xgb_matrix(df):
    out = df[["is_late"] + FEATURES].copy()
    for c in FEATURES:
        out[c] = pd.to_numeric(out[c], errors="coerce").fillna(0.0)
    out["is_late"] = out["is_late"].astype(int)
    return out

train_xgb = to_xgb_matrix(train)
val_xgb   = to_xgb_matrix(val)
test_xgb  = to_xgb_matrix(test)

train_xgb.head()

In [None]:
#M4-2-2
## NOTE: CSVs already written prior to training — do not rerun
xgb_prefix = f"s3://{bucket}/modeling/xgb_v1/"

train_csv_uri = f"{xgb_prefix}train/train.csv"
val_csv_uri   = f"{xgb_prefix}val/val.csv"
test_csv_uri  = f"{xgb_prefix}test/test.csv"

wr.s3.to_csv(train_xgb, train_csv_uri, index=False, header=False)
wr.s3.to_csv(val_xgb,   val_csv_uri,   index=False, header=False)
wr.s3.to_csv(test_xgb,  test_csv_uri,  index=False, header=False)

print("Train:", train_csv_uri)
print("Val:  ", val_csv_uri)
print("Test: ", test_csv_uri)

In [None]:
#M4-2-3 Train Model
# TRAINING CELL (DO NOT RERUN)
# This cell was executed once to train the initial XGBoost model.
# Re-running this cell will retrain the model and incur additional cost.
# The trained model is reused below via attachment for evaluation and deployment.

import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name

# Built-in XGBoost container
xgb_image = retrieve(
    framework="xgboost",
    region=region,
    version="1.7-1"
)

output_path = f"s3://{bucket}/modeling/xgb_v1/output/"

xgb = Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",  # budget-friendly
    output_path=output_path,
    sagemaker_session=sess,
)

# Simple, reasonable first-pass hyperparameters
xgb.set_hyperparameters(
    objective="binary:logistic",
    eval_metric="auc",
    num_round=100,
    max_depth=4,
    eta=0.2,
    subsample=0.8,
    colsample_bytree=0.8,
)

train_input = TrainingInput(train_csv_uri, content_type="text/csv")
val_input   = TrainingInput(val_csv_uri, content_type="text/csv")

#xgb.fit({
#    "train": train_input,
#    "validation": val_input
#})

In [None]:
# M4-2-3b Attach to Existing Trained Model (No Retraining)
import boto3, sagemaker
from sagemaker.estimator import Estimator
from sagemaker.image_uris import retrieve

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
sm = boto3.client("sagemaker", region_name=region)

# Pick the most recent completed training job
jobs = sm.list_training_jobs(SortBy="CreationTime", SortOrder="Descending", MaxResults=20)["TrainingJobSummaries"]
training_job_name = next(j["TrainingJobName"] for j in jobs if j["TrainingJobStatus"] == "Completed")
print("✅ Using training job:", training_job_name)

# Recreate estimator and attach (no retraining)
xgb_image = retrieve(framework="xgboost", region=region, version="1.7-1")
xgb = Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/modeling/xgb_v1/output/",
    sagemaker_session=sess,
)
xgb._current_job_name = training_job_name
print("✅ Attached. Ready for Batch Transform.")

In [None]:
# M4-2-2b Recreate CSV URIs

xgb_prefix = f"s3://{bucket}/modeling/xgb_v1/"

train_csv_uri = f"{xgb_prefix}train/train.csv"
val_csv_uri   = f"{xgb_prefix}val/val.csv"
test_csv_uri  = f"{xgb_prefix}test/test.csv"

print("Train:", train_csv_uri)
print("Val:  ", val_csv_uri)
print("Test: ", test_csv_uri)

In [None]:
# M4-3.0 Create inference-only TEST input (features only)
# Code developed using ChatGPT (ChatGPT, 2024) as a paired programmer.

# test_xgb currently has: is_late + 8 features
test_infer = test_xgb.drop(columns=["is_late"]).copy()

test_infer_csv_uri = f"{xgb_prefix}test/test_infer.csv"

# IMPORTANT: no header, no index
wr.s3.to_csv(test_infer, test_infer_csv_uri, index=False, header=False)

print("✅ Wrote inference CSV:", test_infer_csv_uri)
print("Shape (should be 9944 x 8):", test_infer.shape)

In [None]:
# M4-3.1 Batch Transform on TEST (Evaluation) - v2 (features-only)
from sagemaker.transformer import Transformer

test_transform_output_v2 = f"s3://{bucket}/modeling/xgb_v1/batch-out/test_v2/"

transformer = Transformer(
    model_name=model_name,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=test_transform_output_v2,
    assemble_with="Line",
    accept="text/csv"
)

print("⏳ Starting batch transform (features-only input)...")
transformer.transform(
    data=test_infer_csv_uri,
    content_type="text/csv",
    split_type="Line"
)

transformer.wait()
print("✅ Batch transform finished:", test_transform_output_v2)

In [None]:
# M4-3.2 Load predictions and evaluate
import boto3
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

s3 = boto3.client("s3")

resp = s3.list_objects_v2(Bucket=bucket, Prefix="modeling/xgb_v1/batch-out/test_v2/")
keys = [o["Key"] for o in resp.get("Contents", [])]
out_files = [k for k in keys if k.endswith(".out")]

if not out_files:
    raise RuntimeError(f"No .out files found in test_v2 output. Keys seen: {keys[:10]}")

out_key = sorted(out_files)[-1]
out_uri = f"s3://{bucket}/{out_key}"
print("Reading predictions from:", out_uri)

pred_df = wr.s3.read_csv(out_uri, header=None)
y_prob = pred_df[0].astype(float).reset_index(drop=True)

y_true = test_xgb["is_late"].astype(int).reset_index(drop=True)
y_hat = (y_prob >= 0.5).astype(int)

model_metrics = {
    "auc": roc_auc_score(y_true, y_prob),
    "accuracy": accuracy_score(y_true, y_hat),
    "precision": precision_score(y_true, y_hat, zero_division=0),
    "recall": recall_score(y_true, y_hat, zero_division=0),
    "f1": f1_score(y_true, y_hat, zero_division=0),
}

comparison = pd.DataFrame(
    [baseline_metrics, model_metrics],
    index=["baseline_always_on_time", "xgb_v1"]
)

print("✅ Model metrics:", model_metrics)
comparison

In [None]:
import boto3

sm = boto3.client("sagemaker", region_name=region)
sm.delete_model(ModelName=model_name)
print("✅ Deleted model:", model_name)