# MLOps Pipeline Workflow & Team Notes

## Project Overview
This notebook implements an end-to-end MLOps data pipeline using the **Olist Brazilian E-Commerce dataset**. The goal was to demonstrate a production-style workflow that covers data ingestion, cataloging, exploratory analysis, feature engineering, feature storage, and dataset splitting — all using AWS services in a cost-efficient way.

The pipeline was intentionally built step-by-step to mirror MLOps practices rather than a one-off modeling notebook.

---

## High-Level Workflow

### 1. Raw Data Ingestion (S3 Data Lake)
- Created an S3 bucket to act as a data lake.
- Uploaded all **9 raw CSV files** from the Kaggle Olist dataset.
- Organized raw data under: s3:///raw/olist/ingest_date=YYYY-MM-DD/
- Each dataset was placed into its own subfolder to support Athena’s directory-based table requirements.

---

### 2. Data Cataloging & Querying (Athena)
- Created an Athena database (`olist_datalake`).
- Defined **external tables** for each dataset directly from JupyterLab (no Glue crawler required).
- Verified schemas and row counts using Athena queries.
- This step enabled SQL-based access to the data and served as the cataloging layer for downstream analysis.

---

### 3. Exploratory Data Analysis (SageMaker + Pandas)
- Loaded Athena tables into Pandas using `awswrangler`.
- Performed sanity checks on row counts and joins.
- Built an **order-level analytical view** by aggregating:
- order items
- payments
- customer attributes
- Engineered a target variable (`is_late`) based on delivery vs. estimated delivery dates.
- Observed class imbalance (~8% late deliveries), motivating careful splitting and evaluation later.

---

### 4. Feature Engineering
- Created leakage-safe, order-level features using only information available at purchase time:
- pricing, freight, number of items/sellers
- payment information
- time-based features (day of week, hour of day)
- customer state
- Maintained a **canonical feature dataset** for analysis and splitting.
- Created a **Feature Store–compatible version** with strict data types.

---

### 5. SageMaker Feature Store (Offline Store)
- Created a **SageMaker Feature Group** (offline store only to control cost).
- Used `order_id` as the record identifier.
- Used a strictly formatted ISO-8601 `event_time` with UTC (`Z`) as the event time feature.
- Successfully ingested ~99k feature records into Feature Store.
- Offline store data is persisted in S3 for training and future reuse.

---

### 6. Dataset Splitting (Time-Based)
- Performed a **time-based split** using `event_time` to avoid temporal leakage:
- Train: ~40%
- Validation: ~10%
- Test: ~10%
- Production reserve: ~40%
- Persisted each split as Parquet files to: s3:///splits/olist/features/version=v1/
- This mirrors a real production setup where recent data is reserved for inference.

---

## Key Engineering Decisions & Lessons Learned

- **Athena LOCATION must point to directories**, not individual files.
- **Feature Store requires strict ISO-8601 timestamps with timezone** — missing the `Z` suffix causes ingestion failures.
- Maintaining separate:
- canonical feature data (analysis-friendly)
- Feature Store–safe data (schema-restricted)
is a best practice in real MLOps systems.
- Time-based splitting is critical to avoid data leakage in temporal datasets.
- Offline Feature Store provides the required functionality while minimizing cost.

---

## Cost Management Notes
- SageMaker compute was stopped immediately after completion.
- Feature Store **online store was intentionally disabled** to avoid ongoing charges.
- S3 storage costs are minimal and safe to keep until final submission.
- Cleanup (Feature Group deletion, S3 cleanup) should only be done **after submission**.

---

## For Teammates
If you need to re-run or extend this work:
1. Start at the Athena read step (no need to re-upload raw data).
2. Do **not** re-run ingestion unless changing the feature schema.
3. Always stop SageMaker compute when finished.

This notebook represents a complete, MLOps data pipeline.

## Model Benchmark and Evaluation (Module 4)

### Baseline Model
As a benchmark, we implemented a simple heuristic model that always predicts an order will be delivered on time. This reflects the majority class in the dataset and establishes a lower bound for model performance.

Due to class imbalance (~86% of orders are not late), the baseline achieves high accuracy but fails to identify late deliveries, resulting in zero precision, recall, and F1-score.

### First Iteration Model (XGBoost v1)
We trained a first-pass XGBoost binary classifier in Amazon SageMaker using a limited set of engineered features related to order size, payment behavior, and purchase timing.

The model was evaluated using SageMaker Batch Transform on the held-out test dataset. Batch Transform was selected over a real-time endpoint to minimize cost and ensure automatic resource cleanup.

### Results Summary
- The XGBoost model achieved an AUC of approximately **0.56**, indicating it learned some discriminative signal beyond random chance.
- Overall accuracy matched the baseline model due to class imbalance and use of a default classification threshold.
- Precision, recall, and F1-score remained low, highlighting the need for future improvements such as class weighting, threshold tuning, and feature expansion.

### Key Takeaways
This iteration establishes a complete, cost-aware MLOps workflow including data ingestion, feature engineering, model training, evaluation, and deployment via batch inference. While performance improvements are needed, this version serves as a strong baseline for future model iterations and CI/CD integration in later modules.

## Monitoring, Evaluation, and Reporting Summary (Module 5)

### Model Benchmarking and Evaluation
We established a simple benchmark model that always predicts orders as on-time. While this baseline achieved high accuracy due to class imbalance, it had zero recall for late deliveries. We then trained an XGBoost classifier using a small, carefully selected feature set. Model performance was evaluated using batch inference, and metrics such as accuracy and AUC were compared against the benchmark to establish a minimum viable improvement baseline.

### Data Monitoring
Data monitoring was implemented using SageMaker Batch Transform with batch data capture enabled. Production inference inputs and corresponding model outputs were automatically captured to S3. Baseline statistics and constraints were generated using SageMaker Model Monitor and stored in S3. These artifacts enable offline drift detection and future scheduled monitoring without requiring a persistent real-time endpoint, aligning with cost-efficient MLOps best practices.

### Infrastructure Monitoring
Infrastructure-level monitoring was implemented using Amazon CloudWatch. Metrics for SageMaker training jobs, batch transform jobs, and processing jobs were collected automatically. A centralized CloudWatch dashboard was created to visualize job durations and failure counts, providing operational visibility into the ML system.

### Reports and Artifacts
The system produces the following monitoring artifacts:
- Model training metrics in CloudWatch Logs
- Baseline statistics and constraints in S3
- Captured production input and output data in S3
- CloudWatch dashboard for infrastructure monitoring

This architecture supports scalable monitoring, auditability, and future CI/CD integration while remaining within budget constraints.

In [1]:
import boto3

bucket = "aai540-olist-mlops-chris-7f3k2p"
prefix = "raw/olist/ingest_date=2026-01-25/"

s3 = boto3.client("s3")
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)

keys = [obj["Key"] for obj in resp.get("Contents", [])]
print("Found files:", len(keys))
for k in keys:
    print(k)

Found files: 10
raw/olist/ingest_date=2026-01-25/
raw/olist/ingest_date=2026-01-25/olist_customers_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_geolocation_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_items_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_payments_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_reviews_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_orders_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_products_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_sellers_dataset.csv
raw/olist/ingest_date=2026-01-25/product_category_name_translation.csv


In [2]:
import boto3

bucket = "aai540-olist-mlops-chris-7f3k2p"
results_prefix = "athena-results/"

s3 = boto3.client("s3")
resp = s3.list_objects_v2(Bucket=bucket, Prefix=results_prefix, MaxKeys=1)

if "Contents" in resp:
    print("✅ Athena results prefix exists:", f"s3://{bucket}/{results_prefix}")
else:
    # create a zero-byte object so the prefix exists
    s3.put_object(Bucket=bucket, Key=results_prefix)
    print("✅ Created Athena results prefix:", f"s3://{bucket}/{results_prefix}")

✅ Athena results prefix exists: s3://aai540-olist-mlops-chris-7f3k2p/athena-results/


In [3]:
import time
import boto3

bucket = "aai540-olist-mlops-chris-7f3k2p"
REGION = boto3.session.Session().region_name
athena = boto3.client("athena", region_name=REGION)

ATHENA_OUTPUT = f"s3://{bucket}/athena-results/"
DB = "olist_datalake"

def run_athena(sql: str, database: str = "default"):
    res = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={"Database": database},
        ResultConfiguration={"OutputLocation": ATHENA_OUTPUT},
    )
    qid = res["QueryExecutionId"]
    while True:
        q = athena.get_query_execution(QueryExecutionId=qid)
        state = q["QueryExecution"]["Status"]["State"]
        if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
            break
        time.sleep(1)
    if state != "SUCCEEDED":
        reason = q["QueryExecution"]["Status"].get("StateChangeReason", "Unknown")
        raise RuntimeError(f"Athena query failed: {state} - {reason}\nSQL:\n{sql}")
    return qid

run_athena(f"CREATE DATABASE IF NOT EXISTS {DB};", database="default")
print("✅ Database ready:", DB)

✅ Database ready: olist_datalake


In [5]:
import boto3

bucket = "aai540-olist-mlops-chris-7f3k2p"
base_prefix = "raw/olist/ingest_date=2026-01-25/"

files = [
    "olist_customers_dataset.csv",
    "olist_geolocation_dataset.csv",
    "olist_order_items_dataset.csv",
    "olist_order_payments_dataset.csv",
    "olist_order_reviews_dataset.csv",
    "olist_orders_dataset.csv",
    "olist_products_dataset.csv",
    "olist_sellers_dataset.csv",
    "product_category_name_translation.csv",
]

s3 = boto3.client("s3")

for f in files:
    src_key = base_prefix + f
    folder = f.replace(".csv", "")  # folder name = file name without .csv
    dst_key = f"{base_prefix}{folder}/{f}"
    
    # copy
    s3.copy_object(
        Bucket=bucket,
        CopySource={"Bucket": bucket, "Key": src_key},
        Key=dst_key
    )
    print("✅ Copied to:", dst_key)

print("\nDone. Next we’ll point Athena tables at these folders.")

✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_customers_dataset/olist_customers_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_geolocation_dataset/olist_geolocation_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_order_items_dataset/olist_order_items_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_order_payments_dataset/olist_order_payments_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_order_reviews_dataset/olist_order_reviews_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_orders_dataset/olist_orders_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_products_dataset/olist_products_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/olist_sellers_dataset/olist_sellers_dataset.csv
✅ Copied to: raw/olist/ingest_date=2026-01-25/product_category_name_translation/product_category_name_translation.csv

Done. Next we’ll point Athena tables at these folders.


In [6]:
resp = s3.list_objects_v2(Bucket=bucket, Prefix=base_prefix, MaxKeys=50)
for obj in resp.get("Contents", []):
    print(obj["Key"])

raw/olist/ingest_date=2026-01-25/
raw/olist/ingest_date=2026-01-25/olist_customers_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_customers_dataset/olist_customers_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_geolocation_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_geolocation_dataset/olist_geolocation_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_items_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_items_dataset/olist_order_items_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_payments_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_payments_dataset/olist_order_payments_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_reviews_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_order_reviews_dataset/olist_order_reviews_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_orders_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_orders_dataset/olist_orders_dataset.csv
raw/olist/ingest_date=2026-01-25/olist_products_dataset.csv


In [7]:
RAW_BASE = f"s3://{bucket}/{base_prefix}"

def create_csv_table(table_name: str, columns_ddl: str, folder_name: str):
    sql = f"""
    CREATE EXTERNAL TABLE IF NOT EXISTS {DB}.{table_name} (
      {columns_ddl}
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
      'separatorChar' = ',',
      'quoteChar'     = '\"',
      'escapeChar'    = '\\\\'
    )
    STORED AS TEXTFILE
    LOCATION '{RAW_BASE}{folder_name}/'
    TBLPROPERTIES ('skip.header.line.count'='1');
    """
    run_athena(sql, database=DB)
    print(f"✅ Created: {DB}.{table_name}")

create_csv_table(
    "olist_customers_dataset",
    """
    customer_id string,
    customer_unique_id string,
    customer_zip_code_prefix int,
    customer_city string,
    customer_state string
    """,
    "olist_customers_dataset"
)

create_csv_table(
    "olist_geolocation_dataset",
    """
    geolocation_zip_code_prefix int,
    geolocation_lat double,
    geolocation_lng double,
    geolocation_city string,
    geolocation_state string
    """,
    "olist_geolocation_dataset"
)

create_csv_table(
    "olist_order_items_dataset",
    """
    order_id string,
    order_item_id int,
    product_id string,
    seller_id string,
    shipping_limit_date string,
    price double,
    freight_value double
    """,
    "olist_order_items_dataset"
)

create_csv_table(
    "olist_order_payments_dataset",
    """
    order_id string,
    payment_sequential int,
    payment_type string,
    payment_installments int,
    payment_value double
    """,
    "olist_order_payments_dataset"
)

create_csv_table(
    "olist_order_reviews_dataset",
    """
    review_id string,
    order_id string,
    review_score int,
    review_comment_title string,
    review_comment_message string,
    review_creation_date string,
    review_answer_timestamp string
    """,
    "olist_order_reviews_dataset"
)

create_csv_table(
    "olist_orders_dataset",
    """
    order_id string,
    customer_id string,
    order_status string,
    order_purchase_timestamp string,
    order_approved_at string,
    order_delivered_carrier_date string,
    order_delivered_customer_date string,
    order_estimated_delivery_date string
    """,
    "olist_orders_dataset"
)

create_csv_table(
    "olist_products_dataset",
    """
    product_id string,
    product_category_name string,
    product_name_lenght int,
    product_description_lenght int,
    product_photos_qty int,
    product_weight_g int,
    product_length_cm int,
    product_height_cm int,
    product_width_cm int
    """,
    "olist_products_dataset"
)

create_csv_table(
    "olist_sellers_dataset",
    """
    seller_id string,
    seller_zip_code_prefix int,
    seller_city string,
    seller_state string
    """,
    "olist_sellers_dataset"
)

create_csv_table(
    "product_category_name_translation",
    """
    product_category_name string,
    product_category_name_english string
    """,
    "product_category_name_translation"
)

✅ Created: olist_datalake.olist_customers_dataset
✅ Created: olist_datalake.olist_geolocation_dataset
✅ Created: olist_datalake.olist_order_items_dataset
✅ Created: olist_datalake.olist_order_payments_dataset
✅ Created: olist_datalake.olist_order_reviews_dataset
✅ Created: olist_datalake.olist_orders_dataset
✅ Created: olist_datalake.olist_products_dataset
✅ Created: olist_datalake.olist_sellers_dataset
✅ Created: olist_datalake.product_category_name_translation


In [8]:
run_athena(f"SHOW TABLES IN {DB};", database=DB)
print("✅ SHOW TABLES succeeded")

run_athena(f"SELECT COUNT(*) FROM {DB}.olist_orders_dataset;", database=DB)
print("✅ COUNT orders succeeded")

run_athena(f"SELECT order_status, COUNT(*) c FROM {DB}.olist_orders_dataset GROUP BY 1 ORDER BY c DESC;", database=DB)
print("✅ GROUP BY order_status succeeded")

✅ SHOW TABLES succeeded
✅ COUNT orders succeeded
✅ GROUP BY order_status succeeded


In [1]:
#6.1
import awswrangler as wr
import pandas as pd

DB = "olist_datalake"

orders = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_orders_dataset",
    database=DB,
    ctas_approach=False
)

order_items = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_order_items_dataset",
    database=DB,
    ctas_approach=False
)

payments = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_order_payments_dataset",
    database=DB,
    ctas_approach=False
)

customers = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {DB}.olist_customers_dataset",
    database=DB,
    ctas_approach=False
)

print("orders:", orders.shape)
print("order_items:", order_items.shape)
print("payments:", payments.shape)
print("customers:", customers.shape)

2026-01-25 19:16:46,747	INFO worker.py:1852 -- Started a local Ray instance.


orders: (99441, 8)
order_items: (112650, 7)
payments: (103886, 5)
customers: (99441, 5)


In [2]:
#6.2
# Parse timestamps
timestamp_cols = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date",
]
for col in timestamp_cols:
    orders[col] = pd.to_datetime(orders[col], errors="coerce")

# Aggregations
items_agg = (
    order_items.groupby("order_id")
    .agg(
        num_items=("order_item_id", "count"),
        total_price=("price", "sum"),
        total_freight_value=("freight_value", "sum"),
        num_sellers=("seller_id", "nunique"),
    )
    .reset_index()
)

payments_agg = (
    payments.groupby("order_id")
    .agg(
        payment_value=("payment_value", "sum"),
        payment_installments=("payment_installments", "max"),
        payment_type=("payment_type", lambda x: x.value_counts().index[0]),
    )
    .reset_index()
)

eda_df = (
    orders
    .merge(items_agg, on="order_id", how="left")
    .merge(payments_agg, on="order_id", how="left")
    .merge(customers[["customer_id", "customer_state"]], on="customer_id", how="left")
)

# Time features
eda_df["purchase_dow"] = eda_df["order_purchase_timestamp"].dt.dayofweek
eda_df["purchase_hour"] = eda_df["order_purchase_timestamp"].dt.hour

# Label: late delivery
eda_df["is_late"] = (
    (eda_df["order_delivered_customer_date"].notna()) &
    (eda_df["order_estimated_delivery_date"].notna()) &
    (eda_df["order_delivered_customer_date"] > eda_df["order_estimated_delivery_date"])
).astype(int)

print("Orders rows:", len(orders))
print("EDA rows:", len(eda_df))
print("Row loss:", len(orders) - len(eda_df))
print("Late rate:\n", eda_df["is_late"].value_counts(normalize=True))

Orders rows: 99441
EDA rows: 99441
Row loss: 0
Late rate:
 is_late
0    0.92129
1    0.07871
Name: proportion, dtype: float64


In [3]:
#7.0 + 7B
# Canonical features (with purchase timestamp)
feat = eda_df[[
    "order_id",
    "order_purchase_timestamp",
    "customer_state",
    "num_items",
    "total_price",
    "total_freight_value",
    "num_sellers",
    "payment_value",
    "payment_installments",
    "payment_type",
    "purchase_dow",
    "purchase_hour",
    "is_late"
]].copy()

feat["customer_state"] = feat["customer_state"].fillna("unknown").astype(str)
feat["payment_type"] = feat["payment_type"].fillna("unknown").astype(str)

for c in ["num_items", "num_sellers", "payment_installments", "purchase_dow", "purchase_hour", "is_late"]:
    feat[c] = feat[c].fillna(0).astype(int)

for c in ["total_price", "total_freight_value", "payment_value"]:
    feat[c] = feat[c].fillna(0.0).astype(float)

# Create strict ISO-8601 event time WITH timezone "Z"
feat["event_time"] = (
    pd.to_datetime(feat["order_purchase_timestamp"], errors="coerce")
    .dt.strftime("%Y-%m-%dT%H:%M:%SZ")
)

feat = feat.dropna(subset=["order_id", "event_time"]).reset_index(drop=True)

# FeatureStore-safe version (remove datetime64 column)
feat_fs = feat.drop(columns=["order_purchase_timestamp"]).copy()
feat_fs["event_time"] = feat_fs["event_time"].astype(str)

print("✅ feat shape:", feat.shape)
print("✅ feat_fs shape:", feat_fs.shape)
print("event_time sample:", feat_fs["event_time"].head().tolist())

✅ feat shape: (99441, 14)
✅ feat_fs shape: (99441, 13)
event_time sample: ['2017-10-02T10:56:33Z', '2018-07-24T20:41:37Z', '2018-08-08T08:38:49Z', '2017-11-18T19:28:06Z', '2018-02-13T21:18:39Z']


In [4]:
#7.1
import boto3
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup

bucket = "aai540-olist-mlops-chris-7f3k2p"
region = boto3.session.Session().region_name
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

feature_group_name = "olist-order-features-v1"
offline_store_s3_uri = f"s3://{bucket}/feature-store/{feature_group_name}/"

fg = FeatureGroup(name=feature_group_name, sagemaker_session=sess)

print("Region:", region)
print("Role:", role)
print("Offline store URI:", offline_store_s3_uri)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
Role: arn:aws:iam::758289042916:role/LabRole
Offline store URI: s3://aai540-olist-mlops-chris-7f3k2p/feature-store/olist-order-features-v1/


In [5]:
#7.2
import time
from botocore.exceptions import ClientError

sm = boto3.client("sagemaker", region_name=region)

def feature_group_exists(name: str) -> bool:
    try:
        sm.describe_feature_group(FeatureGroupName=name)
        return True
    except ClientError as e:
        if "ResourceNotFound" in str(e):
            return False
        raise

def wait_for_fg_created(name: str, timeout_sec: int = 600, poll_sec: int = 10):
    start = time.time()
    while True:
        desc = sm.describe_feature_group(FeatureGroupName=name)
        status = desc.get("FeatureGroupStatus")
        offline_status = desc.get("OfflineStoreStatus", {}).get("Status", "UNKNOWN")
        print(f"Status={status}, OfflineStoreStatus={offline_status}")
        if status == "Created" and offline_status in ("Active", "UNKNOWN"):
            return desc
        if status in ("CreateFailed", "DeleteFailed"):
            raise RuntimeError(f"Feature Group failed with status={status}. Details: {desc}")
        if time.time() - start > timeout_sec:
            raise TimeoutError(f"Timed out waiting for Feature Group to be Created: {name}")
        time.sleep(poll_sec)

if feature_group_exists(feature_group_name):
    print(f"✅ Feature Group already exists: {feature_group_name}")
else:
    fg.load_feature_definitions(data_frame=feat_fs)
    fg.create(
        s3_uri=offline_store_s3_uri,
        record_identifier_name="order_id",
        event_time_feature_name="event_time",
        role_arn=role,
        enable_online_store=False,
    )
    print("⏳ Create request submitted")

wait_for_fg_created(feature_group_name)
print("✅ Feature Group ready")

✅ Feature Group already exists: olist-order-features-v1
Status=Created, OfflineStoreStatus=UNKNOWN
✅ Feature Group ready


In [6]:
#7.3
ingest_response = fg.ingest(data_frame=feat_fs, max_workers=2, wait=True)
print("✅ Ingest complete")

✅ Ingest complete


In [7]:
#7.4
import boto3

sm = boto3.client("sagemaker", region_name=region)
desc = sm.describe_feature_group(FeatureGroupName=feature_group_name)

print("FeatureGroupStatus:", desc["FeatureGroupStatus"])
print("OfflineStoreStatus:", desc["OfflineStoreStatus"]["Status"])
print("S3 Offline Store Uri:", desc["OfflineStoreConfig"]["S3StorageConfig"]["S3Uri"])

FeatureGroupStatus: Created
OfflineStoreStatus: Active
S3 Offline Store Uri: s3://aai540-olist-mlops-chris-7f3k2p/feature-store/olist-order-features-v1/


In [8]:
#8.0
import awswrangler as wr

bucket = "aai540-olist-mlops-chris-7f3k2p"

# Use feat_fs for splits (Feature Store compatible)
feat_sorted = feat_fs.sort_values("event_time").reset_index(drop=True)
n = len(feat_sorted)

train_end = int(n * 0.40)
val_end   = int(n * 0.50)
test_end  = int(n * 0.60)

train_df = feat_sorted.iloc[:train_end]
val_df   = feat_sorted.iloc[train_end:val_end]
test_df  = feat_sorted.iloc[val_end:test_end]
prod_df  = feat_sorted.iloc[test_end:]

print("✅ Split sizes")
print("train:", len(train_df))
print("val:  ", len(val_df))
print("test: ", len(test_df))
print("prod: ", len(prod_df))

split_base = f"s3://{bucket}/splits/olist/features/version=v1/"
wr.s3.to_parquet(train_df, f"{split_base}train/", dataset=True, mode="overwrite")
wr.s3.to_parquet(val_df,   f"{split_base}val/",   dataset=True, mode="overwrite")
wr.s3.to_parquet(test_df,  f"{split_base}test/",  dataset=True, mode="overwrite")
wr.s3.to_parquet(prod_df,  f"{split_base}prod/",  dataset=True, mode="overwrite")

print("✅ Wrote splits to:", split_base)

✅ Split sizes
train: 39776
val:   9944
test:  9944
prod:  39777
✅ Wrote splits to: s3://aai540-olist-mlops-chris-7f3k2p/splits/olist/features/version=v1/


In [9]:
#9.0

In [10]:
import boto3

s3 = boto3.client("s3")
prefix = "splits/olist/features/version=v1/"

resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=50)
print("Objects found:", resp.get("KeyCount", 0))
for obj in resp.get("Contents", [])[:20]:
    print(obj["Key"])

Objects found: 4
splits/olist/features/version=v1/prod/a8cdd289b48b495ba34f072d8dfa9932.snappy.parquet
splits/olist/features/version=v1/test/ba691db52493437ab3c663c8b98fe955.snappy.parquet
splits/olist/features/version=v1/train/33879ef6e41a44e384d5255caa4ffa7f.snappy.parquet
splits/olist/features/version=v1/val/6fb7a4263ec04acb8f57f8ad0050cfb0.snappy.parquet


I0000 00:00:1769369955.066884   28608 chttp2_transport.cc:1182] ipv4:169.255.255.2:57431: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2026-01-25T19:39:15.064991581+00:00"}
*** SIGTERM received at time=1769369957 on cpu 0 ***
PC: @     0x7f13accaae9e  (unknown)  epoll_wait
    @     0x7f1357624b0d         64  absl::lts_20240722::AbslFailureSignalHandler()
    @     0x7f13acbc7520  (unknown)  (unknown)
[2026-01-25 19:39:17,560 E 28266 28266] logging.cc:497: *** SIGTERM received at time=1769369957 on cpu 0 ***
[2026-01-25 19:39:17,560 E 28266 28266] logging.cc:497: PC: @     0x7f13accaae9e  (unknown)  epoll_wait
[2026-01-25 19:39:17,561 E 28266 28266] logging.cc:497:     @     0x7f1357624b39         64  absl::lts_20240722::AbslFailureSignalHandler()
[2026-01-25 19:39:17,561 E 28266 28266] logging.cc:497:     @     0x7f13acbc7520  (unknown)  (unknown)


In [1]:
#M4 Start
try:
    print(len(train))
except NameError:
    print("❌ Kernel is fresh — variables not loaded")

❌ Kernel is fresh — variables not loaded


In [2]:
import awswrangler as wr
import pandas as pd
import numpy as np

bucket = "aai540-olist-mlops-chris-7f3k2p"
split_base = f"s3://{bucket}/splits/olist/features/version=v1/"

In [3]:
train = wr.s3.read_parquet(f"{split_base}train/")
val   = wr.s3.read_parquet(f"{split_base}val/")
test  = wr.s3.read_parquet(f"{split_base}test/")

print(train.shape, val.shape, test.shape)

2026-01-31 19:18:40,252	INFO worker.py:1852 -- Started a local Ray instance.


(39776, 13) (9944, 13) (9944, 13)


In [4]:
#M4-1 Benchmark Model Baseline

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Ground truth
y_test = test["is_late"].astype(int)

# Baseline prediction: always predict NOT late
y_pred_baseline = pd.Series(0, index=test.index)

baseline_metrics = {
    "accuracy": accuracy_score(y_test, y_pred_baseline),
    "precision": precision_score(y_test, y_pred_baseline, zero_division=0),
    "recall": recall_score(y_test, y_pred_baseline, zero_division=0),
    "f1": f1_score(y_test, y_pred_baseline, zero_division=0),
}

baseline_metrics

{'accuracy': 0.8621279163314561, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}

In [5]:
#M4-2 1st Model Sage Maker

FEATURES = [
    "num_items",
    "total_price",
    "total_freight_value",
    "num_sellers",
    "payment_value",
    "payment_installments",
    "purchase_dow",
    "purchase_hour",
]

def to_xgb_matrix(df):
    out = df[["is_late"] + FEATURES].copy()
    for c in FEATURES:
        out[c] = pd.to_numeric(out[c], errors="coerce").fillna(0.0)
    out["is_late"] = out["is_late"].astype(int)
    return out

train_xgb = to_xgb_matrix(train)
val_xgb   = to_xgb_matrix(val)
test_xgb  = to_xgb_matrix(test)

train_xgb.head()

Unnamed: 0,is_late,num_items,total_price,total_freight_value,num_sellers,payment_value,payment_installments,purchase_dow,purchase_hour
0,0,2,72.89,63.34,1,136.23,1,6,21
1,0,1,59.5,15.56,1,75.06,3,0,0
2,0,0,0.0,0.0,0,40.95,2,1,15
3,1,3,134.97,8.49,1,0.0,0,3,12
4,0,1,100.0,9.34,1,109.34,1,6,22


In [7]:
#M4-2-2
## NOTE: CSVs already written prior to training — do not rerun
xgb_prefix = f"s3://{bucket}/modeling/xgb_v1/"

train_csv_uri = f"{xgb_prefix}train/train.csv"
val_csv_uri   = f"{xgb_prefix}val/val.csv"
test_csv_uri  = f"{xgb_prefix}test/test.csv"

wr.s3.to_csv(train_xgb, train_csv_uri, index=False, header=False)
wr.s3.to_csv(val_xgb,   val_csv_uri,   index=False, header=False)
wr.s3.to_csv(test_xgb,  test_csv_uri,  index=False, header=False)

print("Train:", train_csv_uri)
print("Val:  ", val_csv_uri)
print("Test: ", test_csv_uri)

Train: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/train/train.csv
Val:   s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/val/val.csv
Test:  s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/test/test.csv


In [8]:
#M4-2-3 Train Model
# TRAINING CELL (DO NOT RERUN)
# This cell was executed once to train the initial XGBoost model.
# Re-running this cell will retrain the model and incur additional cost.
# The trained model is reused below via attachment for evaluation and deployment.

import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name

# Built-in XGBoost container
xgb_image = retrieve(
    framework="xgboost",
    region=region,
    version="1.7-1"
)

output_path = f"s3://{bucket}/modeling/xgb_v1/output/"

xgb = Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",  # budget-friendly
    output_path=output_path,
    sagemaker_session=sess,
)

# Simple, reasonable first-pass hyperparameters
xgb.set_hyperparameters(
    objective="binary:logistic",
    eval_metric="auc",
    num_round=100,
    max_depth=4,
    eta=0.2,
    subsample=0.8,
    colsample_bytree=0.8,
)

train_input = TrainingInput(train_csv_uri, content_type="text/csv")
val_input   = TrainingInput(val_csv_uri, content_type="text/csv")

#xgb.fit({
#    "train": train_input,
#    "validation": val_input
#})

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2026-01-31-18-00-53-460


2026-01-31 18:00:53 Starting - Starting the training job...
2026-01-31 18:01:14 Starting - Preparing the instances for training...
2026-01-31 18:01:38 Downloading - Downloading input data...
  import pkg_resources[0m
[34m[2026-01-31 18:03:35.611 ip-10-2-90-203.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2026-01-31 18:03:35.683 ip-10-2-90-203.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2026-01-31:18:03:36:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2026-01-31:18:03:36:INFO] Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34m[2026-01-31:18:03:36:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2026-01-31:18:03:36:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-01-31:18:03:36:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[3

In [6]:
# M4-2-3b Attach to Existing Trained Model (No Retraining)
import boto3, sagemaker
from sagemaker.estimator import Estimator
from sagemaker.image_uris import retrieve

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
sm = boto3.client("sagemaker", region_name=region)

# Pick the most recent completed training job
jobs = sm.list_training_jobs(SortBy="CreationTime", SortOrder="Descending", MaxResults=20)["TrainingJobSummaries"]
training_job_name = next(j["TrainingJobName"] for j in jobs if j["TrainingJobStatus"] == "Completed")
print("✅ Using training job:", training_job_name)

# Recreate estimator and attach (no retraining)
xgb_image = retrieve(framework="xgboost", region=region, version="1.7-1")
xgb = Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/modeling/xgb_v1/output/",
    sagemaker_session=sess,
)
xgb._current_job_name = training_job_name
print("✅ Attached. Ready for Batch Transform.")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
✅ Using training job: sagemaker-xgboost-2026-01-31-18-00-53-460
✅ Attached. Ready for Batch Transform.


In [8]:
# M4-2-2b Recreate CSV URIs

xgb_prefix = f"s3://{bucket}/modeling/xgb_v1/"

train_csv_uri = f"{xgb_prefix}train/train.csv"
val_csv_uri   = f"{xgb_prefix}val/val.csv"
test_csv_uri  = f"{xgb_prefix}test/test.csv"

print("Train:", train_csv_uri)
print("Val:  ", val_csv_uri)
print("Test: ", test_csv_uri)

Train: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/train/train.csv
Val:   s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/val/val.csv
Test:  s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/test/test.csv


In [12]:
# M4-3.0 Create inference-only TEST input (features only)
# Code developed using ChatGPT (ChatGPT, 2024) as a paired programmer.

# test_xgb currently has: is_late + 8 features
test_infer = test_xgb.drop(columns=["is_late"]).copy()

test_infer_csv_uri = f"{xgb_prefix}test/test_infer.csv"

# IMPORTANT: no header, no index
wr.s3.to_csv(test_infer, test_infer_csv_uri, index=False, header=False)

print("✅ Wrote inference CSV:", test_infer_csv_uri)
print("Shape (should be 9944 x 8):", test_infer.shape)

✅ Wrote inference CSV: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/test/test_infer.csv
Shape (should be 9944 x 8): (9944, 8)


In [13]:
# M4-3.1 Batch Transform on TEST (Evaluation) - v2 (features-only)
from sagemaker.transformer import Transformer

test_transform_output_v2 = f"s3://{bucket}/modeling/xgb_v1/batch-out/test_v2/"

transformer = Transformer(
    model_name=model_name,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=test_transform_output_v2,
    assemble_with="Line",
    accept="text/csv"
)

print("⏳ Starting batch transform (features-only input)...")
transformer.transform(
    data=test_infer_csv_uri,
    content_type="text/csv",
    split_type="Line"
)

transformer.wait()
print("✅ Batch transform finished:", test_transform_output_v2)

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-01-31-19-39-44-779


⏳ Starting batch transform (features-only input)...
  import pkg_resources[0m
[34m[2026-01-31:19:45:17:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-01-31:19:45:17:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-01-31:19:45:17:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
  

In [14]:
# M4-3.2 Load predictions and evaluate
import boto3
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

s3 = boto3.client("s3")

resp = s3.list_objects_v2(Bucket=bucket, Prefix="modeling/xgb_v1/batch-out/test_v2/")
keys = [o["Key"] for o in resp.get("Contents", [])]
out_files = [k for k in keys if k.endswith(".out")]

if not out_files:
    raise RuntimeError(f"No .out files found in test_v2 output. Keys seen: {keys[:10]}")

out_key = sorted(out_files)[-1]
out_uri = f"s3://{bucket}/{out_key}"
print("Reading predictions from:", out_uri)

pred_df = wr.s3.read_csv(out_uri, header=None)
y_prob = pred_df[0].astype(float).reset_index(drop=True)

y_true = test_xgb["is_late"].astype(int).reset_index(drop=True)
y_hat = (y_prob >= 0.5).astype(int)

model_metrics = {
    "auc": roc_auc_score(y_true, y_prob),
    "accuracy": accuracy_score(y_true, y_hat),
    "precision": precision_score(y_true, y_hat, zero_division=0),
    "recall": recall_score(y_true, y_hat, zero_division=0),
    "f1": f1_score(y_true, y_hat, zero_division=0),
}

comparison = pd.DataFrame(
    [baseline_metrics, model_metrics],
    index=["baseline_always_on_time", "xgb_v1"]
)

print("✅ Model metrics:", model_metrics)
comparison

Reading predictions from: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/batch-out/test_v2/test_infer.csv.out
✅ Model metrics: {'auc': 0.5635208855035949, 'accuracy': 0.8621279163314561, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}


Unnamed: 0,accuracy,precision,recall,f1,auc
baseline_always_on_time,0.862128,0.0,0.0,0.0,
xgb_v1,0.862128,0.0,0.0,0.0,0.563521


In [15]:
import boto3

sm = boto3.client("sagemaker", region_name=region)
sm.delete_model(ModelName=model_name)
print("✅ Deleted model:", model_name)

✅ Deleted model: xgb-v1-1769887733


In [1]:
#Impement:
# Implement model monitors on your ML system.
# Implement data monitors on your ML system.
# Implement infrastructure monitors on your ML system.
# Create a monitoring dashboard for your ML endpoint/job on CloudWatch.
# Generate model and data reports on SageMaker.

#0
import time, json
import boto3
import sagemaker
import awswrangler as wr
import pandas as pd
import numpy as np
from sagemaker.image_uris import retrieve
from sagemaker.model import Model

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
sm = boto3.client("sagemaker", region_name=region)
cw = boto3.client("cloudwatch", region_name=region)

bucket = "aai540-olist-mlops-chris-7f3k2p"
xgb_prefix = f"s3://{bucket}/modeling/xgb_v1/"

# Training job from earlier
training_job_name = "sagemaker-xgboost-2026-01-31-18-00-53-460"

# Your feature list (must match training)
FEATURES = [
    "num_items","total_price","total_freight_value","num_sellers",
    "payment_value","payment_installments","purchase_dow","purchase_hour",
]


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
#1
tj = sm.describe_training_job(TrainingJobName=training_job_name)
model_data = tj["ModelArtifacts"]["S3ModelArtifacts"]
print("Model artifact:", model_data)

xgb_image = retrieve(framework="xgboost", region=region, version="1.7-1")

model_name = f"xgb-v1-monitor-{int(time.time())}"
model = Model(
    image_uri=xgb_image,
    model_data=model_data,
    role=role,
    sagemaker_session=sess,
    name=model_name,
)

model.create(instance_type="ml.m5.large")
print("✅ Created model:", model_name)

Model artifact: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/output/sagemaker-xgboost-2026-01-31-18-00-53-460/output/model.tar.gz
✅ Created model: xgb-v1-monitor-1770585486


In [4]:
#2A
import sagemaker, inspect
from sagemaker.transformer import Transformer

print("sagemaker sdk version:", sagemaker.__version__)
print("Transformer.transform signature:\n", inspect.signature(Transformer.transform))


sagemaker sdk version: 2.245.0
Transformer.transform signature:
 (self, data: Union[str, sagemaker.workflow.entities.PipelineVariable], data_type: Union[str, sagemaker.workflow.entities.PipelineVariable] = 'S3Prefix', content_type: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, compression_type: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, split_type: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, job_name: Optional[str] = None, input_filter: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, output_filter: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, join_source: Union[str, sagemaker.workflow.entities.PipelineVariable, NoneType] = None, experiment_config: Optional[Dict[str, str]] = None, model_client_config: Optional[Dict[str, Union[str, sagemaker.workflow.entities.PipelineVariable]]] = None, batch_data_capture_config: sagemaker.inputs.BatchDataCa

In [5]:
# M5-2B Batch Transform with Batch Data Capture (SDK 2.245.0)

import time
from sagemaker.transformer import Transformer
from sagemaker.inputs import BatchDataCaptureConfig

transform_output = f"s3://{bucket}/monitoring/batch-transform/output/{int(time.time())}/"
capture_uri = f"s3://{bucket}/monitoring/batch-transform/capture/{int(time.time())}/"

batch_capture = BatchDataCaptureConfig(
    destination_s3_uri=capture_uri,
    generate_inference_id=True,
)

transformer = Transformer(
    model_name=model_name,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=transform_output,
    assemble_with="Line",
    accept="text/csv",
)

print("⏳ Starting batch transform WITH batch data capture...")
transformer.transform(
    data=prod_infer_uri,
    content_type="text/csv",
    split_type="Line",
    batch_data_capture_config=batch_capture,  # <-- correct name for your SDK
    wait=True,
    logs=True,
)

print("✅ Transform complete")
print("Transform output:", transform_output)
print("Captured data:", capture_uri)

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-02-08-21-21-26-827


⏳ Starting batch transform WITH batch data capture...
  import pkg_resources[0m
[34m[2026-02-08:21:26:55:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-08:21:26:55:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-08:21:26:55:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;


In [6]:
#3
# M5-3.0 Data quality baseline dataset (TRAIN features)
from sagemaker.model_monitor.dataset_format import DatasetFormat
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.processing import ProcessingInput, ProcessingOutput

split_base = f"s3://{bucket}/splits/olist/features/version=v1/"
train = wr.s3.read_parquet(f"{split_base}train/")

train_baseline = train[FEATURES].copy()
for c in FEATURES:
    train_baseline[c] = pd.to_numeric(train_baseline[c], errors="coerce").fillna(0.0)

baseline_uri = f"s3://{bucket}/monitoring/baselines/data_quality/train_baseline.csv"
wr.s3.to_csv(train_baseline, baseline_uri, index=False, header=False)

print("✅ Baseline CSV:", baseline_uri, "shape:", train_baseline.shape)

✅ Baseline CSV: s3://aai540-olist-mlops-chris-7f3k2p/monitoring/baselines/data_quality/train_baseline.csv shape: (39776, 8)


In [7]:
# M5-3.1 Suggest baseline (stats + constraints)
# NOTE (Feb 2026):
# This cell follows older SageMaker examples that reference
# `monitor.baseline_constraints()` and `monitor.baseline_statistics()`.
# These methods are NOT available in SageMaker SDK v2.245.0.
# 
# The baseline IS created successfully, but artifacts must be
# retrieved directly from S3 instead.
# 
# See M5-3.1b below for the correct, SDK-safe implementation.
monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",  # budget-friendly
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
    sagemaker_session=sess,
)

baseline_results_uri = f"s3://{bucket}/monitoring/baselines/data_quality/results/"
monitor.suggest_baseline(
    baseline_dataset=baseline_uri,
    dataset_format=DatasetFormat.csv(header=False),
    output_s3_uri=baseline_results_uri,
    wait=True,
)

print("✅ Baseline results in:", baseline_results_uri)
# print("Constraints:", monitor.baseline_constraints())
# print("Statistics:", monitor.baseline_statistics())

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating processing-job with name baseline-suggestion-job-2026-02-08-21-32-41-870


.......................[34m2026-02-08 21:36:23.809849: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory[0m
[34m2026-02-08 21:36:23.809892: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.[0m
[34m2026-02-08 21:36:25.321307: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory[0m
[34m2026-02-08 21:36:25.321342: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)[0m
[34m2026-02-08 21:36:25.321368: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-10-2-105-159.ec2.internal): /proc/driver/nvidia/version

In [8]:
# M5-3.1b Find baseline statistics + constraints files in S3
import boto3

s3 = boto3.client("s3", region_name=region)

baseline_prefix = baseline_results_uri.replace(f"s3://{bucket}/", "")
resp = s3.list_objects_v2(Bucket=bucket, Prefix=baseline_prefix)

keys = [o["Key"] for o in resp.get("Contents", [])]
print("Found objects:", len(keys))
for k in keys[:50]:
    print(k)

# Try to auto-detect the files we need
stats = [k for k in keys if k.endswith("statistics.json")]
constraints = [k for k in keys if k.endswith("constraints.json")]

print("\nstatistics.json:", stats[-1] if stats else "NOT FOUND")
print("constraints.json:", constraints[-1] if constraints else "NOT FOUND")

baseline_statistics_uri = f"s3://{bucket}/{stats[-1]}" if stats else None
baseline_constraints_uri = f"s3://{bucket}/{constraints[-1]}" if constraints else None

baseline_statistics_uri, baseline_constraints_uri

Found objects: 2
monitoring/baselines/data_quality/results/constraints.json
monitoring/baselines/data_quality/results/statistics.json

statistics.json: monitoring/baselines/data_quality/results/statistics.json
constraints.json: monitoring/baselines/data_quality/results/constraints.json


('s3://aai540-olist-mlops-chris-7f3k2p/monitoring/baselines/data_quality/results/statistics.json',
 's3://aai540-olist-mlops-chris-7f3k2p/monitoring/baselines/data_quality/results/constraints.json')

In [14]:
# M5-3.2a Verify production inference dataset (CSV)

import awswrangler as wr
import pandas as pd

print("Production inference CSV:", prod_infer_uri)

prod_df = wr.s3.read_csv(prod_infer_uri, header=None)
prod_df.columns = FEATURES  # we wrote it without headers

print("✅ prod_infer rows/cols:", prod_df.shape)
print("Columns:", list(prod_df.columns))
prod_df.head(10)

Production inference CSV: s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/prod/prod_infer.csv
✅ prod_infer rows/cols: (39777, 8)
Columns: ['num_items', 'total_price', 'total_freight_value', 'num_sellers', 'payment_value', 'payment_installments', 'purchase_dow', 'purchase_hour']


Unnamed: 0,num_items,total_price,total_freight_value,num_sellers,payment_value,payment_installments,purchase_dow,purchase_hour
0,2,243.9,33.71,2,277.61,8,3,21
1,1,86.9,13.63,1,100.53,4,3,21
2,1,159.9,14.14,1,174.04,4,3,21
3,1,19.9,12.48,1,32.38,1,3,21
4,1,149.0,45.12,1,194.12,2,3,21
5,1,35.15,15.1,1,50.25,2,3,21
6,1,27.9,8.27,1,36.17,1,3,21
7,1,15.3,11.85,1,27.15,1,3,21
8,1,125.0,38.42,1,163.42,2,3,22
9,1,79.9,15.31,1,95.21,1,3,22


In [15]:
# M5-3.2b Verify batch capture artifacts (JSONL)
import boto3, json

s3 = boto3.client("s3", region_name=region)

capture_uri = "s3://aai540-olist-mlops-chris-7f3k2p/monitoring/batch-transform/capture/1770585686/"
prefix = capture_uri.replace(f"s3://{bucket}/", "")

resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
keys = [o["Key"] for o in resp.get("Contents", [])]

print("✅ Capture objects found:", len(keys))
for k in keys[:25]:
    print(k)

# Read the first non-empty object and show 1 parsed JSON record
if not keys:
    raise RuntimeError("No capture files found under: " + capture_uri)

k0 = sorted(keys)[0]
obj = s3.get_object(Bucket=bucket, Key=k0)
text = obj["Body"].read().decode("utf-8", errors="replace")

first_line = next((ln for ln in text.splitlines() if ln.strip()), None)
print("\nFirst capture object:", k0)
print("First line (raw):", first_line[:300], "..." if len(first_line) > 300 else "")

# Try parsing as JSON
try:
    rec = json.loads(first_line)
    print("\n✅ Parsed JSON keys:", list(rec.keys())[:25])
    # Show a couple common locations
    if "inferenceId" in rec:
        print("inferenceId:", rec["inferenceId"])
    if "eventMetadata" in rec:
        print("eventMetadata keys:", list(rec["eventMetadata"].keys()))
except Exception as e:
    print("\n⚠️ Could not parse first line as JSON:", e)

✅ Capture objects found: 2
monitoring/batch-transform/capture/1770585686/input/2026/02/08/21/f1f0c63f-a181-492b-8765-b29a732c06ca.json
monitoring/batch-transform/capture/1770585686/output/2026/02/08/21/a9305a65-6fbb-47aa-b3b7-3a2cc419e728.json

First capture object: monitoring/batch-transform/capture/1770585686/input/2026/02/08/21/f1f0c63f-a181-492b-8765-b29a732c06ca.json
First line (raw): [{"prefix":"s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/prod/prod_infer.csv"},""] 

⚠️ Could not parse first line as JSON: 'list' object has no attribute 'keys'


In [16]:
# M5-3.2c Parse capture JSON properly (input + output)
import boto3, json

s3 = boto3.client("s3", region_name=region)

capture_prefix = "monitoring/batch-transform/capture/1770585686/"
resp = s3.list_objects_v2(Bucket=bucket, Prefix=capture_prefix)
keys = sorted([o["Key"] for o in resp.get("Contents", [])])

print("Keys:")
for k in keys:
    print(" -", k)

def read_first_json_record(key):
    obj = s3.get_object(Bucket=bucket, Key=key)
    text = obj["Body"].read().decode("utf-8", errors="replace")
    first_line = next((ln for ln in text.splitlines() if ln.strip()), None)
    data = json.loads(first_line)
    return first_line, data

# Read & show input
input_key = [k for k in keys if "/input/" in k][0]
raw_in, data_in = read_first_json_record(input_key)

print("\n--- INPUT CAPTURE ---")
print("Raw line:", raw_in[:300], "..." if len(raw_in) > 300 else "")
print("Type:", type(data_in))
print("Parsed:", data_in)

# Read & show output
output_key = [k for k in keys if "/output/" in k][0]
raw_out, data_out = read_first_json_record(output_key)

print("\n--- OUTPUT CAPTURE ---")
print("Raw line:", raw_out[:300], "..." if len(raw_out) > 300 else "")
print("Type:", type(data_out))
print("Parsed:", data_out)

Keys:
 - monitoring/batch-transform/capture/1770585686/input/2026/02/08/21/f1f0c63f-a181-492b-8765-b29a732c06ca.json
 - monitoring/batch-transform/capture/1770585686/output/2026/02/08/21/a9305a65-6fbb-47aa-b3b7-3a2cc419e728.json

--- INPUT CAPTURE ---
Raw line: [{"prefix":"s3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/prod/prod_infer.csv"},""] 
Type: <class 'list'>
Parsed: [{'prefix': 's3://aai540-olist-mlops-chris-7f3k2p/modeling/xgb_v1/prod/prod_infer.csv'}, '']

--- OUTPUT CAPTURE ---
Raw line: [{"prefix":"s3://aai540-olist-mlops-chris-7f3k2p/monitoring/batch-transform/output/1770585686/"},"prod_infer.csv.out"] 
Type: <class 'list'>
Parsed: [{'prefix': 's3://aai540-olist-mlops-chris-7f3k2p/monitoring/batch-transform/output/1770585686/'}, 'prod_infer.csv.out']


In [17]:
# M5-4.1 Collect SageMaker job names for infrastructure monitoring

import boto3

sm = boto3.client("sagemaker", region_name=region)

jobs = {
    "TrainingJobs": [],
    "TransformJobs": [],
    "ProcessingJobs": []
}

# Training jobs
for j in sm.list_training_jobs(MaxResults=5)["TrainingJobSummaries"]:
    jobs["TrainingJobs"].append(j["TrainingJobName"])

# Transform jobs
for j in sm.list_transform_jobs(MaxResults=5)["TransformJobSummaries"]:
    jobs["TransformJobs"].append(j["TransformJobName"])

# Processing jobs
for j in sm.list_processing_jobs(MaxResults=5)["ProcessingJobSummaries"]:
    jobs["ProcessingJobs"].append(j["ProcessingJobName"])

jobs

{'TrainingJobs': ['sagemaker-xgboost-2026-02-07-17-53-40-249',
  'sagemaker-xgboost-2026-02-07-16-46-56-471',
  'sagemaker-xgboost-2026-02-07-16-42-34-291',
  'sagemaker-xgboost-2026-01-31-18-00-53-460',
  'a4-1-xgb-1769829697'],
 'TransformJobs': ['sagemaker-xgboost-2026-02-08-21-21-26-827',
  'sagemaker-xgboost-2026-01-31-19-39-44-779',
  'sagemaker-xgboost-2026-01-31-19-29-14-989',
  'a4-1-xgb-batch-1769830620'],
 'ProcessingJobs': ['baseline-suggestion-job-2026-02-08-21-32-41-870',
  'clarify-bias-1770490533',
  'clarify-bias-1770489976',
  'Clarify-Bias-2026-02-07-18-30-09-560',
  'Clarify-Bias-2026-02-07-18-07-08-683']}

In [18]:
# M5-4.3 Create CloudWatch Dashboard for ML Infrastructure
import json
import boto3

cw = boto3.client("cloudwatch", region_name=region)

dashboard_name = "AAI540-Olist-MLops-Dashboard"

dashboard_body = {
    "widgets": [
        {
            "type": "metric",
            "x": 0,
            "y": 0,
            "width": 12,
            "height": 6,
            "properties": {
                "title": "Training Job Duration",
                "metrics": [
                    ["AWS/SageMaker", "TrainingJobDuration",
                     "TrainingJobName", "sagemaker-xgboost-2026-01-31-18-00-53-460"]
                ],
                "stat": "Average",
                "period": 300,
                "region": region
            }
        },
        {
            "type": "metric",
            "x": 12,
            "y": 0,
            "width": 12,
            "height": 6,
            "properties": {
                "title": "Batch Transform Duration",
                "metrics": [
                    ["AWS/SageMaker", "TransformJobDuration",
                     "TransformJobName", "sagemaker-xgboost-2026-02-08-21-21-26-827"]
                ],
                "stat": "Average",
                "period": 300,
                "region": region
            }
        },
        {
            "type": "metric",
            "x": 0,
            "y": 6,
            "width": 12,
            "height": 6,
            "properties": {
                "title": "Processing Job Duration",
                "metrics": [
                    ["AWS/SageMaker", "ProcessingJobDuration",
                     "ProcessingJobName", "baseline-suggestion-job-2026-02-08-21-32-41-870"]
                ],
                "stat": "Average",
                "period": 300,
                "region": region
            }
        },
        {
            "type": "metric",
            "x": 12,
            "y": 6,
            "width": 12,
            "height": 6,
            "properties": {
                "title": "Job Failures (All SageMaker Jobs)",
                "metrics": [
                    ["AWS/SageMaker", "TrainingJobsFailed"],
                    ["AWS/SageMaker", "TransformJobsFailed"],
                    ["AWS/SageMaker", "ProcessingJobsFailed"]
                ],
                "stat": "Sum",
                "period": 300,
                "region": region
            }
        }
    ]
}

cw.put_dashboard(
    DashboardName=dashboard_name,
    DashboardBody=json.dumps(dashboard_body)
)

print("✅ CloudWatch dashboard created:", dashboard_name)

✅ CloudWatch dashboard created: AAI540-Olist-MLops-Dashboard


In [19]:
# M5-5.1 Verify monitoring report artifacts exist

import boto3

s3 = boto3.client("s3", region_name=region)

paths = [
    baseline_statistics_uri,
    baseline_constraints_uri,
    "s3://aai540-olist-mlops-chris-7f3k2p/monitoring/batch-transform/capture/"
]

for p in paths:
    bucket_name, key = p.replace("s3://", "").split("/", 1)
    resp = s3.list_objects_v2(Bucket=bucket_name, Prefix=key)
    print(f"✅ {p} -> objects:", resp.get("KeyCount", 0))

✅ s3://aai540-olist-mlops-chris-7f3k2p/monitoring/baselines/data_quality/results/statistics.json -> objects: 1
✅ s3://aai540-olist-mlops-chris-7f3k2p/monitoring/baselines/data_quality/results/constraints.json -> objects: 1
✅ s3://aai540-olist-mlops-chris-7f3k2p/monitoring/batch-transform/capture/ -> objects: 2
