# Part 2: Modeling & Evaluation

## Overview
In this notebook, we load the features prepared in `01_Data_Preparation.ipynb`, training a Baseline model and an XGBoost model, and evaluate their performance.

### Setup & Configuration
We optimize the setup by loading the pre-split data directly from S3, avoiding the need to re-run the previous notebook.


In [1]:
import sagemaker
import boto3
import os
import pandas as pd
import numpy as np
from sagemaker import image_uris
import awswrangler as wr

# --- Lightweight Setup (Optimized) ---
# Replaces time-consuming %run ./01_Data_Preparation.ipynb

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sm_sess = sagemaker.Session()
bucket = sm_sess.default_bucket()
region = boto3.Session().region_name

print(f"Region: {region}")
print(f"Role: {role}")
print(f"Default Bucket: {bucket}")

# --- Load Dataframes from S3 ---
# Restores variables expected by downstream cells (previously created by 01)
SPLITS_PREFIX = f"s3://{bucket}/datalake/olist/splits/order_level/"

print(f"Loading splits from {SPLITS_PREFIX}...")
df_train = wr.s3.read_parquet(path=SPLITS_PREFIX + "train/")
df_val   = wr.s3.read_parquet(path=SPLITS_PREFIX + "val/")
df_test  = wr.s3.read_parquet(path=SPLITS_PREFIX + "test/")

print("Data loaded successfully.")
print("Train shape:", df_train.shape)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
Role: arn:aws:iam::587322031938:role/LabRole
Default Bucket: sagemaker-us-east-1-587322031938
Loading splits from s3://sagemaker-us-east-1-587322031938/datalake/olist/splits/order_level/...


2026-02-17 19:39:00,272	INFO worker.py:1852 -- Started a local Ray instance.


Data loaded successfully.
Train shape: (39469, 23)


In [2]:
#imports
import os
import io
import json
import time
import boto3
import numpy as np
import pandas as pd

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
)
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import sagemaker
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Session / region / role
sm_sess = sagemaker.Session()
region = sm_sess.boto_region_name
role = sagemaker.get_execution_role()

print("Region:", region)
print("Role  :", role)


Region: us-east-1
Role  : arn:aws:iam::587322031938:role/LabRole


In [3]:
# Check: make sure that these exist from Data_Preparation
assert "df_train" in globals(), "df_train not found. Ensure 01_Data_Preparation.ipynb ran successfully."
assert "df_test" in globals(), "df_test not found. Ensure 01_Data_Preparation.ipynb ran successfully."
assert "df_val" in globals(), "df_val not found. Ensure 01_Data_Preparation.ipynb ran successfully."

label_col = "label_satisfied"
for _df, _name in [(df_train,"df_train"), (df_test,"df_test"), (df_val,"df_val")]:
    assert label_col in _df.columns, f"{label_col} missing from {_name}"

print("Train shape:", df_train.shape)
print("Test  shape:", df_test.shape)
print("Val   shape:", df_val.shape)
print("Label prevalence (train):", df_train[label_col].mean().round(4))


Train shape: (39469, 23)
Test  shape: (9867, 23)
Val   shape: (9867, 23)
Label prevalence (train): 0.7717


In [4]:
# Feature configuration 
num_features = [
    "total_items",
    "total_price",
    "total_freight",
    "payment_value_sum",
    "payment_installments_max",
    "delivery_time_days",
    "estimated_time_days",
    "delivered_late",
]
cat_features = ["customer_state", "payment_types"]

# Keep only existing columns 
num_features = [c for c in num_features if c in df_train.columns]
cat_features = [c for c in cat_features if c in df_train.columns]

print("Numeric features:", num_features)
print("Categorical features:", cat_features)

def make_model_frame(df: pd.DataFrame) -> pd.DataFrame:
    cols = num_features + cat_features + [label_col]
    out = df[cols].copy()
    # Ensure types
    for c in num_features:
        out[c] = pd.to_numeric(out[c], errors="coerce")
    out[num_features] = out[num_features].fillna(out[num_features].median(numeric_only=True))
    for c in cat_features:
        out[c] = out[c].fillna("UNK").astype(str)
    out[label_col] = out[label_col].astype(int)
    return out

train_df = make_model_frame(df_train)
test_df  = make_model_frame(df_test)
val_df   = make_model_frame(df_val)

train_df.head()


Numeric features: ['total_items', 'total_price', 'total_freight', 'payment_value_sum', 'payment_installments_max', 'delivery_time_days', 'estimated_time_days', 'delivered_late']
Categorical features: ['customer_state', 'payment_types']


Unnamed: 0,total_items,total_price,total_freight,payment_value_sum,payment_installments_max,delivery_time_days,estimated_time_days,delivered_late,customer_state,payment_types,label_satisfied
0,2.0,72.89,63.34,136.23,1.0,10.206227,45.114363,0,RR,credit_card,0
1,1.0,59.5,15.56,75.06,3.0,10.206227,52.98919,0,RS,credit_card,0
2,1.0,86.9,17.16,40.95,2.0,10.206227,16.358113,0,SP,credit_card,0
3,3.0,134.97,8.49,105.28,2.0,54.813194,18.488449,1,SP,UNK,0
4,1.0,100.0,9.34,109.34,1.0,10.206227,22.07787,0,SP,credit_card,0


### Baseline Model
As a benchmark, we implemented a simple heuristic model that always predicts an order will be **Satisfied** (Class 1). This reflects the majority class (~77%) in the dataset.

**Performance Note:**
- Since the model always predicts the positive class (1), it achieves **perfect Recall (1.0)** for that class.
- However, it **completely fails to identify any Late deliveries (Class 0)**, which is the critical minority case for this business problem.


In [5]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X_train = train_df.drop(columns=[label_col])
y_train = train_df[label_col].values

X_test = test_df.drop(columns=[label_col])
y_test = test_df[label_col].values

# Benchmark A: majority-class baseline
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
pred_dummy = dummy.predict(X_test)

def classification_metrics(y_true, y_pred, y_score=None):
    out = {
        "accuracy": float(accuracy_score(y_true, y_pred)),
        "precision": float(precision_score(y_true, y_pred, zero_division=0)),
        "recall": float(recall_score(y_true, y_pred, zero_division=0)),
        "f1": float(f1_score(y_true, y_pred, zero_division=0)),
    }
    if y_score is not None:
        try:
            out["roc_auc"] = float(roc_auc_score(y_true, y_score))
        except Exception:
            pass
    return out

metrics_dummy = classification_metrics(y_test, pred_dummy)
print("Benchmark A — DummyClassifier:", metrics_dummy)


Benchmark A — DummyClassifier: {'accuracy': 0.7559541907367995, 'precision': 0.7559541907367995, 'recall': 1.0, 'f1': 0.8610181230520605}


In [6]:
# Benchmark B: logistic regression on features
# delivered_late + delivery_time_days + total_price
tiny_feats = [c for c in ["delivered_late", "delivery_time_days", "total_price"] if c in X_train.columns]
assert len(tiny_feats) >= 1, "No tiny benchmark features found; adjust tiny_feats list."

pre = ColumnTransformer(
    transformers=[
        ("num", "passthrough", tiny_feats),
    ],
    remainder="drop",
)

bench_lr = Pipeline(steps=[
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=200, n_jobs=None)),
])
bench_lr.fit(X_train, y_train)

pred_lr = bench_lr.predict(X_test)
proba_lr = None
if hasattr(bench_lr.named_steps["clf"], "predict_proba"):
    proba_lr = bench_lr.predict_proba(X_test)[:, 1]

metrics_lr = classification_metrics(y_test, pred_lr, y_score=proba_lr)
print("Benchmark B — Tiny LogisticRegression:", metrics_lr)


Benchmark B — Tiny LogisticRegression: {'accuracy': 0.7876760920239181, 'precision': 0.7911419887103778, 'recall': 0.9770746748893954, 'f1': 0.8743326735048887, 'roc_auc': 0.639013567635967}


In [7]:
# Full preprocessing for model training 
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", num_features),
        ("cat", ohe, cat_features),
    ],
    remainder="drop",
)

# Fit on train only
X_train_mat = preprocess.fit_transform(X_train)
X_val_mat   = preprocess.transform(val_df.drop(columns=[label_col]))
X_test_mat  = preprocess.transform(X_test)

y_val = val_df[label_col].values

# Helper to create XGBoost CSV 
def to_xgb_csv(X_mat, y_vec) -> pd.DataFrame:
    y_vec = np.asarray(y_vec).reshape(-1, 1)
    arr = np.hstack([y_vec, X_mat])
    return pd.DataFrame(arr)

train_xgb = to_xgb_csv(X_train_mat, y_train)
val_xgb   = to_xgb_csv(X_val_mat, y_val)
test_xgb  = to_xgb_csv(X_test_mat, y_test)

train_xgb.shape, val_xgb.shape, test_xgb.shape


((39469, 42), (9867, 42), (9867, 42))

In [8]:
import awswrangler as wr

if "DATALAKE_BUCKET" in globals() and isinstance(DATALAKE_BUCKET, str) and len(DATALAKE_BUCKET) > 0:
    bucket = DATALAKE_BUCKET
else:
    bucket = sm_sess.default_bucket()

base_prefix = f"s3://{bucket}/modeling/xgb-baseline/"

train_prefix = base_prefix + "train/"
val_prefix   = base_prefix + "val/"
test_prefix  = base_prefix + "test/"

# Writes one or more CSV files under each prefix, overwriting existing data
wr.s3.to_csv(train_xgb, path=train_prefix, index=False, header=False, dataset=True, mode="overwrite")
wr.s3.to_csv(val_xgb,   path=val_prefix,   index=False, header=False, dataset=True, mode="overwrite")
wr.s3.to_csv(test_xgb,  path=test_prefix,  index=False, header=False, dataset=True, mode="overwrite")

print("Uploaded dataset prefixes:")
print("  train:", train_prefix)
print("  val  :", val_prefix)
print("  test :", test_prefix)



Uploaded dataset prefixes:
  train: s3://sagemaker-us-east-1-587322031938/modeling/xgb-baseline/train/
  val  : s3://sagemaker-us-east-1-587322031938/modeling/xgb-baseline/val/
  test : s3://sagemaker-us-east-1-587322031938/modeling/xgb-baseline/test/


### First Iteration Model (XGBoost v1)
We trained a first-pass XGBoost binary classifier in Amazon SageMaker using a limited set of engineered features related to order size, payment behavior, and purchase timing.

The model was evaluated using SageMaker Batch Transform on the held-out test dataset. Batch Transform was selected over a real-time endpoint to minimize cost and ensure automatic resource cleanup.


In [9]:
# Define paths used in training/validation
s3_train = train_prefix
s3_val   = val_prefix
s3_test  = test_prefix

output_path = f"s3://{bucket}/modeling/output"
transform_output = base_prefix + "transform-output/"

# Image URI
xgb_image = image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1"
)

# Estimator Definition
xgb = sagemaker.estimator.Estimator(
    image_uri=xgb_image,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=output_path,
    sagemaker_session=sm_sess
)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective="binary:logistic",
    num_round=50
)

# Transformer Definition
transformer = xgb.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=transform_output
)


No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


In [10]:
train_input = TrainingInput(s3_data=os.path.dirname(s3_train) + "/", content_type="text/csv")
val_input   = TrainingInput(s3_data=os.path.dirname(s3_val) + "/",   content_type="text/csv")

# --- COST SAFETY CHECK ---
import boto3
import time
from urllib.parse import urlparse
from sagemaker.model import Model
s3_client = boto3.client('s3')

def check_s3_prefix_has_contents(bucket_name, prefix):
    resp = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
    return resp.get('KeyCount', 0) > 0

# Parse the output path defined in previous cells
p = urlparse(output_path)
out_bucket = p.netloc
out_key_prefix = p.path.lstrip('/')

if check_s3_prefix_has_contents(out_bucket, out_key_prefix):
    print(f'Found existing training artifacts in {output_path}. Skipping Training to save cost.')
    # Find latest model artifact
    resp = s3_client.list_objects_v2(Bucket=out_bucket, Prefix=out_key_prefix)
    contents = sorted(resp.get('Contents', []), key=lambda x: x['LastModified'], reverse=True)
    model_uri = None
    for c in contents:
        if c['Key'].endswith('/output/model.tar.gz'):
            model_uri = f's3://{out_bucket}/{c["Key"]}'
            break
    if model_uri:
        print(f'   Using latest model artifact: {model_uri}')
        # Recreate Estimator/Model so next cells work
        xgb_model = Model(
            image_uri=xgb_image,
            model_data=model_uri,
            role=role,
            sagemaker_session=sm_sess
        )
        xgb_model.create()
        # Swap xgb (Estimator) to xgb_model (Model) for transformer usage
        xgb = xgb_model
    else:
        print('   Output dir exists but no model found. Retraining...')
        xgb.fit({'train': train_input, 'validation': val_input}, logs=False)
else:
    print('No existing training artifacts found. Starting Training...')
    xgb.fit({'train': train_input, 'validation': val_input}, logs=False)

# --- FIX: Re-instantiate Transformer ---
# Ensure transformer uses the correct model (whether trained now or loaded from S3)
transformer = xgb.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=transform_output
)
print("Re-created transformer object linked to current model.")

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2026-02-17-19-39-09-126


No existing training artifacts found. Starting Training...

2026-02-17 19:39:10 Starting - Starting the training job..
2026-02-17 19:39:25 Starting - Preparing the instances for training..
2026-02-17 19:39:42 Downloading - Downloading input data....
2026-02-17 19:40:08 Downloading - Downloading the training image........
2026-02-17 19:40:53 Training - Training image download completed. Training in progress...
2026-02-17 19:41:09 Uploading - Uploading generated training model..
2026-02-17 19:41:22 Completed - Training job completed

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2026-02-17-19-41-25-768



Re-created transformer object linked to current model.


In [13]:
# For transform, we provide features only
test_features_only = test_xgb.drop(columns=[0])  

# Write as a dataset under a prefix 
test_features_prefix = f"{base_prefix}test/features_only/"

wr.s3.to_csv(
    test_features_only,
    path=test_features_prefix,
    index=False,
    header=False,
    dataset=True,
    mode="overwrite",
)

print("Transform input prefix (features):", test_features_prefix)
print("Transform output path            :", transform_output)

# --- COST SAFETY CHECK ---
t_parse = urlparse(transform_output)
t_bucket = t_parse.netloc
t_prefix = t_parse.path.lstrip('/')

if check_s3_prefix_has_contents(t_bucket, t_prefix):
    print(f'Found existing transform output in {transform_output}. Skipping Transform.')
else:
    print('No existing transform output. Starting Batch Transform...')
    transformer.transform(
        data=test_features_prefix,
        content_type='text/csv',
        split_type='Line',
    )
    transformer.wait()
    print('Batch transform complete.')


INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-02-17-19-48-56-610


Transform input prefix (features): s3://sagemaker-us-east-1-587322031938/modeling/xgb-baseline/test/features_only/
Transform output path            : s3://sagemaker-us-east-1-587322031938/modeling/xgb-baseline/transform-output/
No existing transform output. Starting Batch Transform...
  from pandas import MultiIndex, Int64Index[0m
[34m[2026-02-17:19:53:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-17:19:53:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-17:19:53:38:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
  from pandas import MultiIndex, Int64Index[0m
[35m[2026-02-17:19:53:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2026-02-17:19:53:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2026-02-17:19:53:38:IN

In [14]:

import re

# List objects under output prefix to find the output file
s3 = boto3.client("s3")
# Fix: Use the actual transform output path
parsed_out = urlparse(transform_output)
out_prefix = parsed_out.path.lstrip('/')

resp = s3.list_objects_v2(Bucket=bucket, Prefix=out_prefix)
keys = [obj["Key"] for obj in resp.get("Contents", [])]
print("Output objects:", keys)

out_files = [k for k in keys if k.endswith(".out") or k.endswith(".csv") or "test_features" in k]

candidate = None
for k in keys:
    if k.endswith(".out"):
        candidate = k
        break
if candidate is None:
    raise RuntimeError("Could not find batch transform output .out file. Check S3 output prefix listing above.")

print("Using output file:", candidate)

obj = s3.get_object(Bucket=bucket, Key=candidate)
raw = obj["Body"].read().decode("utf-8").strip().splitlines()

# Each line is a probability 
y_score = np.array([float(x.strip().split(",")[0]) for x in raw])
y_pred = (y_score >= 0.5).astype(int)

metrics_xgb = classification_metrics(y_test, y_pred, y_score=y_score)

print("SageMaker XGBoost metrics:", metrics_xgb)
print()
print("Classification report:")
print(classification_report(y_test, y_pred, digits=4))


Output objects: ['modeling/xgb-baseline/transform-output/6fd3245b847643078fcadf0d7767f540.csv.out']
Using output file: modeling/xgb-baseline/transform-output/6fd3245b847643078fcadf0d7767f540.csv.out
SageMaker XGBoost metrics: {'accuracy': 0.8098712881321577, 'precision': 0.8079426365140651, 'recall': 0.9819010591232069, 'f1': 0.8864681675139191, 'roc_auc': 0.7298518167310201}

Classification report:
              precision    recall  f1-score   support

           0     0.8317    0.2770    0.4156      2408
           1     0.8079    0.9819    0.8865      7459

    accuracy                         0.8099      9867
   macro avg     0.8198    0.6294    0.6510      9867
weighted avg     0.8137    0.8099    0.7715      9867



In [15]:
# Side-by-side comparison
compare = pd.DataFrame([
    {"model": "Benchmark A: Dummy (most_frequent)", **metrics_dummy},
    {"model": f"Benchmark B: Tiny LR ({', '.join(tiny_feats)})", **metrics_lr},
    {"model": "SageMaker: XGBoost (batch transform)", **metrics_xgb},
])

# Reorder columns
cols = ["model"] + [c for c in ["accuracy","precision","recall","f1","roc_auc"] if c in compare.columns]
compare = compare[cols]
compare


Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc
0,Benchmark A: Dummy (most_frequent),0.755954,0.755954,1.0,0.861018,
1,"Benchmark B: Tiny LR (delivered_late, delivery...",0.787676,0.791142,0.977075,0.874333,0.639014
2,SageMaker: XGBoost (batch transform),0.809871,0.807943,0.981901,0.886468,0.729852


*** SIGTERM received at time=1771358158 on cpu 1 ***
PC: @     0x7f8938ab1e9e  (unknown)  epoll_wait
    @     0x7f88e0245b0d         64  absl::lts_20240722::AbslFailureSignalHandler()
    @     0x7f89389ce520  (unknown)  (unknown)
[2026-02-17 19:55:58,328 E 2706 2706] logging.cc:497: *** SIGTERM received at time=1771358158 on cpu 1 ***
[2026-02-17 19:55:58,328 E 2706 2706] logging.cc:497: PC: @     0x7f8938ab1e9e  (unknown)  epoll_wait
[2026-02-17 19:55:58,329 E 2706 2706] logging.cc:497:     @     0x7f88e0245b39         64  absl::lts_20240722::AbslFailureSignalHandler()
[2026-02-17 19:55:58,329 E 2706 2706] logging.cc:497:     @     0x7f89389ce520  (unknown)  (unknown)


### Results Summary
- The XGBoost model achieved an **AUC of approximately 0.73**, indicating it learned discriminative patterns significantly better than the baseline.
- **Accuracy (0.81)** outperformed the majority-class baseline (0.76), demonstrating real predictive power.
- **Precision (~0.81)** and **Recall (~0.98)** were strong, showing the model effectively identifies positive cases while maintaining reasonable correctness.
