# Amazon SageMaker Batch Transform: Associate prediction results with their corresponding input records


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

---

_**Use SageMaker's XGBoost to train a binary classification model and for a list of tumors in batch file, predict if each is malignant**_

_**It also shows how to use the input output joining / filter feature in Batch transform in details**_

---



## Background
This purpose of this notebook is to train a model using SageMaker's XGBoost and UCI's breast cancer diagnostic data set to illustrate at how to run batch inferences and how to use the Batch Transform I/O join feature. UCI's breast cancer diagnostic data set is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. 



---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* The S3 bucket that you want to use for training and storing model objects.

In [1]:
# !pip3 install -U sagemaker

In [2]:
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name

bucket = sess.default_bucket()
prefix = "DEMO-breast-cancer-prediction-xgboost-highlevel"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


---
## Data sources

> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

> Breast Cancer Wisconsin (Diagnostic) Data Set [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)].

> _Also see:_ Breast Cancer Wisconsin (Diagnostic) Data Set [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data].

## Data preparation


Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [3]:
import pandas as pd
import numpy as np

s3 = boto3.client("s3")

filename = "wdbc.csv"
s3.download_file(
    f"sagemaker-example-files-prod-{region}", "datasets/tabular/breast_cancer/wdbc.csv", filename
)
data = pd.read_csv(filename, header=None)

# specify columns extracted from wbdc.names
data.columns = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
]

# save the data
data.to_csv("data.csv", sep=",", index=False)

data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
349,899147,B,11.95,14.96,77.23,426.7,0.1158,0.1206,0.01171,0.01787,...,12.81,17.72,83.09,496.2,0.1293,0.1885,0.03122,0.04766,0.3124,0.0759
436,908916,B,12.87,19.54,82.67,509.2,0.09136,0.07883,0.01797,0.0209,...,14.45,24.38,95.14,626.9,0.1214,0.1652,0.07127,0.06384,0.3313,0.07735
282,89122,M,19.4,18.18,127.2,1145.0,0.1037,0.1442,0.1626,0.09464,...,23.79,28.65,152.4,1628.0,0.1518,0.3749,0.4316,0.2252,0.359,0.07787
371,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
139,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
30,853401,M,18.63,25.11,124.8,1088.0,0.1064,0.1887,0.2319,0.1244,...,23.15,34.01,160.5,1670.0,0.1491,0.4257,0.6133,0.1848,0.3444,0.09782
308,893526,B,13.5,12.71,85.69,566.2,0.07376,0.03614,0.002758,0.004419,...,14.97,16.94,95.48,698.7,0.09023,0.05836,0.01379,0.0221,0.2267,0.06192
94,862028,M,15.06,19.83,100.3,705.6,0.1039,0.1553,0.17,0.08815,...,18.23,24.23,123.5,1025.0,0.1551,0.4203,0.5203,0.2115,0.2834,0.08234


#### Key observations:
* The data has 569 observations and 32 columns.
* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features that we will use for training and inferencing.

Let's replace the M/B diagnosis with a 1/0 boolean value. 

In [4]:
data["diagnosis"] = data["diagnosis"].apply(lambda x: ((x == "M")) + 0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
312,89382602,0,12.76,13.37,82.29,504.1,0.08794,0.07948,0.04052,0.02548,...,14.19,16.4,92.04,618.8,0.1194,0.2208,0.1769,0.08411,0.2564,0.08253
497,914580,0,12.47,17.31,80.45,480.1,0.08928,0.0763,0.03609,0.02369,...,14.06,24.34,92.82,607.3,0.1276,0.2506,0.2028,0.1053,0.3035,0.07661
53,857392,1,18.22,18.7,120.3,1033.0,0.1148,0.1485,0.1772,0.106,...,20.6,24.13,135.1,1321.0,0.128,0.2297,0.2623,0.1325,0.3021,0.07987
420,906539,0,11.57,19.04,74.2,409.7,0.08546,0.07722,0.05485,0.01428,...,13.07,26.98,86.43,520.5,0.1249,0.1937,0.256,0.06664,0.3035,0.08284
91,861799,1,15.37,22.76,100.2,728.2,0.092,0.1036,0.1122,0.07483,...,16.43,25.84,107.5,830.9,0.1257,0.1997,0.2846,0.1476,0.2556,0.06828
162,871201,1,19.59,18.15,130.7,1214.0,0.112,0.1666,0.2508,0.1286,...,26.73,26.39,174.9,2232.0,0.1438,0.3846,0.681,0.2247,0.3643,0.09223
105,863030,1,13.11,15.56,87.21,530.2,0.1398,0.1765,0.2071,0.09601,...,16.31,22.4,106.4,827.2,0.1862,0.4099,0.6376,0.1986,0.3147,0.1405
240,88350402,0,13.64,15.6,87.38,575.3,0.09423,0.0663,0.04705,0.03731,...,14.85,19.05,94.11,683.4,0.1278,0.1291,0.1533,0.09222,0.253,0.0651


Let's split the data as follows: 80% for training, 10% for validation and let's set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict.

In [5]:
# data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(["id"], axis=1)
data_val = data[val_list].drop(["id"], axis=1)
data_batch = data[batch_list].drop(["diagnosis"], axis=1)
data_batch_noID = data_batch.drop(["id"], axis=1)

In [6]:
len(rand_split), len(train_list), len(val_list), len(batch_list)

(569, 569, 569, 569)

Let's upload those data sets in S3

In [7]:
train_file = "train_data.csv"
data_train.to_csv(train_file, index=False, header=False)
sess.upload_data(train_file, key_prefix="{}/train".format(prefix))

validation_file = "validation_data.csv"
data_val.to_csv(validation_file, index=False, header=False)
sess.upload_data(validation_file, key_prefix="{}/validation".format(prefix))

batch_file = "batch_data.csv"
data_batch.to_csv(batch_file, index=False, header=False)
sess.upload_data(batch_file, key_prefix="{}/batch".format(prefix))

batch_file_noID = "batch_data_noID.csv"
data_batch_noID.to_csv(batch_file_noID, index=False, header=False)
sess.upload_data(batch_file_noID, key_prefix="{}/batch".format(prefix))

's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/batch/batch_data_noID.csv'

---

## Training job and model creation

The below cell uses the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick off the training job using both our training set and validation set. Not that the objective is set to 'binary:logistic' which trains a model to output a probability between 0 and 1 (here the probability of a tumor being malignant).

In [8]:
%%time
from time import gmtime, strftime

job_name = "xgb-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = "s3://{}/{}/output/{}".format(bucket, prefix, job_name)
image = sagemaker.image_uris.retrieve(
    framework="xgboost", region=boto3.Session().region_name, version="1.7-1"
)

sm_estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size=50,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sess,
)

sm_estimator.set_hyperparameters(
    objective="binary:logistic",
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    num_round=100,
)

train_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validation".format(bucket, prefix),
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

# Start training the modelby calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, job_name=job_name, logs=True)

INFO:sagemaker:Creating training-job with name: xgb-2026-02-20-15-23-46


2026-02-20 15:23:51 Starting - Starting the training job...
2026-02-20 15:24:05 Starting - Preparing the instances for training...
2026-02-20 15:24:52 Downloading - Downloading the training image......
  import pkg_resources[0m
[34m[2026-02-20 15:25:50.722 ip-10-2-238-106.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2026-02-20 15:25:50.804 ip-10-2-238-106.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2026-02-20:15:25:51:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2026-02-20:15:25:51:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2026-02-20:15:25:51:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:25:51:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2026-02-20:15:25:51:INFO] Determined 0 GPU(s) available on the instance.[0m
[34m[2026-02-20:15:25:51:INFO] Determine

In [9]:
job_name

'xgb-2026-02-20-15-23-46'

In [10]:
sm_estimator

<sagemaker.estimator.Estimator at 0x7fad6d44f3e0>

In [11]:
data_channels

{'train': <sagemaker.inputs.TrainingInput at 0x7fad6ddf6330>,
 'validation': <sagemaker.inputs.TrainingInput at 0x7fad6c6daf90>}

In [12]:
desc = sm_estimator.latest_training_job.describe()
desc

{'TrainingJobName': 'xgb-2026-02-20-15-23-46',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:087941614028:training-job/xgb-2026-02-20-15-23-46',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-20-15-23-46/xgb-2026-02-20-15-23-46/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.2',
  'gamma': '4',
  'max_depth': '5',
  'min_child_weight': '6',
  'num_round': '100',
  'objective': 'binary:logistic',
  'subsample': '0.8',
  'verbosity': '0'},
 'AlgorithmSpecification': {'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\

---

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.




#### 1. Create a transform job with the default configurations
Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing.

In [13]:
%%time

sm_transformer = sm_estimator.transformer(1, "ml.m5.xlarge")

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file_noID
)  # use input data without ID column
sm_transformer.transform(input_location, content_type="text/csv", split_type="Line")
sm_transformer.wait()

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2026-02-20-15-26-34-032
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-02-20-15-26-34-804


  import pkg_resources[0m
  import pkg_resources[0m
[34m[2026-02-20:15:31:20:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:31:20:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:31:20:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gun

In [14]:
# Get the latest transform job name
job_name_transform = sm_transformer.latest_transform_job.name

# Getting description on Transform Job
sm_client = boto3.client("sagemaker")

response = sm_client.describe_transform_job(
    TransformJobName=job_name_transform
)
response


{'TransformJobName': 'sagemaker-xgboost-2026-02-20-15-26-34-804',
 'TransformJobArn': 'arn:aws:sagemaker:us-east-1:087941614028:transform-job/sagemaker-xgboost-2026-02-20-15-26-34-804',
 'TransformJobStatus': 'Completed',
 'ModelName': 'sagemaker-xgboost-2026-02-20-15-26-34-032',
 'TransformInput': {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
    'S3Uri': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/batch/batch_data_noID.csv'}},
  'ContentType': 'text/csv',
  'CompressionType': 'None',
  'SplitType': 'Line'},
 'TransformOutput': {'S3OutputPath': 's3://sagemaker-us-east-1-087941614028/sagemaker-xgboost-2026-02-20-15-26-34-804',
  'AssembleWith': 'None',
  'KmsKeyId': ''},
 'TransformResources': {'InstanceType': 'ml.m5.xlarge', 'InstanceCount': 1},
 'CreationTime': datetime.datetime(2026, 2, 20, 15, 26, 34, 964000, tzinfo=tzlocal()),
 'TransformStartTime': datetime.datetime(2026, 2, 20, 15, 30, 2, 420000, tzinfo=tzlocal()),
 'Transf

Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of tumors being malignant.

- Figures out where SageMaker saved the batch transform predictions in S3.
- Downloads that .out file locally.
- Reads it into Pandas so you can analyze the predictions.
- Finally, shows the first 8 prediction results.

In [15]:
import re


def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    print(f"Here is the matched batch output -> {match}")
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

In [16]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Here is the matched batch output -> <re.Match object; span=(0, 103), match='s3://sagemaker-us-east-1-087941614028/sagemaker-x>


Unnamed: 0,0
0,0.9359
1,0.913676
2,0.98123
3,0.962197
4,0.021804
5,0.108707
6,0.040128
7,0.602044


#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [17]:
# content_type / accept and split_type / assemble_with are required to use IO joining feature
sm_transformer.assemble_with = "Line"
sm_transformer.accept = "text/csv"

# start a transform job
input_location = "s3://{}/{}/batch/{}".format(
    bucket, prefix, batch_file
)  # use input data with ID column cause InputFilter will filter it out
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
)
sm_transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-02-20-15-32-08-337


  import pkg_resources[0m
[34m[2026-02-20:15:36:50:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:36:50:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:36:50:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location /

Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [18]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head(8)

Here is the matched batch output -> <re.Match object; span=(0, 98), match='s3://sagemaker-us-east-1-087941614028/sagemaker-x>


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,84458202,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0.9359
1,84667401,13.73,22.61,93.6,578.3,0.1131,0.2293,0.2128,0.08025,0.2069,...,32.01,108.8,697.7,0.1651,0.7725,0.6943,0.2208,0.3596,0.1431,0.913676
2,848406,14.68,20.13,94.74,684.5,0.09867,0.072,0.07395,0.05259,0.1586,...,30.88,123.4,1138.0,0.1464,0.1871,0.2914,0.1609,0.3029,0.08216,0.98123
3,853612,11.84,18.7,77.93,440.6,0.1109,0.1516,0.1218,0.05182,0.2301,...,28.12,119.4,888.7,0.1637,0.5775,0.6956,0.1546,0.4761,0.1402,0.962197
4,859464,9.465,21.01,60.11,269.4,0.1044,0.07773,0.02172,0.01504,0.1717,...,31.56,67.03,330.7,0.1548,0.1664,0.09412,0.06517,0.2878,0.09211,0.021804
5,859983,13.8,15.79,90.43,584.1,0.1007,0.128,0.07789,0.05069,0.1662,...,20.86,110.3,812.4,0.1411,0.3542,0.2779,0.1383,0.2589,0.103,0.108707
6,8610629,13.53,10.94,87.91,559.2,0.1291,0.1047,0.06877,0.06556,0.2403,...,12.49,91.36,605.5,0.1451,0.1379,0.08539,0.07407,0.271,0.07191,0.040128
7,8611161,13.34,15.86,86.49,520.0,0.1078,0.1535,0.1169,0.06987,0.1942,...,23.19,96.66,614.9,0.1536,0.4791,0.4858,0.1708,0.3527,0.1016,0.602044


#### 3. Update the output filter to keep only ID and prediction results
Let's change __output_filter__ to "$[0,-1]", indicating that when presenting the output, we only want to keep column 0 (the 'ID') and the last column (the inference result i.e. the probability of a given tumor to be malignant)

In [19]:
# start another transform job
sm_transformer.transform(
    input_location,
    split_type="Line",
    content_type="text/csv",
    input_filter="$[1:]",
    join_source="Input",
    output_filter="$[0,-1]",
)
sm_transformer.wait()

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2026-02-20-15-37-42-643


  import pkg_resources[0m
[34m[2026-02-20:15:42:25:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:42:25:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2026-02-20:15:42:25:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location /

Now, let's inspect the output of the Batch Transform job in S3 again. It should show 2 columns: the ID and their corresponding probabilities of being malignant.

### ✅ Why do two Transform Jobs?
- First one gave > Predictions only → good for downstream ML pipelines or when you just need the raw outputs.
- Second one gave > Predictions + input joined → good for analysis, debugging, or reporting, because you can trace predictions back to the original records (including IDs).
- Third one > Prediction + Input 'ID' only

In [20]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head(8)

Here is the matched batch output -> <re.Match object; span=(0, 98), match='s3://sagemaker-us-east-1-087941614028/sagemaker-x>


Unnamed: 0,0,1
0,84458202,0.9359
1,84667401,0.913676
2,848406,0.98123
3,853612,0.962197
4,859464,0.021804
5,859983,0.108707
6,8610629,0.040128
7,8611161,0.602044


create_model(role=role, image_uri=XGBOOST_IMAGE)In summary, we can use newly introduced 3 attributes - __input_filter__, __join_source__, __output_filter__ to 
1. Filter / select useful features from the input dataset. e.g. exclude ID columns.
2. Associate the prediction results with their corresponding input records.
3. Filter the original or joined results before saving to S3. e.g. keep ID and probability columns only.

Take the trained model from S3, wrap it in a container, and save it in SageMaker Model Registry so it’s ready for deployment

In [21]:
sagemaker = boto3.client("sagemaker")
model_name = job_name
print(model_name)

xgb-2026-02-20-15-23-46


In [22]:
# info = sagemaker.describe_training_job(TrainingJobName="xgb-2026-02-19-15-58-52")
# import boto3
# sagemaker = boto3.client("sagemaker", region_name="us-east-1")

info = sagemaker.describe_training_job(TrainingJobName=model_name)
info

{'TrainingJobName': 'xgb-2026-02-20-15-23-46',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:087941614028:training-job/xgb-2026-02-20-15-23-46',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-20-15-23-46/xgb-2026-02-20-15-23-46/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.2',
  'gamma': '4',
  'max_depth': '5',
  'min_child_weight': '6',
  'num_round': '100',
  'objective': 'binary:logistic',
  'subsample': '0.8',
  'verbosity': '0'},
 'AlgorithmSpecification': {'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\

In [23]:
info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = info["ModelArtifacts"]["S3ModelArtifacts"]

primary_container = {"Image": image, "ModelDataUrl": model_data}

In [24]:
primary_container

{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
 'ModelDataUrl': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-20-15-23-46/xgb-2026-02-20-15-23-46/output/model.tar.gz'}

In [25]:
# Save our model to the Sagemaker Model Registry
create_model_response = sagemaker.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response["ModelArn"])

arn:aws:sagemaker:us-east-1:087941614028:model/xgb-2026-02-20-15-23-46


In [26]:
# Inspect Training Job Details
info

{'TrainingJobName': 'xgb-2026-02-20-15-23-46',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:087941614028:training-job/xgb-2026-02-20-15-23-46',
 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-20-15-23-46/xgb-2026-02-20-15-23-46/output/model.tar.gz'},
 'TrainingJobStatus': 'Completed',
 'SecondaryStatus': 'Completed',
 'HyperParameters': {'eta': '0.2',
  'gamma': '4',
  'max_depth': '5',
  'min_child_weight': '6',
  'num_round': '100',
  'objective': 'binary:logistic',
  'subsample': '0.8',
  'verbosity': '0'},
 'AlgorithmSpecification': {'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
  'TrainingInputMode': 'File',
  'MetricDefinitions': [{'Name': 'train:mae',
    'Regex': '.*\\[[0-9]+\\].*#011train-mae:([-+]?[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*'},
   {'Name': 'validation:aucpr',
    'Regex': '.*\\[[0-9]+\\].*#011validation-aucpr:([-+]?[0-9]*\

---
This code:
- Creates a deployment blueprint for your model.
- Specifies the model, hardware type, and number of instances.
- Produces an EndpointConfig ARN, which you’ll use to spin up a real‑time inference endpoint.

👉 Think of it as: “I’ve trained a model, now I’m defining how it should be hosted (what machine, how many copies, what variant name).”

In [27]:
# Create Endpoint Configuration

# Create an endpoint config name. Here we create one based on the date  
# so it we can search endpoints based on creation time.
endpoint_config_name = 'lab4-1-endpoint-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

instance_type = 'ml.m5.xlarge'

endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, # You will specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": sm_estimator.latest_training_job.name, 
            "InstanceType": instance_type, # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ]
)


print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")


Created EndpointConfig: arn:aws:sagemaker:us-east-1:087941614028:endpoint-config/lab4-1-endpoint-config2026-02-20-15-43-17


In [28]:
# Deploy our model to real-time endpoint

endpoint_name = 'lab4-1-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = sagemaker.create_endpoint(
                                EndpointName=endpoint_name,
                                EndpointConfigName=endpoint_config_name)

In [29]:
# Wait for endpoint to spin up
import time
sagemaker.describe_endpoint(EndpointName=endpoint_name)

while True:
    print("Getting Job Status")
    res = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    state = res["EndpointStatus"]
    
    if state == "InService":
        print("Endpoint in Service")
        break
    elif state == "Creating":
        print("Endpoint still creating...")
        time.sleep(60)
    else:
        print("Endpoint Creation Error - Check Sagemaker Console")
        break

Getting Job Status
Endpoint still creating...
Getting Job Status
Endpoint still creating...
Getting Job Status
Endpoint still creating...
Getting Job Status
Endpoint still creating...
Getting Job Status
Endpoint in Service


1. Connects to your SageMaker endpoint.
2. Sends one row of input data (features only, no ID) in CSV format.
3. Gets back the model’s prediction.
4. Prints the prediction result.

In [30]:
# Invoke Endpoint

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region)

response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name,
                            ContentType='text/csv',
                            Body=data_batch_noID.to_csv(header=None, index=False).strip('\n').split('\n')[0]  # Sending one row
                            )
print(response['Body'].read().decode('utf-8'))

0.935899555683136



In [31]:
# Examine Response Body

response

{'ResponseMetadata': {'RequestId': 'd5e85da1-83b2-41a7-a155-d46aa6a36838',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd5e85da1-83b2-41a7-a155-d46aa6a36838',
   'x-amzn-invoked-production-variant': 'variant1',
   'date': 'Fri, 20 Feb 2026 15:47:19 GMT',
   'content-type': 'text/csv; charset=utf-8',
   'content-length': '18',
   'connection': 'keep-alive'},
  'RetryAttempts': 0},
 'ContentType': 'text/csv; charset=utf-8',
 'InvokedProductionVariant': 'variant1',
 'Body': <botocore.response.StreamingBody at 0x7fad6751b820>}

### Part 1: Set Up Model Group

In [32]:
import time

model_description = "XGBoost models trained on UCI Breast Cancer dataset to predict benign vs malignant tumors, with support for batch inference and I/O join in SageMaker."
model_package_group_name = "xgboost-breast-cancer-detection-" + str(round(time.time()))
model_package_group_input_dict = {
 "ModelPackageGroupName" : model_package_group_name,
 "ModelPackageGroupDescription" : model_description
}

create_model_package_group_response = sagemaker.create_model_package_group(**model_package_group_input_dict)
print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

ModelPackageGroup Arn : arn:aws:sagemaker:us-east-1:087941614028:model-package-group/xgboost-breast-cancer-detection-1771602439


In [33]:
response = sagemaker.describe_model_package_group(
    ModelPackageGroupName=model_package_group_name
)
response

{'ModelPackageGroupName': 'xgboost-breast-cancer-detection-1771602439',
 'ModelPackageGroupArn': 'arn:aws:sagemaker:us-east-1:087941614028:model-package-group/xgboost-breast-cancer-detection-1771602439',
 'ModelPackageGroupDescription': 'XGBoost models trained on UCI Breast Cancer dataset to predict benign vs malignant tumors, with support for batch inference and I/O join in SageMaker.',
 'CreationTime': datetime.datetime(2026, 2, 20, 15, 47, 19, 296000, tzinfo=tzlocal()),
 'CreatedBy': {'IamIdentity': {'Arn': 'arn:aws:sts::087941614028:assumed-role/LabRole/SageMaker',
   'PrincipalId': 'AROARI6N2RHGPI6CLO7W5:SageMaker'}},
 'ModelPackageGroupStatus': 'Completed',
 'ResponseMetadata': {'RequestId': 'bda5a81c-669d-4179-b43c-3a55001da76d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'bda5a81c-669d-4179-b43c-3a55001da76d',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'n

### Part 2: Set Up Model Package

In [34]:
import boto3

session = boto3.client('sagemaker')

response = session.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription="XGBoost model trained on UCI Breast Cancer dataset to classify benign vs malignant tumors.",
    InferenceSpecification={
        "Containers": [primary_container],
        "SupportedContentTypes": ["text/csv"],
        "SupportedResponseMIMETypes": ["text/csv", "application/json"],
        "SupportedRealtimeInferenceInstanceTypes": ["ml.m5.large", "ml.m5.xlarge"],
        "SupportedTransformInstanceTypes": ["ml.m5.large", "ml.m5.xlarge"]
    }
)

model_package_arn = response["ModelPackageArn"]

print(f"Model Package ARN: {model_package_arn}")

Model Package ARN: arn:aws:sagemaker:us-east-1:087941614028:model-package/xgboost-breast-cancer-detection-1771602439/1


In [35]:
details = session.describe_model_package(
    ModelPackageName=model_package_arn
)
details

{'ModelPackageGroupName': 'xgboost-breast-cancer-detection-1771602439',
 'ModelPackageVersion': 1,
 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:087941614028:model-package/xgboost-breast-cancer-detection-1771602439/1',
 'ModelPackageDescription': 'XGBoost model trained on UCI Breast Cancer dataset to classify benign vs malignant tumors.',
 'CreationTime': datetime.datetime(2026, 2, 20, 15, 47, 19, 809000, tzinfo=tzlocal()),
 'InferenceSpecification': {'Containers': [{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1',
    'ImageDigest': 'sha256:b4f13edb198529c460692015797fa1ca6a8ff1ed64a149297174d922121b8fc4',
    'ModelDataUrl': 's3://sagemaker-us-east-1-087941614028/DEMO-breast-cancer-prediction-xgboost-highlevel/output/xgb-2026-02-20-15-23-46/xgb-2026-02-20-15-23-46/output/model.tar.gz',
    'ModelDataETag': '0544f60408cde4b72b34d0e47775b7f6'}],
  'SupportedTransformInstanceTypes': ['ml.m5.large', 'ml.m5.xlarge'],
  'SupportedRealtimeInferenceInstance

### Part 3: Write the Model Card

In [38]:
from sagemaker.model_card.model_card import (
    ModelCard,
    ModelCardStatusEnum,
    TrainingDetails,
    ObjectiveFunction,
    Function,
)

# assume sagemaker_session is already created and valid
my_card = ModelCard(
    name="cancer-xgboost-modelcard-v1",
    status=ModelCardStatusEnum.DRAFT,
    sagemaker_session=sess,
)

# create Function object (only the function string is accepted here)
fn = Function(function="Maximize")

# wrap Function inside ObjectiveFunction (must be ObjectiveFunction instance)
obj_fn = ObjectiveFunction(function=fn)

# set TrainingDetails with ObjectiveFunction and allowed observation text
my_card.training_details = TrainingDetails(
    objective_function=obj_fn,
    training_observations="Trained on UCI Breast Cancer dataset using XGBoost"
)

# create the model card
my_card.create()
print("Model card created")


INFO:sagemaker.model_card.model_card:Creating model card with name: cancer-xgboost-modelcard-v1


Model card created


In [40]:
import boto3

client = boto3.client("sagemaker", region_name="us-east-1")

response = client.describe_model_card(ModelCardName="cancer-xgboost-modelcard-v1")
response


{'ModelCardArn': 'arn:aws:sagemaker:us-east-1:087941614028:model-card/cancer-xgboost-modelcard-v1',
 'ModelCardName': 'cancer-xgboost-modelcard-v1',
 'ModelCardVersion': 1,
 'Content': '{"training_details": {"objective_function": {"function": {"function": "Maximize"}}, "training_observations": "Trained on UCI Breast Cancer dataset using XGBoost"}}',
 'ModelCardStatus': 'Draft',
 'CreationTime': datetime.datetime(2026, 2, 20, 15, 52, 59, 745000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedTime': datetime.datetime(2026, 2, 20, 15, 52, 59, 745000, tzinfo=tzlocal()),
 'LastModifiedBy': {},
 'ResponseMetadata': {'RequestId': '3112a498-934c-444b-a284-bc0e7533ac6c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3112a498-934c-444b-a284-bc0e7533ac6c',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'none'",
   'cache-control': 'no-cache, no-store, must-revalidate',
   'x-

In [None]:
# Delete Endpoint

# sagemaker.delete_endpoint(EndpointName=endpoint_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker_batch_transform|batch_transform_associate_predictions_with_input|Batch Transform - breast cancer prediction with high level SDK.ipynb)
