# Assignment 4.1 — Model Store Exercise

**Name:** Nitin Kumar Mishra  
**Date:** Oct 2, 2025  

## Summary of Work

In this assignment, I implemented a Model Store using SageMaker’s Model Registry.  

1. **Model Package Group**  
   - Created `xgboost-breast-cancer-detection` to track versions of the breast cancer detection model.  

2. **Model Package**  
   - Registered a trained XGBoost model (binary classification on WDBC dataset).  
   - Linked container image, model artifact S3 URI, and evaluation metrics stored as `metrics.json`.  

3. **Model Card**  
   - Documented the model overview, dataset, hyperparameters, metrics, intended use, and limitations.  
   - Key evaluation results:  
     - Accuracy = 0.9825  
     - AUC = 1.0  
     - Precision = 1.0  
     - Recall = 0.9524  
     - Confusion Matrix = [[36, 0], [1, 20]]  

## Conclusion
This exercise demonstrated how to manage ML models in production environments by:  
- Organizing models into versioned groups,  
- Registering artifacts with metrics,  
- Documenting qualitative and quantitative details in a Model Card.  

This ensures reproducibility, governance, and maintainability of ML models at scale.

# Step 1 : Set up

**Purpose:** set up the notebook environment and verify AWS / SageMaker config.  
This step will:

- import required libraries,
- create a SageMaker session,
- attempt to obtain the execution role,
- set (and print) the default S3 bucket and working prefix,
- show the repository files in the current directory.

In [2]:
# Step 1 — Environment & config check for Assignment 4.1
import os
import sys
import json
import boto3
import sagemaker
from botocore.exceptions import ClientError

print("Python:", sys.version.splitlines()[0])
print("boto3 version:", boto3.__version__)
print("sagemaker sdk version:", sagemaker.__version__)

# boto3 session & region
boto_sess = boto3.session.Session()
region = boto_sess.region_name or "us-east-1"
print("AWS region:", region)

# Try to get SageMaker execution role (works in Studio / training notebooks)
role = None
try:
    role = sagemaker.get_execution_role()
    print("SageMaker execution role obtained via sagemaker.get_execution_role().")
except Exception as e:
    # not fatal — show helpful message and leave role as None
    print("Could not get role via sagemaker.get_execution_role():", str(e))
    print("If running outside SageMaker Studio, set role manually, e.g.:")
    print("role = 'arn:aws:iam::123456789012:role/YourSageMakerRole'")

# Create SageMaker session and default bucket (if possible)
sagemaker_session = sagemaker.session.Session(boto_session=boto_sess)
try:
    default_bucket = sagemaker_session.default_bucket()
except Exception as e:
    default_bucket = None
    print("Unable to determine default bucket automatically:", str(e))

print("SageMaker default bucket:", default_bucket)
# Set a prefix for this assignment (helps keep S3 organized)
prefix = "aai-540/assignment4-1"
print("S3 prefix to use:", prefix)

# Show current working directory contents (repo files)
print("\nFiles in current directory:")
for f in sorted(os.listdir(".")):
    print(" -", f)

# Quick validation: check if the expected notebook from the repo exists
expected_nb = "01-train-and-deploy.ipynb"
print("\nExpected sample notebook present?:", os.path.exists(expected_nb))

# If role is None, show the AWS caller identity (for debugging)
if role is None:
    try:
        sts = boto_sess.client("sts")
        identity = sts.get_caller_identity()
        print("\nAWS caller identity (useful for debugging):")
        print(json.dumps(identity, indent=2))
    except ClientError as ce:
        print("Unable to call STS to get caller identity:", str(ce))

Python: 3.12.9 | packaged by conda-forge | (main, Feb 14 2025, 08:00:06) [GCC 13.3.0]
boto3 version: 1.37.1
sagemaker sdk version: 2.245.0
AWS region: us-east-1
SageMaker execution role obtained via sagemaker.get_execution_role().
SageMaker default bucket: sagemaker-us-east-1-533267190630
S3 prefix to use: aai-540/assignment4-1

Files in current directory:
 - .ipynb_checkpoints
 - 01-train-and-deploy.ipynb
 - Nitin_Assignment 3.1 Model Training and Deployment.ipynb
 - README.md
 - data

Expected sample notebook present?: True


## Step 2 — Final Data Preparation (clean, split, upload)

This cell:
- Downloads the Breast Cancer WDBC dataset if not already present  
- Applies correct column headers  
- Cleans the data:
  - Drops the `id` column  
  - Maps `diagnosis`: Malignant = 1, Benign = 0  
  - Ensures `diagnosis` is the **last column**  
- Creates **train/validation/test splits** (80/10/10, stratified)  
- Saves them as CSV (no header, no index) for SageMaker XGBoost  
- Creates a **batch inference file** (`batch_test_no_id.csv`) with only features (no id, no label)  
- Uploads all files to S3 under your assignment prefix and prints URIs + sample rows

In [3]:
# Step 2 — Final Preprocessing (label first, features after)

import os
import pandas as pd
from sklearn.model_selection import train_test_split

# Config
os.makedirs("data", exist_ok=True)
local_file = "data/wdbc.csv"
src_bucket = f"sagemaker-example-files-prod-{region}"
src_key = "datasets/tabular/breast_cancer/wdbc.csv"

# Download dataset if not exists
if not os.path.exists(local_file):
    boto_sess.client("s3").download_file(src_bucket, src_key, local_file)

# Column headers (from dataset description)
headers = [
    "id","diagnosis",
    "radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
    "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
    "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se","concave points_se","symmetry_se","fractal_dimension_se",
    "radius_worst","texture_worst","perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst","concave points_worst","symmetry_worst","fractal_dimension_worst"
]

# Load dataset
df = pd.read_csv(local_file, header=None)
df.columns = headers

# Encode labels: M=1, B=0
df["diagnosis"] = df["diagnosis"].astype(str).str.strip().str.upper().map({"M":1,"B":0}).astype(int)

# Drop id column
df = df.drop(columns=["id"])

# Ensure label is the FIRST column
proc = pd.concat([df["diagnosis"], df.drop(columns=["diagnosis"])], axis=1)

print("Dataset shape:", proc.shape)
print("Label distribution:", proc.iloc[:,0].value_counts().to_dict())

# Train/val/test split (80/10/10, stratified)
train_df, temp_df = train_test_split(proc, test_size=0.2, random_state=42, stratify=proc.iloc[:,0])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df.iloc[:,0])

print("Splits -> Train:", train_df.shape, "Val:", val_df.shape, "Test:", test_df.shape)

# Save CSVs (no header, no index, enforce float format)
train_path = "data/train.csv"
val_path   = "data/validation.csv"
test_path  = "data/test.csv"

train_df.to_csv(train_path, header=False, index=False, float_format="%.6f")
val_df.to_csv(val_path, header=False, index=False, float_format="%.6f")
test_df.to_csv(test_path, header=False, index=False, float_format="%.6f")

# Batch file for inference: features only (drop label column)
batch_no_id_path = "data/batch_test_no_id.csv"
test_features_only = test_df.iloc[:,1:]  # drop first col (label)
test_features_only.to_csv(batch_no_id_path, header=False, index=False, float_format="%.6f")

print("\nLocal files created:")
print("-", train_path, val_path, test_path, batch_no_id_path)

# Upload to S3
s3_train = sagemaker_session.upload_data(train_path, key_prefix=f"{prefix}/train")
s3_val   = sagemaker_session.upload_data(val_path,   key_prefix=f"{prefix}/validation")
s3_test  = sagemaker_session.upload_data(test_path,  key_prefix=f"{prefix}/test")
s3_batch = sagemaker_session.upload_data(batch_no_id_path, key_prefix=f"{prefix}/batch")

print("\nUploaded to S3:")
print(" - Train:", s3_train)
print(" - Val  :", s3_val)
print(" - Test :", s3_test)
print(" - Batch:", s3_batch)

# Sanity check: show first column unique values in train (label)
print("\nUnique label values in train.csv (should be [0,1]):",
      pd.read_csv(train_path, header=None).iloc[:,0].unique())

# Show a few sample lines from train.csv
print("\nSample lines from train.csv:")
with open(train_path) as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i >= 4: break

Dataset shape: (569, 31)
Label distribution: {0: 357, 1: 212}
Splits -> Train: (455, 31) Val: (57, 31) Test: (57, 31)

Local files created:
- data/train.csv data/validation.csv data/test.csv data/batch_test_no_id.csv

Uploaded to S3:
 - Train: s3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/train/train.csv
 - Val  : s3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/validation/validation.csv
 - Test : s3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/test/test.csv
 - Batch: s3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/batch/batch_test_no_id.csv

Unique label values in train.csv (should be [0,1]): [1 0]

Sample lines from train.csv:
1,16.020000,23.240000,102.700000,797.800000,0.082060,0.066690,0.032990,0.033230,0.152800,0.056970,0.379500,1.187000,2.466000,40.510000,0.004029,0.009269,0.011010,0.007591,0.014600,0.003042,19.190000,33.880000,123.800000,1150.000000,0.118100,0.155100,0.145900,0.099750,0.294800,0.084520
0,12.320000,12.390000,78.850

In [4]:
import pandas as pd

# Load back local train/val/test CSVs as raw (no headers)
train_check = pd.read_csv("data/train.csv", header=None)
val_check   = pd.read_csv("data/validation.csv", header=None)

print("Train shape:", train_check.shape)
print("Val shape:", val_check.shape)

# Look at last few columns
print("\nSample rows (last 5 cols + label):")
print(train_check.iloc[:5, -6:])

# Unique values in last column (label)
print("\nUnique label values in train.csv:", train_check.iloc[:,-1].unique())
print("Unique label values in val.csv:", val_check.iloc[:,-1].unique())

Train shape: (455, 31)
Val shape: (57, 31)

Sample rows (last 5 cols + label):
        25      26      27       28      29       30
0  0.11810  0.1551  0.1459  0.09975  0.2948  0.08452
1  0.13850  0.1266  0.1242  0.09391  0.2827  0.06771
2  0.09402  0.1936  0.1838  0.05601  0.2488  0.08151
3  0.14190  0.7090  0.9019  0.24750  0.2866  0.11550
4  0.13380  0.2117  0.3446  0.14900  0.2341  0.07421

Unique label values in train.csv: [0.08452 0.06771 0.08151 0.1155  0.07421 0.05525 0.07924 0.1014  0.07371
 0.07127 0.09438 0.08082 0.08524 0.09185 0.06487 0.07246 0.09584 0.09981
 0.07408 0.06174 0.06641 0.08665 0.07484 0.1252  0.07427 0.08024 0.1027
 0.06777 0.07615 0.08677 0.07474 0.1034  0.173   0.07055 0.07722 0.05695
 0.09251 0.0782  0.08468 0.06658 0.09124 0.08025 0.07037 0.08113 0.08061
 0.08666 0.1162  0.06037 0.08982 0.07313 0.07191 0.07083 0.08198 0.06954
 0.0757  0.07147 0.1059  0.09026 0.08216 0.06958 0.1405  0.082   0.07735
 0.05974 0.09209 0.07867 0.09585 0.07062 0.09136 0.06878 0

## Step 3 — Train baseline XGBoost model

This cell starts an XGBoost training job on `ml.m5.large` (1 instance) with simple hyperparameters.
When it finishes the cell will print:

- Training job name
- Final model artifact S3 URI

If you want to tune hyperparameters or switch instance types, edit the values in the code cell before running.

In [5]:
# Step 3 — Train XGBoost with corrected data (label in first column)

import sagemaker
from sagemaker.inputs import TrainingInput

# Training image (SageMaker built-in XGBoost)
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost", 
    region=region, 
    version="1.5-1", 
    image_scope="training"
)

# Output path for model artifacts
output_path = f"s3://{default_bucket}/{prefix}/model"

# Define Estimator
est = sagemaker.estimator.Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    volume_size=10,
    output_path=output_path,
    sagemaker_session=sagemaker_session,
    base_job_name="aai-540-xgboost"
)

# Hyperparameters (simple baseline)
est.set_hyperparameters(
    objective="binary:logistic",
    eval_metric="auc",
    num_round=100,
    max_depth=5,
    eta=0.2
)

# Training inputs (pointing to the corrected train/val S3 URIs)
train_input = TrainingInput(s3_train, content_type="text/csv")
val_input   = TrainingInput(s3_val,   content_type="text/csv")

print("Starting training job with corrected dataset... (this will take a few minutes)")
est.fit({"train": train_input, "validation": val_input})

# Capture job info
training_job_name = est.latest_training_job.job_name
model_artifact_s3 = est.model_data

print("\n✅ Training job finished successfully")
print("Training job name:", training_job_name)
print("Model artifact S3 URI:", model_artifact_s3)

INFO:sagemaker:Creating training-job with name: aai-540-xgboost-2025-10-02-16-57-07-654


Starting training job with corrected dataset... (this will take a few minutes)
2025-10-02 16:57:09 Starting - Starting the training job...
2025-10-02 16:57:23 Starting - Preparing the instances for training...
2025-10-02 16:57:46 Downloading - Downloading input data...
2025-10-02 16:58:31 Downloading - Downloading the training image......
2025-10-02 16:59:43 Training - Training image download completed. Training in progress.
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-10-02 16:59:37.889 ip-10-0-68-41.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-10-02 16:59:37.912 ip-10-0-68-41.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-10-02:16:59:38:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-10-02:16:59:38:INFO] Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34m[2025-10-02:16:59:38:INFO] Failed to parse hyper

## Step 4 — Batch Transform on test dataset

This cell:
- Creates a SageMaker Model object from the trained artifact  
- Launches a Batch Transform job on `batch_test_no_id.csv`  
- Saves the predictions to S3 and shows the output location


In [6]:
# Step 4 — Batch Transform inference

from sagemaker.model import Model

# Create a SageMaker Model from the training artifact
model = Model(
    image_uri=image_uri, 
    model_data=model_artifact_s3,
    role=role,
    sagemaker_session=sagemaker_session
)

# Create Transformer for batch inference
transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    strategy="SingleRecord",
    assemble_with="Line",
    output_path=f"s3://{default_bucket}/{prefix}/batch-output",
    accept="text/csv"
)

# Start Batch Transform
print("Starting Batch Transform job...")
transformer.transform(
    data=s3_batch,
    content_type="text/csv",
    split_type="Line",
    input_filter="$[0:]"
)
transformer.wait()

print("\n Batch Transform job completed")
print("Output saved to:", transformer.output_path)

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2025-10-02-17-00-56-185
INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2025-10-02-17-00-56-859


Starting Batch Transform job...
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-10-02:17:06:20:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-10-02:17:06:20:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-10-02:17:06:20:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
 

## Step 5 — Evaluate Batch Transform predictions

This cell:
- Downloads the predictions from S3  
- Reads them into pandas  
- Loads the true labels from `test.csv`  
- Combines predictions + labels  
- Computes Accuracy, AUC, Precision, Recall, Confusion Matrix  


In [7]:
# Step 5 — Download and evaluate predictions

import boto3
import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, confusion_matrix

# Local download path
pred_local = "data/batch_predictions.csv"

# S3 location (from transformer.output_path)
s3_output_path = transformer.output_path
bucket_name = s3_output_path.replace("s3://", "").split("/")[0]
prefix_path = "/".join(s3_output_path.replace("s3://", "").split("/")[1:])

# Get the actual file name (SageMaker writes a .out file inside the prefix)
s3 = boto3.client("s3")
objs = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix_path)
pred_file_key = [o["Key"] for o in objs.get("Contents", []) if o["Key"].endswith(".out")][0]

print("Prediction file found in S3:", pred_file_key)

# Download it
s3.download_file(bucket_name, pred_file_key, pred_local)
print("Downloaded predictions to:", pred_local)

# Read predictions
preds = pd.read_csv(pred_local, header=None)
preds.columns = ["prediction"]

# Load true labels from test.csv
test_df = pd.read_csv("data/test.csv", header=None)
y_true = test_df.iloc[:,0]   # label is first column
y_pred_prob = preds["prediction"].values
y_pred = (y_pred_prob > 0.5).astype(int)  # threshold at 0.5

# Compute metrics
acc = accuracy_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred_prob)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("\n Evaluation Metrics on Test Set")
print("Accuracy :", round(acc,4))
print("AUC      :", round(auc,4))
print("Precision:", round(prec,4))
print("Recall   :", round(rec,4))
print("\nConfusion Matrix:\n", cm)

Prediction file found in S3: aai-540/assignment4-1/batch-output/batch_test_no_id.csv.out
Downloaded predictions to: data/batch_predictions.csv

 Evaluation Metrics on Test Set
Accuracy : 0.9825
AUC      : 1.0
Precision: 1.0
Recall   : 0.9524

Confusion Matrix:
 [[36  0]
 [ 1 20]]


## Part 1 — Create Model Package Group

We create a Model Package Group to track all versions of our breast cancer detection model.
This group will hold the different model versions whenever we retrain or update hyperparameters.


In [8]:
# Part 1 — Create Model Package Group

import boto3

sm_client = boto3.client("sagemaker")

model_package_group_name = "xgboost-breast-cancer-detection"
model_package_group_desc = "Tracks all versions of the breast cancer XGBoost detection model (binary classification)."

# Create the group (safe create — won't fail if already exists)
try:
    sm_client.create_model_package_group(
        ModelPackageGroupName=model_package_group_name,
        ModelPackageGroupDescription=model_package_group_desc
    )
    print(" Model Package Group created:", model_package_group_name)
except sm_client.exceptions.ResourceInUse:
    print(" Model Package Group already exists:", model_package_group_name)

# Describe to verify
group_info = sm_client.describe_model_package_group(ModelPackageGroupName=model_package_group_name)
print("\nModel Package Group Info:\n", group_info)

 Model Package Group created: xgboost-breast-cancer-detection

Model Package Group Info:
 {'ModelPackageGroupName': 'xgboost-breast-cancer-detection', 'ModelPackageGroupArn': 'arn:aws:sagemaker:us-east-1:533267190630:model-package-group/xgboost-breast-cancer-detection', 'ModelPackageGroupDescription': 'Tracks all versions of the breast cancer XGBoost detection model (binary classification).', 'CreationTime': datetime.datetime(2025, 10, 2, 17, 10, 24, 451000, tzinfo=tzlocal()), 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:533267190630:user-profile/d-gbpbykg68mhz/default-1758125121268', 'UserProfileName': 'default-1758125121268', 'DomainId': 'd-gbpbykg68mhz', 'IamIdentity': {'Arn': 'arn:aws:sts::533267190630:assumed-role/LabRole/SageMaker', 'PrincipalId': 'AROAXYKJTSNTJSTPBT2I5:SageMaker'}}, 'ModelPackageGroupStatus': 'Completed', 'ResponseMetadata': {'RequestId': 'eb53f027-dcdd-41bb-a331-6be72caebfd1', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'eb53f027-

## Part 2 — Register Model Package (with metrics)

We register the trained model into the `xgboost-breast-cancer-detection` group.
The package includes:
- Model artifact S3 location
- XGBoost inference container image URI
- Supported content/response types
- Evaluation metrics (Accuracy, AUC, Precision, Recall)

In [10]:
import json

metrics_report = {
    "binary_classification_metrics": {
        "accuracy": round(acc, 4),
        "auc": round(auc, 4),
        "precision": round(prec, 4),
        "recall": round(rec, 4),
        "confusion_matrix": cm.tolist()
    }
}

metrics_file = "data/metrics.json"
with open(metrics_file, "w") as f:
    json.dump(metrics_report, f, indent=2)

print("Saved metrics to", metrics_file)

Saved metrics to data/metrics.json


In [11]:
s3_metrics_uri = sagemaker_session.upload_data(metrics_file, key_prefix=f"{prefix}/metrics")
print("Metrics JSON uploaded to:", s3_metrics_uri)

Metrics JSON uploaded to: s3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/metrics/metrics.json


In [12]:
response = sm_client.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription="XGBoost Breast Cancer Detection model with evaluation metrics",
    InferenceSpecification={
        "Containers": [
            {
                "Image": image_uri,
                "ModelDataUrl": model_artifact_s3
            }
        ],
        "SupportedContentTypes": ["text/csv"],
        "SupportedResponseMIMETypes": ["text/csv"],
        "SupportedRealtimeInferenceInstanceTypes": ["ml.m5.large"],
        "SupportedTransformInstanceTypes": ["ml.m5.large"]
    },
    ModelApprovalStatus="Approved",
    ModelMetrics={
        "ModelQuality": {
            "Statistics": {
                "ContentType": "application/json",
                "S3Uri": s3_metrics_uri
            }
        }
    }
)

model_package_arn = response["ModelPackageArn"]
print(" Model Package created")
print("Model Package ARN:", model_package_arn)

# Verify
pkg_info = sm_client.describe_model_package(ModelPackageName=model_package_arn)
print("\nModel Package Info:\n", pkg_info)

 Model Package created
Model Package ARN: arn:aws:sagemaker:us-east-1:533267190630:model-package/xgboost-breast-cancer-detection/1

Model Package Info:
 {'ModelPackageGroupName': 'xgboost-breast-cancer-detection', 'ModelPackageVersion': 1, 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:533267190630:model-package/xgboost-breast-cancer-detection/1', 'ModelPackageDescription': 'XGBoost Breast Cancer Detection model with evaluation metrics', 'CreationTime': datetime.datetime(2025, 10, 2, 17, 13, 31, 950000, tzinfo=tzlocal()), 'InferenceSpecification': {'Containers': [{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1', 'ImageDigest': 'sha256:c764382b16cd0c921f1b2e66de8684fb999ccbd0c042c95679f0b69bc9cdd12c', 'ModelDataUrl': 's3://sagemaker-us-east-1-533267190630/aai-540/assignment4-1/model/aai-540-xgboost-2025-10-02-16-57-07-654/output/model.tar.gz', 'ModelDataETag': 'b310c7a8ee6a4743fba4b0d35098b40d'}], 'SupportedTransformInstanceTypes': ['ml.m5.large'], 'Supp

## Part 3 — Create Model Card

We now create a Model Card to document:
- Model overview (name, description, algorithm, owner)
- Training details (framework, dataset, hyperparameters)
- Evaluation details (metrics + confusion matrix)
- Intended uses and limitations


In [15]:
# Part 3 — Create Model Card (schema-compliant minimal version)

model_card_name = "xgboost-breast-cancer-card"

model_card_content = {
    "model_overview": {
        "model_id": model_card_name,
        "model_name": "XGBoost Breast Cancer Detection",
        "model_description": "Binary classification model predicting malignant vs benign tumors using the WDBC dataset.",
        "problem_type": "BinaryClassification",
        "algorithm_type": "XGBoost"
    },
    "intended_uses": {
        "intended_uses": "Assist researchers and clinicians in tumor classification. Not a substitute for medical diagnosis."
    },
    "training_details": {
        "training_observations": "Model trained on the Wisconsin Diagnostic Breast Cancer dataset (569 samples, 30 features). Used XGBoost with hyperparameters: max_depth=5, eta=0.2, objective=binary:logistic, num_round=100."
    },
    "evaluation_details": [
        {
            "name": "Test Set Evaluation",
            "evaluation_observation": (
                f"Evaluation performed on hold-out test set (57 samples). "
                f"Results: Accuracy={round(acc,4)}, AUC={round(auc,4)}, "
                f"Precision={round(prec,4)}, Recall={round(rec,4)}. "
                f"Confusion Matrix={cm.tolist()}."
            )
        }
    ]
}

import json
try:
    sm_client.create_model_card(
        ModelCardName=model_card_name,
        ModelCardStatus="Draft",
        Content=json.dumps(model_card_content)
    )
    print(" Model Card created:", model_card_name)
except sm_client.exceptions.ResourceInUse:
    print(" Model Card already exists:", model_card_name)

# Describe to verify
card_info = sm_client.describe_model_card(ModelCardName=model_card_name)
print("\nModel Card Info:\n", card_info)

 Model Card created: xgboost-breast-cancer-card

Model Card Info:
 {'ModelCardArn': 'arn:aws:sagemaker:us-east-1:533267190630:model-card/xgboost-breast-cancer-card', 'ModelCardName': 'xgboost-breast-cancer-card', 'ModelCardVersion': 1, 'Content': '{"model_overview": {"model_id": "xgboost-breast-cancer-card", "model_name": "XGBoost Breast Cancer Detection", "model_description": "Binary classification model predicting malignant vs benign tumors using the WDBC dataset.", "problem_type": "BinaryClassification", "algorithm_type": "XGBoost"}, "intended_uses": {"intended_uses": "Assist researchers and clinicians in tumor classification. Not a substitute for medical diagnosis."}, "training_details": {"training_observations": "Model trained on the Wisconsin Diagnostic Breast Cancer dataset (569 samples, 30 features). Used XGBoost with hyperparameters: max_depth=5, eta=0.2, objective=binary:logistic, num_round=100."}, "evaluation_details": [{"name": "Test Set Evaluation", "evaluation_observation