### AWS Cloud Club UWaterloo

# SageMaker workshop

This workshop explores a tabular, [binary classification](https://en.wikipedia.org/wiki/Binary_classification) use-case with significant **class imbalance**: predicting if a passenger on the Titanic survived.

In this notebook, you'll first tackle the challenge with AutoML using [Amazon SageMaker Canvas AutoML](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html), and then dive deeper with [SageMaker built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)

## Contents

> ℹ️ **Tip:** You can use the Table of Contents panel in the left sidebar on JupyterLab / SageMaker Studio, to view and navigate sections

1. **[Prepare our environment](#Prepare-our-environment)**
1. **[Fetch the example dataset](#Fetch-the-example-dataset)**
1. **[SageMaker Canvas](#SageMaker-Canvas)**
1. **[XGBoost](#XGBoost)**

## Prepare our environment

To read datasets directly from Amazon S3 object storage to in-memory dataframes with Pandas, we'll need to install the [s3fs](https://s3fs.readthedocs.io/en/latest/) library which is not included by default in SageMaker Studio Distribution (at v1.9):

In [None]:
%pip install s3fs

To get started, we'll need to:

- **Import** some useful libraries (as in any Python notebook)
- **Configure** -
    - The [Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#CoreConcepts) and folder where **data** should be stored (to keep our environment tidy)
    - The [IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) defining what **permissions** the jobs you create will have
- **Connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

Run the cell below, to set these up.

> ℹ️ **Tip:** Just like in a regular [JupyterLab notebook](https://jupyterlab.readthedocs.io/en/stable/user/interface.html), you can run code cells by clicking in to target cell - and then pressing the play (▶️) button in the toolbar or `Shift+Enter` on the keyboard.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import time
import os

# External Dependencies:
import boto3  # General-purpose AWS SDK for Python
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # Tabular data utilities
import sagemaker  # High-level SDK specifically for Amazon SageMaker
from sagemaker.automl.automlv2 import (
    AutoMLDataChannel,
    AutoMLTabularConfig,
    AutoMLV2 as AutoMLV2Estimator,
)
from sagemaker.feature_store.feature_group import FeatureGroup

# Local Helper Functions:
import util

# Setting up SageMaker parameters
sgmk_session = sagemaker.Session()  # Connect to SageMaker APIs
region = sgmk_session.boto_session.region_name  # The AWS Region we're using (e.g. 'us-east-1')
bucket_name = sgmk_session.default_bucket()  # Select an Amazon S3 bucket
bucket_prefix = "awscc-sm/titanic"  # Location in the bucket to store our files
sgmk_role = sagemaker.get_execution_role()  # IAM Execution Role to use for permissions

print(f"s3://{bucket_name}/{bucket_prefix}")
print(sgmk_role)

## Fetch the example dataset

This example uses [this dataset](https://www.kaggle.com/datasets/shubhamgupta012/titanic-dataset), which contains information about the passengers on the titanic.

In the following cells we'll download the dataset locally, store it in Amazon S3, and **also** load a transformed copy into [Amazon SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/).

> ℹ️ **Tip:** You can train and deploy models in SageMaker **without using** SageMaker Feature Store, but we introduce it in this example to show you to a wider range of SageMaker features.

Unzip the archive

In [None]:
!unzip archive.zip
!mkdir data
!mv SVMtrain.csv ./data/titanic.csv
!rm SVMtrain.csv
!rm archive.zip

Upload the dataset

In [None]:
raw_data_path = os.path.join("data", "titanic.csv")

print("Uploading raw dataset to Amazon S3:")
raw_data_s3uri = f"s3://{bucket_name}/{bucket_prefix}/raw.csv"
!aws s3 cp {raw_data_path} {raw_data_s3uri}

In [None]:
%%time
feature_group_name = "awscc-sm-titanic"
print("Loading data to SageMaker Feature Store:")

util.data.load_sample_data(
    raw_data_path,
    fg_s3_uri=f"s3://{bucket_name}/{bucket_prefix}/feature-store",
    feature_group_name=feature_group_name,
    ignore_cols=[
        "PassengerId"
    ],
)

> ⏰ **You don't have to wait** for this cell to finish running: As soon as you reach the `Ingesting data...` step, you're ready to continue on to the next section!

▶️ As soon as you reach the `Ingesting data...` stage, you'll be able to see your "feature group" in the SageMaker Feature Store catalog:

- Open or switch to the tab where you've launched the SageMaker Studio home screen
- Choose `Data > Feature Store` from the sidebar menu to open the Feature Store UI

Note you can explore the catalog either by "feature group" (table), or searching for individual features themselves. Descriptions and some tags have already been populated for you, based on the dataset description from UCI.

![](img/feature-store-features.png "Screenshot of SMStudio Feature Store UI showing feature catalog")

## SageMaker Canvas

SageMaker Canvas AutoML makes it easy to get started on tabular ML problems, even without extensive data preparation or writing any code. This is because:

- AutoML will automatically explore multiple data pre-processing options, algorithms, and hyperparameters for you - to identify a high-performing model
- Even if you **do** want to perform some manual feature engineering first, Canvas has direct integrations from [SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/) (SageMaker's low-code/no-code data preparation tool).

▶️ While your data finishes importing to SageMaker Feature Store, let's **start an AutoML experiment using the raw CSV file**

1. **Open** the 🏠 *SageMaker Studio Home* page
1. **Choose** `Canvas` from the *applications* sidebar menu
1. **Click** the `Run Canvas` button to start the Canvas environment
1. (⏰ Once Canvas starts, which could take a minute or two) **Click** `Open canvas` to launch the Canvas UI

![](img/canvas-01-launch.png "Screenshot of SageMaker Studio home showing 'Canvas' application selected, with buttons to stop and open Canvas")

▶️ Import the raw dataset to Canvas ready to build a model:

- **Select** the `Datasets` tab from the left sidebar
- **Click** `Import data > tabular` to launch the flow
    - For **Dataset name**, enter `titanic`
    - For **Data source**, select `Amazon S3`
    - **Browse** to your data at: `Amazon S3/sagemaker-{your region and acct ID}/awscc-sm/titanic/raw.csv`
- Once you've selected your raw CSV, **Click** `Preview` and then `Create dataset`

![](img/canvas-02-datasets-list.png "Screenshot of SageMaker Canvas Datasets tab showing expanded button to import a tabular dataset")

![](img/canvas-03-data-selection.png "Screenshot of SageMaker Canvas dataset import flow showing Amazon S3 source with the raw CSV selected")

Once your dataset is imported successfully, you're ready to create a model from it.

▶️ From the same Canvas Datasets list

- **Select** your new `titanic` dataset by clicking on the checkbox
- **Click** the `Create a model` button above the dataset list

![](img/canvas-04-select-dataset.png "Screenshot of Canvas dataset list with demo dataset selected and 'Create a model' button highlighted")

▶️ Configure your Canvas model:

- For **Model name**, enter `awscc-uw-sm-canvas-model`
- Leave the model type as the default *Predictive analysis*
- For the **Target column**, select `Survived`
- **Un-check** the data column `PassengerID` to **drop** it from the model
- **Select** `Standard build` instead of the default 'Quick build' from the drop-down
- Once you've verified the configuration, **Click "Standard build"** to start the model build process

## XGBoost

Another useful tool to build high-performing models quickly is the set of [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) offered by SageMaker for a wide range of use-cases.

Instead of using Autopilot to automate the process of data pre-processing and hyperparameter tuning, we can directly use these built-in algorithms (or custom ones) for finer-grained control. In this example, we'll show the [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html).


### Understand the algorithm requirements

The first step to using any SageMaker built-in algorithm is understanding its overall characteristics and the interface it offers. Here we'll refer to:

- The [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to understand the **detail** of the **data formats** and **(hyper)-parameters** it supports - as well as sample notebooks
- The [Common Parameters doc](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) to compare the **high-level configurations** and capabilities between algorithms.


As discussed on the algorithm doc page, there are 2 ways to use XGBoost in SageMaker: As a pre-built algorithm (no script required), or as a framework (with your own custom training script).

In this example, we'll use pre-built algorithm mode so only need to fetch the container image URI:

In [None]:
image_uri = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.7-1")
print(image_uri)

### Extract batch data from the SageMaker Feature Store

Next, we'll extract a snapshot of data from the (offline/batch) SageMaker Feature Store via serverless SQL query with [Amazon Athena](https://aws.amazon.com/athena/), to prepare for model training.

Feature Store **tracks the history** of records, allowing you to reproduce point-in-time snapshots even when features change over time.

- **Example queries** for time-travel and other views are available through the SageMaker Studio Feature Store UI: From your Feature Group, switch to the "Sample queries" tab.
- The additional `event_time`, `write_time`, `api_invocation_time`, `is_deleted` and `row_number` fields returned in the below query are metadata for this history tracking - so won't be used in the actual model training.

In [None]:
feature_group = FeatureGroup(feature_group_name, sagemaker_session=sgmk_session)
query = feature_group.athena_query()
table_name = query.table_name

data_extract_s3uri = f"s3://{bucket_name}/{bucket_prefix}/data-extract"
!aws s3 rm --quiet --recursive {data_extract_s3uri}  # Clear any previous extractions
print(f"Querying feature store to extract snapshot at:\n{data_extract_s3uri}")
query.run(
    f"""
    SELECT *
    FROM
        (SELECT *,
        row_number()
        OVER
            (PARTITION BY "PassengerID"
            ORDER BY "EventTime" DESC, Api_Invocation_Time DESC, write_time DESC)
        AS row_number
        FROM "sagemaker_featurestore"."{table_name}"
        WHERE "EventTime" <= {time.time()})
    WHERE row_number = 1 AND NOT is_deleted;
    """,
    output_location=data_extract_s3uri,
)
query.wait()

full_df = query.as_dataframe()
print(f"Got {len(full_df)} records")
full_df

### Split and prepare datasets

From the [Input and Output Interface section](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) of the algorithm doc, we know that XGBoost expects CSV or LibSVM data inputs for training, and optionally validation.

Some extra data preparation is also required because (at the time of writing), this XGBoost algorithm version doesn't fully support string categorical features.

Below we **one-hot encode the categorical fields**, and then split the pre-processed data into randomly shuffled training, validation, and test sets.

In [1]:
df_model_data = full_df.drop(
    columns=[
        "passengerid", "eventtime", "write_time", "api_invocation_time", "is_deleted", "row_number"
    ],
    errors="ignore",  # Your DF may not have 'row_number' if you didn't do a time travel query
)
df_model_data

# One-hot encode categorical variables:
df_model_data = pd.get_dummies(df_model_data, dtype=int)

# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
)

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=False)

df_model_data

NameError: name 'full_df' is not defined

The datasets specific for this algorithm can then be uploaded to Amazon S3, ready to use as inputs to the training job:

In [None]:
model_data_s3uri = f"s3://{bucket_name}/{bucket_prefix}/model-data-xgb"

train_data_s3uri = model_data_s3uri + "/train/data.csv"
train_data.to_csv(train_data_s3uri, index=False, header=False)
validation_data_s3uri = model_data_s3uri + "/validation/data.csv"
validation_data.to_csv(validation_data_s3uri, index=False, header=False)
test_data_s3uri = model_data_s3uri + "/test/data.csv"
test_data.to_csv(test_data_s3uri, index=False, header=False)

### Train a model

With the data prepared in a compatible format, and the parameters collected, we're ready to run a training job through the SageMaker SDK [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) class, which provides a high-level wrapper over the underlying [SageMaker CreateTrainingJob API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

The training job runs on **separate, containerized infrastructure** from this notebook:

- **You specify** the number and type of instances, and the IAM permissions with which the job runs (which could be separate from the notebook execution role)
- The job is **independent** from the notebook: The input parameters, logs, metrics, and output artifacts are still available through the APIs even if the notebook disconnects/restarts part way through. (See [Estimator.attach(...)](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.attach) classmethod for re-attaching to previous/ongoing jobs).
- A range of **other infrastructure parameters** are available like:
    - [SageMaker managed spot](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), to optimize infrastructure costs
    - [Warm pool keep-alive](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html), to speed up start of sequential jobs

In [None]:
%%time

xgb_estimator = sagemaker.estimator.Estimator(
    base_job_name="xgboost",
    role=sgmk_role,  # IAM role for job permissions (to access the S3 data)
    image_uri=image_uri,  # XGBoost algorithm container
    instance_count=1,
    instance_type="ml.m5.xlarge",  # Type of compute instance
    max_run=25 * 60,  # Limit job to 25 minutes

    # OPTIONALLY use spot instances to reduce cost:
    use_spot_instances=True,
    max_wait=30 * 60,  # Maximum clock time (including spot delays)

    output_path=f"s3://{bucket_name}/{bucket_prefix}/train-output",
)

xgb_estimator.set_hyperparameters(
    num_round=50,
    max_depth=5,
    alpha=2.5,
    eta=0.5,
    objective="binary:logistic",
    eval_metric="auc",
)

# Launch a SageMaker Training job by passing the S3 path of the datasets:
xgb_estimator.fit({
    "train": sagemaker.inputs.TrainingInput(train_data_s3uri, content_type="csv"),
    "validation": sagemaker.inputs.TrainingInput(validation_data_s3uri, content_type="csv"),
})

As well as the logs streamed to the notebook, you can follow the status of the job in:
- The [Training > Training jobs page of the AWS Console for SageMaker](https://console.aws.amazon.com/sagemaker/home?#/jobs)
    - Including links to Amazon CloudWatch console to drill in to job logs and metric graphs
- The Resources > Experiments and trials pane in SageMaker Studio
    - Jobs started without an explicit Experiment configuration will appear under the "Unassigned trial components" folder

### Batch inference

Once the model is trained, we can either deploy it to a real-time endpoint to make inference requests on-demand, or use it to run batch jobs on existing datasets.

In this first example, we'll use [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to run batch inference. SageMaker will spin up a temporary cluster, send our data through the model, and shut down the resources as soon as all the input data is processed.

To get started, you can create a [Transformer object](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html) directly from `estimator.transformer(...)`. However, in this we'll go via `create_model()` first so we can easily add the model to SageMaker Model Registry later:

In [None]:
xgb_model = xgb_estimator.create_model()

Because SageMaker Batch Transform orchestrates the process of sending data through the model and consolidating the outputs, there are a range of extra parameters beyond the basic S3 output location and instance size/type.

By default, SageMaker Batch Transform treats each file in the input S3 prefix as one request payload and generates an output file of the same name, appending `.out`. Below we configure more specific handling for tabular data though:

- Interpret each line of input files as a separate record with `split_type`, and interpret each line of output data as separate record with `assemble_with`.
- Make `MultiRecord` batch requests up to `max_payload` Megabytes each - allowing up to `max_concurrent_transforms` concurrent requests per instance.
- Exclude the `y` target label column (which is present in the test data) from model requests with `input_filter`.
- Include the input data as well as the predictions in the result with `join_source`.

The result will still be a single `.csv.out` file for each `.csv` input, but SageMaker has control of individual request batch sizes to optimize resource use.

In [None]:
eval_s3uri = f"s3://{bucket_name}/{bucket_prefix}/xgb-evaluation"

xgb_transformer = xgb_model.transformer(
    output_path=eval_s3uri,  # S3 output location
    instance_count=1,  # Number of instances to spin up for the job
    instance_type="ml.m5.large",  # Instance type to use for inference
    strategy="MultiRecord",  # Request inference in batches, for efficiency
    accept="text/csv",  # Request CSV response format
    assemble_with="Line",  # Consolidate response records with newlines between
    max_concurrent_transforms=2,  # Instances sent up to N requests concurrently
    max_payload=1,  # Max size per request (in Megabytes)
)

xgb_transformer.base_transform_job_name="sm101-dm-xgboost"
xgb_transformer.transform(
    test_data_s3uri,
    content_type="text/csv",  # Test data is in CSV format
    split_type="Line",  # Each line of test data is a separate record
    join_source="Input",  # Output joined data including the input features as well as prediction
    input_filter="$[1:]",  # Exclude the leading (actual target value) field
    # wait=True,  # (Default True) Block the notebook kernel until the job completes
    # logs=True,  # (Default True) Stream job logs to the notebook
)

Once the job completes, we can read the dataframe direct from Amazon S3:

In [None]:
df_eval = pd.read_csv(
    eval_s3uri + "/data.csv.out",
    header=None,
    names=test_data.columns.tolist() + ["prob_survive"],
)
df_eval

To generate a report for the model:

In [None]:
report = util.reporting.generate_binary_classification_report(
    y_real=df_eval["survived"].values,
    y_predict_proba=df_eval["prob_survive"].values,
    class_names_list=["Died", "Lived"],
    title="Initial XGBoost model",
)

# Store the model quality report locally and on Amazon S3:
with open("data/report-xgboost.json", "w") as f:
    json.dump(report, f, indent=2)
model_quality_s3uri = f"s3://{bucket_name}/{bucket_prefix}/{xgb_model.name}/model-quality.json"
!aws s3 cp data/report-xgboost.json {model_quality_s3uri}

### Register and share the model

The trained model is already available in the SageMaker APIs to deploy and re-use (you should see it, for example, in the [Models page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/models)).

However, we can improve discoverability and governance by cataloging it in the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html). Here extra metadata can be associated, including I/O formats and the model quality report generated above:

In [None]:
xgb_model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    model_package_group_name="awscc-uw-sm-dm",
    description="Initial XGBoost model",
    model_metrics=sagemaker.model_metrics.ModelMetrics(
        model_statistics=sagemaker.model_metrics.MetricsSource(
            content_type="application/json",
            s3_uri=model_quality_s3uri,
        ),
    ),
    domain="MACHINE_LEARNING",
    task="CLASSIFICATION",
    sample_payload_url=test_data_s3uri,
)

You can explore and manage your versioned registry model packages in SageMaker Studio: Including **reviewing and approving** new versions to trigger automated deployments.

## Conclusions

In this notebook, we saw how [SageMaker Canvas AutoML](https://aws.amazon.com/sagemaker/canvas/) can accelerate new tabular ML projects to a high-accuracy, deployable model with no coding required. We also saw how you can dive deeper using the [SageMaker built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) to customize your models without implementing common algorithms from scratch.

We also saw brief intros to how [SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) can help catalog shared feature data, and how [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) helps with tracking and managing trained models. For more information on these MLOps features, you can refer to the documentation and the official [SageMaker notebook examples repository](https://github.com/aws/amazon-sagemaker-examples).

# IMPORTANT
### STAY UNTIL THE END TO LEARN HOW TO FREE UP CLOUD RESOURCES TO AVOID BEING CHARGED