# A Tour of SageMaker

<div style="text-align: left"><img src="images/a_tour_of_sagemaker.png" alt="A Tour of SageMaker" style="width: 300px;"/></div>

## Introduction

Welcome to **A Tour of SageMaker**! 😊

In this notebook-shaped presentation, we'll show you **every** nook and cranny ⚙️ of [Amazon SageMaker](https://aws.amazon.com/sagemaker/) - our end-to-end ML service:

* How it supports **all** stages of the ML pipeline - from data collection and labeling to model deployment... and back ♾️
* How it automates the entire ML workflow - welllll, almost... 👀 looking at you [Amazon Augmented AI](https://aws.amazon.com/augmented-ai/) - and last but *certainly* not least
* How to optimize everything for cost 💰

Ready to master the (not so) mystic art of running ML-powered apps with Amazon SageMaker? 🧙

Hope you enjoy the ride! 😉

<div style="text-align: left"><img src="https://media.tenor.com/jNGGYr4g4xAAAAAM/benedict-cumberbatch-dr-strange.gif"/></div>

## Table of Contents

* [Introduction](#Introduction)
* [The Current State of AI/ML](#The-Current-State-of-AI/ML)
* [AWS ML Stack: The Big Picture](#AWS-ML-Stack:-The-Big-Picture)
* [Amazon SageMaker Overview](#Amazon-SageMaker-Overview)
* [The Quick Tour](#The-Quick-Tour)
    - [Prerequisites](#Prerequisites)
    - [Data Preparation](#Data-Preparation)
        - [Download and Explore the Dataset](#Download-and-Explore-the-Dataset)
        - [Prepare and Upload Data](#Prepare-and-Upload-Data)
    - [Model Deployment](#Model-Deployment)
    - [Model Tuning](#Model-Tuning)
    - [Clean Up](#Clean-Up)
* [Data Preparation with Data Wrangler](#Data-Preparation-with-Data-Wrangler)
* [Feature Engineering with Processing Jobs](#Feature-Engineering-with-Amazon-SageMaker-Processing)
* [Deploying Models at Scale](#Deploying-Models-at-Scale)
* [Coming Up](#Coming-Up) 🆕
* [References](#References)

## The Current State of AI/ML

[🔝 Back to the top](#A-Tour-of-SageMaker)

If you made it all the way here, you've probably heard these before, or some variation thereof:

> "Data is the new Oil" 🛢️ -- Clive Humby

> "AI is the new electricity" ⚡ -- Andrew Ng

> "ML is the last invention that humanity will ever need to make." 🔮 -- Nick Bostrom

Riding [Gartner's Hype Cycle for AI](https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2022-gartner-hype-cycle), we can clearly see that while there's plenty of hype in and around AI/ML, there's also a lot of potential.

<img src="https://emtemp.gcom.cloud/ngw/globalassets/en/articles/images/hype-cycle-for-artificial-intelligence-2022.png" style="width: 500px;"/>

Looking at the current state of the so-called digital economy, there's one thing we know for sure - the reach of ML is growing. 📈

🤨 **Don't believe me?**

Let's look at the numbers then:

* By 2025, global spending on AI will reach $204 billion (source: [IDC](https://www.idc.com/getdoc.jsp?containerId=prUS48191221))
* By the end of 2024, 75% of enterprises will shift from piloting to operationalizing AI (source: [Gartner](https://www.gartner.com/en/newsroom/press-releases/2020-06-22-gartner-identifies-top-10-data-and-analytics-technolo))
* 57% say that AI would transform their organization in the next 3 years (source: [Deloitte](https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/state-of-ai-and-intelligent-automation-in-business-survey.html))

"Well, well, well", you might say, "this looks promising! Maybe I should jump on the [bandwagon](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1056774) after all"... 

![But that's not the whole story](https://c.tenor.com/hsk-j1UTNFMAAAAC/but-thats-not-the-whole-story-derek-muller.gif)

Here's the scary part (source: [InfoWorld](https://www.infoworld.com/article/3639028/why-ai-investments-fail-to-deliver.html)):

* 85% of all AI/ML projects fail to deliver 😨
* While ~50% never make it to production 😱

⚠️ **What's happening there? Why is the failure rate so high?**

The truth is that ML development can be both **complex** 🤯 and **costly** 💸

Traditionally, there are multiple barriers to adoption at every step of the ML workflow:
* Data collection and preparation can be **time consuming** and **undifferentiated**
* Choosing the right ML algorithm is often done by **trial and error**
* Lenghty training times often lead to **higher costs**
* Model tuning can involve **very long cycles** and require adjusting thousands of different combinations
* Models need to be **monitored constantly** and **scaled** to meet demand

To make matters worse, many of the tools developers take for granted when building traditional software such as debuggers, project management, collaboration, and so forth are disconnected when it comes to ML development. 

**How can we make things simpler?**

Enter [Amazon SageMaker](https://aws.amazon.com/sagemaker/)...

![Et Voilá!](https://media.tenor.com/NWqisN5ga_MAAAAC/voila-iron-man.gif)

Amazon SageMaker was built from the ground up to provide every developer and data scientist the ability to build, train, and deploy ML models quickly and at lower cost by providing the tools required for every step of the ML development lifecycle in one integrated, fully managed service.

But before we dive deeper into how Amazon SageMaker does this, let's take a step back and look at what AWS has to offer in terms of AI/ML and where Amazon SageMaker fits.

## AWS ML Stack: The Big Picture

[🔝 Back to the top](#A-Tour-of-SageMaker)

At AWS, our mission is to

> **Put ML in the hands of every developer and data scientist**.

We're constantly creating and innovating on behalf of our customers to deliver the broadest and deepest set of ML capabilities for builders of all levels of expertise. 

The result of this combined effort is what we call the **AWS ML Stack** (*pictured below*).

<img src="images/aws_ml_stack.png" alt="AWS ML Stack" style="width: 900px;"/>

Each layer of the stack is focused on removing what we call [undifferentiated heavy lifting](https://aws.amazon.com/blogs/aws/we_build_muck_s/) i.e. work that adds little or no value to the mission of a company, so that our customers can move faster.

**Let's look at each layer more closely...**

At the **ML Frameworks & Infrastructure** layer (*bottom*), expert practitioners can develop on the framework of their choice as a managed experience in Amazon SageMaker or use the [AWS Deep Learning AMIs](https://aws.amazon.com/machine-learning/amis/), which are fully configured with the latest versions of the most popular deep learning frameworks and tools – including [PyTorch](https://aws.amazon.com/pytorch/), [MXNet](https://aws.amazon.com/mxnet/), TensorFlow, and [many more](https://docs.aws.amazon.com/sagemaker/latest/dg/frameworks.html). AWS provides the broadest and deepest portfolio of compute, networking, and storage infrastructure services with a choice of processors and accelerators - [AWS Trainium / `Trn1` Instances](https://aws.amazon.com/machine-learning/trainium/) and [Habana Gaudi / `DL1` Instances](https://aws.amazon.com/ec2/instance-types/dl1/) for training, [AWS Inferentia / `Inf1` Instances](https://aws.amazon.com/machine-learning/inferentia/) for inference, and even [FPGAs / `F1` Instances](https://aws.amazon.com/ec2/instance-types/f1/) - to meet our customers' unique performance and budget needs for ML.

> 📚 For an overview of the different services that support ML workloads in terms of infrastructure, check out [AWS Machine Learning Infrastructure](https://aws.amazon.com/machine-learning/infrastructure/)

<img src="images/ml_infra_overview.png" alt="ML Infrastructure Overview" style="width: 900px;"/>

At the **ML Services** layer (*middle*) is Amazon SageMaker, which provides every developer and data scientist with the ability to build, train, and deploy ML models *at scale*. It removes the complexity from each step of the ML workflow so you can more easily deploy your ML use cases, anything from predictive maintenance to computer vision to predicting customer behaviors. Customers achieve up to 10x improvement in data scientists' productivity with Amazon SageMaker.

Finally, the **AI Services** layer (*top*) contains services that allow developers to easily add intelligence to any application **without** needing ML skills. We group these services into two sub-groups: **Core** services include text, documents, chatbots, speech and vision, while **Specialized** services contains everything related to Business Processes, Search, Code & DevOps, Industrial and Heathcare.

> 💡 Want to see some of these AI services in action? Head over to the [AWS AI Service Demos](https://ai-service-demos.go-aws.com/)

<img src="https://miro.medium.com/max/1358/0*MNUhgCBYMi5k849I" alt="ML Infrastructure Overview" style="width: 500px;"/>

## Amazon SageMaker Overview

[🔝 Back to the top](#A-Tour-of-SageMaker)

SageMaker is a big service with a lot of different features and capabilities (*pictured below*).

We typically talk about those capabilities as falling into four categories:

1. Data Preparation
2. Model Building
3. Training & Tuning, and
4. Deployment & Management

These four sets of capabilities address the needs that ML builders have and the challenges they face at each stage of a model's lifecycle.

**Let's do a quick (mental) exercise...**

Look carefully at the image below 👀 Read each feature description - don't just F- or Z-scan your way through.

**Is there anything that may be of particular interest to you and your organization at this point in time?**

<img src="images/sagemaker_overview.png" alt="ML Infrastructure Overview" style="width: 900px;"/>

## The Quick Tour

[🔝 Back to the top](#A-Tour-of-SageMaker)

Now that we know what Amazon SageMaker is and what it has to offer, let's look at a simple use case that showcases a small subset of its capabilities.

### Prerequisites

[🔝 Back to the top](#A-Tour-of-SageMaker)

We'll start by importing some Python libraries and defining some helper functions

In [None]:
import os
import sys
import time

import boto3                         # AWS SDK for Python
import sagemaker                     # Amazon SageMaker SDK for Python

import numpy as np                   # Matrix multiplication and numerical processing
import pandas as pd                  # Munging tabular data
import matplotlib.pyplot as plt      # Charts and visualizations

from IPython.display import (        # Display tools in IPython
    display,
    HTML,
    Image,
    Latex,
    Markdown
)

def printmd(str):
    """Prints a Markdown string"""
    display(Markdown(str))

def printhtml(str):
    """Prints an HTML string"""
    return display(HTML(str))

def printtex(str):
    """Prints a Latex string"""
    return display(Latex(str))

# Debug
printmd(f"Numpy: `{np.__version__}`")
printmd(f"Pandas: `{pd.__version__}`")
printmd(f"Boto3: `{boto3.__version__}`")
printmd(f"SageMaker: `{sagemaker.__version__}`")

Let's gather some constants that we'll use later in this demo

In [None]:
# Initialize SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Manages interactions with the Amazon SageMaker APIs
sagemaker_session = sagemaker.session.Session()

# The AWS Region that we're using
region = sagemaker_session.boto_region_name

# The IAM execution role assumed by SageMaker
role = sagemaker.get_execution_role()

# The S3 bucket to be used by this session
bucket = sagemaker_session.default_bucket()

# Where we'll store our data and model artifacts
prefix = "smlabs/sagemaker_tour"

printmd(f"Region 🌎: `{region}`")
printmd(f"Bucket 🪣: `{bucket}`")
printmd(f"Role 👷: `{role}`")

### Data Preparation

[🔝 Back to the top](#A-Tour-of-SageMaker)

> This section was adapted from [Download, Prepare and Upload Training Data](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-data.html)

#### Download and Explore the Dataset

[🔝 Back to the top](#A-Tour-of-SageMaker)

In this demo, we'll use the [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing) (Moro *et al.*, 2014).

The data is comes from direct marketing campaigns of a Portuguese banking institution 🏦

The marketing campaigns were based on phone calls 📞

Often, more than one contact to the same client was required in order to access if the product (bank term deposit) would be subscribed (`yes`) 👍 or not (`no`) 👎

There are four datasets:

1. `bank-additional-full.csv` with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2. `bank-additional.csv` with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3. `bank-full.csv` with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4. `bank.csv` with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The goal is to predict whether the client will subscribe (`yes/no`) to a term deposit as indicated by the variable `y`.

Let's start by downloading the dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

In [None]:
!wget -N "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
!unzip -o bank-additional.zip
# This will create a folder named 'bank-additional'

For this demo, we'll use the full dataset with all the inputs (`bank-additional-full.csv`)

In [None]:
# Load dataset from CSV file
data = pd.read_csv('./bank-additional/bank-additional-full.csv')
pd.set_option('display.max_columns', 500)    # Make sure we can see all columns
pd.set_option('display.max_rows', 50)        # Make sure we can see all rows

# Debug
data.head()

which we'll store in the default bucket for later use

In [None]:
input_source = sagemaker_session.upload_data('./bank-additional/bank-additional-full.csv', bucket=bucket, key_prefix=f'{prefix}/input_data')

Let's talk a little bit about the data and its attributes

In [None]:
!cat bank-additional/bank-additional-names.txt

Now we can start asking some questions...

**1/ How are the features distributed?**

In [None]:
# Frequency tables for each categorical feature
printmd("##### Categorical Features ")
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
printmd("##### Numeric Features ")
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=True, figsize=(12, 12))

*2/ How are the features related to one another?**

In [None]:
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

**What conclusions can you draw from the information above?**

#### Prepare and Upload Data

[🔝 Back to the top](#A-Tour-of-SageMaker)

Our preprocessing pipeline will be very simple.

We will create a variable to indicate that there was no prior contact (`no_previous_contact`) and another one to indicate whether the individual is currently employed (`not_working`).

Finally, we will convert all categorical variables into dummy/indicator variables.

In [None]:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999 (no prior contact)
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables into sets of indicators

We'll also remove all the economic features (see description above) and `duration` from our data.

In [None]:
model_data = model_data.drop([
    'duration',
    'emp.var.rate',
    'cons.price.idx',
    'cons.conf.idx',
    'euribor3m',
    'nr.employed'
], axis=1)

#### Prepare and Upload Data

We'll split the data into three datasets: **train**, **validation**, **test**

In [None]:
# Randomly sort the data then split out first 70%, second 20%, and last 10%
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=42), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])

In this demo, we'll use the [built-in XGBost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) container which supports CSV and libsvm formats for training and inference.

We'll keep the original CSV format, but there are a few restrictions:

1. The first column must be the target variable and
2. The CSV should **not** include headers

In [None]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Finally, we upload a copy of the data to the default S3 bucket where SageMaker can access it

In [None]:
train_source = sagemaker_session.upload_data('train.csv', bucket=bucket, key_prefix=f'{prefix}/train')
validation_source = sagemaker_session.upload_data('validation.csv', bucket=bucket, key_prefix=f'{prefix}/validation')

# Debug
print(train_source)
!aws s3 ls {bucket}/{prefix}/train/
print(validation_source)
!aws s3 ls {bucket}/{prefix}/validation/

### Model Training

[🔝 Back to the top](#A-Tour-of-SageMaker)

As we mentioned in the previous section, we'll use [XGBoost](https://github.com/dmlc/xgboost) to predict whether an individual will subscribe to the product. **But what exactly is it?**

XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm - a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

It is a widely used tool among Kaggle competitors and can usually be found in many winning submissions.

> For additional details, check out [How XGBoost Works](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-HowItWorks.html)

The first thing we need is the location of the image that contains SageMaker's implementation of XGBoost

In [None]:
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')
printmd(f"XGBoost Image 📦: `{container}`")

Then, we create input channels for the train and validation datasets that we upload to the S3 bucket

In [None]:
# Define dataset paths
train_path = f"s3://{bucket}/{prefix}/train/"
validation_path = f"s3://{bucket}/{prefix}/validation/"
test_path = f"s3://{bucket}/{prefix}/test/"
output_path = f"s3://{bucket}/{prefix}/output/"

# Declare inputs
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_path, content_type='csv')

Finally, we can start training our model

> Wondering which instance types are available? Head over to [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/), scroll down to the **On-Demand Pricing** section, select the tab that matches your use case and the AWS region you're in for a list of supported instance types.

In [None]:
# 1. Create estimator
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.large',
                                    output_path=output_path,
                                    sagemaker_session=sagemaker_session)

# 2. Set hyperparameters
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

# 3. Start training
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

### Model Deployment

[🔝 Back to the top](#A-Tour-of-SageMaker)

Now that the training phase is finished, we can deploy the model

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium')

### Model Evaluation

[🔝 Back to the top](#A-Tour-of-SageMaker)

One easy way to evaluate the model is compare actual vs predicted values.

In our case, since we're simply trying to predict whether the customer subscribes to a product (`1`) or not (`0`), this will produce a standard 2-class confusion matrix.

To pass data between our endpoint, we'll serialize it as a CSV string and decode the resulting CSV.

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll create a simple function to make the predictions

In [None]:
def predict(data, predictor, rows=100):
    """Returns predictions for a dataset by invoking a predictor endpoint"""
    # Split dataset into mini-batches of rows
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')  # drop the target variable from the dataset

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)

Let's check what the confusion matrix looks like

In [None]:
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

Not bad, but... **can we do better?**

### Model Tuning

[🔝 Back to the top](#A-Tour-of-SageMaker)

[Amazon SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. 

It then selects the hyperparameter values that result in a model that performs the best, as measured by a metric that you chose.

In [None]:
from sagemaker.tuner import (
    ContinuousParameter,
    IntegerParameter,
    HyperparameterTuner
)

# 1. Define hyperparameter ranges
# https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                         'min_child_weight': ContinuousParameter(1, 10),
                         'alpha': ContinuousParameter(0, 2),
                         'max_depth': IntegerParameter(1, 10)}

# 2. Define objective metric
# https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html
objective_metric_name = 'validation:auc'

# 3. Initialize hyperparameter tuner
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)

# 4. Start model tuning
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

Return the best training job

In [None]:
tuner.best_training_job()

Deploy it

In [None]:
tuner_predictor = tuner.deploy(initial_instance_count=1,
                               instance_type='ml.t2.medium')

and let's look at the new confusion matrix

In [None]:
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()
predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), tuner_predictor)
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

### Clean Up

[🔝 Back to the top](#A-Tour-of-SageMaker)

Don't forget to remove the hosted endpoints or you'll start accruing costs!

In [None]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

## Data Preparation with Data Wrangler

[🔝 Back to the top](#A-Tour-of-SageMaker)

[SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler) is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. 

You can integrate a Data Wrangler data preparation flow into your ML workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. 

You can also add your own Python scripts and transformations to customize workflows.

Data Wrangler provides the following core functionalities to help you analyze and prepare data for machine learning applications.

* **Import** – Connect to and import data from Amazon Simple Storage Service (Amazon S3), Amazon Athena (Athena), Amazon Redshift, Snowflake, and Databricks.

* **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.

* **Transform** – Clean and transform your dataset using standard transforms like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.

* **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Insights and Quality Report.

* **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation.

* **Export** – Export your data preparation workflow to a different location. The following are example locations:

    - Amazon Simple Storage Service (Amazon S3) bucket

    - Amazon SageMaker Model Building Pipelines – Use SageMaker Pipelines to automate model deployment. You can export the data that you've transformed directly to the pipelines.

    - Amazon SageMaker Feature Store – Store the features and their data in a centralized store.

    - Python script – Store the data and their transformations in a Python script for your custom workflows.

📚 **Exercise:** Try to replicate the data preparation steps from [The Quick Tour](#The-Quick-Tour) section using Data Wrangler.

<img src="images/data_wrangler_flow.png"/>


## Feature Engineering with Amazon SageMaker Processing

[🔝 Back to the top](#A-Tour-of-SageMaker)

[SageMaker Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) provide a simplified, managed experience on SageMaker to run data processing workloads like feature engineering, data validation, model evaluation, and model interpretation.

SageMaker takes your script, copies the data from S3, and runs the processing container (either from built-in image or a custom image that you provide).

In the end, the output will be stored in an S3 bucket that you specify.

> Your input data **must** be stored in an Amazon S3 bucket. Alternatively, you can use [Amazon Athena](https://aws.amazon.com/athena) or [Amazon Redshift](https://aws.amazon.com/redshift/) as input sources.

<img src="https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/Processing-1.png"/>

Let's see how this works in practice by mimicking the preprocessing steps we did in [The Quick Tour](#The-Quick-Tour).

First, we need a processing script

In [None]:
%%writefile preprocessing.py
"""
Processing script for the Bank Marketing Data Set
https://archive.ics.uci.edu/ml/datasets/bank+marketing
"""

import argparse
import os

import pandas as pd
import numpy as np

def parse_args():
    """Parses command line arguments"""

    # Initialize parser
    parser = argparse.ArgumentParser()

    # Define arguments
    parser.add_argument(
        '--filepath',
        type=str,
        default='/opt/ml/processing/input/'
    )
    parser.add_argument(
        '--filename',
        type=str,
        default='bank-additional-full.csv'
    )
    parser.add_argument(
        '--outputpath',
        type=str,
        default='/opt/ml/processing/output/'
    )
    parser.add_argument(
        '--categorical_features',
        type=str,
        default='y, job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome'  # pylint: disable=line-too-long
    )

    return parser.parse_known_args()

def main():
    """Main entrypoint"""
    # Process arguments
    args, _ = parse_args()

    # Load data
    data = pd.read_csv(os.path.join(args.filepath, args.filename))

    # Change the value . into _
    data = data.replace(regex=r'\.', value='_')
    data = data.replace(regex=r'\_$', value='')

    # Create a variable to indicate that there was no prior contact (no_previous_contact) and
    # another one to indicate whether the individual is currently employed (not_working).
    data["no_previous_contact"] = (data["pdays"] == 999).astype(int)
    data["not_working"] = data["job"].isin(["student", "retired", "unemployed"]).astype(int)

    # Drop duration and economics features
    data = data.drop([
        'duration',
        'emp.var.rate',
        'cons.price.idx',
        'cons.conf.idx',
        'euribor3m',
        'nr.employed'
    ], axis=1)

    # Encode categorical features
    data = pd.get_dummies(data)

    # Train, test, validation split
    train_data, validation_data, test_data = np.split(  # pylint: disable=unbalanced-tuple-unpacking
        data.sample(frac=1, random_state=42), [int(0.7 * len(data)), int(0.9 * len(data))])

    # Prepare data for upload
    ## Train
    pd.concat([
        train_data['y_yes'],
        train_data.drop(['y_yes','y_no'], axis=1)
    ], axis=1).to_csv(
        os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False
    )
    ## Validation
    pd.concat([
        validation_data['y_yes'],
        validation_data.drop(['y_yes','y_no'], axis=1)
    ], axis=1).to_csv(
        os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    ## Test
    test_data['y_yes'].to_csv(
        os.path.join(args.outputpath, 'test/test_y.csv'), index=False, header=False
    )
    test_data.drop(['y_yes','y_no'], axis=1).to_csv(
        os.path.join(args.outputpath, 'test/test_x.csv'), index=False, header=False
    )

    print("Exiting processing job")

if __name__== '__main__':
    main()


We'll use [scikit-learn Processor](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html) to run our job

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

# 1. Initialize processor
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type="ml.m5.large",
    instance_count=1, 
    base_job_name='sm-tour-skprocessing'
)

# 2. Start processing job
sklearn_processor.run(
    code='preprocessing.py',
    inputs=[
        ProcessingInput(
            source=input_source, 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_path,
        ),
        ProcessingOutput(
            output_name="validation_data",
            source="/opt/ml/processing/output/validation",
            destination=validation_path
        ),
        ProcessingOutput(
            output_name="test_data",
            source="/opt/ml/processing/output/test",
            destination=test_path
        ),
    ]
)

In [None]:
# Debug
!aws s3 ls {train_path}
!aws s3 ls {validation_path}
!aws s3 ls {test_path}

## Deploying Models at Scale

[🔝 Back to the top](#A-Tour-of-SageMaker)

After training a model, there are many ways to deploy it using SageMaker:

* [Real-Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) for persistent, real-time endpoints that make one prediction at a time

    <img src="https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/multi-model-endpoints-diagram.png" style="width: 400px;"/>

* [Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) for workloads that have idle periods between traffic spurts and can tolerate cold starts

    <img src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/10/21/image001-9-1024x468.png" style="width: 600px;"/>

* [Asynchronous Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) for requests with large payload sizes up to 1GB, long processing times, and near real-time latency requirements

    <img src="https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/async-architecture.png" style="width: 600px;"/>

* [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to get predictions for an entire dataset

    <img src="https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/batch-transform-data-processing.png" style="width: 600px;"/>

**Which one should you choose?**

In this section, we will use the [Amazon SageMaker Serverless Inference Benchmarking Toolkit](https://aws.amazon.com/blogs/machine-learning/introducing-the-amazon-sagemaker-serverless-inference-benchmarking-toolkit/) to test different endpoint configurations and pit the optimal one against a comparable real-time hosting instance.

Let's start by installing the benchmarking library

In [None]:
!pip install sm-serverless-benchmarking

Create a model

In [None]:
model = xgb.create_model(image_uri=container, role=role)
model.create(instance_type="ml.m5.xlarge")

Start a benchmark run

In [None]:
from sm_serverless_benchmarking.sagemaker_runner import run_as_sagemaker_job
from sm_serverless_benchmarking.utils import convert_invoke_args_to_jsonl

# 1. Provide a representative set of examples
ser = sagemaker.serializers.CSVSerializer()
bench_data = test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy()
random_samples = bench_data[np.random.choice(bench_data.shape[0], size=20, replace=False)]
example_invoke_args = [
    {"Body": ser.serialize(sample), "ContentType": "text/csv"} for sample in random_samples
]
example_args_file = convert_invoke_args_to_jsonl(example_invoke_args, output_path=".")

# 2. Start benchmark run
print("Starting benchmark run ⏱️")
bench_run = run_as_sagemaker_job(
    role=role,
    model_name=model.name,
    invoke_args_examples_file=example_args_file,
    stability_benchmark_invocations=2500,
    concurrency_benchmark_invocations=2500,
)

# 3. Get results
print("Waiting for Processing Job", end='')
while sagemaker_client.describe_processing_job(ProcessingJobName=bench_run.latest_job.job_name)['ProcessingJobStatus'] != "Completed":
    print(".", end='')
    time.sleep(30)
printmd(
    f"\nOutputs were uploaded to `{bench_run.latest_job.outputs[0].destination}`"
)
bench_report = boto3.client('s3').get_object(Bucket=bucket, Key=f"{bench_run.latest_job.name}/output/benchmark_outputs/benchmarking_report/benchmarking_report.html")['Body'].read().decode("utf-8") 
printhtml(bench_report)

Don't forget to clean up everything afterwards

In [None]:
model.delete()

## Coming Up

[🔝 Back to the top](#A-Tour-of-SageMaker)

**This project is still under construction!** 🚧

We're accepting feature demands, bug reports and pull requests.

Here's what you can expect in a not-so-distant future:

* Training Large Models with [Distributed Training](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) and [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html)
* Human in the Loop with [Augmented AI](https://aws.amazon.com/augmented-ai/)
* MLOps with [SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/)
* ... and much, much more

See you soon! 😉

## References

[🔝 Back to the top](#A-Tour-of-SageMaker)

<div style="text-align: left"><img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F90cb787d-c3a1-48c4-86d1-84e7456a949a_500x213.gif"/></div>

### Tutorials

* [Build, Train, and Deploy a Machine Learning Model with Amazon SageMaker](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)
* [Train and tune a deep learning model at scale](https://aws.amazon.com/getting-started/hands-on/train-tune-deep-learning-model-amazon-sagemaker/)

### Learning

* [Amazon SageMaker Technical Deep Dive Series](https://www.youtube.com/playlist?list=PLhr1KZpdzukcOr_6j_zmSrvYnLUtgqsZz)
* [Dive into Deep Learning](https://www.d2l.ai/) – an interactive, notebook-shaped DL book

### Guides

* [Deploy a Model in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html)
* [Use Amazon SageMaker Built-in Algorithms or Pre-trained Models](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)
* [Buy and Sell Amazon SageMaker Algorithms and Models in AWS Marketplace](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-marketplace.html)
* [Using SageMaker JumpStart Models](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-models.html)
* [Using Your Own Algorithm or Model](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-notebooks.html)
* [Using Amazon Augmented AI for Human Review](https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-use-augmented-ai-a2i-human-review-loops.html)

### Code

* [AWSome SageMaker](https://github.com/aws-samples/awesome-sagemaker) – a curated list of references for Amazon SageMaker
* [Amazon SageMaker Examples](https://github.com/aws/amazon-sagemaker-examples) – these are automatically available when using SageMaker Notebook Instances
* [Hugging Face Notebooks > SageMaker](https://github.com/huggingface/notebooks/tree/main/sagemaker) – sample notebooks that demonstrate how to build, train and deploy [🤗 Transformers](https://github.com/huggingface/transformers) with Amazon SageMaker
* [Amazon Augmented AI Sample Notebooks](https://github.com/aws-samples/amazon-a2i-sample-jupyter-notebooks)
* [Optimizing NLP models with Amazon EC2 `Inf1` instances in Amazon SageMaker](https://github.com/aws-samples/aws-inferentia-huggingface-workshop)

### Infrastructure

* [AWS ML Infrastructure](https://aws.amazon.com/machine-learning/infrastructure/) – an overview of the different services that support ML-specific workloads
* [How to choose the right GPU for DL](https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86) – a must read
* HW accelerators - [AWS Trainium / `Trn1` Instances](https://aws.amazon.com/machine-learning/trainium/) and [Habana Gaudi / `DL1` Instances](https://aws.amazon.com/ec2/instance-types/dl1/) for training, [AWS Inferentia / `Inf1` Instances](https://aws.amazon.com/machine-learning/inferentia/) for inference, and even [FPGAs / `F1` Instances](https://aws.amazon.com/ec2/instance-types/f1/)

### Blogs

* [Deploying ML models using SageMaker Serverless Inference (Preview)](https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/)
* [Host Hugging Face transformer models using Amazon SageMaker Serverless Inference](https://aws.amazon.com/blogs/machine-learning/host-hugging-face-transformer-models-using-amazon-sagemaker-serverless-inference/)
* [Introducing the Amazon SageMaker Serverless Inference Benchmarking Toolkit](https://aws.amazon.com/blogs/machine-learning/introducing-the-amazon-sagemaker-serverless-inference-benchmarking-toolkit/)
* [Bring your own model with Amazon SageMaker script mode](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/)
* [Speed up YOLOv4 inference to twice as fast on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/speed-up-yolov4-inference-to-twice-as-fast-on-amazon-sagemaker/)
* [Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia](https://huggingface.co/blog/bert-inferentia-sagemaker)
* [Build custom Amazon SageMaker PyTorch models for real-time handwriting text recognition](https://aws.amazon.com/blogs/machine-learning/build-custom-amazon-sagemaker-pytorch-models-for-real-time-handwriting-text-recognition/)
* [Julien Simon’s substack](https://substack.com/profile/100614256-julien-simon) – Chief Evangelist @ 🤗, former Global Technical Evangelist (AI/ML) @ AWS

### Frameworks

* [TensorFlow on AWS](https://aws.amazon.com/tensorflow/)
* [PyTorch on AWS](https://aws.amazon.com/pytorch/)
* [Hugging Face on Amazon SageMaker](https://aws.amazon.com/machine-learning/hugging-face/) (and [Amazon SageMaker on Hugging Face](https://huggingface.co/docs/sagemaker/index))
* … and [many more](https://docs.aws.amazon.com/sagemaker/latest/dg/frameworks.html)

### Articles

* (Moro *et al.*, 2014) A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems
* (Sculley *et al.*, 2015) Hidden Technical Debt in Machine Learning Systems