# Sagemaker

- Notebook environment
- Training Service
- Hosting Service

## Algorithms for natural language processing (NLP)

### There are Amazon SageMaker built-in algorithms for natural language processing:

- BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.
- Sequence2sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.
- Object2Vec generalizes the well-known Word2Vec embedding technique for words that are optimized in the Amazon SageMaker BlazingText algorithm.

### Training, on the other hand, may require GPUs which are much more suited to handle the training requirements than CPUs. However, GPUs are less cost-effective to keep running when a model is not being trained. So you can make use of this decoupled architecture by simply using an ETL service like AWS Glue or Amazon EMR, which use Apache Spark for your ETL jobs and Amazon SageMaker to train, test, and deploy your models.

## Create a model in Amazon SageMaker
### You need:

- The Amazon S3 path where the model artifacts are stored 
- The Docker registry path for the image that contains the inference code 
- A name that you can use for subsequent deployment steps

## Create an endpoint configuration for an HTTPS endpoint
### You need:

- The name of one or more models in production variants
- The ML compute instances that you want Amazon SageMaker to launch to host each production variant. When hosting models in production, you can configure the endpoint to elastically scale the deployed ML compute instances. For each production variant, you specify the number of ML compute instances that you want to deploy. When you specify two or more instances, Amazon SageMaker launches them in multiple Availability Zones. This ensures continuous availability. Amazon - SageMaker manages deploying the instances.

## Create an HTTPS endpoint
### You need to provide the endpoint configuration to Amazon SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration.

- Model packages: These are used to create deployable SageMaker models. You can create your own algorithm, package it using the model package APIs, and publish it to AWS Marketplace.
- Models: Models are created using model artifacts. They are similar to mathematical equations with variables; that is, you input the values for the variables and get an output. These models are stored in S3 and will be used for inference by the endpoints.
- Endpoint configurations: Amazon SageMaker allows you to deploy multiple weighted models to a single endpoint. This means you can route a specific number of requests to one endpoint. What does this mean? Well, let’s say you have one model in use. You want to replace it with a new model. However, you cannot simply remove the first model that is already in use. In this scenario, you can use the VariantWeight API to make the endpoints serve 80% of the requests with the old model and 20% of the requests with the new model. This is the most common production scenario where the data changes rapidly and the model needs to be trained and tuned periodically. Another possible use case is to test the model results with live data, then a certain percentage of the requests can be routed to the new model, and the results can be monitored to ascertain the accuracy of the model on real-time unseen data.

### Target tracking
- With target tracking scaling policy, you specify a single metric, like SageMaker​Var⁠iantInvocationsPerInstance = 1000, and SageMaker will autoscale as needed. This strategy is very common, as it’s the easiest to configure.

### Simple
- When configured to use the simple scaling policy, SageMaker will trigger a scaling event on a given metric at a given threshold with a fixed amount of scaling. For example, “when SageMakerVariantInvocationsPerInstance > 1000, add 10 instances.” This strategy requires a bit more configuration but also provides more control over the target-tracking strategy.

### Step scaling
- Step scaling, the most configurable scaling policy, allows SageMaker to trigger a scaling event on a given metric at various thresholds—with configurable amounts of scaling at each threshold. For example, “when SageMaker​Var⁠ian⁠t​InvocationsPerInstance > 1000, add 10 instances, SageMakerVariant​Invoca⁠tionsPerInstance > 2000, add 50 instances,” etc. This strategy requires the most amount of configuration but provides the most amount of control for situations such as spiky traffic.

### SageMaker's multi-model endpoints 
- are a cost-effective option for you to deploy your models. Instead of hosting 50 models on 50 endpoints for an ML use case with data from 50 US states and paying for 50 endpoints when you know the traffic to some states will be sparser compared to some other states, you can consolidate 50 models into 1 multi-model endpoint to fully utilize the compute capacity for the endpoint and reduce the hosting cost.

## Offline Vs Online Stores

### An online store is a feature storage option in SageMaker Feature Store that is designed to stay online at all times. Online means that the store should behave like an online application, one that responds to data read/write access requests immediately. Immediately can be subjective, but in technical terms, it means low response latency so that users do not feel the lapse. In addition to low latency, another aspect that makes the online store "online" is the high throughput of transactions that it can serve at the same time. Imagine hundreds of thousands of users visiting your application; you do not want to disappoint your awesome customers. You want your online application to be capable of handling traffic with high throughput and low latency.

### Why do we need an online store that has low latency? In many ML use cases, the ML inference needs to respond to a user's action on the system immediately to provide the inference results back to the user. The inference process typically includes querying features for a particular data point and sending the features as a payload to the ML model. For example, an auto insurance online quote application has an ML model that takes a driver's information to predict their risk level and suggest a quote. This application needs to pull vehicle-related features from a feature store based on the car make provided by the user. You'd expect a modern application to return a quote immediately. Therefore, an ideal architecture should keep the latency of both pulling features from a feature store and making an ML inference low. We can't have a system where the ML model responds immediately but takes seconds or minutes to gather features from various databases and locations.


### An offline store in SageMaker Feature Store is designed to provide much more versatile functionality by keeping all the records over time for use. You will be able to access features at any given condition and time for a variety of use cases. But this comes at the cost of higher-latency response times for requests to an offline store, because the offline store uses slower and less expensive storage.

### An offline store complements the online store for ML use cases where low latency isn't a requirement. For example, when building an ML training dataset to reproduce a particular model for compliance purposes, you need to access historic features in order to build a model that was created in the past. ML training is typically not expected to complete within seconds anyway, so you don't necessarily need sub-second performance when querying a feature store for training data.

## Sagemaker Auto Scaling example

In [None]:
response = autoscaling_client.put_scaling_policy(

   PolicyName='Invocations-ScalingPolicy',

   ServiceNamespace='sagemaker',

   ResourceId=resource_id,

   ScalableDimension='sagemaker:variant:DesiredInstanceCount',

   PolicyType='TargetTrackingScaling',

   TargetTrackingScalingPolicyConfiguration={

       'TargetValue': 4000.0,

       'PredefinedMetricSpecification': {

          'PredefinedMetricType':

             'SageMakerVariantInvocationsPerInstance'},

        'ScaleInCooldown': 600,

        'ScaleOutCooldown': 300})

### In this example, we employ a scaling strategy called target tracking scaling. Target tracking scaling aims to scale in and out the instances based on a specific target metric, such as instance CPU load, or the number of inference requests per instance per minute. We use the latter (SageMakerVariantInvocationsPerInstance) in this configuration to make sure each instance can share 4,000 requests per minute before scaling out another instance. ScaleInCooldown and ScaleOutCooldown refer to the period of time in seconds after the last scaling activity before autoscaling can scale in and out again. With our configuration, SageMaker will not scale in (remove an instance) within 600 seconds of the last scale-in activity, and will not scale out (add an instance) within 300 seconds of the last scale-out activity.

### There are two commonly used advanced scaling strategies for PolicyType: step scaling and scheduled scaling. In step scaling, you can define the number of instances to scale in/out based on the size of the alarm breaches of a certain metric.

### A multi-model endpoint is a type of real-time endpoint in SageMaker that allows multiple models to be deployed behind the same endpoint. There are many use cases in which you would build models for each customer or for each geographic area, and depending on the characteristics of the incoming data point, you would apply the corresponding ML model. Take the telecommunications churn prediction use case that we tackled in Chapter 3, Data Preparation with SageMaker Data Wrangler, as an example. We may get more accurate ML models if we train them by state because there may be regional differences in terms of competition among local telecommunication providers. And if we do train ML models for each US state, you can also easily imagine that the utilization of each model might not be completely equal. Actually, quite the contrary.

### Model utilization is inevitably proportional to the population of each state. Your New York model is going to be used more frequently than your Alaska model. In this scenario, if you host an endpoint for each state, you will have to pay for instances, even for the least utilized endpoint. With multi-model endpoints, SageMaker helps you reduce costs by reducing the number of endpoints needed for your use case. 

### Code Example