# How to optimize the Meta Llama-3 70B Amazon JumpStart model for inference using Amazon SageMaker model optimization jobs
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to apply state-of-the-art optimization techniques to an Amazon JumpStart model (JumpStart model ID: `meta-textgeneration-llama-3-70b`) using Amazon SageMaker ahead-of-time (AOT) model optimization capabilities. Each example includes the deployment of the optimized model to an Amazon SageMaker endpoint. In all cases, the inference image will be the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

You will successively:
1. Deploy a pre-optimized variant of the Amazon JumpStart model with speculative decoding enabled (using SageMaker provided draft model). For popular models, the JumpStart team indeed selects and applies the best optimization configurations for you.
2. Customize the speculative decoding with open-source draft model.
3. Quantize the model weights using the AWQ algorithm.
4. Compile the model for a deployment of AWS Inferentia 2 accelerated hardware.

**Notices:**
* Make sure that the `ml.p4d.24xlarge` and `ml.inf2.48xlarge` instance types required for this tutorial are available in your AWS Region.
* Make sure that the value of your "ml.p4d.24xlarge for endpoint usage" and "ml.inf2.48xlarge for endpoint usage" Amazon SageMaker service quotas allow you to deploy at least one Amazon SageMaker endpoint using these instance types.

This notebook leverages the [Model Builder Class](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) within the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to abstract out container and model server management/tuning. Via the Model Builder Class you can easily interact with JumpStart Models, HuggingFace Hub Models, and also custom models via pointing towards an S3 path with your Model Data. For this sample we will focus on the JumpStart Optimization path.

### License agreement
* This model is under the Meta license, please refer to the original model card.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.225.0 

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install "sagemaker>=2.225.0" boto3 huggingface_hub --upgrade --quiet --no-warn-conflicts

### Setup

In [None]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.session import Session
import logging
import huggingface_hub
from pathlib import Path

In [None]:
sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-70b"
gpu_instance_type = "ml.p4d.24xlarge"
neuron_instance_type = "ml.inf2.48xlarge"

In [None]:
response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

## 1. Deploy a pre-optimized deployment configuration with speculative decoding (SageMaker provided draft model)
The `meta-textgeneration-llama-3-70b` JumpStart model is available with multiple pre-optimized deployment configuration. Optimized model artifacts for each configuration have already been created by the JumpStart team and a readily available for deployment. In this section, you will deploy one of theses pre-optimized configuration to an Amazon SageMaker endpoint. 

### What is speculative decoding?
Speculative decoding is an inference optimization technique introduced by [Y. Leviathan et al. (ICML 2023)](https://arxiv.org/abs/2211.17192) used to accelerate the decoding process of large and therefore slow LLMs for latency-critical applications. The key idea is to use a smaller, less powerful but faster model called the ***draft model*** to generate candidate tokens that get validated by the larger, more powerful but slower model called the ***target model***. At each iteration, the draft model generates $K>1$ candidate tokens. Then, using a single forward pass of the larger target model, none, part, or all candidate tokens get accepted. The more aligned the selected draft model is with the target model, the better guesses it makes, the higher candidate token acceptance rate and therefore the higher the speed ups. The larger the size gap between the target and the draft model, the largest the potential speedups.

Let's start by creating a `ModelBuilder` instance for the model:

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

For each optimization configuration, the JumpStart team has computed key performance metrics such as time-to-first-token (TTFT) latency and throughput for multiple hardwares and concurrent invocation intensities. Let's visualize these metrics using the `display_benchmark_metrics` method:

In [None]:
model_builder.display_benchmark_metrics()

Now, let's pick and deploy the `lmi-optimized` pre-optimized configuration to a `ml.p4d.24xlarge` instance. The `lmi-optimized` configuration enables speculative decoding. In this configuration, a SageMaker provided draft model is used. Therefore, you don't have to supply a draft model. 

In [None]:
model_builder.set_deployment_config(config_name="lmi-optimized", instance_type=gpu_instance_type)

Currently set deployment configuration can be visualized using the `get_deployment_config` method:

In [None]:
model_builder.get_deployment_config()

Now, let's build the `Model` instance and use it to deploy the selected optimized configuration. This operation may take a few minutes.

In [None]:
optimized_model = model_builder.build()

In [None]:
predictor = optimized_model.deploy(accept_eula=True)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

## 2. Customize the speculative decoding with open-source draft model, then deploy the optimized model
In this section and instead of relying on a pre-optimized model, you will use an Amazon SageMaker optimization toolkit to enable speculative decoding on the `meta-textgeneration-llama-3-70b` JumpStart model. In this example, the draft model is from HuggingFace model hub. We use the HF-Hub model package to download these artifacts to S3 directly, optionally you can also provide your HF Model ID. In this case for the draft model we use Meta-Llama-3-8B, for this model ensure you have access to the artifacts via HF.

In [None]:
custom_draft_model_id="meta-llama/Meta-Llama-3-8B"

hf_local_download_dir = Path.cwd() / "model_repo"
hf_local_download_dir.mkdir(exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id=custom_draft_model_id,
    revision="main",
    local_dir=hf_local_download_dir,
    local_dir_use_symlinks=False,
)

In [None]:
!rm -rf model_repo/.ipynb_checkpoints
!rm -rf model_repo/.cache
!rm -rf model_repo/.gitattributes
!rm -rf model_repo/original

In [None]:
custom_draft_model_uri = sagemaker_session.upload_data(
    path=hf_local_download_dir.as_posix(),
    bucket=artifacts_bucket_name,
    key_prefix="spec-dec-custom-draft-model",
)

In [None]:
draft_uri = custom_draft_model_uri + "/" #need to point towards the uncompressed model artifacts
draft_uri

In [None]:
!aws s3 ls {draft_uri} #verify model artifacts

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

The optimization operation may take a few minutes.

In [None]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    speculative_decoding_config={
        "ModelSource": draft_uri
    },
)

Now let's deploy the optimized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [None]:
predictor = optimized_model.deploy(accept_eula=True)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

## 3. Run optimization job to quantize the model using AWQ, then deploy the quantized model
In this section, you will quantize the `meta-textgeneration-llama-3-70b` JumpStart model with the AWQ quantization algorithm by running an Amazon SageMaker optimization job. 

### What is quantization?
In our particular context, quantization means casting the weights of a pre-trained LLM to a data type with a lower number of bits and therefore a smaller memory footprint. The benefits of LLM quantization include:
* Reduced hardware requirements for model serving: A quantized model can be served using less expensive and more available GPUs or even made accessible on consumer devices or mobile platforms.
* Increased space for the KV cache to enable larger batch sizes and/or sequence lengths.
* Faster decoding latency. As the decoding process is memory bandwidth bound, less data movement from reduced weight sizes directly improves decoding latency, unless offset by dequantization overhead.
* A higher compute-to-memory access ratio (through reduced data movement), known as arithmetic intensity. This allows for fuller utilization of available compute resources during decoding.  

AWQ (Activation-aware Weight Quantization) is a post-training weight-only quantization algorithm introduced by [J. Lin et al. (MLSys 2024)](https://arxiv.org/abs/2306.00978) that allows to quantize LLMs to low-bit integer types like INT4 with virtually no loss in model accuracy.

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

Quantizing the model is as easy as supplying the following inputs:
* The location of the unquantized model artifacts, here the Amazon SageMaker JumpStart model ID.
* The quantization configuration.
* The Amazon S3 URI where the output quantized artifacts needs to be stored.
Everything else (compute provisioning and configuration for example) is managed by SageMaker. This operation takes around 120min.

In [None]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq",
        },
    },
    output_path=f"s3://{artifacts_bucket_name}/awq-quantization/",
)

Now let's deploy the quantized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [None]:
quantized_instance_type = "ml.g5.12xlarge"  # We can use a smaller instance type once quantized
predictor = optimized_model.deploy(instance_type=quantized_instance_type, accept_eula=True)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

## 4. Run optimization job to compile the model using the AWS Neuron Compiler then deploy the compiled model to an Inferentia 2 endpoint
In this section, you will use an Amazon SageMaker optimization job as a managed model compiler to compile the `meta-textgeneration-llama-3-70b` JumpStart model for [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) hardware. The optimization job allows you to decouple compilation from deployment. The model is compiled once while the compiled artifacts can be deployed many times. In other words, the compilation overhead is paid once instead of occuring upon each deployment.

In [None]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
)

Now let's compile the model for Inferentia 2. This operation takes around 40min.

In [None]:
optimized_model = model_builder.optimize(
    instance_type=neuron_instance_type,
    accept_eula=True,
    compilation_config={
        "OverrideEnvironment": {
            "OPTION_TENSOR_PARALLEL_DEGREE": "24",
            "OPTION_N_POSITIONS": "8192",
            "OPTION_DTYPE": "fp16",
            "OPTION_ROLLING_BATCH": "auto",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
            "OPTION_NEURON_OPTIMIZE_LEVEL": "2",
        }
    },
    output_path=f"s3://{artifacts_bucket_name}/neuron/",
)

Now let's deploy the compiled model to an Amazon SageMaker endpoint powered by Inferentia 2 hardware. This operation may take a few minutes.

In [None]:
predictor = optimized_model.deploy(accept_eula=True, model_data_download_timeout=3600, volume_size=512)

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [None]:
predictor.predict(sample_input)

In [None]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)