# Use Amazon SageMaker to deploy Mistral-7B-Instruct-v0.2 from the Hugging Face Hub

### Before running the code

You will need a valid [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to run the code. You can set up the profile by running `aws configure --profile <profile_name>` in your terminal. You will need to provide your AWS Access Key ID and AWS Secret Access Key. You can find your AWS Access Key ID and AWS Secret Access Key in the [Security Credentials](https://console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials) section of the AWS console.

```bash
$ aws configure --profile <profile_name>
$ AWS Access Key ID [None]: <your_access_key_id>
$ AWS Secret Access Key [None]: <your_secret_access_key>
$ Default region name [None]: us-west-2
$ Default output format [None]: .json
```

We recommend using the default profile by executing the `aws configure` command. This notebook will utilize the default profile. Make sure to set `Default output format` to `.json`.

> Note: If you don't have AWS CLI installed, you will get a `command not found: aws` error. You can follow the instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

For more details on how to deploy a model on Amazon SageMaker, you can refer to this document:

https://huggingface.co/docs/sagemaker/inference#deploy-a-model-from-the--hub


### Install Extra Libraries

In [2]:
import sys

!{sys.executable} -m pip install -q boto3
!{sys.executable} -m pip install -q huggingface-hub

### Import dependency
First, we import libraries and create a boto3 session. We will use the default profile here, but you can also specify a profile name.

In [None]:
import glob
import json
import os
from datetime import datetime
from pathlib import Path

import boto3

In [2]:
session = boto3.Session(profile_name='default')
region = "us-west-2"
account_id = boto3.client('sts').get_caller_identity().get('Account')
datetime_timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
datetime_day = datetime.now().strftime("%Y-%m-%d")

s3_client = session.client("s3", region_name=region)
sm_client = session.client("sagemaker", region_name=region)
sm_runtime_client = session.client("sagemaker-runtime", region_name=region)

### Download model
We will download the model from the Hugging Face Hub. You can find the model in the [model hub](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2). For this example, we will use the `mistralai/Mistral-7B-Instruct-v0.2` model.

In [4]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_model_path = Path("./Mistral_7B_Instruct_model")
local_model_path.mkdir(exist_ok=True)
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model_commit_hash = "b70aa86578567ba3301b21c8a27bea4e8f6d6d61"

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
snapshot_download(repo_id=model_name, revision=model_commit_hash, cache_dir=local_model_path)

Fetching 16 files: 100%|██████████| 16/16 [00:00<00:00, 144631.17it/s]


'Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61'

### Upload model to Amazon S3
We will create a bucket in Amazon S3 and upload the model to the bucket. You can find more details on how to create a bucket in Amazon S3 [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

In [6]:
def create_s3_bucket(bucket_name):
    try:
        s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
    except Exception as e:
        print(e)

def upload_files_to_s3(local_model_directory, s3_bucket_name, s3_directory_prefix):
    """
    Uploads all files in a local directory to an S3 bucket with a specified directory prefix.

    Args:
        local_model_directory (str): The local directory containing the files to be uploaded.
        s3_bucket_name (str): The name of the S3 bucket.
        s3_directory_prefix (str): The directory prefix to be added to the S3 file paths.

    Returns:
        None
    """
    all_files_paths = glob.glob(local_model_directory + "/**/*", recursive=True)

    for file_path in all_files_paths:
        if Path(file_path).is_dir():
            continue

        relative_file_path = file_path.replace(f"{local_model_directory}/", "")
        s3_file_path = os.path.join(s3_directory_prefix, str(relative_file_path))

        print(f"Uploading {file_path} to {s3_file_path}")
        s3_client.upload_file(file_path, s3_bucket_name, s3_file_path)

In [7]:
s3_model_prefix = "Uniflow/LLM/Mistral_7B_Instruct_model"  # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]
s3_code_prefix = "Uniflow/LLM/Mistral_7B_Instruct_code"
print(f"s3_code_prefix: {s3_code_prefix}")
print(f"model_snapshot_path: {model_snapshot_path}")

s3_code_prefix: Uniflow/LLM/Mistral_7B_Instruct_code
model_snapshot_path: Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61


In [8]:
s3_bucket_name = f"uniflow-llm-{account_id}-{region}"
create_s3_bucket(s3_bucket_name)
upload_files_to_s3(str(model_snapshot_path), s3_bucket_name, s3_model_prefix)

An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
Uploading Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61/special_tokens_map.json to Uniflow/LLM/Mistral_7B_Instruct_model/special_tokens_map.json
Uploading Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61/pytorch_model.bin.index.json to Uniflow/LLM/Mistral_7B_Instruct_model/pytorch_model.bin.index.json
Uploading Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61/generation_config.json to Uniflow/LLM/Mistral_7B_Instruct_model/generation_config.json
Uploading Mistral_7B_Instruct_model/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/b70aa86578567ba3301b21c8a27bea4e8f6d6d61/config.json to Uniflow/LLM/Mistra

### Create role
We will create an execution role that will be used by SageMaker to access AWS resources.

In [9]:
def create_role(role_name):
    """
    Creates an IAM role for SageMaker deployment.

    Parameters:
    role_name (str): The name of the IAM role to be created.

    Returns:
    str: The ARN (Amazon Resource Name) of the created IAM role.
    """
    iam_client = session.client("iam")

    # Check if role already exists
    try:
        get_role_response = iam_client.get_role(RoleName=role_name)
        print(f"IAM Role '{role_name}' already exists. Skipping creation.")
        return get_role_response["Role"]["Arn"]
    except iam_client.exceptions.NoSuchEntityException:
        pass

    assume_role_policy_document = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {"Service": "sagemaker.amazonaws.com"},
                "Action": "sts:AssumeRole",
            }
        ],
    }

    create_role_response = iam_client.create_role(
        RoleName=role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_document),
    )

    attach_policy_response = iam_client.attach_role_policy(
        RoleName=role_name,
        PolicyArn="arn:aws:iam::aws:policy/AmazonSageMakerFullAccess",
    )

    attach_policy_response = iam_client.attach_role_policy(
        RoleName=role_name,
        PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess",
    )

    print(f"IAM Role '{role_name}' created successfully!")

    role_arn = create_role_response["Role"]["Arn"]

    return role_arn

We name the role `UniflowSageMakerEndpointRole-v1` in this notebook. You can change it to your own role name.

In [10]:
role_name = f"UniflowSageMakerEndpointRole-v1"
role_arn = create_role(role_name)

IAM Role 'UniflowSageMakerEndpointRole-v1' created successfully!


### Deploy model
Next, we deploy the model to an endpoint. There will be 3 steps to this process:

First, we create a model in SageMaker. This will be a reference to the model artifacts in S3.

Second, we create an endpoint configuration. This will be a reference to the model in SageMaker.

Third, we create an endpoint. SageMaker will spin up an instance to host the model.

Before deploying the model, we need to create model artifacts and inference code. We will create a `model.tar.gz` file that contains the inference code, requirements and serving properties. We will then upload it to Amazon S3.

In [11]:
!mkdir -p Mistral_7B_Instruct_code

In [12]:
%%writefile Mistral_7B_Instruct_code/model.py
from djl_python import Input, Output
import torch
import logging
import math
import os
import json
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda"

def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_location, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_location, trust_remote_code=True)
    model = model.eval().half()
    model.to(device)
    
    return model, tokenizer


model = None
tokenizer = None
generator = None

def handle(inputs: Input):
    global model, tokenizer
    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    data = inputs.get_as_json()
    
    input_sentence = data["inputs"]
    params = data["parameters"]
    history = data.get("history", [])
    max_new_tokens = params.get("max_new_tokens", 1000)
    do_sample = params.get("do_sample", True)

    messages = []

    messages.append({"role": "user", "content": input_sentence})

    if history:
        messages.extend(history)
    
    outputs = Output()

    encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
    model_inputs = encodeds.to(device)
    generated_ids = model.generate(model_inputs, max_new_tokens=max_new_tokens, do_sample=do_sample)
    decoded = tokenizer.batch_decode(generated_ids)
    result = {"outputs": decoded[0]}

    outputs.add_as_json(result)
    return outputs


Overwriting Mistral_7B_Instruct_code/model.py


In [13]:
requirements = [
    "transformers==4.36.2",
    "accelerate==0.25.0"
]

with open("Mistral_7B_Instruct_code/requirements.txt", "w") as f:
    f.write("\n".join(requirements))

In [14]:
serving_properties = [
    "engine=Python",
    "option.tensor_parallel_degree=1",
    f"option.s3url=s3://{s3_bucket_name}/Uniflow/LLM/Mistral_7B_Instruct_model/",
]

with open("Mistral_7B_Instruct_code/serving.properties", "w+") as f:
    f.write("\n".join(serving_properties))

In [15]:
!rm model.tar.gz
!tar czvf model.tar.gz Mistral_7B_Instruct_code

s3_client.upload_file("model.tar.gz", s3_bucket_name, f"{s3_code_prefix}/model.tar.gz")

Mistral_7B_Instruct_code/
Mistral_7B_Instruct_code/requirements.txt
Mistral_7B_Instruct_code/model.py
Mistral_7B_Instruct_code/serving.properties


After uploading the model artifacts to Amazon S3, we will create a SageMaker model. We will then create an endpoint configuration and deploy the model to an endpoint.

In [None]:

model_name = f"Mistral-7B-Instruct-{datetime_timestamp}"
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"
)
s3_code_artifact = f"s3://{s3_bucket_name}/{s3_code_prefix}/model.tar.gz"

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role_arn,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

The following code will create a SageMaker Endpoint Configuration.

In [None]:
endpoint_config_name = f"{model_name}-config"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.4xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 10*60,
        },
    ],
)
print(f"Created Endpoint Config: {endpoint_config_response['EndpointConfigArn']}")

Then we will create a SageMaker Endpoint.

In [None]:
endpoint_name = f"{model_name}-endpoint"

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

def wait_for_endpoint_creation(sm_client, endpoint_name):
    while True:
        resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
        status = resp["EndpointStatus"]
        
        print(f"Status: {status}")
        
        if status != "Creating":
            break
        
        time.sleep(60)
    
    print(f"Arn: {resp['EndpointArn']}")
    print(f"Status: {status}")

wait_for_endpoint_creation(sm_client, endpoint_name)

### Invoke endpoint
Finally, we invoke the endpoint with a sample input.

In [23]:
def invoke_endpoint(endpoint_name, input_text):
    """
    Invokes the SageMaker endpoint.

    Args:
        endpoint_name (str): The name of the SageMaker endpoint.
        input_text (str): The input text to be processed by the endpoint.

    Returns:
        dict: The response from the SageMaker endpoint.
    """

    parameters = {
        "do_sample": True,
        "max_new_tokens": 128,
    }

    prompt = f"{input_text}"

    payload = json.dumps({"inputs": prompt, "parameters": parameters})

    response = sm_runtime_client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=payload
    )

    return json.loads(response["Body"].read().decode("utf-8"))

In [24]:
input_text = "Tell me about Amazon SageMaker"
response = invoke_endpoint(endpoint_name, input_text)
print(response)

{'outputs': "<s> [INST] Tell me about Amazon SageMaker [/INST] Amazon SageMaker is a fully managed platform provided by Amazon Web Services (AWS) that allows developers and data scientists to build, train, and deploy machine learning models quickly and at scale. It provides various tools, algorithms, and pre-built templates to help you get started with your machine learning projects. Here are some key features and benefits of Amazon SageMaker:\n\n1. Fully managed: You don't need to manage any infrastructure, such as servers, storage, or networking, as Amazon SageMaker manages all of that for you.\n2. Bring your own data: Sage"}


That's the end of this notebook. You can find more details on how to deploy a model on Amazon SageMaker [here](https://huggingface.co/docs/sagemaker/inference#deploy-a-model-from-the--hub).

Don't forget to delete the endpoint after you are done with this notebook. SageMaker endpoints are billed by the hour so you will incur charges if you forget to delete the endpoint.

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>