# LLava stateful inference with SageMaker

## Contents

This notebook uses SageMaker notebook instance `conda_pytorch_p310` kernel, demonstrates how to use TorchServe to deploy Llama 3.2 vision Model on SageMaker. 

This is the code accompanying the workshop <TODO>


## Step 0: Let's bump up SageMaker and import stuff

In [169]:
!python --version && aws --version

Python 3.10.14
aws-cli/1.34.16 Python/3.10.14 Linux/5.10.224-212.876.amzn2.x86_64 botocore/1.35.35


In [170]:
!pip install -Uq pip
!pip install -Uq sagemaker
!pip install torch-model-archiver
!pip install -Uq botocore
!pip install -Uq boto3



In [171]:
!pip install python-dotenv
from dotenv import load_dotenv
import os
load_dotenv(override=True)  # Loads the variables from .env



True

In [172]:
os.environ["TS_HF_TOKEN_VALUE"]

'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'

In [173]:
import os
import shutil
import importlib
import botocore

In [174]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [175]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers
barebone_session = sagemaker.session.Session()  # barebone sagemaker session to get current region
# region name of the current SageMaker Studio environment
region = barebone_session._region_name
boto3_session=boto3.session.Session(region_name=region)
# Create a SageMaker runtime client object using your IAM role ARN
smr = boto3.client('sagemaker-runtime', region_name=region)
# Create a SageMaker client object
sm = boto3.client('sagemaker', region_name=region)
# execution role for the endpoint
role = sagemaker.get_execution_role()  
# sagemaker session for interacting with different AWS APIs
sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  
# account_id of the current SageMaker Studio environment
account = sess.account_id()  

# Configuration:
bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
model_name = "llama32vision-sm"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

account=043632497353, region=us-west-2, role=arn:aws:iam::043632497353:role/sm-vision-stateful-role-SageMakerEndpointRole-OiHgNy330sKT, output_path=s3://sagemaker-us-west-2-043632497353/torchserve


## Step 1: Build a BYOD TorchServe Docker container and push it to Amazon ECR

1. Create an ECR repo: https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html
2. Get Base Image: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

In [176]:
baseimage = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker"
reponame = "llama32-11b-vision-stateful"
versiontag = "1.0"
print("use the output from the print below to run ./build_and_push.sh in a termianl. You get better feedback in terminal.")
print (f"cd docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}")
print("if you do endup running this command in a terminal , you can skip the next cell")

use the output from the print below to run ./build_and_push.sh in a termianl. You get better feedback in terminal.
cd docker && ./build_and_push.sh llama32-11b-vision-stateful 1.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker us-west-2 043632497353
if you do endup running this command in a terminal , you can skip the next cell


In [177]:
# %%capture build_output

# # Build our own docker image
# !cd docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}

In [178]:
# Update container
container = f"{account}.dkr.ecr.{region}.amazonaws.com/{reponame}:{versiontag}"
container
print(baseimage)


763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker


## Step2: Build TorchServe Model Artifacts and Upload to S3

In [179]:
rm -rf code/{model_name}

In [180]:
!cd code && torch-model-archiver --model-name {model_name} --version 1.0 --handler handler/custom_handler.py --config-file handler/model-config.yaml --archive-format no-archive --extra-files handler/ -f

In [181]:
!cd code && aws s3 cp {model_name} {output_path}/{model_name} --recursive

upload: llama32vision-sm/data_types.py to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/data_types.py
upload: llama32vision-sm/__init__.py to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/__init__.py
upload: llama32vision-sm/custom_handler.py to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/custom_handler.py
upload: llama32vision-sm/inference_api.py to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/inference_api.py
upload: llama32vision-sm/MAR-INF/MANIFEST.json to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/MAR-INF/MANIFEST.json
upload: llama32vision-sm/utils.py to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/utils.py
upload: llama32vision-sm/model-config.yaml to s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/model-config.yaml


In [182]:
s3_uri = f"{output_path}/{model_name}/"
print(s3_uri)

s3://sagemaker-us-west-2-043632497353/torchserve/llama32vision-sm/


## Step3: Create SageMaker Endpont

### 3.1 Create Model

In [183]:
from datetime import datetime

instance_type = "ml.p4d.24xlarge"
endpoint_name = sagemaker.utils.name_from_base(model_name)

model = Model(
    name=model_name + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts via "S3DataType": "S3Prefix"
    model_data={
        "S3DataSource": {
                "S3Uri": s3_uri,
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
        }
    },
    image_uri=container,
    role=role,
    sagemaker_session=sess,
    env={
        # TorchServe configuration file
        "TS_CONFIG_FILE": "/home/model-server/config.properties",
        # Disable token authorization for REST APIs
        "TS_DISABLE_TOKEN_AUTHORIZATION": "true", 
        # Headers to indicate Session ID
        "TS_HEADER_KEY_SEQUENCE_ID": "X-Amzn-SageMaker-Session-Id",
        "TS_REQUEST_SEQUENCE_ID": "X-Amzn-SageMaker-Session-Id",
        # Headers to indicate closed session
        "TS_HEADER_KEY_SEQUENCE_END": "X-Amzn-SageMaker-Closed-Session-Id",
        "TS_REQUEST_SEQUENCE_END": "X-Amzn-SageMaker-Closed-Session-Id",
        # Enable system metrics aggregation
        "TS_DISABLE_SYSTEM_METRICS": "false",
        "TS_HF_TOKEN": os.environ["TS_HF_TOKEN_VALUE"]
    },
)
print(model)

<sagemaker.model.Model object at 0x7f585803f550>


### 3.2 Deploy Model and Create Endpoint

In [184]:
model.deploy(
    initial_instance_count=1, # increase the number of instances based on your load
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    #volume_size=512, # increase the size to store large model
    model_data_download_timeout=3600, 
    container_startup_health_check_timeout=3600, 
)

--------------!

### 3.3 Create a Predictor

In [185]:
predictor = sagemaker.predictor.Predictor(
    endpoint_name=model.endpoint_name,
    sagemaker_session=sess
)
print(predictor)

Predictor: {'endpoint_name': 'llama32vision-sm-2024-10-07-22-04-35-140', 'sagemaker_session': <sagemaker.session.Session object at 0x7f585803e200>, 'serializer': <sagemaker.base_serializers.IdentitySerializer object at 0x7f590e9a5630>, 'deserializer': <sagemaker.base_deserializers.BytesDeserializer object at 0x7f590e9a5cf0>}


In [186]:
# predictor = sagemaker.predictor.Predictor(
#     endpoint_name='llava-sm-2024-09-04-06-35-10-354',
#     sagemaker_session=sess
# )
# print(predictor)

## Step4: Run Inference

In [187]:
#Add necessary modules path to sys.path
import os, sys

demo_data_path = os.path.join(os.getcwd(), "code/handler")
if demo_data_path not in sys.path:
    sys.path.append(demo_data_path)

In [188]:
#Install dependencies
!pip install torch dataclasses_json



### 4.1 Open Session 1

In [189]:
image_url="https://images.pexels.com/photos/1519753/pexels-photo-1519753.jpeg"

In [190]:
image_url="https://images.pexels.com/photos/1519753/pexels-photo-1519753.jpeg"

In [191]:
%%time
from data_types import (
    BaseRequest,
    CloseSessionRequest,
    StartSessionRequest,
    TextPromptRequest,
    OpenSessionResponse,
    TextPromptResponse,
    CloseSessionResponse
)

ts_request_sequence_id = "SessionId"


def send_and_check_request(r, seq_id):
    response = smr.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=r.to_json(),
        ContentType="application/json",
        SessionId=seq_id
    )
    assert response["ResponseMetadata"]["HTTPStatusCode"] == 200, f"Sending request failed: {r}"
    return response['Body'].readlines()[0]

open_request = StartSessionRequest(
    type="start_session",
    path=image_url,
)

open_response = send_and_check_request(open_request, "NEW_SESSION")
open_response = OpenSessionResponse.from_json(open_response)
print(open_response)
assert open_response.session_id.startswith("ts-seq-")

OpenSessionResponse(session_id='ts-seq-387e7c95-e46c-43d2-9bca-d1d4d3ac071a')
CPU times: user 14.2 ms, sys: 7.74 ms, total: 21.9 ms
Wall time: 1.17 s


In [192]:
open_response.session_id

'ts-seq-387e7c95-e46c-43d2-9bca-d1d4d3ac071a'

### 4.2 Send Text Promt 1

In [193]:
%%time
text_prompt_request1 = TextPromptRequest(
    type="send_text_prompt",
    session_id=open_response.session_id,
    prompt_text="describe the picture"
)

text_prompt_response1 = send_and_check_request(text_prompt_request1, open_response.session_id)
text_prompt_response1 = TextPromptResponse.from_json(text_prompt_response1)
print(text_prompt_response1.response_text)
assert text_prompt_response1.response_text

end_header_id|>

This aerial image presents a stunning bird's-eye view of a tropical island, showcasing a lush forest, a small house, and a vibrant turquoise sea.

The island's forest is characterized by a diverse array of green trees, with a few palm trees scattered throughout. A small, white house with a gray roof is nestled among the trees, accompanied by a smaller, red-roofed structure to its left. The shoreline is marked by a rocky area, where the sea meets the land, and the water's edge is dotted with large rocks and boulders.

The sea itself is a brilliant turquoise hue, with a subtle gradient of lighter shades towards the center. The overall atmosphere of the image exudes a sense of serenity and tranquility, evoking a peaceful and idyllic setting.<|eot_id|>
CPU times: user 4.95 ms, sys: 185 μs, total: 5.13 ms
Wall time: 7.91 s


### 4.3 Send Text Promt 2

In [194]:
%%time
text_prompt_request2 = TextPromptRequest(
    type="send_text_prompt",
    session_id=open_response.session_id,
    prompt_text="is there a mountain in the picture, describe it"
)

text_prompt_response2 = send_and_check_request(text_prompt_request2, open_response.session_id)
text_prompt_response2 = TextPromptResponse.from_json(text_prompt_response2)
print(text_prompt_response2.response_text)
assert text_prompt_response2.response_text

end_header_id|>

There is no mountain in the picture. The image shows a tropical island with a rocky shoreline and a dense forest of palm trees and other vegetation. The water is a bright blue color, indicating that it is likely a tropical or subtropical region. The overall atmosphere of the image suggests a warm and sunny day, with the sun shining down on the island and the water.<|eot_id|>
CPU times: user 4.92 ms, sys: 0 ns, total: 4.92 ms
Wall time: 3.65 s


### 4.4 Close session

In [195]:
# close session
close_request = CloseSessionRequest(
    type="close_session",
    session_id=open_response.session_id,
)
    
close_response = send_and_check_request(
    close_request, open_response.session_id
)

close_response = CloseSessionResponse.from_json(close_response)
assert close_response.success

In [196]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()