# SageMaker-TGI-for-LLM
* **conda_pytorch_p310**

The following diagram provides an overview of the ML model packaging process.

- **Step 1** Storing model artifacts and serving/scoring logic
- **Step 2** Creating and pushing a container to ECR that is used to host your model on SageMaker which performs inference
- **Step 3** Validating the container which can succesfully host your model on SageMaker
- **Step 4** Packaging the ML model into a Model Package
- **Step 5** Validating this ML model package by deploying it with Amazon SageMaker 
- **Step 6** Listing the ML model in AWS Marketplace

<img src="images/ml-model-publishing-workflow.png"/>


In [1]:
install_needed = True
# install_needed = False

In [12]:
%%bash
#!/bin/bash

DAEMON_PATH="/etc/docker"
MEMORY_SIZE=10G

FLAG=$(cat $DAEMON_PATH/daemon.json | jq 'has("data-root")')
# echo $FLAG

if [ "$FLAG" == true ]; then
    echo "Already revised"
else
    echo "Add data-root and default-shm-size=$MEMORY_SIZE"
    sudo cp $DAEMON_PATH/daemon.json $DAEMON_PATH/daemon.json.bak
    sudo cat $DAEMON_PATH/daemon.json.bak | jq '. += {"data-root":"/home/ec2-user/SageMaker/.container/docker","default-shm-size":"'$MEMORY_SIZE'"}' | sudo tee $DAEMON_PATH/daemon.json > /dev/null
    sudo service docker restart
    echo "Docker Restart"
fi

sudo curl -L "https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Already revised


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 24.5M  100 24.5M    0     0  44.1M      0 --:--:-- --:--:-- --:--:--  236M


In [1]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install --upgrade pip --quiet
    !{sys.executable} -m pip install -U sagemaker transformers huggingface_hub --quiet
    IPython.Application.instance().kernel.do_shutdown(True)

NameError: name 'install_needed' is not defined

# Start

In [2]:
%load_ext autoreload
%autoreload 2

### Model Store

In [3]:
import os
import time
import boto3
import logging

from pathlib import Path
import huggingface_hub

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.pytorch.model import PyTorchModel

from sagemaker import get_execution_role
from sagemaker.session import Session

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()

execution_role_arn = get_execution_role()
region = sagemaker_session.boto_region_name

os.environ['HF_HOME'] = '/home/ec2-user/SageMaker/.cache'



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


#### https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html

In [4]:
from sagemaker import instance_types

ref_model_id = "meta-textgeneration-llama-3-8b-instruct"
instance_type = instance_types.retrieve_default(
    model_id=ref_model_id,
    model_version="2.2.1",
    scope="inference")
print(instance_type)

Model 'meta-textgeneration-llama-3-8b-instruct' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.


Using model 'meta-textgeneration-llama-3-8b-instruct' with version '2.2.1'. You can upgrade to version '2.11.2' to get the latest model specifications. Note that models may have different input/output signatures after a major version upgrade.


Using vulnerable JumpStart model 'meta-textgeneration-llama-3-8b-instruct' and version '2.2.1'.


ml.g5.12xlarge


<br>

## [**Step 1**] Preparing model artifacts
---

In [5]:
model_id='meta-llama/Meta-Llama-3.1-8B-Instruct'

model_name = model_id.split("/")[-1].lower()
model_name = model_name.replace(".", "-")
model_name

'meta-llama-3-1-8b-instruct'

In [6]:
hf_local_download_dir = Path.cwd() / model_name
hf_local_download_dir.mkdir(exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id=model_id,
    revision="main",
    local_dir=hf_local_download_dir
)

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.69k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

consolidated.00.pth:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.63k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

'/home/ec2-user/SageMaker/2025/INFERENCE/on-boarding-process/meta-llama-3-1-8b-instruct'

In [7]:
!rm -rf shell && mkdir shell

In [8]:
%%writefile shell/triton_model_compression_upload.sh

cd meta-llama-3-1-8b-instruct
tar cvf - * | pigz > model.tar.gz

cd ..
sudo rm -rf compressed_model && mkdir compressed_model
mv meta-llama-3-1-8b-instruct/model.tar.gz compressed_model/

Writing shell/triton_model_compression_upload.sh


In [9]:
!sh ./shell/triton_model_compression_upload.sh

config.json
generation_config.json
LICENSE
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
original/
original/params.json
original/consolidated.00.pth
original/tokenizer.model
README.md
special_tokens_map.json
tokenizer_config.json
tokenizer.json
USE_POLICY.md


In [10]:
compressed_model_path = f"s3://{artifacts_bucket_name}/{model_name}/compressed_model"
compressed_model_path

's3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/compressed_model'

In [11]:
!aws s3 sync ./compressed_model/ $compressed_model_path

upload: compressed_model/model.tar.gz to s3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/compressed_model/model.tar.gz


<br>

## [**Step 2**] Creating and pushing a container to ECR
---

In [12]:
image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region
)
image_uri

'763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04-v2.1'

In [50]:
account = sagemaker.Session().account_id()
ecr_image_uri = image_uri.replace("763104351884", account)
ecr_image_uri

'322537213286.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04-v2.1'

In [51]:
!rm -rf docker && mkdir docker

In [52]:
%%writefile docker/sagemaker-entrypoint.sh
#!/bin/bash

if [[ -z "${HF_MODEL_ID}" ]]; then
  echo "HF_MODEL_ID must be set"
  exit 1
fi
export MODEL_ID="${HF_MODEL_ID}"

if [[ -n "${HF_MODEL_REVISION}" ]]; then
  export REVISION="${HF_MODEL_REVISION}"
fi

if [[ -n "${SM_NUM_GPUS}" ]]; then
    NUM_SHARD="${SM_NUM_GPUS}"
else
    NUM_SHARD=$(nvidia-smi --list-gpus | wc -l)
fi

export NUM_SHARD


if [[ -n "${HF_MODEL_QUANTIZE}" ]]; then
  export QUANTIZE="${HF_MODEL_QUANTIZE}"
fi

if [[ -n "${HF_MODEL_TRUST_REMOTE_CODE}" ]]; then
  export TRUST_REMOTE_CODE="${HF_MODEL_TRUST_REMOTE_CODE}"
fi

text-generation-launcher --port 8080


Writing docker/sagemaker-entrypoint.sh


In [53]:
%%writefile docker/Dockerfile

# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04-v2.1

ENV HF_MODEL_ID "/opt/ml/model"
ENV HF_MODEL_QUANTIZE "bitsandbytes"
ENV HF_MODEL_TRUST_REMOTE_CODE "true"
# ENV SM_NUM_GPUS "4"

COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]

Writing docker/Dockerfile


In [54]:
%%writefile docker/build_and_push.sh

original_image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04-v2.1"

algorithm_name="huggingface-pytorch-tgi-inference"

cd docker

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

target_image_uri="${account}.dkr.ecr.us-west-2.amazonaws.com/${algorithm_name}:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04-v2.1"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1


if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin "763104351884.dkr.ecr.us-west-2.amazonaws.com"

# docker pull $original_image_uri
# docker image tag $original_image_uri $target_image_uri

docker build -f Dockerfile -t ${target_image_uri} .

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${target_image_uri}

docker push ${target_image_uri}

Writing docker/build_and_push.sh


In [55]:
!sh ./docker/build_and_push.sh > /dev/null 2>&1

<br>

## [**Step 3**] Validating the container for hosting your model on SageMaker
---

SageMaker 호스팅 엔드포인트로 배포하기 전에 로컬 모드 엔드포인트로 배포할 수 있습니다. 로컬 모드는 현재 개발 중인 환경에서 도커 컨테이너를 실행하여 SageMaker 프로세싱/훈련/추론 작업을 에뮬레이트할 수 있습니다. 추론 작업의 경우는 Amazon ECR의 딥러닝 프레임워크 기반 추론 컨테이너를 로컬로 가져오고(docker pull) 컨테이너를 실행하여(docker run) 모델 서버를 시작합니다.


### SageMaker Endpoint (Local Mode)

로컬 모드는 필수로 수행할 필요는 없지만, 디버깅에 많은 도움이 됩니다. 또한, 로컬 모드 사용 시에는 모델을 S3에 반드시 업로드할 필요 없이 로컬 디렉터리에서도 로드할 수 있습니다. (`container` 변수 참조)

In [56]:
import boto3
import time
import json


# Set to True to enable SageMaker to run locally
local_mode = True
# local_mode = False
if local_mode:
    from sagemaker.local import LocalSession
    instance_type = "local_gpu"
    sm_session = LocalSession()
    sm_session.config = {'local': {'local_code': True}}
    sm_client = sagemaker.local.LocalSagemakerClient()
    smr_client = sagemaker.local.LocalSagemakerRuntimeClient()
    model_data=f"file://{Path.cwd()}/{model_name}"
else:
    instance_type = "ml.g5.12xlarge"
    sm_session = sagemaker.Session()
    sm_client = boto3.client("sagemaker")
    smr_client = boto3.client("sagemaker-runtime")
    model_data = f"{compressed_model_path}/model.tar.gz"

instance_count = 1
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{model_name}-{ts}"
endpoint_config_name = f"{model_name}-endpoint-config-{ts}"
endpoint_name = f"{model_name}-endpoint-{ts}"
model_data

print(f'--- SageMaker Model Name: {sm_model_name}')
print(f'--- Endpoint Config Name: {endpoint_config_name}')     
print(f'--- Endpoint Name: {endpoint_name}')
print(f'--- Model Data: {model_data}')

--- SageMaker Model Name: meta-llama-3-1-8b-instruct-2025-03-30-13-43-28
--- Endpoint Config Name: meta-llama-3-1-8b-instruct-endpoint-config-2025-03-30-13-43-28
--- Endpoint Name: meta-llama-3-1-8b-instruct-endpoint-2025-03-30-13-43-28
--- Model Data: file:///home/ec2-user/SageMaker/2025/INFERENCE/on-boarding-process/meta-llama-3-1-8b-instruct


In [57]:
# env_var = {
#     'HF_MODEL_ID': "/opt/ml/model",
#     'SM_NUM_GPUS':'4',
#     'HF_MODEL_QUANTIZE':'bitsandbytes',
#     'HF_MODEL_TRUST_REMOTE_CODE' : 'true'
# }

env_var = {
}

container = {
    "Image": ecr_image_uri,
    "ModelDataUrl": model_data,
    "Environment": env_var
}

In [58]:
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, 
    ExecutionRoleArn=execution_role_arn, 
    PrimaryContainer=container,
)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            'ModelDataDownloadTimeoutInSeconds': 300,
            'ContainerStartupHealthCheckTimeoutInSeconds': 300,
            
        },
    ],
)
#print("Model Arn: " + create_model_response["ModelArn"])

In [59]:
!docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


In [60]:
!docker kill 162db8246dcf

Error response from daemon: Cannot kill container: 162db8246dcf: container 162db8246dcfd79d796765ab39f9f4d227e267075b7e9983d5012f5d59e0cf75 is not running


In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, 
    EndpointConfigName=endpoint_config_name
)

Attaching to pd4ifaupem-algo-1-2yipb
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:44.626581Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Args {
pd4ifaupem-algo-1-2yipb  |     model_id: "/opt/ml/model",
pd4ifaupem-algo-1-2yipb  |     revision: None,
pd4ifaupem-algo-1-2yipb  |     validation_workers: 2,
pd4ifaupem-algo-1-2yipb  |     sharded: None,
pd4ifaupem-algo-1-2yipb  |     num_shard: Some(
pd4ifaupem-algo-1-2yipb  |         1,
pd4ifaupem-algo-1-2yipb  |     ),
pd4ifaupem-algo-1-2yipb  |     quantize: Some(
pd4ifaupem-algo-1-2yipb  |         Bitsandbytes,
pd4ifaupem-algo-1-2yipb  |     ),
pd4ifaupem-algo-1-2yipb  |     speculate: None,
pd4ifaupem-algo-1-2yipb  |     dtype: None,
pd4ifaupem-algo-1-2yipb  |     kv_cache_dtype: None,
pd4ifaupem-algo-1-2yipb  |     trust_remote_code: true,
pd4ifaupem-algo-1-2yipb  |     max_concurrent_requests: 128,
pd4ifaupem-algo-1-2yipb  |     max_best_of: 2,
pd4ifaupem-algo-1-2yipb  |     max_stop_sequences: 4,
pd4ifaupem-algo-1-

pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:48.140288Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Files are already present on the host. Skipping download.
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:48.573660Z[0m [32m INFO[0m [1mdownload[0m: [2mtext_generation_launcher[0m[2m:[0m Successfully downloaded weights for /opt/ml/model
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:48.573960Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Starting shard [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:50.884336Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Using prefix caching = True
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:50.884389Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Using Attention = flashinfer


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:43:58.589420Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Waiting for shard to be ready... [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:08.598646Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Waiting for shard to be ready... [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:18.608128Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Waiting for shard to be ready... [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:28.617612Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Waiting for shard to be ready... [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:38.626444Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Waiting for shard to be ready... [2m[3mrank[0m[2m=[0m0[0m


pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.064370Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Using prefill chunking = True
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.436020Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Server started at unix:///tmp/text-generation-server-0
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.530807Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Shard ready in 54.949727443s [2m[3mrank[0m[2m=[0m0[0m
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.617329Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Starting Webserver
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.733578Z[0m [32m INFO[0m [2mtext_generation_router_v3[0m[2m:[0m [2mbackends/v3/src/lib.rs[0m[2m:[0m[2m125:[0m Warming up model
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:43.757194Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Using optimized Triton 

pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:51.787930Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m KV-cache blocks: 64210, size: 1
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:52.040248Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Cuda Graphs are disabled (CUDA_GRAPHS=None).
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:52.040659Z[0m [32m INFO[0m [2mtext_generation_router_v3[0m[2m:[0m [2mbackends/v3/src/lib.rs[0m[2m:[0m[2m137:[0m Setting max batch total tokens to 64210
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:52.040680Z[0m [33m WARN[0m [2mtext_generation_router_v3::backend[0m[2m:[0m [2mbackends/v3/src/backend.rs[0m[2m:[0m[2m39:[0m Model supports prefill chunking. `waiting_served_ratio` and `max_waiting_tokens` will be ignored.
pd4ifaupem-algo-1-2yipb  | [2m2025-03-30T13:44:52.040706Z[0m [32m INFO[0m [2mtext_generation_router_v3[0m[2m:[0m [2mbackends/v3/src/lib.rs[0m[2m:[0m[2m166:[0m Using backend 

In [None]:
!docker ps

### Inference Test

In [None]:
prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

sample_input = {
    "inputs": prompt,
    "parameters": {
        "max_tokens":256,
        "top_p": 0.9,
        "temperature": 0.6,
        "max_tokens": 512,
        "stop": ["<|eot_id|>"]
    }
}

In [None]:
%%time
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Accept="application/json",
    ContentType="application/json",
    Body=json.dumps(sample_input)
)
data = response["Body"].read()
output = json.loads(data)
output[0]['generated_text']

In [28]:
!docker ps

CONTAINER ID   IMAGE                                                                                                                            COMMAND                  CREATED          STATUS          PORTS                                       NAMES
43c5c1691597   322537213286.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0   "./entrypoint.sh ser…"   55 seconds ago   Up 54 seconds   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   e1j29wus72-algo-1-33w75


In [30]:
!docker kill 43c5c1691597

e1j29wus72-algo-1-33w75 exited with code 137
Aborting on container exit...
43c5c1691597


Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/sagemaker/local/image.py", line 955, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/sagemaker/local/image.py", line 1021, in _stream_output
    raise RuntimeError(f"Failed to run: {process.args}. Process exited with code: {exit_code}")
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpaj3ui2vo/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit']. Process exited with code: 137

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/sagemaker/local/image.py", line 960, in run
    raise RuntimeEr

### Validating the container in SageMaker Endpoint

In [31]:
import boto3
import time
import json


# Set to True to enable SageMaker to run locally
local_mode = False

if local_mode:
    from sagemaker.local import LocalSession
    instance_type = "local_gpu"
    sm_session = LocalSession()
    sm_session.config = {'local': {'local_code': True}}
    sm_client = sagemaker.local.LocalSagemakerClient()
    smr_client = sagemaker.local.LocalSagemakerRuntimeClient()
    model_data=f"file://{Path.cwd()}/{model_name}"
else:
    instance_type = "ml.g5.12xlarge" ###### instance type
    
    sm_session = sagemaker.Session()
    sm_client = boto3.client("sagemaker")
    smr_client = boto3.client("sagemaker-runtime")
    model_data = f"{compressed_model_path}/model.tar.gz"

instance_count = 1
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{model_name}-{ts}"
endpoint_config_name = f"{model_name}-endpoint-config-{ts}"
endpoint_name = f"{model_name}-endpoint-{ts}"

print(f'--- SageMaker Model Name: {sm_model_name}')
print(f'--- Endpoint Config Name: {endpoint_config_name}')     
print(f'--- Endpoint Name: {endpoint_name}')
print(f'--- Model Data: {model_data}')


--- SageMaker Model Name: meta-llama-3-1-8b-instruct-2024-08-15-03-10-33
--- Endpoint Config Name: meta-llama-3-1-8b-instruct-endpoint-config-2024-08-15-03-10-33
--- Endpoint Name: meta-llama-3-1-8b-instruct-endpoint-2024-08-15-03-10-33
--- Model Data: s3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/compressed_model/model.tar.gz


In [32]:
env_var = {}

container = {
    "Image": ecr_image_uri,
    "ModelDataUrl": model_data,
    "Environment": env_var
}

In [33]:
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, 
    ExecutionRoleArn=execution_role_arn, 
    PrimaryContainer=container,
)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            'ModelDataDownloadTimeoutInSeconds': 300,
            'ContainerStartupHealthCheckTimeoutInSeconds': 300
            
        },
    ]
)

print("Model Arn: " + create_model_response["ModelArn"])
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Model Arn: arn:aws:sagemaker:us-west-2:322537213286:model/meta-llama-3-1-8b-instruct-2024-08-15-03-10-33
Endpoint Config Arn: arn:aws:sagemaker:us-west-2:322537213286:endpoint-config/meta-llama-3-1-8b-instruct-endpoint-config-2024-08-15-03-10-33


In [34]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, 
    EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-west-2:322537213286:endpoint/meta-llama-3-1-8b-instruct-endpoint-2024-08-15-03-10-33


In [35]:
from IPython.display import display, HTML
def make_console_link(region, endpoint_name, task='[SageMaker LLM Serving]'):
    endpoint_link = f'<b> {task} <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={region}#/endpoints/{endpoint_name}">Check Endpoint Status</a></b>'   
    return endpoint_link

endpoint_link = make_console_link(region, endpoint_name)
display(HTML(endpoint_link))

In [36]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(30)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:322537213286:endpoint/meta-llama-3-1-8b-instruct-endpoint-2024-08-15-03-10-33
Status: InService


In [37]:
prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

sample_input = {
    "inputs": prompt,
    "parameters": {
        "max_tokens":256,
        "top_p": 0.9,
        "temperature": 0.6,
        "max_tokens": 512,
        "stop": ["<|eot_id|>"]
    }
}

In [38]:
%%time
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Accept="application/json",
    ContentType="application/json",
    Body=json.dumps(sample_input)
)
data = response["Body"].read()
output = json.loads(data)
output[0]['generated_text']

CPU times: user 12.7 ms, sys: 0 ns, total: 12.7 ms
Wall time: 7.86 s


'The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the southeastern United States and northeastern Mexico. It is the only species of turtle that lives in the brackish waters of the southeastern United States.\nThe diamondback terrapin is a medium-sized turtle that has a distinctive diamond-shaped marking on its shell. It is a carnivorous turtle that feeds on a variety of prey including crabs, shrimp, and fish.\nDiamondback terrapins are an important part of their ecosystem, serving as both predators and prey for other animals. They are also an important'

### Clean up

In [39]:
def delete_endpoint(client, endpoint_name):
    response = client.describe_endpoint(EndpointName=endpoint_name)
    EndpointConfigName = response['EndpointConfigName']
    
    response = client.describe_endpoint_config(EndpointConfigName=EndpointConfigName)
    model_name = response['ProductionVariants'][0]['ModelName']
    
    client.delete_model(ModelName=model_name)    
    client.delete_endpoint_config(EndpointConfigName=EndpointConfigName) 
    client.delete_endpoint(EndpointName=endpoint_name)
   
    print(f'--- Deleted model: {model_name}')
    print(f'--- Deleted endpoint_config: {EndpointConfigName}')     
    print(f'--- Deleted endpoint: {endpoint_name}')

In [40]:
delete_endpoint(sm_client, endpoint_name)

--- Deleted model: meta-llama-3-1-8b-instruct-2024-08-15-03-10-33
--- Deleted endpoint_config: meta-llama-3-1-8b-instruct-endpoint-config-2024-08-15-03-10-33
--- Deleted endpoint: meta-llama-3-1-8b-instruct-endpoint-2024-08-15-03-10-33


<br>

## [**Step 4**] Packaging the ML model into a Model Package
---
이 **step**에서는 아티팩트(ECR 이미지 및 학습된 모델 아티팩트)를 ModelPackage로 패키징하는 방법을 살펴봅니다. 이 작업을 완료하면 AWS 마켓플레이스에서 제품을 사전 학습된 모델로 등록할 수 있습니다.

**Note:** 모델을 여러 하드웨어 유형(CPU/GPU/Inferentia)에 배포할 수 있는 경우, 일반적으로 사용되는 컨테이너 이미지가 각각 다르기 때문에 각각에 대해 모델패키지를 생성하고 MP 목록에 다른 버전으로 추가해야 합니다.  

### 모델 패키지 사전 준비
모델 패키지는 추론에 필요한 모든 요소를 패키지로 묶은 모델 아티팩트에 대한 재사용 가능한 추상화 형태입니다. 이는 모델 데이터 위치(선택 사항)와 함께 사용할 추론 이미지를 정의하는 추론 사양으로 구성됩니다. ModelPackage는 AWS 마켓플레이스에 판매자로 등록할 AWS 계정에서 생성해야 합니다.

In [41]:
import os
strPythonPath = !which python
strValidatePath = os.path.join(strPythonPath[0].rsplit("/", 2)[0], "lib/python3.10/site-packages/botocore/validate.py")
print ("vi " + strValidatePath)

vi /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/validate.py


<div class="alert alert-info"> <strong> Note </strong>
모델패키지를 생성할 때 아래와 같은 오류가 발생할 수 있습니다:

```
~/anaconda3/envs/python3/lib/python3.8/site-packages/botocore/validate.py in serialize_to_request(self, parameters, operation_model)
    380             if report.has_errors():
--> 381                raise ParamValidationError(report=report.generate_report())
    382         return self._serializer.serialize_to_request(
    383             parameters, operation_model

ParamValidationError: Parameter validation failed:
Invalid length for parameter ValidationSpecification.ValidationProfiles, value: 0, valid min length: 1
```

이 이슈를 해결하기 위해 다음 경로의 `~/anaconda3/envs/python3/lib/python3.8/site-packages/botocore/validate.py`에서 아래 코드를 제거하거나 코멘트 처리가 필요합니다. 경로는 노트북의 상황에 따라 변경될 수 있으므로 바로 위의 cell 코드의 수행 결과에서 정확한 위치를 파악하시기 바랍니다.
    
```
380 if report.has_errors():
381                 raise ParamValidationError(report=report.generate_report())
```

커널을 재시작한 다음, [**Step4**]의 아래 부터 재시작을 합니다.


</div>

In [42]:
import os
import time
import boto3
import logging

from pathlib import Path
import huggingface_hub

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker.pytorch.model import PyTorchModel

from sagemaker import get_execution_role
from sagemaker.session import Session

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()

execution_role_arn = get_execution_role()
region = sagemaker_session.boto_region_name

s3_client = sagemaker_session.boto_session.client("s3")
sm_runtime = boto3.client("sagemaker-runtime")

In [43]:
model_id='meta-llama/Meta-Llama-3.1-8B-Instruct'

model_name = model_id.split("/")[-1].lower()
model_name = model_name.replace(".", "-")
model_name

'meta-llama-3-1-8b-instruct'

In [44]:
compressed_model_path = f"s3://{artifacts_bucket_name}/{model_name}/compressed_model"
compressed_model_path

's3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/compressed_model'

In [45]:
model_data = f"{compressed_model_path}/model.tar.gz"
model_data

's3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/compressed_model/model.tar.gz'

In [46]:
image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi
  region=region
)
account = sagemaker.Session().account_id()
ecr_image_uri = image_uri.replace("763104351884", account)
ecr_image_uri

'322537213286.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0'

### 모델 패키지 생성
모델 패키지 생성 프로세스에서는 다음을 지정해야 합니다:
  1. 도커 이미지
  2. 모델 아티팩트
    - tar.gz 형태로 압축된 모델 아티팩트가 제공되어야 합니다.
        
판매자(및 구매자)에게 Amazon SageMaker에서 제품이 작동한다는 확신을 주기 위해, AWS Marketplace에 제품을 리스팅하기 전에 SageMaker는 기본적인 유효성 검사를 위와 같이 진행하였습니다. 이 유효성 검사 프로세스가 성공해야만 제품을 AWS Marketplace에 리스팅할 수 있습니다. 이 유효성 검사 프로세스는 사용자가 제공한 유효성 검사 프로필과 샘플 데이터를 사용하여 모델을 사용하여 계정에서 변환 작업을 생성하여 추론 이미지가 SageMaker에서 작동하는지 확인합니다.

다음으로, ML 모델에 적합한 인스턴스 크기를 식별해야 하며, ML 모델 위에서 성능 테스트를 실행하여 이를 확인할 수 있습니다.

**Note:** 모델 튜닝 외에도 인스턴스 유형을 식별할 때 모델의 요구 사항을 고려해야 합니다.  모델이 GPU 리소스를 사용하지 않는 경우 GPU 인스턴스 유형을 포함하지 마세요. 마찬가지로 모델이 GPU 리소스를 사용하지만 단일 GPU만 사용할 수 있는 경우, 여러 개의 GPU가 있는 인스턴스 유형을 포함하지 마세요. 성능상의 이점은 없이 사용자의 인프라 요금만 증가시킬 수 있기 때문입니다.

### 테스트용 데이터 만들기

In [47]:
prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

sample_input = {
    "inputs": prompt,
    "parameters": {
        "max_tokens":256,
        "top_p": 0.9,
        "temperature": 0.6,
        "max_tokens": 512,
        "stop": ["<|eot_id|>"]
    }
}

In [48]:
import json
json_line = json.dumps(sample_input)
s3_client.put_object(Bucket=artifacts_bucket_name, Key=f"{model_name}/validation-input-json/input.jsonl", Body=json_line)

{'ResponseMetadata': {'RequestId': 'NPA6EEYAV3C816GR',
  'HostId': '2unFu6ytFjRKnqOz40TUy7wSTObt2JyudBQPn/21M7Z55ZDNvUtqbFiyVn2lEOB/k9v/+NOaE7I=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '2unFu6ytFjRKnqOz40TUy7wSTObt2JyudBQPn/21M7Z55ZDNvUtqbFiyVn2lEOB/k9v/+NOaE7I=',
   'x-amz-request-id': 'NPA6EEYAV3C816GR',
   'date': 'Thu, 15 Aug 2024 04:14:11 GMT',
   'x-amz-version-id': 'ot_UI5ZMQ84F.jiFIm1HSfWjzYeW616a',
   'x-amz-server-side-encryption': 'AES256',
   'etag': '"599e5a9aa81aa0f79b0fdf1d064e7621"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"599e5a9aa81aa0f79b0fdf1d064e7621"',
 'ServerSideEncryption': 'AES256',
 'VersionId': 'ot_UI5ZMQ84F.jiFIm1HSfWjzYeW616a'}

In [49]:
validation_file_name = "input.jsonl"
validation_input_path = f"s3://{artifacts_bucket_name}/{model_name}/validation-input-json/"
validation_output_path = f"s3://{artifacts_bucket_name}/{model_name}/validation-output-jsonl/"
validation_input_path

's3://sagemaker-us-west-2-322537213286/meta-llama-3-1-8b-instruct/validation-input-json/'

### 패키지 생성

In [50]:
instance_count = 1
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sm_model_name = f"{model_name}-{ts}"

print(f'--- SageMaker Model Name: {sm_model_name}')

# Define parameters
model_description = "marketplace-model-test" #"<<YourModelDescription>>"

# <<YourSupportedContentTypes>>
supported_content_types = ["application/json"] #["text/csv", "application/json", "application/json", "application/jsonlines"]

# <<YourSupportedResponseMIMETypes>>
supported_response_MIME_types = [ 
    "application/json",
]

supported_realtime_inference_instance_types = ["ml.g5.2xlarge", "ml.g5.4xlarge", "ml.g5.12xlarge", "ml.g5.16xlarge", "ml.g5.24xlarge","ml.g5.48xlarge"]
supported_batch_transform_instance_types = ["ml.g5.2xlarge"] #  Don't use batch transform. And, the Batch Transform validation step is not required

--- SageMaker Model Name: meta-llama-3-1-8b-instruct-2024-08-15-04-14-13


In [51]:
model_package = sagemaker_session.sagemaker_client.create_model_package(
    ModelPackageName=sm_model_name,
    ModelPackageDescription=model_description,
    InferenceSpecification={
        "Containers": [
            {
                "Image": ecr_image_uri,
                "ModelDataUrl": model_data
            }
        ],
        "SupportedTransformInstanceTypes": supported_batch_transform_instance_types,
        "SupportedRealtimeInferenceInstanceTypes": supported_realtime_inference_instance_types,
        "SupportedContentTypes": supported_content_types,
        "SupportedResponseMIMETypes": supported_response_MIME_types,
    },
    CertifyForMarketplace=True,  # Make sure to set this to True
   ValidationSpecification={
        'ValidationRole': execution_role_arn,
        'ValidationProfiles': [
            {
                'ProfileName': "validation",
                'TransformJobDefinition': {
                    'MaxConcurrentTransforms': 1,
                    'MaxPayloadInMB': 64,
                    'BatchStrategy': 'SingleRecord',
                    'TransformInput': {
                        'DataSource': {
                            'S3DataSource': {
                                'S3DataType': 'S3Prefix',
                                'S3Uri': f'{validation_input_path}input.jsonl'
                            }
                        },
                        'ContentType': 'application/json',
                        'CompressionType': 'None',
                        'SplitType': 'None'
                    },
                    'TransformOutput': {
                        'S3OutputPath': f'{validation_output_path}output.json',
                        'Accept': 'application/json',
                        'AssembleWith': 'None',
                    },
                    'TransformResources': {
                        'InstanceType': supported_batch_transform_instance_types[0],
                        'InstanceCount': 1,
                    }
                }
            },
        ]
    },
)

In [52]:
model_package_list = []
sm_client = boto3.client("sagemaker")
model_list_pack = sm_client.list_model_packages()
model_package_list = model_list_pack['ModelPackageSummaryList']
NextToken = model_list_pack.get('NextToken')

while True:
    if model_list_pack.get('NextToken'):
        NextToken = model_list_pack.get('NextToken')
        model_list_pack = sm_client.list_model_packages(NextToken=NextToken)
        model_package_list.extend(model_list_pack['ModelPackageSummaryList'])
    else:
        break

model_package_list 

[{'ModelPackageName': 'meta-llama-3-1-8b-instruct-2024-08-15-04-14-13',
  'ModelPackageArn': 'arn:aws:sagemaker:us-west-2:322537213286:model-package/meta-llama-3-1-8b-instruct-2024-08-15-04-14-13',
  'ModelPackageDescription': 'marketplace-model-test',
  'CreationTime': datetime.datetime(2024, 8, 15, 4, 14, 14, 393000, tzinfo=tzlocal()),
  'ModelPackageStatus': 'Pending'},
 {'ModelPackageName': 'meta-llama-3-8b-2024-08-10-13-53-12',
  'ModelPackageArn': 'arn:aws:sagemaker:us-west-2:322537213286:model-package/meta-llama-3-8b-2024-08-10-13-53-12',
  'ModelPackageDescription': 'marketplace-model-test',
  'CreationTime': datetime.datetime(2024, 8, 10, 13, 53, 13, 813000, tzinfo=tzlocal()),
  'ModelPackageStatus': 'Completed'},
 {'ModelPackageName': 'meta-llama-3-8b-2024-08-10-12-50-29',
  'ModelPackageArn': 'arn:aws:sagemaker:us-west-2:322537213286:model-package/meta-llama-3-8b-2024-08-10-12-50-29',
  'ModelPackageDescription': 'marketplace-model-test',
  'CreationTime': datetime.datetime(

In [54]:
# ModelPackageName='meta-llama-3-1-8b-instruct-2024-08-04-23-52-19'
ModelPackageName = sm_model_name
sm_client.describe_model_package(ModelPackageName=ModelPackageName)['ModelPackageStatusDetails']
# sm_client.delete_model_package(ModelPackageName=ModelPackageName)

{'ValidationStatuses': [{'Name': 'validation', 'Status': 'Completed'}],
 'ImageScanStatuses': [{'Name': '322537213286.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference@sha256:05c36b82431608ce9dd04fc6552c4876acd8ad3b90e03ca8729f25b3ff1ce752',
   'Status': 'Completed'}]}

In [55]:
# sagemaker_session.wait_for_model_package(model_package_name=sm_model_name) # If failure occurs navigate to SageMaker Console > My marketplace model packages > select the failed ModelPackage for details. 

다음을 실행하기 전에, [Model Packages console from Amazon SageMaker](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/model-packages/my-resources)을 열어서 모델 생성의 성공했는지를 확인해야 합니다.
모델을 선택하고 **Validation** 탭을 열어서 validation 결과를 확인할 수 있습니다.

<br>

## [**Step 5**] Validating this ML model package by deploying it with Amazon SageMaker
---

##### 모델 패키지에서 모델 객체 생성

#### SageMaker 모델을 Endpoint로 배포

In [57]:
from sagemaker import ModelPackage

model = ModelPackage(
    role=execution_role_arn,
    model_package_arn=model_package["ModelPackageArn"],
    # model_package_arn="arn:aws:sagemaker:us-west-2:322537213286:model-package/meta-llama-3-1-8b-instruct-2024-08-04-04-24-21",
    sagemaker_session=sagemaker_session
)

In [58]:
model.deploy(
    initial_instance_count=1,
    # instance_type=supported_realtime_inference_instance_types[0],
    instance_type='ml.g5.12xlarge',
    endpoint_name=sm_model_name,
    model_data_download_timeout=600,
    container_startup_health_check_timeout=300,
)
model.endpoint_name

--------------!

'meta-llama-3-1-8b-instruct-2024-08-15-04-14-13'

#### boto3로 예시 호출

In [59]:
prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

sample_input = {
    "inputs": prompt,
    "parameters": {
        "max_tokens":256,
        "top_p": 0.9,
        "temperature": 0.6,
        "stop": ["<|eot_id|>"]
    }
}

In [60]:
%%time
response = sm_runtime.invoke_endpoint(
    EndpointName=model.endpoint_name,
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(sample_input),
)

json.load(response["Body"])

CPU times: user 8.62 ms, sys: 4.4 ms, total: 13 ms
Wall time: 7.88 s


[{'generated_text': 'The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the southeastern United States and northeastern Mexico. The diamondback terrapin is the official state reptile of Maryland. The species is divided into five subspecies, each with its own unique characteristics and geographic range. The diamondback terrapin is a medium-sized turtle that can grow up to 10 inches in length and weigh up to 3 pounds. They have a distinctive diamond-shaped pattern on their shell, which is brown or black with yellow or orange markings.\nThe diamondback terrapin is a'}]

#### 생성된 endpoint configuration 과 endpoint 정리 

In [61]:
model.sagemaker_session.delete_endpoint(model.endpoint_name)
model.sagemaker_session.delete_endpoint_config(model.endpoint_name)

- 이 모델은 필수가 아니므로 삭제해도 됩니다. 
- 배포 가능한 모델을 삭제한다는 점에 유의하세요. 
- 모델 패키지는 삭제하지 않습니다.

In [62]:
model.delete_model()

##### AWS 마켓플레이스에 모델을 게시하려면 모델 패키지 ARN을 지정해야 합니다. 다음 모델 패키지 ARN을 복사합니다. 

In [63]:
model_package["ModelPackageArn"]

'arn:aws:sagemaker:us-west-2:322537213286:model-package/meta-llama-3-1-8b-instruct-2024-08-15-04-14-13'

<br>

## [**Step 6**] Listing the ML model in AWS Marketplace
---

1.  모델 파트너는 AWS 마켓플레이스에서 [public profile](https://docs.aws.amazon.com/marketplace/latest/userguide/seller-registration-process.html#seller-public-profile)을 생성하고 seller로 등록합니다.
마켓플레이스의 상품은 무료 상품으로 등록되므로 세금 정보를 제공할 필요가 없습니다.

2. 세이지메이커 콘솔의 [Model Packages](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/model-packages/my-resources) 섹션에서 이 노트북에서 생성한 엔티티를 찾을 수 있습니다. 성공적으로 생성되고 유효성이 검사되었다면 해당 엔티티를 선택하고 **Publish new ML Marketplace listing**를 선택할 수 있을 것입니다.

<img src="images/publish-to-marketplace-action.png"/>

리스팅을 작성할 수 있는 [AWS Marketplace Management portal](https://aws.amazon.com/marketplace/management/ml-products/)로 리디렉션됩니다.

<img src="images/listing.png"/>

1. 모델이 여러 하드웨어 유형을 대상으로 하는 경우 각 ModelPackage를 별도의 버전으로 목록에 추가하는 것을 잊지 마세요.
2. 추가를 클릭하고 모델 정보를 입력합니다. Product visibility을 'Public'로 설정해야 합니다.

<img src="images/public.png"/>

3. 테스트를 진행할 account 에 대해 모델 접근을 위한 Allowlist에 추가합니다. 예) account `171503325295`, `572320329544` and `559110549532` for access to the model. 
For region support select: `us-east-1, us-west-2, eu-west-1, eu-central-1, eu-west-2, ap-northeast-1, ap-south-1, ca-central-1, us-east-2, ap-northeast-2`
<img src="images/allowlist-accs.png"/>

4. Pricing and terms 하에 pricing 모델을 설정합니다.
**Inference based pricing (custom metering) at $0**

(선택 사항) 컨테이너가 아래를 구현하지 않은 경우 이를 확인하고 다음을 진행하세요. 

```
I confirm that my model package supports the response header for custom metering. Example response header: X-Amzn-Inference-Metering:
{"Dimension": "inference.count", "ConsumedUnits": 3}
I understand that in absence of this header, default metering will be used instead.
```

<img src="images/inference-based-pricing.png"/>

5. Listing 상태는 다음과 같이 표시되어야 합니다:
**Do not click Sign off and publish**

<img src="images/status-1.png"/>

6. Vissibility status of the listing should be `Limited`.

<img src="images/status-2.png"/>




**Resources**
* [Publishing your product in AWS Marketplace](https://docs.aws.amazon.com/marketplace/latest/userguide/ml-publishing-your-product-in-aws-marketplace.html)


https://medium.com/@aliasghar.arabi/deploy-llama3-on-aws-inferentia-using-sagemaker-lmi-and-djl-serving-aa241db17aa3