## 1. 준비

based on https://github.com/triton-inference-server/fastertransformer_backend/tree/dev/v1.1_beta

* 모델 생성과 local mode 테스트를 위해 ml.p3.16xlarge 노트북 인스턴스에서 작업
* 이미지 크기가 크므로 노트북 생성시 디스크 용량 증가 및 docker image 경로 변경 필요
* fastertransformer_backend README 참고하여 git clone(fastertransformer_backend, triton, FasterTransformer)

### 도커 이미지 경로를 EBS로 변경

In [None]:
%%bash

#!/usr/bin/env bash

echo '{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}' > daemon.json

sudo cp daemon.json /etc/docker/daemon.json && rm daemon.json

DAEMON_PATH="/etc/docker"
MEMORY_SIZE=10G

FLAG=$(cat $DAEMON_PATH/daemon.json | jq 'has("data-root")')
# echo $FLAG

if [ "$FLAG" == true ]; then
    echo "Already revised"
else
    echo "Add data-root and default-shm-size=$MEMORY_SIZE"
    sudo cp $DAEMON_PATH/daemon.json $DAEMON_PATH/daemon.json.bak
    sudo cat $DAEMON_PATH/daemon.json.bak | jq '. += {"data-root":"/home/ec2-user/SageMaker/.container/docker","default-shm-size":"'$MEMORY_SIZE'"}' | sudo tee $DAEMON_PATH/daemon.json > /dev/null
    sudo service docker restart
    echo "Docker Restart"
fi

sudo docker info | grep Root

### SageMaker Triton image pull(us-east-1 기준)
* ECR 로그인 필요

In [None]:
triton_image_account_id_map = {
    'us-east-1': '785573368785',
    'us-east-2': '007439368137',
    'us-west-1': '710691900526',
    'us-west-2': '301217895009',
    'eu-west-1': '802834080501',
    'eu-west-2': '205493899709',
    'eu-west-3': '254080097072',
    'eu-north-1': '601324751636',
    'eu-south-1': '966458181534',
    'eu-central-1': '746233611703',
    'ap-east-1': '110948597952',
    'ap-south-1': '763008648453',
    'ap-northeast-1': '941853720454',
    'ap-northeast-2': '151534178276',
    'ap-southeast-1': '324986816169',
    'ap-southeast-2': '355873309152',
    'cn-northwest-1': '474822919863',
    'cn-north-1': '472730292857',
    'sa-east-1': '756306329178',
    'ca-central-1': '464438896020',
    'me-south-1': '836785723513',
    'af-south-1': '774647643957'
}

import boto3
region = boto3.Session().region_name
triton_image_account_id = triton_image_account_id_map[region]
triton_image_uri = f'{triton_image_account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-tritonserver:21.08-py3'
print(triton_image_uri)

In [None]:
%%sh -s "$region" "$triton_image_account_id" "$triton_image_uri"
$(aws ecr get-login --no-include-email --registry-ids $2 --region $1)
docker pull $3
docker image ls

### git clone

In [None]:
%%sh 
git clone https://github.com/triton-inference-server/fastertransformer_backend.git -b dev/v1.1_beta
git clone https://github.com/triton-inference-server/server.git # We need some tools when we test this backend
git clone -b dev/v5.0_beta https://github.com/NVIDIA/FasterTransformer # Used for convert the checkpoint and triton output
ln -s server/qa/common .
cp serve config.pbtxt fastertransformer_backend && cp Dockerfile.sm fastertransformer_backend/docker

In [None]:
import boto3
account_id = boto3.client("sts").get_caller_identity()["Account"]
algorithm_name = "sm-triton-ft"
#version = "latest"
version = "21.08-al2-py3"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{algorithm_name}:{version}"
print(image_uri)

### ECR 로그인 & 도커 이미지 build 및 push

* [Dockerfile](https://github.com/triton-inference-server/fastertransformer_backend/blob/dev/v1.1_beta/docker/Dockerfile)에서 Base Image를 SageMaker Triton 이미지로 교체하고 마지막에 serve 파일을 대체.
* serve 파일은 [원본](https://github.com/triton-inference-server/server/blob/main/docker/sagemaker/serve)에서 마지막 실행 명령만 faster transformer 백엔드의 실행 명령을 참고하여 수정했음.
* 원래 dockerfile이 있는 경로(workspace/fastertransformer_backend/docker)에 Dockerfile.sm을 붙여넣고,
* 상위 폴더(workspace/fastertransformer_backend)에 serve파일 붙여 넣은 후 docker build(터미널에서)

```
docker build -t {account_number}.dkr.ecr.us-east-1.amazonaws.com/sm-triton-ft:21.08-py3 -f docker/Dockerfile.sm .
```

* Push 전에 ECR 레포지토리 sm-triton-ft 생성, ECR 로긴, push 권한 설정 필요

In [None]:
%%sh -s "$algorithm_name" "$region" "$image_uri"

algorithm_name=$1
region=$2
fullname=$3

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin $3

echo "[Note] Please copy the command below and run it in the terminal. You can run it directly in jupyter notebook, but it is recommended to run it in a terminal for debugging."
echo ""
echo "cd /home/ec2-user/SageMaker/ft-triton/fastertransformer_backend && docker build -t ${fullname} -f docker/Dockerfile.sm ."
echo "docker push ${fullname}"

<br>

## 2. Model 생성 및 S3 업로드

모델 생성
* [fastertransformer_backend README.md How to set the model configuration](https://github.com/triton-inference-server/fastertransformer_backend/tree/dev/v1.1_beta#how-to-set-the-model-configuration) 참고 Prepare Triton GPT model store

config.pbtxt 수정
* tensor_para_size = 8
* model_checkpoint_path = "/opt/ml/model/fastertransformer/1/8-gpu"

모델 압축 및 S3 업로드

In [None]:
import os
WORKSPACE = os.getcwd()
SRC_MODELS_DIR = f"{WORKSPACE}/models"
TRITON_MODELS_STORE = f"{WORKSPACE}/triton-model-store"
TRITON_DOCKER_IMAGE = image_uri

### 모델 다운로드 (Megatron GPT-3 345M)

In [None]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
!mkdir -p {SRC_MODELS_DIR}/megatron-models/345m
!unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
!mkdir {TRITON_MODELS_STORE}/fastertransformer/1 -p

### 모델 생성

In [None]:
ARGUMENT_STR = f"-i {SRC_MODELS_DIR}/megatron-models/345m/release/ " + \
    f"-o {TRITON_MODELS_STORE}/fastertransformer/1 " + \
    "-trained_gpu_num 1 -infer_gpu_num 8 -head_num 16"

!echo {ARGUMENT_STR}

!docker run --rm -it --gpus=all \
    -e SRC_MODELS_DIR={SRC_MODELS_DIR} \
    -e TRITON_MODELS_STORE={TRITON_MODELS_STORE} \
    -v {WORKSPACE}:{WORKSPACE} \
    {TRITON_DOCKER_IMAGE} \
    bash -c "python {WORKSPACE}/FasterTransformer/examples/pytorch/gpt/utils/megatron_ckpt_convert.py {ARGUMENT_STR}"

In [None]:
converted_model_path = f'{TRITON_MODELS_STORE}/fastertransformer/1/'
!ls {converted_model_path}

### 모델 압축

In [None]:
%%time
serve_dir = "triton-serve-ft"
!rm -rf {serve_dir} && mkdir -p {serve_dir}/fastertransformer/
!cp -r {converted_model_path} {serve_dir}/fastertransformer/1/
!cp config.pbtxt {serve_dir}/fastertransformer
!tar -C {serve_dir}/ -czf model.tar.gz fastertransformer

### 모델 S3 업로드

In [None]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix=serve_dir)

<br>

## 3. Local mode test

* [fastertransformer_backend Run Serving on Single Node](https://github.com/triton-inference-server/fastertransformer_backend/tree/dev/v1.1_beta#run-serving-on-single-node) 참고
* [SageMaker Triton example](https://github.com/aws/amazon-sagemaker-examples/blob/1072934944e5270f7f2fb0d9e0e1a86ce96aa57e/sagemaker-triton/nlp_bert/triton_nlp_bert.ipynb) 참고

In [None]:
!pip install tritonclient[http]

In [None]:
!pip install torch transformers

In [None]:
sm_local_model_name = "triton-ft-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
model = sagemaker.model.Model(image_uri=image_uri, model_data=model_uri, role=role, 
                              name=sm_local_model_name)
model.deploy(initial_instance_count=1, instance_type='local_gpu')

In [None]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.local import LocalSession

local_sess = LocalSession()
predictor = Predictor(
    endpoint_name=model.endpoint_name, 
    sagemaker_session=local_sess,
    serializer=JSONSerializer()
)

In [None]:
import os
import torch
import numpy as np
from transformers import GPT2Tokenizer
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype, InferenceServerException

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def make_payload(tokenizer, decode_length=24):
    max_length = 4
    num_samples = 4
    
    sample1 = tokenizer("Machine Learning skills require", max_length=max_length, truncation=True)['input_ids']
    sample2 = tokenizer("Study, play,", max_length=max_length, truncation=True)['input_ids']
    sample3 = tokenizer("Amazon's biggest success", max_length=max_length, truncation=True)['input_ids']
    sample4 = tokenizer("Amazon SageMaker is", max_length=max_length, truncation=True)['input_ids']    
    input_start_ids = np.array([sample1, sample2, sample3, sample4], np.uint32)

    input_start_ids = input_start_ids.reshape([input_start_ids.shape[0], 1, input_start_ids.shape[1]])
    input_data = np.tile(input_start_ids, (1, 1, 1))
    input_len = np.array([[sentence.size] for sentence in input_start_ids], np.uint32)
    output_len = np.ones_like(input_len).astype(np.uint32) * decode_length

    payload = {
        "inputs": [
            {"name": "INPUT_ID", "shape": input_data.shape, "datatype": np_to_triton_dtype(input_data.dtype), 
             "data": input_data.tolist()},
            {"name": "REQUEST_INPUT_LEN", "shape": input_len.shape, "datatype": np_to_triton_dtype(input_len.dtype), 
             "data": input_len.tolist()},
            {"name": "REQUEST_OUTPUT_LEN", "shape": output_len.shape, "datatype": np_to_triton_dtype(output_len.dtype),
             "data": output_len.tolist()}
        ]
    }    
    
    return payload

In [None]:
from datetime import datetime
payload = make_payload(tokenizer, decode_length=24)

num_infers = 2
latency = np.zeros(num_infers)
for i in range(num_infers):
    start_time = datetime.now()     
    outputs = predictor.predict(payload)
    stop_time = datetime.now()
    delta = ((stop_time - start_time).total_seconds()* 1000.0)
    latency[i] = delta

avg = np.average(latency)
p50 = np.quantile(latency, 0.50)
p95 = np.quantile(latency, 0.95) 
p99 = np.quantile(latency, 0.99)
print(f'avg latency: {avg:.4f} ms')    
print(f'p50 latency: {p50:.4f} ms')
print(f'p95 latency: {p95:.4f} ms')
print(f'p99 latency: {p99:.4f} ms')    

In [None]:
import json
output_json = json.loads(outputs)['outputs'][0]
num_output_samples = output_json['shape'][0]
#print(output_json['shape']) # num_batches x 1 x (input length + decode length)
output_tokens = output_json['data']
output_tokens = np.reshape(output_tokens, (num_output_samples, -1))

for k in range(num_output_samples):
    text = tokenizer.decode(output_tokens[k], clean_up_tokenization_spaces=True)
    print(f'[Output sample {k+1}]')
    print(text + '\n')

### 엔드포인트, 모델 삭제

In [None]:
predictor.delete_endpoint()
model.delete_model()

<br>

## 4. Endpoint 생성 및 테스트

In [None]:
sm_model_name = "triton-ft-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
    "Image": image_uri,
    "ModelDataUrl": model_uri,
}

create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

In [None]:
endpoint_config_name = "triton-ft-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.p3.16xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

In [None]:
endpoint_name = "triton-ft-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
from IPython.core.display import display, HTML
def make_endpoint_link(region, endpoint_name, endpoint_task):
    endpoint_link = f'<b><a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={region}#/endpoints/{endpoint_name}">{endpoint_task} Review Endpoint</a></b>'   
    return endpoint_link 
        
endpoint_link = make_endpoint_link(region, endpoint_name, '[Deploy model from S3]')
display(HTML(endpoint_link))

In [None]:
# sagemaker.Session().wait_for_endpoint(endpoint_name, poll=5)

In [None]:
from datetime import datetime
from tqdm import tqdm


def benchmark(sm_runtime_client, endpoint_name, payload, num_infers=20):
    latency = np.zeros(num_infers)
    t = tqdm(range(num_infers), position=0, leave=True)

    for i in t:
        start_time = datetime.now()          
        response = client.invoke_endpoint(
            EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
        )
        outputs = json.loads(response["Body"].read().decode("utf8"))
        stop_time = datetime.now()
        delta = ((stop_time - start_time).total_seconds()* 1000.0)
        latency[i] = delta  
        
    avg = np.average(latency)
    p50 = np.quantile(latency, 0.50)
    p95 = np.quantile(latency, 0.95) 
    p99 = np.quantile(latency, 0.99)
    print(f'avg latency: {avg:.4f} ms')    
    print(f'p50 latency: {p50:.4f} ms')
    print(f'p95 latency: {p95:.4f} ms')
    print(f'p99 latency: {p99:.4f} ms')
    
    return outputs

GPT 모델은 auto-regressive 모델이므로 decode_length가 늘어날수록 latency가 증가함

In [None]:
payload = make_payload(tokenizer, decode_length=24)
outputs = benchmark(client, endpoint_name, payload, num_infers=50)

In [None]:
print(outputs)

In [None]:
output_json = outputs['outputs'][0]
num_output_samples = output_json['shape'][0]
#print(output_json['shape']) # num_batches x 1 x (input length + decode length)
output_tokens = output_json['data']
output_tokens = np.reshape(output_tokens, (num_output_samples, -1))

for k in range(num_output_samples):
    text = tokenizer.decode(output_tokens[k], clean_up_tokenization_spaces=True)
    print(f'[Output sample {k+1}]')
    print(text + '\n')

## Clean up

In [None]:
sm.delete_endpoint(EndpointName=endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_model(ModelName=sm_model_name)

In [None]:
!sudo rm -rf common fastertransformer_backend FasterTransformer server triton-model-store triton-serve-ft models models.tar.gz
!rm megatron_lm_345m_v0.0.zip model.tar.gz