# Qwen2 deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker --upgrade  --quiet

In [2]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [3]:
model_name = 'Qwen2-7B-Instruct'
# model_name = 'Qwen2-72B-Instruct-AWQ'

In [4]:
import os

# Directory and file paths
dir_path = '.'
serving_file_path = os.path.join(dir_path, 'serving.properties')

# Create the directory structure
os.makedirs(os.path.dirname(serving_file_path), exist_ok=True)

serving_content = f'''\
engine=Python
option.model_id=s3://sagemaker-us-west-2-452145973879/models/{model_name}/
option.dtype=fp16
option.task=text-generation
option.rolling_batch=vllm
option.max_model_len=8192
option.device_map=auto
option.enable_streaming=true
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=16
# option.output_formatter=jsonlines  # if set, the default output is streaming
'''

# option.max_model_len=131072

if model_name == 'Qwen2-72B-Instruct-AWQ':
    serving_content += 'option.tensor_parallel_degree=4\n'
elif model_name == 'Qwen2-7B-Instruct':
    serving_content += 'option.tensor_parallel_degree=1\n'
    
if 'AWQ' in model_name:
    serving_content += 'option.quantize=awq'

with open(serving_file_path, 'w') as file:
    file.write(serving_content)

In [6]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [7]:
# image_uri = image_uris.retrieve(
#         framework="djl-deepspeed",
#         region=sess.boto_session.region_name,
#         version="0.27.0"
#     )

image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0"

### Upload artifact on S3 and create SageMaker model

In [8]:
s3_code_prefix = f"deploy/{model_name}/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-452145973879/Qwen2-72B-Instruct-AWQ/code/mymodel.tar.gz


In [1]:
%rm mymodel.tar.gz

### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
if model_name == 'Qwen2-7B-Instruct':
    instance_type = "ml.g5.xlarge"  # 7B
elif model_name == 'Qwen2-72B-Instruct-AWQ':
    instance_type = "ml.g5.12xlarge"  # 72B

endpoint_name = sagemaker.utils.name_from_base(model_name)

print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

endpoint_name: Qwen2-72B-Instruct-AWQ-2024-06-21-01-12-11-100
------

## if the model has already been deployed

In [None]:
# import sagemaker

# endpoint_name = 'lmi-model-2024-06-11-09-18-56-099'
# predictor = sagemaker.Predictor(
#     endpoint_name=endpoint_name, 
#     sagemaker_session=sess,
#     serializer=serializers.JSONSerializer()
# )

# predictor

## Step 5: Test and benchmark the inference

Firstly let's try to run with a wrong inputs

### non-streaming singel test

In [None]:
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = f'/home/ec2-user/SageMaker/efs/Models/{model_name}'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompt = """'下面是一段agent与customer的对话\nagent: 咁嘅\ncustomer: 你好呀,我正在用你們的抽濕機\ncustomer: 用咗半年到啦咁而家呢佢壞咗呀佢呢\nagent: 嗨\ncustomer: 濕濕咗嘅時候呢嗰啲水呢直接流落去地下\ncustomer: 接不到那些水\nagent: 請問點稱呼呀?\nagent: 余先生,請問你可否給我抽濕口罩?\ncustomer: 我姓余的\nagent: 手機的型號\ncustomer: RADY200H\nagent: 多謝大家\ncustomer: Thank you for watching\nagent: OK\ncustomer: 拜拜\nagent: 又係咪check我嗰個水準\nagent: 放入去嘅時候,位置係正常\nagent: 唔好擺尾之類\ncustomer: 冇嘅冇排名\nagent: Check過冇問題嘅,如果係可以搵錢\nagent: 咁請問你嗰個抽濕機係買咗一年到?嘛係咪?\nagent: 上年九月抽濕機\ncustomer: 上年九月買的\nagent: 请问您的购买单和保用证是否存在?\ncustomer: 喺度嘅\nagent: 不存在的,如果我可以找师傅上来帮你看,麻烦你出示给他\nagent: 請問您的地址在哪裡?\ncustomer: 柴灣道111號\ncustomer: 高威港\nagent: 高威國幾座幾樓幾室\ncustomer: 第四座,四座八樓私宅\nagent: SipoCat\ncustomer: 係冇差\nagent: 麻煩你等一等我睇下如果係柴灣嘅可以幾時安排到過嚟先\ncustomer: 唔\nagent: 嚟緊星期二 23 號 12 點,\nagent: 屋企會唔會有忍\ncustomer: 哎呀\nagent: 咁我呢度幫你安排返啦係余生嘅到時上嚟之前師傅可以打返90997\nagent: 910 呢個電話高威國四座扮cat嘅啱啱\ncustomer: 我留兩個電話給你\nagent: 好呀你再講呀\ncustomer: 我太太梁小姐\nagent: 90256922,呢個係搵邊位??\nagent: 即係阿梁小姐\nagent: 打誰的電話先?\nagent: 由我講吧!\ncustomer: 你打這個先啦,我再留多個屋企電話比你啦\ncustomer: 以5687021\nagent: 屋企電話25687021,如果係到時打電話嘅時候,可以搵返梁小姐係\nagent: 902-56922\ncustomer: 係冇錯\nagent: 好得,我呢度幫你安排返\nagent: 23 號 12 點至 5 點過嚟\ncustomer: 星期五至五點,星期二。\nagent: 係,冇錯,有咩其他可以幫到你?\ncustomer: OK,好呀,\nagent: 都係打電話嚟,多謝!\ncustomer: 係冇?啦,\nagent: 拜拜\ncustomer: OK\n\n\n请从上面的对话中抽取如下信息，并以json格式返回，如果对话中没有提到相应字段的内容，则填""：\n {"customerType": "个人", "customerName": "溫先生", "phoneNumber1": "", "phoneNumber2": "", "email": "", "address": "楊逸居第三座,九樓,A7", "productBrand": "Toshiba", "productCategoryName": "", "serialNumber": "ER-GD400HK", "srType": "维修", "srSubType": "维修", "symptomDescription": "燈膽燒了", "customerRequest": "維修法", "refNo": "", "selloutInvoiceNum": "", "salesDealerName": "", "installDealerName": ""}\n \n注意只返回抽取的json格式的结果，不返回其它额外信息。 \n'"""
system_prompt = "You are a helpful assistant."

messages = [
        {"role": "system", "content": system_prompt},
    ]
    
messages.append({"role": "user", "content": prompt})
    
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# print(prompt)

response = predictor.predict(
    {
        "inputs": prompt, 
         "parameters": 
         {
             "max_new_tokens": 4096,
             # Add any other sampling parameters as needed
             "temperature": 0.7,
             "top_k": 5,
             "top_p": 0.9,
             # "stop_token_ids": [], 
             # "stop": ["\nASSISTANT", "\nUSER:"],
             "include_stop_str_in_output": False,
             # "skip_special_tokens": True,
             "ignore_eos": False,
             "repetition_penalty": 1,
         }
    }
)

print('response: ', response.decode('utf-8'))
print('generated_text: ', json.loads(response.decode('utf-8'))['generated_text'])

### streaming single test

In [None]:
import sagemaker
import io
import json

class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1] # line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])
            
            
client = boto3.client('sagemaker-runtime')

body = {"inputs": prompt, "parameters": {"max_new_tokens":512}, "stream": True}
resp = client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']
        
for line in LineIterator(event_stream):
    resp = json.loads(line)
    print(resp.get("token").get('text'), end='')

### streaming gradio service

In [None]:
import gradio as gr
import json

from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = f'/home/ec2-user/SageMaker/efs/Models/{model_name}'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

def response(message, history, system_prompt):
    
    # print('message:', message)
    # print('history:', history)
    
    messages = [
        {"role": "system", "content": system_prompt},
    ]
    
    for human, ai in history:
        messages.append({"role": "user", "content": human})
        messages.append( {"role": "assistant", "content": ai})
    
    messages.append({"role": "user", "content": message})
    
    # prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
    
    # print(f"prompt: {prompt}")
    
    body = {
            "inputs": prompt, 
             "parameters": 
             {
                 "do_sample": True,
                 "max_new_tokens": 4096,
                 # Add any other sampling parameters as needed
                 "temperature": 0.7,
                 "top_k": 20,
                 "top_p": 0.8,
                 # "stop_token_ids": [], 
                 # "stop": ["[INST]"],
                 "skip_special_tokens": True,
                 "ignore_eos": False,
                 "repetition_penalty": 1.05,
             },
            "stream": True        
    }
    
    resp = client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
    event_stream = resp['Body']
    
    response_text = ''
    for line in LineIterator(event_stream):
        print(resp)
        resp = json.loads(line)
        response_text += resp.get("token").get('text')
        
        yield response_text
    
demo = gr.ChatInterface(response, 
                        chatbot=gr.Chatbot(render_markdown=False), 
                        additional_inputs=[gr.Textbox("You are a helpful assistant.", label="System Prompt")],
                        title='聊天机器人（Qwen2-72B-Instruct-AWQ）',
                        description='欢迎光临，我是您的聊天机器人，快来问我吧。')

demo.launch(share=True)

## Clean up the environment

In [None]:
# sess.delete_endpoint(endpoint_name)
# sess.delete_endpoint_config(endpoint_name)
# model.delete_model()