## 使用SageMaker JumpStart 方式部署大语言模型

利用SageMaker部署大语言模型的原理如下：

<img src="imgs/sagemaker_deploy_model.jpg" style="width: 850px;"></img>


### 可部署的模型

这里提供了多个模型，以及相应的script用于部署，提供的模型有：
 * LLaMA
 * LLaMA2系列
 * falcon系列

所需的脚本在相应的`djl-*`文件夹里。

模型不同，可使用的加速框架不同，如huggingface、deepspeed等。


### 部署

准备：
1. 升级boto3, sagemaker python sdk  
2. 准备inference.py, requirements.txt

In [None]:
# 如果需要，更新sagemaker和 aws python sdk boto3
# !pip install --upgrade boto3
# !pip install --upgrade sagemaker
# !pip install ipywidgets==7.0.0 --quiet

In [1]:
import boto3
import sagemaker

account_id = boto3.client('sts').get_caller_identity().get('Account')
region_name = boto3.session.Session().region_name

sagemaker_session = sagemaker.Session()
region = sagemaker_session._region_name # region name of the current environment
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
print(bucket)

sagemaker-us-east-1-568765279027


接下来，我们使用Sagemaker进行模型部署。

In [2]:
model_id, model_version, = (
    "huggingface-llm-falcon-7b-instruct-bf16",
    "*",
)

In [6]:
from sagemaker.jumpstart.model import JumpStartModel

# ??JumpStartModel

In [10]:
from sagemaker.jumpstart.model import JumpStartModel
my_model = JumpStartModel(
    model_id = model_id,
    name='mt-jump-falcon-7b-instruct-model'
)

In [11]:
print(my_model.image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04


In [12]:
# 可以下载模型源码，看看推理的代码。
print(my_model.model_data)

s3://jumpstart-cache-prod-us-east-1/huggingface-infer/prepack/v1.0.0/infer-prepack-huggingface-llm-falcon-7b-instruct-bf16.tar.gz


可以看到，在JumoStart上部署falcon模型，是使用的 *tgi 0.8.2* 的容器部署的。endpoint_name = 'mt-jump-falcon-7b-instruct-g4dn'

In [14]:
endpoint_name = 'mt-jump-falcon-7b-instruct-g5'

In [17]:
predictor = my_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge'
)

----------------!

如果部署过程中出现错误，部署失败，通过下面的方式删除endpoint以及相应的模型。

In [15]:
from sagemaker import serializers, deserializers

# endpoint_name = 'mt-llama2-7b-g4dn'
del_predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name, 
    sagemaker_session=sagemaker_session, 
    serializer=serializers.JSONSerializer(), 
    deserializer=deserializers.JSONDeserializer())
# del_predictor.delete_model()
# del_predictor.delete_endpoint()


In [18]:
inputs= [
    {"inputs": "写一首关于交通信号灯的诗"},
    {"inputs": "陨石为什么总能落在陨石坑里?"},
    {"inputs": "为什么爸妈结婚没叫我参加婚礼?"}
]

response = predictor.predict(inputs[2])

# print("\n\n问题: ", inputs[0]["inputs"], "\n回答:\n", response["outputs"])
# response = predictor.predict(inputs[1])
# print("\n\n问题: ", inputs[1]["inputs"], "\n回答:\n", response["outputs"])
# response = predictor.predict(inputs[2])
# print("\n\n问题: ", inputs[2]["inputs"], "\n回答:\n", response["outputs"])

In [19]:
response

[{'generated_text': "\nI'm sorry, I cannot answer that question as I do not have enough context. Can"}]

### 通过Sagemaker Endpoint调用
我们已经将模型部署到了Sagemaker Endpoint上，我们就可以通过这个Endpoint名称，来调用模型进行推理，这样即使你停止了这个notebook，也能使用已经部署的模型。

In [17]:
import json
import boto3

client = boto3.client('runtime.sagemaker')

def query_endpoint(encoded_json):
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=encoded_json)
    model_predictions = json.loads(response['Body'].read())
    generated_text = model_predictions["outputs"]
    return generated_text



In [22]:
%%time

payload = {"inputs": "信息抽取：\n2022年世界杯的冠军是阿根廷队伍，梅西是MVP\n问题：国家名，人名\n答案：", "parameters": {"temperature": 0.01}}
resp_test = query_endpoint(json.dumps(payload).encode('utf-8'))
print(resp_test)

国家名:阿根廷
人名:梅西
CPU times: user 0 ns, sys: 4.17 ms, total: 4.17 ms
Wall time: 556 ms


In [19]:
%%time
payload = {"inputs": "天为什么是蓝色的？"}
answer = query_endpoint(json.dumps(payload).encode('utf-8'))
print(answer)
print(len(answer))

天空之所以呈现蓝色,是由于光的散射现象造成的。当太阳光穿过大气层时,光被大气中的分子和小颗粒散射,这些颗粒包括氧气、氮气和水蒸气等分子。这些分子吸收较短波长的光,如紫色和蓝色,而较长波长的光,如红色和橙色,则被分散得更少。

由于蓝色光的波长比红色光短,因此它更容易被分散,而在大气层中被散射的程度也更高,因此在天空的观察中,我们看到了大量的蓝色光。这也是为什么在日落或日出时,太阳光穿过更长的大气层路径,较多的光被散射为红色和橙色,天空呈现出橙色或红色的原因。
231
CPU times: user 3.32 ms, sys: 386 µs, total: 3.71 ms
Wall time: 8.22 s


In [20]:
%%time
payload = {"inputs": "4加2等于几？"}
resp_test = query_endpoint(json.dumps(payload).encode('utf-8'))
print(resp_test)

4加2等于6。
CPU times: user 3.48 ms, sys: 0 ns, total: 3.48 ms
Wall time: 446 ms


In [21]:
%%time
# ChatGLM支持通过history传递聊天历史
payload = {
    "inputs": "再乘以3呢？",
    "history": [("数学计算：\n3加8等于几？\n答案：", "3加8等于11。")]}
resp_test = query_endpoint(json.dumps(payload).encode('utf-8'))
print(resp_test)

如果你将11乘以3,那么答案是33。
CPU times: user 871 µs, sys: 3.07 ms, total: 3.94 ms
Wall time: 874 ms


### 删除 EndPoint

In [24]:
predictor.delete_model()
predictor.delete_endpoint()

In [9]:
from typing import Dict, Optional
from sagemaker.djl_inference import DeepSpeedModel

class MTDeepSpeedModel(DeepSpeedModel):
    def __init__(
        self,
        model_id: str,
        role: str,
        trust_remote_code: bool = True,
        **kwargs,
    ):  
        super().__init__(
            model_id, role, **kwargs,
        )
        self.trust_remote_code = trust_remote_code
        
    def generate_serving_properties(self, serving_properties=None) -> Dict[str, str]:
        serving_properties = super(MTDeepSpeedModel, self).generate_serving_properties(
            serving_properties=serving_properties
        )
        serving_properties["option.trust_remote_code"] = self.trust_remote_code
        return serving_properties
        
    

In [10]:
model = MTDeepSpeedModel(
    model_id="s3://sagemaker-us-east-1-568765279027/mt_models_uploaded/THUDM--chatglm2-6b",
    role=role,
    number_of_partitions=1,
    trust_remote_code=True,
    max_tokens=4096,
    dtype="fp16",
    task="text-generation"
)

In [9]:
model.generate_serving_properties()

{'engine': 'DeepSpeed',
 'option.entryPoint': 'djl_python.deepspeed',
 'option.model_id': 's3://sagemaker-us-east-1-568765279027/mt_models_uploaded/THUDM--chatglm2-6b',
 'option.tensor_parallel_degree': 1,
 'option.task': 'text-generation',
 'option.dtype': 'fp16',
 'option.max_tokens': 4096,
 'option.triangular_masking': True,
 'option.return_tuple': True,
 'option.trust_remote_code': True}

In [None]:
from sagemaker.djl_inference import DeepSpeedModel

instance_type = "ml.g4dn.2xlarge"
endpoint_name = 'mt-'+dir_name+'-g4dn'

model = MTDeepSpeedModel(
    djl_version="0.23.0",
    model_id="s3://sagemaker-us-east-1-568765279027/mt_models_uploaded/THUDM--chatglm2-6b",
    role=role,
    number_of_partitions=1,
    trust_remote_code=True,
    max_tokens=4096,
    dtype="fp16",
    task="text-generation"
)


predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=600
)

----------------

然后部署该模型为 Sagemaker endpoint