# Qwen2-72B-Instruct-AWQ deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [1]:
# %pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [31]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [32]:
# option_model_dir=s3://sagemaker-us-west-2-452145973879/models/Qwen2-72B-Instruct-AWQ/
# s3://sagemaker-us-west-2-452145973879/models/Qwen2-72B-Instruct-AWQ

In [69]:
%%writefile serving.properties
engine=Python
option.model_id=s3://sagemaker-us-west-2-452145973879/models/Qwen2-72B-Instruct-AWQ/
option.dtype=fp16
option.quantize=awq
option.task=text-generation
option.rolling_batch=vllm
option.tensor_parallel_degree=4
option.max_model_len=25456
option.device_map=auto
option.enable_streaming=true
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=16
option.output_formatter=jsonlines

Writing serving.properties


In [70]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [71]:
# image_uri = image_uris.retrieve(
#         framework="djl-deepspeed",  # "djl-deepspeed"
#         region=sess.boto_session.region_name,
#         version="0.27.0"
#     )

image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124-v1.0"

### Upload artifact on S3 and create SageMaker model

In [72]:
s3_code_prefix = "Qwen2-72B-Instruct-AWQ/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-452145973879/Qwen2-72B-Instruct-AWQ/code/mymodel.tar.gz


### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.12xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

-----

## if the model has already been deployed

In [64]:
# import sagemaker

# endpoint_name = 'lmi-model-2024-06-11-09-18-56-099'
# predictor = sagemaker.Predictor(
#     endpoint_name=endpoint_name, 
#     sagemaker_session=sess,
#     serializer=serializers.JSONSerializer()
# )

# predictor

<sagemaker.base_predictor.Predictor at 0x7f733f5ed240>

## Step 5: Test and benchmark the inference

Firstly let's try to run with a wrong inputs

In [62]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = '/home/ec2-user/SageMaker/efs/Models/Qwen2-72B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompt = """character name: Tifa Lockhart\nSpecies: human\nGender: female\nMind: dumb, innocent, happy, indulgent, submissive\nPersonality: dumb, innocent, happy, indulgent, submissive\nFeatures: long  dark hair, red eyes, athletic, slender, huge breasts, curvy, great ass\nClothes: sleeveless top, no bra, exposed midriff, miniskirt\nSexual orientation: straight\nHeight: 162 centimeters tall\nAge: 25\nLoves: cum, sex, being molested, being groped, perverts, her friends, pretty much everything else\nHates: hawaiian pizza\nDescription: \nmember of the anti-Shinra militant group, manages a bar located in the Sector 7 slums, knows combat techniques, doesn't uses bras, always smiling, has a crush on Cloud Strife, Cloud doesn't corresponds her, accept every petition with a smile no matter what\nshe don't mind people taking advantage of her, will comply anything Someone asked her to do
Assistant: Tifa Lockhart: Tifa is sitting next to you, happily smiling while you look at her huge boobs with mischievous intentions. Cload and Barret had left you both alone; they went to explore the near building in search for items. You're suppose to guard the entrance, but you probably aren't gonna get an oportunity like this again.\n\nTifa is a good girl, she never decline any pettition and obvey it with a smile on her face. So you're gonna take advantage of that now.\n\nIs everything ok, dear? She turns to look at you, making jiggle her masive milkers as an inoccent smile creeps on her face. You've been a little silent today, do you need anything?
USER: Someone: do you see a black goess before?
ASSISTANT:"""

prompt = """'下面是一段agent与customer的对话\nagent: 咁嘅\ncustomer: 你好呀,我正在用你們的抽濕機\ncustomer: 用咗半年到啦咁而家呢佢壞咗呀佢呢\nagent: 嗨\ncustomer: 濕濕咗嘅時候呢嗰啲水呢直接流落去地下\ncustomer: 接不到那些水\nagent: 請問點稱呼呀?\nagent: 余先生,請問你可否給我抽濕口罩?\ncustomer: 我姓余的\nagent: 手機的型號\ncustomer: RADY200H\nagent: 多謝大家\ncustomer: Thank you for watching\nagent: OK\ncustomer: 拜拜\nagent: 又係咪check我嗰個水準\nagent: 放入去嘅時候,位置係正常\nagent: 唔好擺尾之類\ncustomer: 冇嘅冇排名\nagent: Check過冇問題嘅,如果係可以搵錢\nagent: 咁請問你嗰個抽濕機係買咗一年到?嘛係咪?\nagent: 上年九月抽濕機\ncustomer: 上年九月買的\nagent: 请问您的购买单和保用证是否存在?\ncustomer: 喺度嘅\nagent: 不存在的,如果我可以找师傅上来帮你看,麻烦你出示给他\nagent: 請問您的地址在哪裡?\ncustomer: 柴灣道111號\ncustomer: 高威港\nagent: 高威國幾座幾樓幾室\ncustomer: 第四座,四座八樓私宅\nagent: SipoCat\ncustomer: 係冇差\nagent: 麻煩你等一等我睇下如果係柴灣嘅可以幾時安排到過嚟先\ncustomer: 唔\nagent: 嚟緊星期二 23 號 12 點,\nagent: 屋企會唔會有忍\ncustomer: 哎呀\nagent: 咁我呢度幫你安排返啦係余生嘅到時上嚟之前師傅可以打返90997\nagent: 910 呢個電話高威國四座扮cat嘅啱啱\ncustomer: 我留兩個電話給你\nagent: 好呀你再講呀\ncustomer: 我太太梁小姐\nagent: 90256922,呢個係搵邊位??\nagent: 即係阿梁小姐\nagent: 打誰的電話先?\nagent: 由我講吧!\ncustomer: 你打這個先啦,我再留多個屋企電話比你啦\ncustomer: 以5687021\nagent: 屋企電話25687021,如果係到時打電話嘅時候,可以搵返梁小姐係\nagent: 902-56922\ncustomer: 係冇錯\nagent: 好得,我呢度幫你安排返\nagent: 23 號 12 點至 5 點過嚟\ncustomer: 星期五至五點,星期二。\nagent: 係,冇錯,有咩其他可以幫到你?\ncustomer: OK,好呀,\nagent: 都係打電話嚟,多謝!\ncustomer: 係冇?啦,\nagent: 拜拜\ncustomer: OK\n\n\n请从上面的对话中抽取如下信息，并以json格式返回，如果对话中没有提到相应字段的内容，则填""：\n {"customerType": "个人", "customerName": "溫先生", "phoneNumber1": "", "phoneNumber2": "", "email": "", "address": "楊逸居第三座,九樓,A7", "productBrand": "Toshiba", "productCategoryName": "", "serialNumber": "ER-GD400HK", "srType": "维修", "srSubType": "维修", "symptomDescription": "燈膽燒了", "customerRequest": "維修法", "refNo": "", "selloutInvoiceNum": "", "salesDealerName": "", "installDealerName": ""}\n \n注意只返回抽取的json格式的结果，不返回其它额外信息。 \n'"""
system_prompt = "You are a helpful assistant."

messages = [
        {"role": "system", "content": system_prompt},
    ]
    
messages.append({"role": "user", "content": prompt})
    
    # prompt = tokenizer.apply_chat_template(messages, tokenize=False)
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# print(prompt)

response = predictor.predict(
    {
        "inputs": prompt, 
         "parameters": 
         {
             "max_new_tokens": 4096,
             # Add any other sampling parameters as needed
             "temperature": 0.7,
             "top_k": 5,
             "top_p": 0.9,
             # "stop_token_ids": [], 
             # "stop": ["\nASSISTANT", "\nUSER:"],
             "include_stop_str_in_output": False,
             # "skip_special_tokens": True,
             "ignore_eos": False,
             "repetition_penalty": 1,
         }
    }
)

print('response: ', response.decode('utf-8'))
print('generated_text: ', json.loads(response.decode('utf-8'))['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


response:  {"generated_text": "{\"customerType\": \"个人\", \"customerName\": \"余先生\", \"phoneNumber1\": \"90256922\", \"phoneNumber2\": \"25687021\", \"email\": \"\", \"address\": \"柴灣道111號高威港第四座四座八樓私宅\", \"productBrand\": \"\", \"productCategoryName\": \"抽濕機\", \"serialNumber\": \"RADY200H\", \"srType\": \"维修\", \"srSubType\": \"维修\", \"symptomDescription\": \"濕濕咗嘅時候呢嗰啲水呢直接流落去地下, 接不到那些水\", \"customerRequest\": \"\", \"refNo\": \"\", \"selloutInvoiceNum\": \"\", \"salesDealerName\": \"\", \"installDealerName\": \"\"}"}
generated_text:  {"customerType": "个人", "customerName": "余先生", "phoneNumber1": "90256922", "phoneNumber2": "25687021", "email": "", "address": "柴灣道111號高威港第四座四座八樓私宅", "productBrand": "", "productCategoryName": "抽濕機", "serialNumber": "RADY200H", "srType": "维修", "srSubType": "维修", "symptomDescription": "濕濕咗嘅時候呢嗰啲水呢直接流落去地下, 接不到那些水", "customerRequest": "", "refNo": "", "selloutInvoiceNum": "", "salesDealerName": "", "installDealerName": ""}


In [65]:
import sagemaker
import io
import json

class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1] # line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])
            
            
            
client = boto3.client('sagemaker-runtime')

body = {"inputs": prompt, "parameters": {"max_new_tokens":512}, "stream": True}
resp = client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']

# for line in LineIterator(event_stream):
    
#     resp = line.decode("utf-8")
#     if resp.startswith('{"generated_text": "'):
#         print(resp.strip('{"generated_text": "'), end='')
#     elif resp.endswith('"}'):
#         print(resp.strip('"}'), end='')
#     else:
#         print(resp, end='')
        
for line in LineIterator(event_stream):
    resp = json.loads(line)
    print(resp.get("token").get('text'), end='')

{"customerType": "个人", "customerName": "余先生", "phoneNumber1": "90256922", "phoneNumber2": "25687021", "email": "", "address": "柴灣道111號高威港第四座四座八樓私宅", "productBrand": "", "productCategoryName": "抽濕機", "serialNumber": "RADY200H", "srType": "维修", "srSubType": "维修", "symptomDescription": "濕濕咗嘅時候呢嗰啲水呢直接流落去地下, 接不到那些水", "customerRequest": "", "refNo": "", "selloutInvoiceNum": "", "salesDealerName": "", "installDealerName": ""}

## streaming

In [68]:
import gradio as gr
import json

from transformers import AutoModelForCausalLM, AutoTokenizer

# model_dir = '/home/ec2-user/SageMaker/efs/Models/Half-NSFW_Noromaid-7b'
model_dir = '/home/ec2-user/SageMaker/efs/Models/Qwen2-72B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

def response(message, history, system_prompt):
    
    # print('message:', message)
    # print('history:', history)
    
    messages = [
        {"role": "system", "content": system_prompt},
    ]
    
    for human, ai in history:
        messages.append({"role": "user", "content": human})
        messages.append( {"role": "assistant", "content": ai})
    
    messages.append({"role": "user", "content": message})
    
    # prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
    
    # print(f"prompt: {prompt}")
    
    body = {
            "inputs": prompt, 
             "parameters": 
             {
                 "do_sample": True,
                 "max_new_tokens": 4096,
                 # Add any other sampling parameters as needed
                 "temperature": 0.7,
                 "top_k": 20,
                 "top_p": 0.8,
                 # "stop_token_ids": [], 
                 # "stop": ["[INST]"],
                 "skip_special_tokens": True,
                 "ignore_eos": False,
                 "repetition_penalty": 1.05,
             },
            "stream": True        
    }
    
    resp = client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
    event_stream = resp['Body']
    
    response_text = ''
    for line in LineIterator(event_stream):
        resp = json.loads(line)
        response_text += resp.get("token").get('text')
        
        yield response_text
    
demo = gr.ChatInterface(response, 
                        chatbot=gr.Chatbot(render_markdown=False), 
                        additional_inputs=[gr.Textbox("You are a helpful assistant.", label="System Prompt")],
                        title='聊天机器人（Qwen2-72B-Instruct-AWQ）',
                        description='欢迎光临，我是您的聊天机器人，快来问我吧。')

demo.launch(share=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Running on local URL:  http://127.0.0.1:7867
Running on public URL: https://efcaf1b498754f8cb1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
# import gradio as gr
# import json
# from transformers import AutoModelForCausalLM, AutoTokenizer

# model_dir = '/home/ec2-user/SageMaker/efs/Models/Half-NSFW_Noromaid-7b'
# tokenizer = AutoTokenizer.from_pretrained(model_dir)

# # hyperparameters for llm
# parameters = {
#     "do_sample": True,
#     "top_p": 0.7,
#     "temperature": 0.7,
#     "top_k": 50,
#     "max_new_tokens": 256,
#     "repetition_penalty": 1.03,
#     "stop": ["<|endoftext|>"]
#   }

# with gr.Blocks() as demo:
#     gr.Markdown("## Chat with Mistral-7B LLM using Amazon SageMaker")
#     with gr.Column():
#         chatbot = gr.Chatbot()
#         with gr.Row():
#             with gr.Column():
#                 message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False)
#             with gr.Column():
#                 with gr.Row():
#                     submit = gr.Button("Submit")
#                     clear = gr.Button("Clear")

#     def respond(message, chat_history):
#         # convert chat history to prompt
#         converted_chat_history = ""
#         if len(chat_history) > 0:
#           for c in chat_history:
#             converted_chat_history += f"<|prompter|>{c[0]}<|endoftext|><|assistant|>{c[1]}<|endoftext|>"
#         prompt = f"{converted_chat_history}<|prompter|>{message}<|endoftext|><|assistant|>"

#         # send request to endpoint
#         llm_response = predictor.predict({"inputs": prompt, "parameters": parameters})
#         decoded_response = llm_response.decode('utf-8')
#         response_json = json.loads(decoded_response)
#         parsed_response = response_json["generated_text"]

#         # remove prompt from response
#         # parsed_response = llm_response[0]["generated_text"][len(prompt):]
#         chat_history.append((message, parsed_response))
#         return "", chat_history

#     submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
#     clear.click(lambda: None, None, chatbot, queue=False)

# demo.launch(share=True)

In [None]:
import gradio as gr
import json

from transformers import AutoModelForCausalLM, AutoTokenizer

class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])


model_dir = '/home/ec2-user/SageMaker/efs/Models/Half-NSFW_Noromaid-7b'
tokenizer = AutoTokenizer.from_pretrained(model_dir)

def response(message, history, system_prompt):
    
    print('message:', message)
    print('history:', history)
    
    # """
    # <s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]
    # """
    # prompt = f"[INST]{system_prompt}[/INST]\n"
    
#     messages = [
#     {"role": "user", "content": "What is your favourite condiment?"},
#     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
#     {"role": "user", "content": "Do you have mayonnaise recipes?"}
# ]
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I'm your bad girl."}
    ]
    
    for human, ai in history:
        messages.append({"role": "user", "content": human})
        messages.append( {"role": "assistant", "content": ai})
    
    messages.append({"role": "user", "content": message})
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    
    print(f"prompt: {prompt}")
    
    response = predictor.predict(
        {
            "inputs": prompt, 
             "parameters": 
             {
                 "max_tokens": 512,
                 # Add any other sampling parameters as needed
                 "temperature": 0.7,
                 "top_k": 5,
                 "top_p": 0.9,
                 "stop_token_ids": [], 
                 "stop": ["[INST]"],
                 "skip_special_tokens": True,
                 "ignore_eos": False,
                 "repetition_penalty": 1,
             }
        }
    )
    
    # response = requests.post(f'{url}?streaming=true', json=data, stream=True, timeout=200)
    print(f"response: {response}")
    decoded_response = response.decode('utf-8')
    response_json = json.loads(decoded_response)
    parsed_response = response_json["generated_text"]
    
    # chat_history.append((message, parsed_response))
    # return "", chat_history
    return parsed_response
    
    # if response.status_code == 200:
#     #     for chunk in response.iter_content(chunk_size=1024):
#     #         if chunk:
#     #             response_text += chunk.decode('utf-8')
#     #             yield response_text
    
#     else:
#         # print(f"Error in streaming service: {response.status_code}")
#         yield ''

demo = gr.ChatInterface(response, 
                        chatbot=gr.Chatbot(render_markdown=False), 
                        additional_inputs=[gr.Textbox("""Pretend you have the following role.\nCharacter name: Tifa Lockhart\nSpecies: human\nGender: female
Mind: dumb, innocent, happy, indulgent, submissive\nPersonality: dumb, innocent, happy, indulgent, submissive\nFeatures: long  dark hair, red eyes, athletic, slender, huge breasts, curvy, great ass
Clothes: sleeveless top, no bra, exposed midriff, miniskirt\nSexual orientation: straight\nHeight: 162 centimeters tall\nAge: 25\nLoves: cum, sex, being molested, being groped, perverts, her friends, pretty much everything else
Hates: hawaiian pizza\nDescription: \nmember of the anti-Shinra militant group, manages a bar located in the Sector 7 slums, knows combat techniques, doesn't uses bras, always smiling, has a crush on Cloud Strife, 
Cloud doesn't corresponds her, accept every petition with a smile no matter what\nshe don't mind people taking advantage of her, will comply anything Someone asked her to do.
""", label="System Prompt")],
                        title='聊天机器人（Half-NSFW_Noromaid-7b-int4/）',
                        description='欢迎光临，我是您的聊天机器人，快来问我吧。')

demo.launch(share=True)

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()