# Deploy open-source Large Language Models on Amazon SageMaker

#### Before we start
This notebook runs with <mark>Data Science 3.0 </mark> Kernel

In [None]:
!pip install tiktoken
import json
import sagemaker

# Section 1: Deploy Llama model from SageMaker jumpstart

SageMaker JumpStart provides pretrained, open-source models for a wide range of problem types to help you get started with machine learning. You can incrementally train and tune these models before deployment. JumpStart also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker. You can also access JumpStart models using the SageMaker Python SDK. For information about how to use JumpStart models programmatically, see [Use SageMaker JumpStart Algorithms with Pretrained Models ](https://sagemaker.readthedocs.io/en/stable/overview.html#use-built-in-algorithms-with-pre-trained-models-in-sagemaker-python-sdk)

You can access the pretrained models, solution templates, and examples through the JumpStart landing page in Amazon SageMaker Studio. 

This module presents a remarkable opportunity to explore the capabilities of Llama 2 model L in resolving language tasks such as abstractive question answering, text summarization, etc.

To get started with the example, on the left-hand-side navigation pane, got to Home, under SageMaker JumpStart, choose Model, notebooks, solutions. You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case. If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs. To start exploring the Llama 2 models, complete the following steps:

TO deploy Llma 2- Go to the Foundation Models section. In the search bar, search for the llama model and select the Llama-2-7b. You can use the following screenshop to follow step by step. 

![image](./image.JPG)

Click on view model and a new window will open where you can configure and delpy the model.

Under Delpoyement configuration choose the instance type ml.g5.2xlarge, specify your endpoint name ( or leave as defualt) and then click Deploy.

You can also deploy the model using the SageMaker Python SDK by clicking on the <mark>Notebook</mark> tab and opening the notebook that is shown.

Once the model endpoint is in service, you can use the additional sample notebook to make inference using the deployed model.

In this section we explored how we can deploy LLMs using SageMaker jumpstart that utilised the SageMaker Jumpstart containers. In the next section we will explore utilisation of HF Deep Learning Contianer to deploy suppoerted LLMs.

## Section 2: Deploy Flan_T5_XXl using Hugging face DLC container.
In this section, we will deploy the open-source Flan_T5_XXl model on SageMaker for real-time inference. For this deployment we willbe using Hugging Face LLM inference Deep Learning Containers (DLC).

#### What is Hugging Face LLM Inference DLC?
Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:

* Tensor Parallelism and custom cuda kernels
* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)
* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala)
* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. 

Lets get started!

To deploy FLAN_T5_XXL to Amazon SageMaker we set up our environemnt and define our endpoint configuration including the hf_model_id, instance_type etc. We will use a g5.12xlarge instance type. We then utilise the Hugging face DLC image by passing its relevant uri to create our model object, ready to deployed. 
This is an example on how to deploy the open-source LLMs, to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container. 


In [None]:
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 900
role = sagemaker.get_execution_role()

In [None]:
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "google/flan-t5-xxl", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(4096),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(5120),  # Max length of the generation (including input text)
  #'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}


from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor

# endpoint_name = "flan-ul2-2047-2023-06-27-01-28-36-094"
endpoint_name = 'huggingface-pytorch-tgi-inference-2023-07-24-23-59-42-211'
predictor = HuggingFacePredictor(endpoint_name=endpoint_name)

parameters = {
    "max_length": 500, # This is not used
    "max_new_tokens": 500, # Default value
    "temperature": 0.01,
    "top_p": 0.1,
}

## promot template function
def intent_template(text, predictor):

    prompt = f"""
        Extract the topic of the customer conversation 
        "Input:\n\n{text}"
        Output:
    """
    payload = prompt
    print(prompt)

    response = predictor.predict({
    "inputs": payload,
    "parameters" :parameters})

    json_extraction = response[0]["generated_text"]

    return json_extraction


#this is how you would make infernece on your endpoint

text= "i have faced some issues when booking a flight from Melbourne to Amsterdam. When i was trying to use the online booking system, it times out and i coudln't get back to it?can you please help me?" 
predictor=predictor
parameters= {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "stop": ["<|endoftext|>"]
  }

print(intent_template(text, predictor))
print(summary_template(text, predictor))


## The following shows you how you would envoke the Flan model directly from the endpoint without using the restful API request that you can explore on your own time.

## Section 3- Prompting the FLAN_T5_XXL to extract the intents.
In the following we will be interacting with the deloyed Flan_T5_XXl and use it to extract Chatbot conversations intents. 
We first define a few simple prompt template using text strings that will allow for user input text (which is the chat bot conversation in this scenario)

In [None]:
import requests

def query_endpoint_with_json_payload(url, data, payload):
    response = requests.post(
        url,
        headers=data,
        json=payload,
    )
    return response

# def parse_response_multiple_texts(query_response):
#     return query_response.json()[0]["generated_text"]

In [None]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

In [None]:
_MODEL_CONFIG_ = {
        "Flan_T5_XXL" : {
        "aws_region": "us-east-1",
        "endpoint_name": "demo-FlanT5-Endpoint",
        "api_url": "https://kj72lukej0.execute-api.us-east-1.amazonaws.com/prod/flan",
        "headers":{
    'Content-Type': 'application/json',
    'Accept': 'application/json',
    'Authorization':'xxx'  #insert the authentication code
}, 
        # "parse_function": parse_response_multiple_texts,
        # "prompt": """{context}\n\nGiven the above context, answer the following question:\n{question}\nAnswer: """,

    },
}

payload = {
    "text_inputs": question,
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03
}
#print(payload)
model_id="Flan_T5_XXL"
api_url = _MODEL_CONFIG_[model_id]["api_url"]
data=  _MODEL_CONFIG_[model_id]["headers"]

query_response = query_endpoint_with_json_payload(
        api_url,
        data,
        payload,
        )
import json

print(query_response.__attrs__)
print(json.dumps(query_response.json(), indent=2))


# generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
# print(f"For model: {model_id}, the generated output is: {generated_texts}\n")

In [None]:
text= "I have faced some issues when booking a flight from Melbourne to Amsterdam. When i was trying to use the online booking system, it times out and i coudln't get back to it?can you please help me?" 

parameters= {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
     }
text_inputs = f"""
    Extract the topic of the customer conversation 
    "Input:\n\n{text}"
    Output:
    """
payload={}
payload['text_inputs']= text_inputs


parameters= parameters 

payload.update(parameters)
#print(payload)

query_response = query_endpoint_with_json_payload(
        api_url,
        data,
        payload,
        )
print(json.dumps(query_response.json(), indent=2))

In [None]:
def intent_template(text, parameters, query_response):
    
    parameters= parameters
    text_inputs = f"""
    Extract the topic of the customer conversation 
    "Input:\n\n{text}"
    Output:
    """,
    payload={}
    payload['text_inputs']= text_inputs
     
    payload.update(parameters)

    query_response = query_endpoint_with_json_payload(
        api_url,
        data,
        payload,
        )

    json_extraction = json.dumps(query_response.json(), indent=2)

    return json_extraction



def summary_template(text, parameters, query_response):
    
    text_inputs = f"""
    Provide a short summary of what is it that the customer contacted for?do not provide answers
    "Input:\n\n{text}"
    Output:
    """,
    payload={}
    payload['text_inputs']= text_inputs
    parameters= parameters 
    payload.update(parameters)

    query_response = query_endpoint_with_json_payload(
        api_url,
        data,
        payload,
        )

    json_extraction = json.dumps(query_response.json(), indent=2)

    return json_extraction

In [None]:
text= "I have faced some issues when booking a flight from Melbourne to Amsterdam. When i was trying to use the online booking system, it times out and i coudln't get back to it?can you please help me?" 

parameters= {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
     }

print(intent_template(text, parameters, query_response))
print(summary_template(text, parameters, query_response))


## Section 4- Prompt Engineering with spcific intent lables.
In this section we will perform additional prompt engineering, to pass on the specific intent lables to the model and ask the model to pick one of those intents when extracting the intent from the chat session.

In [None]:
import pandas as pd
import tiktoken
from tqdm import tqdm
import json
import pickle

In [None]:
def num_tokens_from_string(string: str, encoding_name: str="cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
reasons_df = pd.read_csv("genai-workshop/data/Reasons.csv")
reasons_df.head()


In [None]:
reasons_df['intent'].unique()

In [None]:
reason_tree = reasons_df.groupby('intent')['intent'].unique().apply(list).to_dict()
sub_intent_tree = reasons_df.groupby('intent')['sub_intent'].unique().apply(list).to_dict()

In [None]:
print(list(reason_tree.values())[:5])
print(list(sub_intent_tree.values())[:5])

In [None]:
def intent_template(candidate_labels, customer_feedback, parameters, query_response):
    
    parameters= parameters
    
    text_inputs = prompt = f"""
        Classify the input text only from the labels listed below
        "Labels": {candidate_labels}
        "Input": {customer_feedback}
        "Output":"""
    
    payload={}
    
    payload['text_inputs']= text_inputs
     
    payload.update(parameters)

    query_response = query_endpoint_with_json_payload(
        api_url,
        data,
        payload,
        )

    json_extraction = json.dumps(query_response.json(), indent=2)

    return json_extraction

In [None]:
results = []

conv= "I have just landed and can not find my luggage. "

parameters= {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
     }

call_intent = intent_template(list(reason_tree.keys()), conv, parameters, query_response)
call_sub_intent = intent_template(sub_intent_tree.get(json.loads(call_intent).get("generated_texts")[0]), conv, parameters, query_response)

results.append({
        "call_intent":call_intent, 
         "call_sub_intent":call_sub_intent
           })

results