# Deploy Flan_T5_XXl using Hugging face DLC container.


This notebook runs with <mark>Data Science </mark> Kernel


In this section, we will deploy the open-source Flan_T5_XXl model on SageMaker for real-time inference. For this deployment we willbe using Hugging Face LLM inference Deep Learning Containers (DLC).

#### What is Hugging Face LLM Inference DLC?
Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:

* Tensor Parallelism and custom cuda kernels
* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)
* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala)
* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. 

Lets get started!

To deploy FLAN_T5_XXL to Amazon SageMaker we set up our environemnt and define our endpoint configuration including the hf_model_id, instance_type, etc. We will use a **g5.12xlarge** instance type. We then utilise the Hugging face DLC image by passing its relevant uri to create our model object, ready for deployment.

The following is an example on how to deploy the open-source LLMs, to Amazon SageMaker for inference using the Hugging Face DLC Container. 
It also contains some promot template and show you how to make inference.

In [None]:
import sagemaker
import json
role = sagemaker.get_execution_role()

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 900
role = sagemaker.get_execution_role()

In [None]:
#Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "google/flan-t5-xxl", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(4096),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(5120),  # Max length of the generation (including input text)
  #'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}



from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor

# endpoint_name = "flan-ul2-2047-2023-06-27-01-28-36-094"
endpoint_name = 'huggingface-pytorch-tgi-inference-2023-07-24-23-59-42-211'
predictor = HuggingFacePredictor(endpoint_name=endpoint_name)

parameters = {
    "max_length": 500, # This is not used
    "max_new_tokens": 500, # Default value
    "temperature": 0.01,
    "top_p": 0.1,
}

## prompt template function
def intent_template(text, predictor):

    prompt = f"""
        Extract the topic of the customer conversation 
        "Input:\n\n{text}"
        Output:
    """
    payload = prompt
    print(prompt)

    response = predictor.predict({
    "inputs": payload,
    "parameters" :parameters})

    json_extraction = response[0]["generated_text"]

    return json_extraction


#this is how you would make infernece on your endpoint

text= "i have faced some issues when booking a flight from Melbourne to Amsterdam. When i was trying to use the online booking system, it times out and i coudln't get back to it?can you please help me?" 
predictor= predictor

parameters= {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.2,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "stop": ["<|endoftext|>"]
  }

print(intent_template(text, predictor))
print(summary_template(text, predictor))
