# Function calling on LLM hosted on Amazon SageMaker

Function calling is the ability to reliably connect LLMs to external tools to enable effective tool usage and interaction with external APIs.

Function calling is an important ability for building LLM-powered chatbots or agents that need to retrieve context for an LLM or interact with external tools by converting natural language into API calls.

Functional calling enables developers to create:

* Conversational agents that can efficiently use external tools to answer questions. For example, the query "What is the weather like in Belize?" will be converted to a function call such as get_current_weather(location: string, unit: 'celsius' | 'fahrenheit').
* LLM-powered solutions for extracting and tagging data (e.g., extracting people names from a Wikipedia article).
* Applications that can help convert natural language to API calls or valid database queries.
* Conversational knowledge retrieval engines that interact with a knowledge base.

In this example, we will be hosting [Meta-Llama-3-8B](https://huggingface.co/Trelis/Meta-Llama-3-8B-Instruct-function-calling) on sagemaker and invoke it using function calling. 

Meta released [Llama 3](https://huggingface.co/blog/llama3), the next iteration of the open-access Llama family. Llama 3 comes in two sizes: [8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) for efficient deployment and development on consumer-size GPU, and [70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-instruct) for large-scale AI native applications. Both come in base and instruction-tuned variants. The vanilla llama3 models doesnot support function calling hence we will be using finetuned model which has support for function calling


Lets get started!


## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy Llama3 to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [99]:
!pip install sagemaker --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.2 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.34.135 which is incompatible.
awscli 1.33.1 requires botocore==1.34.119, but you have botocore 1.34.135 which is incompatible.[0m[31m
[0m

In [100]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker role arn: arn:aws:iam::716256856266:role/service-role/AmazonSageMaker-ExecutionRole-20240603T162068
sagemaker session region: us-east-1


Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)

_Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the `get_huggingface_llm_image_uri` method. We are going to use the raw container uri instead._


In [101]:
# COMMENT IN WHEN PR (https://github.com/aws/sagemaker-python-sdk/pull/4314) IS MERGED
# from sagemaker.huggingface import get_huggingface_llm_image_uri

# # retrieve the llm image uri
# llm_image = get_huggingface_llm_image_uri(
#   "huggingface",
#   version="2.0.0"
# )
llm_image = f"763104351884.dkr.ecr.{sess.boto_region_name}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04"

# print ecr image uri
print(f"llm image uri: {llm_image}")

llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1-tgi2.0-gpu-py310-cu121-ubuntu22.04


## 2. Hardware requirements

Llama 3 comes in 2 different sizes - 8B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.

| Model                                                              | Instance Type       | Quantization   | # of GPUs per replica |
| ------------------------------------------------------------------ | ------------------- | -------------- | --------------------- |
| [Llama 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)   | `(ml.)g5.2xlarge`   | `-`            | 1                     |
| [Llama 70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | `(ml.)g5.12xlarge`  | `gptq` | `awq` | 8                     |
| [Llama 70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | `(ml.)g5.48xlarge`  | `-` | 8                     |
| [Llama 70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | `(ml.)p4d.24xlarge` | `-`            | 8                     |



## 3. Deploy Llama 3 to Amazon SageMaker

To deploy [Llama 3 8B](https://huggingface.co/Trelis/Meta-Llama-3-8B-Instruct-function-calling) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc.  Llama 3 8B instruct function calling is  fine-tuned for function calling. We will interact with llama using the common OpenAI format of `messages`. 

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ],
}
```

In [102]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
health_check_timeout = 900





# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "Trelis/Meta-Llama-3-8B-Instruct-function-calling", # model_id from hf.co/models
  'SM_NUM_GPUS': "1", # Number of GPU used per replica
  'MAX_INPUT_LENGTH': "2048",  # Max length of input text
  'MAX_TOTAL_TOKENS': "4096",  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': "8192",  # Limits the number of tokens that can be processed in parallel during the generation
  'MESSAGES_API_ENABLED': "true", # Enable the messages API
  'HUGGING_FACE_HUB_TOKEN': "" #Update this
}

# check if token is set


# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)


After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.12xlarge` instance type. 

In [103]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


-----------------!

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

## 4. Run inference and chat with the model

After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. You can find supported parameters in the [here](https://huggingface.co/docs/text-generation-inference/messages_api). 

The Messages API allows us to interact with the model in a conversational way. We can define the role of the message and the content. The role can be either `system`,`assistant` or `user`. The `system` role is used to provide context to the model and the `user` role is used to ask questions or provide input to the model. 

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ],
}
```

In [104]:
# Prompt to generate
messages=[
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ]

# Generation arguments
parameters = {
    "model": "Trelis/Meta-Llama-3-8B-Instruct-function-calling", # placholder, needed
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    #"stop": ["<|eot_id|>"],
    #"tools"=tools
}

Okay lets test it.

In [105]:
chat = llm.predict({"messages" :messages, **parameters})

print(chat["choices"][0]["message"]["content"].strip())

Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and analyze data. It is called "deep" because it involves multiple layers of interconnected nodes or "neurons" that process and transform the data as it flows through the network.

In traditional machine learning, algorithms are designed to learn from data by identifying patterns and making predictions based on those patterns. In contrast, deep learning algorithms are designed to learn from data by identifying complex patterns and relationships that are not easily discernible by humans.

Deep learning is particularly well-suited for tasks that involve:

1. Image recognition: Deep learning algorithms can be trained to recognize objects, scenes, and activities in images and videos.
2. Natural language processing: Deep learning algorithms can be trained to understand and generate human language, including text and speech.
3. Speech recognition: Deep learning algorithms can be trained

# Function calling

As a basic example, let's say we asked the model to check the weather in a given location 

The LLM alone would not be able to respond to this request because it has been trained on a dataset with a cutoff point. The way to solve this is to combine the LLM with an external tool. You can leverage the function calling capabilities of the model to determine an external function to call along with its arguments and then have it return a final response. Below is a simple example of how you can achieve this using the Sagemaker's invoke model API.



In [106]:
FUNCTION_METADATA=[
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "This function gets the current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city, e.g., San Francisco"
                    },
                    "format": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit to use."
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_clothes",
            "description": "This function provides a suggestion of clothes to wear based on the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "temperature": {
                        "type": "string",
                        "description": "The temperature, e.g., 15 C or 59 F"
                    },
                    "condition": {
                        "type": "string",
                        "description": "The weather condition, e.g., 'Cloudy', 'Sunny', 'Rainy'"
                    }
                },
                "required": ["temperature", "condition"]
            }
        }
    }    
]


In [107]:
messages=[
    {
        "role": "function_metadata",
        "content": "FUNCTION_METADATA"
    },
    
    {
        "role": "function_call",
        "content": "{\n    \"name\": \"get_current_weather\",\n    \"arguments\": {\n        \"city\": \"London\"\n    }\n}"
    },
    {
        "role": "user",
        "content": "What is the current weather in London?"
    }
]


In [108]:
parameters = {
    "model": "Trelis/Meta-Llama-3-8B-Instruct-function-calling", # placholder, needed
    "top_p": 0.6,
    "temperature": 0.9,
    "max_tokens": 512,
    "stop": ["<|eot_id|>"],
    
}



chat = llm.predict({"messages" :messages, **parameters})

print(chat["choices"][0]["message"]["content"].strip())

{
    "name": "get_current_weather",
    "arguments": {
        "city": "London"
    }
}


## 6. Clean up

To clean up, we can delete the model and endpoint.


In [109]:
llm.delete_model()
llm.delete_endpoint()