#Provisioned Throughput GTE serving example


Provisioned Throughput provides optimized inference for Foundation Models with performance guarantees for production workloads.

This example walks through:

- Downloading the model from Hugging Face transformers
- Logging the model in a provisioned throughput supported format into the Databricks Unity Catalog or Workspace Registry
- Enabling optimized serving on the model

## Step 1: Log the model for serving

In [0]:
# Update and install required dependencies
!pip install -U mlflow
!pip install -U transformers
!pip install -U torch
!pip install -U torchvision
!pip install -U accelerate
dbutils.library.restartPython()

In [0]:
from transformers import AutoModel, AutoTokenizer

gte = "Alibaba-NLP/gte-large-en-v1.5"
model = AutoModel.from_pretrained(gte, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(gte, trust_remote_code=True)

To enable optimized serving, when logging the model, include the extra metadata dictionary when calling mlflow.transformers.log_model as shown below:

metadata = {"task": "llm/v1/completions"}
This specifies the API signature used for the model serving endpoint.

In [0]:
import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import ColSpec, Schema, TensorSpec
import numpy as np

# Define the model input and output schema
input_schema = Schema([ColSpec(type="string", name=None)])
output_schema = Schema([TensorSpec(type=np.dtype("float64"), shape=(-1,))])

signature = ModelSignature(inputs=input_schema, outputs=output_schema)

# Define an example input
input_example = {
    "input": np.array([
        "Welcome to Databricks!"
    ])
}

In [0]:
from transformers import pipeline
import mlflow

# Comment out the line below if not using Models in UC 
# and simply provide the model name instead of three-level namespace
mlflow.set_registry_uri('databricks-uc')
CATALOG = "ml_demo"
SCHEMA = "models"
registered_model_name = f"{CATALOG}.{SCHEMA}.gte-large"

# Start a new MLflow run
with mlflow.start_run():
    mlflow.transformers.log_model(
        transformers_model=pipeline(
            "feature-extraction",
            model=model,
            tokenizer=tokenizer
        ),
        artifact_path="gte-large",
        task="llm/v1/embeddings",
        metadata={"task": "llm/v1/embeddings"},
        registered_model_name=registered_model_name
   )


## Step 2: View optimization information for your model

Modify the cell below to change the model name. After calling the model optimization information API, you will be able to retrieve throughput chunk size information for your model. This is the number of tokens/second that corresponds to 1 throughput unit for your specific model.

In [0]:
import requests
import json

# Name of the registered MLflow model
model_name = registered_model_name

# Get the latest version of the MLflow model
model_version = 1

# Get the API endpoint and token for the current notebook context

API_ROOT = "https://e2-demo-field-eng.cloud.databricks.com/"
API_TOKEN = ""

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.get(url=f"{API_ROOT}/api/2.0/serving-endpoints/get-model-optimization-info/{model_name}/{model_version}", headers=headers)

if 'optimizable' not in response.json() or not response.json()['optimizable']:
  raise ValueError("Model is not eligible for provisioned throughput")

print(json.dumps(response.json(), indent=4))

## Step 3: Configure and create your model serving GPU endpoint
Modify the cell below to change the endpoint name. After calling the create endpoint API, the logged MPT-7B model is automatically deployed with optimized LLM serving.

In [0]:
# Set the name of the MLflow endpoint
endpoint_name = "gte-large_sb"

In [0]:
chunk_size = response.json()['throughput_chunk_size']

# Specify the minimum provisioned throughput 
min_provisioned_throughput = chunk_size*2

# Specify the maximum provisioned throughput 
max_provisioned_throughput = chunk_size*3

In [0]:
data = {
    "name": endpoint_name,
    "config": {
        "served_entities": [
            {
                "entity_name": model_name,
                "entity_version": model_version,
                "min_provisioned_throughput": min_provisioned_throughput,
                "max_provisioned_throughput": min_provisioned_throughput,
            }
        ]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers)

print(json.dumps(response.json(), indent=4))

#Step 4: Query your endpoint
After your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.

In [0]:
import time

In [0]:
API_ROOT = "https://e2-demo-field-eng.cloud.databricks.com/"
API_TOKEN = ""
data = {
    "input": ["Welcome to Databricks!"]
}

headers = {
    "Context-Type": "text/json",
    "Authorization": f"Bearer {API_TOKEN}"
}

# Check whether the Endpoint is ready and sleep for 30second before next check
while True:
    state = requests.get(
        url=f"{API_ROOT}/api/2.0/serving-endpoints/{endpoint_name}",
        headers=headers
    ).json()["state"]["ready"]
    if state == "READY":
        print("Endpoint is ready to be queried")
        break
    else:
        time.sleep(10)


response = requests.post(
    url=f"{API_ROOT}/serving-endpoints/{endpoint_name}/invocations",
    json=data,
    headers=headers
)

print(json.dumps(response.json()))