# Huggingface PT Models
This notebook registers hugging face models to Unity Catalog and deploys it via model serving

In [0]:
%pip install --upgrade transformers
%restart_python

## Load Model from HuggingFace
Our serving journey starts with how we load the model from huggingface. We leverage the 'Auto' library from the HuggingFace transformers package because of its compatibility with MLFLow and Unity Catalog. You may need an API token for huggingface if accessing a gated repo like Llama.

In [0]:
import os
os.environ["HF_TOKEN"] = dbutils.secrets.get('shm', 'hftoken')
hf_model_id = "microsoft/Phi-3.5-mini-instruct"

The pattern for deploying every huggingface model is the same: load the model via the `AutoModelForCausalLM` and `AutoTokenizer` libraries and then register it to Unity Catalog using the MLFlow.transformers package.

In [0]:
import mlflow
import transformers
import re

task = "llm/v1/chat"
model = transformers.AutoModelForCausalLM.from_pretrained(hf_model_id)
tokenizer = transformers.AutoTokenizer.from_pretrained(hf_model_id)

mlflow.set_registry_uri("databricks-uc")
registry = mlflow.MlflowClient()

my_uc_catalog = "shm"
my_uc_schema = "default"
uc_model_name = hf_model_id.split("/")[-1].replace(".","")

transformers_model = {"model": model, "tokenizer": tokenizer}

with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=transformers_model,
        artifact_path="model",
        task=task,
        registered_model_name=f"{my_uc_catalog}.{my_uc_schema}.{uc_model_name}",
        metadata={
            "task": task,
            "pretrained_model_name": uc_model_name,
        },
    )

In [0]:
# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

## Two Types of Provisioned Throughput
There are two ways we can deploy models - using 'classic' model serving, or using accelerated provisioned throughput. 

In [0]:
import requests
import json

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.get(url=f"{API_ROOT}/api/2.0/serving-endpoints/get-model-optimization-info/{my_uc_catalog}.{my_uc_schema}.{uc_model_name}/{model_info.registered_model_version}", headers=headers)

print(json.dumps(response.json(), indent=4))

There are two ways we can deploy models - using 'classic' model serving, or using accelerated provisioned throughput. 

In [0]:
max_provisioned_throughput = response.json()['throughput_chunk_size']

from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")

client.create_endpoint(
    name=f"shm_{uc_model_name}_acc",
    config={
        "served_entities": [{
            "entity_name": f"shm.default.{uc_model_name}",
            "entity_version": model_info.registered_model_version,
            "min_provisioned_throughput": 0, #Must be zero for scale to zero
            "max_provisioned_throughput": max_provisioned_throughput,
            "scale_to_zero_enabled": True
        }]
    }
)

In [0]:
from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")

client.create_endpoint(
    name=f"shm_{uc_model_name}_cl",
    config={
        "served_entities": [{
            "entity_name": f"shm.default.{uc_model_name}",
            "entity_version": model_info.registered_model_version,
            "workload_type": "GPU_LARGE",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }]
    }
)