# Log and serve Huggingface model
This notebook provides an example of logging and serving a Huggingface transformers model with `text-genration` task.


## To read about Model Serving at Snowflake
Documentation: https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/container

In [None]:
# Import python packages
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session

session = get_active_session()

In [None]:
from snowflake.ml import version
from snowflake import snowpark
from snowflake.ml.registry import registry as registry_module
from snowflake.ml.model import openai_signatures

print(f"snowflake-ml-python=={version.VERSION}")

registry = registry_module.Registry(
    session=session,
)

# make sure to be on `snowflake-ml-python>=1.13.0`

snowflake-ml-python==1.15.0


### Note: The notebook requires an EAI to download files from Huggingface
This featuer lets us log a large model without requiring GPU

Keep a lookout for a new API in `HuggingFacePipelineModel.log_model_and_create_service()` (to be released in snowflake-ml-python>=1.15.0)
which logs the model remotely wihthout having to download the HF model locally. This feature doesn't require EAI to HuggingFace.

For now we are downloading the model weights locally using `download_snapshot=True`

In [None]:
# Note: this requires an EAI to download files from Huggingface
# This feature lets us log a large model without requiring GPU

# import os
from snowflake.ml.model.models import huggingface_pipeline


model = huggingface_pipeline.HuggingFacePipelineModel(
    model="google/medgemma-4b-it",
    task="text-generation",
    # TODO: provide token if the model is gated
    # token=os.getenv("HF_TOKEN"),
    # token="hf_...",
    download_snapshot=True,
)
model

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/35.1k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/4.63k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

<snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel at 0x3741421d0>

In [9]:
# list the directory where the model files are downloaded
! ls {model.repo_snapshot_dir}

[35mREADME.md[m[m                        [35mmodel.safetensors.index.json[m[m
[35madded_tokens.json[m[m                [35mpreprocessor_config.json[m[m
[35mchat_template.jinja[m[m              [35mprocessor_config.json[m[m
[35mconfig.json[m[m                      [35mspecial_tokens_map.json[m[m
[35mgeneration_config.json[m[m           [35mtokenizer.json[m[m
[35mmodel-00001-of-00002.safetensors[m[m [35mtokenizer.model[m[m
[35mmodel-00002-of-00002.safetensors[m[m [35mtokenizer_config.json[m[m


Log the Huggingface pipeline to Snowflake

In [None]:
mv = registry.log_model(
    model=model,
    model_name="med_gemma_4b",
    # provides OpenAI Chat Completions compatible signature (input output format) to interact with the transformers model
    signatures=openai_signatures.OPENAI_CHAT_SIGNATURE,
)
mv

Create a service using the logged model. This usually takes few minutes and depends on the node availability in your compute pool.
Once the service is up and running, the model can be inferred using SQL, Python API and REST Endpoints (if ingress is enabled)

In [None]:
service_name = "med_gemma_service"

mv.create_service(
    service_name="med_gemma_service",
    service_compute_pool="<service_compute_pool>",
    gpu_requests="1",
    # if rest endpoint is required
    ingress_enabled=True,
)

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {
        "role": "user",
        "content": "How do you differentiate bacterial from viral pneumonia?",
    },
]

# create a pd.DataFrame with openai.client.chat.completions arguments like below:
x_df = pd.DataFrame.from_records(
    [
        {
            "messages": messages,
            "max_completion_tokens": 250,
            "temperature": 0.9,
            "stop": None,
            "n": 3,
            # Note streaming is not supported yet
            "stream": False,
            "top_p": 1.0,
            "frequency_penalty": 0.1,
            "presence_penalty": 0.2,
        }
    ],
)

# To get the model version object
# mv = registry.get_model("<model_name>").version("<version_name>"")
# OpenAI Chat Completion compatible output
output_df = mv.run(X=x_df, service_name=service_name)
output_df

  handler.validate(data)


Unnamed: 0,id,object,created,model,choices,usage
0,chatcmpl-43a83d69ada04062b709bfeabba377c1,chat.completion,1758827776.0,/shared/model/model/models/MED_GEMMA_4B/model,"[{'finish_reason': 'stop', 'index': 0, 'logpro...","{'completion_tokens': 750, 'prompt_tokens': 27..."


In [28]:
# Print the first choice
print(output_df["choices"][0][0]["message"]["content"])

Okay, I can help you understand how a medical assistant can differentiate between bacterial and viral pneumonia.

**How a Medical Assistant can help differentiate between bacterial and viral pneumonia:**

**1. Medical History and Symptoms:**

*   **Onset:**
    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.
    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.
*   **Onset:**
    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.
    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.
*   **Symptoms:**
    *   **Bacterial pneumonia:**
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:*

In [29]:
! pip install openai -q

In [30]:
# Optional: To get OpenAI chat completion object
# Note: requires openai python SDK `pip install openai`
import openai


def convert_to_openai_completion(df):
    completions = []
    for _, row in df.iterrows():
        completions.append(openai.types.chat.ChatCompletion(**row))

    return completions


completions = convert_to_openai_completion(output_df)

completions[0].choices[0].message.content

'Okay, I can help you understand how a medical assistant can differentiate between bacterial and viral pneumonia.\n\n**How a Medical Assistant can help differentiate between bacterial and viral pneumonia:**\n\n**1. Medical History and Symptoms:**\n\n*   **Onset:**\n    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.\n    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.\n*   **Onset:**\n    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.\n    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.\n*   **Symptoms:**\n    *   **Bacterial pneumonia:**\n        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.\n        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.\n        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.\n      

In [None]:
ingress_url = session.sql(f"SHOW ENDPOINTS IN SERVICE {service_name}").collect()[0][
    "ingress_url"
]
ingress_url

'bdk7a2g-sfengineering-mlplatformtest.awsuswest2preprod8.pp-snowflakecomputing.app'

## To consume REST endpoint

Consume the model inference using REST endpoint

In [None]:
from snowflake.ml.utils import connection_params

conn_cfg = connection_params.SnowflakeLoginOptions(
    connection_name="...",  # Optional
)

pat_token = conn_cfg.get("password")
headers = {
    "Authorization": f'Snowflake Token="{pat_token}"',
    "Content-Type": "application/json",
}

In [None]:
import requests

URL = f"https://{ingress_url}/--call--"


def invoke_endpoint(chat_requests, headers):
    data_array = []
    for i, chat_request in enumerate(chat_requests):
        question_row = [i, chat_request]
        data_array.append(question_row)

    payload = {"data": data_array}

    return requests.post(URL, headers=headers, json=payload)


messages = [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {
        "role": "user",
        "content": "How do you differentiate bacterial from viral pneumonia?",
    },
]

chat_requests = [
    {
        "messages": messages,
        "max_completion_tokens": 250,
        "temperature": 0.9,
        "stop": None,
        "n": 3,
        # Note streaming is not supported yet
        "stream": False,
        "top_p": 1.0,
        "frequency_penalty": 0.1,
        "presence_penalty": 0.2,
    }
]

response = invoke_endpoint(chat_requests, headers)
response.status_code

200

In [19]:
import openai


def convert_responses_to_openai_completion(responses):
    completions = []
    data = responses["data"]
    for item in data:
        response = item[-1]
        completions.append(openai.types.chat.ChatCompletion(**response))

    return completions


openai_completions = convert_responses_to_openai_completion(response.json())
print(openai_completions[0].choices[0].message.content)

Okay, I can help you understand how a medical assistant can differentiate between bacterial and viral pneumonia.

**How a Medical Assistant can help differentiate between bacterial and viral pneumonia:**

**1. Medical History and Symptoms:**

*   **Onset:**
    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.
    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.
*   **Onset:**
    *   **Bacterial pneumonia:** Often has a more sudden onset, can be more sudden and severe.
    *   **Viral pneumonia:** Often has a more gradual onset, can be more gradual and more gradual.
*   **Symptoms:**
    *   **Bacterial pneumonia:**
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:** Often has a more sudden onset, can be more sudden and severe.
        *   **Symptoms:*