In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with your deployed model on GKE

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fgke_model_ui_deployment_notebook_auto.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/gke_model_ui_deployment_notebook_auto.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

# Overview

This notebook will guide you through the initial step of testing your recently
deployed model with text prompts. Depending on your deployed model's inference
setup, the notebook utilizes either Text Generation Inference
[TGI](https://huggingface.co/docs/text-generation-inference/en/index) or
[vLLM](https://developers.googleblog.com/en/inference-with-gemma-using-dataflow-and-vllm/#:~:text=model%20frameworks%20simple.-,What%20is%20vLLM%3F,-vLLM%20is%20an),
two efficient serving frameworks that enhance the performance of your GPU model.
Ready to see your deployed model respond? Run the cells below and start
experimenting with different prompts!

### Prerequisites

Before proceeding with this notebook, ensure you have already deployed a model
using the Google Cloud Console. You can find an overview of AI and Machine
Learning services on
[GKE AI/ML](https://console.cloud.google.com/kubernetes/aiml/overview).

### Objective

Enable prompt-based testing of the AI model deployed on GKE

### GPUs

GPUs let you accelerate specific workloads running on your nodes, such as
machine learning and data processing. GKE provides a range of machine type
options for node configuration, including machine types with NVIDIA H100, L4,
and A100 GPUs.

### Understanding the Inference Frameworks

Your model is running on one of two popular and efficient serving frameworks:
vLLM or Text Generation Inference (TGI). The following sections provide a brief
overview of each to give you context on the underlying technology powering your
model.

#### TGI

TGI is a highly optimized open-source LLM serving framework that can increase
serving throughput on GPUs. TGI includes features such as:

*   Optimized transformer implementation with PagedAttention
*   Continuous batching to improve the overall serving throughput
*   Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to the
[TGI documentation](https://github.com/huggingface/text-generation-inference/blob/main/README.md)

#### vLLM

vLLM is another fast and easy-to-use library for LLM inference and serving. It's
known for its high throughput and efficiency, and it leverages PagedAttention.
Key features include:

*   PagedAttention: Efficient memory management for handling long sequences and
    dynamic workloads.
*   Continuous batching: Maximizes GPU utilization by batching incoming
    requests.
*   High-throughput serving: Designed for production-level serving with low
    latency.
*   Optimized CUDA kernels.

To learn more, refer to the
[vLLM documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/vllm/use-vllm)

In [None]:
# @title # Connect to Google Cloud Project
# @markdown #### Run this cell to configure your Google Cloud environment for Kubernetes (GKE) operations.
# @markdown
# @markdown #### Actions:
# @markdown 1.  **Connects to Project:** Retrieves and sets your Google Cloud project ID.
# @markdown 3.  **Installs `kubectl`:** Installs the Kubernetes command-line tool.

import os

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Set up gcloud.
! gcloud config set project "$PROJECT_ID"
! gcloud services enable container.googleapis.com

# Add kubectl to the set of available tools.
! mkdir -p /tools/google-cloud-sdk/.install
! gcloud components install kubectl --quiet

In [None]:
# @title # Chat completion for text-only models {vertical-output: true}
# @markdown Run cell to prompt the model server for prediction.
# @markdown
# @markdown * **user_prompt (string):** This is the text prompt you provide to the language model. It's the question or instruction e (e.g., "Explain neural networks").
# @markdown * **temperature (number):** This  parameter controls the randomness of the model's output. It influences how the model selects the next token in the sequence it generates. Typical values range from 0.2 to 1.0.
# @markdown * **max_tokens (number):** This parameter refers to the maximum number of tokens (words or sub-word units) that the model is allowed to generate in its response.
# @markdown

import json
import subprocess

import ipywidgets as widgets
from IPython.display import Markdown, clear_output, display

CLUSTER = ""  # @param {type:"string", isTemplate:true}
REGION = ""  # @param {type:"string", isTemplate:true}
NAMESPACE = ""  # @param {type:"string", isTemplate:true}
DEPLOYMENT = ""  # @param {type:"string", isTemplate:true}
POD_PORT = ""  # @param {type:"string", isTemplate:true}


def _run_kubectl(cmd, timeout=60):
    """Executes a kubectl command."""
    try:
        result = subprocess.run(
            cmd, capture_output=True, text=True, check=True, timeout=timeout
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"Kubectl command failed: {' '.join(e.cmd)}\nStderr: {e.stderr}"
        ) from e
    except subprocess.TimeoutExpired as e:
        raise RuntimeError(f"Kubectl command timed out: {' '.join(e.cmd)}") from e


def fetch_cluster_credentials(cluster, region, project_id):
    """Ensures credentials for the target GKE cluster."""
    cred_cmd = [
        "gcloud",
        "container",
        "clusters",
        "get-credentials",
        cluster,
        f"--location={region}",
        f"--project={project_id}",
    ]
    _run_kubectl(cred_cmd)


def get_deployment_selector_labels(deployment_name, namespace):
    """Retrieves the selector labels for a given Kubernetes deployment."""
    cmd = [
        "kubectl",
        "get",
        "deployment",
        deployment_name,
        "-n",
        namespace,
        "-o",
        "json",
    ]
    deployment_json = _run_kubectl(cmd)
    deployment_data = json.loads(deployment_json)

    selector_labels = (
        deployment_data.get("spec", {}).get("selector", {}).get("matchLabels")
    )
    if not selector_labels:
        raise RuntimeError(
            f"No selector labels found for deployment '{deployment_name}' in"
            f" namespace '{namespace}'."
        )
    return selector_labels


def get_running_pod_name(deployment_name, namespace):
    """Retrieves the name of a running pod associated with a deployment."""
    selector_labels = get_deployment_selector_labels(deployment_name, namespace)
    label_selector_str = ",".join(f"{k}={v}" for k, v in selector_labels.items())

    cmd = [
        "kubectl",
        "get",
        "pods",
        "-n",
        namespace,
        "-o",
        "json",
        "-l",
        label_selector_str,
        "--field-selector=status.phase=Running",
    ]
    pods_json = _run_kubectl(cmd)
    pods_data = json.loads(pods_json)

    if not pods_data.get("items"):
        raise RuntimeError(
            f"No running pods found for deployment '{deployment_name}' in namespace"
            f" '{namespace}' with selector '{label_selector_str}'."
        )
    return pods_data["items"][0]["metadata"]["name"]


def check_vllm_inference_label(pod_name, namespace):
    """Checks if the specified pod has the vLLM inference server label."""
    cmd = ["kubectl", "get", "pod", pod_name, "-n", namespace, "-o", "json"]
    pod_json = _run_kubectl(cmd)
    labels = json.loads(pod_json).get("metadata", {}).get("labels", {})
    return labels.get("ai.gke.io/inference-server") == "vllm"


def send_inference_request(
    request_payload, pod_name, pod_port, is_vllm_inference, namespace
):
    """Sends an inference request to the specified pod and returns the model's response."""
    json_data_escaped = json.dumps(request_payload).replace("'", "'\\''")
    curl_cmd = (
        f"kubectl exec -n {namespace} -t {pod_name} -- curl -s -X POST"
        f' http://localhost:{pod_port}/generate -H "Content-Type:'
        ' application/json"'
        f" -d '{json_data_escaped}' 2> /dev/null"
    )

    response_raw = _run_kubectl(["bash", "-c", curl_cmd])

    if not response_raw:
        raise RuntimeError(f"Empty response received from pod '{pod_name}'.")

    try:
        first_line = response_raw.splitlines()[0]
        data = json.loads(first_line)
    except json.JSONDecodeError as e:
        raise RuntimeError(
            f"Failed to decode JSON response from pod: {e}. Raw: {response_raw}"
        ) from e
    except IndexError:
        raise RuntimeError(
            f"Unexpected empty response line from pod. Raw: {response_raw}"
        )

    if is_vllm_inference:
        predictions = data.get("predictions")
        if isinstance(predictions, list) and predictions:
            return predictions[0]
        raise RuntimeError(f"Unexpected vLLM response format. Raw data: {data}")
    else:  # TGI format
        generated_text = data.get("generated_text")
        if generated_text is not None:
            return generated_text
        raise RuntimeError(f"Unexpected TGI response format. Raw data: {data}")


# --- Main Execution Logic ---


def execute_chat_completion(
    deployment_name, namespace, pod_port, user_prompt, temperature, max_tokens
):
    """Executes the full chat completion process: fetches credentials, finds a pod,

    determines inference type, sends a request, and returns the response.
    """
    display(Markdown("Establishing cluster credentials..."))
    fetch_cluster_credentials(CLUSTER, REGION, PROJECT_ID)

    display(Markdown("Retrieving pod information..."))
    pod_name = get_running_pod_name(deployment_name, namespace)
    display(Markdown(f"Successfully identified pod: `{pod_name}`"))

    is_vllm = check_vllm_inference_label(pod_name, namespace)

    request_payload = {
        "max_tokens": max_tokens,
        "temperature": temperature,
        "prompt" if is_vllm else "inputs": user_prompt,
    }
    display(Markdown("Sending inference request..."))
    response = send_inference_request(
        request_payload, pod_name, pod_port, is_vllm, namespace
    )

    return response


# --- Widgets Setup ---
user_prompt_widget = widgets.Textarea(
    value="What is AI?",
    description="User Prompt:",
    layout=widgets.Layout(width="95%", height="100px"),
)

temperature_widget = widgets.FloatSlider(
    value=0.50, min=0.0, max=1.0, step=0.01, description="Temperature:"
)

max_tokens_widget = widgets.IntSlider(
    value=250, min=1, max=2048, step=1, description="Max Tokens:"
)

submit_button = widgets.Button(description="Submit")
output_area_response = widgets.Output()


# --- Submit Button Logic ---
def on_submit_clicked(b):
    with output_area_response:
        clear_output()
        display(Markdown("Loading..."))

        try:
            model_response = execute_chat_completion(
                DEPLOYMENT,
                NAMESPACE,
                POD_PORT,
                user_prompt_widget.value,
                temperature_widget.value,
                max_tokens_widget.value,
            )
            clear_output()
            display(Markdown(f"**Response:**\n\n{model_response}"))
        except Exception as e:
            clear_output()
            display(Markdown(f"**An error occurred:**\n```\n{e}\n```"))


# --- Display Widgets ---
submit_button.on_click(on_submit_clicked)
display(
    user_prompt_widget,
    temperature_widget,
    max_tokens_widget,
    submit_button,
    output_area_response,
)

# Next Steps: Integrating the GKE Service Endpoint

After successfully deploying a model on Google Kubernetes Engine (GKE) and
verifying it via a notebook, the next step is to integrate it into various
applications. This involves making HTTP requests to the service's endpoint from
your application code.

### Exposing the Service

To make your deployed model accessible to applications, you'll need to expose
its service endpoint. Google Kubernetes Engine offers several ways to do this:

1.  **Ingress:** Configure an Ingress resource to route external HTTP(S) traffic
    to your service. Set up Ingress for either an internal Load Balancer
    (accessible only within your VPC) or an external Load Balancer (accessible
    from the internet).
    [Learn more about GKE Ingress](https://cloud.google.com/kubernetes-engine/docs/concepts/ingress).
2.  **Gateway API:** A more modern and feature-rich API for managing traffic
    routing in Kubernetes. Similar to Ingress, Gateway API allows you to define
    how external and internal traffic should be directed to your services.
    [Explore GKE Gateway API](https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api).

### Setting Up Autoscaling

Ensure your model serving can handle varying traffic by configuring the
Horizontal Pod Autoscaler (HPA). HPA automatically scales the number of Pods
based on resource utilization or custom metrics, optimizing performance and
cost.
[See how to configure HPA](https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling).

### Setting Up Monitoring

Monitor the health and performance of your deployed model using Google Cloud
Managed Service for Prometheus. Configure your model serving to expose
Prometheus metrics for comprehensive insights.
[Get started with Google Cloud Managed Prometheus](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-automatic-application-monitoring).

### Additional Resources:

*   #### Kubernetes Documentation:

    *   Services:
        https://kubernetes.io/docs/concepts/services-networking/service/

*   #### Google Cloud Documentation:

    *   Google Kubernetes Engine (GKE):
        https://cloud.google.com/kubernetes-engine
    *   Cloud Load Balancing:
        https://cloud.google.com/load-balancing/docs/ingress
    *   Gateway API on GKE:
        https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api
    *   Learn about GPUs in GKE:
        https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

*   #### Python requests Library:

    *   https://requests.readthedocs.io/en/latest/

*   #### LangChain with Google Integrations:

    *   The Langchain documentation is very useful:
        https://python.langchain.com/docs/integrations/providers/google/