In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with your deployed model on GKE

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fgke_model_ui_deployment_notebook.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/gke_model_ui_deployment_notebook.ipynb">
      <img alt="GitHub logo" src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

# Overview

This notebook will guide you through the initial step of testing your recently deployed model with text prompts. Depending on your deployed model's inference setup, the notebook utilizes either Text Generation Inference [TGI](https://huggingface.co/docs/text-generation-inference/en/index) or [vLLM](https://developers.googleblog.com/en/inference-with-gemma-using-dataflow-and-vllm/#:~:text=model%20frameworks%20simple.-,What%20is%20vLLM%3F,-vLLM%20is%20an), two efficient serving frameworks that enhance the performance of your GPU model. Ready to see your deployed model respond? Run the cells below and start experimenting with different prompts!

### Prerequisites

Before proceeding with this notebook, ensure you have already deployed a model using the Google Cloud Console. You can find an overview of AI and Machine Learning services on [GKE AI/ML](https://console.cloud.google.com/kubernetes/aiml/overview).


### Objective

Enable prompt-based testing of the AI model deployed on GKE

### GPUs

GPUs let you accelerate specific workloads running on your nodes, such as machine learning and data processing. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, L4, and A100 GPUs.

### Understanding the Inference Frameworks

Your model is running on one of two popular and efficient serving frameworks: vLLM or Text Generation Inference (TGI). The following sections provide a brief overview of each to give you context on the underlying technology powering your model.


#### TGI

TGI is a highly optimized open-source LLM serving framework that can increase serving throughput on GPUs. TGI includes features such as:

* Optimized transformer implementation with PagedAttention
* Continuous batching to improve the overall serving throughput
* Tensor parallelism and distributed serving on multiple GPUs

To learn more, refer to the [TGI documentation](https://github.com/huggingface/text-generation-inference/blob/main/README.md)

#### vLLM

vLLM is another fast and easy-to-use library for LLM inference and serving. It's known for its high throughput and efficiency, and it leverages PagedAttention. Key features include:

* PagedAttention: Efficient memory management for handling long sequences and dynamic workloads.
* Continuous batching: Maximizes GPU utilization by batching incoming requests.
* High-throughput serving: Designed for production-level serving with low latency.
* Optimized CUDA kernels.

To learn more, refer to the [vLLM documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/vllm/use-vllm)

In [None]:
# @title # Connect to Google Cloud Project
# @markdown #### Run this cell to configure your Google Cloud environment for Kubernetes (GKE) operations.

# @markdown #### Actions:
# @markdown 1.  **Connects to Project & Region:** Retrieves and sets your Google Cloud project ID and region.
# @markdown 3.  **Installs `kubectl`:** Installs the Kubernetes command-line tool.

import os

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Get the default region for launching jobs.
REGION = os.environ["GOOGLE_CLOUD_REGION"]

# Set up gcloud.
! gcloud config set project "$PROJECT_ID"
! gcloud services enable container.googleapis.com

# Add kubectl to the set of available tools.
! mkdir -p /tools/google-cloud-sdk/.install
! gcloud components install kubectl --quiet

In [None]:
# @title # Select Cluster and Deployment { vertical-output: true }

# @markdown ## Instruction:

# @markdown This cell provides interactive dropdown menus to select a Google Kubernetes Engine (GKE) cluster and a deployment within that cluster.

# @markdown ***Please select a cluster and deployment before proceeding.***

import json
import subprocess

import ipywidgets as widgets
from IPython.display import display

SELECTED_DEPLOYMENT = None


def get_clusters(p, r):
    try:
        return (
            subprocess.run(
                [
                    "gcloud",
                    "container",
                    "clusters",
                    "list",
                    "--project",
                    p,
                    "--region",
                    r,
                    "--format=value(name)",
                ],
                capture_output=True,
                text=True,
                check=True,
            )
            .stdout.strip()
            .split("\n")
        )
    except subprocess.CalledProcessError as e:
        print(f"Error: {e}")
        return []


def get_deployments(c, r):
    try:
        subprocess.run(
            [
                "gcloud",
                "container",
                "clusters",
                "get-credentials",
                c,
                "--location",
                r,
            ],
            capture_output=True,
            text=True,
            check=True,
        )
        deployments = json.loads(
            subprocess.run(
                ["kubectl", "get", "deployments", "-o", "json"],
                capture_output=True,
                text=True,
                check=True,
            ).stdout
        )
        return [i["metadata"]["name"] for i in deployments["items"]]
    except subprocess.CalledProcessError as e:
        print(f"Error: {e}")
        return []


def create_deployment_dropdown(cluster_name, region, on_select_deployment):
    deployments = get_deployments(cluster_name, region)
    deployments_with_prompt = ["Select Deployment"] + deployments
    deployment_dropdown = widgets.Dropdown(
        options=deployments_with_prompt,
        description="Deployments",
        disabled=False,
        width="4000px",
    )
    deployment_dropdown.observe(
        lambda c: on_select_deployment(c["new"])
        if c["type"] == "change" and c["name"] == "value"
        else None,
        names="value",
    )
    return deployment_dropdown


def on_deployment_select(deployment_name):
    global SELECTED_DEPLOYMENT
    SELECTED_DEPLOYMENT = deployment_name
    print(f"Selected deployment: {SELECTED_DEPLOYMENT}")


def on_cluster_change(change):
    if change["type"] == "change" and change["name"] == "value":
        if change["new"] == "Select Cluster":
            return
        deployment_dropdown = create_deployment_dropdown(
            change["new"], REGION, on_deployment_select
        )
        display(deployment_dropdown)


clusters = get_clusters(PROJECT_ID, REGION)
if clusters:
    # @markdown Run this cell to display the Cluster dropdown menu:
    clusters_with_prompt = ["Select Cluster"] + clusters
    cluster_dropdown = widgets.Dropdown(
        options=clusters_with_prompt, description="Clusters", disabled=False
    )
    cluster_dropdown.observe(on_cluster_change, names="value")
    display(cluster_dropdown)
else:
    print(f"No clusters found in {PROJECT_ID}/{REGION}.")

In [None]:
# @title # Chat completion for text-only models {run:"auto", vertical-output: true}

# @markdown You may send prompts to the model server for prediction.
# @markdown
# @markdown * **user_prompt (string):** This is the text prompt you provide to the language model. It's the question or instruction e (e.g., "Explain neural networks").

# @markdown * **temperature (number):** This  parameter controls the randomness of the model's output. It influences how the model selects the next token in the sequence it generates. Typical values range from 0.2 to 1.0.

# @markdown * **max_tokens (number):** This parameter refers to the maximum number of tokens (words or sub-word units) that the model is allowed to generate in its response.

from IPython.display import HTML


def get_deployment_pod_name(deployment):
    try:
        label = deployment + "-app"
        pods = json.loads(
            subprocess.run(
                ["kubectl", "get", "pods", "-o", "json", "-l", f"app={label}"],
                capture_output=True,
                check=True,
            ).stdout
        )
        return pods["items"][0]["metadata"]["name"] if pods["items"] else None
    except (
        subprocess.CalledProcessError,
        json.JSONDecodeError,
        KeyError,
        IndexError,
    ):
        return None


def check_vllm_label(pod_name):
    """Checks if the pod has the 'ai.gke.io/inference-server=vllm' label."""
    try:
        result = subprocess.run(
            ["kubectl", "get", "pod", pod_name, "-o", "json"],
            capture_output=True,
            check=True,
        )
        labels = json.loads(result.stdout)["metadata"]["labels"]
        return labels.get("ai.gke.io/inference-server") == "vllm"
    except (subprocess.CalledProcessError, KeyError, json.JSONDecodeError):
        return False


def process_response(request, pod_name, pod_endpoint, is_vllm):
    response = !kubectl exec -t {pod_name} -- curl -X POST http://{pod_endpoint}/generate -H "Content-Type: application/json" -d '{json.dumps(request)}' 2> /dev/null
    try:
        data = json.loads(response[0])
        if is_vllm:
            return data["predictions"][0]
        else:
            return data["generated_text"]
    except (json.JSONDecodeError, KeyError, IndexError) as e:
        return f"Error: {e}, Raw: {response}"


deployment_pod = get_deployment_pod_name(SELECTED_DEPLOYMENT)
is_vllm_inference = check_vllm_label(deployment_pod)

user_prompt = "What is AI?"  # @param {type: "string"}
temperature = 0.50  # @param {type: "number"}
max_tokens = 250  # @param {type: "number"}

request = {
    "max_tokens": 250 if max_tokens is None else max_tokens,
    "temperature": 0.5 if temperature is None else temperature,
}

if is_vllm_inference:
    request["prompt"] = user_prompt
else:
    request["inputs"] = user_prompt

model_service = SELECTED_DEPLOYMENT + "-service"
output = !kubectl get endpoints {model_service}
pod_endpoint = output[1].split()[1]

# @markdown ### Response:
response = process_response(request, deployment_pod, pod_endpoint, is_vllm_inference)
HTML(
    '<div style="overflow-x: auto; font-size: 16px; line-height:'
    f' 1.8;">{response}</div>'
)

# Next Steps: Integrating the GKE Service Endpoint

After successfully deploying a model on Google Kubernetes Engine (GKE) and verifying it via a notebook, the next step is to integrate it into various applications. This involves making HTTP requests to the service's endpoint from your application code.

### Exposing the Service

To make your deployed model accessible to applications, you'll need to expose its service endpoint. Google Kubernetes Engine offers several ways to do this:

1.  **Ingress:** Configure an Ingress resource to route external HTTP(S) traffic to your service. Set up Ingress for either an internal Load Balancer (accessible only within your VPC) or an external Load Balancer (accessible from the internet). [Learn more about GKE Ingress](https://cloud.google.com/kubernetes-engine/docs/concepts/ingress).
2.  **Gateway API:** A more modern and feature-rich API for managing traffic routing in Kubernetes. Similar to Ingress, Gateway API allows you to define how external and internal traffic should be directed to your services. [Explore GKE Gateway API](https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api).

### Setting Up Autoscaling

Ensure your model serving can handle varying traffic by configuring the Horizontal Pod Autoscaler (HPA). HPA automatically scales the number of Pods based on resource utilization or custom metrics, optimizing performance and cost. [See how to configure HPA](https://cloud.google.com/kubernetes-engine/docs/how-to/horizontal-pod-autoscaling).

### Setting Up Monitoring

Monitor the health and performance of your deployed model using Google Cloud Managed Service for Prometheus. Configure your model serving to expose Prometheus metrics for comprehensive insights. [Get started with Google Cloud Managed Prometheus](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-automatic-application-monitoring).

### Additional Resources:

* #### Kubernetes Documentation:
   * Services: https://kubernetes.io/docs/concepts/services-networking/service/

* #### Google Cloud Documentation:
   * Google Kubernetes Engine (GKE): https://cloud.google.com/kubernetes-engine
   * Cloud Load Balancing: https://cloud.google.com/load-balancing/docs/ingress
   * Gateway API on GKE: https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api
   * Learn about GPUs in GKE: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

* #### Python requests Library:
   * https://requests.readthedocs.io/en/latest/

* #### LangChain with Google Integrations:
   * The Langchain documentation is very useful: https://python.langchain.com/docs/integrations/providers/google/