# Run LLM inference on Cloud Run GPUs with Gemma 3

## Deploy a Gemma model with a prebuilt container
This guide shows how to run LLM inference on Cloud Run GPUs with Gemma 3 and Ollama, and has the following objectives:
 - Deploy Ollama with the Gemma 3 model on a GPU-enabled Cloud Run service using a prebuilt container.
 - Using the deployed Cloud Run service with the Google Gen AI SDK

***Gemma is a family of generative artificial intelligence (AI) models*** and you can use them in a wide variety of generation tasks, including question answering, summarization, and reasoning. Gemma models are provided with open weights and permit responsible commercial use, allowing you to tune and deploy them in your own projects and applications.

#### You can deploy the container image and make Ollama with Gemma 3 available as a Cloud Run service.
Use the following gcloud run deploy command to deploy your Cloud Run service:
gcloud run deploy {SERVICE_NAME} \
 --image {IMAGE} \
 --concurrency 4 \
 --cpu 8 \
 --set-env-vars OLLAMA_NUM_PARALLEL=4 \
 --set-env-vars=API_KEY={YOUR_API_KEY} \
 --gpu 1 \
 --gpu-type nvidia-l4 \
 --max-instances 1 \
 --memory 32Gi \
 --allow-unauthenticated \
 --no-cpu-throttling \
 --timeout=600 \
 --region {REGION}
 
The Cloud Run service is configured with:
Explanation of Variables:

 - SERVICE_NAME: The unique name for your Cloud Run service.
 - IMAGE: The Docker image to deploy. This can be one of our pre-built images or an image you built yourself from this repository
 - YOUR_API_KEY: Crucial for authentication. Set this to a strong, unique API key string of your choice. This key will be required to access your service. See the Authentication section below for more details. If you're deploying from AI Studio, this is generated on your behalf. Note that this should not be an API key re-used from another service.
 - REGION: The Google Cloud region where your Cloud Run service will be deployed (e.g., us-central1). Ensure this region supports the specified GPU type. See GPU support for Cloud Run services for more details. If you're deploying from AI Studio, this defaults to europe-west1.

After successful deployment, the gcloud command will output the Cloud Run service URL. Save this URL as <cloud_run_url> for interacting with your service.

#### Supported Models and Pre-Built Docker Images:

 - gemma-3-1b-it
 - gemma-3-4b-it
 - gemma-3-12b-it
 - gemma-3-27b-it
 - gemma-3n-e2b-it
 - gemma-3n-e4b-it
 
These images have the respective Gemma models bundled:

 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-1b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-4b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-12b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-27b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3n-e2b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3n-e4b

### Copy the next commant to console:

In [None]:
print("gcloud run deploy llm-gemma3-4b \
 --image us-docker.pkg.dev/cloudrun/container/gemma/gemma3-4b \
 --concurrency 4 \
 --cpu 8 \
 --set-env-vars OLLAMA_NUM_PARALLEL=4 \
 --set-env-vars=API_KEY=TESTKEY12345 \
 --gpu 1 \
 --gpu-type nvidia-l4 \
 --max-instances 1 \
 --memory 32Gi \
 --allow-unauthenticated \
 --no-cpu-throttling \
 --timeout=600 \
 --region us-central1")

## Send prompts to the LLM CloudRun service by using Google Gen AI SDK

#### Initializing Google Gen AI SDK Client

In [None]:
from google import genai
from google.genai.types import HttpOptions
GEMMA_CLOUD_RUN_ENDPOINT="https://llm-gemma3-4b-25570882233.us-central1.run.app" #TODO: Add your cloud run enpoint here
GEMMA_CLOUD_RUN_API_KEY="TESTKEY12345"
# Configure the client to use your Cloud Run endpoint and API key
client = genai.Client(api_key=GEMMA_CLOUD_RUN_API_KEY, http_options=HttpOptions(base_url=GEMMA_CLOUD_RUN_ENDPOINT))

#### Generate content (non-streaming): 

In [None]:
response = client.models.generate_content(
   model="gemma-3-4b-it", # Replace model with the Gemma 3 model you selected, such as "gemma-3-4b-it".
   contents=["How does AI work?"]
)
print(response.text)

#### Stream generate content example

In [None]:
response = client.models.generate_content_stream(
   model="gemma-3-4b-it", # Replace model with the Gemma 3 model you selected, such as "gemma-3-4b-it".
   contents=["Write a story about a magic backpack. You are the narrator of an interactive text adventure game."]
)
for chunk in response:
   print(chunk.text, end="")

In [None]:
import subprocess

from IPython.display import Image, Markdown, display
from openai import OpenAI

prompt = "Write a story about a magic backpack. You are the narrator of an interactive text adventure game."

MODEL="gemma3:4b"
client = OpenAI(
    api_key=GEMMA_CLOUD_RUN_API_KEY,
    base_url=f"{GEMMA_CLOUD_RUN_ENDPOINT}/v1",
)
chat_response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
            ],
        }
    ],
    temperature=0.5,
    #extra_headers=f"Bearer {identity_token}",
)
print(f"Prompt: {prompt}")
print(chat_response.choices[0].message.content)

## Basic App: Weather Agent

In [None]:
import asyncio
import importlib
import json
import os
import warnings

import pandas as pd
from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm  # For multi-model support
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.tools.tool_context import ToolContext
from google.genai import types  # For creating message Content/Parts
from IPython.display import HTML, Markdown, display

# Ignore all warnings
warnings.filterwarnings("ignore")

import logging

logging.basicConfig(level=logging.ERROR)

In [None]:
LOCATION = "us-central1"
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "TRUE"  # Use Vertex AI API

In [None]:
%%bash
echo > adk_agents/.env "GOOGLE_CLOUD_LOCATION=$GOOGLE_CLOUD_LOCATION
GOOGLE_GENAI_USE_VERTEXAI=$GOOGLE_GENAI_USE_VERTEXAI
"

In [None]:
!mkdir ./adk_agents/agent1_weather_gemma/

In [None]:
%%writefile ./adk_agents/agent1_weather_gemma/__init__.py
# pylint: skip-file
from . import agent

In [None]:
%%writefile ./adk_agents/agent1_weather_gemma/agent.py
from google.adk.agents import Agent
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm

root_agent = LlmAgent(
    name="weather_agent_v1",
    model=LiteLlm(
        model="openai/gemma3:4b",
        api_base="https://llm-gemma3-4b-25570882233.us-central1.run.app/v1",
        api_key="TESTKEY12345",
        # Pass authentication headers if needed
        # extra_headers=auth_headers
        # Alternatively, if endpoint uses an API key:
        # api_key="YOUR_ENDPOINT_API_KEY"
    ),
    description="Provides average weather information for specific cities.",
    instruction="You are a helpful weather assistant. "
                "When the user asks for the weather in a specific city, "
                "use information about average weather. ",
    tools=[], # Pass the function directly
)

In [None]:
from adk_agents.agent1_weather_gemma import agent

importlib.reload(agent)  # Force reload

APP_NAME = "weather_tutorial_app"
USER_ID = "user_1"
SESSION_ID = "session_001"  # Using a fixed ID for simplicity

session_service = InMemorySessionService()

# Create the specific session where the conversation will happen
session = await session_service.create_session(
    app_name=APP_NAME, user_id=USER_ID, session_id=SESSION_ID
)

runner = Runner(
    agent=agent.root_agent,  # The agent we want to run
    app_name=APP_NAME,  # Associates runs with our app
    session_service=session_service,  # Uses our session manager
)

In [None]:
async def call_agent_async(query: str, runner, user_id, session_id):
    """Sends a query to the agent and prints the final response."""
    print(f"\n>>> User Query: {query}")

    content = types.Content(role="user", parts=[types.Part(text=query)])

    final_response_text = "Agent did not produce a final response."  # Default

    # Key Concept: run_async executes the agent logic and yields Events.
    # We iterate through events to find the final answer.
    async for event in runner.run_async(
        user_id=user_id, session_id=session_id, new_message=content
    ):
        # You can uncomment the line below to see *all* events during execution
        # print(f"  [Event] Author: {event.author}, Type: {type(event).__name__}, Final: {event.is_final_response()}, Content: {event.content}")

        # Key Concept: is_final_response() marks the concluding message for the turn.
        if event.is_final_response():
            if event.content and event.content.parts:
                # Assuming text response in the first part
                final_response_text = event.content.parts[0].text
            elif (
                event.actions and event.actions.escalate
            ):  # Handle potential errors/escalations
                final_response_text = f"Agent escalated: {event.error_message or 'No specific message.'}"
            # Add more checks here if needed (e.g., specific error codes)
            break  # Stop processing events once the final response is found

    print(f"<<< Agent Response: {final_response_text}")

In [None]:
await call_agent_async(
    "What is the weather like in London?",
    runner=runner,
    user_id=USER_ID,
    session_id=SESSION_ID,
)

### Clean up
After you have finished, it is a good practice to clean up your cloud resources. 

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.