# Run LLM inference on Cloud Run GPUs with Gemma 3

## Deploy a Gemma model with a prebuilt container
This guide shows how to run LLM inference on Cloud Run GPUs with Gemma 3 and Ollama, and has the following objectives:
 - Deploy Ollama with the Gemma 3 model on a GPU-enabled Cloud Run service using a prebuilt container.
 - Using the deployed Cloud Run service with the Google Gen AI SDK

#### You can deploy the container image and make Ollama with Gemma 3 available as a Cloud Run service.
Use the following gcloud run deploy command to deploy your Cloud Run service:
gcloud run deploy {SERVICE_NAME} \
 --image {IMAGE} \
 --concurrency 4 \
 --cpu 8 \
 --set-env-vars OLLAMA_NUM_PARALLEL=4 \
 --set-env-vars=API_KEY={YOUR_API_KEY} \
 --gpu 1 \
 --gpu-type nvidia-l4 \
 --max-instances 1 \
 --memory 32Gi \
 --allow-unauthenticated \
 --no-cpu-throttling \
 --timeout=600 \
 --region {REGION}
 
The Cloud Run service is configured with:
One Nvidia L4 GPU per instance.
A maximum of four concurrent requests to the instance that matches the number of request slots (OLLAMA_NUM_PARALLEL) available per model to serve concurrent inference requests.
A maximum of seven Cloud Run service instances which should match your GPU quota per project and per region.
The no-cpu-throttling setting required to use GPU.

#### Supported Models and Pre-Built Docker Images:

 - gemma-3-1b-it
 - gemma-3-4b-it
 - gemma-3-12b-it
 - gemma-3-27b-it
 - gemma-3n-e2b-it
 - gemma-3n-e4b-it
 
These images have the respective Gemma models bundled:

 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-1b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-4b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-12b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3-27b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3n-e2b
 - us-docker.pkg.dev/cloudrun/container/gemma/gemma3n-e4b

### Copy the next commant to console:

In [None]:
print("gcloud run deploy llm-gemma3-4b \
 --image us-docker.pkg.dev/cloudrun/container/gemma/gemma3-4b \
 --concurrency 4 \
 --cpu 8 \
 --set-env-vars OLLAMA_NUM_PARALLEL=4 \
 --set-env-vars=API_KEY=TESTKEY12345 \
 --gpu 1 \
 --gpu-type nvidia-l4 \
 --max-instances 1 \
 --memory 32Gi \
 --allow-unauthenticated \
 --no-cpu-throttling \
 --timeout=600 \
 --region us-central1")

## Send prompts to the LLM CloudRun service by using Google Gen AI SDK

#### Initializing Google Gen AI SDK Client

In [None]:
from google import genai
from google.genai.types import HttpOptions
GEMMA_CLOUD_RUN_ENDPOINT="https://llm-gemma3-4b- ... .us-central1.run.app" #TODO: Add your cloud run enpoint here
GEMMA_CLOUD_RUN_API_KEY="TESTKEY12345"
# Configure the client to use your Cloud Run endpoint and API key
client = genai.Client(api_key=GEMMA_CLOUD_RUN_API_KEY, http_options=HttpOptions(base_url=GEMMA_CLOUD_RUN_ENDPOINT))

#### Generate content (non-streaming): 

In [None]:
response = client.models.generate_content(
   model="gemma-3-4b-it", # Replace model with the Gemma 3 model you selected, such as "gemma-3-4b-it".
   contents=["How does AI work?"]
)
print(response.text)

#### Stream generate content example

In [None]:
response = client.models.generate_content_stream(
   model="gemma-3-4b-it", # Replace model with the Gemma 3 model you selected, such as "gemma-3-4b-it".
   contents=["Write a story about a magic backpack. You are the narrator of an interactive text adventure game."]
)
for chunk in response:
   print(chunk.text, end="")

### Clean up
After you have finished, it is a good practice to clean up your cloud resources. 

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.