# Deploy the Gemma 3 4b instruct for inference using Amazon SageMakerAI

This notebook demonstrates how to deploy and use the Gemma 3 4B instruct model on Amazon SageMaker. Gemma is a family of lightweight, open-source language models developed by Google, designed to be efficient and easy to use. By following this guide, you'll learn how to set up the model, deploy it as an endpoint, and interact with it for both text and image-based tasks.

In this notebook, you will learn how to deploy the Gemma 3 4B instruct model (HuggingFace model ID: google/gemma-3-4b-it) using Amazon SageMaker AI. The inference image will be [HuggingFace TGI](https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.0)(Text Generation Inference) on Amazon SageMaker [TGI 3.2.0](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true).

[Gemma 3 models](https://ai.google.dev/gemma/docs/core) are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.


**License agreement**
- This model is gated on HuggingFace, please refer to the original [model card](https://huggingface.co/google/gemma-3-4b-it) for license.
- This notebook is a sample notebook and not intended for production use.

### Install or upgrade SageMaker

In [1]:
%pip install -Uq sagemaker

Note: you may need to restart the kernel to use updated packages.


### Set up

In [1]:
import sagemaker
import boto3
import json
import time
from sagemaker.session import Session
import logging
from sagemaker.s3 import S3Uploader
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

session = sagemaker.Session()



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
HF_MODEL_ID = "google/gemma-3-4b-it"

base_name = HF_MODEL_ID.split('/')[-1].replace('.', '-').lower()
model_lineage = HF_MODEL_ID.split("/")[0]
base_name

'gemma-3-4b-it'

### Create SageMaker Model 

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process, making it easier to develop high-quality models. The SageMaker Python SDK provides open-source APIs and containers to train and deploy models on SageMaker, using several different ML and deep learning frameworks.

[Hugging Face](https://huggingface.co/) is a popular open-source platform and company that specializes in natural language processing (NLP) and artificial intelligence. Amazon SageMaker AI lets customers train, fine-tune, and run inference using Hugging Face models for Natural Language Processing (NLP) on SageMaker AI. You can use Hugging Face for both training and inference. 

AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models.

For inference, customer can use your trained Hugging Face model or one of the pre-trained Hugging Face models to deploy an inference job with [SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html). With this collaboration, you only need one line of code to deploy both your trained models and pre-trained models with SageMaker AI. You can also run inference jobs without having to write any custom inference code. With custom inference code, you can customize the inference logic by providing your own Python script.


Hosting large language models like Gemma on cloud platforms such as Amazon SageMaker offers several advantages:

1. **Scalability**: Easily adjust resources based on demand.
2. **Cost-efficiency**: Pay only for the compute resources you use.
3. **Managed infrastructure**: AWS handles the underlying infrastructure, allowing you to focus on model deployment and usage.
4. **Integration**: Seamlessly connect with other AWS services for comprehensive AI/ML pipelines.
5. **Security**: Leverage AWS's robust security features to protect your model and data.

By using SageMaker, we can deploy Gemma in a production-ready environment with minimal overhead.


#### Set up huggingface token
Gemma-3-4B-Instruct is a gated model so you will need to provide your [Hugging face token](https://huggingface.co/docs/hub/en/security-tokens)

In [None]:
hf_token = 'hf_xxxxxxxxxx' #change to your own token

#### Set up model environment variables 

In [4]:
hub = {
    "HF_MODEL_ID": 'google/gemma-3-4b-it',
    "ENDPOINT_SERVER_TIMEOUT": "1200",
    "SM_NUM_GPUS": "1",
    "HUGGING_FACE_HUB_TOKEN": hf_token,
    "PREFIX_CACHING": "0",  
    "USE_PREFIX_CACHING":"0", 
}

#### Set image URI
Currently need to hard code the [image URI](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true) to use it. 

In [5]:
tgi_image_uri = '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.6.0-tgi3.2.0-gpu-py311-cu124-ubuntu22.04-v2.0'

#### Create HuggingFaceModel

HuggingFaceModel is a class provided by Amazon [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) that simplifies the process of deploying models from the Hugging Face Hub on Amazon SageMaker.

In [6]:
model_name = base_name + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
model_name

'gemma-3-4b-it2025-04-07-07-11-21'

In [7]:
gemma_tgi_model = HuggingFaceModel(
    image_uri=tgi_image_uri,
    env=hub,
    role=role,
    name=model_name,
    sagemaker_session=session
)

### Deploy

Deploying the model creates a SageMaker endpoint - a fully managed HTTPS endpoint that can be used for real-time inference. We are using "ml.g5.2xlarge" instance type. This process may take several minutes as SageMaker provisions the necessary resources.

In [8]:
endpointName = model_name+"endpoint"

In [10]:
pretrained_tgi_predictor = gemma_tgi_model.deploy(
    endpoint_name= endpointName,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge", #1 gpu
    wait=False
)

In [24]:
import time
client = boto3.client('sagemaker')
readyflag = False
if not readyflag:
    response = client.describe_endpoint(EndpointName=endpointName)
    status = response['EndpointStatus']
    if status != "Creating":
        readyflag = True
        print("Finished Deploy, Endpint status: " + status)
    else:
        time.sleep(30)
        

Finished Deploy, Endpint status: InService


### Invocation

Once the endpoint is deployed, we can send requests to it for inference. The Gemma model can handle both text-only and multimodal (text + image) inputs. 

**Model Input:**
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size

**Model Output:**
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context of 8192 tokens

We'll demonstrate both types of interactions in the following examples.


#### Option 1 - Invoke use predictor

**Text as model Input**

In [26]:
pretrained_tgi_predictor.predict({
	"inputs": "Hi, what can you help me with?",
})

[{'generated_text': "\n\nI'm an AI assistant created by Google. I can assist you with a variety of tasks, including:\n\n*   **Answering your questions:** I can provide information on a huge range of topics. Just ask!\n*   **Generating creative text formats:** I can write stories, poems, code, scripts, musical pieces, email, letters, etc.\n*   **Summarizing text:** I can condense long articles or documents into shorter summaries.\n*   **Translating languages:** I can translate between many different languages.\n*   **Brainstorming ideas:** I can help you come up with ideas for projects, stories, or anything else.\n*   **Performing calculations:** I can do math problems.\n*   **Following your instructions:** I can execute your commands and requests.\n\n**To help me assist you best, please be as specific as possible with your requests.**\n\nSo, what's on your mind? Do you have a question, need help with something, or just want to chat?"}]

**Multimodality - Image as input**

In [3]:
from IPython.display import Image as IPyImage
IPyImage(url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG", height=300, width= 300)

In [29]:
import json

payload = {
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image_url", 
          "image_url": {"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
        },
        {"type": "text", "text": "What animal is on the candy?"}
      ]
    }
  ]
}

response = pretrained_tgi_predictor.predict(payload)
print(response['choices'][0]['message']['content'])

# Print usage statistics
print("=== Token Usage ===")
usage = response['usage']
print(f"Prompt Tokens: {usage['prompt_tokens']}")
print(f"Completion Tokens: {usage['completion_tokens']}")
print(f"Total Tokens: {usage['total_tokens']}")

Based on the image, the animal on the candy is a **turtle**. You can clearly see the shell shape printed on the teal candy.
=== Token Usage ===
Prompt Tokens: 284
Completion Tokens: 29
Total Tokens: 313


#### Option 2 - Invoke use endpoint name

**Text as model Input**

In [52]:
import json
import boto3

client = boto3.client('sagemaker-runtime')

input_text = "Hi, what can you help me with?"
input_data = {"inputs": input_text}
encoded_body = json.dumps(input_data).encode('utf-8')

response = client.invoke_endpoint(
    EndpointName=endpointName,
    Body=encoded_body,
    ContentType='application/json'
)

print(response['Body'].read().decode('utf-8'))

[{"generated_text":"\n\nI'm a large language model, created by the Gemma team at Google DeepMind. I can take text and images as inputs and output text. As an open-weights model, I'm widely available for public use!\n\nHere are some things I can do:\n\n*   **Answer your questions:** I can try my best to provide informative and comprehensive answers.\n*   **Generate creative content:** I can write stories, poems, code, scripts, musical pieces, email, letters, etc.\n*   **Translate languages:** I can translate text from one language to another.\n*   **Summarize text:** I can provide concise summaries of longer texts.\n*   **Follow your instructions:** I’ll do my best to follow your instructions and complete your requests thoughtfully.\n\nHow can I help you today?"}]


**Multimodality - Image as input**

In [54]:
imagetext_input = payload
imagetext_encoded_body = json.dumps(imagetext_input).encode('utf-8')

response2 = client.invoke_endpoint(
    EndpointName=endpointName,
    Body=imagetext_encoded_body,
    ContentType='application/json'
)

print(response2['Body'].read().decode('utf-8'))

{"object":"chat.completion","id":"","created":1744011147,"model":"google/gemma-3-4b-it","system_fingerprint":"3.2.0-native","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, let's take a look! \n\nThe animal on the candy is a **turtle**. You can see the shell pattern clearly printed on the candy. \n\nDo you want to know anything more about these candies?"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":284,"completion_tokens":47,"total_tokens":331}}


### (Clean Up)

After you've finished experimenting with the model, it's important to clean up the resources to avoid ongoing charges. The following steps will guide you through deleting the endpoint, endpoint configuration, and model.

In [None]:
pretrained_tgi_predictor.delete_model()
pretrained_tgi_predictor.delete_endpoint(delete_endpoint_config=True)

**Or**

In [56]:
client = boto3.client('sagemaker')

client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=endpointName)
client.delete_endpoint(EndpointName=endpointName)

{'ResponseMetadata': {'RequestId': 'ca8b1157-041a-42c1-8b70-a1449455fc0c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ca8b1157-041a-42c1-8b70-a1449455fc0c',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 07 Apr 2025 07:34:21 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}