# Gemma 3: Google's multimodal, multilingual, long context open LLM

Google releases [**Gemma 3**](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d), an iteration of their Gemma family of models. The models range from 1B to 27B parameters, have a context window up to 128k tokens, can accept images and text, and support 140+ languages.


|  | Gemma 2 | Gemma 3 |
| :---- | :---- | :---- |
| Size Variants | <li>2B <li>9B <li>27B | <li>1B <li>4B <li>12B <li>27B |
| Context Window Length | 8k | <li>32k (1B) <li>128k (4B, 12B, 27B) |
| Multimodality (Images and Text) | ❌ | <li>❌ (1B) <li>✅ (4B, 12B, 27B) |
| Multilingual Support | – | English (1B) +140 languages (4B, 12B, 27B) |

All the [models are on the Hub](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) and tightly integrated with the Hugging Face ecosystem.

> *Both pre-trained and instruction tuned models are released. Gemma-3-4B-IT beats Gemma-2-27B IT, while Gemma-3-27B-IT beats Gemini 1.5-Pro across benchmarks*.

| ![pareto graph](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gemma3/lmsys.png) |
| :---- |
| Gemma 3 27B is in the pareto sweet spot (Source: [Gemma3 Tech Report](https://goo.gle/Gemma3Report)) |

## What is Gemma 3?



| Pre Trained | Instruction Tuned | Multimodal | Multilingual | Input Context Window |
| :---- | :---- | :---- | :---- | :---- |
| [gemma-3-1b-pt](http://hf.co/google/gemma-3-1b-pt) | [gemma-3-1b-it](http://hf.co/google/gemma-3-1b-it) | ❌ | English | 32K |
| [gemma-3-4b-pt](http://hf.co/google/gemma-3-4b-pt) | [gemma-3-4b-it](http://hf.co/google/gemma-3-4b-it) | ✅ | +140 languages | 128K |
| [gemma-3-12b-pt](http://hf.co/google/gemma-3-12b-pt) | [gemma-3-12b-it](http://hf.co/google/gemma-3-12b-it) | ✅ | +140 languages | 128K |
| [gemma-3-27b-pt](http://hf.co/google/gemma-3-27b-pt) | [gemma-3-27b-it](http://hf.co/google/gemma-3-27b-it) | ✅ | +140 languages | 128K |

> [!NOTE]  
> While these are multimodal models, one can use it as a *text only* model (as an LLM) without loading the vision encoder in memory.

## Technical Enhancements in Gemma 3

The three core enhancements in Gemma 3 over Gemma 2 are:

* Longer context length  
* Multimodality  
* Multilinguality



In [1]:
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-kpq5ztzh
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-kpq5ztzh
  Running command git checkout -q 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Resolved https://github.com/huggingface/transformers to commit 1c0f782fe5f983727ff245c4c1b3906f9b99eec2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10936457 sha256=88f8addc688fad85e185359bc2061fe107cac792b15850173585a393044ff383
  Stored in directory: /tmp/pip-eph

In [2]:
HUGGINGFACE_TOKEN = "hf_..."

In [3]:
import torch
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it", # "google/gemma-3-12b-it", "google/gemma-3-27b-it"
    device="cuda",
    torch_dtype=torch.bfloat16,
    token=HUGGINGFACE_TOKEN
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cuda


![Novak-Djokovic-Serbia-US-Open-2023](https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg)

In [4]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg"},
            {"type": "text", "text": "Give the sport name and player name in this image."}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Here's the information based on the image:

*   **Sport:** Tennis
*   **Player:** Novak Djokovic


یا به جای آدرس از خود عکس استفاده کنیم

In [5]:
!wget https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg

--2025-08-25 13:26:13--  https://cdn.britannica.com/78/249578-050-01D46C9B/Novak-Djokovic-Serbia-US-Open-2023.jpg
Resolving cdn.britannica.com (cdn.britannica.com)... 108.138.94.29, 108.138.94.88, 108.138.94.126, ...
Connecting to cdn.britannica.com (cdn.britannica.com)|108.138.94.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 198808 (194K) [image/jpeg]
Saving to: ‘Novak-Djokovic-Serbia-US-Open-2023.jpg.1’


2025-08-25 13:26:14 (7.26 MB/s) - ‘Novak-Djokovic-Serbia-US-Open-2023.jpg.1’ saved [198808/198808]



In [6]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "Novak-Djokovic-Serbia-US-Open-2023.jpg"},
            {"type": "text", "text": " این چه ورزشی است؟ فقط در یک کلمه اسم ورزش رابه فارسی بگو و توضیح نده."}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

تENNIS


![bmw](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRkMa5nc8TsQv49NV66I15S_E70CIlWUjxLCg&s)

In [7]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRkMa5nc8TsQv49NV66I15S_E70CIlWUjxLCg&s"},
            {"type": "text", "text": "تصویر را تشریح کن"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

حتما، در اینجا یک شرح از تصویر است:

**تصویر خودرو:**

تصویر یک خودروی BMW M3 Competition سفید رنگ است. این خودرو از سری M3 جدید است که طراحی جدیدی دارد. این خودرو دارای بدنه ای اسپرت، خطوط تیز و ظاهری تهاجمی است.

**ویژگی‌های کلیدی:**

*   **رنگ:** سفید
*   **طراحی:** ظاهری اسپرت و تهاجمی با المان‌های M
*   **تجهیزات:** دارای اسپویلر جلو، ورودی‌های هوا بزرگتر، و لاستیک‌های چرخه‌ای تیره
*   **چراغ‌ها:** چراغ‌های جلوی LED با طراحی خاص
*   **تایمر:** تایمر دیجیتال با المان‌های M
*   **ملحفه:** پلاک خودروی المانی (LS 3372)
*   **موقعیت


![password](https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg)

In [12]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
            {"type": "text", "text": "رمز در این تصویر چیست؟"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200,
              generate_kwargs={"do_sample": False})
print(output[0]["generated_text"][-1]["content"])

رمز عبور در این تصویر "aaeu" است.



#### Detailed Inference with Transformers

The transformers integration comes with two new model classes:

1. `Gemma3ForConditionalGeneration`: For 4B, 12B, and 27B vision language models.  
2. `Gemma3ForCausalLM`: For the 1B text only model and to load the vision language models like they were language models (omitting the vision tower).

In the snippet below we use the model to query on an image. The `Gemma3ForConditionalGeneration` class is used to instantiate the vision language model variants. To use the model we pair it with the `AutoProcessor` class. Running inference is as simple as creating the `messages` dictionary, applying a chat template on top, processing the inputs and calling `model.generate`.


In [27]:
del pipe

In [34]:
import gc
import torch

# اجرای garbage collector
gc.collect()

# آزاد کردن cache کارت گرافیک
torch.cuda.empty_cache()


In [29]:
import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

ckpt = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(
    ckpt,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    token=HUGGINGFACE_TOKEN
)
processor = AutoProcessor.from_pretrained(ckpt, token=HUGGINGFACE_TOKEN)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

![password](https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg)

In [33]:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/spaces/big-vision/paligemma-hf/resolve/main/examples/password.jpg"},
            {"type": "text", "text": "رمز در این تصویر چیست؟"}
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]

generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)



رمز موجود در تصویر "aaeu" است.


##### Source [https://github.com/huggingface/blog/blob/main/gemma3.md]