<a href="https://colab.research.google.com/github/Sartify/Gemma-3-Notebooks/blob/main/scripts/text_inference_gemma_with_image.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install the required dependencies for using Hugging Face's Transformers library
# This specific version includes support for Google's Gemma 3 model
!pip install -q git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

# Upgrade and install bitsandbytes, a library for efficient model loading and quantization
!pip install -U -q bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import necessary libraries

# Hugging Face's Transformers library for working with Gemma 3 model
from transformers import AutoProcessor, Gemma3ForConditionalGeneration

# PIL (Python Imaging Library) for handling image processing
from PIL import Image

# Requests library to fetch images from URLs
import requests

# PyTorch for tensor computations and model execution
import torch

In [3]:
# Define the model ID - you can change this to any available Gemma 3 model variant
model_id = "google/gemma-3-4b-it"

# Load the Gemma 3 model for conditional generation
# device_map="auto" automatically assigns the model to an available GPU or CPU
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()  # Set the model to evaluation mode (inference mode)

# Load the corresponding processor for the selected model
# The processor is responsible for preprocessing text and images before feeding them into the model
processor = AutoProcessor.from_pretrained(model_id)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]



processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

In [9]:
# Define the conversation messages for the model
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
        # System message that sets the behavior of the AI assistant
    },
    {
        "role": "user",
        "content": [
            # Image-based input for multimodal understanding
            # The AI model will analyze the image along with the text prompt
            {"type": "image", "image": "https://raw.githubusercontent.com/Sartify/Gemma-3-Notebooks/main/images/image1.jpg"},

            # Text prompt asking about the device in the image
            {"type": "text", "text": "Do you have any idea what is this device?"}
        ]
    }
]

# Convert the messages into a format suitable for the model
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,  # Adds a generation prompt for response generation
    tokenize=True,  # Tokenizes the input text
    return_dict=True,  # Returns output as a dictionary
    return_tensors="pt"  # Returns PyTorch tensors for processing
).to(model.device, dtype=torch.bfloat16)  # Move to the correct device and data type for efficiency

# Capture the input length to ensure proper decoding later
input_len = inputs["input_ids"].shape[-1]

# Generate model output without storing gradients for inference efficiency
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
    # Generate response with a max token limit, without randomness (deterministic output)

    generation = generation[0][input_len:]  # Extract only the generated portion of the output

# Decode the generated output into human-readable text
decoded = processor.decode(generation, skip_special_tokens=True)

# Print the final response
print(decoded)

Okay, let's analyze the image!

Based on the visible markings, this is an **Elegoo UNO R3 microcontroller board**. 

Here's what I can tell you from the image:

*   **Elegoo:** This is the manufacturer.
*   **UNO R3:** This is the model number, indicating it's a version of the popular Arduino Uno board.
*   **Digital (плата):** This is Russian for "board"
*   **The pin headers:** These are the rows of pins used to connect the board to other components (sensors, LEDs, etc.)
*   **The components:** You can see a few capacitors and a resistor.

**What is it used for?**

The Elegoo UNO R3 is a versatile microcontroller board that's commonly used for:

*   **Arduino projects:** It's a great starting point for learning about electronics and programming.
*   **Robotics:** Controlling motors, sensors, and other robotic components.
*   **Home automation:** Creating smart devices and systems.
*   **Experimentation:**  A general-purpose platform for building and testing electronic circuits.

**Do