# SightSafe 👁️
### Offline Visual Narrator powered by Gemma 3n
Helping people with low-vision understand their surroundings – privately and without an internet connection.
This notebook shows a working proof-of-concept for SightSafe, an on-device assistive app that:
Captures an image from the user (camera or gallery).
Generates a rich scene description with Gemma 3n.
Lets the user ask follow-up questions about the image.
Runs entirely offline – perfect for real-world mobility and privacy.
The same logic can be wrapped in a mobile UI (Android/iOS) or deployed on a Raspberry Pi / Jetson for wearable glasses.
Key Gemma 3n capabilities used
• Multimodal : interleaved image + text understanding.
• On-device ready : the 4B → 2B sub-model keeps RAM < 4 GB.
• Mix’n’Match sub-models: dynamically switch between fast 2B and high-quality 4B.
Why this matters – 285 million people worldwide live with visual impairment. Offline, private narration removes data-plan costs, latency, and privacy concerns often associated with cloud vision APIs.

Install & set-up
We use the bleeding-edge Transformers >= 4.54 (Gemma 3n support) and optionally Unsloth for memory-efficient fine-tuning.
Running inside Kaggle/Colab? Skip CUDA drivers – they are pre-installed.

1st time running: unncomment and run the following 2 cells once, then restart and comment the 2 cells again.

In [None]:
#!pip uninstall -y torch torchvision torchaudio numpy scipy scikit-learn tensorflow keras

In [None]:
#%pip install -qU \
#  numpy==1.26.4 \
#  torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
#  fsspec==2025.3.0 gcsfs==2025.3.0 "rich<14" \
#  transformers accelerate timm unsloth kagglehub

In [1]:
import os
os.environ["TRANSFORMERS_NO_TF"]   = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

Download Gemma 3n
Google hosts Gemma 3n weights on Kaggle Hub. Change the flavour (e.g. gemma-3n-e2b-it) to switch model sizes.

In [2]:
import kagglehub, os, gc, torch
from pathlib import Path
MODEL_REPO = "google/gemma-3n/transformers/gemma-3n-e2b-it"  # 4B base with 2B sub-model
MODEL_DIR = Path(kagglehub.model_download(MODEL_REPO))
print("Model files stored in:", MODEL_DIR)

Model files stored in: /kaggle/input/gemma-3n/transformers/gemma-3n-e2b-it/1


Quick image → description demo
We load the multimodal Gemma 3n checkpoint and send an image with a prompt asking for a detailed, concise scene description.

In [5]:
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained(MODEL_DIR, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
        MODEL_DIR, torch_dtype="auto", device_map="auto")
# ==== finished loading model === 

#reference image
IMAGE_URL = "https://images.pexels.com/photos/845451/pexels-photo-845451.jpeg?h=512"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": IMAGE_URL},
            {"type": "text",  "text": "Describe this scene for someone who cannot see it."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,          # <-- let the processor handle tokenization
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=model.dtype)

gen = model.generate(**inputs, max_new_tokens=128)
description = processor.batch_decode(gen[:, inputs["input_ids"].shape[-1]:],
                                     skip_special_tokens=True)[0]
print(description)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

The scene is a bright, modern office space, captured from a slightly elevated angle looking down at a person working at a desk. The dominant feature is a large window that stretches almost the entire height of the frame on the left side, flooding the room with natural light. The light is intense, suggesting it might be daytime.

Below the window, a long wooden desk is visible, running from the top of the frame towards the viewer. It's cluttered with various office items. On the desk, there are multiple computer monitors, keyboards, and other peripherals – it looks like a shared workspace or a densely occupied area. 

A


Follow-up Q&A about the same image
Gemma 3n keeps the image in context (no re-upload needed). Ask clarifying questions – "Is it safe to cross the street?", "How many people do you see?", etc.

In [8]:
# First user message + image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": IMAGE_URL},
            {"type": "text",  "text": "Describe this scene for someone who cannot see it."}
        ]
    }
]

def run(messages, max_tokens=128):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device, dtype=model.dtype)

    gen   = model.generate(**inputs, max_new_tokens=max_tokens)
    reply = processor.batch_decode(
                gen[:, inputs["input_ids"].shape[-1]:],
                skip_special_tokens=True
            )[0].strip()
    return reply

# 1️⃣ assistant answers
answer = run(messages)
print("Assistant:", answer)
messages.append(
    {"role": "assistant",
     "content": [{"type": "text", "text": answer}]}
)

# 2️⃣ user follow-up question
question = "Tell me different details?"
messages.append(
    {"role": "user",
     "content": [{"type": "text", "text": question}]}
)

# assistant reply to follow-up
answer2 = run(messages, max_tokens=64)
print("\nAssistant:", answer2)
messages.append(
    {"role": "assistant",
     "content": [{"type": "text", "text": answer2}]}
)

# …continue alternating as needed

Assistant: The scene is viewed from a high angle, looking down at a modern office workspace. The dominant feature is a man sitting at a desk, working on a computer. He's wearing headphones and a light-colored long-sleeved shirt. His posture suggests he's focused on the screen. 

The desk is made of a light-colored wood and has several computer components: a monitor, a keyboard, and a mouse. Wires are visible connecting these devices. To the left of the man, another desk is partially visible, also with a monitor and other equipment. 

The office is bright, likely due to large

Assistant: Okay, let's delve into more details of this office scene, describing it in a more comprehensive way:

The photograph captures a contemporary office environment from a slightly elevated perspective, looking down onto a row of workstations. The main subject is a man diligently working at his desk. He appears to be in his late


5. (Optional) Fine-tune with Unsloth LoRA
For high accuracy on assistive narration, you can LoRA-tune Gemma 3n on datasets like VizWiz Captions.
Here we show a tiny 20-sample demo to keep runtime < 2 min. Remove the slice to train on full data.

In [9]:
from unsloth import FastLanguageModel
from datasets import load_dataset
sft_model, sft_tokenizer = FastLanguageModel.from_pretrained(MODEL_DIR, max_seq_length=4096)
dataset = load_dataset('MBZUAI/VizWiz_Captions', split='train[:20]')
def format_example(e):
    return { 'text': f'User: [IMAGE]\nDescribe this image.\nAssistant: {e["caption"]}' }
dataset = dataset.map(format_example, remove_columns=dataset.column_names)
FastLanguageModel.for_inference(sft_model)  # enable gradient checkpointing etc.
sft_model = FastLanguageModel.get_peft_model(
        sft_model,
        r=8, lora_alpha=16, lora_dropout=0.05, target_modules='all')
sft_model.train()
sft_model.fit(dataset, batch_size=2, epochs=1, lr=1e-4)
sft_model.save_pretrained('sightsafe-lora')
gc.collect(); torch.cuda.empty_cache()


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


NotImplementedError: Unsloth currently only works on NVIDIA GPUs and Intel GPUs.

6. Lightweight Gradio demo (runs offline)
Upload a photo, click Describe, and optionally ask follow-up questions. The interface swaps between 2B and 4B modes on-the-fly depending on device RAM.

In [1]:
import gradio as gr

def describe_and_chat(image, question, use_fast):
    chosen_submodel = '2B' if use_fast else '4B'
    # The mix'n'match trick would load weights slices; here we just inform the UI.
    msgs = [{ 'role':'user', 'content':[{'type':'image','image':image}, {'type':'text','text':question or 'Describe this scene in detail.'}] }]
    inp = processor.apply_chat_template(msgs, add_generation_prompt=True, return_tensors='pt').to(model.device)
    gen = model.generate(**inp, max_new_tokens=128)
    return processor.decode(gen[0], skip_special_tokens=True), f'Processed with the {chosen_submodel} sub-model.'

with gr.Blocks() as app:
    gr.Markdown('# SightSafe – Offline Visual Narrator')
    image = gr.Image(type='filepath', label='Upload photo')
    question = gr.Textbox(label='Optional follow-up question', placeholder='What colour is the bus?')
    use_fast = gr.Checkbox(label='Fast 2B mode (less VRAM, quicker)')
    btn = gr.Button('Describe')
    out_text = gr.Textbox(label='Assistant')
    meta = gr.Markdown()
    btn.click(describe_and_chat, [image, question, use_fast], [out_text, meta])

app.launch(debug=False, share=False)

ModuleNotFoundError: No module named 'gradio'

Next Steps
Package the notebook into an Android APK using Google AI Edge SDK.
Add text-to-speech with PyTTSx3 for auditory output.
Expand Q&A with long-term scene memory using RecurrentGemma techniques.