# Insight to Impact: Seeing Through Another Eyes
**Hands‑on A (20 minutes)**

This mini‑lab lets you _feel_ visual limitations first (color‑vision deficiency / low‑vision) and then see how modern open‑vocabulary perception + Color Vision Deficiency(CVD)‑aware + optional text-to-speech(TTS) can restore information.

1) **Experience** CVD simulation (what some users actually see).  
2) **Assist** with AI: open‑vocabulary detection → promptable segmentation → high‑contrast overlays.  
3) **Create** accessibility: make small edits (palette / TTS / blur) to help people see better.

**Backends (switchable)**: **[OmDet‑Turbo](https://github.com/om-ai-lab/OmDet)** (default, fastest) · **[Florence‑2](https://huggingface.co/microsoft/Florence-2-large) (experimental)** · **[Grounding‑DINO](https://github.com/IDEA-Research/GroundingDINO) (baseline)**  

**Segmentation**: **SAM v1**

**UI**: Gradio (image upload/webcam), prompt text, CVD type, TTS, exports.

## 0. Runtime setup

In [None]:
# Tip: Use a GPU runtime (Runtime → Change runtime type → GPU).
# Keep installs minimal to stay within 15 minutes.

# Core
%pip -q install torch torchvision --index-url https://download.pytorch.org/whl/cu121
%pip -q install opencv-python pillow matplotlib numpy scipy gTTS gradio ultralytics supervision

# Transformers for OmDet-Turbo / Florence-2
%pip -q install --upgrade transformers accelerate timm huggingface_hub

# GroundingDINO
!git clone -q https://github.com/IDEA-Research/GroundingDINO.git
%pip -q install -e GroundingDINO

# SAM 2 + fallback SAM v1
!git clone -q https://github.com/facebookresearch/sam2.git
%pip -q install -e sam2
%pip -q install git+https://github.com/facebookresearch/segment-anything.git

fatal: destination path 'GroundingDINO' already exists and is not an empty directory.
  Preparing metadata (setup.py) ... [?25l[?25hdone
fatal: destination path 'sam2' already exists and is not an empty directory.
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for SAM-2 (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone


## 1. Download weights (fast and small)

In [None]:
import os, urllib.request, pathlib

WEIGHTS = pathlib.Path("weights"); WEIGHTS.mkdir(exist_ok=True)

# Grounding-DINO (Swin-T OGC)
GDINO_URL = "https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth"
GDINO_PATH = WEIGHTS / "groundingdino_swint_ogc.pth"
if not GDINO_PATH.exists():
    print("Downloading GroundingDINO weights…")
    urllib.request.urlretrieve(GDINO_URL, GDINO_PATH)

# SAM2 tiny
SAM2_URL = "https://huggingface.co/facebook/sam2-hiera-tiny/resolve/main/sam2_hiera_tiny.pt"
SAM2_PATH = WEIGHTS / "sam2_hiera_tiny.pt"
if not SAM2_PATH.exists():
    print("Downloading SAM2 (tiny) weights…")
    urllib.request.urlretrieve(SAM2_URL, SAM2_PATH)

# SAM v1 ViT-B
SAMV1_URL = "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth"
SAMV1_PATH = WEIGHTS / "sam_vit_b_01ec64.pth"
if not SAMV1_PATH.exists():
    print("Downloading SAM v1 (ViT-B) weights…")
    urllib.request.urlretrieve(SAMV1_URL, SAMV1_PATH)

print("✓ Weights ready")

✓ Weights ready


## 2. Feel it first: CVD simulation (approximate)

In [None]:
import numpy as np, cv2
from PIL import Image
import matplotlib.pyplot as plt

def simulate_cvd(image_rgb, mode="deutan"):
    import numpy as np
    img = (image_rgb.astype(np.float32) / 255.0)
    if mode=="protan":
        M = np.array([[0.567,0.433,0],[0.558,0.442,0],[0,0.242,0.758]])
    elif mode=="deutan":
        M = np.array([[0.625,0.375,0],[0.7,0.3,0],[0,0.3,0.7]])
    elif mode=="tritan":
        M = np.array([[0.95,0.05,0],[0,0.433,0.567],[0,0.475,0.525]])
    else:
        M = np.eye(3)
    return (np.clip(img @ M.T, 0, 1) * 255).astype(np.uint8)

print("Loaded CVD simulation utilities. Use the Gradio UI to try it interactively.")

Loaded CVD simulation utilities. Use the Gradio UI to try it interactively.


## 3. Models: Detector backend + SAM v1

In [None]:
import sys, os
sys.path.append(os.path.abspath("GroundingDINO"))

import torch, torch.nn.functional as F, cv2, numpy as np
from PIL import Image
import groundingdino.datasets.transforms as T
from groundingdino.models import build_model
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict

device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Grounding-DINO ---
def load_groundingdino(cfg_path, ckpt_path):
    args = SLConfig.fromfile(cfg_path)
    model = build_model(args)
    checkpoint = torch.load(str(ckpt_path), map_location="cpu")
    model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
    model.eval()
    return model.to(device)

GDINO_CFG = "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py"
GDINO_PATH = "weights/groundingdino_swint_ogc.pth"
gdino = load_groundingdino(GDINO_CFG, GDINO_PATH)

# ---- SAM v1 ----
import torch, os
from segment_anything import sam_model_registry, SamPredictor

device = "cuda" if torch.cuda.is_available() else "cpu"
SAMV1_PATH = "weights/sam_vit_b_01ec64.pth"

sam = sam_model_registry["vit_b"](checkpoint=SAMV1_PATH)
sam.to(device)
sam_predictor = SamPredictor(sam)

use_sam2 = False
print(f"Using SAM v1 on {device}.")

final text_encoder_type: bert-base-uncased
Using SAM v1 on cpu.


## 4. Detection (OmDet‑Turbo / Florence‑2 experimental / Grounding‑DINO)

In [None]:
import time, re
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

# --- OmDet‑Turbo (default) ---
om_processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
om_model = AutoModelForZeroShotObjectDetection.from_pretrained(
    "omlab/omdet-turbo-swin-tiny-hf",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
).eval()

def omdet_detect(image_rgb, prompt_text, box_threshold=0.25):
    queries = [q.strip() for q in prompt_text.split(",") if q.strip()]
    if len(queries) == 0:
        return np.zeros((0,4), dtype=int), np.zeros((0,), dtype=float), []
    inputs = om_processor(text=queries, images=image_rgb, return_tensors="pt").to(om_model.device)
    with torch.inference_mode(), torch.autocast(device_type="cuda", enabled=torch.cuda.is_available()):
        outputs = om_model(**inputs)
    target_sizes = torch.tensor([image_rgb.shape[:2]], device=om_model.device)
    results = om_processor.post_process_grounded_object_detection(outputs, threshold=box_threshold, target_sizes=target_sizes)[0]
    boxes  = results["boxes"].detach().cpu().numpy().astype(int) if "boxes" in results else np.zeros((0,4), dtype=int)
    scores = results["scores"].detach().cpu().numpy() if "scores" in results else np.zeros((0,), dtype=float)
    labels = [queries[i % len(queries)] for i in range(len(boxes))] if len(boxes)>0 else []
    return boxes, scores, labels

# --- Grounding‑DINO ---
def gdino_detect(image_bgr, text_prompt, box_threshold=0.25, text_threshold=0.25):
    H, W = image_bgr.shape[:2]
    image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
    image_pil = Image.fromarray(image_rgb)

    transform = T.Compose([
        T.RandomResize([800], max_size=1333),
        T.ToTensor(),
        T.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
    ])
    image, _ = transform(image_pil, None)
    with torch.no_grad():
        outputs = gdino(image[None].to(device), captions=[text_prompt])
    logits = outputs["pred_logits"].cpu().sigmoid()[0]
    boxes = outputs["pred_boxes"].cpu()[0]
    filt = logits.max(dim=1).values > box_threshold
    boxes = boxes[filt]; scores = logits[filt].max(dim=1).values

    if boxes.numel()==0:
        return np.zeros((0,4), dtype=int), np.zeros((0,), dtype=float), []

    cxcywh = boxes.numpy()
    xyxy = np.zeros_like(cxcywh)
    xyxy[:,0] = (cxcywh[:,0] - cxcywh[:,2]/2.0) * W
    xyxy[:,1] = (cxcywh[:,1] - cxcywh[:,3]/2.0) * H
    xyxy[:,2] = (cxcywh[:,0] + cxcywh[:,2]/2.0) * W
    xyxy[:,3] = (cxcywh[:,1] + cxcywh[:,3]/2.0) * H
    xyxy = np.clip(xyxy, 0, [W-1, H-1, W-1, H-1]).astype(np.int32)

    labels = [s.strip() for s in text_prompt.split(",") if s.strip()] or ["object"]
    return xyxy, scores.numpy(), labels[:len(xyxy)]

# --- Florence‑2 (experimental; safe no-op if unavailable) ---
try:
    from transformers import AutoProcessor as FProcessor, AutoModelForCausalLM as FModel
    fl_processor = FProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
    fl_model = FModel.from_pretrained(
        "microsoft/Florence-2-base",
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device_map="auto", trust_remote_code=True
    ).eval()
    FLORENCE_READY = True
except Exception as e:
    FLORENCE_READY = False
    fl_processor = None; fl_model = None

def florence2_detect(image_rgb, prompt_text):
    if not FLORENCE_READY:
        return np.zeros((0,4), dtype=int), np.zeros((0,), dtype=float), []
    task_prefix = "<OPEN_VOCAB_DET>"
    inputs = fl_processor(text=task_prefix + prompt_text, images=image_rgb, return_tensors="pt").to(fl_model.device)
    with torch.inference_mode():
        ids = fl_model.generate(**inputs, max_new_tokens=256)
    text = fl_processor.batch_decode(ids, skip_special_tokens=False)[0]
    boxes = []
    for m in re.finditer(r"\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]", text):
        x1,y1,x2,y2 = map(int, m.groups())
        boxes.append([x1,y1,x2,y2])
    boxes = np.array(boxes, dtype=int) if boxes else np.zeros((0,4), dtype=int)
    scores = np.ones((len(boxes),), dtype=float) if len(boxes)>0 else np.zeros((0,), dtype=float)
    labels = [s.strip() for s in prompt_text.split(",") if s.strip()] or ["object"]
    return boxes, scores, labels[:len(boxes)]

## 5. Segmentation + CVD‑aware overlays + TTS

In [None]:
OKABE_ITO = [
    (0,114,178), (230,159,0), (0,158,115), (213,94,0), (86,180,233), (240,228,66), (204,121,167)
]

def pick_palette(cvd_type):
    if cvd_type.lower() in ("protan","deutan"):
        order = [0,1,6,4,5]
    elif cvd_type.lower()=="tritan":
        order = [3,2,6,1,0]
    elif cvd_type.lower()=="custom":
        order = [1,3,6]  # TODO (Exercise 1): tweak your own set
    else:
        order = list(range(len(OKABE_ITO)))
    return [OKABE_ITO[i] for i in order]

def segment_from_boxes(image_rgb, boxes_xyxy):
    sam_predictor.set_image(image_rgb)
    if len(boxes_xyxy)==0: return []
    boxes = np.array(boxes_xyxy)
    tb = sam_predictor.transform.apply_boxes_torch(torch.from_numpy(boxes), image_rgb.shape[:2])
    with torch.no_grad():
        masks, scores, _ = sam_predictor.predict_torch(point_coords=None, point_labels=None, boxes=tb.to(sam_predictor.device), multimask_output=False)
    return (masks.squeeze(1).cpu().numpy()>0.5).astype(np.uint8)

def overlay_masks(image_bgr, masks, boxes, cvd_type="deutan", alpha=0.45):
    out = image_bgr.copy()
    palette = pick_palette(cvd_type)
    for i, m in enumerate(masks):
        color = palette[i % len(palette)]
        fill = np.zeros_like(out); fill[m.astype(bool)] = np.array(color, dtype=np.uint8)
        out = cv2.addWeighted(out, 1.0, fill, alpha, 0)
        edges = cv2.morphologyEx(m, cv2.MORPH_GRADIENT, np.ones((5,5), np.uint8))
        out[edges>0] = (255,255,255)
    for i, box in enumerate(boxes):
        color = palette[i % len(palette)]
        cv2.rectangle(out, (int(box[0]),int(box[1])), (int(box[2]),int(box[3])), color, 3)
    lab = cv2.cvtColor(out, cv2.COLOR_BGR2LAB)
    l,a,b = cv2.split(lab)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    l2 = clahe.apply(l)
    return cv2.cvtColor(cv2.merge([l2,a,b]), cv2.COLOR_LAB2BGR)

def describe_angle(box, W, H):
    # TODO (Exercise 2): try zone-based wording
    cx = (box[0]+box[2])/2; cy=(box[1]+box[3])/2
    dx, dy = cx - W/2, H/2 - cy
    ang = (np.degrees(np.arctan2(dy, dx)) + 360) % 360
    hours = int(((ang + 15) % 360)//30) or 12
    return f"{hours} o'clock"

from gtts import gTTS
import tempfile, json

def tts_file(text, speed=1.0, lang="en"):
    try:
        tts = gTTS(text=text, lang=lang, slow=(speed < 1.0))
    except Exception as e:
        print("TTS failed:", e)
        return None

## 6. Gradio UI (fast path, OmDet‑Turbo by default)

In [None]:
%pip -q install nest_asyncio gradio
import gradio as gr, time, json

def detect_router(image_rgb, prompt_text, backend, box_thr, text_thr):
    if backend=="omdet-turbo":
        return omdet_detect(image_rgb, prompt_text, box_threshold=box_thr)
    elif backend=="florence2":
        return florence2_detect(image_rgb, prompt_text)
    else:
        bgr = cv2.cvtColor(image_rgb, cv2.COLOR_RGB2BGR)
        return gdino_detect(bgr, prompt_text, box_threshold=box_thr, text_threshold=text_thr)

def run_pipeline(img, prompt_text, backend, cvd_type, simulate, tts, tts_speed, box_thr, text_thr, low_vision):
    if img is None or not prompt_text.strip():
        return None, None, "{}"
    img_rgb = img
    if simulate!="none":
        img_rgb = simulate_cvd(img_rgb, simulate)
    if low_vision:
        img_rgb = cv2.GaussianBlur(img_rgb, (21,21), 8)

    t0 = time.time()
    boxes, scores, labels = detect_router(img_rgb, prompt_text, backend, box_thr, text_thr)
    det_ms = (time.time()-t0)*1000

    masks = segment_from_boxes(img_rgb, boxes)
    out_bgr = overlay_masks(cv2.cvtColor(img_rgb, cv2.COLOR_RGB2BGR), masks, boxes, cvd_type=cvd_type, alpha=0.45)
    out_rgb = cv2.cvtColor(out_bgr, cv2.COLOR_BGR2RGB)

    audio_path = None
    W,H = img_rgb.shape[1], img_rgb.shape[0]
    records = []
    for i, box in enumerate(boxes):
        label = labels[i % len(labels)] if labels else "object"
        angle = describe_angle(box, W, H)
        score = float(scores[i]) if len(scores)>i else 1.0
        records.append({"label":label, "box_xyxy":[int(v) for v in box], "angle":angle, "score":score})
    if tts and len(records)>0:
        phrases = [f"{r['label']} at {r['angle']}" for r in records]
        audio_path = tts_file(". ".join(phrases), speed=tts_speed or 1.0)

    audio_path = None
    if tts and len(records) > 0:
        try:
            phrases = [f"{r['label']} at {r['angle']}" for r in records]
            audio_path = tts_file(". ".join(phrases),
                              speed=tts_speed or 1.0)
        except Exception as e:
            print("TTS failed:", e)
            audio_path = None

    ts = time.strftime("%Y%m%d_%H%M%S")
    img_out = f"annotated_{ts}.png"
    json_out = f"detections_{ts}.json"
    aud_out = f"audio_callouts_{ts}.mp3"

    # Save outputs
    # cv2.imwrite("annotated.png", cv2.cvtColor(out_rgb, cv2.COLOR_RGB2BGR))
    # with open("detections.json","w") as f:
    #     f.write(json.dumps({"detections":records, "meta":{"backend":backend,"cvd":cvd_type,"sim":simulate,"latency_ms":det_ms}}, indent=2))

    cv2.imwrite(img_out, cv2.cvtColor(out_rgb, cv2.COLOR_RGB2BGR))
    with open(json_out,"w") as f:
         f.write(json.dumps({"detections":records, "meta":{"backend":backend,"cvd":cvd_type,"sim":simulate,"latency_ms":det_ms}}, indent=2))
    audio_path = tts_file(". ".join(phrases), speed=tts_speed or 1.0, out_path=aud_out)

    return out_rgb, (audio_path if audio_path else None), json.dumps({"detections":records, "latency_ms":det_ms}, indent=2)

with gr.Blocks(title="Accessible AI: Seeing Through Another’s Eyes") as demo:
    gr.Markdown("### Step 1: Feel the limitation → Step 2: Assist with AI → Step 3: Create accessibility")
    with gr.Row():
        with gr.Column():
            in_img = gr.Image(label="Upload or webcam", sources=["upload","webcam"], type="numpy")
            prompt = gr.Textbox(label="Text prompts (comma‑separated)", value="sofa, bottle, table, carpet")
            backend = gr.Radio(choices=["omdet-turbo","florence2","grounding-dino"], value="omdet-turbo", label="Detector backend")
            cvd = gr.Radio(choices=["none","protan (Red)","deutan (Green)","tritan (Blue)","custom"], value="deutan", label="CVD overlay")
            simulate = gr.Radio(choices=["none","protan (Red)","deutan (Green)","tritan (Blue)"], value="none", label="Simulate CVD")
            low_vision = gr.Checkbox(False, label="Simulate low‑vision blur")
            tts_chk = gr.Checkbox(False, label="Enable audio (English)")
            tts_speed = gr.Slider(0.5, 1.0, value=1.0, step=0.25, label="TTS speed")
            box_thr = gr.Slider(0.1, 0.6, value=0.25, step=0.05, label="Box threshold")
            text_thr = gr.Slider(0.1, 0.6, value=0.25, step=0.05, label="Text threshold (DINO only)")
            run_btn = gr.Button("Run")
        with gr.Column():
            out_img = gr.Image(label="AI-assisted view (overlay)", height=430)
            out_audio = gr.Audio(label="Audio callouts", type="filepath")
            out_json = gr.JSON(label="Detections JSON")
            gr.Markdown("**Exports**: `annotated.png`, `detections.json`, `audio_callouts.mp3`")
    run_btn.click(run_pipeline, inputs=[in_img, prompt, backend, cvd, simulate, tts_chk, tts_speed, box_thr, text_thr, low_vision],
                  outputs=[out_img, out_audio, out_json])

demo.launch(share=False, inbrowser=False, debug=True, show_error=True)
print("UI ready. Select OmDet‑Turbo for the fastest path.")

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Note: opening Chrome Inspector may crash demo inside Colab notebooks.
* To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1133, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py",

Keyboard interruption in main thread... closing server.
UI ready. Select OmDet‑Turbo for the fastest path.


## 7. Mini‑Exercises


1. **Make colors more legible** *(edit `pick_palette`)*  
   - Change the `custom` palette to maximize separability under your chosen CVD type.

2. **Change the narration** *(edit `describe_angle` or phrases in `run_pipeline`)*  
   - Replace clock‑face with zones: `"top-left / center / bottom-right"`.
   - Try more helpful wording: `"Caution: wheelchair at center-left"`.

3. **Low‑vision** *(try in the UI)*  
   - Turn on **Low‑vision blur** and see how overlays restore edges and contrast.