Skip to content

feat(telegram): unique photo filenames + caption-aware auto-vision#23

Merged
jkyberneees merged 2 commits into
mainfrom
feat/telegram-vision-caption-and-unique-filenames
Jun 7, 2026
Merged

feat(telegram): unique photo filenames + caption-aware auto-vision#23
jkyberneees merged 2 commits into
mainfrom
feat/telegram-vision-caption-and-unique-filenames

Conversation

@jkyberneees
Copy link
Copy Markdown
Contributor

Summary

Two fixes for the Telegram photo flow, reported from live use.

1. Filename collision — "image already processed"

DownloadPhoto/DownloadVoice named files photo_<fileID[:16]>.<ext>. Telegram file_ids share a long constant prefix (e.g. AgACAgIAAxkBAAI…) that encodes file-type/datacenter/version — the bytes that actually distinguish one file from another come after char 16. Truncating kept only the shared prefix, so every photo mapped to the same filename and overwrote the previous one, making the bot treat each new image as already-seen.

Fix: new fileIDSuffix() hashes the full file_id (SHA-256, first 16 hex chars) for a genuinely unique suffix. Applied to both photo and voice downloads.

2. Caption-aware auto-vision

A photo can carry a caption (the user's actual request), which was silently dropped — and the agent had to discover/call vision itself.

Fix:

  • Message gains a Caption field; OnPhotoMessage now receives it.
  • New vision.auto_describe config (default true, mirrors transcription.auto_transcribe).
  • On a photo, the bot runs the vision model first (focused by the caption when present) to extract a description, then injects [description] + caption to the agent so it answers the request using the description.
  • Falls back to the path-based message when auto-describe is off or vision fails.

The extracted description stays wrapped in <untrusted_content> boundaries (image text is untrusted input); the caption is the user's own trusted request.

Behavior

Input Before After
Two different photos Same filename → "already processed" Distinct filenames, each processed
Photo + caption "what breed?" Caption dropped; path handed to agent Vision extracts description focused on the caption → agent answers "what breed?"
Photo, no caption Path handed to agent Vision describes → agent summarizes

Config

Docker configs (config.restricted.json, config.godmode.json) ship vision.auto_describe: true. Note: like auto_transcribe, the default-true only applies when the vision section is entirely absent, so a present section must set the flag explicitly.

Tests

  • TestDownloadPhoto_PrefixCollisionAvoided — regression: two IDs sharing a prefix produce different filenames.
  • TestDownloadVoice_HashedFileIDSuffix / TestDownloadPhoto_HashedFileIDSuffix — hashed suffix, raw prefix absent.
  • TestHandleUpdate_PhotoMessage — asserts caption threading.
  • TestResolveVision_Defaults / TestResolveVision_AutoDescribePreservedauto_describe default + explicit values.

All packages build, go vet clean, tests pass under -race.

Docs

docs/CHEATSHEET.md and docs/TELEGRAM.md updated (auto-describe flow, new filename scheme, updated handler signature).

🤖 Generated with Claude Code

Two fixes for the Telegram photo flow:

1) Filename collision ("image already processed"). DownloadPhoto/DownloadVoice
   named files photo_<fileID[:16]>.<ext>, but Telegram file_ids share a long
   constant prefix (e.g. "AgACAgIAAxkBAAI…") — the distinguishing bytes come
   *after* char 16. Truncating kept only the shared prefix, so every photo
   mapped to the same filename and overwrote the last one. Now we hash the full
   file_id (SHA-256, first 16 hex chars) for a genuinely unique suffix. Adds a
   prefix-collision regression test.

2) Caption-aware vision. Photos can carry a caption (the user's request), which
   was silently dropped, and the agent had to discover/call vision itself. Now:
   - Message gains a Caption field; OnPhotoMessage receives it.
   - New vision.auto_describe config (default true, mirrors auto_transcribe).
   - On a photo, the bot runs the vision model FIRST (focused by the caption if
     present) to extract a description, then injects "[description] + caption"
     to the agent so it answers the request. Falls back to the path-based
     message when auto-describe is off or vision fails.

Docker configs ship vision.auto_describe=true. Docs (CHEATSHEET, TELEGRAM)
updated. All packages build, vet clean, tests pass under -race.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jun 7, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
odek 714bbc1 Commit Preview URL

Branch Preview URL
Jun 07 2026, 03:14 PM

…e funcs

vprotocol auto-repair (§6.2 property tests). The photo-handler message
composition lived inline in an untested closure in package main, leaving the
new branching logic (caption present/absent, vision success/fallback)
unexercised — the binding weakness in the verification η.

Extract three pure functions — photoVisionPrompt, photoVisionMessage,
photoFallbackMessage — and cover them with unit tests, including a regression
that the <untrusted_content> wrapping is preserved verbatim when the
description is injected into the agent (axis 2.8). No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jkyberneees
Copy link
Copy Markdown
Contributor Author

vprotocol v5.2.7 — Verification Certificate

PR: #23 feat/telegram-vision-caption-and-unique-filenames · head 714bbc1 · 14 files · +330/-67 LOC
Generator: Claude Opus 4.8 (claude-opus-4-8) · Class: GeneratedCode (code + tests, same model/session)
Run date: 2026-06-07 · Single-model pipeline (B=C=D=E), monoculture fallback


Pre-scan (§0)

Deterministic scan of the diff for injection markers / verdict tokens / new exec sinks: clean. The one new untrusted→LLM path (image description → agent message) is delimited: the vision tool wraps the description in nonce'd <untrusted_content> boundaries before injection, and a regression test now asserts the wrapping survives. Axis 2.8 → pass.

Nine Axes

Axis Verdict Notes
2.1 Semantic Correctness ✅ pass Explicit error/fallback paths (download, vision, json); empty-caption handled
2.2 Behavioral Contract ⚠️ warn No independent spec; PR description is the contract. OnPhotoMessage signature change — all callers + tests updated
2.3 Security Surface ✅ pass Caption is user-trusted; description untrusted-wrapped; path from MediaDir; SHA-256 used only for filenames
2.4 Structural Integrity ✅ pass Mirrors voice auto-transcribe; fileIDSuffix + 3 pure composers, single-responsibility
2.5 Behavioral Exploration ✅ pass Collision regression (shared-prefix ids), empty/oversized caption, vision-error fallback all covered
2.6 Dependency Integrity ✅ pass No new deps; stdlib crypto/sha256, encoding/hex
2.7 Generator Provenance ⚠️ warn Code + tests: same model, same session → correlated (gates ρ)
2.8 Adversarial Surface ✅ pass New image→prompt path explicitly delimited + provenance-tagged; verified by test
2.9 Documentation Coverage ✅ pass auto_describe + filename scheme + handler signature documented (CHEATSHEET, TELEGRAM)

η Derivation (re-derived post-repair)

Signal Weight Value Note
m (mutation kill) 0.34 0.62 composition branches now unit-tested; no mutation runner (estimated)
o (oracle agreement) 0.24 0.38 no independent Agent-C contract
b (branch coverage) 0.14 0.70 changed-line branches covered; handler orchestration still integration-only
f (fuzz survival) 0.09 0.90 no crashes; defensive (estimated, no fuzzer)
s (SAST clean) 0.04 1.00 go vet clean
t (static depth) 0.10 1.00 typed; compiler + vet clean on changed lines
d (doc coverage) 0.05 1.00 config/user surface documented

η_raw = 0.671 · ρ = 0.24 (family +0.10, version +0.05, spec_independence +0.05, AST ~0.02, shared-mutants ~0.02)
η = clamp(0.671 − 0.24, 0, 1) = 0.431

Verdict: HumanReviewRequired

Binding gates: η 0.431 < 0.80, and ρ 0.24 ∈ (0.20, 0.30] → HumanReviewRequired regardless of η. A single model authoring both code and tests cannot self-certify higher — independent human review is the protocol-mandated next step.
ΔDebt ≈ 0.3 h (Low) · Ci_estimated: true · LOC 397 < 1,500 (standard pipeline).

Auto-Repair Applied

§6.2 property tests (commit 714bbc1): extracted the photo-handler message composition (untested closure in package main) into three pure functions — photoVisionPrompt, photoVisionMessage, photoFallbackMessage — and added unit tests incl. an untrusted-wrapping-preservation regression. Raised η 0.333 → 0.431 by closing the testing gap on the new branching logic. No behavior change.

Open items for the human reviewer

  • Axis 2.7 (correlated generator): confirm the mock/test assumptions match real llama-mtmd-cli + Telegram behavior — code and tests share a single author.
  • Axis 2.2: no formal spec; verify the injected-message phrasing actually elicits the intended "extract then answer" behavior from the production model.
  • The handler orchestration (download→vision→dispatch wiring) remains integration-only by design, consistent with the existing voice handler.

Generated by vprotocol v5.2.7 auto-repair mode (single-model pipeline; ρ applied at full strength per §0.1 monoculture fallback).

@jkyberneees jkyberneees merged commit 903e453 into main Jun 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant