Skip to content

IrishMehta/AutoFrame

Repository files navigation

Autonomic Photography Assistant

This project runs an iterative photo-optimization loop for a portrait objective.

The pipeline captures an overview image from an Android phone, summarizes the scene with Gemini on Vertex, retrieves prior real examples, proposes a new {pan, orientation, z, zoom_ratio} setting, captures a candidate image, scores it, stores the attempt, and repeats until it stops.

Index

Reference Visuals

These images are AI-generated and are included only as illustrative reference material. They may contain inaccuracies and should not be treated as the source of truth over the code and runtime behavior.

System Overview

Illustrative system overview

UI Example

Illustrative UI example

Flow

Step What happens
1 Start from the current local initial pose: pan=0, orientation=portrait, z=23
2 Capture PHOTO_ASSISTANT_OVERVIEW_COUNT overview images; the current local env uses 2
3 Gemini scene summary uses VERTEX_SUMMARY_MODEL; the current local env uses gemini-2.5-flash
4 Retrieval finds up to 4 good and 4 bad prior examples
5 Gemini reasoning uses VERTEX_REASONER_MODEL; the current local env uses gemini-2.5-flash for the next {pan, orientation, z, zoom_ratio}
6 In manual mode, you move the phone or rig and confirm capture
7 In hardware mode, the ESP8266 rig applies the proposal automatically
8 Gemini scoring uses VERTEX_SCORER_MODEL; the current local env uses gemini-2.5-flash
9 The run stores the attempt and either continues, stalls, or finishes

Current Local Model Configuration

Component Model
Scene understanding Vertex Gemini gemini-2.5-flash
Reasoning / proposal generation Vertex Gemini gemini-2.5-flash
Candidate image scoring Vertex Gemini gemini-2.5-flash

The current local env/config split is:

  • VERTEX_REASONER_MODEL
  • VERTEX_SUMMARY_MODEL
  • VERTEX_SCORER_MODEL

VERTEX_VISION_MODEL still works as a legacy fallback that sets both summary and scorer when the split variables are not provided.

Code defaults in app/config.py are still gemini-2.5-flash for reasoning and gemini-2.5-flash-lite for summary and scoring. The table above reflects the current local .env behavior, without exposing machine-specific paths or secrets.

Current Runtime Assumptions

  • android_http is the main camera adapter for live runs.
  • vertex is the main reasoner for live runs.
  • gemini is the main VLM for live runs.
  • The current local env starts each run from:
    • pan=0
    • orientation=portrait
    • z=23
    • zoom_ratio=1.0
  • The current local env uses:
    • hardware as the default execution mode
    • retrieval_top_k=20
    • max_iterations=15
    • overview_count=2
    • 4 good retrieved examples
    • 4 bad retrieved examples
    • 3 session-memory items
  • The first proposal is seeded by an overview-derived framing prior before retrieval examples are allowed to dominate.
  • Retrieval excludes execution_mode="mock" examples.
  • Mock runs no longer write new examples into long-term retrieval memory.
  • Retrieval also excludes synthetic examples.
  • The current local Vertex env uses:
    • VERTEX_REASONER_TIMEOUT_SECONDS=40
    • VERTEX_REASONER_RETRY_COUNT=2
    • VERTEX_REASONER_RETRY_BACKOFF_SECONDS=3
    • no explicit scorer timeout override, so scorer timeout falls back to the code default

Phone Camera Setup

The phone side is the separate Android Studio project in android-camera-server/.

It exposes:

  • GET /health
  • POST /capture
  • GET /shot.jpg

The Android app currently serves HTTP on port 8080.

The Python pipeline expects:

PHOTO_ASSISTANT_ANDROID_CAPTURE_URL
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL

Typical direct-network values:

PHOTO_ASSISTANT_ANDROID_CAPTURE_URL=http://<phone-ip>:8080/capture
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL=http://<phone-ip>:8080/shot.jpg

POST /capture accepts query parameters for:

  • zoom_ratio or legacy zoom
  • orientation such as portrait or landscape

ESP8266 Rig Setup

hardware mode expects the ESP8266 rig controller to expose:

  • GET /health
  • GET /pose
  • POST /apply
  • POST /task/start
  • POST /task/end

Set:

PHOTO_ASSISTANT_ESP8266_BASE_URL=http://<esp-ip>

The currently deployed ESP8266 controller source is stored in esp8266_code.txt as a reference snapshot of what is running on the rig. That file shows:

  • the controller serves HTTP on port 80
  • AutoFrame-compatible endpoints include GET /health, GET /pose, GET or POST /apply, and GET or POST for both /task/start and /task/end
  • the deployed home pose is pan=0, orientation=portrait, z=23
  • accepted orientation values are portrait and landscape
  • the configured height range is 23 to 90 cm

Environment

HPC workflow

Before any Python command on the HPC environment, run:

module load mamba
source activate autoframe
export HF_HOME="/scratch/ihmehta/hf_cache"

Local Windows workflow

Use the repo venv interpreter explicitly to avoid python / pip mismatches:

py -3.11 -m venv autoframe
.\autoframe\Scripts\python.exe -m pip install -r requirements.txt

Important:

  • google-auth and requests are both required for Vertex ADC auth.
  • The repo .env is re-read by app.config whenever get_settings() is called.
  • .env values currently overwrite inherited process environment values for the same keys.

Example .env

This example reflects the current local env values, except that network URLs, project identifiers, and credential paths stay portable and placeholder-safe.

PHOTO_ASSISTANT_PROJECT_ROOT=.
PHOTO_ASSISTANT_DATA_DIR=./data
PHOTO_ASSISTANT_LOG_LEVEL=INFO

PHOTO_ASSISTANT_ESP8266_BASE_URL=http://<esp-ip>

PHOTO_ASSISTANT_DEFAULT_MODE=hardware
PHOTO_ASSISTANT_CAMERA_KIND=android_http
PHOTO_ASSISTANT_VLM_KIND=gemini
PHOTO_ASSISTANT_REASONER_KIND=vertex
PHOTO_ASSISTANT_DRY_RUN=false

PHOTO_ASSISTANT_ANDROID_CAPTURE_URL=http://<phone-ip>:8080/capture
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL=http://<phone-ip>:8080/shot.jpg
PHOTO_ASSISTANT_ANDROID_TIMEOUT_SECONDS=10.0

PHOTO_ASSISTANT_INITIAL_PAN=0
PHOTO_ASSISTANT_INITIAL_ORIENTATION=portrait
PHOTO_ASSISTANT_INITIAL_Z=23

PHOTO_ASSISTANT_RETRIEVAL_TOP_K=20
PHOTO_ASSISTANT_PROMPT_GOOD_EXAMPLES_LIMIT=4
PHOTO_ASSISTANT_PROMPT_BAD_EXAMPLES_LIMIT=4
PHOTO_ASSISTANT_PROMPT_SESSION_MEMORY_LIMIT=3
PHOTO_ASSISTANT_OBJECTIVE_THRESHOLD=0.35
PHOTO_ASSISTANT_OBJECTIVE_WEIGHT=0.65
PHOTO_ASSISTANT_SCENE_WEIGHT=0.35
PHOTO_ASSISTANT_MAX_ITERATIONS=15
PHOTO_ASSISTANT_GOOD_ENOUGH_THRESHOLD=8.8
PHOTO_ASSISTANT_STALL_WINDOW=3
PHOTO_ASSISTANT_STALL_DELTA=0.2
PHOTO_ASSISTANT_OVERVIEW_COUNT=2

PHOTO_ASSISTANT_FOLDER_CAMERA_DIR=./data/captures/folder_source

VERTEX_PROJECT=<your-vertex-project-id>
VERTEX_LOCATION=us-central1
VERTEX_REASONER_MODEL=gemini-2.5-flash
VERTEX_SUMMARY_MODEL=gemini-2.5-flash
VERTEX_SCORER_MODEL=gemini-2.5-flash
VERTEX_REASONER_TIMEOUT_SECONDS=40
VERTEX_REASONER_RETRY_COUNT=2
VERTEX_REASONER_RETRY_BACKOFF_SECONDS=3
VERTEX_ENDPOINT_OVERRIDE=
VERTEX_ACCESS_TOKEN=
GOOGLE_APPLICATION_CREDENTIALS=./credentials/vertex-service-account.json

Initialize Data

Run once after creating the environment:

.\autoframe\Scripts\python.exe -m app.cli init-data

If old mock or synthetic examples are present in the dataset, clean them out and validate the artifacts:

.\autoframe\Scripts\python.exe -m app.cli remove-mock-examples
.\autoframe\Scripts\python.exe -m app.cli remove-synthetic-examples
.\autoframe\Scripts\python.exe -m app.cli validate-jsonl

Run The Pipeline

Manual run

.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a flattering portrait of one person" `
  --execution-mode manual `
  --camera-kind android_http `
  --reasoner-kind vertex `
  --vlm-kind gemini

During a manual run:

  1. Let the system capture the overview image.
  2. Wait for the proposed pan / orientation / z / zoom_ratio.
  3. Move the phone or rig to that position.
  4. Confirm, skip, retry, or abort in the CLI.
  5. Let the system score the result and continue.

Hardware run

.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a flattering portrait of one person" `
  --execution-mode hardware `
  --camera-kind android_http `
  --reasoner-kind vertex `
  --vlm-kind gemini

Mock run

.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a wide group portrait with background context" `
  --execution-mode mock `
  --camera-kind mock `
  --reasoner-kind mock `
  --vlm-kind mock

Mock runs are useful for wiring checks, but they do not contribute retrieval examples for live runs.

Maintenance Commands

Rebuild retrieval artifacts:

.\autoframe\Scripts\python.exe -m app.cli rebuild-embeddings

Validate dataset consistency:

.\autoframe\Scripts\python.exe -m app.cli validate-jsonl

Remove stored mock examples:

.\autoframe\Scripts\python.exe -m app.cli remove-mock-examples

Remove stored synthetic examples:

.\autoframe\Scripts\python.exe -m app.cli remove-synthetic-examples

Rescore stored examples that still have accessible image files:

.\autoframe\Scripts\python.exe -m app.cli rescore-examples

Add a real example manually:

.\autoframe\Scripts\python.exe -m app.cli offline-add-example ... --execution-mode manual

Export or import the retrieval dataset:

.\autoframe\Scripts\python.exe -m app.cli export-dataset .\dataset_export.json
.\autoframe\Scripts\python.exe -m app.cli import-dataset .\dataset_export.json

API Server

Start the API:

.\autoframe\Scripts\python.exe -m uvicorn app.main:app --reload --host 127.0.0.1 --port 8000

Useful endpoints:

  • GET /health
  • GET /ui
  • GET /dataset/ui
  • GET /ui/config
  • POST /objectives/run
  • POST /objectives/start
  • POST /retrieve/debug
  • POST /score/debug
  • POST /scene-summary/debug
  • POST /offline/add-example
  • POST /dataset/capture-overview
  • POST /dataset/capture-score
  • POST /dataset/save-example
  • POST /dataset/reset-rig

Where Outputs Go

  • data/captures/: saved overview and candidate images
  • data/sessions/: full session state
  • data/traces/: event logs
  • data/debug/: prompt/debug artifacts per iteration
  • data/examples/examples.jsonl: long-term retrieval memory source of truth
  • data/examples/*.npy: retrieval embeddings
  • data/examples/id_map.json: embedding row to example id mapping

Key Files

Current Limitations

  • Manual mode still requires a human to move the device between iterations.
  • Hardware mode assumes the rig is reachable directly over HTTP.
  • Large images and remote Vertex calls add noticeable latency.
  • Retrieval and dataset maintenance use examples.jsonl, id_map.json, and the embedding .npy files directly.

About

AutoFrame is a project that enables automatic photography by camera positioning through VLM reasoning and 3-axis hardware support

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors