This project runs an iterative photo-optimization loop for a portrait objective.
The pipeline captures an overview image from an Android phone, summarizes the scene with Gemini on Vertex, retrieves prior real examples, proposes a new {pan, orientation, z, zoom_ratio} setting, captures a candidate image, scores it, stores the attempt, and repeats until it stops.
- Reference Visuals
- Flow
- Current Local Model Configuration
- Current Runtime Assumptions
- Phone Camera Setup
- ESP8266 Rig Setup
- Environment
- Example
.env - Initialize Data
- Run The Pipeline
- Maintenance Commands
- API Server
- Where Outputs Go
- Key Files
- Current Limitations
These images are AI-generated and are included only as illustrative reference material. They may contain inaccuracies and should not be treated as the source of truth over the code and runtime behavior.
| Step | What happens |
|---|---|
| 1 | Start from the current local initial pose: pan=0, orientation=portrait, z=23 |
| 2 | Capture PHOTO_ASSISTANT_OVERVIEW_COUNT overview images; the current local env uses 2 |
| 3 | Gemini scene summary uses VERTEX_SUMMARY_MODEL; the current local env uses gemini-2.5-flash |
| 4 | Retrieval finds up to 4 good and 4 bad prior examples |
| 5 | Gemini reasoning uses VERTEX_REASONER_MODEL; the current local env uses gemini-2.5-flash for the next {pan, orientation, z, zoom_ratio} |
| 6 | In manual mode, you move the phone or rig and confirm capture |
| 7 | In hardware mode, the ESP8266 rig applies the proposal automatically |
| 8 | Gemini scoring uses VERTEX_SCORER_MODEL; the current local env uses gemini-2.5-flash |
| 9 | The run stores the attempt and either continues, stalls, or finishes |
| Component | Model |
|---|---|
| Scene understanding | Vertex Gemini gemini-2.5-flash |
| Reasoning / proposal generation | Vertex Gemini gemini-2.5-flash |
| Candidate image scoring | Vertex Gemini gemini-2.5-flash |
The current local env/config split is:
VERTEX_REASONER_MODELVERTEX_SUMMARY_MODELVERTEX_SCORER_MODEL
VERTEX_VISION_MODEL still works as a legacy fallback that sets both summary and scorer when the split variables are not provided.
Code defaults in app/config.py are still gemini-2.5-flash for reasoning and gemini-2.5-flash-lite for summary and scoring. The table above reflects the current local .env behavior, without exposing machine-specific paths or secrets.
android_httpis the main camera adapter for live runs.vertexis the main reasoner for live runs.geminiis the main VLM for live runs.- The current local env starts each run from:
pan=0orientation=portraitz=23zoom_ratio=1.0
- The current local env uses:
hardwareas the default execution moderetrieval_top_k=20max_iterations=15overview_count=24good retrieved examples4bad retrieved examples3session-memory items
- The first proposal is seeded by an overview-derived framing prior before retrieval examples are allowed to dominate.
- Retrieval excludes
execution_mode="mock"examples. - Mock runs no longer write new examples into long-term retrieval memory.
- Retrieval also excludes synthetic examples.
- The current local Vertex env uses:
VERTEX_REASONER_TIMEOUT_SECONDS=40VERTEX_REASONER_RETRY_COUNT=2VERTEX_REASONER_RETRY_BACKOFF_SECONDS=3- no explicit scorer timeout override, so scorer timeout falls back to the code default
The phone side is the separate Android Studio project in android-camera-server/.
It exposes:
GET /healthPOST /captureGET /shot.jpg
The Android app currently serves HTTP on port 8080.
The Python pipeline expects:
PHOTO_ASSISTANT_ANDROID_CAPTURE_URL
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL
Typical direct-network values:
PHOTO_ASSISTANT_ANDROID_CAPTURE_URL=http://<phone-ip>:8080/capture
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL=http://<phone-ip>:8080/shot.jpgPOST /capture accepts query parameters for:
zoom_ratioor legacyzoomorientationsuch asportraitorlandscape
hardware mode expects the ESP8266 rig controller to expose:
GET /healthGET /posePOST /applyPOST /task/startPOST /task/end
Set:
PHOTO_ASSISTANT_ESP8266_BASE_URL=http://<esp-ip>The currently deployed ESP8266 controller source is stored in esp8266_code.txt as a reference snapshot of what is running on the rig. That file shows:
- the controller serves HTTP on port
80 - AutoFrame-compatible endpoints include
GET /health,GET /pose,GETorPOST /apply, andGETorPOSTfor both/task/startand/task/end - the deployed home pose is
pan=0,orientation=portrait,z=23 - accepted orientation values are
portraitandlandscape - the configured height range is
23to90cm
Before any Python command on the HPC environment, run:
module load mamba
source activate autoframe
export HF_HOME="/scratch/ihmehta/hf_cache"Use the repo venv interpreter explicitly to avoid python / pip mismatches:
py -3.11 -m venv autoframe
.\autoframe\Scripts\python.exe -m pip install -r requirements.txtImportant:
google-authandrequestsare both required for Vertex ADC auth.- The repo
.envis re-read byapp.configwheneverget_settings()is called. .envvalues currently overwrite inherited process environment values for the same keys.
This example reflects the current local env values, except that network URLs, project identifiers, and credential paths stay portable and placeholder-safe.
PHOTO_ASSISTANT_PROJECT_ROOT=.
PHOTO_ASSISTANT_DATA_DIR=./data
PHOTO_ASSISTANT_LOG_LEVEL=INFO
PHOTO_ASSISTANT_ESP8266_BASE_URL=http://<esp-ip>
PHOTO_ASSISTANT_DEFAULT_MODE=hardware
PHOTO_ASSISTANT_CAMERA_KIND=android_http
PHOTO_ASSISTANT_VLM_KIND=gemini
PHOTO_ASSISTANT_REASONER_KIND=vertex
PHOTO_ASSISTANT_DRY_RUN=false
PHOTO_ASSISTANT_ANDROID_CAPTURE_URL=http://<phone-ip>:8080/capture
PHOTO_ASSISTANT_ANDROID_SNAPSHOT_URL=http://<phone-ip>:8080/shot.jpg
PHOTO_ASSISTANT_ANDROID_TIMEOUT_SECONDS=10.0
PHOTO_ASSISTANT_INITIAL_PAN=0
PHOTO_ASSISTANT_INITIAL_ORIENTATION=portrait
PHOTO_ASSISTANT_INITIAL_Z=23
PHOTO_ASSISTANT_RETRIEVAL_TOP_K=20
PHOTO_ASSISTANT_PROMPT_GOOD_EXAMPLES_LIMIT=4
PHOTO_ASSISTANT_PROMPT_BAD_EXAMPLES_LIMIT=4
PHOTO_ASSISTANT_PROMPT_SESSION_MEMORY_LIMIT=3
PHOTO_ASSISTANT_OBJECTIVE_THRESHOLD=0.35
PHOTO_ASSISTANT_OBJECTIVE_WEIGHT=0.65
PHOTO_ASSISTANT_SCENE_WEIGHT=0.35
PHOTO_ASSISTANT_MAX_ITERATIONS=15
PHOTO_ASSISTANT_GOOD_ENOUGH_THRESHOLD=8.8
PHOTO_ASSISTANT_STALL_WINDOW=3
PHOTO_ASSISTANT_STALL_DELTA=0.2
PHOTO_ASSISTANT_OVERVIEW_COUNT=2
PHOTO_ASSISTANT_FOLDER_CAMERA_DIR=./data/captures/folder_source
VERTEX_PROJECT=<your-vertex-project-id>
VERTEX_LOCATION=us-central1
VERTEX_REASONER_MODEL=gemini-2.5-flash
VERTEX_SUMMARY_MODEL=gemini-2.5-flash
VERTEX_SCORER_MODEL=gemini-2.5-flash
VERTEX_REASONER_TIMEOUT_SECONDS=40
VERTEX_REASONER_RETRY_COUNT=2
VERTEX_REASONER_RETRY_BACKOFF_SECONDS=3
VERTEX_ENDPOINT_OVERRIDE=
VERTEX_ACCESS_TOKEN=
GOOGLE_APPLICATION_CREDENTIALS=./credentials/vertex-service-account.jsonRun once after creating the environment:
.\autoframe\Scripts\python.exe -m app.cli init-dataIf old mock or synthetic examples are present in the dataset, clean them out and validate the artifacts:
.\autoframe\Scripts\python.exe -m app.cli remove-mock-examples
.\autoframe\Scripts\python.exe -m app.cli remove-synthetic-examples
.\autoframe\Scripts\python.exe -m app.cli validate-jsonl.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a flattering portrait of one person" `
--execution-mode manual `
--camera-kind android_http `
--reasoner-kind vertex `
--vlm-kind geminiDuring a manual run:
- Let the system capture the overview image.
- Wait for the proposed
pan / orientation / z / zoom_ratio. - Move the phone or rig to that position.
- Confirm, skip, retry, or abort in the CLI.
- Let the system score the result and continue.
.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a flattering portrait of one person" `
--execution-mode hardware `
--camera-kind android_http `
--reasoner-kind vertex `
--vlm-kind gemini.\autoframe\Scripts\python.exe -m app.cli run-objective "Capture a wide group portrait with background context" `
--execution-mode mock `
--camera-kind mock `
--reasoner-kind mock `
--vlm-kind mockMock runs are useful for wiring checks, but they do not contribute retrieval examples for live runs.
Rebuild retrieval artifacts:
.\autoframe\Scripts\python.exe -m app.cli rebuild-embeddingsValidate dataset consistency:
.\autoframe\Scripts\python.exe -m app.cli validate-jsonlRemove stored mock examples:
.\autoframe\Scripts\python.exe -m app.cli remove-mock-examplesRemove stored synthetic examples:
.\autoframe\Scripts\python.exe -m app.cli remove-synthetic-examplesRescore stored examples that still have accessible image files:
.\autoframe\Scripts\python.exe -m app.cli rescore-examplesAdd a real example manually:
.\autoframe\Scripts\python.exe -m app.cli offline-add-example ... --execution-mode manualExport or import the retrieval dataset:
.\autoframe\Scripts\python.exe -m app.cli export-dataset .\dataset_export.json
.\autoframe\Scripts\python.exe -m app.cli import-dataset .\dataset_export.jsonStart the API:
.\autoframe\Scripts\python.exe -m uvicorn app.main:app --reload --host 127.0.0.1 --port 8000Useful endpoints:
GET /healthGET /uiGET /dataset/uiGET /ui/configPOST /objectives/runPOST /objectives/startPOST /retrieve/debugPOST /score/debugPOST /scene-summary/debugPOST /offline/add-examplePOST /dataset/capture-overviewPOST /dataset/capture-scorePOST /dataset/save-examplePOST /dataset/reset-rig
data/captures/: saved overview and candidate imagesdata/sessions/: full session statedata/traces/: event logsdata/debug/: prompt/debug artifacts per iterationdata/examples/examples.jsonl: long-term retrieval memory source of truthdata/examples/*.npy: retrieval embeddingsdata/examples/id_map.json: embedding row to example id mapping
app/services/orchestrator.py: main optimization loopapp/services/vlm.py: Gemini scene summary and scoringapp/services/sas_reasoner.py: Gemini reasoningapp/services/prompt_builder.py: prompt construction and example selectionapp/memory/retrieval.py: retrieval scoring and filteringapp/services/camera.py: Android HTTP camera adapterapp/services/rig.py: ESP8266 rig controllerandroid-camera-server/: phone-side camera server
- Manual mode still requires a human to move the device between iterations.
- Hardware mode assumes the rig is reachable directly over HTTP.
- Large images and remote Vertex calls add noticeable latency.
- Retrieval and dataset maintenance use
examples.jsonl,id_map.json, and the embedding.npyfiles directly.

