Skip to content

Make VLM analysis frame count configurable (beyond fixed 4-frame grid) #203

Description

@lucas-oma

Is your feature request related to a problem? Please describe.

On SharpAI Aegis v0.2.9 (macOS, Apple Silicon), VLM clip analysis appears to be limited to a fixed 4-frame composite per clip. Saved analyses in ~/.aegis-ai/media/video_chats/ consistently reference four sequential frames (e.g. "all four timestamps", "frame three", "four sequential moments").

For security monitoring, that often isn't enough temporal context. Short events like quick deliveries, brief intrusions, etc. can fall between the sampled frames in a ~1-minute clip. Right now I cannot configure how many frames the VLM sees within each clip.

From inspecting the installed app, the motion-composite pipeline seems to cap grid mode at 4 frames (render4FrameGrid), and I couldn't find a user-facing setting in the Aegis UI or ~/.aegis-ai/ config files to change this.


Describe the solution you'd like

A configurable setting (global and/or per-camera) for how many frames are sent to the VLM per clip, for example:

  • Grid size: 2×2 (4), 3×3 (9), 4×4 (16), or a custom frame count
  • or simply a number of frames samples per clip: 4, 6, 10, etc

Ideally this would be exposed in AI Setup or per-camera settings, with a short note on the tradeoff: more frames = richer temporal context, but higher VLM latency/cost and potentially lower per-frame resolution when composited into a single image.


Describe alternatives you've considered

  1. Custom DeepCamera analysis skill: viable for advanced users (e.g. multi-frame ffmpeg sampling like smarthome-bench), but requires building and maintaining a separate pipeline outside the default Aegis flow.
  2. Patching app.asar: technically possible but unsupported, breaks on updates, and not a sustainable solution (also, not sure if this would be considered legal based on the license).

Additional context

  • App: SharpAI Aegis 0.2.9
  • Platform: macOS (Apple Silicon)
  • VLM: Local engine (llama.cpp), Qwen3.5-2B
  • Camera: Local webcam (testing)
  • Data dir: ~/.aegis-ai/

From the app bundle, the pipeline appears to sample frames at sampleFps: 5 (up to ~60 stored frames), then choose between a stroboscopic composite (busy motion) or a fixed 4-frame 2×2 grid (simpler scenes). Aegis docs mention frame extraction at 0.1–5 fps, but it's unclear how that relates to the final 4-frame VLM input — clarifying or unifying those settings would help.

Happy to provide sample video_chats JSON outputs or help test a beta build if this gets implemented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions