Is your feature request related to a problem? Please describe.
On SharpAI Aegis v0.2.9 (macOS, Apple Silicon), VLM clip analysis appears to be limited to a fixed 4-frame composite per clip. Saved analyses in ~/.aegis-ai/media/video_chats/ consistently reference four sequential frames (e.g. "all four timestamps", "frame three", "four sequential moments").
For security monitoring, that often isn't enough temporal context. Short events like quick deliveries, brief intrusions, etc. can fall between the sampled frames in a ~1-minute clip. Right now I cannot configure how many frames the VLM sees within each clip.
From inspecting the installed app, the motion-composite pipeline seems to cap grid mode at 4 frames (render4FrameGrid), and I couldn't find a user-facing setting in the Aegis UI or ~/.aegis-ai/ config files to change this.
Describe the solution you'd like
A configurable setting (global and/or per-camera) for how many frames are sent to the VLM per clip, for example:
- Grid size: 2×2 (4), 3×3 (9), 4×4 (16), or a custom frame count
- or simply a number of frames samples per clip: 4, 6, 10, etc
Ideally this would be exposed in AI Setup or per-camera settings, with a short note on the tradeoff: more frames = richer temporal context, but higher VLM latency/cost and potentially lower per-frame resolution when composited into a single image.
Describe alternatives you've considered
- Custom DeepCamera analysis skill: viable for advanced users (e.g. multi-frame ffmpeg sampling like
smarthome-bench), but requires building and maintaining a separate pipeline outside the default Aegis flow.
- Patching
app.asar: technically possible but unsupported, breaks on updates, and not a sustainable solution (also, not sure if this would be considered legal based on the license).
Additional context
- App: SharpAI Aegis
0.2.9
- Platform: macOS (Apple Silicon)
- VLM: Local engine (llama.cpp), Qwen3.5-2B
- Camera: Local webcam (testing)
- Data dir:
~/.aegis-ai/
From the app bundle, the pipeline appears to sample frames at sampleFps: 5 (up to ~60 stored frames), then choose between a stroboscopic composite (busy motion) or a fixed 4-frame 2×2 grid (simpler scenes). Aegis docs mention frame extraction at 0.1–5 fps, but it's unclear how that relates to the final 4-frame VLM input — clarifying or unifying those settings would help.
Happy to provide sample video_chats JSON outputs or help test a beta build if this gets implemented.
Is your feature request related to a problem? Please describe.
On SharpAI Aegis v0.2.9 (macOS, Apple Silicon), VLM clip analysis appears to be limited to a fixed 4-frame composite per clip. Saved analyses in
~/.aegis-ai/media/video_chats/consistently reference four sequential frames (e.g. "all four timestamps", "frame three", "four sequential moments").For security monitoring, that often isn't enough temporal context. Short events like quick deliveries, brief intrusions, etc. can fall between the sampled frames in a ~1-minute clip. Right now I cannot configure how many frames the VLM sees within each clip.
From inspecting the installed app, the motion-composite pipeline seems to cap grid mode at 4 frames (
render4FrameGrid), and I couldn't find a user-facing setting in the Aegis UI or~/.aegis-ai/config files to change this.Describe the solution you'd like
A configurable setting (global and/or per-camera) for how many frames are sent to the VLM per clip, for example:
Ideally this would be exposed in AI Setup or per-camera settings, with a short note on the tradeoff: more frames = richer temporal context, but higher VLM latency/cost and potentially lower per-frame resolution when composited into a single image.
Describe alternatives you've considered
smarthome-bench), but requires building and maintaining a separate pipeline outside the default Aegis flow.app.asar: technically possible but unsupported, breaks on updates, and not a sustainable solution (also, not sure if this would be considered legal based on the license).Additional context
0.2.9~/.aegis-ai/From the app bundle, the pipeline appears to sample frames at
sampleFps: 5(up to ~60 stored frames), then choose between a stroboscopic composite (busy motion) or a fixed 4-frame 2×2 grid (simpler scenes). Aegis docs mention frame extraction at 0.1–5 fps, but it's unclear how that relates to the final 4-frame VLM input — clarifying or unifying those settings would help.Happy to provide sample
video_chatsJSON outputs or help test a beta build if this gets implemented.