Releases: EdgeFirstAI/profiler-cli
Releases · EdgeFirstAI/profiler-cli
v1.2.1
Fixed
- Models with float16 (fp16) inputs and outputs now produce detections. Previously these models — including every TensorRT engine produced by EdgeFirst Studio with
fp16precision — completed without errors but returned no predictions, leavingpredictions.parquetempty. Validated on Jetson platforms running TensorRT; other platforms and runtimes have not been verified yet.
v1.2.0
Added
- The F2 Studio explorer now shows a coloured deployability indicator next to every artifact in a training session. Green means this build of the profiler on this host can run the artifact; yellow means the artifact targets accelerator hardware that can't be confirmed without a runtime probe; red means the artifact targets a different SoC or the matching backend isn't built into this profiler. For yellow and red, a short inline reason explains the diagnosis at a glance — for example
[Hailo-8L] (Hailo runtime and accessory not confirmed)or[NXP i.MX 95 Neutron] (TFLite backend not built into this profiler). A short legend appears below the list. The indicator is informational only and does not block selection — you can still attempt to profile any artifact. - New pipelining guide explaining how the validation pipeline overlaps its six stages so the NPU stays busy while the CPU prepares the next frame, what guarantees the bounded-channel design gives you about latency and memory, and how to read stalls in the F4 Pipeline Stages panel. Linked from the README and the architecture overview.
Changed
- The training session list in the F2 Studio explorer now shows the most recently trained model at the top instead of the bottom. The session you most likely want to validate is the first one highlighted when you open an experiment.
- TensorRT engine-load failures now surface the underlying TensorRT runtime reason — plan-header size mismatch, version mismatch, compute-capability mismatch — instead of the previous generic "deserializeCudaEngine failed" message. Incompatible-artifact failures are now actionable from the profiler's error output alone.
Fixed
- Validation sessions in EdgeFirst Studio no longer show as "complete" while the profiler CLI is still running. Each stage of a newly-created validation session is now correctly displayed as not-yet-started until work for that stage begins, and the post-profiling analysis step stays visible as a separate stage that remains pending until the EdgeFirst Validator app picks it up and runs. If the validator fails to launch after the profiler finishes uploading results, the analysis stage is now marked with an error instead of being left in a misleading "queued" state.
- EdgeFirst Studio validation sessions launched from the F2 Studio tab for ONNX models now record the actual execution provider in the session name. A CUDA run on a Linux x86_64 host produces a session named like
…-x86_64-onnx-cuda, a CoreML run on macOS produces…-macos-onnx-coreml, and a plain CPU run produces…-onnx-cpu. Previously every ONNX session ended in…-onnx-cpuregardless of which provider actually ran, which made CPU runs and accelerated runs indistinguishable in Studio's session listings. The execution-provider modal also now opens before the Studio session is created, so the name is correct on the first attempt rather than relying on a rename afterwards. - TensorRT
.engineartifacts produced by EdgeFirst Studio training sessions now load on Jetson targets. Previously every such artifact failed at deserialization, so TensorRT validation on Jetson required pointing the profiler at an.onnxand rebuilding the engine locally.
v1.1.0
Added
- Prebuilt
libhailort_profiler.sonow builds in CI for both linux-aarch64 (RPi5 / AI Kit) and linux-x86_64 (Hailo-8 PCIe). Each shim is built against HailoRT v4.23.0 compiled from source (no Hailo Developer Zone account required), uploaded as a workflow artifact, then bundled into the matching release archive and wheel via the same glob-based injection used forlibtrt_shim.so. The wheel-injection and archive-packaging steps were generalized to iterate overshims/prebuilt/<arch>/*.soso future vendor shims drop in without further CI surgery. ProfilingStatus::Failed { error: String, frames_done, frames_total, total_time_ms }variant insrc/tui/event.rs. Carries the human-readable failure message, the last-known progress at the moment of failure, and the wall-clock elapsed time captured at the transition (so the dashboard's "Elapsed" stops counting on a dead run). Future framework steps will promote theerrorfield to a typedDiagnosticwith structuredcode/stage/hintfields; the wire-level PROF: protocol is unchanged in this release.- Strict pre-flight for
--provider cuda(strict_cuda_preflightinsrc/inference/ort_backend.rs). Verifies each CUDA / cuDNN runtime SONAME the EP plugin needs is loadable; on a miss, the run hard-errors with the exactpip install nvidia-*-cu12 ...command instead of silently running on CPU. The check runs only when CUDA is the selected provider, so CPU users on partial setups never see CUDA-specific warnings. Paired witherror_on_failure()on the CUDA EP dispatch so ort propagates registration failures instead of swallowing them. - F4 Profiler completion popup now shows two persistent per-artifact progress bars (predictions.parquet and trace.pftrace) during Studio uploads, with byte/percent labels and pending/active/done/skipped state colouring. Replaces the previous ◌→✓ icon-only step indicators that flashed past too quickly for the user to see on fast uploads; each bar now stays at 100% green once its artifact finishes so the visual trail is preserved. A new
artifact_bar_lineshelper backed by anArtifactStateenum centralises the rendering; theSkippedvariant covers the case wheretrace.pftracewas not produced but the session still completed (rendered dim with a "skipped (not produced)" label instead of a perpetually spinning ◌).UploadProgressnow stores per-artifact byte counters (parquet_current/_total,trace_current/_total) that are fed by routingBackgroundEvent::DownloadProgressevents on phase labels matchingtui::event::UPLOAD_LABEL_PREDICTIONS/UPLOAD_LABEL_TRACE. - F2 Studio validation screen now has an explicit "start new session" affordance. On the Complete / Failed page, Enter resets the phase state machine and returns to the Explorer with a freshly-loaded projects list; Esc resets and returns to the current breadcrumb level. The footer hint reads
Press Enter to start a new session • Esc to return to Explorer. Previously the screen had no way out other than quitting and relaunching the binary.
Changed
- TUI provider modal now binds
g(for GPU) instead ofufor CUDA, with the hint text shortened to "Run with CUDA".mfor CoreML andcfor CPU are unchanged.ONNX.mdupdated to match. - TUI-created Studio validation sessions now carry a platform-aware name and a multi-line description. The name is
<model_stem>-<host>-<npu>(e.g.yolov5m-t-27e4_quant-u8-i8_smart-imx95-neutron) — the model artifact's filename has its runtime extension and any.imx95/.hailo8lstyle hint stripped, then the runtime-resolved<host>-<npu>tag is appended. The host is detected from/sys/firmware/devicetree/base/compatible(covering imx8mp / imx95 / imx93 / orin-nano / rpi5) on Linux, short-circuits tomacos/windowson those targets, and falls back touname -mfor DT-less Unix hosts; the NPU is derived from the model contents (TFLiteNeutronGraphblob scan), the filename multi-suffix (Hailo.hailo8l.hef/.hailo8.hef/.hailo10.hef), and a host-side probe oflibvx_delegate.sofor generic TFLite on i.MX 8M Plus. Filename hints from Studio are treated as user-facing conventions only — the runtime host and the model contents are the source of truth, so a.imx95.tfliterunning on i.MX 8M Plus correctly resolves toimx8mp-vsi(orimx8mp-cpu), neverimx95-neutron. The session description block is 4-7 lines (gracefully degrading on hosts where a probe is unavailable): hostname, kernelVersion(sysname + release, no arch suffix), SoC / CPU model + core count, NPU name with a short detection note, total memory, and the profiler version. All Linux-specific probes (/sys/firmware/devicetree/*,/proc/cpuinfo,/proc/meminfo) sit behindcfg(target_os = "linux")andlibc::unamesits behindcfg(unix), so the module compiles cleanly on macOS today and will accept a Windows backend without touching public API. New SoMs and accelerators extend two enums (HostTag,NpuTag) with one variant + one match arm each — seesrc/platform/. - TUI validation workflow now downloads the model before creating the Studio session so the platform identification has the model bytes in hand. The session is created with the final name and description on the first call instead of being renamed afterwards. The
downloadstage inset_session_stagescovers both model + dataset; after the reorder, stages are initialized with the stage's progress label set to"Downloading dataset"to describe the remaining work (the model is already on disk). The stage itself is opened inrunningand is not explicitly transitioned tocomplete— same as onmain— leaving the existing server-side stage machine to resolve it when a later stage moves to running. The "Downloading model" progress bar is unchanged from the user's perspective — only the Studio-side label and the moment of session creation have shifted. The blockingstd::fs::readfor the Neutron-blob scan is nowtokio::fs::readso large models don't stall the tokio worker thread. tui/components/profiler.rs::detect_delegateis now a thin shim over the newplatform::delegate_hint_for_host()helper; the inline/sys/firmware/devicetree/base/compatibleprobe lived in two places (also the configure-panel default) and is now centralized insrc/platform/host.rs.
Fixed
- 5000-image validation runs no longer drop the last
pipeline_depthframes withPboDisconnected. The HALImageProcessorwas moved into the preprocess thread closure, so when the preprocess loop exited at end-of-run the processor was dropped, tearing down its GL thread / PBOSenderwhile the inference thread still held depth-many PBO-backed staging tensors with onlyWeakSenders. Wrapping the processor inArc<Mutex<>>and holding oneArcin the orchestrator scope past all thread joins keeps the GLSenderalive until inference fully drains, so everypre_tensor.map()succeeds. Mutex is uncontended in practice (only the preprocess thread ever callsconvert()). - Inference-thread skip log now names the failing file. The decode and preprocess skip sites already logged
frame_idandpath, butinfer.rsonly carried the error message — which sent thePboDisconnectedregression hunt down the wrong rabbit hole.PreprocessedFramenow carriesimage_paththrough fromDecodedFrame, and the infer-thread error WARN includes it. - Pipeline summary splits the skip count by stage:
decode=N preprocess=N infer=Ninstead of the previous lump-sum "decode or processing failure".OrchestratorResultgaineddecode_skipped/preprocess_skipped/infer_skippedfields (the existingskipped_framesis now their sum). - yolo26 ONNX exports with
model.end2end = false(pre-NMS head) now decode correctly and producepredictions.parquet.edgefirst-hal0.23'sDecoderVersion::is_end_to_end()treats theYolo26variant as unconditionally embedded-NMS, so the builder rejected the model withInvalidConfig("Yolo26 Segmentation Detection must have num_features be 6 + num_protos = ...")and the pipeline silently fell into timing-only mode. The profiler now rewritesdecoder_versionto"yolov8"whenever it seesdecoder_version: "yolo26"paired withmodel.end2end: false— the head shape (4 box + N class + M mask coefs × anchors) is structurally identical to yolov8, so HAL's existing yolov8 DFL path handles it correctly. Compat shim only; reverts to passthrough once HAL gains an explicitend2endfield. - TUI validation sessions no longer stay stuck on the "Running" dashboard when the child pipeline reports a fatal error (e.g.
libonnxruntime.sonot found).BackgroundEvent::Errorpreviously only setshared.pipeline_errorand showed an error popup, leavingProfilingStatusinRunningforever — the dashboard kept counting elapsed time on zero frames and the F2 Studio tab stayed onRunningInference, even though the Studio backend session was correctly errored viareport_session_error. A newProfilingStatus::Failedvariant gives the Error event a terminal target; the dispatcher now transitions Running → Failed, clearsprofiling_active, and bridges the failure to the Studio screen by enqueueingValidationPhase::Failedonbg_tx. The error popup behaviour is unchanged. - TUI now also surfaces silent child-exit failures (panic, SIGKILL, or any path that closes the child's stderr without emitting
PROF:completeorPROF:error).ingest_prof_streamsynthesises aBackgroundEvent::Error("Child process exited without producing a result...")on EOF when no terminal event was seen, so the same Running → Failed transition fires. - TUI partial-failure runs (
run_bench_internalinvoke fails mid-loop atsrc/tui/bench.rs:166-173) now correctly stay in the Failed state. The child emitsPROF:errorfor the failing iteration and then still emitsPROF:completewith the partial stats from the iterations that did succeed; pre-fix, the parent's ...
v1.0.3
Fixed
- Validation sessions no longer hang at 100 % inference. The pipeline orchestrator's infer thread held a clone of the buffer-return channel sender for its error path and never dropped it before the post-loop drain — leaving itself as the last live sender so
buffer_rxcould never disconnect. The thread then blocked onfor _buf in buffer_rx {}forever, which in turn blocked the engine hand-back, the thread joins, and the entire publish flow. Symptom was a progress bar pinned at5000/5000with no CPU, no memory growth, and no further output. The clone is now dropped immediately afterdrop(tx)so the drain terminates as soon as the decoder exits. - ONNX Runtime per-op profiling no longer spams
Maximum number of events reached, could not record profile event.partway through long runs. ORT's profiler uses a fixed-capacity 1 M-event ring buffer in upstreamcore/common/profiler.ccand the buffer is not exposed as a session config option in any released ORT version. A YOLO-class graph emits ~240–500 events persession.run, so a 5 000-frame run overflows the cap around frame 4 100 and loses every subsequent per-op event.OrtEnginenow counts invocations and ends profiling early onceORT_PROFILE_INVOCATION_CAP(3 000 frames) is reached; the cached profile path is returned later by the traitend_profiling()call so the post-loop parser is unchanged. Aggregate stage timing (tensor_ref,session_run,output_copy) keeps running for every subsequent frame through the existingStagedTimer.
Changed
- Post-loop ORT profile emission now streams the JSON file event-by-event instead of
serde_json::from_reader-ing the whole array into aVec<OrtTraceEvent>. ORT writes one JSON object per line with comma separators (Chrome-tracing convention), so the newstream_ort_profileparser walks the file line-by-line, buffers nodes until the per-frameSequentialExecutor::Executeenvelope event, then yields oneOrtFrameProfileand frees the buffer. Combined withstream_emit_ort_trace_events, this drops peak post-loop memory from roughly 200 MB to ~48 KB for a 3 000-frame capped run and eliminates the multi-second pause that used to happen between the last inference and the start of the publish step. - CLI now renders dataset, model, predictions, and trace transfers as indicatif progress bars matching the style used by
edgefirst-cli. Downloads show decimal bytes (model artifact, trace) or item counts (dataset enumeration); uploads show decimal bytes. The TUI path is unchanged — it continues to consume the structuredPROF:event stream and reports progress through its dashboard. - Inference loop progress is now an indicatif bar instead of a per-100-frame
[N/Total] X.X FPS | elapsed: Ysprintln!. The bar shows elapsed and ETA, the current/total frame counts with thousands separators, and a custom{fps}key formatted to one decimal place (avoids indicatif's default{per_sec}which renders the full f64). Skipped under--emit-prof(TUI mode) so the parent's PROF: event parser doesn't see ANSI escape sequences. - Several "stuck at the end" silences now have status lines so neither the CLI nor the TUI parent process looks frozen:
Inference complete. Finalizing results...between the last frame and stats computation,ORT profile: <path>before the streaming parse begins,Finalizing trace file...between the pipeline return and the pftraceBufWriterdrop, andPublishing results to Studio...before the upload. Combined with the streaming ORT parse and the deadlock fix, the gap between "last frame" and "results published" is now visibly attributed instead of being one long silence.
v1.0.2
Added
- Prebuilt
libtrt_shim.sonow ships in linux-aarch64 release archives and wheels. Jetson Orin users on JetPack 6.x canpip install edgefirst-profilerorcurl install.sh | bashand run TensorRT models out of the box — no more "TensorRT shim not found" with a "build it on the Jetson" follow-up. The shim is built in CI from the NVIDIA L4T TensorRT 10.3 container. Users on other JetPack versions can still build from source viashims/trt-shim/README.md.
Changed
- TensorRT shim search path now checks binary-relative locations first:
dirname(profiler_binary)/libtrt_shim.sofor flat archives and manual drop-ins, anddirname(profiler_binary)/../lib/libtrt_shim.sofor pip-installed wheels.$TRT_SHIM_PATH,LD_LIBRARY_PATH, and the/usr/local/lib//usr/libfallbacks remain.
v1.0.1
Fixed
- Release archives on
EdgeFirstAI/profiler-cliGitHub Releases now ship a.sha256sidecar alongside each.tar.gz/.zip. The v1.0.0 release was missing these sidecars, causinginstall.shandinstall.ps1to fail their checksum-verification step with a 404 on the first install attempt (worked around manually for v1.0.0 by uploading sidecars after the fact). install.shno longer printstmpdir: unbound variableduring cleanup on success. The script's EXIT trap referenced alocalvariable that had gone out of scope by the time the trap fired; the variable is now process-scoped and the trap defensively defaults if unset.
Changed
- CI builders for the wheel and binary matrix switched from
ubuntu-latestandubuntu-22.04-armtoubuntu-22.04-largeandubuntu-22.04-arm-largerespectively. The larger SKUs cut wall-clock build times on the long-pole entries. - CHANGELOG body reformatted with one paragraph per line so GitHub Release pages, PyPI project descriptions, and other markdown consumers can wrap the text to fit their layout instead of preserving the source-file hard wraps.
v1.0.0
Initial public release of the EdgeFirst Profiler CLI — an on-target
profiling agent for AI vision pipelines, designed to characterize
real-world inference performance on the hardware your model will
actually run on.
Platforms
- Linux x86_64 (glibc 2.17+ / manylinux2014)
- Linux aarch64 (glibc 2.17+ / manylinux2014)
- macOS arm64 (macOS 11+)
Windows support is planned and not yet shipped; the installer prints
a clear message for Windows users in the meantime.
Edge hardware support
- NVIDIA Jetson Orin / Orin Nano — CUDA via ONNX Runtime, optional
TensorRT execution provider. - NXP i.MX 95 — Neutron NPU via the TFLite delegate.
- NXP i.MX 8M Plus — VSI NPU via the TFLite delegate.
- Raspberry Pi 5 — CPU baseline plus optional accelerators.
- Kinara Ara-2 — DVM models via the
ara2-proxydaemon. - Hailo-8 / Hailo-8L — via HailoRT.
- Apple Silicon — CoreML execution provider via ONNX Runtime.
Execution providers
ONNX Runtime is the default, lowest-common-denominator backend. Vendor
backends opt in per build: TFLite (with NXP Neutron and VSI delegates),
Hailo, TensorRT, Apple CoreML.
Features
- Per-operator timing dashboard with a live terminal UI (TUI).
- Pipeline orchestrator with 4-thread pipelined validation for
realistic-throughput measurements. - EdgeFirst Studio integration: launch profiling sessions from Studio,
report status and progress, publish traces and validation runs to
your Studio account. - Offline-first: runs without a Studio account; Studio sign-in is
only required to publish or pull data.
Distribution
- Standalone binary archives via GitHub Releases on
EdgeFirstAI/profiler-cli
(install.sh/install.ps1resolve the latest version). - Python wheels on PyPI as
edgefirst-profiler— the wheel ships the
same native binary as the standalone archive, installed onto your
PATHviapip install edgefirst-profiler.