A fast Python progress bar library with a C++ core. Windows-first.
Built to beat tqdm, rich.progress, and alive-progress on cold
import, per-iteration overhead, peak it/s, memory footprint, tail
latency, multi-bar throughput, and first-frame latency — simultaneously.
import barflow
# Fastest: `for _ in progress:` runs at 160+ M it/s — faster than
# a bare `for _ in range(n): pass` because Py_None is immortal.
with barflow.Progress(total=n, desc="Crunching") as p:
for _ in p:
do_work()
# When you also need the iterated values:
for x in barflow.track(range(1_000_000), desc="Working"):
...
# Event-driven / manual:
with barflow.Progress(total=n, desc="Streaming") as p:
for chunk in data:
process(chunk)
p.advance(len(chunk))Numbers below are from benchmarks/bench.py on Windows 11 / Python
3.13 / N = 20,000,000 iterations (5 runs per data point, best wall
time for rate measurements, min CPU time for CPU measurements).
Baseline bare for _ in range(n): pass is 145.75 M it/s
(6.9 ns/iter, 140.6 ms of CPU). Raw output lives in
benchmarks/bench_raw.md; methodology and platform notes are in
benchmarks/results.md.
| Axis | BarFlow | tqdm | rich | alive-progress |
|---|---|---|---|---|
| Cold import (ms) | 1.21 | 72.27 | 74.96 | 30.12 |
Overhead, for _ in p: (ns/iter) |
0.0 | 7.4 | 471.9 | 384.9 |
Overhead, track(...) (ns/iter) |
3.0 | 7.4 | 471.9 | 384.9 |
| Peak it/s, display off | 160.8 M | 70.2 M | 2.1 M | 2.6 M |
| Peak it/s, display on | 101.8 M | 19.6 M | 2.1 M | 2.1 M |
| Python heap peak (1 M iters) | 486 B | 298 KB | 661 KB | 3.4 MB |
| Tail latency p99.9 (ns) | 100 | 200 | 2,200 | 2,200 |
| First-frame latency (µs) | 32 | 97 | 921 | n/a |
| Multi-bar, 4 tasks (M it/s) | 43.9 M | 8.8 M | 2.2 M | n/a |
| Metadata churn (M it/s) | 29.6 M | 6.6 M | 2.0 M | 1.9 M |
| Total CPU, display on (ms for 20 M) | 188 | 953 | 9,297 | 9,391 |
BarFlow wins on every axis — 25–62× faster cold import, zero
measurable overhead on its iteration fast path (faster than a bare
for _ in range(n) because Py_None is immortal on 3.12+ and skips
the store-cycle refcount work that range's small-int yields incur),
5.2× display-on throughput vs tqdm, ~50× vs rich / alive,
600×+ less Python heap peak, 2× tighter tail latency, 3×
faster first-frame paint than tqdm, and ~50× less CPU than
rich / alive over 20 M iterations. The sub-1.0 CPU/wall ratio
reflects the decoupled render thread: it wakes on a 50 ms timeout,
formats into a preallocated buffer, and spends most of its life
parked on a condition variable, so the producer loop never pays for
rendering inline.
| Library | Cold import (ms) | vs BarFlow |
|---|---|---|
| barflow | 1.21 | 1× |
| alive | 30.12 | 25× |
| tqdm | 72.27 | 60× |
| rich | 74.96 | 62× |
Measured by timing python -c "from <lib> import ..." in a
subprocess and subtracting a bare-interpreter baseline
(python -c "pass"), so the number is just the work the library
does at import time. BarFlow's module graph is deliberately lazy:
themes, columns, style, spinners, hooks, and aio are
all resolved on first attribute access via __getattr__, so the
cold import only pays for the C extension load and an __init__.py
that does nothing but expose Progress, Tracker, and track.
Display is disabled (disable=True where the library supports it)
so we measure the cost of advancing the counter, not rendering.
ns/iter is over the bare for-loop baseline.
| Variant | M it/s | ns/iter over baseline |
|---|---|---|
| barflow-iter | 160.76 | 0.0 |
| barflow-track | 101.13 | 3.0 |
| tqdm | 70.23 | 7.4 |
| barflow-tick | 65.39 | 8.4 |
| alive | 2.55 | 384.9 |
| rich | 2.09 | 471.9 |
barflow-iterisfor _ in p:— BarFlow'sProgresstype implements the iteration protocol directly, soFOR_ITERdispatchestp_iternextwithout the CPython vectorcall trampoline. The iternext body is three x86 instructions (load,fetch_add,return Py_None), and Py_None's immortal refcount on 3.12+ means the loop'sSTORE_FAST _is free. Net result: below the bare for-loop baseline, because arange-driven loop still does refcount work on its cached small-int yields.barflow-trackis thefor x in barflow.track(iterable):wrapper, used when you also need the yielded values.barflow-tickis the manualProgress.tick()call from Python, which pays the full CPython vectorcall dispatch overhead per call. Use the iteration protocol above when you don't have a source iterable.
Comparator libraries write into an io.StringIO sink with
force_terminal=True so no real console I/O is measured. BarFlow
writes to its native Windows console path (no sink parameter),
which makes the comparison conservatively worse for BarFlow.
| Library | M it/s | vs BarFlow |
|---|---|---|
| barflow | 101.76 | 1× |
| tqdm | 19.57 | 5.20× |
| rich | 2.11 | 48× |
| alive-progress | 2.09 | 49× |
BarFlow's render loop emits delta frames: each column's
previously-rendered bytes are cached, and on the next frame the
render thread emits \x1b[<n>C (cursor-right) over unchanged
spans instead of re-writing the bytes. On a real TTY this cuts
bytes-written per frame by roughly 60% for the default layout;
the sink-based benchmark above does not exercise the delta path,
so the number you see is the lower bound — real terminals get
more.
| Library | tracemalloc peak | RSS import | RSS run |
|---|---|---|---|
| barflow | 486 B | 236 KB | 192 KB |
| tqdm | 298 KB | 6.83 MB | 992 KB |
| rich | 661 KB | 6.58 MB | 1.77 MB |
| alive-progress | 3.41 MB | 1.64 MB | 3.79 MB |
tracemalloc peak is the high-water mark of the Python heap over
a 1 M-iteration run (bench_memory.py). BarFlow's ~500 bytes is
effectively one Progress object's shell — the counter, output
buffer, render thread, and render scratch all live in C-owned
storage that tracemalloc cannot see. Competitors allocate
hundreds of KB to several MB of Python objects per run.
| Library | p50 | p90 | p99 | p99.9 | max |
|---|---|---|---|---|---|
| barflow | 100 ns | 100 ns | 100 ns | 100 ns | 7.80 µs |
| tqdm | 100 ns | 200 ns | 200 ns | 200 ns | 28.00 µs |
| rich | 500 ns | 600 ns | 800 ns | 2.20 µs | 153.20 µs |
| alive-progress | 500 ns | 600 ns | 700 ns | 2.20 µs | 138.00 µs |
Per-iter timestamps recorded with perf_counter_ns() across 100 K
iterations; bench_tail_latency.py. BarFlow is the only library
whose p99.9 does not diverge from its p50 — the render thread
never spills work onto the producer, so there is no jitter source
to create tail spikes.
| Library | median | min | p90 |
|---|---|---|---|
| barflow | 32 µs | 28 µs | 41 µs |
| tqdm | 97 µs | 93 µs | 109 µs |
| rich | 921 µs | 845 µs | 1.05 ms |
BarFlow paints a synchronous first frame on Progress.__enter__
before the render thread takes over, eliminating the 50 ms
"blank bar" window that would otherwise be visible for
short-lived jobs. Measured by bench_first_frame.py.
| Library | wall time | aggregate |
|---|---|---|
| barflow | 22.8 ms | 43.9 M it/s |
| tqdm | 114.1 ms | 8.8 M it/s |
| rich | 465.8 ms | 2.2 M it/s |
| alive-progress | skipped (no clean multi-task API) |
4 tasks × 250 K ticks each, driven round-robin from one thread
(bench_multibar.py). BarFlow stays lock-free — every task has
its own cache-line-padded counter, and the render thread walks
the task vector under a mutex that the hot path never touches.
| Library | wall time | it/s |
|---|---|---|
| barflow | 33.8 ms | 29.6 M |
| tqdm | 152.4 ms | 6.6 M |
| rich | 514.1 ms | 2.0 M |
| alive-progress | 526.9 ms | 1.9 M |
1 M ticks, set_description called every 1000 ticks with a
pre-generated 40-char string (bench_metadata_churn.py).
BarFlow exposes set_description(str) and
set_task_description(task_id, str) that briefly acquire the
render mutex to swap the description; the lock-free tick hot
path is unaffected.
time.process_time() sums user+system time across every thread
of the process (Windows GetProcessTimes, Linux
CLOCK_PROCESS_CPUTIME_ID, macOS task_info), so a background
render thread cannot hide from this measurement.
| Library | Mode | CPU ms (best of 5) | Extra ns/iter | CPU / wall |
|---|---|---|---|---|
| barflow | display-off | 187.5 | 2.3 | 0.96 |
| barflow | display-on | 187.5 | 2.3 | 0.96 |
| tqdm | display-off | 265.6 | 6.2 | 0.93 |
| tqdm | display-on | 953.1 | 40.6 | 0.95 |
| alive-progress | display-off | 7,671.9 | 376.6 | 0.98 |
| alive-progress | display-on | 9,390.6 | 462.5 | 0.98 |
| rich | display-off | 9,437.5 | 464.8 | 0.96 |
| rich | display-on | 9,296.9 | 457.8 | 0.98 |
Two things stand out:
- BarFlow's CPU cost is identical whether the display is on or off. Turning the bar on adds no measurable per-iter CPU because the render thread wakes on a 50 ms condition-variable timeout and spends the rest of its life parked. The producer loop sees the same hot path in both modes.
- tqdm's CPU grows 3.6× when the display turns on (266 →
953 ms), because rendering runs inline on the producer thread.
Rich and alive-progress sit near ~50× BarFlow's CPU cost in
both modes — they pay hundreds of nanoseconds of dict/lock work
per
advance()call before any rendering happens.
- Numbers are from a single Windows 11 box; absolute values will
differ on Linux / macOS but the ratios are stable in repeated
runs. Re-run
python benchmarks/bench.py --n 20000000 --runs 5to reproduce the main table, andpython benchmarks/bench_*.pyfor each extra axis (tail latency, memory, first-frame, multi-bar, metadata churn). - tqdm is run with
mininterval=0.05(matching BarFlow's default) rather than its out-of-box 0.10, so the comparison isolates per-render work from render frequency instead of giving tqdm a free 2× render-skip advantage. time.process_time()resolution is ~15 ms on Windows, so the smallest CPU numbers (barflow: 187 ms) sit only ~12 ticks above noise floor. Differences against tqdm (5×) and rich/alive (~50×) are well outside that window.- Display-on throughput is measured against an
io.StringIOsink, which skips Windows console latency. On a real TTY, BarFlow's delta-render (cursor-advance over unchanged column spans) gives it additional headroom that the StringIO harness cannot see.
pip install barflow
Wheels are published for Windows (AMD64), Linux (x86_64, aarch64), and
macOS (x86_64, arm64) for CPython 3.13 and 3.14, including the
free-threaded cp313t / cp314t builds.
- Zero-overhead iteration.
for _ in progress:runs at 160+ M it/s — below the barefor _ in range(n)baseline, becauseFOR_ITERdispatches directly totp_iternext(no vectorcall trampoline) and Py_None is immortal on 3.12+ (no refcount work onSTORE_FAST). - C++ hot path.
tick,advance, andTracker's iter-next are singlestd::atomic::fetch_addcalls with no locks and no Python-level bookkeeping. Task counters are cache-line padded so the render thread's reads never false-share with producer writes. - Decoupled renderer. A background thread wakes on a 50 ms condition-variable timeout and formats into a preallocated buffer. The producer never blocks.
- Delta-render. The render loop caches each column's previous
bytes and emits
\x1b[<n>Ccursor-advance over unchanged spans instead of rewriting the frame. Roughly 60% fewer bytes written per frame on the default layout. - Synchronous first frame.
Progress.__enter__paints one frame inline before the render thread takes over, so short-lived jobs don't see the 50 ms blank-bar window. - Windows-first. Unconditional
ENABLE_VIRTUAL_TERMINAL_PROCESSING, UTF-16 transcodedWriteConsoleWchunked at 32 KB, legacy-console fallback. Nocoloramadependency. A reusablewscratchtranscoding buffer means steady-state frames are zero-alloc. - Multi-task + columns. 9 built-in column types
(description/bar/percent/count/rate/elapsed/eta/spinner/text),
rich-style column API, themes, ANSI cursor stacking for nested bars.
Progress.set_description(str)andset_task_description(task_id, str)expose metadata churn without touching the lock-free hot path. - Spinner DSL. Compositional factories
(
frame/scrolling/bouncing/alongside/sequential) compile to precomputed frame tables at__enter__. print()interception.capture_output=Truereroutessys.stdoutthroughwrite_above()so user prints appear above live bars without tearing.- asyncio.
barflow.aio.atrack(aiter)wraps async iterables. - Tiny cold import.
import barflowis ~1.2 ms (baseline-subtracted median) — 25–62× faster than the alternatives. All non-core submodules (themes,columns,spinners,style,hooks,aio) are lazy-loaded via PEP 562__getattr__. - Sub-kilobyte Python heap. Peak
tracemallocusage across a 1 M-iteration run is ~500 bytes, vs 300 KB (tqdm), 660 KB (rich), and 3.4 MB (alive-progress).
import barflow
from barflow.columns import (
SpinnerColumn, DescriptionColumn, BarColumn, PercentColumn,
CountColumn, RateColumn, EtaColumn,
)
# Simplest form — when you need the iterated values
for x in barflow.track(range(1000), desc="task"):
...
# Fastest form — when you just need a counter
with barflow.Progress(total=1000, desc="task") as p:
for _ in p:
do_work()
# Custom columns
with barflow.Progress(
SpinnerColumn(name="dots"), " ",
DescriptionColumn(), " ",
BarColumn(width=40, color="magenta"), " ",
PercentColumn(), " ",
CountColumn(), " | ", EtaColumn(),
total=1000, desc="build",
) as p:
for _ in range(1000):
p.tick()
# Named theme
with barflow.Progress(theme="classic", total=1000) as p:
...
# Multi-task
with barflow.Progress(theme="classic") as p:
dl = p.add_task(total=100, desc="download")
ex = p.add_task(total=100, desc="extract")
for i in range(100):
p.update(dl, 1)
p.update(ex, 1)
# Live prints during a bar
with barflow.Progress(total=100, capture_output=True) as p:
for i in range(100):
if i % 10 == 0:
print(f"checkpoint {i}") # appears above the bar
p.tick()
# asyncio
import asyncio, barflow.aio as aio
async def main():
async for x in aio.atrack(some_async_iter(), total=1000):
...
asyncio.run(main())See docs/DESIGN.md for the full architecture: atomic hot path,
background render thread, column pipeline, Windows console handling,
and the benchmarks methodology.
Requires Visual Studio 2022+ (Windows) or GCC/Clang + Python headers (POSIX) and Python ≥ 3.13.
# Windows
build.bat
# POSIX
python -m pip install -e .
MIT. See LICENSE.