virtual-frame

Foundational infrastructure for deterministic data pipelines. Written in Rust with Python bindings via PyO3.

virtual-frame provides low-level building blocks — columnar storage, bitmask-filtered views, compensated summation, NFA regex, string distance metrics, and a deterministic RNG — that can serve as a substrate for reproducible data processing, including LLM training data preparation. It is not a complete LLM data toolkit; it does not yet provide large-scale sharding, dedup pipelines, tokenizer-aware transforms, dataset versioning, provenance capture, or parallel ingestion. Those are the kinds of things you would build on top of this library.

Features

TidyView — Virtual views over columnar data. Filters flip bits in a packed bitmask (BitMask); selects narrow a projection index. The base Rc<DataFrame> is shared, not cloned, across chained operations. Note: materialize() does allocate when you need a concrete output DataFrame.
NFA Regex — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
NLP Primitives — Levenshtein edit distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
Kahan Summation — Compensated floating-point accumulation. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation to reduce rounding drift. See Limitations for precision boundaries.
SplitMix64 RNG — Deterministic PRNG. Same seed produces the same sequence. Supports uniform f64, Box-Muller normal, and fork() for independent substreams. See Determinism Design for what this does and does not guarantee.
CSV Ingestion — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that operates in O(ncols) memory without materializing the full dataset.
Columnar Storage — Typed column vectors (Int64, Float64, String, Bool) with borrowed ColumnKeyRef keys for group-by and join index construction, avoiding per-row string cloning.

Install

Python

pip install virtual-frame

Rust

[dependencies]
virtual-frame = "0.1"

Quick Start (Python)

import virtual_frame as vf

# Load data
df = vf.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["eng", "eng", "sales", "sales"],
    "salary": [90000.0, 85000.0, 70000.0, 75000.0],
})

# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize())  # 3 rows

# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary"))  # [87500.0, 72500.0]

# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]

# NLP
print(vf.levenshtein("kitten", "sitting"))  # 3
print(vf.char_ngrams("hello", 2))  # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}

# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64())  # 0.741565...

# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000))  # 1000000.0

Quick Start (Rust)

use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::TidyView;
use virtual_frame::expr::{col, binop, BinOp, DExpr};

let df = DataFrame::from_columns(vec![
    ("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
    ("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();

let view = TidyView::new(df);
let filtered = view
    .filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
    .unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25

Architecture

TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>

BitMask: One bit per row, packed into 64-bit words. A million-row filter costs ~122 KB of bitmask memory. Chained filters AND their bitmasks together.
ProjectionMap: Tracks visible column indices. select() narrows the map without touching column data.
Ordering: Lazy sort permutation via arrange(). Only materialized into a concrete DataFrame when materialize() is called.
ColumnKeyRef: Borrowed keys into column data for group-by and join index construction. Single-key operations use BTreeMap<ColumnKeyRef, usize> directly, avoiding one Vec allocation per row.

Determinism Design

This library is designed for reproducible results through several mechanisms:

Kahan summation for all floating-point reductions (sum, mean, variance, standard deviation)
BTreeMap/BTreeSet everywhere — never HashMap/HashSet, which have non-deterministic iteration order
SplitMix64 RNG with explicit seed threading
No reliance on FMA in reduction paths (FMA can change rounding behavior across platforms)

What this means in practice: Given identical inputs and the same seed, all operations in this library should produce identical outputs. The test suite includes determinism checks that verify repeated execution yields the same results.

What this does not yet prove: Cross-platform bit-identity has not been validated with a CI matrix across Linux/macOS/Windows/ARM. The determinism properties are enforced by algorithm design (ordered containers, compensated summation, explicit RNG), but independent platform-pair verification is not yet published. If you depend on cross-platform reproducibility in production, you should validate on your target platforms.

Limitations

Kahan summation precision boundary: Single Kahan compensation captures one level of rounding error. For extreme cases (e.g., summing values where the accumulator and individual values differ by more than ~2^52), the compensation term itself can lose precision. The test suite validates Kahan accuracy for practical ranges (10M summands of 0.1). For cases requiring higher precision, consider second-order compensation or arbitrary-precision arithmetic.
Single-threaded: All operations run on a single thread. There is no parallel ingestion or parallel group-by. This is a design choice (determinism is trivial without concurrency), but it means throughput is bounded by single-core speed.
No null/missing value support: Columns are dense typed vectors with no NA/null sentinel. Missing data must be handled before loading.
No string interning: String columns store owned String values. For datasets with high-cardinality string columns, memory usage may be higher than interned alternatives.
Python GIL bound: The Python bindings hold the GIL during all operations. Long-running computations will block other Python threads.

What This Is and Is Not

This is foundational infrastructure: columnar storage, filtered views, joins, grouped aggregation, regex extraction, string distance, tokenization, deterministic RNG, and compensated arithmetic. These are building blocks.

This is not yet a complete LLM data preparation toolkit. The harder problems in that space — large-scale deduplication, near-duplicate detection at corpus scale, tokenizer-aware transforms, dataset versioning and provenance, contamination checks, sharded parallel processing with preserved determinism — are not implemented. They could be built on top of this library's primitives, but they are not included today.

Test Suite

95 tests covering all modules:

Module	Tests	What's covered
bitmask	6	Word boundaries, AND, set iteration, memory sizing
column	4	Gather, length, borrowed keys, NaN ordering
dataframe	4	Construction, duplicates, length mismatch
expr	2	Row evaluation, columnar fast path
kahan	3	Compensation accuracy, determinism, count
regex_engine	31	Literals, classes, quantifiers, anchors, lazy/greedy, split, determinism
nlp	17	Edit distance, Jaccard, n-grams, tokenization, TF, cosine similarity
csv	11	Type inference, streaming, delimiters, line endings, max_rows
rng	3	Determinism, f64 range, fork independence
tidyview	14	Filter, chain, select, group-by, arrange, join, sample, distinct, snapshot semantics

Run with: cargo test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

virtual-frame

Features

Install

Python

Rust

Quick Start (Python)

Quick Start (Rust)

Architecture

Determinism Design

Limitations

What This Is and Is Not

Test Suite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

virtual-frame

Features

Install

Python

Rust

Quick Start (Python)

Quick Start (Rust)

Architecture

Determinism Design

Limitations

What This Is and Is Not

Test Suite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages