Foundational infrastructure for deterministic data pipelines. Written in Rust with Python bindings via PyO3.
virtual-frame provides low-level building blocks — columnar storage, bitmask-filtered views, compensated summation, NFA regex, string distance metrics, and a deterministic RNG — that can serve as a substrate for reproducible data processing, including LLM training data preparation. It is not a complete LLM data toolkit; it does not yet provide large-scale sharding, dedup pipelines, tokenizer-aware transforms, dataset versioning, provenance capture, or parallel ingestion. Those are the kinds of things you would build on top of this library.
- TidyView — Virtual views over columnar data. Filters flip bits in a packed bitmask (
BitMask); selects narrow a projection index. The baseRc<DataFrame>is shared, not cloned, across chained operations. Note:materialize()does allocate when you need a concrete output DataFrame. - NFA Regex — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
- NLP Primitives — Levenshtein edit distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
- Kahan Summation — Compensated floating-point accumulation. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation to reduce rounding drift. See Limitations for precision boundaries.
- SplitMix64 RNG — Deterministic PRNG. Same seed produces the same sequence. Supports uniform f64, Box-Muller normal, and
fork()for independent substreams. See Determinism Design for what this does and does not guarantee. - CSV Ingestion — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that operates in O(ncols) memory without materializing the full dataset.
- Columnar Storage — Typed column vectors (Int64, Float64, String, Bool) with borrowed
ColumnKeyRefkeys for group-by and join index construction, avoiding per-row string cloning.
pip install virtual-frame[dependencies]
virtual-frame = "0.1"import virtual_frame as vf
# Load data
df = vf.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"dept": ["eng", "eng", "sales", "sales"],
"salary": [90000.0, 85000.0, 70000.0, 75000.0],
})
# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize()) # 3 rows
# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary")) # [87500.0, 72500.0]
# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]
# NLP
print(vf.levenshtein("kitten", "sitting")) # 3
print(vf.char_ngrams("hello", 2)) # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}
# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64()) # 0.741565...
# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000)) # 1000000.0use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::TidyView;
use virtual_frame::expr::{col, binop, BinOp, DExpr};
let df = DataFrame::from_columns(vec![
("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();
let view = TidyView::new(df);
let filtered = view
.filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
.unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
- BitMask: One bit per row, packed into 64-bit words. A million-row filter costs ~122 KB of bitmask memory. Chained filters AND their bitmasks together.
- ProjectionMap: Tracks visible column indices.
select()narrows the map without touching column data. - Ordering: Lazy sort permutation via
arrange(). Only materialized into a concrete DataFrame whenmaterialize()is called. - ColumnKeyRef: Borrowed keys into column data for group-by and join index construction. Single-key operations use
BTreeMap<ColumnKeyRef, usize>directly, avoiding oneVecallocation per row.
This library is designed for reproducible results through several mechanisms:
- Kahan summation for all floating-point reductions (sum, mean, variance, standard deviation)
- BTreeMap/BTreeSet everywhere — never
HashMap/HashSet, which have non-deterministic iteration order - SplitMix64 RNG with explicit seed threading
- No reliance on FMA in reduction paths (FMA can change rounding behavior across platforms)
What this means in practice: Given identical inputs and the same seed, all operations in this library should produce identical outputs. The test suite includes determinism checks that verify repeated execution yields the same results.
What this does not yet prove: Cross-platform bit-identity has not been validated with a CI matrix across Linux/macOS/Windows/ARM. The determinism properties are enforced by algorithm design (ordered containers, compensated summation, explicit RNG), but independent platform-pair verification is not yet published. If you depend on cross-platform reproducibility in production, you should validate on your target platforms.
- Kahan summation precision boundary: Single Kahan compensation captures one level of rounding error. For extreme cases (e.g., summing values where the accumulator and individual values differ by more than ~2^52), the compensation term itself can lose precision. The test suite validates Kahan accuracy for practical ranges (10M summands of 0.1). For cases requiring higher precision, consider second-order compensation or arbitrary-precision arithmetic.
- Single-threaded: All operations run on a single thread. There is no parallel ingestion or parallel group-by. This is a design choice (determinism is trivial without concurrency), but it means throughput is bounded by single-core speed.
- No null/missing value support: Columns are dense typed vectors with no NA/null sentinel. Missing data must be handled before loading.
- No string interning: String columns store owned
Stringvalues. For datasets with high-cardinality string columns, memory usage may be higher than interned alternatives. - Python GIL bound: The Python bindings hold the GIL during all operations. Long-running computations will block other Python threads.
This is foundational infrastructure: columnar storage, filtered views, joins, grouped aggregation, regex extraction, string distance, tokenization, deterministic RNG, and compensated arithmetic. These are building blocks.
This is not yet a complete LLM data preparation toolkit. The harder problems in that space — large-scale deduplication, near-duplicate detection at corpus scale, tokenizer-aware transforms, dataset versioning and provenance, contamination checks, sharded parallel processing with preserved determinism — are not implemented. They could be built on top of this library's primitives, but they are not included today.
95 tests covering all modules:
| Module | Tests | What's covered |
|---|---|---|
| bitmask | 6 | Word boundaries, AND, set iteration, memory sizing |
| column | 4 | Gather, length, borrowed keys, NaN ordering |
| dataframe | 4 | Construction, duplicates, length mismatch |
| expr | 2 | Row evaluation, columnar fast path |
| kahan | 3 | Compensation accuracy, determinism, count |
| regex_engine | 31 | Literals, classes, quantifiers, anchors, lazy/greedy, split, determinism |
| nlp | 17 | Edit distance, Jaccard, n-grams, tokenization, TF, cosine similarity |
| csv | 11 | Type inference, streaming, delimiters, line endings, max_rows |
| rng | 3 | Determinism, f64 range, fork independence |
| tidyview | 14 | Filter, chain, select, group-by, arrange, join, sample, distinct, snapshot semantics |
Run with: cargo test
MIT