Skip to content

AdamEzzat1/virtual-frame

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

virtual-frame

Foundational infrastructure for deterministic data pipelines. Written in Rust with Python bindings via PyO3.

virtual-frame provides low-level building blocks — columnar storage, bitmask-filtered views, compensated summation, NFA regex, string distance metrics, and a deterministic RNG — that can serve as a substrate for reproducible data processing, including LLM training data preparation. It is not a complete LLM data toolkit; it does not yet provide large-scale sharding, dedup pipelines, tokenizer-aware transforms, dataset versioning, provenance capture, or parallel ingestion. Those are the kinds of things you would build on top of this library.

Features

  • TidyView — Virtual views over columnar data. Filters flip bits in a packed bitmask (BitMask); selects narrow a projection index. The base Rc<DataFrame> is shared, not cloned, across chained operations. Note: materialize() does allocate when you need a concrete output DataFrame.
  • NFA Regex — Thompson NFA simulation with O(n*m) worst-case guarantee. No backtracking, no catastrophic blowup. Supports Perl-style syntax: character classes, quantifiers (greedy + lazy), anchors, word boundaries, alternation.
  • NLP Primitives — Levenshtein edit distance, Jaccard n-gram similarity, character/word n-gram extraction, whitespace and word-punctuation tokenizers, term frequency, cosine similarity.
  • Kahan Summation — Compensated floating-point accumulation. Every aggregation (sum, mean, variance, standard deviation) uses Kahan summation to reduce rounding drift. See Limitations for precision boundaries.
  • SplitMix64 RNG — Deterministic PRNG. Same seed produces the same sequence. Supports uniform f64, Box-Muller normal, and fork() for independent substreams. See Determinism Design for what this does and does not guarantee.
  • CSV Ingestion — Single-pass type-inferring parser with streaming aggregation (sum, min/max) that operates in O(ncols) memory without materializing the full dataset.
  • Columnar Storage — Typed column vectors (Int64, Float64, String, Bool) with borrowed ColumnKeyRef keys for group-by and join index construction, avoiding per-row string cloning.

Install

Python

pip install virtual-frame

Rust

[dependencies]
virtual-frame = "0.1"

Quick Start (Python)

import virtual_frame as vf

# Load data
df = vf.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["eng", "eng", "sales", "sales"],
    "salary": [90000.0, 85000.0, 70000.0, 75000.0],
})

# Create a virtual view and chain operations
tv = vf.TidyView(df)
result = tv.filter_gt_float("salary", 72000.0).select(["name", "salary"])
print(result.materialize())  # 3 rows

# Group and summarise
summary = tv.group_summarise(["dept"], "salary", "mean", "avg_salary")
print(summary.get_column("avg_salary"))  # [87500.0, 72500.0]

# NFA regex
print(vf.regex_find_all(r"\d+", "order-42 item-7 qty-100"))
# [(6, 8), (14, 15), (20, 23)]

# NLP
print(vf.levenshtein("kitten", "sitting"))  # 3
print(vf.char_ngrams("hello", 2))  # {'el': 1, 'he': 1, 'll': 1, 'lo': 1}

# Deterministic RNG
rng = vf.Rng(42)
print(rng.next_f64())  # 0.741565...

# Kahan-compensated sum
print(vf.kahan_sum([0.1] * 10_000_000))  # 1000000.0

Quick Start (Rust)

use virtual_frame::dataframe::DataFrame;
use virtual_frame::column::Column;
use virtual_frame::tidyview::TidyView;
use virtual_frame::expr::{col, binop, BinOp, DExpr};

let df = DataFrame::from_columns(vec![
    ("x".into(), Column::Int(vec![1, 2, 3, 4, 5])),
    ("y".into(), Column::Float(vec![10.0, 20.0, 30.0, 40.0, 50.0])),
]).unwrap();

let view = TidyView::new(df);
let filtered = view
    .filter(&binop(BinOp::Gt, col("y"), DExpr::LitFloat(25.0)))
    .unwrap();
assert_eq!(filtered.nrows(), 3); // rows where y > 25

Architecture

TidyView = Rc<DataFrame> + BitMask + ProjectionMap + Option<Ordering>
  • BitMask: One bit per row, packed into 64-bit words. A million-row filter costs ~122 KB of bitmask memory. Chained filters AND their bitmasks together.
  • ProjectionMap: Tracks visible column indices. select() narrows the map without touching column data.
  • Ordering: Lazy sort permutation via arrange(). Only materialized into a concrete DataFrame when materialize() is called.
  • ColumnKeyRef: Borrowed keys into column data for group-by and join index construction. Single-key operations use BTreeMap<ColumnKeyRef, usize> directly, avoiding one Vec allocation per row.

Determinism Design

This library is designed for reproducible results through several mechanisms:

  • Kahan summation for all floating-point reductions (sum, mean, variance, standard deviation)
  • BTreeMap/BTreeSet everywhere — never HashMap/HashSet, which have non-deterministic iteration order
  • SplitMix64 RNG with explicit seed threading
  • No reliance on FMA in reduction paths (FMA can change rounding behavior across platforms)

What this means in practice: Given identical inputs and the same seed, all operations in this library should produce identical outputs. The test suite includes determinism checks that verify repeated execution yields the same results.

What this does not yet prove: Cross-platform bit-identity has not been validated with a CI matrix across Linux/macOS/Windows/ARM. The determinism properties are enforced by algorithm design (ordered containers, compensated summation, explicit RNG), but independent platform-pair verification is not yet published. If you depend on cross-platform reproducibility in production, you should validate on your target platforms.

Limitations

  • Kahan summation precision boundary: Single Kahan compensation captures one level of rounding error. For extreme cases (e.g., summing values where the accumulator and individual values differ by more than ~2^52), the compensation term itself can lose precision. The test suite validates Kahan accuracy for practical ranges (10M summands of 0.1). For cases requiring higher precision, consider second-order compensation or arbitrary-precision arithmetic.
  • Single-threaded: All operations run on a single thread. There is no parallel ingestion or parallel group-by. This is a design choice (determinism is trivial without concurrency), but it means throughput is bounded by single-core speed.
  • No null/missing value support: Columns are dense typed vectors with no NA/null sentinel. Missing data must be handled before loading.
  • No string interning: String columns store owned String values. For datasets with high-cardinality string columns, memory usage may be higher than interned alternatives.
  • Python GIL bound: The Python bindings hold the GIL during all operations. Long-running computations will block other Python threads.

What This Is and Is Not

This is foundational infrastructure: columnar storage, filtered views, joins, grouped aggregation, regex extraction, string distance, tokenization, deterministic RNG, and compensated arithmetic. These are building blocks.

This is not yet a complete LLM data preparation toolkit. The harder problems in that space — large-scale deduplication, near-duplicate detection at corpus scale, tokenizer-aware transforms, dataset versioning and provenance, contamination checks, sharded parallel processing with preserved determinism — are not implemented. They could be built on top of this library's primitives, but they are not included today.

Test Suite

95 tests covering all modules:

Module Tests What's covered
bitmask 6 Word boundaries, AND, set iteration, memory sizing
column 4 Gather, length, borrowed keys, NaN ordering
dataframe 4 Construction, duplicates, length mismatch
expr 2 Row evaluation, columnar fast path
kahan 3 Compensation accuracy, determinism, count
regex_engine 31 Literals, classes, quantifiers, anchors, lazy/greedy, split, determinism
nlp 17 Edit distance, Jaccard, n-grams, tokenization, TF, cosine similarity
csv 11 Type inference, streaming, delimiters, line endings, max_rows
rng 3 Determinism, f64 range, fork independence
tidyview 14 Filter, chain, select, group-by, arrange, join, sample, distinct, snapshot semantics

Run with: cargo test

License

MIT

About

Deterministic data pipeline toolkit for LLM training — zero-copy TidyView, NFA regex, NLP, Kahan summation. Python bindings via PyO3.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages