ColStore

A memory-mapped columnar binary format for fast, memory-efficient I/O on structured arrays. colstore lets you write a tabular dataset to a single .cstore file once and then load arbitrary row/column subsets without materializing the rest. Internally, columns are stored back-to-back as raw NumPy bytes, reads use np.memmap, and fancy-index gathers run through a parallel C++ kernel (OpenMP + software prefetching) bound via Cython. Process memory stays bounded by the size of the output you ask for; the source file is never fully read into RAM.

Install

pip install colstore

Building from source needs a C++17 compiler and CMake ≥ 3.18. On macOS install libomp (brew install libomp) to get the parallel kernel; without it the build still succeeds but the kernel runs single-threaded.

Quick start

from colstore import ColStore

# Write and open in one call. `.cstore` is the canonical extension.
ds = ColStore.from_dataframe(df, "data.cstore")

# Indexing returns lazy views; no data is read yet.
ds['price']                          # ColumnView
ds[100:200]                          # TableView
ds[100:200, 'price']                 # ColumnView
ds[100:200, ['price', 'qty']]        # TableView
ds[[1, 5, 9], ['price', 'qty']]      # TableView (fancy rows + cols)

# Materialize through one of the to_* methods.
ds['price'].to_array()                          # 1D ndarray
ds[indices, ['price', 'qty']].to_dict()         # dict of 1D arrays
ds[indices, ['price', 'qty']].to_record()       # structured ndarray
ds[indices, ['price', 'qty']].to_dataframe()    # pandas DataFrame

Writing from other sources

from colstore import ColStore
import numpy as np

# From a dict of 1D arrays.
ColStore.from_dict(
    {"x": np.arange(100, dtype=np.float32), "y": np.arange(100, dtype=np.int64)},
    "data.cstore",
)

# From a structured (record) array.
records = np.empty(100, dtype=[("price", np.float32), ("qty", np.int32)])
ColStore.from_records(records, "data.cstore")

Each factory returns an opened ColStore ready to read from.

Configuration

from colstore import set_max_workers, set_default_madvise, set_default_backend

set_max_workers(8)                # parallel gathers across columns
set_default_madvise("sequential") # OS read-ahead hint for sorted-index reads
set_default_backend("cpp")        # gather kernel: cpp | numpy | numba

On-disk format

[magic 8B = b"CSTORE\x00\x01"]
[manifest_len 8B (u64 little-endian)]
[manifest_json]
[zero-padding to 64-byte alignment]
[column_0 raw bytes][column_1 raw bytes]...[column_n raw bytes]

The manifest is a small JSON object recording format_version, n_rows, and per-column {name, dtype}. Column dtypes are preserved byte-for-byte; columns are stored back-to-back with no per-row overhead.

Supported dtypes

Fixed-size only: float32, float64, int8/16/32/64, uint8/16/32/64, bool. Object dtype (strings, Python objects) is rejected at write time — the design point is zero-copy random access, which requires a fixed stride.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
benchmark		benchmark
include/colstore		include/colstore
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColStore

Install

Quick start

Writing from other sources

Configuration

On-disk format

Supported dtypes

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ColStore

Install

Quick start

Writing from other sources

Configuration

On-disk format

Supported dtypes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages